How Cleora Works

A technical deep-dive into Cleora's architecture — from mathematical foundations to implementation details.

Architecture Overview

1

Input: Edge List

TSV/space-separated edge file. Each line defines a hyperedge connecting entities. Column types control how entities interact.

alice item_laptop
bob item_keyboard
alice item_mouse
2

Sparse Transition Matrix

Construct a row-stochastic sparse matrix M where M[i][j] = probability of transitioning from node i to node j. For left-Markov: rows sum to 1. For symmetric: D-1/2 A D-1/2.

3

Deterministic Initialization

Hash-based initialization: each entity gets a deterministic d-dimensional vector derived from its string ID. No random seeds. Same entity always gets the same initial vector.

4

Iterative Propagation

Repeat k times: Et+1 = normalize(M × Et). Each multiplication aggregates every possible random walk of that length. After k iterations, each node's embedding encodes its k-hop neighborhood structure.

5

Output: Embedding Matrix

N × d matrix of L2-normalized embeddings. Each row is a unit vector. Cosine similarity between any two rows measures structural proximity in the original graph.

Core Algorithm

Cleora's algorithm is a form of spectral embedding. The key insight: a single sparse matrix multiplication M × E computes the weighted average of all neighbors' embeddings for every node simultaneously. This is mathematically equivalent to aggregating all possible random walks of length 1.

After k multiplications, the embedding of node v encodes:

  • All paths of length ≤ k starting from v
  • The probability of reaching every other node in k steps
  • The structural role of v in its k-hop neighborhood

The L2 normalization after each iteration prevents embedding collapse and ensures the cosine similarity metric is meaningful.

E0 = hash_init(entity_ids, dim)
Et+1 = L2_normalize(M × Et)
output = Ek

Markov Propagation Types

Left Markov (default)

Row-stochastic: D-1A. Each row sums to 1. Equivalent to a random walk transition matrix. Better for co-occurrence/similarity tasks where you want "nodes visited by the same random walker are similar."

Use for: recommendations, similarity search, clustering

Symmetric Markov

D-1/2AD-1/2. Preserves spectral properties. The eigenvectors of this matrix are the graph's spectral embedding. Better for tasks requiring structural equivalence.

Use for: node classification, community detection, role discovery

Convergence Properties

Cleora converges rapidly due to L2 normalization constraining the embedding manifold. Practical guidelines:

Iterations Neighborhood Best For
2-3 Local (2-3 hops) Direct neighbors, immediate similarity
4-5 Medium (4-5 hops) Community-level similarity, recommendations
7+ Global Contextual similarity (like skip-gram), role equivalence

Beyond ~10 iterations, embeddings converge to the principal eigenvector and lose discriminative power. The default of 4 iterations is optimal for most applications.

Rust Core (PyO3)

The performance-critical operations are implemented in Rust and exposed to Python via PyO3:

SparseMatrix (Rust)

  • from_iterator — parses edge strings, builds adjacency in a single pass. Multithreaded via Rayon for large inputs.
  • initialize_deterministically — SipHash-based vector initialization. No randomness, no seeds.
  • left_markov_propagate / symmetric_markov_propagate — sparse matrix-vector multiplication with adaptive thread pool. Processes rows in parallel chunks sized to L2 cache.
  • to_sparse_csr — exports the Markov matrix to scipy-compatible CSR format for advanced operations in Python.

The Rust layer handles ~95% of compute time. Python orchestrates iteration and normalization. This hybrid approach gives Rust speed with Python flexibility.

Python API Layer

The Python layer provides high-level APIs on top of the Rust core:

pycleora (core)

embed(), embed_multiscale(), embed_with_attention(), supervised_refine(), CleoraEmbedder

pycleora.algorithms

ProNE, RandNE, HOPE, NetMF, GraRep, DeepWalk, Node2Vec

pycleora.metrics

AUC, MRR, Hits@K, MAP@K, nDCG, F1, ARI, Silhouette

pycleora.sampling

neighborhood, subgraph, GraphSAINT, negative, train/test splits

pycleora.viz

t-SNE, UMAP, PCA reduction + matplotlib plotting

pycleora.io_utils

NetworkX, PyG, DGL export. NPZ/CSV/TSV/Parquet save/load

Memory Model

Cleora's memory usage is dominated by two structures:

  • Sparse matrix — O(edges) memory. CSR format stores only non-zero values. A graph with 5M edges uses ~60 MB for the matrix.
  • Embedding matrix — O(nodes × dim) memory. 2M nodes × 1024 dims × 4 bytes = ~8 GB.

Total for roadNet-CA (2M nodes, 5.5M edges, dim=1024): ~9 GB. Compare: NetMF requires O(nodes²) for the dense log matrix — 2M² × 8 bytes = 32 TB. This is why NetMF crashes on large graphs.

Why Not Random Walks?

Traditional methods (DeepWalk, Node2Vec) sample random walks, then train a skip-gram model on the walk sequences. This has three fundamental limitations:

Random Walk Methods

  • Sample a small subset of possible walks
  • Require negative sampling (approximation)
  • Non-deterministic: different runs give different results
  • Need many walks per node for convergence
  • Skip-gram training adds O(walks × walk_length × dim) time

Cleora (Matrix Propagation)

  • Computes ALL walks simultaneously via matrix multiplication
  • No sampling, no approximation
  • Fully deterministic: same input = same output
  • Converges in 3-5 iterations
  • Time complexity: O(iterations × edges × dim)

The result: Cleora is typically 100-240x faster than random walk methods on the same graph, with equal or better accuracy on node classification benchmarks.