How Cleora Works
A technical deep-dive into Cleora's architecture — from mathematical foundations to implementation details.
Architecture Overview
Input: Edge List
TSV/space-separated edge file. Each line defines a hyperedge connecting entities. Column types control how entities interact.
alice item_laptop
bob item_keyboard
alice item_mouse
Sparse Transition Matrix
Construct a row-stochastic sparse matrix M where M[i][j] = probability of transitioning from node i to node j. For left-Markov: rows sum to 1. For symmetric: D-1/2 A D-1/2.
Deterministic Initialization
Hash-based initialization: each entity gets a deterministic d-dimensional vector derived from its string ID. No random seeds. Same entity always gets the same initial vector.
Iterative Propagation
Repeat k times: Et+1 = normalize(M × Et). Each multiplication aggregates every possible random walk of that length. After k iterations, each node's embedding encodes its k-hop neighborhood structure.
Output: Embedding Matrix
N × d matrix of L2-normalized embeddings. Each row is a unit vector. Cosine similarity between any two rows measures structural proximity in the original graph.
Core Algorithm
Cleora's algorithm is a form of spectral embedding. The key insight: a single sparse matrix multiplication M × E computes the weighted average of all neighbors' embeddings for every node simultaneously. This is mathematically equivalent to aggregating all possible random walks of length 1.
After k multiplications, the embedding of node v encodes:
- All paths of length ≤ k starting from v
- The probability of reaching every other node in k steps
- The structural role of v in its k-hop neighborhood
The L2 normalization after each iteration prevents embedding collapse and ensures the cosine similarity metric is meaningful.
E0 = hash_init(entity_ids, dim)
Et+1 = L2_normalize(M × Et)
output = Ek
Markov Propagation Types
Left Markov (default)
Row-stochastic: D-1A. Each row sums to 1. Equivalent to a random walk transition matrix. Better for co-occurrence/similarity tasks where you want "nodes visited by the same random walker are similar."
Use for: recommendations, similarity search, clustering
Symmetric Markov
D-1/2AD-1/2. Preserves spectral properties. The eigenvectors of this matrix are the graph's spectral embedding. Better for tasks requiring structural equivalence.
Use for: node classification, community detection, role discovery
Convergence Properties
Cleora converges rapidly due to L2 normalization constraining the embedding manifold. Practical guidelines:
| Iterations | Neighborhood | Best For |
|---|---|---|
| 2-3 | Local (2-3 hops) | Direct neighbors, immediate similarity |
| 4-5 | Medium (4-5 hops) | Community-level similarity, recommendations |
| 7+ | Global | Contextual similarity (like skip-gram), role equivalence |
Beyond ~10 iterations, embeddings converge to the principal eigenvector and lose discriminative power. The default of 4 iterations is optimal for most applications.
Rust Core (PyO3)
The performance-critical operations are implemented in Rust and exposed to Python via PyO3:
SparseMatrix (Rust)
- from_iterator — parses edge strings, builds adjacency in a single pass. Multithreaded via Rayon for large inputs.
- initialize_deterministically — SipHash-based vector initialization. No randomness, no seeds.
- left_markov_propagate / symmetric_markov_propagate — sparse matrix-vector multiplication with adaptive thread pool. Processes rows in parallel chunks sized to L2 cache.
- to_sparse_csr — exports the Markov matrix to scipy-compatible CSR format for advanced operations in Python.
The Rust layer handles ~95% of compute time. Python orchestrates iteration and normalization. This hybrid approach gives Rust speed with Python flexibility.
Python API Layer
The Python layer provides high-level APIs on top of the Rust core:
pycleora (core)
embed(), embed_multiscale(), embed_with_attention(), supervised_refine(), CleoraEmbedder
pycleora.algorithms
ProNE, RandNE, HOPE, NetMF, GraRep, DeepWalk, Node2Vec
pycleora.metrics
AUC, MRR, Hits@K, MAP@K, nDCG, F1, ARI, Silhouette
pycleora.sampling
neighborhood, subgraph, GraphSAINT, negative, train/test splits
pycleora.viz
t-SNE, UMAP, PCA reduction + matplotlib plotting
pycleora.io_utils
NetworkX, PyG, DGL export. NPZ/CSV/TSV/Parquet save/load
Memory Model
Cleora's memory usage is dominated by two structures:
- Sparse matrix — O(edges) memory. CSR format stores only non-zero values. A graph with 5M edges uses ~60 MB for the matrix.
- Embedding matrix — O(nodes × dim) memory. 2M nodes × 1024 dims × 4 bytes = ~8 GB.
Total for roadNet-CA (2M nodes, 5.5M edges, dim=1024): ~9 GB. Compare: NetMF requires O(nodes²) for the dense log matrix — 2M² × 8 bytes = 32 TB. This is why NetMF crashes on large graphs.
Why Not Random Walks?
Traditional methods (DeepWalk, Node2Vec) sample random walks, then train a skip-gram model on the walk sequences. This has three fundamental limitations:
Random Walk Methods
- Sample a small subset of possible walks
- Require negative sampling (approximation)
- Non-deterministic: different runs give different results
- Need many walks per node for convergence
- Skip-gram training adds O(walks × walk_length × dim) time
Cleora (Matrix Propagation)
- Computes ALL walks simultaneously via matrix multiplication
- No sampling, no approximation
- Fully deterministic: same input = same output
- Converges in 3-5 iterations
- Time complexity: O(iterations × edges × dim)
The result: Cleora is typically 100-240x faster than random walk methods on the same graph, with equal or better accuracy on node classification benchmarks.