How Cleora Works

A technical deep-dive into Cleora's architecture — from mathematical foundations to implementation details.

Architecture Overview

Input: Edge List

TSV/space-separated edge file. Each line defines a hyperedge connecting entities. Column types control how entities interact.

alice item_laptop
bob item_keyboard
alice item_mouse

↓

Sparse Transition Matrix

Construct a row-stochastic sparse matrix M where M[i][j] = probability of transitioning from node i to node j. For left-Markov: rows sum to 1. For symmetric: D^-1/2 A D^-1/2.

↓

Deterministic Initialization

Hash-based initialization: each entity gets a deterministic d-dimensional vector derived from its string ID. No random seeds. Same entity always gets the same initial vector.

↓

Iterative Propagation

Repeat k times: E_t+1 = normalize(M × E_t). Each multiplication aggregates every possible random walk of that length. After k iterations, each node's embedding encodes its k-hop neighborhood structure.

↓

Output: Embedding Matrix

N × d matrix of L2-normalized embeddings. Each row is a unit vector. Cosine similarity between any two rows measures structural proximity in the original graph.

Core Algorithm

Cleora's algorithm is a form of spectral embedding. The key insight: a single sparse matrix multiplication M × E computes the weighted average of all neighbors' embeddings for every node simultaneously. This is mathematically equivalent to aggregating all possible random walks of length 1.

After k multiplications, the embedding of node v encodes:

All paths of length ≤ k starting from v
The probability of reaching every other node in k steps
The structural role of v in its k-hop neighborhood

The L2 normalization after each iteration prevents embedding collapse and ensures the cosine similarity metric is meaningful.

E₀ = hash_init(entity_ids, dim)

E_t+1 = L2_normalize(M × E_t)

output = E_k

Markov Propagation Types

Left Markov (default)

Row-stochastic: D^-1A. Each row sums to 1. Equivalent to a random walk transition matrix. Better for co-occurrence/similarity tasks where you want "nodes visited by the same random walker are similar."

Use for: recommendations, similarity search, clustering

Symmetric Markov

D^-1/2AD^-1/2. Preserves spectral properties. The eigenvectors of this matrix are the graph's spectral embedding. Better for tasks requiring structural equivalence.

Use for: node classification, community detection, role discovery

Convergence Properties

Cleora converges rapidly due to L2 normalization constraining the embedding manifold. Practical guidelines:

Iterations	Neighborhood	Best For
2-3	Local (2-3 hops)	Direct neighbors, immediate similarity
4-5	Medium (4-5 hops)	Community-level similarity, recommendations
7+	Global	Contextual similarity (like skip-gram), role equivalence

Without whitening, embeddings converge to the principal eigenvector beyond ~10 iterations. With interleaved whitening (the default), the spectrum is renormalized at every step, preventing collapse and allowing 40 iterations to extract deep structural signal that shallow runs miss.

Rust Core (PyO3)

The performance-critical operations are implemented in Rust and exposed to Python via PyO3:

SparseMatrix (Rust)

from_iterator — parses edge strings, builds adjacency in a single pass. Multithreaded via Rayon for large inputs.
initialize_deterministically — SipHash-based vector initialization. No randomness, no seeds.
left_markov_propagate / symmetric_markov_propagate — sparse matrix-vector multiplication with adaptive thread pool. Processes rows in parallel chunks sized to L2 cache.
to_sparse_csr — exports the Markov matrix to scipy-compatible CSR format for advanced operations in Python.

The Rust layer handles ~95% of compute time. Python orchestrates iteration and normalization. This hybrid approach gives Rust speed with Python flexibility.

Python API Layer

The Python layer provides high-level APIs on top of the Rust core:

pycleora (core)

embed(), embed_multiscale(), embed_with_attention(), supervised_refine(), CleoraEmbedder

pycleora.algorithms

ProNE, RandNE, HOPE, NetMF, GraRep, DeepWalk, Node2Vec

pycleora.metrics

AUC, MRR, Hits@K, MAP@K, nDCG, F1, ARI, Silhouette

pycleora.sampling

neighborhood, subgraph, GraphSAINT, negative, train/test splits

pycleora.viz

t-SNE, UMAP, PCA reduction + matplotlib plotting

pycleora.io_utils

NetworkX, PyG, DGL export. NPZ/CSV/TSV/Parquet save/load

Memory Model

Cleora's memory usage is dominated by two structures:

Sparse matrix — O(edges) memory. CSR format stores only non-zero values. A graph with 5M edges uses ~60 MB for the matrix.
Embedding matrix — O(nodes × dim) memory. 2M nodes × 1024 dims × 4 bytes = ~8 GB.

Total for roadNet-CA (2M nodes, 5.5M edges, dim=1024): ~9 GB. Compare: NetMF requires O(nodes²) for the dense log matrix — 2M² × 8 bytes = 32 TB. This is why NetMF crashes on large graphs.

Why Not Random Walks?

Traditional methods (DeepWalk, Node2Vec) sample random walks, then train a skip-gram model on the walk sequences. This has three fundamental limitations:

Random Walk Methods

Sample a small subset of possible walks
Require negative sampling (approximation)
Non-deterministic: different runs give different results
Need many walks per node for convergence
Skip-gram training adds O(walks × walk_length × dim) time

Cleora (Matrix Propagation)

Computes ALL walks simultaneously via matrix multiplication
No sampling, no approximation
Fully deterministic: same input = same output
Converges in 3-5 iterations
Time complexity: O(iterations × edges × dim)

The result: Cleora is typically 100-240x faster than random walk methods on the same graph, with equal or better accuracy on node classification benchmarks.