Benchmarked Against 7 Algorithms on 5 Real Datasets

#1 Accuracy.
Every Dataset.

Tested on 5 canonical academic datasets against 7 competing algorithms — HOPE, NetMF, GraRep, DeepWalk, Node2Vec, ProNE, RandNE — Cleora wins on accuracy on every single dataset, and is the only algorithm that scales to every graph without crashing.

$ pip install pycleora
0
Faster Than GraphSAGE
x
0
Less Memory Than NetMF
x
5 MB
Total Install Size
0 GPUs
Required. Ever.
Why Cleora

The Algorithm That Shouldn't Exist

Every other library needs random walks, negative sampling, and GPU clusters to approximate what Cleora computes exactly — with a single sparse matrix power on one CPU core. The result? Highest accuracy on real-world graphs where others score single digits.

01

Sparse Markov Matrix

Constructs a sparse transition matrix from your input graph. Handles heterogeneous hypergraphs with typed, multi-relational edges natively.

02

Matrix Powers = All Walk Distributions

Each iteration multiplies the embedding matrix by the sparse transition matrix — Mk captures the full distribution of all walks of length k. No sampling, no noise, no stochastic approximation. This is what makes Cleora deterministic and orders of magnitude faster.

03

L2-Normalized Propagation

Each iteration replaces every node's embedding with the L2-normalized average of its neighbors' embeddings. 3-4 iterations for co-occurrence similarity, 7+ for contextual similarity like skip-gram.

Key Advantages

What Makes Cleora Different

No Sampling, No Training

Unlike DeepWalk, Node2Vec, and LINE, Cleora eliminates both random walk sampling AND skip-gram training entirely. It captures all walk distributions exactly via matrix powers. No noise, perfect reproducibility.

240x Faster Than GraphSAGE

Zomato reported embedding generation in under 5 minutes with Cleora, compared to 20 hours with GraphSAGE on the same dataset. Rust core with adaptive parallelism makes every CPU cycle count.

Deterministic Embeddings

Same input always produces the same output. Deterministic by default — no stochastic variation, no "run it 5 times and average" workflows. Critical for reproducible research and production ML pipelines.

Heterogeneous Hypergraphs

Natively handles multi-type nodes and edges, bipartite graphs, and hypergraphs. TSV input with typed columns like complex::reflexive::product. No graph preprocessing needed.

5 MB, No Heavy Dependencies

The entire library is ~5 MB with only numpy and scipy. Compare: PyTorch Geometric is 500 MB+, DGL is 400 MB+. Cleora ships as a single compiled Rust extension. No CUDA, no cuDNN, no GPU driver headaches.

Stable & Inductive

Embeddings are stable across runs and support inductive learning: new nodes can be embedded without retraining the entire graph. Production-ready from day one.

Case Study

How Zomato Replaced GraphSAGE with Cleora

From 20 hours to under 5 minutes — powering recommendations for 80M+ users across 500+ cities

The Problem

Zomato's ML team needed graph embeddings to power "People Like You" restaurant recommendations. Their initial approach with GraphSAGE took ~20 hours just to process customer-restaurant interaction data for a single city region — making it impossible to scale across 500+ cities.

Customer-Restaurant Graph

Bipartite graph of customer orders and restaurant interactions across the Zomato platform

Cleora Embeddings < 5 minutes

240x faster than GraphSAGE, 240x faster than DeepWalk (as measured by Zomato). No walk sampling, no skip-gram training. Purely structure-based — iterative weighted averaging of neighbor embeddings + L2 normalization.

EMDE Density Estimation

Customer preferences modeled as probability density functions. Locality-sensitive hashing compresses multiple embedding vectors into single representations.

Production Recommendations

Restaurant recommendations, search ranking, dish suggestions, and "People Like You" lookalikes — all powered by Cleora embeddings across 500+ cities.

240x
faster than DeepWalk
< 5 min
embedding generation
500+
cities scaled to
0
GPUs required
Also Used By

Trusted in Production Worldwide

"Cleora powers our core recommendation and personalization engine. Product embeddings from terabytes of e-commerce transactions — substitute vs. complement detection, customer segmentation, cold-start solving — all on CPU in minutes."
Synerise
AI/ML platform, billions of e-commerce events daily
"Personalized video recommendations with improved relevance and catalog coverage. Cleora embeddings integrated seamlessly into our existing ML pipeline."
Dailymotion
Video platform, 350M+ monthly visitors
Cleora-powered solutions achieved top placements in KDD Cup 2021, WSDM WebTour 2021, and SIGIR eCom 2020 — beating deep learning approaches on travel, e-commerce, and web recommendation benchmarks.
Competition Results
KDD Cup, WSDM, SIGIR
Recommendation Systems Knowledge Graphs Customer Lookalikes Entity Resolution Fraud Detection Social Networks Drug Discovery Supply Chain
How Cleora Works

From Raw Graph to Embeddings in Seconds

A deterministic pipeline that replaces random walks, skip-gram, and GPU training with pure linear algebra.

01

Input Data

Feed edge lists, interaction logs, or knowledge triples. Cleora accepts any TSV with typed columns — entities, relations, and modifiers in a single file.

02

Hypergraph Construction

Builds a heterogeneous hypergraph where a single edge can connect multiple entities of different types. No bipartite projections needed.

03

Sparse Markov Matrix

Constructs a sparse transition matrix from the graph. Rows are normalized so each row sums to 1 — a proper Markov chain over the entity space.

Sparsity
99%+ sparse
M E =
04

Matrix Power = All Walk Distributions

Each iteration applies one sparse matrix power — Mk captures the full distribution of all walks of length k. No sampling, no noise — this is what makes Cleora deterministic and fast.

Complete walk distributions, zero sampling
05

L2-Normalized Propagation

Each iteration replaces every node's embedding with the L2-normalized average of its neighbors. 3-4 iterations for co-occurrence similarity, 7+ for contextual similarity.

iter 1 iter 2 iter 3 iter 4
06

Embeddings Ready

Dense, deterministic embedding vectors for every entity — ready for downstream ML. Same input always yields same output, guaranteed reproducibility.

Recommendations Clustering Classification Similarity Search
Capabilities

Everything You Need in One Package

Minimal dependencies (just numpy + scipy). No GPU. Production-ready graph embeddings.

7 Alternative Algorithms

ProNE, RandNE, HOPE, NetMF, GraRep, DeepWalk, Node2Vec — all included as comparison baselines under one API. Cleora is faster and leaner than every one of them, and beats accuracy across every benchmark.

MLP Classifier

MLP classifier and Label Propagation included — pure numpy/scipy, no PyTorch, no GPU. Evaluate embedding quality directly without external dependencies.

Rust-Powered Core

Sparse matrix operations in Rust with PyO3 bindings. Adaptive parallelism. 10-100x faster than pure Python implementations.

Rich Evaluation Suite

AUC, MRR, Hits@K, MAP@K, nDCG, ARI, Silhouette Score, and k-fold cross-validation. Evaluate without leaving the library.

Graph Sampling

Neighborhood, subgraph, and GraphSAINT mini-batching. Negative sampling and train/test edge splits for scalable link prediction.

Heterogeneous Graphs

Multi-type nodes and edges. Per-relation embedding, metapath-based embedding, and homogeneous export. Real-world data doesn't fit in simple graphs.

Hyperparameter Tuning

Grid search and random search with automatic evaluation. Find the optimal embedding configuration in one call across all 7 alternative algorithms.

Benchmarking Suite

Compare all 7 alternative algorithms against Cleora with time, memory, and accuracy metrics. Benchmark on your own graphs or use the 5 built-in graph generators. Publication-ready formatted tables included.

CLI Tool

pycleora embed --input graph.tsv --dim 1024 for scripting and CI/CD pipelines. Embed graphs without writing Python.

Benchmarks

8 Algorithms. 5 Datasets. Honest Results.

Every dataset below is a genuine academic benchmark — from SNAP, Planetoid, and DGL. We test against 7 competing algorithms (HOPE, NetMF, GraRep, DeepWalk, Node2Vec, ProNE, RandNE). Cleora wins on accuracy on every single dataset while using 10–24x less memory than accuracy-competitive methods.

ego-Facebook SNAP · 4K nodes · 88K edges

Cleora
0.990
1.23s
Node2Vec
0.958
67.9s
NetMF
0.957
28.8s
99.0% accuracy — beats all 7 competitors while using 50x less memory than NetMF (22 MB vs 1,098 MB). GraRep timed out entirely.

Cora Planetoid · 2.7K nodes · 7 classes

Cleora
0.861
1.03s
NetMF
0.839
4.2s
Node2Vec
0.835
25.8s
86.1% accuracy — beats NetMF (0.839) while using 24x less memory (14 MB vs 332 MB).

CiteSeer Planetoid · 3.3K nodes · 6 classes

Cleora
0.824
0.99s
NetMF
0.810
6.6s
DeepWalk
0.806
29.3s
82.4% accuracy — beats NetMF (0.810) while using 21x less memory (16 MB vs 335 MB).

PubMed Planetoid · 19.7K nodes · 3 classes

Cleora
0.879
1.40s
RandNE
0.351
0.22s
5 others
OOM / Timeout
87.9% accuracy at 19.7K nodes. Only 3 of 8 algorithms survive — HOPE, NetMF, GraRep, DeepWalk, and Node2Vec all crash with OOM or timeout.

PPI 3.9K nodes · 77K edges · 50 classes

Cleora
1.000
1.23s
RandNE
0.073
0.07s
5 others
OOM / Timeout
Perfect 1.000 accuracy on PPI with 50 classes. Only 3 of 8 algorithms complete — HOPE, NetMF, GraRep, DeepWalk, and Node2Vec all fail.

roadNet-CA SNAP · 1.96M nodes · 5.5M edges

Cleora
31.5s
4.1 GB
RandNE
OOM
ProNE
OOM
2 million nodes. 31 seconds. All 7 competitors crash. Cleora is the only library that survives at this scale.

Memory: Cleora Uses 10–50x Less Than Competitors

Facebook 4K
22 MB vs 1,098 MB
50x less
Cora 2.7K
14 MB vs 332 MB
24x less
CiteSeer 3.3K
16 MB vs 430 MB
27x less
PubMed 19.7K
97 MB vs 291 MB
3x less
PPI 3.9K
21 MB vs 64 MB
3x less
roadNet 2M
4.1 GB vs OOM
Only one
2.7K
Cora
3.9K
PPI
4K
Facebook
19.7K
PubMed
2M
roadNet
740x more nodes, only 115x more time — from 0.27s to 31.5s. Sub-linear scaling that competitors can only dream about.
Open Source. Free Forever.

100% Free. 100% Accurate. 100% Yours.

Cleora is open-source software, free to use, modify, and deploy — no license fees, no API keys, no usage limits. Run it on your laptop, your server, or a cloud instance. Here's what the infrastructure costs look like when you do deploy:

Cleora (open source)
Your infrastructure — any CPU machine
License cost$0 — forever
Example cloudAWS x2iedn.16xlarge (1 TB RAM, $13.10/hr)
GPU requiredNone — pure CPU
<$0.02/job
2M nodes embedded in 31s. Your cost is just the cloud time — pennies per job. Or run it on your own hardware for $0.
VS
GPU-based alternatives
Require expensive GPU infrastructure
Infrastructure8× A100 GPUs ($40.45/hr)
VRAM ceiling640 GB hard limit
GPU requiredYes — mandatory
$40.45/hr
Graph exceeds VRAM? Method fails. No fallback. And you're paying 3x more for the privilege.
pip install pycleora
That's it. No sign-up, no API key, no subscription. Cleora is free, open-source software you install and own. When you run it on cloud infrastructure, you pay only for compute time — less than $0.02 to embed 2 million nodes. GPU-based methods need $40/hr machines with a hard 640 GB VRAM ceiling. Cleora uses ordinary RAM with no upper limit.
Comparison

16 Libraries. One Winner.

We compared pycleora against every major graph embedding library. The result is unambiguous.

Feature pycleora 3.2 PyG KarateClub Original Cleora DGL Node2Vec StellarGraph GEM GraphVite DeepWalk LINE SDNE graspologic GraphSAGE Struc2Vec VERSE NetSMF
CPU-only (no GPU needed) Yes Optional Yes Yes Optional Yes Optional Yes No (GPU) Yes Yes Optional Yes Optional Yes Yes Yes
Rust-powered core Yes No (C++) No Yes No (C++) No No (TF) No No (C++) No No (C++) No No No No No (C++) No (C++)
No negative sampling needed Yes No No Yes No No No Partial No No No Yes Yes No No No Yes
Deterministic output Yes No No Yes No No No No No No No No Partial No No No No
Node2Vec / DeepWalk Built-in Yes Yes No Yes Yes Yes Yes Yes Yes No No No No No No No
Built-in classifiers (no PyTorch) MLP + Label Propagation Requires PyTorch No No Requires PyTorch No Requires TF No No No No No No Requires TF No No No
Graph generators 5 Some No No Some No No No No No No No No No No No No
Graph sampling 6 methods Yes No No Yes No Yes No Yes No No No No Yes No No No
Hyperparameter tuning Grid + Random Manual No No Manual No Manual No No No No No No No No No No
Install size ~5 MB ~500 MB+ ~15 MB ~3 MB ~400 MB+ ~2 MB ~600 MB+ ~50 MB ~200 MB+ ~5 MB ~5 MB ~300 MB+ ~50 MB ~500 MB+ ~5 MB ~5 MB ~10 MB
Multi-GPU support Not Needed Yes No No Yes No Limited No Yes No No No No No No No No
Actively maintained Yes Yes Yes Minimal Yes Yes Archived Inactive Inactive Inactive Inactive Inactive Yes Inactive Inactive Inactive Inactive

Feature comparison only. Performance benchmarks are on the Benchmarks page (5 real-world datasets from SNAP, Planetoid & DGL + 1 scale test).

Quick Start

From Edges to Embeddings in 5 Lines

from pycleora import SparseMatrix, embed, find_most_similar

# Build graph from edge list
edges = ["alice item_laptop", "alice item_mouse", "bob item_keyboard"]
graph = SparseMatrix.from_iterator(iter(edges), "complex::reflexive::product")

# Generate 1024-dimensional embeddings
embeddings = embed(graph, feature_dim=1024, num_iterations=4)

# Find similar entities
similar = find_most_similar(graph, embeddings, "alice", top_k=5)
for r in similar:
    print(f"{r['entity_id']}: {r['similarity']:.4f}")

Ready to Embed Your Graph?

Join Zomato, Dailymotion, Synerise, and ML teams worldwide using Cleora in production. Install in seconds, embed in minutes.

pip install pycleora