Benchmark Results
We benchmarked 8 algorithms across 5 real-world datasets plus 1 scale test. Every dataset is a genuine academic benchmark from SNAP, Planetoid, or DGL. Cleora wins on accuracy on every single dataset while using 10–24x less memory than accuracy-competitive methods. Dense and random-walk methods fail catastrophically on larger graphs.
Summary Table
Best accuracy per dataset. Green = best on that dataset. "OOM" = out of memory. "Timed Out" = exceeded time budget.
ego-Facebook (4,039 nodes, 18 Louvain communities)
Facebook ego network from SNAP (~4K nodes, ~88K edges). Community labels detected via Louvain. All 8 algorithms benchmarked with 256-dimensional embeddings:
| Algorithm | Accuracy | Macro F1 | Time | Memory |
|---|---|---|---|---|
| Cleora | 0.990 | 0.989 | 1.23s | 22 MB |
| DeepWalk | 0.958 | 0.956 | 59.2s | 572 MB |
| Node2Vec | 0.958 | 0.956 | 67.9s | 572 MB |
| NetMF | 0.957 | 0.958 | 28.8s | 1,098 MB |
| HOPE | 0.890 | 0.905 | 31.5s | 857 MB |
| RandNE | 0.212 | 0.181 | 0.07s | 42 MB |
| ProNE | 0.075 | 0.056 | 0.26s | 67 MB |
| GraRep | Timed Out — dense SVD per k-step exceeds 90s budget | |||
Cora (2,708 nodes, 7 classes)
Citation network from Planetoid (Yang et al., 2016). 2,708 ML papers across 7 subject areas, 10,858 citation edges:
| Algorithm | Accuracy | Macro F1 | Time | Memory |
|---|---|---|---|---|
| Cleora | 0.861 | 0.858 | 1.03s | 14 MB |
| NetMF | 0.839 | 0.836 | 4.23s | 332 MB |
| DeepWalk | 0.835 | 0.833 | 24.1s | 227 MB |
| Node2Vec | 0.835 | 0.833 | 25.8s | 227 MB |
| HOPE | 0.821 | 0.818 | 15.97s | 330 MB |
| GraRep | 0.809 | 0.806 | 16.4s | 322 MB |
| RandNE | 0.247 | 0.246 | 0.03s | 24 MB |
| ProNE | 0.179 | 0.178 | 0.13s | 40 MB |
CiteSeer (3,312 nodes, 6 classes)
Citation network from Planetoid. 3,312 CS papers across 6 subject areas, 9,464 citation edges:
| Algorithm | Accuracy | Macro F1 | Time | Memory |
|---|---|---|---|---|
| Cleora | 0.824 | 0.822 | 0.99s | 16 MB |
| NetMF | 0.810 | 0.810 | 6.58s | 335 MB |
| DeepWalk | 0.806 | 0.806 | 29.3s | 294 MB |
| Node2Vec | 0.806 | 0.806 | 29.6s | 294 MB |
| GraRep | 0.756 | 0.756 | 27.3s | 411 MB |
| HOPE | 0.740 | 0.740 | 19.6s | 430 MB |
| RandNE | 0.244 | 0.244 | 0.02s | 27 MB |
| ProNE | 0.189 | 0.188 | 0.14s | 45 MB |
PubMed (19,717 nodes, 3 classes)
Citation network from Planetoid. 19,717 diabetes papers across 3 categories, 88,676 citation edges:
| Algorithm | Accuracy | Macro F1 | Time | Memory |
|---|---|---|---|---|
| Cleora | 0.879 | 0.878 | 1.40s | 97 MB |
| RandNE | 0.351 | 0.351 | 0.22s | 175 MB |
| ProNE | 0.339 | 0.339 | 0.75s | 291 MB |
| HOPE | Timed Out — sparse inverse too slow at 19.7K nodes | |||
| NetMF | OOM — requires O(n²) dense matrix (19.7K² × 8 bytes ≈ 3.1 GB) | |||
| GraRep | OOM — dense SVD per k-step | |||
| DeepWalk | Timed Out — random walks too slow at this scale | |||
| Node2Vec | Timed Out — random walks too slow at this scale | |||
PPI (3,890 nodes, 50 classes)
Protein-protein interaction graph. 3,890 proteins, 76,584 edges, 50 functional classes:
| Algorithm | Accuracy | Macro F1 | Time | Memory |
|---|---|---|---|---|
| Cleora | 1.000 | 1.000 | 1.23s | 21 MB |
| RandNE | 0.073 | 0.070 | 0.07s | 40 MB |
| ProNE | 0.023 | 0.021 | 1.45s | 64 MB |
| HOPE | Timed Out | |||
| NetMF | OOM | |||
| GraRep | OOM | |||
| DeepWalk | Timed Out | |||
| Node2Vec | Timed Out | |||
roadNet-CA (1,965,206 nodes, speed/memory only)
California road network from SNAP (~2M nodes, ~5.5M edges). No ground-truth community labels, so only speed and memory metrics are reported:
| Algorithm | Time | Memory |
|---|---|---|
| Cleora | 31.5s | 4,129 MB |
| RandNE | OOM | |
| ProNE | OOM | |
| HOPE | OOM | |
| NetMF | OOM | |
| GraRep | OOM | |
| DeepWalk | OOM | |
| Node2Vec | OOM | |
Speed Comparison
Embedding time across real datasets (256 dim). Dashed bars indicate algorithms that failed (OOM or Timed Out):
Memory Usage
Peak memory footprint per algorithm across real datasets (256 dim). Dashed bars indicate OOM failures:
Accuracy vs Speed Tradeoff
Scatter plot showing how each algorithm trades off accuracy against embedding time across all real datasets with labels (256 dim):
When to Use What
Methodology
- Datasets: All datasets are downloaded from their canonical academic sources at runtime. ego-Facebook and roadNet-CA from SNAP. Cora, CiteSeer, and PubMed from Planetoid (Yang et al., ICML 2016). PPI from DGL (Zitnik & Leskovec, Bioinformatics 2017). No synthetic data is used.
- Algorithms: Cleora, ProNE, RandNE, HOPE, NetMF, GraRep, DeepWalk, and Node2Vec. All use 256-dimensional embeddings. Cleora uses 40 iterations; all other algorithms use default hyperparameters.
- Labels: ego-Facebook uses Louvain community detection (seed=42). Cora/CiteSeer/PubMed use paper category labels. PPI uses functional class labels.
- Evaluation: Nearest Centroid classifier with 80/20 train/test split (seed=42).
- Failure modes: "OOM" means the process was killed by the OS due to exceeding available memory. "Timed Out" means the algorithm exceeded a 90-second time budget. Both indicate that the algorithm is not feasible for that dataset size.
- Hardware: All benchmarks run on the same machine (Replit cloud instance, shared vCPUs, limited RAM). Times are wall-clock. Memory is peak RSS via tracemalloc.