Changelog
Complete history of pycleora — from the original Synerise Cleora Rust implementation to the current full-featured Python graph embedding library.
v3.2.0
New Rust-native Full Embed Loop (embed_fast)
The entire embedding pipeline — initialization, propagation, normalization, and iteration — now runs inside a single Rust call. Eliminates Python↔Rust boundary crossing on every iteration. 3.7× faster on roadNet (2M nodes): 15.8s → 4.3s. 1.7× faster on Cora (2.5K nodes).
New Embedding Whitening (whiten_embeddings)
Post-processing that mean-centers and decorrelates embedding dimensions via eigendecomposition. Boosts node classification accuracy from 0.26 → 0.70 on Cora. Combined with multiscale, achieves 0.83 accuracy. Available as whiten=True parameter in embed().
New Residual Connections
Mix propagated embeddings with previous iteration: emb = (1-α)·propagated + α·prev. Prevents over-smoothing on deep iterations. Parameter: residual_weight in embed().
New Convergence-Based Early Stopping
Automatically detects when embeddings stabilize (RMSE between iterations drops below threshold). Saves compute on graphs that converge before max_iterations. Parameter: convergence_threshold in embed().
New Graph Statistics Module (pycleora.stats)
graph_summary()— one-call overview of all graph metricsdegree_distribution()— node degree histogramclustering_coefficient()— average local clustering coefficientconnected_components()— list of componentsdiameter()— graph diameter via BFSbetweenness_centrality()— Brandes algorithm, top-Kpagerank()— power iteration with dangling node handling
New Graph Preprocessing Module (pycleora.preprocess)
clean_graph()— remove self-loops, deduplicate edges, filter by degreelargest_connected_component()— extract LCC as new SparseMatrixfilter_by_degree()— degree-based edge filtering
New ANN Search Module (pycleora.search)
ANNIndex class for approximate nearest neighbor queries. HNSW backend via optional hnswlib, ball-tree fallback without dependencies. Same result format as find_most_similar. query() and query_vector() methods.
New Embedding Compression Module (pycleora.compress)
pca_compress()— PCA-based dimensionality reductionrandom_projection()— fast random projection (Johnson-Lindenstrauss)product_quantize()→PQIndexwithreconstruct()andsearch()
New Embedding Alignment Module (pycleora.align)
procrustes()— orthogonal Procrustes alignment via SVDcca_align()— Canonical Correlation Analysis into shared spacealignment_score()— mean cosine similarity after Procrustes
New Ensemble Embeddings Module (pycleora.ensemble)
combine() — merge multiple embedding matrices via concat, mean, weighted average, or SVD reduction.
New Extended I/O: Pandas, SciPy, NumPy, Edge Lists
from_pandas(df, source_col, target_col)— build graph from DataFramefrom_scipy_sparse(matrix)— from scipy sparse adjacencyfrom_numpy(adjacency_matrix)— from dense numpy arrayfrom_edge_list(edges)— from list of (source, target) tuples
Improved Rust Core: Double-Buffered Propagation
Instead of allocating a new embedding matrix on every iteration, two pre-allocated buffers are swapped. Reduces GC pressure and memory allocator overhead.
Improved Rust Core: Faster Initialization Hashing
Replaced SipHash (DefaultHasher) with FxHash in init_value(). FxHash runs at ~0.3 cycles/byte vs SipHash's ~4 cycles/byte — 10× faster initialization for large graphs.
Improved Rust Core: GIL Release During Embedding
py.allow_threads() releases Python's GIL during the entire Rust embedding computation, enabling true multi-threaded parallelism.
Improved Rust Core: Vectorization-Friendly Inner Loop
SpMM kernel rewritten with direct slice access instead of ndarray iterators, enabling better auto-vectorization by LLVM.
v3.1.0
New Scikit-learn Compatible API (CleoraEmbedder)
New CleoraEmbedder class with fit(), transform(), fit_transform(), get_params(), and set_params() methods. Drop-in compatible with scikit-learn pipelines and grid search workflows. Transform preserves entity order.
New Use Cases & Tutorials Page
6 documented use cases with working code: recommendations, fraud detection, social networks, knowledge graphs, entity resolution, drug discovery. 3 step-by-step tutorials.
New Architecture Deep-Dive Page
Technical documentation: Markov propagation types, convergence properties, Rust/PyO3 architecture, memory model, vs random walk comparison.
Improved Interactive Benchmark Visualizations
Chart.js-powered charts: accuracy bars, speed comparison, memory usage, scatter, cross-validation with error bars. Chart/table toggle.
v3.0.0
New MLP Classifier & Label Propagation
Pure numpy/scipy classifiers for evaluating embedding quality. MLP for supervised node classification, Label Propagation for semi-supervised. No PyTorch required.
New Graph Sampling Module (pycleora.sampling)
7 methods: sample_nodes, sample_edges, sample_neighborhood (k-hop), sample_subgraph, graphsaint_sample (mini-batching), negative_sampling, train_test_split_edges.
New Cross-Validation
k-fold cross-validation for node classification. Per-fold accuracy/F1 with mean±std.
New Enhanced Evaluation Metrics
map_at_k()— Mean Average Precision at Kndcg_at_k()— Normalized Discounted Cumulative Gain at Kadjusted_rand_index()silhouette_score()
New Hyperparameter Tuning (pycleora.tuning)
grid_search() and random_search() with automatic evaluation. Custom param grids and scoring functions.
New CLI Tool (pycleora)
Command-line interface: pycleora embed|info|benchmark|similar. Supports all algorithms, output formats (npz/txt), dataset operations.
New Benchmarking Suite (pycleora.benchmark)
Automated comparison of 8 algorithms across 6+ datasets. Formatted comparison tables.
New Graph Generators (pycleora.generators)
5 models: Erdos-Renyi, Barabasi-Albert, Stochastic Block Model, Planted Partition, Watts-Strogatz.
Improved Test Suite
81 tests covering all modules. Full suite runs in under 4 seconds.
Fixed Cross-validate with small labeled sets
Fixed Barabasi-Albert crash for m >= num_nodes
Fixed nDCG edge alignment with MAP@K
v2.1.0
New Heterogeneous Graph Support (pycleora.hetero)
HeteroGraph class with add_node_type(), add_edge_type(), embed_per_relation(), embed_metapath(). Multi-type nodes and edges.
New Attention-Weighted Embedding (embed_with_attention)
Temperature-based softmax attention weighting during propagation. Learns neighbor importance from embedding similarity.
Improved Visualization Module
Community-colored t-SNE/UMAP/PCA plots. Custom colormaps. PNG/SVG export.
v2.0.0
Breaking New Algorithm API
All alternative algorithms moved to pycleora.algorithms. Unified parameter naming across all embedding functions.
New 7 Embedding Algorithms
embed_deepwalk()— random walks + SVDembed_node2vec()— biased walks with p/q parametersembed_netmf()— Network Matrix Factorizationembed_prone()— Procrustean Network Embedding with spectral propagationembed_randne()— Random projection with power iterationembed_grarep()— Multi-scale transition matrix factorizationembed_hope()— Higher-Order Proximity (Katz similarity)
New Node Classification (pycleora.classify)
mlp_classify(), label_propagation(), label_propagation_predict(). Accuracy and macro/weighted F1.
New Link Prediction Metrics
AUC, MRR, Hits@K scoring via link_prediction_scores().
New Community Detection
detect_communities_kmeans(), detect_communities_spectral(), detect_communities_louvain(), modularity().
Improved Louvain Community Detection
Binary adjacency fixes, proper self-loop removal, correct modularity computation.
v1.2.0
New Multi-Scale Embedding (embed_multiscale)
Concatenation of embeddings at different propagation depths (e.g. scales=[1,2,4,8]). Captures both local and global graph structure in one vector.
New Node Feature Integration (embed_with_node_features)
Weighted combination of structural Cleora embeddings with external node attributes.
New Edge Feature Embedding (embed_edge_features)
Multi-dimensional features per edge with concat/mean/edge_only combination modes.
New I/O Module (pycleora.io_utils)
NetworkX/PyG/DGL export (to_networkx, to_pyg_data, to_dgl_graph). Save/load in NPZ, CSV, TSV, Parquet formats. Edge list export.
New Visualization Module (pycleora.viz)
reduce_dimensions() (t-SNE/UMAP/PCA), plot_embeddings(), visualize().
v1.1.0
New Dynamic Graph Updates
update_graph() — add new edges to an existing graph. remove_edges() — remove edges. Both rebuild the SparseMatrix with updated connectivity.
New Inductive Embedding (embed_inductive)
Embed new nodes that weren't in the training graph by propagating from existing trained embeddings.
New Streaming Embedding (embed_streaming)
Process edge batches incrementally with per-batch callbacks. For graphs that arrive in chunks.
New Weighted Edges (embed_weighted)
Build and embed graphs with explicit edge weights.
New Directed Graphs (embed_directed)
Embedding for directed edge relationships.
New Supervised Refinement (supervised_refine)
Fine-tune embeddings using labeled positive/negative entity pairs with margin-based loss.
New Link Prediction (predict_links)
Predict top-K most likely missing edges based on embedding similarity.
Improved Normalization Options
L1, L2, spectral, and "none" normalization modes for all embedding functions.
v1.0.0
New Python API over Rust Core
First full Python package wrapping the Cleora Rust engine via PyO3. embed() function with configurable dimensions, iterations, propagation type, normalization, seed, and worker count.
New SparseMatrix Class
Core graph representation with from_iterator(), from_files(), entity lookup, neighbor access, CSR export, serialization (pickle). Properties: entity_ids, entity_degrees, num_entities, num_edges.
New Similarity Search
cosine_similarity() and find_most_similar() for top-K entity retrieval.
New Column Definition System
Flexible column configuration: complex (multi-entity), reflexive (self-relations), transient (intermediate only). Example: "complex::reflexive::user product".
New 14 Built-in Datasets (pycleora.datasets)
From Karate Club (34 nodes) to com-Friendster (65M nodes). SNAP datasets auto-downloaded and cached.
New GPU Propagation (propagate_gpu)
Optional PyTorch-based GPU acceleration for the propagation step.
v0.1.0 — Original Cleora
The original Cleora repository by Synerise. A pure Rust CLI tool for graph embedding via Markov propagation on hypergraph structures.
What the original repo provided:
- Rust binary (
cleora) — command-line only, no Python API SparseMatrixstruct with left and symmetric Markov propagation- Deterministic hash-based initialization
- File-based input only (TSV/CSV) via
from_files() - Column definition system (complex, reflexive, transient)
- Parallel graph building via Crossbeam channels + Rayon
- Output: NPY embedding files
- L2 normalization only
- No Python bindings, no pip install, no evaluation metrics, no algorithms
Rust crates used: rayon, ndarray, crossbeam, dashmap, smallvec, twox-hash, rustc-hash, itertools
pycleora vs Original Cleora — Full Comparison
| Capability | Original Cleora | pycleora 3.2 |
|---|---|---|
| Language | Rust CLI only | Python API + Rust core |
| Install | Compile from source | pip install pycleora |
| Input formats | TSV/CSV files | Files, iterators, pandas, scipy, numpy, NetworkX, edge lists |
| Output formats | NPY files | NPZ, CSV, TSV, Parquet, NetworkX, PyG, DGL |
| Embedding methods | 1 (Cleora) | 8 (Cleora + 7 alternatives: ProNE, RandNE, HOPE, NetMF, GraRep, DeepWalk, Node2Vec) + Cleora variants (multiscale, attention, weighted, directed, inductive, streaming) |
| Normalization | L2 only | L2, L1, spectral, none |
| Propagation types | Left, symmetric | Left, symmetric + residual connections |
| Post-processing | None | Whitening, PCA compression, random projection, product quantization |
| Evaluation | None | 12 metrics (accuracy, F1, AUC, MRR, Hits@K, MAP@K, nDCG, ARI, silhouette, modularity, cross-validation) |
| Classification | None | MLP, label propagation |
| Community detection | None | K-means, spectral, Louvain + modularity |
| Graph analysis | None | Degree distribution, clustering coefficient, diameter, PageRank, betweenness centrality, connected components |
| Preprocessing | None | Self-loop removal, deduplication, degree filtering, LCC extraction |
| Similarity search | None | Brute force + ANN (HNSW/ball tree) |
| Alignment | None | Procrustes, CCA, alignment scoring |
| Ensemble | None | Concat, mean, weighted, SVD combination |
| Visualization | None | t-SNE, UMAP, PCA with matplotlib |
| Sampling | None | 7 methods (neighborhood, subgraph, GraphSAINT, negative, train/test split) |
| Graph generators | None | 5 models (ER, BA, SBM, planted partition, Watts-Strogatz) |
| Datasets | None (BYO files) | 14 built-in (34 to 65M nodes, auto-download) |
| Tuning | None | Grid search, random search |
| Benchmarking | None | Automated multi-algorithm multi-dataset benchmarks |
| Heterogeneous graphs | None | HeteroGraph class with per-relation and metapath embedding |
| Dynamic graphs | None | Add/remove edges, inductive embedding, streaming |
| Sklearn integration | None | CleoraEmbedder with fit/transform/get_params |
| GPU support | None | Optional PyTorch-based propagation |
| Serialization | None | Full pickle support for SparseMatrix |
| CLI | Rust binary | Python CLI (pycleora embed|info|benchmark|similar) |
| Performance (roadNet 2M) | ~16s (Rust CLI) | 4.3s (3.7× faster via optimized Rust loop) |
| Public functions | ~5 | 100+ |