Changelog

Complete history of pycleora — from the original Synerise Cleora Rust implementation to the current full-featured Python graph embedding library.

v3.2.0

March 2026 — Performance & Ecosystem Release

New Rust-native Full Embed Loop (embed_fast)

The entire embedding pipeline — initialization, propagation, normalization, and iteration — now runs inside a single Rust call. Eliminates Python↔Rust boundary crossing on every iteration. 3.7× faster on roadNet (2M nodes): 15.8s → 4.3s. 1.7× faster on Cora (2.5K nodes).

New Embedding Whitening (whiten_embeddings)

Post-processing that mean-centers and decorrelates embedding dimensions via eigendecomposition. Boosts node classification accuracy from 0.26 → 0.70 on Cora. Combined with multiscale, achieves 0.83 accuracy. Available as whiten=True parameter in embed().

New Residual Connections

Mix propagated embeddings with previous iteration: emb = (1-α)·propagated + α·prev. Prevents over-smoothing on deep iterations. Parameter: residual_weight in embed().

New Convergence-Based Early Stopping

Automatically detects when embeddings stabilize (RMSE between iterations drops below threshold). Saves compute on graphs that converge before max_iterations. Parameter: convergence_threshold in embed().

New Graph Statistics Module (pycleora.stats)

  • graph_summary() — one-call overview of all graph metrics
  • degree_distribution() — node degree histogram
  • clustering_coefficient() — average local clustering coefficient
  • connected_components() — list of components
  • diameter() — graph diameter via BFS
  • betweenness_centrality() — Brandes algorithm, top-K
  • pagerank() — power iteration with dangling node handling

New Graph Preprocessing Module (pycleora.preprocess)

  • clean_graph() — remove self-loops, deduplicate edges, filter by degree
  • largest_connected_component() — extract LCC as new SparseMatrix
  • filter_by_degree() — degree-based edge filtering

New ANN Search Module (pycleora.search)

ANNIndex class for approximate nearest neighbor queries. HNSW backend via optional hnswlib, ball-tree fallback without dependencies. Same result format as find_most_similar. query() and query_vector() methods.

New Embedding Compression Module (pycleora.compress)

  • pca_compress() — PCA-based dimensionality reduction
  • random_projection() — fast random projection (Johnson-Lindenstrauss)
  • product_quantize()PQIndex with reconstruct() and search()

New Embedding Alignment Module (pycleora.align)

  • procrustes() — orthogonal Procrustes alignment via SVD
  • cca_align() — Canonical Correlation Analysis into shared space
  • alignment_score() — mean cosine similarity after Procrustes

New Ensemble Embeddings Module (pycleora.ensemble)

combine() — merge multiple embedding matrices via concat, mean, weighted average, or SVD reduction.

New Extended I/O: Pandas, SciPy, NumPy, Edge Lists

  • from_pandas(df, source_col, target_col) — build graph from DataFrame
  • from_scipy_sparse(matrix) — from scipy sparse adjacency
  • from_numpy(adjacency_matrix) — from dense numpy array
  • from_edge_list(edges) — from list of (source, target) tuples

Improved Rust Core: Double-Buffered Propagation

Instead of allocating a new embedding matrix on every iteration, two pre-allocated buffers are swapped. Reduces GC pressure and memory allocator overhead.

Improved Rust Core: Faster Initialization Hashing

Replaced SipHash (DefaultHasher) with FxHash in init_value(). FxHash runs at ~0.3 cycles/byte vs SipHash's ~4 cycles/byte — 10× faster initialization for large graphs.

Improved Rust Core: GIL Release During Embedding

py.allow_threads() releases Python's GIL during the entire Rust embedding computation, enabling true multi-threaded parallelism.

Improved Rust Core: Vectorization-Friendly Inner Loop

SpMM kernel rewritten with direct slice access instead of ndarray iterators, enabling better auto-vectorization by LLVM.

v3.1.0

March 2026

New Scikit-learn Compatible API (CleoraEmbedder)

New CleoraEmbedder class with fit(), transform(), fit_transform(), get_params(), and set_params() methods. Drop-in compatible with scikit-learn pipelines and grid search workflows. Transform preserves entity order.

New Use Cases & Tutorials Page

6 documented use cases with working code: recommendations, fraud detection, social networks, knowledge graphs, entity resolution, drug discovery. 3 step-by-step tutorials.

New Architecture Deep-Dive Page

Technical documentation: Markov propagation types, convergence properties, Rust/PyO3 architecture, memory model, vs random walk comparison.

Improved Interactive Benchmark Visualizations

Chart.js-powered charts: accuracy bars, speed comparison, memory usage, scatter, cross-validation with error bars. Chart/table toggle.

v3.0.0

March 2026 — Major Release

New MLP Classifier & Label Propagation

Pure numpy/scipy classifiers for evaluating embedding quality. MLP for supervised node classification, Label Propagation for semi-supervised. No PyTorch required.

New Graph Sampling Module (pycleora.sampling)

7 methods: sample_nodes, sample_edges, sample_neighborhood (k-hop), sample_subgraph, graphsaint_sample (mini-batching), negative_sampling, train_test_split_edges.

New Cross-Validation

k-fold cross-validation for node classification. Per-fold accuracy/F1 with mean±std.

New Enhanced Evaluation Metrics

  • map_at_k() — Mean Average Precision at K
  • ndcg_at_k() — Normalized Discounted Cumulative Gain at K
  • adjusted_rand_index()
  • silhouette_score()

New Hyperparameter Tuning (pycleora.tuning)

grid_search() and random_search() with automatic evaluation. Custom param grids and scoring functions.

New CLI Tool (pycleora)

Command-line interface: pycleora embed|info|benchmark|similar. Supports all algorithms, output formats (npz/txt), dataset operations.

New Benchmarking Suite (pycleora.benchmark)

Automated comparison of 8 algorithms across 6+ datasets. Formatted comparison tables.

New Graph Generators (pycleora.generators)

5 models: Erdos-Renyi, Barabasi-Albert, Stochastic Block Model, Planted Partition, Watts-Strogatz.

Improved Test Suite

81 tests covering all modules. Full suite runs in under 4 seconds.

Fixed Cross-validate with small labeled sets

Fixed Barabasi-Albert crash for m >= num_nodes

Fixed nDCG edge alignment with MAP@K

v2.1.0

January 2026

New Heterogeneous Graph Support (pycleora.hetero)

HeteroGraph class with add_node_type(), add_edge_type(), embed_per_relation(), embed_metapath(). Multi-type nodes and edges.

New Attention-Weighted Embedding (embed_with_attention)

Temperature-based softmax attention weighting during propagation. Learns neighbor importance from embedding similarity.

Improved Visualization Module

Community-colored t-SNE/UMAP/PCA plots. Custom colormaps. PNG/SVG export.

v2.0.0

October 2025

Breaking New Algorithm API

All alternative algorithms moved to pycleora.algorithms. Unified parameter naming across all embedding functions.

New 7 Embedding Algorithms

  • embed_deepwalk() — random walks + SVD
  • embed_node2vec() — biased walks with p/q parameters
  • embed_netmf() — Network Matrix Factorization
  • embed_prone() — Procrustean Network Embedding with spectral propagation
  • embed_randne() — Random projection with power iteration
  • embed_grarep() — Multi-scale transition matrix factorization
  • embed_hope() — Higher-Order Proximity (Katz similarity)

New Node Classification (pycleora.classify)

mlp_classify(), label_propagation(), label_propagation_predict(). Accuracy and macro/weighted F1.

New Link Prediction Metrics

AUC, MRR, Hits@K scoring via link_prediction_scores().

New Community Detection

detect_communities_kmeans(), detect_communities_spectral(), detect_communities_louvain(), modularity().

Improved Louvain Community Detection

Binary adjacency fixes, proper self-loop removal, correct modularity computation.

v1.2.0

August 2025

New Multi-Scale Embedding (embed_multiscale)

Concatenation of embeddings at different propagation depths (e.g. scales=[1,2,4,8]). Captures both local and global graph structure in one vector.

New Node Feature Integration (embed_with_node_features)

Weighted combination of structural Cleora embeddings with external node attributes.

New Edge Feature Embedding (embed_edge_features)

Multi-dimensional features per edge with concat/mean/edge_only combination modes.

New I/O Module (pycleora.io_utils)

NetworkX/PyG/DGL export (to_networkx, to_pyg_data, to_dgl_graph). Save/load in NPZ, CSV, TSV, Parquet formats. Edge list export.

New Visualization Module (pycleora.viz)

reduce_dimensions() (t-SNE/UMAP/PCA), plot_embeddings(), visualize().

v1.1.0

June 2025

New Dynamic Graph Updates

update_graph() — add new edges to an existing graph. remove_edges() — remove edges. Both rebuild the SparseMatrix with updated connectivity.

New Inductive Embedding (embed_inductive)

Embed new nodes that weren't in the training graph by propagating from existing trained embeddings.

New Streaming Embedding (embed_streaming)

Process edge batches incrementally with per-batch callbacks. For graphs that arrive in chunks.

New Weighted Edges (embed_weighted)

Build and embed graphs with explicit edge weights.

New Directed Graphs (embed_directed)

Embedding for directed edge relationships.

New Supervised Refinement (supervised_refine)

Fine-tune embeddings using labeled positive/negative entity pairs with margin-based loss.

New Link Prediction (predict_links)

Predict top-K most likely missing edges based on embedding similarity.

Improved Normalization Options

L1, L2, spectral, and "none" normalization modes for all embedding functions.

v1.0.0

March 2025 — Initial pycleora Release

New Python API over Rust Core

First full Python package wrapping the Cleora Rust engine via PyO3. embed() function with configurable dimensions, iterations, propagation type, normalization, seed, and worker count.

New SparseMatrix Class

Core graph representation with from_iterator(), from_files(), entity lookup, neighbor access, CSR export, serialization (pickle). Properties: entity_ids, entity_degrees, num_entities, num_edges.

New Similarity Search

cosine_similarity() and find_most_similar() for top-K entity retrieval.

New Column Definition System

Flexible column configuration: complex (multi-entity), reflexive (self-relations), transient (intermediate only). Example: "complex::reflexive::user product".

New 14 Built-in Datasets (pycleora.datasets)

From Karate Club (34 nodes) to com-Friendster (65M nodes). SNAP datasets auto-downloaded and cached.

New GPU Propagation (propagate_gpu)

Optional PyTorch-based GPU acceleration for the propagation step.

v0.1.0 — Original Cleora

2021 — Synerise/Cleora (github.com/synerise/cleora)

The original Cleora repository by Synerise. A pure Rust CLI tool for graph embedding via Markov propagation on hypergraph structures.

What the original repo provided:

  • Rust binary (cleora) — command-line only, no Python API
  • SparseMatrix struct with left and symmetric Markov propagation
  • Deterministic hash-based initialization
  • File-based input only (TSV/CSV) via from_files()
  • Column definition system (complex, reflexive, transient)
  • Parallel graph building via Crossbeam channels + Rayon
  • Output: NPY embedding files
  • L2 normalization only
  • No Python bindings, no pip install, no evaluation metrics, no algorithms

Rust crates used: rayon, ndarray, crossbeam, dashmap, smallvec, twox-hash, rustc-hash, itertools

pycleora vs Original Cleora — Full Comparison

What changed from the original Synerise repository
Capability Original Cleora pycleora 3.2
LanguageRust CLI onlyPython API + Rust core
InstallCompile from sourcepip install pycleora
Input formatsTSV/CSV filesFiles, iterators, pandas, scipy, numpy, NetworkX, edge lists
Output formatsNPY filesNPZ, CSV, TSV, Parquet, NetworkX, PyG, DGL
Embedding methods1 (Cleora)8 (Cleora + 7 alternatives: ProNE, RandNE, HOPE, NetMF, GraRep, DeepWalk, Node2Vec) + Cleora variants (multiscale, attention, weighted, directed, inductive, streaming)
NormalizationL2 onlyL2, L1, spectral, none
Propagation typesLeft, symmetricLeft, symmetric + residual connections
Post-processingNoneWhitening, PCA compression, random projection, product quantization
EvaluationNone12 metrics (accuracy, F1, AUC, MRR, Hits@K, MAP@K, nDCG, ARI, silhouette, modularity, cross-validation)
ClassificationNoneMLP, label propagation
Community detectionNoneK-means, spectral, Louvain + modularity
Graph analysisNoneDegree distribution, clustering coefficient, diameter, PageRank, betweenness centrality, connected components
PreprocessingNoneSelf-loop removal, deduplication, degree filtering, LCC extraction
Similarity searchNoneBrute force + ANN (HNSW/ball tree)
AlignmentNoneProcrustes, CCA, alignment scoring
EnsembleNoneConcat, mean, weighted, SVD combination
VisualizationNonet-SNE, UMAP, PCA with matplotlib
SamplingNone7 methods (neighborhood, subgraph, GraphSAINT, negative, train/test split)
Graph generatorsNone5 models (ER, BA, SBM, planted partition, Watts-Strogatz)
DatasetsNone (BYO files)14 built-in (34 to 65M nodes, auto-download)
TuningNoneGrid search, random search
BenchmarkingNoneAutomated multi-algorithm multi-dataset benchmarks
Heterogeneous graphsNoneHeteroGraph class with per-relation and metapath embedding
Dynamic graphsNoneAdd/remove edges, inductive embedding, streaming
Sklearn integrationNoneCleoraEmbedder with fit/transform/get_params
GPU supportNoneOptional PyTorch-based propagation
SerializationNoneFull pickle support for SparseMatrix
CLIRust binaryPython CLI (pycleora embed|info|benchmark|similar)
Performance (roadNet 2M)~16s (Rust CLI)4.3s (3.7× faster via optimized Rust loop)
Public functions~5100+