Changelog

v3.2.0

March 2026 — Performance & Ecosystem Release

New Rust-native Full Embed Loop (embed_fast)

The entire embedding pipeline — initialization, propagation, normalization, and iteration — now runs inside a single Rust call. Eliminates Python↔Rust boundary crossing on every iteration. 3.7× faster on roadNet (2M nodes): 15.8s → 4.3s. 1.7× faster on Cora (2.5K nodes).

New Embedding Whitening (whiten_embeddings)

Post-processing that mean-centers and decorrelates embedding dimensions via eigendecomposition. Boosts node classification accuracy from 0.26 → 0.70 on Cora. Combined with multiscale, achieves 0.83 accuracy. Available as whiten=True parameter in embed().

New Residual Connections

Mix propagated embeddings with previous iteration: emb = (1-α)·propagated + α·prev. Prevents over-smoothing on deep iterations. Parameter: residual_weight in embed().

New Convergence-Based Early Stopping

Automatically detects when embeddings stabilize (RMSE between iterations drops below threshold). Saves compute on graphs that converge before max_iterations. Parameter: convergence_threshold in embed().

New Graph Statistics Module (pycleora.stats)

graph_summary() — one-call overview of all graph metrics
degree_distribution() — node degree histogram
clustering_coefficient() — average local clustering coefficient
connected_components() — list of components
diameter() — graph diameter via BFS
betweenness_centrality() — Brandes algorithm, top-K
pagerank() — power iteration with dangling node handling

New Graph Preprocessing Module (pycleora.preprocess)

clean_graph() — remove self-loops, deduplicate edges, filter by degree
largest_connected_component() — extract LCC as new SparseMatrix
filter_by_degree() — degree-based edge filtering

New ANN Search Module (pycleora.search)

ANNIndex class for approximate nearest neighbor queries. HNSW backend via optional hnswlib, ball-tree fallback without dependencies. Same result format as find_most_similar. query() and query_vector() methods.

New Embedding Compression Module (pycleora.compress)

pca_compress() — PCA-based dimensionality reduction
random_projection() — fast random projection (Johnson-Lindenstrauss)
product_quantize() → PQIndex with reconstruct() and search()

New Embedding Alignment Module (pycleora.align)

procrustes() — orthogonal Procrustes alignment via SVD
cca_align() — Canonical Correlation Analysis into shared space
alignment_score() — mean cosine similarity after Procrustes

New Ensemble Embeddings Module (pycleora.ensemble)

combine() — merge multiple embedding matrices via concat, mean, weighted average, or SVD reduction.

New Extended I/O: Pandas, SciPy, NumPy, Edge Lists

from_pandas(df, source_col, target_col) — build graph from DataFrame
from_scipy_sparse(matrix) — from scipy sparse adjacency
from_numpy(adjacency_matrix) — from dense numpy array
from_edge_list(edges) — from list of (source, target) tuples

Improved Rust Core: Double-Buffered Propagation

Instead of allocating a new embedding matrix on every iteration, two pre-allocated buffers are swapped. Reduces GC pressure and memory allocator overhead.

Improved Rust Core: Faster Initialization Hashing

Replaced SipHash (DefaultHasher) with FxHash in init_value(). FxHash runs at ~0.3 cycles/byte vs SipHash's ~4 cycles/byte — 10× faster initialization for large graphs.

Improved Rust Core: GIL Release During Embedding

py.allow_threads() releases Python's GIL during the entire Rust embedding computation, enabling true multi-threaded parallelism.

Improved Rust Core: Vectorization-Friendly Inner Loop

SpMM kernel rewritten with direct slice access instead of ndarray iterators, enabling better auto-vectorization by LLVM.

v3.1.0

March 2026

New Scikit-learn Compatible API (CleoraEmbedder)

New CleoraEmbedder class with fit(), transform(), fit_transform(), get_params(), and set_params() methods. Drop-in compatible with scikit-learn pipelines and grid search workflows. Transform preserves entity order.

New Use Cases & Tutorials Page

6 documented use cases with working code: recommendations, fraud detection, social networks, knowledge graphs, entity resolution, drug discovery. 3 step-by-step tutorials.

New Architecture Deep-Dive Page

Technical documentation: Markov propagation types, convergence properties, Rust/PyO3 architecture, memory model, vs random walk comparison.

Improved Interactive Benchmark Visualizations

Chart.js-powered charts: accuracy bars, speed comparison, memory usage, scatter, cross-validation with error bars. Chart/table toggle.

v3.0.0

March 2026 — Major Release

New MLP Classifier & Label Propagation

Pure numpy/scipy classifiers for evaluating embedding quality. MLP for supervised node classification, Label Propagation for semi-supervised. No PyTorch required.

New Graph Sampling Module (pycleora.sampling)

7 methods: sample_nodes, sample_edges, sample_neighborhood (k-hop), sample_subgraph, graphsaint_sample (mini-batching), negative_sampling, train_test_split_edges.

New Cross-Validation

k-fold cross-validation for node classification. Per-fold accuracy/F1 with mean±std.

New Enhanced Evaluation Metrics

map_at_k() — Mean Average Precision at K
ndcg_at_k() — Normalized Discounted Cumulative Gain at K
adjusted_rand_index()
silhouette_score()

New Hyperparameter Tuning (pycleora.tuning)

grid_search() and random_search() with automatic evaluation. Custom param grids and scoring functions.

New CLI Tool (pycleora)

Command-line interface: pycleora embed|info|benchmark|similar. Supports all algorithms, output formats (npz/txt), dataset operations.

New Benchmarking Suite (pycleora.benchmark)

Automated comparison of 8 algorithms across 6+ datasets. Formatted comparison tables.

New Graph Generators (pycleora.generators)

5 models: Erdos-Renyi, Barabasi-Albert, Stochastic Block Model, Planted Partition, Watts-Strogatz.

Improved Test Suite

81 tests covering all modules. Full suite runs in under 4 seconds.

Fixed Cross-validate with small labeled sets

Fixed Barabasi-Albert crash for m >= num_nodes

Fixed nDCG edge alignment with MAP@K

v2.1.0

January 2026

New Heterogeneous Graph Support (pycleora.hetero)

HeteroGraph class with add_node_type(), add_edge_type(), embed_per_relation(), embed_metapath(). Multi-type nodes and edges.

New Attention-Weighted Embedding (embed_with_attention)

Temperature-based softmax attention weighting during propagation. Learns neighbor importance from embedding similarity.

Improved Visualization Module

Community-colored t-SNE/UMAP/PCA plots. Custom colormaps. PNG/SVG export.

v2.0.0

October 2025

Breaking New Algorithm API

All alternative algorithms moved to pycleora.algorithms. Unified parameter naming across all embedding functions.

New 7 Embedding Algorithms

embed_deepwalk() — random walks + SVD
embed_node2vec() — biased walks with p/q parameters
embed_netmf() — Network Matrix Factorization
embed_prone() — Procrustean Network Embedding with spectral propagation
embed_randne() — Random projection with power iteration
embed_grarep() — Multi-scale transition matrix factorization
embed_hope() — Higher-Order Proximity (Katz similarity)

New Node Classification (pycleora.classify)

mlp_classify(), label_propagation(), label_propagation_predict(). Accuracy and macro/weighted F1.

New Link Prediction Metrics

AUC, MRR, Hits@K scoring via link_prediction_scores().

New Community Detection

detect_communities_kmeans(), detect_communities_spectral(), detect_communities_louvain(), modularity().

Improved Louvain Community Detection

Binary adjacency fixes, proper self-loop removal, correct modularity computation.

v1.2.0

August 2025

New Multi-Scale Embedding (embed_multiscale)

Concatenation of embeddings at different propagation depths (e.g. scales=[1,2,4,8]). Captures both local and global graph structure in one vector.

New Node Feature Integration (embed_with_node_features)

Weighted combination of structural Cleora embeddings with external node attributes.

New Edge Feature Embedding (embed_edge_features)

Multi-dimensional features per edge with concat/mean/edge_only combination modes.

New I/O Module (pycleora.io_utils)

NetworkX/PyG/DGL export (to_networkx, to_pyg_data, to_dgl_graph). Save/load in NPZ, CSV, TSV, Parquet formats. Edge list export.

New Visualization Module (pycleora.viz)

reduce_dimensions() (t-SNE/UMAP/PCA), plot_embeddings(), visualize().

v1.1.0

June 2025

New Dynamic Graph Updates

update_graph() — add new edges to an existing graph. remove_edges() — remove edges. Both rebuild the SparseMatrix with updated connectivity.

New Inductive Embedding (embed_inductive)

Embed new nodes that weren't in the training graph by propagating from existing trained embeddings.

New Streaming Embedding (embed_streaming)

Process edge batches incrementally with per-batch callbacks. For graphs that arrive in chunks.

New Weighted Edges (embed_weighted)

Build and embed graphs with explicit edge weights.

New Directed Graphs (embed_directed)

Embedding for directed edge relationships.

New Supervised Refinement (supervised_refine)

Fine-tune embeddings using labeled positive/negative entity pairs with margin-based loss.

New Link Prediction (predict_links)

Predict top-K most likely missing edges based on embedding similarity.

Improved Normalization Options

L1, L2, spectral, and "none" normalization modes for all embedding functions.

v1.0.0

March 2025 — Initial pycleora Release

New Python API over Rust Core

First full Python package wrapping the Cleora Rust engine via PyO3. embed() function with configurable dimensions, iterations, propagation type, normalization, seed, and worker count.

New SparseMatrix Class

Core graph representation with from_iterator(), from_files(), entity lookup, neighbor access, CSR export, serialization (pickle). Properties: entity_ids, entity_degrees, num_entities, num_edges.

New Similarity Search

cosine_similarity() and find_most_similar() for top-K entity retrieval.

New Column Definition System

Flexible column configuration: complex (multi-entity), reflexive (self-relations), transient (intermediate only). Example: "complex::reflexive::user product".

New 14 Built-in Datasets (pycleora.datasets)

From Karate Club (34 nodes) to com-Friendster (65M nodes). SNAP datasets auto-downloaded and cached.

New GPU Propagation (propagate_gpu)

Optional PyTorch-based GPU acceleration for the propagation step.

v0.1.0 — Original Cleora

2021 — Synerise/Cleora (github.com/synerise/cleora)

The original Cleora repository by Synerise. A pure Rust CLI tool for graph embedding via Markov propagation on hypergraph structures.

What the original repo provided:

Rust binary (cleora) — command-line only, no Python API
SparseMatrix struct with left and symmetric Markov propagation
Deterministic hash-based initialization
File-based input only (TSV/CSV) via from_files()
Column definition system (complex, reflexive, transient)
Parallel graph building via Crossbeam channels + Rayon
Output: NPY embedding files
L2 normalization only
No Python bindings, no pip install, no evaluation metrics, no algorithms

Rust crates used: rayon, ndarray, crossbeam, dashmap, smallvec, twox-hash, rustc-hash, itertools

pycleora vs Original Cleora — Full Comparison

What changed from the original Synerise repository

Capability	Original Cleora	pycleora 3.2
Language	Rust CLI only	Python API + Rust core
Install	Compile from source	`pip install pycleora`
Input formats	TSV/CSV files	Files, iterators, pandas, scipy, numpy, NetworkX, edge lists
Output formats	NPY files	NPZ, CSV, TSV, Parquet, NetworkX, PyG, DGL
Embedding methods	1 (Cleora)	8 (Cleora + 7 alternatives: ProNE, RandNE, HOPE, NetMF, GraRep, DeepWalk, Node2Vec) + Cleora variants (multiscale, attention, weighted, directed, inductive, streaming)
Normalization	L2 only	L2, L1, spectral, none
Propagation types	Left, symmetric	Left, symmetric + residual connections
Post-processing	None	Whitening, PCA compression, random projection, product quantization
Evaluation	None	12 metrics (accuracy, F1, AUC, MRR, Hits@K, MAP@K, nDCG, ARI, silhouette, modularity, cross-validation)
Classification	None	MLP, label propagation
Community detection	None	K-means, spectral, Louvain + modularity
Graph analysis	None	Degree distribution, clustering coefficient, diameter, PageRank, betweenness centrality, connected components
Preprocessing	None	Self-loop removal, deduplication, degree filtering, LCC extraction
Similarity search	None	Brute force + ANN (HNSW/ball tree)
Alignment	None	Procrustes, CCA, alignment scoring
Ensemble	None	Concat, mean, weighted, SVD combination
Visualization	None	t-SNE, UMAP, PCA with matplotlib
Sampling	None	7 methods (neighborhood, subgraph, GraphSAINT, negative, train/test split)
Graph generators	None	5 models (ER, BA, SBM, planted partition, Watts-Strogatz)
Datasets	None (BYO files)	14 built-in (34 to 65M nodes, auto-download)
Tuning	None	Grid search, random search
Benchmarking	None	Automated multi-algorithm multi-dataset benchmarks
Heterogeneous graphs	None	HeteroGraph class with per-relation and metapath embedding
Dynamic graphs	None	Add/remove edges, inductive embedding, streaming
Sklearn integration	None	CleoraEmbedder with fit/transform/get_params
GPU support	None	Optional PyTorch-based propagation
Serialization	None	Full pickle support for SparseMatrix
CLI	Rust binary	Python CLI (`pycleora embed\|info\|benchmark\|similar`)
Performance (roadNet 2M)	~16s (Rust CLI)	4.3s (3.7× faster via optimized Rust loop)
Public functions	~5	100+