Documentation
Complete guide to pycleora 3.2 — the graph embedding library that achieves 100% accuracy on CPU.
Installation
From PyPI (recommended)
pip install pycleora
From source (with Rust compiler)
pip install maturin==1.2.3
git clone https://github.com/pycleora/pycleora
cd pycleora
maturin build --release -i python3
pip install target/wheels/pycleora-*.whl
Quick Start
from pycleora import SparseMatrix, embed, find_most_similar
# 1. Define edges (space-separated entity pairs)
edges = [
"user_alice product_laptop",
"user_alice product_mouse",
"user_bob product_keyboard",
"user_bob product_monitor",
]
# 2. Build the graph
graph = SparseMatrix.from_iterator(
iter(edges),
"complex::reflexive::entity"
)
# 3. Generate embeddings (whitening is on by default for best accuracy)
embeddings = embed(graph, feature_dim=256, num_iterations=16)
# 4. Query similarity
results = find_most_similar(graph, embeddings, "user_alice", top_k=5)
for r in results:
print(f"{r['entity_id']}: {r['similarity']:.4f}")
Core Concepts
Column Definitions
The column string defines how edges are parsed. Common patterns:
complex::reflexive::entity— bipartite graph, entities interact with themselvescomplex::reflexive::user complex::reflexive::product— two separate entity typestransient::reflexive::context— context column, not embedded
Propagation Types
"left"— Left Markov (D⁻¹A, row-stochastic). Default. Best for most tasks."symmetric"— Symmetric normalization (D⁻¹/²AD⁻¹/²). Faster, but may reduce accuracy.
"left" usually outperform "symmetric"?
Left Markov normalizes only on the source side — it divides each row by the source node's degree. This means high-degree nodes (hubs) retain their full influence over their neighbors. In social and interaction graphs, hubs are often community leaders or connectors between groups, so preserving their signal is critical for downstream tasks like community detection and node classification.
Symmetric normalization divides by the square root of the degree on both sides of every edge. This dampens the influence of hubs, making embeddings more uniform across the graph. The effect is that community boundaries become less sharp — the very signal that classification relies on gets smoothed away.
In benchmarks, this difference is measurable: on ego-Facebook, left propagation achieves 0.964 accuracy vs 0.293 for symmetric (dim=1024, whiten=True). On roadNet-CA, the gap is wider: 0.598 vs 0.431.
When to use symmetric: On regular graphs where node degrees are roughly uniform (lattices, road networks, mesh topologies), both normalization types perform similarly, but symmetric can be slightly faster computationally. Symmetric is also the standard choice when feeding embeddings into spectral methods or GCN-style architectures that expect D⁻¹/²AD⁻¹/² as input.
Recommendation: Use "left" (default) unless you specifically need symmetric normalization for downstream compatibility or are working with very regular graphs.
Normalization
"l2"— Unit length vectors (default, recommended)"l1"— L1 normalization"spectral"— SVD-based spectral normalization"none"— No normalization
Cleora Embedding
from pycleora import embed, embed_multiscale
# Standard embedding (whiten=True by default for best accuracy)
emb = embed(graph, feature_dim=256, num_iterations=16)
# Symmetric propagation
emb_sym = embed(graph, feature_dim=256, propagation="symmetric")
# Multi-scale (captures different neighborhood sizes)
emb_multi = embed_multiscale(graph, feature_dim=64, scales=[1, 2, 4, 8])
# Result: (n_entities, 64*4) = (n, 256) dimensions
Alternative Algorithms
from pycleora.algorithms import (
embed_deepwalk, embed_node2vec,
embed_prone, embed_randne,
embed_hope, embed_netmf, embed_grarep
)
# DeepWalk — random walks + SVD (requires negative sampling)
emb = embed_deepwalk(graph, feature_dim=1024, num_walks=20, walk_length=40)
# Node2Vec — biased walks (requires negative sampling)
emb = embed_node2vec(graph, feature_dim=1024, p=0.5, q=2.0)
# RandNE — ultra-fast random projection
emb = embed_randne(graph, feature_dim=1024)
# NetMF — matrix factorization (requires negative sampling)
emb = embed_netmf(graph, feature_dim=1024, negative_samples=1.0)
Node Features
from pycleora import embed_with_node_features
node_features = {"alice": np.array([1.0, 0.5, ...]), ...}
emb = embed_with_node_features(graph, node_features, feature_weight=0.7)
Edge Features
from pycleora import embed_edge_features
# Multi-dimensional features per edge
edge_feats = {"alice item_laptop": np.array([1.0, 0.8, ...]), ...}
emb = embed_edge_features(graph, edge_feats, feature_dim=64, combine="concat")
# combine: "concat" | "mean" | "edge_only"
Attention-Weighted Embedding
from pycleora import embed_with_attention
emb = embed_with_attention(graph, feature_dim=1024, attention_temperature=0.5)
Building Graphs
# From edge list
graph = SparseMatrix.from_iterator(iter(edges), columns)
# Dynamic updates
from pycleora import update_graph, remove_edges
graph2 = update_graph(edges, new_edges, columns)
graph3 = remove_edges(edges, edges_to_remove, columns)
# Streaming (batch-by-batch)
from pycleora import embed_streaming
graph, emb = embed_streaming(batches, columns)
Heterogeneous Graphs
from pycleora.hetero import HeteroGraph
hg = HeteroGraph()
hg.add_node_type("user")
hg.add_node_type("product")
hg.add_edge_type("purchased", "user", "product", [("alice", "laptop")])
# Per-relation embedding
graphs, embeddings, combined = hg.embed_per_relation(feature_dim=64)
# Metapath embedding
g, emb = hg.embed_metapath(["purchased", "sold_at"], feature_dim=64)
Graph Generation
from pycleora.generators import (
erdos_renyi, barabasi_albert,
stochastic_block_model, watts_strogatz
)
er = erdos_renyi(num_nodes=1000, p=0.01)
ba = barabasi_albert(num_nodes=1000, m=3)
sbm = stochastic_block_model([100, 100, 100], p_within=0.1, p_between=0.001)
Graph Sampling
from pycleora.sampling import (
sample_neighborhood, sample_subgraph,
graphsaint_sample, negative_sampling,
train_test_split_edges
)
# K-hop neighborhood
nbhood = sample_neighborhood(graph, ["alice"], num_hops=2, max_neighbors_per_hop=10)
# Mini-batches for large graphs
batches = graphsaint_sample(graph, batch_size=512, num_batches=10)
# Link prediction data split
split = train_test_split_edges(graph, test_ratio=0.2)
neg_edges = negative_sampling(graph, num_negatives=1000)
Classification
Built-in classifiers — pure numpy/scipy, no PyTorch or TensorFlow required:
from pycleora.classify import mlp_classify, label_propagation
# MLP classifier — dense layers on embedding vectors
result = mlp_classify(graph, emb, labels, hidden_dim=64)
print(f"Accuracy: {result['accuracy']:.4f}")
# Semi-supervised label propagation — no embeddings needed
predictions = label_propagation(graph, partial_labels)
Evaluation Metrics
from pycleora.metrics import (
node_classification_scores, link_prediction_scores,
clustering_scores, map_at_k, ndcg_at_k,
adjusted_rand_index, silhouette_score
)
# Node classification
scores = node_classification_scores(graph, emb, labels)
# Returns: accuracy, macro_f1, weighted_f1
# Link prediction
lp = link_prediction_scores(graph, emb, test_edges)
# Returns: auc, mrr, hits@1/3/10/50
# Ranking metrics
map_at_k(graph, emb, test_edges, k=10)
ndcg_at_k(graph, emb, test_edges, k=10)
Cross-Validation
from pycleora.metrics import cross_validate
cv = cross_validate(graph, emb, labels, k_folds=5)
print(f"Accuracy: {cv['mean_accuracy']:.4f} +/- {cv['std_accuracy']:.4f}")
print(f"Macro F1: {cv['mean_macro_f1']:.4f} +/- {cv['std_macro_f1']:.4f}")
Hyperparameter Tuning
from pycleora.tuning import grid_search, random_search
result = grid_search(
graph, labels,
embed_fn=lambda g, feature_dim=1024, num_iterations=4: embed(g, feature_dim, num_iterations),
param_grid={"feature_dim": [128, 512, 1024], "num_iterations": [2, 4, 8]},
)
print(f"Best: {result['best_params']} -> {result['best_score']:.4f}")
CLI Tool
# Generate embeddings
$ pycleora embed --input edges.tsv --dim 1024 --output emb.npz
# With algorithm choice
$ pycleora embed -i graph.tsv -d 256 -a deepwalk -o emb.npz
# Graph info
$ pycleora info --input edges.tsv
# Find similar entities
$ pycleora similar --input edges.tsv --entity alice --top-k 10
# Run benchmarks
$ pycleora benchmark --dataset karate_club
Import/Export
from pycleora.io_utils import save_embeddings, load_embeddings, to_networkx
# Save/load
save_embeddings(graph, emb, "embeddings.npz")
emb, ids = load_embeddings("embeddings.npz")
# Export to NetworkX, PyG, DGL
G = to_networkx(graph, emb)
Visualization
from pycleora.viz import visualize, reduce_dimensions
# t-SNE / PCA reduction
emb_2d = reduce_dimensions(emb, method="tsne")
# Full visualization with labels
visualize(graph, emb, labels=labels, save_path="graph.png")