Documentation

Complete guide to pycleora 3.2 — the graph embedding library that achieves 100% accuracy on CPU.

Installation

From PyPI (recommended)

pip install pycleora

From source (with Rust compiler)

pip install maturin==1.2.3
git clone https://github.com/pycleora/pycleora
cd pycleora
maturin build --release -i python3
pip install target/wheels/pycleora-*.whl

Dependencies: Only numpy and scipy are required. Matplotlib is optional for visualization.

Quick Start

from pycleora import SparseMatrix, embed, find_most_similar

# 1. Define edges (space-separated entity pairs)
edges = [
    "user_alice product_laptop",
    "user_alice product_mouse",
    "user_bob product_keyboard",
    "user_bob product_monitor",
]

# 2. Build the graph
graph = SparseMatrix.from_iterator(
    iter(edges),
    "complex::reflexive::entity"
)

# 3. Generate embeddings (whitening is on by default for best accuracy)
embeddings = embed(graph, feature_dim=256)

# 4. Query similarity
results = find_most_similar(graph, embeddings, "user_alice", top_k=5)
for r in results:
    print(f"{r['entity_id']}: {r['similarity']:.4f}")

Core Concepts

Column Definitions

The column string defines how edges are parsed. Common patterns:

complex::reflexive::entity — bipartite graph, entities interact with themselves
complex::reflexive::user complex::reflexive::product — two separate entity types
transient::reflexive::context — context column, not embedded

Propagation Types

"left" — Left Markov (D⁻¹A, row-stochastic). Default. Best for most tasks.
"symmetric" — Symmetric normalization (D⁻¹/²AD⁻¹/²). Faster, but may reduce accuracy.

Why does "left" usually outperform "symmetric"?

Left Markov normalizes only on the source side — it divides each row by the source node's degree. This means high-degree nodes (hubs) retain their full influence over their neighbors. In social and interaction graphs, hubs are often community leaders or connectors between groups, so preserving their signal is critical for downstream tasks like community detection and node classification.

Symmetric normalization divides by the square root of the degree on both sides of every edge. This dampens the influence of hubs, making embeddings more uniform across the graph. The effect is that community boundaries become less sharp — the very signal that classification relies on gets smoothed away.

In benchmarks, this difference is measurable: on ego-Facebook, left propagation achieves 0.964 accuracy vs 0.293 for symmetric (dim=1024, whiten=True). On roadNet-CA, the gap is wider: 0.598 vs 0.431.

When to use symmetric: On regular graphs where node degrees are roughly uniform (lattices, road networks, mesh topologies), both normalization types perform similarly, but symmetric can be slightly faster computationally. Symmetric is also the standard choice when feeding embeddings into spectral methods or GCN-style architectures that expect D⁻¹/²AD⁻¹/² as input.

Recommendation: Use "left" (default) unless you specifically need symmetric normalization for downstream compatibility or are working with very regular graphs.

Normalization

"l2" — Unit length vectors (default, recommended)
"l1" — L1 normalization
"spectral" — SVD-based spectral normalization
"none" — No normalization

Cleora Embedding

from pycleora import embed, embed_multiscale

# Standard embedding (whiten=True by default for best accuracy)
emb = embed(graph, feature_dim=256)

# Symmetric propagation
emb_sym = embed(graph, feature_dim=256, propagation="symmetric")

# Multi-scale (captures different neighborhood sizes)
emb_multi = embed_multiscale(graph, feature_dim=64, scales=[1, 2, 4, 8])
# Result: (n_entities, 64*4) = (n, 256) dimensions

Alternative Algorithms

Cleora vs walk-based methods: DeepWalk and Node2Vec sample random walks and train a skip-gram model (which uses negative sampling to approximate the softmax). NetMF factorizes the same co-occurrence matrix directly but still requires a negative sampling parameter. Cleora eliminates both walk sampling and skip-gram training entirely — it computes the full walk distribution via matrix powers.

from pycleora.algorithms import (
    embed_deepwalk, embed_node2vec,
    embed_prone, embed_randne,
    embed_hope, embed_netmf, embed_grarep
)

# DeepWalk — random walks + SVD (requires negative sampling)
emb = embed_deepwalk(graph, feature_dim=1024, num_walks=20, walk_length=40)

# Node2Vec — biased walks (requires negative sampling)
emb = embed_node2vec(graph, feature_dim=1024, p=0.5, q=2.0)

# RandNE — ultra-fast random projection
emb = embed_randne(graph, feature_dim=1024)

# NetMF — matrix factorization (requires negative sampling)
emb = embed_netmf(graph, feature_dim=1024, negative_samples=1.0)

Node Features

from pycleora import embed_with_node_features

node_features = {"alice": np.array([1.0, 0.5, ...]), ...}
emb = embed_with_node_features(graph, node_features, feature_weight=0.7)

Edge Features

from pycleora import embed_edge_features

# Multi-dimensional features per edge
edge_feats = {"alice item_laptop": np.array([1.0, 0.8, ...]), ...}
emb = embed_edge_features(graph, edge_feats, feature_dim=64, combine="concat")
# combine: "concat" | "mean" | "edge_only"

Attention-Weighted Embedding

from pycleora import embed_with_attention

emb = embed_with_attention(graph, feature_dim=1024, attention_temperature=0.5)

Building Graphs

# From edge list
graph = SparseMatrix.from_iterator(iter(edges), columns)

# Dynamic updates
from pycleora import update_graph, remove_edges
graph2 = update_graph(edges, new_edges, columns)
graph3 = remove_edges(edges, edges_to_remove, columns)

# Streaming (batch-by-batch)
from pycleora import embed_streaming
graph, emb = embed_streaming(batches, columns)

Heterogeneous Graphs

from pycleora.hetero import HeteroGraph

hg = HeteroGraph()
hg.add_node_type("user")
hg.add_node_type("product")
hg.add_edge_type("purchased", "user", "product", [("alice", "laptop")])

# Per-relation embedding
graphs, embeddings, combined = hg.embed_per_relation(feature_dim=64)

# Metapath embedding
g, emb = hg.embed_metapath(["purchased", "sold_at"], feature_dim=64)

Graph Generation

from pycleora.generators import (
    erdos_renyi, barabasi_albert,
    stochastic_block_model, watts_strogatz
)

er = erdos_renyi(num_nodes=1000, p=0.01)
ba = barabasi_albert(num_nodes=1000, m=3)
sbm = stochastic_block_model([100, 100, 100], p_within=0.1, p_between=0.001)

Graph Sampling

from pycleora.sampling import (
    sample_neighborhood, sample_subgraph,
    graphsaint_sample, negative_sampling,
    train_test_split_edges
)

# K-hop neighborhood
nbhood = sample_neighborhood(graph, ["alice"], num_hops=2, max_neighbors_per_hop=10)

# Mini-batches for large graphs
batches = graphsaint_sample(graph, batch_size=512, num_batches=10)

# Link prediction data split
split = train_test_split_edges(graph, test_ratio=0.2)
neg_edges = negative_sampling(graph, num_negatives=1000)

Classification

Built-in classifiers — pure numpy/scipy, no PyTorch or TensorFlow required:

from pycleora.classify import mlp_classify, label_propagation

# MLP classifier — dense layers on embedding vectors
result = mlp_classify(graph, emb, labels, hidden_dim=64)
print(f"Accuracy: {result['accuracy']:.4f}")

# Semi-supervised label propagation — no embeddings needed
predictions = label_propagation(graph, partial_labels)

Why MLP, not GCN? Cleora embeddings already encode multi-hop graph structure through iterative Markov propagation. A simple MLP is the right classifier — adding GCN graph convolution layers on top of already-convolved embeddings causes over-smoothing and adds no value. That's the whole point: Cleora replaces the GCN feature extraction step entirely.

Evaluation Metrics

from pycleora.metrics import (
    node_classification_scores, link_prediction_scores,
    clustering_scores, map_at_k, ndcg_at_k,
    adjusted_rand_index, silhouette_score
)

# Node classification
scores = node_classification_scores(graph, emb, labels)
# Returns: accuracy, macro_f1, weighted_f1

# Link prediction
lp = link_prediction_scores(graph, emb, test_edges)
# Returns: auc, mrr, hits@1/3/10/50

# Ranking metrics
map_at_k(graph, emb, test_edges, k=10)
ndcg_at_k(graph, emb, test_edges, k=10)

Cross-Validation

from pycleora.metrics import cross_validate

cv = cross_validate(graph, emb, labels, k_folds=5)
print(f"Accuracy: {cv['mean_accuracy']:.4f} +/- {cv['std_accuracy']:.4f}")
print(f"Macro F1: {cv['mean_macro_f1']:.4f} +/- {cv['std_macro_f1']:.4f}")

Hyperparameter Tuning

from pycleora.tuning import grid_search, random_search

result = grid_search(
    graph, labels,
    embed_fn=lambda g, feature_dim=1024, num_iterations=40: embed(g, feature_dim, num_iterations),
    param_grid={"feature_dim": [256, 512, 1024], "num_iterations": [20, 40, 60]},
)
print(f"Best: {result['best_params']} -> {result['best_score']:.4f}")

CLI Tool

# Generate embeddings
$ pycleora embed --input edges.tsv --dim 1024 --output emb.npz

# With algorithm choice
$ pycleora embed -i graph.tsv -d 256 -a deepwalk -o emb.npz

# Graph info
$ pycleora info --input edges.tsv

# Find similar entities
$ pycleora similar --input edges.tsv --entity alice --top-k 10

# Run benchmarks
$ pycleora benchmark --dataset karate_club

Import/Export

from pycleora.io_utils import save_embeddings, load_embeddings, to_networkx

# Save/load
save_embeddings(graph, emb, "embeddings.npz")
emb, ids = load_embeddings("embeddings.npz")

# Export to NetworkX, PyG, DGL
G = to_networkx(graph, emb)

Visualization

from pycleora.viz import visualize, reduce_dimensions

# t-SNE / PCA reduction
emb_2d = reduce_dimensions(emb, method="tsne")

# Full visualization with labels
visualize(graph, emb, labels=labels, save_path="graph.png")