API Reference

Complete API documentation for pycleora 3.2. All functions, parameters, and return values.

pycleora.SparseMatrix

SparseMatrix.from_iterator(edge_iter, columns)

Build a graph from an iterator of space-separated edge strings.

edge_iter : Iterator[str] — Iterator of edge strings, e.g. "entity_a entity_b"

columns : str — Column definition string (e.g. "complex::reflexive::entity")

Returns: SparseMatrix — The constructed graph object

SparseMatrix.entity_ids

List of entity ID strings in the graph.

Returns: List[str]

SparseMatrix.to_sparse_csr()

Convert the graph to a scipy CSR sparse matrix.

Returns: scipy.sparse.csr_matrix — Row-stochastic adjacency matrix

pycleora.embed()

embed(graph, feature_dim=128, num_iterations=4, propagation="left", normalization="l2", whiten=True)

Generate Cleora embeddings via Markov propagation. Whitening is enabled by default — it dramatically improves accuracy at all dimensions. Use num_iterations="auto" to automatically scale iterations based on dimension.

graph : SparseMatrix — Input graph

feature_dim : int — Embedding dimensionality (default: 128)

num_iterations : int or "auto" — Propagation iterations (default: 4). "auto" selects 4 for dim ≤256, 8 for dim ≤512, 16 for dim >512

propagation : str"left" or "symmetric" (default: "left")

normalization : str"l2", "l1", "spectral", or "none"

whiten : bool — Apply PCA whitening post-processing (default: True). Dramatically improves accuracy by decorrelating and standardizing dimensions

Returns: np.ndarray — Shape (n_entities, feature_dim)

pycleora.find_most_similar()

find_most_similar(graph, embeddings, entity_id, top_k=10)

Find the top-K most similar entities by cosine similarity.

graph : SparseMatrix — Input graph

embeddings : np.ndarray — Embedding matrix

entity_id : str — Query entity ID

top_k : int — Number of results (default: 10)

Returns: List[Dict] — List of dicts with entity_id and similarity keys

pycleora.embed_multiscale()

embed_multiscale(graph, feature_dim=128, scales=[1, 2, 4, 8], whiten=True)

Generate multi-scale embeddings by concatenating embeddings at different propagation depths. Each scale captures a different neighborhood radius. Whitening is enabled by default.

feature_dim : int — Dimension per scale (total output = feature_dim × len(scales))

scales : List[int] — Iteration counts per scale (default: [1, 2, 4, 8])

whiten : bool — Apply PCA whitening to each scale before concatenation (default: True)

Returns: np.ndarray — Shape (n_entities, feature_dim × len(scales))

pycleora.embed_with_node_features()

embed_with_node_features(graph, node_features, feature_weight=0.5, feature_dim=1024)

Combine structural embeddings with external node feature vectors.

node_features : Dict[str, np.ndarray] — Feature vectors per entity

feature_weight : float — Weight for features vs structure (0-1)

Returns: np.ndarray

pycleora.embed_edge_features()

embed_edge_features(graph, edge_features, feature_dim=64, combine="concat")

Embed with multi-dimensional edge features.

edge_features : Dict[str, np.ndarray] — Feature vectors per edge (key: "src dst")

combine : str"concat", "mean", or "edge_only"

Returns: np.ndarray

pycleora.embed_with_attention()

embed_with_attention(graph, feature_dim=1024, attention_temperature=1.0)

Embedding with attention-weighted propagation using temperature-based softmax.

attention_temperature : float — Lower = sharper attention (default: 1.0)

Returns: np.ndarray

algorithms.embed_deepwalk()

embed_deepwalk(graph, feature_dim=1024, num_walks=20, walk_length=40, window_size=5)

DeepWalk: uniform random walks + SVD factorization of co-occurrence matrix.

num_walks : int — Walks per node (default: 20)

walk_length : int — Steps per walk (default: 40)

window_size : int — Context window for co-occurrence (default: 5)

Returns: np.ndarray

algorithms.embed_node2vec()

embed_node2vec(graph, feature_dim=1024, p=1.0, q=1.0, num_walks=20, walk_length=40)

Node2Vec: biased random walks. p controls return parameter (BFS), q controls in-out parameter (DFS).

p : float — Return parameter. Low p = BFS-like (default: 1.0)

q : float — In-out parameter. Low q = DFS-like (default: 1.0)

Returns: np.ndarray

algorithms.embed_netmf()

embed_netmf(graph, feature_dim=1024, window_size=5, negative_samples=1.0)

NetMF: Network Matrix Factorization. Theoretically equivalent to DeepWalk but computed directly.

window_size : int — Context window (default: 5)

negative_samples : float — Negative sampling parameter (default: 1.0)

Returns: np.ndarray

algorithms.embed_prone()

embed_prone(graph, feature_dim=1024)

ProNE: Procrustean Network Embedding. Spectral propagation with band-pass filter.

algorithms.embed_randne()

embed_randne(graph, feature_dim=1024, num_iterations=3, weights=None, seed=0)

RandNE: Random Projection-based Network Embedding. Ultra-fast Gaussian projection with iterative power refinement.

num_iterations : int — Number of power iterations (default: 3)

weights : List[float] or None — Weights for each iteration (default: auto)

algorithms.embed_grarep()

embed_grarep(graph, feature_dim=1024, max_step=4)

GraRep: Multi-scale transition matrix factorization. Captures k-step neighborhood patterns.

max_step : int — Max transition steps (default: 4)

algorithms.embed_hope()

embed_hope(graph, feature_dim=1024, beta=0.01)

HOPE: Higher-Order Proximity Embedding. Preserves asymmetric relationships in directed graphs.

beta : float — Decay factor for Katz similarity (default: 0.01)

classify.mlp_classify()

mlp_classify(graph, embeddings, labels, hidden_dim=64)

Multi-layer perceptron classifier on embedding vectors.

classify.label_propagation()

label_propagation(graph, partial_labels, max_iter=100)

Semi-supervised label propagation using graph structure.

partial_labels : Dict[str, int] — Known labels (can be sparse)

Returns: Dict[str, int] — Predicted labels for all entities

metrics.node_classification_scores()

node_classification_scores(graph, embeddings, labels, test_ratio=0.2)

Evaluate node classification using nearest centroid.

Returns: Dict with accuracy, macro_f1, weighted_f1

link_prediction_scores(graph, embeddings, test_edges, num_neg=100)

Evaluate link prediction with ranking metrics.

Returns: Dict with auc, mrr, hits_at_1, hits_at_3, hits_at_10, hits_at_50

metrics.map_at_k()

map_at_k(graph, embeddings, test_edges, k=10)

Mean Average Precision at K for link prediction ranking.

metrics.ndcg_at_k()

ndcg_at_k(graph, embeddings, test_edges, k=10)

Normalized Discounted Cumulative Gain at K.

metrics.cross_validate()

cross_validate(graph, embeddings, labels, k_folds=5)

k-fold cross-validation for node classification.

Returns: Dict with mean_accuracy, std_accuracy, mean_macro_f1, std_macro_f1, fold_accuracies

sampling.sample_neighborhood()

sample_neighborhood(graph, seed_nodes, num_hops=2, max_neighbors_per_hop=None, seed=42)

K-hop neighborhood sampling from seed nodes.

seed_nodes : List[str] — Starting entity IDs

num_hops : int — Number of expansion hops

max_neighbors_per_hop : int or None — Max neighbors sampled per hop (None = all)

Returns: Dict with sampled_nodes and edges

sampling.graphsaint_sample()

graphsaint_sample(graph, batch_size=512, num_batches=10)

GraphSAINT-style random mini-batch sampling.

Returns: List[Dict] — List of mini-batch subgraphs

sampling.negative_sampling()

negative_sampling(graph, num_negatives=1000)

Generate negative (non-existing) edges for link prediction training.

Returns: List[Tuple[str, str]]

generators.erdos_renyi()

erdos_renyi(num_nodes, p, seed=None)

Generate an Erdos-Renyi random graph G(n, p).

p : float — Edge probability between any two nodes

Returns: Tuple[List[str], Dict] — (edge_list, metadata)

generators.barabasi_albert()

barabasi_albert(num_nodes, m=3, seed=None)

Generate a Barabasi-Albert preferential attachment graph.

m : int — Number of edges per new node (default: 3)

Returns: Tuple[List[str], Dict]

generators.stochastic_block_model()

stochastic_block_model(block_sizes, p_within=0.1, p_between=0.01, seed=None)

Generate a Stochastic Block Model graph with community structure.

block_sizes : List[int] — Number of nodes per block

p_within : float — Intra-community edge probability

p_between : float — Inter-community edge probability

Returns: Tuple[List[str], Dict]

grid_search(graph, labels, embed_fn, param_grid, metric="accuracy")

Exhaustive grid search over embedding hyperparameters.

embed_fn : Callable — Function that takes graph + params and returns embeddings

param_grid : Dict[str, List] — Parameter name to values mapping

Returns: Dict with best_params, best_score, all_results

random_search(graph, labels, embed_fn, param_distributions, n_iter=20, metric="accuracy")

Random search over embedding hyperparameters.

param_distributions : Dict[str, List] — Parameter name to candidate values

n_iter : int — Number of random combinations to try

Returns: Dict with best_params, best_score, all_results