API Reference
Complete API documentation for pycleora 3.2. All functions, parameters, and return values.
pycleora.SparseMatrix
SparseMatrix.from_iterator(edge_iter, columns)
Build a graph from an iterator of space-separated edge strings.
edge_iter : Iterator[str] — Iterator of edge strings, e.g. "entity_a entity_b"
columns : str — Column definition string (e.g. "complex::reflexive::entity")
Returns: SparseMatrix — The constructed graph object
SparseMatrix.entity_ids
List of entity ID strings in the graph.
Returns: List[str]
SparseMatrix.to_sparse_csr()
Convert the graph to a scipy CSR sparse matrix.
Returns: scipy.sparse.csr_matrix — Row-stochastic adjacency matrix
pycleora.embed()
embed(graph, feature_dim=128, num_iterations=4, propagation="left", normalization="l2", whiten=True)
Generate Cleora embeddings via Markov propagation. Whitening is enabled by default — it dramatically improves accuracy at all dimensions. Use num_iterations="auto" to automatically scale iterations based on dimension.
graph : SparseMatrix — Input graph
feature_dim : int — Embedding dimensionality (default: 128)
num_iterations : int or "auto" — Propagation iterations (default: 4). "auto" selects 4 for dim ≤256, 8 for dim ≤512, 16 for dim >512
propagation : str — "left" or "symmetric" (default: "left")
normalization : str — "l2", "l1", "spectral", or "none"
whiten : bool — Apply PCA whitening post-processing (default: True). Dramatically improves accuracy by decorrelating and standardizing dimensions
Returns: np.ndarray — Shape (n_entities, feature_dim)
pycleora.find_most_similar()
find_most_similar(graph, embeddings, entity_id, top_k=10)
Find the top-K most similar entities by cosine similarity.
graph : SparseMatrix — Input graph
embeddings : np.ndarray — Embedding matrix
entity_id : str — Query entity ID
top_k : int — Number of results (default: 10)
Returns: List[Dict] — List of dicts with entity_id and similarity keys
pycleora.embed_multiscale()
embed_multiscale(graph, feature_dim=128, scales=[1, 2, 4, 8], whiten=True)
Generate multi-scale embeddings by concatenating embeddings at different propagation depths. Each scale captures a different neighborhood radius. Whitening is enabled by default.
feature_dim : int — Dimension per scale (total output = feature_dim × len(scales))
scales : List[int] — Iteration counts per scale (default: [1, 2, 4, 8])
whiten : bool — Apply PCA whitening to each scale before concatenation (default: True)
Returns: np.ndarray — Shape (n_entities, feature_dim × len(scales))
pycleora.embed_with_node_features()
embed_with_node_features(graph, node_features, feature_weight=0.5, feature_dim=1024)
Combine structural embeddings with external node feature vectors.
node_features : Dict[str, np.ndarray] — Feature vectors per entity
feature_weight : float — Weight for features vs structure (0-1)
Returns: np.ndarray
pycleora.embed_edge_features()
embed_edge_features(graph, edge_features, feature_dim=64, combine="concat")
Embed with multi-dimensional edge features.
edge_features : Dict[str, np.ndarray] — Feature vectors per edge (key: "src dst")
combine : str — "concat", "mean", or "edge_only"
Returns: np.ndarray
pycleora.embed_with_attention()
embed_with_attention(graph, feature_dim=1024, attention_temperature=1.0)
Embedding with attention-weighted propagation using temperature-based softmax.
attention_temperature : float — Lower = sharper attention (default: 1.0)
Returns: np.ndarray
algorithms.embed_deepwalk()
embed_deepwalk(graph, feature_dim=1024, num_walks=20, walk_length=40, window_size=5)
DeepWalk: uniform random walks + SVD factorization of co-occurrence matrix.
num_walks : int — Walks per node (default: 20)
walk_length : int — Steps per walk (default: 40)
window_size : int — Context window for co-occurrence (default: 5)
Returns: np.ndarray
algorithms.embed_node2vec()
embed_node2vec(graph, feature_dim=1024, p=1.0, q=1.0, num_walks=20, walk_length=40)
Node2Vec: biased random walks. p controls return parameter (BFS), q controls in-out parameter (DFS).
p : float — Return parameter. Low p = BFS-like (default: 1.0)
q : float — In-out parameter. Low q = DFS-like (default: 1.0)
Returns: np.ndarray
algorithms.embed_netmf()
embed_netmf(graph, feature_dim=1024, window_size=5, negative_samples=1.0)
NetMF: Network Matrix Factorization. Theoretically equivalent to DeepWalk but computed directly.
window_size : int — Context window (default: 5)
negative_samples : float — Negative sampling parameter (default: 1.0)
Returns: np.ndarray
algorithms.embed_prone()
embed_prone(graph, feature_dim=1024)
ProNE: Procrustean Network Embedding. Spectral propagation with band-pass filter.
algorithms.embed_randne()
embed_randne(graph, feature_dim=1024, num_iterations=3, weights=None, seed=0)
RandNE: Random Projection-based Network Embedding. Ultra-fast Gaussian projection with iterative power refinement.
num_iterations : int — Number of power iterations (default: 3)
weights : List[float] or None — Weights for each iteration (default: auto)
algorithms.embed_grarep()
embed_grarep(graph, feature_dim=1024, max_step=4)
GraRep: Multi-scale transition matrix factorization. Captures k-step neighborhood patterns.
max_step : int — Max transition steps (default: 4)
algorithms.embed_hope()
embed_hope(graph, feature_dim=1024, beta=0.01)
HOPE: Higher-Order Proximity Embedding. Preserves asymmetric relationships in directed graphs.
beta : float — Decay factor for Katz similarity (default: 0.01)
classify.mlp_classify()
mlp_classify(graph, embeddings, labels, hidden_dim=64)
Multi-layer perceptron classifier on embedding vectors.
classify.label_propagation()
label_propagation(graph, partial_labels, max_iter=100)
Semi-supervised label propagation using graph structure.
partial_labels : Dict[str, int] — Known labels (can be sparse)
Returns: Dict[str, int] — Predicted labels for all entities
metrics.node_classification_scores()
node_classification_scores(graph, embeddings, labels, test_ratio=0.2)
Evaluate node classification using nearest centroid.
Returns: Dict with accuracy, macro_f1, weighted_f1
metrics.link_prediction_scores()
link_prediction_scores(graph, embeddings, test_edges, num_neg=100)
Evaluate link prediction with ranking metrics.
Returns: Dict with auc, mrr, hits_at_1, hits_at_3, hits_at_10, hits_at_50
metrics.map_at_k()
map_at_k(graph, embeddings, test_edges, k=10)
Mean Average Precision at K for link prediction ranking.
metrics.ndcg_at_k()
ndcg_at_k(graph, embeddings, test_edges, k=10)
Normalized Discounted Cumulative Gain at K.
metrics.cross_validate()
cross_validate(graph, embeddings, labels, k_folds=5)
k-fold cross-validation for node classification.
Returns: Dict with mean_accuracy, std_accuracy, mean_macro_f1, std_macro_f1, fold_accuracies
sampling.sample_neighborhood()
sample_neighborhood(graph, seed_nodes, num_hops=2, max_neighbors_per_hop=None, seed=42)
K-hop neighborhood sampling from seed nodes.
seed_nodes : List[str] — Starting entity IDs
num_hops : int — Number of expansion hops
max_neighbors_per_hop : int or None — Max neighbors sampled per hop (None = all)
Returns: Dict with sampled_nodes and edges
sampling.graphsaint_sample()
graphsaint_sample(graph, batch_size=512, num_batches=10)
GraphSAINT-style random mini-batch sampling.
Returns: List[Dict] — List of mini-batch subgraphs
sampling.negative_sampling()
negative_sampling(graph, num_negatives=1000)
Generate negative (non-existing) edges for link prediction training.
Returns: List[Tuple[str, str]]
generators.erdos_renyi()
erdos_renyi(num_nodes, p, seed=None)
Generate an Erdos-Renyi random graph G(n, p).
p : float — Edge probability between any two nodes
Returns: Tuple[List[str], Dict] — (edge_list, metadata)
generators.barabasi_albert()
barabasi_albert(num_nodes, m=3, seed=None)
Generate a Barabasi-Albert preferential attachment graph.
m : int — Number of edges per new node (default: 3)
Returns: Tuple[List[str], Dict]
generators.stochastic_block_model()
stochastic_block_model(block_sizes, p_within=0.1, p_between=0.01, seed=None)
Generate a Stochastic Block Model graph with community structure.
block_sizes : List[int] — Number of nodes per block
p_within : float — Intra-community edge probability
p_between : float — Inter-community edge probability
Returns: Tuple[List[str], Dict]
tuning.grid_search()
grid_search(graph, labels, embed_fn, param_grid, metric="accuracy")
Exhaustive grid search over embedding hyperparameters.
embed_fn : Callable — Function that takes graph + params and returns embeddings
param_grid : Dict[str, List] — Parameter name to values mapping
Returns: Dict with best_params, best_score, all_results
tuning.random_search()
random_search(graph, labels, embed_fn, param_distributions, n_iter=20, metric="accuracy")
Random search over embedding hyperparameters.
param_distributions : Dict[str, List] — Parameter name to candidate values
n_iter : int — Number of random combinations to try
Returns: Dict with best_params, best_score, all_results