More than 197x faster than DeepWalk, ~4x-8x faster than Pytorch-BigGraph by Facebook.
Cleora computes embeddings of your relational data. Entities such as clients, products, stores, accounts, and others can be represented with embeddings, just like Word2Vec or BERT for text or CLIP for images. Cleora embeddings are behavioral - they represent entities by their behavior history, which has the form of large graphs.
Cleora PRO (Enterprise) vs Cleora Open Source
Self-service Cleora PRO is now available for selected customers.
Cleora Open Source is publicly available on Github and used by many industry leaders.
Key improvements in Cleora PRO over the open source version:
automatic scaling: no expensive hardware required
ease of use: only 2 columns extracted from your DB are required. Graphs are detected automatically in the data.
performance optimizations: 10x faster embedding times
The task is to predict the existence of edges in the graph. For example, predicting whether a certain product will be bought by a certain customer. Higher score is better.
Embedding speed
Total time of computing the embeddings.
About Cleora
A machine learning tool that enables faster and hyper-easy production of graph embeddings for big graphs.
Cleora embeds entities in n-dimensional spherical spaces utilizing extremely fast stable, iterative random projections, which allows for unparalleled performance and scalability. Types of data which can be embedded include for example:
graphs
hypergraphs
text and other categorical array data
any combination of the above
How?
Key technical features of Cleora embeddings
The embeddings produced by Cleora are different from those produced by Node2vec, Word2vec, DeepWalk or other systems in this class by a number of key properties:
Efficiency
Cleora is two orders of magnitude faster than Node2Vec or DeepWalk. We’ve embedded graphs with 100s of billions of edges on a single machine without GPUs. It likely is the fastest possible approach possible.
Inductivity
As Cleora embeddings of an entity are defined only by interactions with other entities, vectors for new entities can be computed on-the-fly.
Cross-dataset compositionality
Thanks to stability of Cleora embeddings, embeddings of the same entity on multiple datasets can be combined by averaging, yielding meaningful vectors.
Stability
All starting vectors for entities are deterministic, which means that Cleora embeddings on similar datasets will end up being similar. Methods like Word2vec, Node2vec or DeepWalk return different results with every run.
Extreme parallelism and performance
Cleora is written in Rust utilizing thread-level parallelism for all calculations except input file loading. In practice this means that the embedding process is often faster than loading the input data.
Dim-wise independence
Thanks to the process producing Cleora embeddings, every dimension is independent of others. This property allows for efficient and low-parameter method for combining multi-view embeddings with Conv1d layers.
We used Cleora for customer-restaurants graph data in the National Capital Region (NCR) area. And to our delight, the embedding generation was superfast (i.e <5 minutes). For context, do remember that GraphSAGE took ~20hours for the same data in the NCR region.
Technical requirements
Data formats supported by Cleora PRO
Cleora supports 2 input file formats:
TSV (tab-separated-values) - preferred
CSV (comma-separated-values)
Option 1: Database export without timestamps
Cleora needs a 2-column extract from your database. The IDs within the columns can (and probably should be) anonymized. Each row expresses an event - the fact that a certain user ID interacted with a product ID at a given point in time. Column 1 will give Cleora an idea about what should constitute a connection in the graph. This will usually be a user IDs or session IDs.
Column 2 will be embedded - you will obtain embeddings for each entity which appears in this column. This will usually be product IDs.
Option 2: Database export with timestamps
Optionally, you can also use a 3-column format. The logic is the same as in the 2-column format, but the 3rd column additionally has a timestamp for each event.