March 4, 2024

Fresh 2D-Matryoshka Embedding Model

All posts
Fresh 2D-Matryoshka Embedding Model

Reading Time

8 min read

Publish Date

March 4, 2024

Authors

  • Sean Lee

    Sean Lee

  • Aamir Shakir

    Aamir Shakir

  • Julius Lipp

    Julius Lipp

  • Darius Koenig

    Darius Koenig

We are excited to release the world's first 2D-🪆 embedding model. As our previous release, it comes with an Apache 2.0 license and is available on Hugging Face.

Read on to learn more about our approach and to check out our benchmarks. If you want to skip right to the model instead, you can access it here:

Why Embeddings?

A significant hurdle for modern generative models is their inability to directly interact with specific organizational data. Consider a scenario where your task is to generate a report on recent market trends based on internal research documents. Traditional generative models fall short here as they don't have access to or understanding of your internal documents, making it impossible for them to generate the required report.

To address this challenge, the Retrieval-Augmented Generation (RAG) technique offers a solution. Imagine you have a repository of internal research on market trends. This repository can be processed through an embedding model to convert the documents into a searchable format within a vector database. When you need a report on market trends, the embedding model can locate and fetch the most relevant documents. These documents can then inform a generative model, enabling it to produce a detailed report based on your specific data.

What Is a Matryoshka Model? And 2D-🪆?

Dense embedding models typically produce embeddings with a fixed size, such as 768 or 1024 dimensions. All further computations (clustering, classification, semantic search, retrieval, reranking, etc.) must then be done on these full embeddings.

revisits this idea, and proposes a solution to train embedding models whose embeddings are still useful after dimensionality reduction -- truncation to much smaller sizes -- as shown in the figure below. This allows for considerably faster (bulk) processing and storage savings while maintaining most of the performance. However, the impact on inference speed and memory footprint is small, because the model still runs through all layers.

takes this idea further and proposes chunkable layers (see the second part of the figure). Here, the hidden layers are also trained on generating high quality embeddings without the higher layers. As a result, layers can be chunked from the model without losing too much performance in the embeddings generation process. This allows a user to train one large model and get multiple smaller models out of it. The first dimension of 2D-🪆, which chunks layers, allows faster inference and a lower memory footprint, and the second dimension, which chunks the embeddings, allows faster retrieval while using less storage capacity.

Visualization of the difference between regular and 2D-🪆

Visualization of the difference between regular and 2D-🪆 (Source: )

Introducing the First 2D-🪆 Embedding Model

We are excited to announce the (to our knowledge) first embedding model which supports 2D-🪆. The model is based on our embedding model, which we will also release soon (🤫). The model was pretrained using contrastive training on over 700 million pairs, covering a huge variety of different topics across the whole internet. Then, it was finetuned with over 30 million high quality triplets using novel loss functions. Our model helps the user to get multiple models out of one and to use different embedding sizes, which gives you full control over the tradeoffs between speed, storage consumption, and model performance.

Using It in Action

Our model is extremely easy to use with your existing search stack. You replace the first stage retrieval with our model, and you're ready to go. You’ll have two options: use the model either offline by hosting it yourself or online by using our (upcoming) API.

To get started, install the necessary packages:

pip install -U mixedbread-ai sentence-transformers

Here is a quick example: Given two sentences, we want to find their similarities. We can modify the amount of layers (depth) with new_num_layers and dimensions of the embeddings with new_embedding_size.

    from mixedbread_ai.client import MixedbreadAi
    from sentence_transformers.util import cos_sim
 
    mxbai = MixedbreadAI(api_key="YOUR_API_KEY")
 
    result = mxbai.embeddings(
        model="mixedbread-ai/mxbai-embed-2d-large-v1",
        input=[
            'Who is german and likes bread?',
            'Everybody in Germany.'
        ],
        dimensions=768
    )
 
    # Similarity of the first sentence with the other two
    similarities = cos_sim(result.data[0].embedding, result.data[1].embedding)
 
    print('similarities:', similarities)

This will yield the following similarity of 0.7342.

Performance

Our first iteration of the model yields performance that is competitive with models supporting Matryoshka only in the embeddings layer. We recognise that the performance may currently be behind some larger models, but it's a first big step into bringing it to real world use-cases. We are working hard to match the overall performance of the state-of-the-art embedding models.

MTEB: Massive Text Embedding Benchmark

is a large text embedding benchmark that measures embedding models across seven tasks: classification, clustering, pair classification, re-ranking, retrieval, STS (semantic textual similarity), and summarization. It includes 56 datasets from various domains and with various text lengths.

Unfortunately, a lot of new models have started overfitting to the tasks on the MTEB. Frequently, they even started to train on respective test sets (i.e., tell the model the correct answers for the test set, which is basically cheating) or generate synthetic data for those datasets. For our training, we excluded any potential overlap with the test sets and used no data which is used for MTEB, not even the training sets (except MS Marco) -- the reported performances are zero shot!

As the results show, the 2D-🪆 in its base form performs on the level of current embedding models of different sizes against the MTEB. Now, we want to investigate the model's performance on various tasks, factoring in its downsizing abilities.

Matryoshka on the Embeddings Layer

First, we investigate the model performance under dimensionality reduction for the embeddings only -- in essence like a 'traditional' Matryoshka model. Currently, we haven't yet managed to evaluate the MTEB for every dimension size and are working on providing the full results. For now, we report the performance of STS tasks and SciFact. We are only including the comparison to nomic-embed-text-v1.5 and text-embedding-3-large, since the reported performance of text-embedding-3-small only contains the performance without Matryoshka.

STS (whole subset of MTEB)

Model performance for different embeddings sizes against the STS benchmark

Model performance for different embeddings sizes against the STS (whole subset of MTEB) benchmark

Clearly, the 2D-🪆 embedding model can perform the task with a similar or even higher performance compared to currently available models. Even if the embedding size is reduced by factor 16, the model offers competitive performance.

SciFact

Modelnative51225612864
77.77--73.1 (-6.1%)----
74.1171.41 (-3.6%)68.74 (-7.2%)67.92 (-8.4%)63.75 (-14.0%)
70.2870.12 (-0.2%)68.24 (-2.9%)64.28 (-8.5%)52.71 (-25.0%)

Again, the model can perform the task on a comparable level to the available Matryoshka models. While our model loses a higher degree of performance in the lower-order dimensionality reductions, it shows comparatively much stronger performance with greater size reduction.

TREC-COVID

Modelnative51225612864
79.59--76.24 (-4.3%)----
68.6469.67 (+1.5%)69.90 (+1.8%)65.27 (-4.9%)59.81 (-12.9%)
82.3082.12 (-0.2%)80.65 (-2.0%)74.58 (-9.4%)67.83 (-17.6%)

On this task, our model exhibited some curious behaviour. While the performance on this specific task remained a bit behind that of the Matryoshka models, we observed an interesting increase in performance for the lower-order size reductions and a comparatively slighty more stable performance towards higher-order size reductions.

2D-Matryoshka

Now, we investigate the model performance taking full advantage of the 2D-🪆 principle. Essentially, we iterate evaluating the model (using 'classic' Matryoshka functionality), cutting a layer from the model each time. We start with the full 24-layer model and reduce it step-by-step to 13 layers, discarding almost 50% of the model. As we've already seen the performance of our model compared to other Matryoshka models and this 2D-downsizing process has (to our knowledge) not been done before, there will be no comparison to other models in this section. We are happy to investigate what our model is capable of and excited to see where we can take the technology going forward. Because this process is very compute-intensive and we are working with quite limited resources, we only evaluated our model against two tasks, SciFact and STS. In this post, we show a selection of results for the base model, a reduction by one third of layers, and only half of the base model. The full list of results is available on our website (, ).

SciFact

Model performance for 24, 16, and 13 layers and different embeddings sizes against the SciFact benchmark

Model performance for 24, 16, and 13 layers and different embeddings sizes against the SciFact benchmark

For SciFact, we see a reduction in performance from the base model by about a quarter of total performance when cutting the model down to 13 layers. This means that even after cutting half of the model, we still retain about 75% of performance. The decline in performance observed when reducing the embeddings size is consistently in line with what we would expect following the results in the above Matryoshka section.

STS (whole subset of MTEB)

Model performance for 24, 16, and 13 layers and different embeddings sizes against the STS benchmark

Model performance for 24, 16, and 13 layers and different embeddings sizes against the STS benchmark

For STS, the model performance after downsizing is even more promising. We still observe more than 85% of performance even after half of the model has been discarded. The results in combination with the embeddings downsizing are particularly interesting, as the performance decrease following a factor 8 dimensionality reduction is only slightly above 1%. In effect, even the combination of a 50% reduction in model size and a factor 8 reduction in embeddings size still leaves about 85% of performance.

Give Us Feedback

This is the first model of its kind, and we welcome any feedback to make our models better and refine their user-friendliness or capabilities. Please let us know if you’re hungry for any new features or have encountered any issues. We value your feedback!

Please share your feedback and thoughts through our . We are here to help and also always happy to chat about the exciting field of machine learning!