Mixedbread

mxbai-embed-2d-large-v1

Explore mxbai-embed-2d-large-v1, the world's first 2D-Matryoshka embedding model. Learn about its innovative approach to reducing model size while maintaining high performance, and discover how to leverage its flexible dimensionality for various NLP tasks and efficient information retrieval.

Model Description

is the world's first 2D-Matryoshka embedding model. The 2D-Matryoshka model introduces a novel approach that enables you to reduce both the number of layers and the dimensions of embeddings within the model. This dual reduction strategy allows for a more compact model size while still delivering performance on par with that of leading models such as . Specifically, reducing the model's layers by approximately 50% retains up to 85% of its original performance, even without additional training.

The model was pretrained using contrastive training on over 700 million pairs, covering a wide variety of topics across the internet. It was then fine-tuned with over 30 million high-quality triplets using novel loss functions. allows users to get multiple models out of one and use different embedding sizes, providing full control over the trade-offs between speed, storage consumption, and model performance.

On the Massive Text Embedding Benchmark (MTEB), mxbai-embed-2d-large-v1a performs at the level of current embedding models of different sizes. The model's performance remains competitive even when the embedding size is reduced by a factor of 16. Additionally, the model retains about 75% of its performance after cutting half of its layers, demonstrating the effectiveness of the 2D-Matryoshka approach.

LayersEmbedding DimensionRecommended Sequence LengthLanguage
241024512English

Using a Prompt

Adding a domain-specific prompt to a text can help the model understand how the embedding will be used.

For retrieval tasks, the query can be preceded by the prompt: Represent this sentence for searching relevant passages:. For other tasks, the text can be used as-is without any additional prompt.

Suitable Scoring Methods

  • Cosine Similarity: Ideal for measuring the similarity between text vectors, commonly used in tasks like semantic textual similarity and information retrieval.
  • Euclidean Distance: Useful for measuring dissimilarity between embeddings, especially effective in clustering and outlier detection.
  • Dot Product: Appropriate when embeddings are normalized; used in tasks where alignment of vector orientation is critical.

Limitations

  • Language: mxbai-embed-2d-large-v1 is trained on English text and is specifically designed for the English language.
  • Sequence Length: The suggested maximum sequence length is 512 tokens. Longer sequences may be truncated, leading to a loss of information.

Examples

Calculate Sentence Similarities

The following code illustrates how to compute similarities between sentences using the cosine similarity score function. The number of dimensions can be adjusted using the dimensions parameter of the SDK.

from mixedbread_ai.client import MixedbreadAi
from sentence_transformers.util import cos_sim
 
mxbai = MixedbreadAI(api_key="YOUR_API_KEY")
model = "mixedbread-ai/mxbai-embed-2d-large-v1"
 
docs = [
    "A man is eating food.",
    "A man is eating pasta.",
]
 
result = mxbai.embeddings(
    model=model,
    input=docs,
    dimensions=512
)
 
embeddings = [item.embedding for item in result.data]
 
# Calculate cosine similarity
similarity = cos_sim(embeddings[0], embeddings[1])
print(similarity)

Information Retrieval

The following code snippet demonstrates the retrieval of information related to a specific query from a given corpus. The number of dimensions can be adjusted using the dimensions parameter of the SDK.

from mixedbread_ai.client import MixedbreadAi
from sentence_transformers.util import cos_sim
 
mxbai = MixedbreadAI(api_key="YOUR_API_KEY")
model = "mixedbread-ai/mxbai-embed-2d-large-v1"
 
prompt = 'Represent this sentence for searching relevant passages:'
query = "A man is eating a piece of bread"
 
docs = [
    "A man is eating food.",
    "A man is eating pasta.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
]
 
query_result = mxbai.embeddings(
    model=model,
    prompt=prompt,
    input=[query],
    dimensions=512
)
 
docs_result = mxbai.embeddings(
    model=model,
    input=docs,
    dimensions=512
)
 
query_embedding = query_result.data[0].embedding
docs_embeddings = [item.embedding for item in docs_result.data]
 
# Calculate cosine similarity
similarities = cos_sim(query_embedding, docs_embeddings)
similarity_scores = similarities.squeeze().tolist()
 
# Retrieve documents sorted by similarity
retrieved_docs = sorted(zip(docs, similarity_scores), key=lambda x: x[1], reverse=True)
 
# Print the retrieved documents and their similarity scores
for doc, score in retrieved_docs:
    print(f"Document: {doc}\nSimilarity Score: {score}\n")

On this page