ColBERTus Maximus - Introducing mxbai-colbert-large-v1 - Blog

We are excited to announce our first ColBERT model, pushing the space one step forward! It comes with an Apache 2.0 license and is available on Hugging Face.

Read on to learn more about our approach and to check out our benchmarks. If you want to skip right to the model instead, you can access it here:

mixedbread-ai/mxbai-colbert-large-v1: SOTA ColBERT. Simply tasty and good.

TLDR:
Our ColBERT model provides state-of-the-art performance among its peers. It outperforms other available options and even cross-encoder based rerankers.

What is ColBERT?

In a real world use case, search is extremely complex: different domains, languages, varying text length, and many more hurdles have to be dealt with. We try to overcome this challenge by using smart embedding models, which take text as an input and produce a fixed (dimension) size vector.

The typical search approach uses the same model to encode both documents and queries. We then choose a metric, such as cosine similarity, to measure the distance between the query and the documents. However, there is an issue with that: our model has to determine the optimal placement within the latent space, so that query and relevant documents are positioned closely together, but there is no interaction between query and document within the model.

On the other side, we have models like cross-encoders. There, query and documents are fed to the model together, improving search accuracy. Unfortunately, cross-encoders are extremely compute-heavy, since we need to pass all possible combinations of documents and queries to the model. Therefore, these models are not suitable for large-scale search and mostly used for reranking.

Similarity scoring process of query and document in a ColBERT model

ColBERT stands for contextualized late interaction BERT and it combines both vector search and cross-encoders. In ColBERT, the queries and the documents are first encoded separately. However, instead of creating a single embedding for the entire document, ColBERT generates contextualized embeddings for each token in the document. To search, the token-level query embeddings are compared with the token-level embeddings of the documents using the lightweight scoring function MaxSim. This allows ColBERT to capture more nuanced matching signals while still being computationally efficient. The resulting scores are then used to rank the documents based on their relevance to the query.

Introducing mxbai-colbert-large-v1

Last week we released our powerful embedding model mxbai-embed-large-v1, based on which we trained our ColBERT model. Therefore, our ColBERT model provides all the benefits of our embedding model - the model has seen a huge amount of diverse data from all kinds of different domains. Our model can be easily used, no fluff like remote code etc. is needed.

As of March 2024, our model achieves state-of-the-art performance for ColBERT models for reranking on the 13 publicly available BEIR benchmarks.

Using It in Action

We recommend using our model with the framework RAGatouille. To get started, let's install the library:

pip install ragatouille

Now, let's see how to use our model with RAGatouille:

from ragatouille import RAGPretrainedModel
 
# Let's create a ragatouille instance
RAG = RAGPretrainedModel.from_pretrained("mixedbread-ai/mxbai-colbert-large-v1")
 
documents = [
    "'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.",
    "The novel 'Moby-Dick' was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.",
    "Harper Lee, an American novelist widely known for her novel 'To Kill a Mockingbird', was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.",
    "Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment upon the British landed gentry at the end of the 18th century.",
    "The 'Harry Potter' series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.",
    "'The Great Gatsby', a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan."
]
 
# index documents
RAG.index(documents, index_name="mockingbird")
 
# search
query = "Who wrote 'To Kill a Mockingbird'?"
results = RAG.search(query)

The results should look like this:

[
    {
        'content': "'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.",
        'score': 28.453125,
        'rank': 1,
        'document_id': '9d564e82-f14f-433a-ab40-b10bda9dc370',
        'passage_id': 0
    },
    {
    '   content': "Harper Lee, an American novelist widely known for her novel 'To Kill a Mockingbird', was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.",
        'score': 27.03125,
        'rank': 2,
        'document_id': 'a35a89c3-b610-4e2e-863e-fa1e7e0710a6',
        'passage_id': 2
    },
    ...
]

Built for RAG and Reranking

While a lot of models use ready-made datasets -- which are pretty outdated and also quite far removed from real world use cases -- we spent a lot of time building our own datasets. We scraped a large part of the internet, cleaned the data, and used it to construct our training dataset.

We initalized our ColBERT model from our mxbai-embed-large-v1 model, which was trained on over 700 million samples from various domains. We then adjusted our embedding model to the late interaction mechanism with around 96 million samples. This allows our ColBERT model to be used for a wide range of tasks and domains.

Model Evaluation with BEIR for Out-of-Domain Information Retrieval

BEIR is a benchmark focused on out-of-domain information retrieval. We benchmark our ColBERT model in two different settings: reranking and retrieval.

Unfortunately, many recently published models were trained on the BEIR training sets and frequently even on the actual test sets (i.e., telling the model the correct answers for the test set, which is basically cheating). For our training, we excluded any potential overlap with the test sets by removing potential test candidates from the comparison. This ensures that our model is evaluated on unseen data and that the results are reliable.

Reranking

Since reranking is currently the most significant setting for the use of ColBERT models, we focused on benchmarking our model against other currently available ColBERT options on all 13 publicly available BEIR tasks.

Specifically, we evaluated the model using the NDCG@10 metric, which scores models based on their overall rankings of search results compared to their actual relevance, with a higher weight being placed on the search results higher up on the list.

Reranking performance in NDCG@10:

Dataset	ColBERTv2	Jina-ColBERT-v1	mxbai-colbert-large-v1
ArguAna	29.99	33.42	33.11
ClimateFEVER	16.51	20.66	20.85
DBPedia	31.80	42.16	40.61
FEVER	65.13	81.07	80.75
FiQA	23.61	35.60	35.86
HotPotQA	63.30	68.84	67.62
NFCorpus	33.75	36.69	36.37
NQ	30.55	51.27	51.43
Quora	78.86	85.18	86.95
SCIDOCS	14.90	15.39	16.98
SciFact	67.89	70.20	71.48
TREC-COVID	59.47	75.00	81.04
Webis-touché2020	44.22	32.12	31.70
Average	43.08	49.82	50.37

mxbai-colbert-large-v1 outperforms other models on average as well as directly in most of the tasks. Curiously, the model's exceptionally high score even beats typical scores for cross-encoder based reranker models on the benchmark, despite the advantages of the ColBERT architecture regarding resource use.

Retrieval

As mentioned, ColBERT is currently mainly used for reranking. However, since more and more people are starting to use ColBERT for retrieval tasks as well, we also tested our model's performance on retrieval tasks on a subset of the BEIR benchmarks.

Due to resource limitations, so far we were only able to test our model on three BEIR tasks, with NDCG@10 serving as the main metric. We aim to complete testing on the full set of tasks in the future and will provide the full results as soon as possible.

Retrieval performance in NDCG@10:

Model	ColBERTv2	Jina-ColBERT-V1	mxbai-colbert-large-v1
NFCorpus	33.7	33.8	36.5
SciFact	68.9	70.1	71.3
TREC-COVID	72.6	75.0	80.5

Rest of the results will be updated soon. We're on it!

On this small subset, the model exhibits state-of-the-art retrieval performance when compared to other currently available ColBERT models. However, while our ColBERT model also performs well on retrieval, we still recommend using our embedding model mixedbread-ai/mxbai-embed-large-v1 in this setting.

Give Us Feedback

This is our first ColBERT model, and we greatly welcome any feedback that helps us make our models better, refine their user-friendliness, or improve their capabilities. Please let us know if you're hungry for any new features or have encountered any issues. We value your feedback!

Please share your feedback and thoughts through our Discord Community. We are here to help and also always happy to chat about the exciting field of machine learning!

Citation

@online{colbert2024mxbai,
  title={ColBERTus Maximus - Introducing mxbai-colbert-large-v1},
  author={Sean Lee and Aamir Shakir and Darius Koenig and Julius Lipp},
  year={2024},
  url={https://www.mixedbread.ai/blog/mxbai-colbert-large-v1},
}

ColBERTus Maximus - Introducing mxbai-colbert-large-v1

Reading Time

Publish Date

Authors