Background

In the current implementation of GenGO, the ACL paper exploration system, we use a small and rather old sentence encoder, namely all-MiniLM-L6-v2 to provide the semantic search and paper recommendation features. While this model is quite powerful for its size, I am not quite happy with the search results I find in GenGO. Especially for the chat-based extension, I am currently building with GenGO’s dataset.

To find a better model for my use case, I did a little benchmark experiment where I evaluated 12 small models (small enough to be run within a web browser as it’s done in GenGO) on a very recent retrieval dataset called LitSearch. In this blog post, I compile some simple results and my thoughts, to hopefully help me decide the new model for GenGO.

Setup

Fortunately, the MTEB python library has implemented the retrieval evaluation using the LitSearch dataset, so I will simply use this library for my evaluation. This is super nice, I do not have to write any code, I just need to run the following command,

mteb run -m <model-name> \
  -t LitSearchRetrieval \
 --verbosity 3

(In the original LitSearch paper, they conduct experiments by differentiating the two specificity levels in queries, but due to the MTEB implementation, I will just use the whole thing, which makes it non-comparable to tables in the paper.)

Model selection

I empirically know that models less than 100M parameters can easily be run within a web browser using the Transformers.js library, so I picked the models with less than c.a. 100M params from the top of the retrieval subset of the MTEB leaderboard.

Results

Here is the table which shows model names, model sizes in parameters, and the nDCG@10 scores.

model-names Params (M) ndcg_at_10
Snowflake/snowflake-arctic-embed-m-v1.5 109 0.51715
Snowflake/snowflake-arctic-embed-m 109 0.51237
infgrad/stella-base-en-v2 55 0.45465
intfloat/e5-small-v2 33 0.45465
Snowflake/snowflake-arctic-embed-s 33 0.45086
Snowflake/snowflake-arctic-embed-xs 23 0.44751
avsolatorio/GIST-Embedding-v0 109 0.4452
abhinand/MedEmbed-small-v0.1 33 0.43317
BAAI/bge-small-en-v1.5 33 0.42833
thenlper/gte-small 33 0.42357
avsolatorio/GIST-small-Embedding-v0 33 0.38049
sentence-transformers/all-MiniLM-L6-v2 23 0.35318

As expected, the currently used model (all-MiniLM-L6-v2) is underperforming other models by quite some gap. Another similarly-sized model (snowflake-arctic-embed-xs, 23M) outperforms this model by more than 10 pts which makes me think at least I should make a switch to this model since the inference speed shouldn’t be so different. Actually, this snowflake-arctic-embed-xs is quite well-performing, it performs similarly to bigger models, and there is only a less than 1 pt difference to stella-base-en-v2 which has double the parameter size.

Another thing is that going over 100M parameters clearly shows improvements, snowflake-arctic-embed-m-v1.5 with 109M params achieves 0.51 pts, which is a big jump from snowflake-arctic-embed-xs.

Arctic model series from Slowflake

I was interested in how Snowflake trained such powerful small models so I quickly checked their technical report on these models (URL, it’s such a nice time now, we can read everything for free).

The list below is a few highlights,

  • It’s not explicitly trained for a scientific domain, web-corpus take a large part of training which should contain some in-domain stuff.
  • They balance the domain distribution in training data for better performance by “importance” which they computed in the preliminary isolated experiment. NLI datasets are excluded from their training due to this.
  • Synthetic training data generated by LLMs. They note the importance of including negative documents in the prompt. It is unclear which LLM is used for the data generation.

Conclusion

The series of Arctic models from Snowflake is quite impressive, they show strong performance even when compared to large ones. snowflake-arctic-embed-s and snowflake-arctic-embed-xs are particularly interesting to me since by changing the current model to these, I can enjoy better search performance without sacrificing the inference speed.