Background
In the current implementation of GenGO, the ACL paper exploration system, we use a small and rather old sentence encoder, namely all-MiniLM-L6-v2 to provide the semantic search and paper recommendation features. While this model is quite powerful for its size, I am not quite happy with the search results I find in GenGO. Especially for the chat-based extension, I am currently building with GenGO’s dataset.
To find a better model for my use case, I did a little benchmark experiment where I evaluated 12 small models (small enough to be run within a web browser as it’s done in GenGO) on a very recent retrieval dataset called LitSearch. In this blog post, I compile some simple results and my thoughts, to hopefully help me decide the new model for GenGO.
Setup
Fortunately, the MTEB python library has implemented the retrieval evaluation using the LitSearch dataset, so I will simply use this library for my evaluation. This is super nice, I do not have to write any code, I just need to run the following command,
mteb run -m <model-name> \
-t LitSearchRetrieval \
--verbosity 3
(In the original LitSearch paper, they conduct experiments by differentiating the two specificity levels in queries, but due to the MTEB implementation, I will just use the whole thing, which makes it non-comparable to tables in the paper.)
Model selection
I empirically know that models less than 100M parameters can easily be run within a web browser using the Transformers.js library, so I picked the models with less than c.a. 100M params from the top of the retrieval subset of the MTEB leaderboard.
Results
Here is the table which shows model names, model sizes in parameters, and the nDCG@10 scores.
model-names | Params (M) | ndcg_at_10 |
---|---|---|
Snowflake/snowflake-arctic-embed-m-v1.5 | 109 | 0.51715 |
Snowflake/snowflake-arctic-embed-m | 109 | 0.51237 |
infgrad/stella-base-en-v2 | 55 | 0.45465 |
intfloat/e5-small-v2 | 33 | 0.45465 |
Snowflake/snowflake-arctic-embed-s | 33 | 0.45086 |
Snowflake/snowflake-arctic-embed-xs | 23 | 0.44751 |
avsolatorio/GIST-Embedding-v0 | 109 | 0.4452 |
abhinand/MedEmbed-small-v0.1 | 33 | 0.43317 |
BAAI/bge-small-en-v1.5 | 33 | 0.42833 |
thenlper/gte-small | 33 | 0.42357 |
avsolatorio/GIST-small-Embedding-v0 | 33 | 0.38049 |
sentence-transformers/all-MiniLM-L6-v2 | 23 | 0.35318 |
As expected, the currently used model (all-MiniLM-L6-v2) is underperforming other models by quite some gap. Another similarly-sized model (snowflake-arctic-embed-xs, 23M) outperforms this model by more than 10 pts which makes me think at least I should make a switch to this model since the inference speed shouldn’t be so different. Actually, this snowflake-arctic-embed-xs is quite well-performing, it performs similarly to bigger models, and there is only a less than 1 pt difference to stella-base-en-v2 which has double the parameter size.
Another thing is that going over 100M parameters clearly shows improvements, snowflake-arctic-embed-m-v1.5 with 109M params achieves 0.51 pts, which is a big jump from snowflake-arctic-embed-xs.
Arctic model series from Slowflake
I was interested in how Snowflake trained such powerful small models so I quickly checked their technical report on these models (URL, it’s such a nice time now, we can read everything for free).
The list below is a few highlights,
- It’s not explicitly trained for a scientific domain, web-corpus take a large part of training which should contain some in-domain stuff.
- They balance the domain distribution in training data for better performance by “importance” which they computed in the preliminary isolated experiment. NLI datasets are excluded from their training due to this.
- Synthetic training data generated by LLMs. They note the importance of including negative documents in the prompt. It is unclear which LLM is used for the data generation.
Conclusion
The series of Arctic models from Snowflake is quite impressive, they show strong performance even when compared to large ones. snowflake-arctic-embed-s and snowflake-arctic-embed-xs are particularly interesting to me since by changing the current model to these, I can enjoy better search performance without sacrificing the inference speed.