SOLR-18267: Add flat vector index with no HNSW#4492
Conversation
alessandrobenedetti
left a comment
There was a problem hiding this comment.
It's a nice addition, it would be great to have this in and then solve the limitations!
| * @lucene.spi {@value #NAME} | ||
| * @since 10.1 | ||
| */ | ||
| public final class SolrFlatVectorFormat extends KnnVectorsFormat { |
There was a problem hiding this comment.
I went back-and-forth about the naming of this class, because it is used as the name of segment files. For example: _0_SolrFlatVectorFormat_0.vec and _0_SolrFlatVectorFormat_0.vemf. Since the name is baked into the index, versioning it might make it easier to evolve in the future.
An alternative would be to add a version in the name like Solr101FlatVectorFormat (indicating it was introduced in Solr 10.1), or a similar approach.
@alessandrobenedetti do you have a strong opinion about either approach?
There was a problem hiding this comment.
mmmm I never liked the versioning in the name of the class, to be honest, let's see if anybody has any suggestions and let's take it from there!
There was a problem hiding this comment.
I would name it something to indicate the Lucene's version and codec. There doesn't seem to be any consistent convention here. Some codecs don't embed versions in their file names but some do. When they do it is typically the Lucene version that is embedded. I don't see a huge downside to having the codec/Lucene version displayed more prominently when inspecting index files. For reference, the vector codecs do seem to have more descriptive file names so I am leaning towards that naming pattern for consistency:
*_Lucene99HnswVectorsFormat_0.vec
*_Lucene99HnswVectorsFormat_0.vem
There was a problem hiding this comment.
I see now why we need this wrapper. But is it really a "Solr Flat Vector" format? I feel it is a bit of a stretch to call it that as the implementation is entirely Lucene and this is just to work around exposing it as an SPI. I suppose you can change the Lucene implementation under the hood without changing this "Solr format" but then you lose the benefit of the naming which is to immediately know that you have, say, lucene flat vector files from two different versions just by looking at the index files.
Edit: I don't think you can actually hot swap another Lucene implementation here otherwise you won't know how to read the old index files if you ever upgrade, right? This makes me think versioning is the way to go otherwise the unversioned SolrVectorFormat will forever be tied to Lucene99 format and any iteration will be versioned and that will be confusing to look at. So we should version it from the start.
There was a problem hiding this comment.
I agree that having a version from the start is a good idea given the potential for this to evolve in the future. The question would be should be it the Solr version (e.g. Solr101FlatVectorFormat for Solr 10.1), or the Lucene Version (e.g. SolrLucene99FlatVectorsFormat).
The class is currently a very thin wrapper that delegates to the Lucene version. However, I could see the need to extend functionality to support the KNN query parser search logic. It's looks like Elastic had a similar discussion in this thread, so there's precedent there.
I've updated the PR to use Solr101FlatVectorFormat but am open to changing it
https://issues.apache.org/jira/browse/SOLR-18267
Description
There are certain use cases, such as highly selective filters on large datasets, where it can be more efficient to perform a brute-force KNN search as a post-filter, instead of during ANN search.
Solr currently supports this use case with the vectorSimilarity Function and an
fq, but still requires an HNSW graph to be built during indexing when using DenseVectorField, even if it's not used during search. The goal of this feature is to avoid paying the cost of HNSW graph construction and rebuilding ingestion when ANN search isn't used.Solution
This PR introduces a new
knnAlgorithm=flatoption to DenseVectorField that uses Lucene99FlatVectorsFormat. This stores vectors in the index (.vec/.vemf files) without building the HNSW graph (.vex/.vem files).Lucene99FlatVectorsFormatis not registered in Lucene's SPI, so this PR includes a wrapper class SolrFlatVectorFormat that delegates to Lucene99FlatVectorsFormat as a workaround. There are examples in other Lucene-based engines using a similar pattern to provide a flat vector format for exact KNN search that wraps Lucene99FlatVectorsFormat.Limitations
This PR currently doesn't support:
knnAlgorithm=flatfor quantized variantsknn,knn_text_to_vectorandvectorSimilarityquery parsers. Only the vectorSimilarity function query is initially supported.Both features could be shipped as follow-ups.
AI Disclosure: Claude was used to assist with this PR. All code has been reviewed and tested by me.
Tests
Unit tests for Dense Vector Fields and quantized variants.
Checklist
Please review the following and check all that apply:
mainbranch../gradlew check.