Skip to content

SOLR-18267: Add flat vector index with no HNSW#4492

Open
adamjq wants to merge 8 commits into
apache:mainfrom
adamjq:SOLR-18267-add-flat-vector-index
Open

SOLR-18267: Add flat vector index with no HNSW#4492
adamjq wants to merge 8 commits into
apache:mainfrom
adamjq:SOLR-18267-add-flat-vector-index

Conversation

@adamjq

@adamjq adamjq commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

https://issues.apache.org/jira/browse/SOLR-18267

Description

There are certain use cases, such as highly selective filters on large datasets, where it can be more efficient to perform a brute-force KNN search as a post-filter, instead of during ANN search.

Solr currently supports this use case with the vectorSimilarity Function and an fq, but still requires an HNSW graph to be built during indexing when using DenseVectorField, even if it's not used during search. The goal of this feature is to avoid paying the cost of HNSW graph construction and rebuilding ingestion when ANN search isn't used.

Solution

This PR introduces a new knnAlgorithm=flat option to DenseVectorField that uses Lucene99FlatVectorsFormat. This stores vectors in the index (.vec/.vemf files) without building the HNSW graph (.vex/.vem files).

Lucene99FlatVectorsFormat is not registered in Lucene's SPI, so this PR includes a wrapper class SolrFlatVectorFormat that delegates to Lucene99FlatVectorsFormat as a workaround. There are examples in other Lucene-based engines using a similar pattern to provide a flat vector format for exact KNN search that wraps Lucene99FlatVectorsFormat.

Limitations

This PR currently doesn't support:

  • knnAlgorithm=flat for quantized variants
  • search across flat dense vector fields using the knn, knn_text_to_vector and vectorSimilarity query parsers. Only the vectorSimilarity function query is initially supported.

Both features could be shipped as follow-ups.

AI Disclosure: Claude was used to assist with this PR. All code has been reviewed and tested by me.

Tests

Unit tests for Dense Vector Fields and quantized variants.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide
  • I have added a changelog entry for my change

@github-actions github-actions Bot added documentation Improvements or additions to documentation tests cat:search cat:schema labels Jun 1, 2026
@adamjq adamjq marked this pull request as ready for review June 1, 2026 20:09

@alessandrobenedetti alessandrobenedetti left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a nice addition, it would be great to have this in and then solve the limitations!

Comment thread solr/core/src/java/org/apache/solr/schema/DenseVectorField.java
* @lucene.spi {@value #NAME}
* @since 10.1
*/
public final class SolrFlatVectorFormat extends KnnVectorsFormat {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back-and-forth about the naming of this class, because it is used as the name of segment files. For example: _0_SolrFlatVectorFormat_0.vec and _0_SolrFlatVectorFormat_0.vemf. Since the name is baked into the index, versioning it might make it easier to evolve in the future.

An alternative would be to add a version in the name like Solr101FlatVectorFormat (indicating it was introduced in Solr 10.1), or a similar approach.

@alessandrobenedetti do you have a strong opinion about either approach?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmmm I never liked the versioning in the name of the class, to be honest, let's see if anybody has any suggestions and let's take it from there!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would name it something to indicate the Lucene's version and codec. There doesn't seem to be any consistent convention here. Some codecs don't embed versions in their file names but some do. When they do it is typically the Lucene version that is embedded. I don't see a huge downside to having the codec/Lucene version displayed more prominently when inspecting index files. For reference, the vector codecs do seem to have more descriptive file names so I am leaning towards that naming pattern for consistency:

*_Lucene99HnswVectorsFormat_0.vec
*_Lucene99HnswVectorsFormat_0.vem

@kotman12 kotman12 Jun 5, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now why we need this wrapper. But is it really a "Solr Flat Vector" format? I feel it is a bit of a stretch to call it that as the implementation is entirely Lucene and this is just to work around exposing it as an SPI. I suppose you can change the Lucene implementation under the hood without changing this "Solr format" but then you lose the benefit of the naming which is to immediately know that you have, say, lucene flat vector files from two different versions just by looking at the index files.

Edit: I don't think you can actually hot swap another Lucene implementation here otherwise you won't know how to read the old index files if you ever upgrade, right? This makes me think versioning is the way to go otherwise the unversioned SolrVectorFormat will forever be tied to Lucene99 format and any iteration will be versioned and that will be confusing to look at. So we should version it from the start.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that having a version from the start is a good idea given the potential for this to evolve in the future. The question would be should be it the Solr version (e.g. Solr101FlatVectorFormat for Solr 10.1), or the Lucene Version (e.g. SolrLucene99FlatVectorsFormat).

The class is currently a very thin wrapper that delegates to the Lucene version. However, I could see the need to extend functionality to support the KNN query parser search logic. It's looks like Elastic had a similar discussion in this thread, so there's precedent there.

I've updated the PR to use Solr101FlatVectorFormat but am open to changing it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cat:schema cat:search documentation Improvements or additions to documentation tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants