Skip to content

CASSANALYTICS-26: Support vector data type#144

Open
lukasz-antoniak wants to merge 4 commits into
apache:trunkfrom
lukasz-antoniak:CASSANALYTICS-26
Open

CASSANALYTICS-26: Support vector data type#144
lukasz-antoniak wants to merge 4 commits into
apache:trunkfrom
lukasz-antoniak:CASSANALYTICS-26

Conversation

@lukasz-antoniak

Copy link
Copy Markdown
Member

@lukasz-antoniak lukasz-antoniak marked this pull request as ready for review February 18, 2026 06:50

public abstract CqlField.CqlList list(CqlField.CqlType type);

public abstract CqlField.CqlVector vector(CqlField.CqlType type, int dimentions);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: dimentions -> dimensions

public abstract class CassandraTypes
{
public static final Pattern COLLECTION_PATTERN = Pattern.compile("^(set|list|map|tuple)<(.+)>$", Pattern.CASE_INSENSITIVE);
public static final Pattern VECTOR_PATTERN = Pattern.compile("^(vector)<(.+),(.*)>$", Pattern.CASE_INSENSITIVE);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the last group (.*) could be improved. We do not want to match 0 characters, right?
I'd propose the pattern ^(vector)<(.+),(.+)>$.

Comment on lines +247 to +248
throw new UnsupportedOperationException("Only native, collection, tuples or UDT data types are supported, "
+ "unsupported data type: " + cqlType.toString());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for cassandra 5, vector is also supported, so the error message ("Only native, collection, tuples or UDT data types are supported") is misleading.

Comment on lines +120 to +130
for (Object o : (List<?>) value)
{
if (ttl != NO_TTL)
{
rowBuilder.addCell(BufferCell.expiring(cd, timestamp, ttl, now, type().serialize(o),
CellPath.create(TimeUUID.Generator.nextTimeUUID().toBytes())));
}
else
{
rowBuilder.addCell(BufferCell.live(cd, timestamp, type().serialize(o), randomCellPath()));
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think vector is a single cell data type. But the for-loop adds multiple cells. Interesting that the tests are passing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. This method is only used for creating test mutations for CDC.


@ParameterizedTest
@MethodSource("org.apache.cassandra.bridge.VersionRunner#bridges")
public void testVector(CassandraBridge bridge)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this test uses bridge to write sstables to disk and doesn't test complete bulk writer and bulk reader path. Can you add atleast one dtest using bulkWriterDataFrameWriter and bulkReaderDataFrame to test e2e writing and reading of vectors?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure do we need a vector converter in SqlToCqlTypeConverter.java similar to what we did for Tuples in this PR https://github.com/apache/cassandra-analytics/pull/174/changes. A dtest using bulk writer and reader will expose it if a converter is needed

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good catch! Still working on it.

bulkWriterDataFrameWriter(sourceData, VECTOR_TABLE_NAME).save();

// Count rows because Java driver 3.x cannot read vector type
assertThat(countDataWithDriver(VECTOR_TABLE_NAME)).isEqualTo(numRowsInserted);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the data was inserted before the bulk write, it may be possible for this assertion to succeed even if bulk write fails silently. Can we please insert into a empty table instead, to ensure bulk write indeed inserted the data?


import static org.apache.cassandra.spark.data.CqlField.NO_TTL;

public class CqlVector extends CqlCollection implements CqlField.CqlVector

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Looks like need to overwrite write() in this class, to serialize dimensions as well. Then readType() in AbstractCassandraTypes need to use it to to create a vector from deserialized values.

  • Also, we may need to overwrite equals() in this to compare dimensions as well

*/
protected Long countDataWithDriver(QualifiedName table, ConsistencyLevel consistency)
{
Cluster driverCluster = createDriverCluster(cluster.delegate());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try-with-resources needed here I guess

*
* @param table the qualified Cassandra table name
* @param consistency the consistency level to use for querying the data
* @return record count limited to 10000 rows

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This @return need to be corrected

@Override
public Object convertForCqlWriter(Object value, CassandraVersion version, boolean isCollectionElement)
{
return ((List<?>) value).stream()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to verify list.size() == dimensions

long now,
Object value)
{
for (Object o : (List<?>) value)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very sure, instead looping and creating cell per element, i.,e multi cell, should we create a single cell for entire vector 'value' ? Does C* expect vector to be single cell or multi cell?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants