CASSANALYTICS-26: Support vector data type#144
Conversation
|
|
||
| public abstract CqlField.CqlList list(CqlField.CqlType type); | ||
|
|
||
| public abstract CqlField.CqlVector vector(CqlField.CqlType type, int dimentions); |
There was a problem hiding this comment.
typo: dimentions -> dimensions
| public abstract class CassandraTypes | ||
| { | ||
| public static final Pattern COLLECTION_PATTERN = Pattern.compile("^(set|list|map|tuple)<(.+)>$", Pattern.CASE_INSENSITIVE); | ||
| public static final Pattern VECTOR_PATTERN = Pattern.compile("^(vector)<(.+),(.*)>$", Pattern.CASE_INSENSITIVE); |
There was a problem hiding this comment.
I think the last group (.*) could be improved. We do not want to match 0 characters, right?
I'd propose the pattern ^(vector)<(.+),(.+)>$.
| throw new UnsupportedOperationException("Only native, collection, tuples or UDT data types are supported, " | ||
| + "unsupported data type: " + cqlType.toString()); |
There was a problem hiding this comment.
nit: for cassandra 5, vector is also supported, so the error message ("Only native, collection, tuples or UDT data types are supported") is misleading.
| for (Object o : (List<?>) value) | ||
| { | ||
| if (ttl != NO_TTL) | ||
| { | ||
| rowBuilder.addCell(BufferCell.expiring(cd, timestamp, ttl, now, type().serialize(o), | ||
| CellPath.create(TimeUUID.Generator.nextTimeUUID().toBytes()))); | ||
| } | ||
| else | ||
| { | ||
| rowBuilder.addCell(BufferCell.live(cd, timestamp, type().serialize(o), randomCellPath())); | ||
| } |
There was a problem hiding this comment.
I think vector is a single cell data type. But the for-loop adds multiple cells. Interesting that the tests are passing.
There was a problem hiding this comment.
Ah. This method is only used for creating test mutations for CDC.
|
|
||
| @ParameterizedTest | ||
| @MethodSource("org.apache.cassandra.bridge.VersionRunner#bridges") | ||
| public void testVector(CassandraBridge bridge) |
There was a problem hiding this comment.
I believe this test uses bridge to write sstables to disk and doesn't test complete bulk writer and bulk reader path. Can you add atleast one dtest using bulkWriterDataFrameWriter and bulkReaderDataFrame to test e2e writing and reading of vectors?
There was a problem hiding this comment.
Not sure do we need a vector converter in SqlToCqlTypeConverter.java similar to what we did for Tuples in this PR https://github.com/apache/cassandra-analytics/pull/174/changes. A dtest using bulk writer and reader will expose it if a converter is needed
There was a problem hiding this comment.
Very good catch! Still working on it.
33f4d30 to
2ca9e3f
Compare
a7a85d5 to
c77ab91
Compare
| bulkWriterDataFrameWriter(sourceData, VECTOR_TABLE_NAME).save(); | ||
|
|
||
| // Count rows because Java driver 3.x cannot read vector type | ||
| assertThat(countDataWithDriver(VECTOR_TABLE_NAME)).isEqualTo(numRowsInserted); |
There was a problem hiding this comment.
As the data was inserted before the bulk write, it may be possible for this assertion to succeed even if bulk write fails silently. Can we please insert into a empty table instead, to ensure bulk write indeed inserted the data?
|
|
||
| import static org.apache.cassandra.spark.data.CqlField.NO_TTL; | ||
|
|
||
| public class CqlVector extends CqlCollection implements CqlField.CqlVector |
There was a problem hiding this comment.
-
Looks like need to overwrite write() in this class, to serialize dimensions as well. Then readType() in AbstractCassandraTypes need to use it to to create a vector from deserialized values.
-
Also, we may need to overwrite equals() in this to compare dimensions as well
| */ | ||
| protected Long countDataWithDriver(QualifiedName table, ConsistencyLevel consistency) | ||
| { | ||
| Cluster driverCluster = createDriverCluster(cluster.delegate()); |
There was a problem hiding this comment.
try-with-resources needed here I guess
| * | ||
| * @param table the qualified Cassandra table name | ||
| * @param consistency the consistency level to use for querying the data | ||
| * @return record count limited to 10000 rows |
| @Override | ||
| public Object convertForCqlWriter(Object value, CassandraVersion version, boolean isCollectionElement) | ||
| { | ||
| return ((List<?>) value).stream() |
There was a problem hiding this comment.
Good to verify list.size() == dimensions
| long now, | ||
| Object value) | ||
| { | ||
| for (Object o : (List<?>) value) |
There was a problem hiding this comment.
I am not very sure, instead looping and creating cell per element, i.,e multi cell, should we create a single cell for entire vector 'value' ? Does C* expect vector to be single cell or multi cell?
Fixes CASSANALYTICS-26.