Conversation
Signed-off-by: Jiffin Tony Thottan <thottanjiffin@gmail.com>
| fields = [ | ||
| FieldSchema(name='url', dtype=DataType.VARCHAR, max_length=2048, is_primary=True), # VARCHARS need a maximum length, so for this example they are set to 200 characters | ||
| FieldSchema(name='embedded_vector', dtype=DataType.FLOAT_VECTOR, dim=int(os.getenv("VECTOR_DIMENSION"))), | ||
| FieldSchema(name='start_offset', dtype=DataType.INT64, default_value=0), |
There was a problem hiding this comment.
can you try and add is_primary=True to the start_offset field?
There was a problem hiding this comment.
probably not needed for the end_offset as we dont expect overlaps
| MILVUS_ENDPOINT : "http://my-release-milvus.default.svc:19530" | ||
| OBJECT_TYPE : "TEXT" | ||
| VECTOR_DIMENSION: "384" | ||
| # CHUNK_SIZE : "500" |
| app.logger.debug("object size zero cannot be chunked") | ||
| return | ||
| text_splitter = CharacterTextSplitter( | ||
| separator=".", |
There was a problem hiding this comment.
do you split by . or by size?
There was a problem hiding this comment.
is it possible to demo chunking done by content (by the language model itself)?
There was a problem hiding this comment.
do you split by
.or by size?
First check for ".", if not then chunking happen based on size
| objectlist = text_splitter.split_text(object_content) | ||
| app.logger.debug("chunk size " + str(chunk_size) + " no of chunks " + str(len(objectlist))) | ||
| else : | ||
| objectlist.append(object_content) |
There was a problem hiding this comment.
why do you append the entire object content to the object list?
There was a problem hiding this comment.
if Chunking is disabled entire content is added together
There was a problem hiding this comment.
so, chunk_size=1 is the indication that chunking is disabled?
why not "0"?
also, what would be the value if the env var is not set?
There was a problem hiding this comment.
will set it to one if it is not defined, missing in this PR
|
IMO, the main issue here, is that we use delimiter/size based chunking. |
Signed-off-by: Jiffin Tony Thottan <thottanjiffin@gmail.com>
Please read https://zilliz.com/learn/pandas-dataframe-chunking-anf-vectorizing-with-milvus, goto last part of Content-Aware Chunking in the page |
they have an example based on the token chunk length. since we know the model we use for embedding, we can probably use that method? |
No description provided.