add support for chunking by thotz · Pull Request #7 · thotz/python-vectordbapp-ceph

thotz · 2025-02-28T09:02:42Z

No description provided.

Signed-off-by: Jiffin Tony Thottan <thottanjiffin@gmail.com>

yuvalif · 2025-03-04T14:27:55Z

        fields = [
                FieldSchema(name='url', dtype=DataType.VARCHAR, max_length=2048, is_primary=True),  # VARCHARS need a maximum length, so for this example they are set to 200 characters
                FieldSchema(name='embedded_vector', dtype=DataType.FLOAT_VECTOR, dim=int(os.getenv("VECTOR_DIMENSION"))),
+                FieldSchema(name='start_offset', dtype=DataType.INT64, default_value=0),


can you try and add is_primary=True to the start_offset field?

probably not needed for the end_offset as we dont expect overlaps

yuvalif · 2025-03-04T14:28:52Z

  MILVUS_ENDPOINT : "http://my-release-milvus.default.svc:19530"
  OBJECT_TYPE     : "TEXT"
  VECTOR_DIMENSION: "384"
+#  CHUNK_SIZE      : "500"


why under comment?

yuvalif · 2025-03-04T14:30:02Z

+                    app.logger.debug("object size zero cannot be chunked")
+                    return
+                text_splitter = CharacterTextSplitter(
+                    separator=".",


do you split by . or by size?

is it possible to demo chunking done by content (by the language model itself)?

do you split by . or by size?

First check for ".", if not then chunking happen based on size

yuvalif · 2025-03-04T14:49:29Z

+                objectlist = text_splitter.split_text(object_content)
+                app.logger.debug("chunk size " + str(chunk_size) + " no of chunks " + str(len(objectlist)))
+            else :
+                objectlist.append(object_content)


why do you append the entire object content to the object list?

if Chunking is disabled entire content is added together

so, chunk_size=1 is the indication that chunking is disabled?
why not "0"?
also, what would be the value if the env var is not set?

will set it to one if it is not defined, missing in this PR

yuvalif · 2025-03-17T10:54:59Z

IMO, the main issue here, is that we use delimiter/size based chunking.
is there a way to use the LLM to do the chunking?

Signed-off-by: Jiffin Tony Thottan <thottanjiffin@gmail.com>

thotz · 2025-03-17T11:05:55Z

IMO, the main issue here, is that we use delimiter/size based chunking. is there a way to use the LLM to do the chunking?

Please read https://zilliz.com/learn/pandas-dataframe-chunking-anf-vectorizing-with-milvus, goto last part of Content-Aware Chunking in the page

yuvalif · 2025-03-17T11:20:01Z

IMO, the main issue here, is that we use delimiter/size based chunking. is there a way to use the LLM to do the chunking?

Please read https://zilliz.com/learn/pandas-dataframe-chunking-anf-vectorizing-with-milvus, goto last part of Content-Aware Chunking in the page

they have an example based on the token chunk length. since we know the model we use for embedding, we can probably use that method?

add support for chunking

364768e

Signed-off-by: Jiffin Tony Thottan <thottanjiffin@gmail.com>

yuvalif reviewed Mar 4, 2025

View reviewed changes

including auto_id, null vector back to collection

c763c99

Signed-off-by: Jiffin Tony Thottan <thottanjiffin@gmail.com>

Conversation

thotz commented Feb 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuvalif commented Mar 17, 2025

Uh oh!

thotz commented Mar 17, 2025

Uh oh!

yuvalif commented Mar 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants