Skip to content

add support for chunking#7

Open
thotz wants to merge 2 commits intomainfrom
chunking-option-for-text
Open

add support for chunking#7
thotz wants to merge 2 commits intomainfrom
chunking-option-for-text

Conversation

@thotz
Copy link
Copy Markdown
Owner

@thotz thotz commented Feb 28, 2025

No description provided.

Signed-off-by: Jiffin Tony Thottan <thottanjiffin@gmail.com>
Comment thread pythonvectordbceph.py Outdated
fields = [
FieldSchema(name='url', dtype=DataType.VARCHAR, max_length=2048, is_primary=True), # VARCHARS need a maximum length, so for this example they are set to 200 characters
FieldSchema(name='embedded_vector', dtype=DataType.FLOAT_VECTOR, dim=int(os.getenv("VECTOR_DIMENSION"))),
FieldSchema(name='start_offset', dtype=DataType.INT64, default_value=0),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you try and add is_primary=True to the start_offset field?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not needed for the end_offset as we dont expect overlaps

MILVUS_ENDPOINT : "http://my-release-milvus.default.svc:19530"
OBJECT_TYPE : "TEXT"
VECTOR_DIMENSION: "384"
# CHUNK_SIZE : "500"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why under comment?

Comment thread pythonvectordbceph.py
app.logger.debug("object size zero cannot be chunked")
return
text_splitter = CharacterTextSplitter(
separator=".",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you split by . or by size?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to demo chunking done by content (by the language model itself)?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you split by . or by size?

First check for ".", if not then chunking happen based on size

Comment thread pythonvectordbceph.py
objectlist = text_splitter.split_text(object_content)
app.logger.debug("chunk size " + str(chunk_size) + " no of chunks " + str(len(objectlist)))
else :
objectlist.append(object_content)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you append the entire object content to the object list?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if Chunking is disabled entire content is added together

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, chunk_size=1 is the indication that chunking is disabled?
why not "0"?
also, what would be the value if the env var is not set?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will set it to one if it is not defined, missing in this PR

@yuvalif
Copy link
Copy Markdown
Contributor

yuvalif commented Mar 17, 2025

IMO, the main issue here, is that we use delimiter/size based chunking.
is there a way to use the LLM to do the chunking?

Signed-off-by: Jiffin Tony Thottan <thottanjiffin@gmail.com>
@thotz
Copy link
Copy Markdown
Owner Author

thotz commented Mar 17, 2025

IMO, the main issue here, is that we use delimiter/size based chunking. is there a way to use the LLM to do the chunking?

Please read https://zilliz.com/learn/pandas-dataframe-chunking-anf-vectorizing-with-milvus, goto last part of Content-Aware Chunking in the page

@yuvalif
Copy link
Copy Markdown
Contributor

yuvalif commented Mar 17, 2025

IMO, the main issue here, is that we use delimiter/size based chunking. is there a way to use the LLM to do the chunking?

Please read https://zilliz.com/learn/pandas-dataframe-chunking-anf-vectorizing-with-milvus, goto last part of Content-Aware Chunking in the page

they have an example based on the token chunk length. since we know the model we use for embedding, we can probably use that method?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants