chore: Updating VectorStore batch size to improve performance by jamie-ons · Pull Request #182 · datasciencecampus/classifai

jamie-ons · 2026-06-08T17:56:14Z

✨ Summary

VectorStore previously exposed batch_size as a repeated parameter on individual methods, creating multiple independent sources of truth. This PR consolidates that to a single value set at construction time.

To inform the choice of default, a profiling analysis was run across the target GCP instance range at batch sizes from 2 to 250. The default has been updated to the value that minimises search time without risking OOM on the smallest supported instances.

Constraints: must not break or perform significantly worse on 2 vCPU instances; optimised for typical cloud deployments at 4–8 vCPUs.

📜 Changes Introduced

VectorStore methods updated so batch_size is self.batch_size from the constructor.
Profiling analysis across e2-standard-2, e2-medium, and e2-standard-8 measuring latency and memory
Default batch_size updated from 8 to 250 based on analysis findings

✅ Checklist

Code passes linting with Ruff
DocStrings follow Google-style and are added as per Pylint recommendations
Documentation has been updated if needed

🔍 How to Test

To test this code, run the DEMO/general_workflow_demo.ipynb, confirm it all runs as usual with a batch size of 250.
Add the batch_size=any_value to the VectorStore creation and re-run; confirm both the VectorStore creation and search use this batch size.
Swap the VectorStore creation for reloading the saved VectorStore via VectorStore.from_filespace, and check the value for batch_size is identiccal to the value you changed it to previously.
Keeping the VectorStore being reloaded, check setting the batch_size in the search allows you to overwrite it temporarily for the search.
Read the Docucumentation updates & confirm it's clear to a prospective / current user.

Additionally test this on one gcp instance (workstation or other) and locally to ensure it works on both hardwares.

e2-medium
e2-standard-2
e2-standard-8
Macbook M4

…ngle source of truth

lukeroantreeONS · 2026-06-09T10:50:25Z


        return result_df

-    def search(self, query: VectorStoreSearchInput, n_results=10, batch_size=8) -> VectorStoreSearchOutput:  # noqa: C901, PLR0912, PLR0915


I think we'd like to retain the option for users to specify a different batch size at this point, but we'd want the default behaviour to follow the single source of truth.

lukeroantreeONS · 2026-06-11T10:46:30Z

A few things to note for the updates;

We're moving to having all (default) batch sizes inherit from the VectorStore's - so we'll need to make sure that

is persisted in metadata
is made available as an attribute when a VectorStore is reloaded via the .from_filespace() method
can be overridden to a new value via a parameter in the .from_filespace() method

jamie-ons · 2026-06-23T11:41:41Z

Using the hugging face vectoriser
vectoriser = HuggingFaceVectoriser(model_name="sentence-transformers/all-MiniLM-L6-v2")
and two vector stores of 92 records (small dataset) and 44,000 records (standard dataset).

I ran a search query of 2,000 input queries using a range of batch sizes. I repeated this test 3 times across different hardwares

Machine types	vCPUs	Fractional vCPUs1	Memory (GB)
e2-medium	2	11	4
e2-standard-2	2	N/A	8
e2-standard-8	8	N/A	32
macbook m4	16-CPU 40-GPU	N/A	36

As GCP models only allow up to 250 input texts for each request, I tested the following batch sizes:

BATCH_SIZES = [2, 4, 8, 16, 32, 64, 128, 250]

We can see that there is a relationship similar to exponential decay between batch_size and time taken to process the 2,000 input queries.

The advantage of a higher batch size is more prevalent in larger datasets and greater compute.
The effect is less seen in low compute and small datasets, however no adverse effect is seen either.

Therefore I think setting the default to 250, the max value allowed by GCP models is the best choice.

frayle-ons · 2026-06-23T13:14:25Z

Have we done any testing for this with the On-Net machines? If not it would be a good idea to test and confirm these findings since our current main user base use these machines

jamie-ons · 2026-06-24T11:17:08Z

Have we done any testing for this with the On-Net machines? If not it would be a good idea to test and confirm these findings since our current main user base use these machines
@frayle-ons

Yes - the mac in the graph is the on net machine. It performs best on the on net machine which is good.

We did some brief further testing at higher batch sizes than 250 and the performance does increase on the on net machines (although its marginal gains depending on the dataset).
Due to the requirement for it to work on cloud and (relatively) low compute VMs I think 250 is best.

If you mean the Thinkpad then as the compute of the Thinkpad is far greater than the chosen GCP instances I would assume It will also perform well.

updated VectorStore methods to use self.batch_size so there is one si…

5399d59

…ngle source of truth

jamie-ons linked an issue Jun 8, 2026 that may be closed by this pull request

Review 'batch_size' behaviour #181

Open

lukeroantreeONS reviewed Jun 9, 2026

View reviewed changes

jamie-ons added 3 commits June 12, 2026 15:29

updated VectorStore search method to allow a query batch_size argument

8769e9e

persist batch_size in metadata and expose via from_filespace

fa949a7

set default batch_size=250 as it is max batch size for gcp models

2d6f4e5

lukeroantreeONS self-requested a review June 14, 2026 09:42

jamie-ons marked this pull request as ready for review June 22, 2026 15:27

jamie-ons requested a review from a team as a code owner June 22, 2026 15:27

Merge branch 'main' into 181-review-batch_size-behaviour

f5bdaa8

github-actions Bot added the chore label Jun 22, 2026

lukeroantreeONS requested changes Jun 23, 2026

View reviewed changes

Comment thread src/classifai/indexers/main.py Outdated

fix query_batch_size typo from merge

e8fac53

jamie-ons self-assigned this Jun 23, 2026

jamie-ons requested review from lukeroantreeONS and rileyok-ons June 23, 2026 11:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: Updating VectorStore batch size to improve performance#182

chore: Updating VectorStore batch size to improve performance#182
jamie-ons wants to merge 6 commits into
mainfrom
181-review-batch_size-behaviour

jamie-ons commented Jun 8, 2026 •

edited

Loading

Uh oh!

lukeroantreeONS Jun 9, 2026

Uh oh!

lukeroantreeONS commented Jun 11, 2026

Uh oh!

Uh oh!

jamie-ons commented Jun 23, 2026

Uh oh!

frayle-ons commented Jun 23, 2026

Uh oh!

jamie-ons commented Jun 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		return result_df

		def search(self, query: VectorStoreSearchInput, n_results=10, batch_size=8) -> VectorStoreSearchOutput: # noqa: C901, PLR0912, PLR0915

Uh oh!

Conversation

jamie-ons commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Summary

📜 Changes Introduced

✅ Checklist

🔍 How to Test

Uh oh!

lukeroantreeONS Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

lukeroantreeONS commented Jun 11, 2026

Uh oh!

Uh oh!

jamie-ons commented Jun 23, 2026

Uh oh!

frayle-ons commented Jun 23, 2026

Uh oh!

jamie-ons commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jamie-ons commented Jun 8, 2026 •

edited

Loading

jamie-ons commented Jun 24, 2026 •

edited

Loading