diff --git a/DEMO/README.md b/DEMO/README.md index b3b0f8f..915d583 100644 --- a/DEMO/README.md +++ b/DEMO/README.md @@ -66,6 +66,25 @@ It covers: * A showcase of the pre-made `RagHook` class and performing different RAG type tasks on `VectorStoreSearchOutput` results including classification, reranking and keyword identication, and how to customise the specific task using this class. +### 5. Evaluating VectorStore Performance with Metrics : `evaluation_workflow_demo.ipynb` + +This notebook demonstrates how to use the Evaluation module to assess the performance of one or more VectorStore instances against ground-truth labelled data. + +It covers: + +* An introduction to the Evaluation module and its multi-class, single-label classification focus. + +* The available evaluation metrics + +* Creating multiple VectorStore instances with varying data coverage to showcase performance differences. + +* Instantiating an `Evaluation` object with ground truth data and selected metrics. + +* Running the `evaluate()` method to compute metrics across multiple VectorStores. + +* Memory-efficient evaluation using callable functions to load VectorStores on-demand, useful when evaluating many or large VectorStores. + +> **Note:** The Evaluation module is currently in development and its API is subject to change in future releases. --- ## Installation of classifai diff --git a/DEMO/data/fake_soc_eval_queries.csv b/DEMO/data/fake_soc_eval_queries.csv new file mode 100644 index 0000000..a292fce --- /dev/null +++ b/DEMO/data/fake_soc_eval_queries.csv @@ -0,0 +1,49 @@ +text,label +"grows apples and berries in orchards",101 +"raises cows for milk and produces dairy goods",102 +"lays bricks and concrete blocks to build walls",103 +"builds custom wooden furniture and fittings",104 +"installs and repairs wiring in residential buildings",105 +"fixes leaking pipes and drainage systems",106 +"develops and tests software applications",107 +"analyzes datasets to produce business insights",108 +"prepares and reviews financial records for compliance",109 +"teaches students in schools and colleges",110 +"provides direct patient care in hospitals",111 +"prepares meals in a busy restaurant kitchen",112 +"creates visual branding assets and illustrations",113 +"diagnoses and repairs faults in cars and trucks",114 +"captures and edits video content for events",115 +"prepares espresso drinks as a cafe barista",116 +"creates personal workout plans for clients",117 +"organizes archives and historical records",118 +"writes technical manuals for software products",119 +"conducts lab experiments in biology and chemistry",120 +"investigates crimes and gathers forensic evidence",121 +"responds to fires and emergency rescue calls",122 +"operates commercial aircraft for passenger travel",123 +"writes scripts for films and television",124 +"composes original music for performances",125 +"coaches athletes to improve competitive performance",126 +"designs clothing collections and fashion accessories",127 +"helps clients buy and sell residential property",128 +"plans weddings and coordinates vendors",129 +"treats pets and livestock as a veterinarian",130 +"supports families with counseling and social care",131 +"advises businesses on strategy and growth",132 +"manages warehousing and transportation logistics",133 +"tests applications for bugs and usability issues",134 +"builds statistical models to forecast trends",135 +"recruits candidates and manages employee relations",136 +"assists head chefs with kitchen operations",137 +"mixes cocktails and serves drinks at a bar",138 +"plans travel itineraries and bookings for clients",139 +"designs gameplay mechanics for video games",140 +"installs office lighting systems and electrical fixtures",106 +"maintains gas appliances and heating pipework",105 +"builds mobile apps for smartphones and tablets",108 +"analyzes sales and market data for business decisions",109 +"teaches students with special educational needs",111 +"assists with childbirth and postnatal care",112 +"protects company systems from cyber attacks",120 +"organizes entertainment and activities on cruise ships",123 \ No newline at end of file diff --git a/DEMO/evaluation_workflow_demo.ipynb b/DEMO/evaluation_workflow_demo.ipynb new file mode 100644 index 0000000..43dd1f1 --- /dev/null +++ b/DEMO/evaluation_workflow_demo.ipynb @@ -0,0 +1,374 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# ClassifAI Evaluation Module - Overview and Usage\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The Evaluation module provides a toolkit to evaluate the performance of VectorStores in a multi-class, single-label classification setting. Provided the user has:\n", + "\n", + "- a constructed VectorStore (or multiple VectorStores) built from historically labeled (or similar) data;\n", + "- a held out collection of labelled ground truth data, not in the VectorStore;\n", + "\n", + "then this module can be used to evaluate VectorStore performance, presenting results with a variety of available metrics that can be specified by the user." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Multi-Class Single-Label Evaluation\n", + "\n", + "Currently the evaluation module only evaluates single label predictions meaning that, while ClassifAI is designed to return a ranked list of several semantically similar candidate entries to a provided query sample, only the top result will be considered when comparing the VectorStore result to a ground truth label provided by a user.\n", + "\n", + "![top_1_eval_image](files/eval_top_1_diagram.png)\n", + "\n", + "The Evaluation module is currently in development, and in the future its feature set may be extended to include a broader range of evaluation tasks such as multi-class multi-label classification, where potentially multiple labels for a ground truth sample can be compared and evaluated against multiple ranked candidate predictions of the VectorStore." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For multi-class, single-label evaluation we have implemented several metrics that can be calculated, their names and descriptions are as follows: \n", + "| Metric | Description |\n", + "|------------------|-------------------------------------------------------------------------------------------------|\n", + "| Accuracy | The proportion of correctly predicted labels out of the total number of predictions. |\n", + "| Macro Recall | The average recall calculated independently for each class, treating all classes equally. |\n", + "| Macro Precision | The average precision calculated independently for each class, treating all classes equally. |\n", + "| Macro F1 | The harmonic mean of Macro Precision and Macro Recall, providing a balance between the two. |" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## In this notebook\n", + "\n", + "This notebook will provide a demonstration of the following concepts:\n", + "\n", + "- An introduction to the Evaluation Module and its main class `Evaluation`\n", + "\n", + "- Use of the demo file `fake_soc_eval_queries.csv` which is in the `DEMO/data/` repo folder. This data contains ground truth samples related to the demo knowledgebase CSV file `DEMO/data/fake_soc_dataset.csv`\n", + "\n", + "- Creating several VectorStores using the 'fake_soc_dataset.csv' file, using different amounts of data from the file to create several VectorStores of varying quality.\n", + "\n", + "- Setting up an Evaluation task using the fake queries file and a list of metrics to evaluate.\n", + "\n", + "- Tying this all together, evaluating the VectorStores against the ground truth query file using the evaluation module.\n", + "\n", + "\n", + "See the ClassifAI GitHub repository and `DEMO/README.md` for information on accessing the associated demo datasets needed for this (and other) notebook tutorials.
\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Building VectorStores to Evaluate\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To begin, we're going to create 2 VectorStores from our `fake_soc_dataset.csv` file which contains mock SOC survey responses and their corresponding occupation codes. One VectorStore will be built from the full dataset, and the second one will be built from half the dataset. \n", + "\n", + "Since the second VectorStore will contain only half the training data, we can reason that it will not perform as well as the VectorStore built with the full dataset due to lack of coverage." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from classifai.indexers import VectorStore\n", + "from classifai.vectorisers import HuggingFaceVectoriser\n", + "\n", + "# Initialize a vectoriser using a HuggingFace model\n", + "demo_vectoriser = HuggingFaceVectoriser(model_name=\"sentence-transformers/all-MiniLM-L6-v2\")\n", + "\n", + "# Initialize a vector store using the vectoriser and a CSV file\n", + "demo_vectorstore_full = VectorStore(\n", + " file_name=\"data/fake_soc_dataset.csv\",\n", + " data_type=\"csv\",\n", + " vectoriser=demo_vectoriser,\n", + " skip_save=True,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We're now going to modify the `fake_soc_dataset.csv` file to remove half the samples, then build a VectorStore with that reduced dataset. We'll do this by loading in the original data, cutting it in half and saving it to a new CSV file in the code below. Then build another VectorStore like with the code above." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "# Load the CSV file into a DataFrame\n", + "df = pd.read_csv(\"data/fake_soc_dataset.csv\")\n", + "print(df.shape)\n", + "\n", + "# cut the dataframe in half\n", + "half_df = df.iloc[: len(df) // 2]\n", + "print(half_df.shape)\n", + "\n", + "# save back to CSV\n", + "half_df.to_csv(\"data/fake_soc_dataset_half.csv\", index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# now building the vector store with the half dataset, same as before but changing the file path\n", + "demo_vectorstore_half = VectorStore(\n", + " file_name=\"data/fake_soc_dataset_half.csv\",\n", + " data_type=\"csv\",\n", + " vectoriser=demo_vectoriser,\n", + " skip_save=True,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Creating an Evaluation Object" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "ClassifAI provides an `Evaluation` class (in the Evaluation module) which can be used to load a collection of ground truth queries and specify metrics to use. The evaluator constructor accepts 4 arguments:\n", + "\n", + "- A pandas dataframe with `['text', 'label']` columns/headers both of type string, which represent the text sample queries and gold standard ground truth label respectively,\n", + "- A list of evaluation metric names which must be strings corresponding to one of the current available metrics: `['accuracy', 'macro_recall', 'macro_precision', 'macro_f1']`,\n", + "- A `batch_size` which determines how many samples should be processed at once (smaller size will take longer but be more memory efficient),\n", + "- A boolean argument `save_output` which determines if generated results should be saved to CSV.\n", + "\n", + "\n", + "Calling the constructor with these arguments, along with a boolean to save the results to file, will check that the inputs are valid and suitable for the task:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from classifai.evaluation import Evaluation\n", + "\n", + "# loading our mock ground truth file into a pandas dataframe, setting the type of both columns to string.\n", + "ground_truths = pd.read_csv(\n", + " \"data/fake_soc_eval_queries.csv\", dtype={\"text\": str, \"label\": str}\n", + ") # Load the ground truths from the CSV file\n", + "\n", + "# using the Evaluation class, passing the ground truth DF and metrics we want to evaluate.\n", + "evaluator = Evaluation(\n", + " ground_truths=ground_truths,\n", + " metrics=[\"macro_precision\", \"accuracy\"], # we chose just 2 metrics this time\n", + " batch_size=16,\n", + " save_output=True,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Running the above code should also present a FutureWarning. This is because at this time the Evaluation module is still in development and may be subject to future breaking changes in later updates.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Evaluating the performance of VectorStores" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If the evaluator object instantiates correctly, we can then use the `.evaluate()` method which takes VectorStores and some corresponding names to evaluate against the ground truth. \n", + "\n", + "For each VectorStore passed to it, the `evaluate()` method will:\n", + "\n", + "1. Trigger the VectorStore to perform search over each of the queries in the grount truth dataset, obtaining 1 result for each,\n", + "\n", + "2. The results will be collected and combined with the ground truth labels,\n", + "\n", + "3. The predictions and ground truths will be used to calculate each of the metrics specified by the user in the constructor.\n", + "\n", + "4. The evaluate method returns the results as a dataframe object and saves the results to a file inf the user set constructor argument `save_ouput=True`\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "results = evaluator.evaluate(\n", + " vectorstores=[demo_vectorstore_full, demo_vectorstore_half],\n", + " vectorstore_names=[\"full data vectorstore\", \"half data vectorstore\"],\n", + " output_file=\"./classifai_temp/demo_eval_results.csv\", # leaving this line blank will save the results to evaluation_results.csv\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The results object is a dataframe with provided VectorStore names as the row indexes, and each column is associated with a given metric. We should also see the results have been saved to the specified output CSV file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Following through this demo correctly, the results should show that a VectorStore containing only half the available labelled data sees a significant drop in performance on the ground truth dataset compared to the VectorStore using all available data, according to the chosen metrics." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Efficient VectorStore Loading and other settings" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We've already demonstrated the core features of the Evaluation module. This final section will show some additional implemented features including:\n", + "\n", + "- Efficient ways to load VectorStores to avoid memory issues when using many and/or large VectorStores,\n", + "- More of the available metrics,\n", + "- Saving results to file without a specified filename." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In some cases, the user may want to evaluate many VectorStores in a single evaluation run, but more VectorStores require more memory. It may not be possible to load all VectorStores into memory at the same time. For this reason, the `vectorstores` parameter of `Evaluation.evaluate()` accepts callable functions that return VectorStores as well as instantiated VectorStores.\n", + "\n", + "With this design, the user can write functions that will load VectorStores into memory at the time of the their evaluation. Once the evaluation is complete, the VectorStore will be dropped from memory, and memory is managed more efficiently." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# writing a function that will build/load the larger of our 2 VectorStores into memory when the function is called.\n", + "# you can also use the VectorStore.from_filespace method to load the VectorStore from the filespace if it has already been built and saved to disk.\n", + "def load_largest_vectorStore():\n", + " efficient_vectoriser = HuggingFaceVectoriser(model_name=\"sentence-transformers/all-MiniLM-L6-v2\")\n", + " efficient_vectorstore = VectorStore(\n", + " file_name=\"data/fake_soc_dataset.csv\",\n", + " data_type=\"csv\",\n", + " vectoriser=efficient_vectoriser,\n", + " skip_save=True,\n", + " )\n", + "\n", + " return efficient_vectorstore # these functions must return only the VectorStore object." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "second_evaluator = Evaluation(\n", + " ground_truths=ground_truths,\n", + " metrics=[\n", + " \"accuracy\",\n", + " \"macro_precision\",\n", + " \"macro_recall\",\n", + " \"macro_f1\",\n", + " ], # this is all the available metrics, we can use them all at the same time.\n", + " batch_size=16,\n", + " save_output=True, # setting to False, only the results will be returned, not saved to a file.\n", + ")\n", + "\n", + "\n", + "second_results = second_evaluator.evaluate(\n", + " vectorstores=[\n", + " load_largest_vectorStore,\n", + " demo_vectorstore_half,\n", + " ], # passing the function itself, not calling it. We can also pass one of our existing in memory vectorstores as well.\n", + " vectorstore_names=[\"efficiently loaded vectorstore\", \"half data vectorstore\"],\n", + ")\n", + "\n", + "\n", + "second_results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Thats It!\n", + "\n", + "The results (this time with more metrics) from the last evaluation run should be saved to the file in the current directory as `evaluation_results.csv`, since we didn't specify a file name this time to save the results to. Remember you can disable writing results to file by setting `save_output=False` in the Evaluation constructor.\n", + "\n", + "Finally, one more reminder that this module is still in development and may be subject to breaking changes in the future.\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "classifai", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.7" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/DEMO/files/eval_top_1_diagram.png b/DEMO/files/eval_top_1_diagram.png new file mode 100644 index 0000000..4db8fdd Binary files /dev/null and b/DEMO/files/eval_top_1_diagram.png differ diff --git a/README.md b/README.md index 2600c4a..6f95193 100644 --- a/README.md +++ b/README.md @@ -165,6 +165,8 @@ Further guides and tutorials can be found in the [DEMO folder](./DEMO/) of this - [General workflow](./DEMO/general_workflow_demo.ipynb) - A general introduction to using the ClassifAI package. +- [Evaluation Module](./DEMO/evaluation_workflow_demo.ipynb) + - Evaluate and compare VectorStore performance in a multi-class single-label setting using your own ground truth data. - [Custom vectorisers](./DEMO/custom_vectoriser.ipynb) - make your own custom vectoriser model that will interact with the core features of the package. - [Custom pre- and post-processing "hooks"](./DEMO/using_hooks.ipynb) diff --git a/src/classifai/evaluation/main.py b/src/classifai/evaluation/main.py index abdb374..100ec8b 100644 --- a/src/classifai/evaluation/main.py +++ b/src/classifai/evaluation/main.py @@ -131,7 +131,6 @@ class Evaluation: batch_size (int): Batch size for vectorstore search operations. save_output (bool): Whether to save evaluation results to a file. parsed_metrics (dict): Dictionary of parsed metrics to compute. - results (pd.DataFrame | None): DataFrame containing overall evaluation results. metric_results (dict): Dictionary of individual metric results for detailed inspection. """ @@ -163,6 +162,7 @@ def __init__( self.ground_truths["qid"] = self.ground_truths.index.astype(str) self.batch_size = batch_size self.save_output = save_output + self.metric_results = {} # parse the provided metrics and store them in the instance try: @@ -290,11 +290,10 @@ def evaluate( # noqa: C901, PLR0912 del resolved_vs # Compute metrics for the current VectorStore and store results - vs_metrics = {} try: for _metric_name, metric in self.parsed_metrics.items(): result = metric.evaluate(results_df) - vs_metrics[result.name] = result.value + self.metric_results[result.name] = result.value except Exception as e: raise EvaluationError( "Metric computation failed.", @@ -302,7 +301,7 @@ def evaluate( # noqa: C901, PLR0912 ) from e # Append the current VectorStore's metrics to the overall results DataFrame - vectorstore_df = pd.DataFrame([vs_metrics], index=[name]) + vectorstore_df = pd.DataFrame([self.metric_results], index=[name]) overall_results_df = pd.concat([overall_results_df, vectorstore_df]) # Save results to CSV if requested