Skip to content
4 changes: 4 additions & 0 deletions .openpublishing.redirection.ai.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@
"redirect_url": "/dotnet/ai/microsoft-extensions-ai",
"redirect_document_id": true
},
{
"source_path_from_root": "/docs/ai/conceptual/ai-tools.md",
"redirect_url": "/dotnet/ai/conceptual/calling-tools"
},
{
"source_path_from_root": "/docs/ai/conceptual/evaluation-libraries.md",
"redirect_url": "/dotnet/ai/evaluation/libraries",
Expand Down
File renamed without changes.
32 changes: 2 additions & 30 deletions docs/ai/conceptual/data-ingestion.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Data ingestion is the process of collecting, reading, and preparing data from di
- **Transform** the data by cleaning, chunking, enriching, or converting formats.
- **Load** the data into a destination like a database, vector store, or AI model for retrieval and analysis.

For AI and machine learning scenarios, especially Retrieval-Augmented Generation (RAG), data ingestion is not just about converting data from one format to another. It is about making data usable for intelligent applications. This means representing documents in a way that preserves their structure and meaning, splitting them into manageable chunks, enriching them with metadata or embeddings, and storing them so they can be retrieved quickly and accurately.
For AI and machine learning scenarios, especially retrieval-augmented generation (RAG), data ingestion is not just about converting data from one format to another. It is about making data usable for intelligent applications. This means representing documents in a way that preserves their structure and meaning, splitting them into manageable chunks, enriching them with metadata or embeddings, and storing them so they can be retrieved quickly and accurately.

## Why data ingestion matters for AI applications

Expand All @@ -26,37 +26,9 @@ Your chatbot needs to understand and search through thousands of documents to pr

This is where data ingestion becomes critical. You need to extract text from different file formats, break large documents into smaller chunks that fit within AI model limits, enrich the content with metadata, generate embeddings for semantic search, and store everything in a way that enables fast retrieval. Each step requires careful consideration of how to preserve the original meaning and context.

## The Microsoft.Extensions.DataIngestion library

The [📦 Microsoft.Extensions.DataIngestion package](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion) provides foundational .NET building blocks for data ingestion. It enables developers to read, process, and prepare documents for AI and machine learning workflows, especially Retrieval-Augmented Generation (RAG) scenarios.

With these building blocks, you can create robust, flexible, and intelligent data ingestion pipelines tailored for your application needs:

- **Unified document representation:** Represent any file type (for example, PDF, Image, or Microsoft Word) in a consistent format that works well with large language models.
- **Flexible data ingestion:** Read documents from both cloud services and local sources using multiple built-in readers, making it easy to bring in data from wherever it lives.
- **Built-in AI enhancements:** Automatically enrich content with summaries, sentiment analysis, keyword extraction, and classification, preparing your data for intelligent workflows.
- **Customizable chunking strategies:** Split documents into chunks using token-based, section-based, or semantic-aware approaches, so you can optimize for your retrieval and analysis needs.
- **Production-ready storage:** Store processed chunks in popular vector databases and document stores, with support for embedding generation, making your pipelines ready for real-world scenarios.
- **End-to-end pipeline composition:** Chain together readers, processors, chunkers, and writers with the <xref:Microsoft.Extensions.DataIngestion.IngestionPipeline`1> API, reducing boilerplate and making it easy to build, customize, and extend complete workflows.
- **Performance and scalability:** Designed for scalable data processing, these components can handle large volumes of data efficiently, making them suitable for enterprise-grade applications.

All of these components are open and extensible by design. You can add custom logic and new connectors, and extend the system to support emerging AI scenarios. By standardizing how documents are represented, processed, and stored, .NET developers can build reliable, scalable, and maintainable data pipelines without "reinventing the wheel" for every project.

### Built on stable foundations

![Data Ingestion Architecture Diagram](../media/data-ingestion/dataingestion.png)

These data ingestion building blocks are built on top of proven and extensible components in the .NET ecosystem, ensuring reliability, interoperability, and seamless integration with existing AI workflows:

- **Microsoft.ML.Tokenizers:** Tokenizers provide the foundation for chunking documents based on tokens. This enables precise splitting of content, which is essential for preparing data for large language models and optimizing retrieval strategies.
- **Microsoft.Extensions.AI:** This set of libraries powers enrichment transformations using large language models. It enables features like summarization, sentiment analysis, keyword extraction, and embedding generation, making it easy to enhance your data with intelligent insights.
- **Microsoft.Extensions.VectorData:** This set of libraries offers a consistent interface for storing processed chunks in a wide variety of vector stores, including Qdrant, Azure SQL, CosmosDB, MongoDB, ElasticSearch, and many more. This ensures your data pipelines are ready for production and can scale across different storage backends.

In addition to familiar patterns and tools, these abstractions build on already extensible components. Plug-in capability and interoperability are paramount, so as the rest of the .NET AI ecosystem grows, the capabilities of the data ingestion components grow as well. This approach empowers developers to easily integrate new providers, enrichments, and storage options, keeping their pipelines future-ready and adaptable to evolving AI scenarios.

## Data ingestion building blocks

The [Microsoft.Extensions.DataIngestion](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion) library is built around several key components that work together to create a complete data processing pipeline. This section explores each component and how they fit together.
The [Microsoft.Extensions.DataIngestion](medi-library.md) library is built around several key components that work together to create a complete data processing pipeline. This section explores each component and how they fit together.

### Documents and document readers

Expand Down
39 changes: 39 additions & 0 deletions docs/ai/conceptual/medi-library.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
title: "The Microsoft.Extensions.DataIngestion library"
description: "Learn about the Microsoft.Extensions.DataIngestion library, which provides foundational .NET building blocks for data ingestion."
ms.topic: concept-article
ms.date: 04/15/2026
ai-usage: ai-assisted
---
Comment thread
gewarren marked this conversation as resolved.

# The Microsoft.Extensions.DataIngestion library

The [📦 Microsoft.Extensions.DataIngestion package](https://www.nuget.org/packages/Microsoft.Extensions.DataIngestion) provides foundational .NET building blocks for data ingestion. It enables developers to read, process, and prepare documents for AI and machine learning workflows, especially Retrieval-Augmented Generation (RAG) scenarios.

With these building blocks, you can create robust, flexible, and intelligent data ingestion pipelines tailored for your application needs:

- **Unified document representation:** Represent any file type (for example, PDF, Image, or Microsoft Word) in a consistent format that works well with large language models.
- **Flexible data ingestion:** Read documents from both cloud services and local sources using multiple built-in readers, making it easy to bring in data from wherever it lives.
- **Built-in AI enhancements:** Automatically enrich content with summaries, sentiment analysis, keyword extraction, and classification, preparing your data for intelligent workflows.
- **Customizable chunking strategies:** Split documents into chunks using token-based, section-based, or semantic-aware approaches, so you can optimize for your retrieval and analysis needs.
- **Production-ready storage:** Store processed chunks in popular vector databases and document stores, with support for embedding generation, making your pipelines ready for real-world scenarios.
- **End-to-end pipeline composition:** Chain together readers, processors, chunkers, and writers with the <xref:Microsoft.Extensions.DataIngestion.IngestionPipeline`1> API, reducing boilerplate and making it easy to build, customize, and extend complete workflows.
- **Performance and scalability:** Designed for scalable data processing, these components can handle large volumes of data efficiently, making them suitable for enterprise-grade applications.

All of these components are open and extensible by design. You can add custom logic and new connectors, and extend the system to support emerging AI scenarios. By standardizing how documents are represented, processed, and stored, .NET developers can build reliable, scalable, and maintainable data pipelines without "reinventing the wheel" for every project.

## Built on stable foundations

![Data Ingestion Architecture Diagram](../media/data-ingestion/dataingestion.png)

These data ingestion building blocks are built on top of proven and extensible components in the .NET ecosystem, ensuring reliability, interoperability, and seamless integration with existing AI workflows:

- **Microsoft.ML.Tokenizers:** Tokenizers provide the foundation for chunking documents based on tokens. This enables precise splitting of content, which is essential for preparing data for large language models and optimizing retrieval strategies.
- **Microsoft.Extensions.AI:** This set of libraries powers enrichment transformations using large language models. It enables features like summarization, sentiment analysis, keyword extraction, and embedding generation, making it easy to enhance your data with intelligent insights.
- **Microsoft.Extensions.VectorData:** This set of libraries offers a consistent interface for storing processed chunks in a wide variety of vector stores, including Qdrant, Azure SQL, CosmosDB, MongoDB, ElasticSearch, and many more. This ensures your data pipelines are ready for production and can scale across different storage backends.

In addition to familiar patterns and tools, these abstractions build on already extensible components. Plug-in capability and interoperability are paramount, so as the rest of the .NET AI ecosystem grows, the capabilities of the data ingestion components grow as well. This approach empowers developers to easily integrate new providers, enrichments, and storage options, keeping their pipelines future-ready and adaptable to evolving AI scenarios.

## See also

- [Data ingestion](data-ingestion.md)
30 changes: 30 additions & 0 deletions docs/ai/conceptual/mevd-library.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
---
title: "The Microsoft.Extensions.VectorData library"
description: "Learn how to use Microsoft.Extensions.VectorData to build semantic search features."
ms.topic: concept-article
ms.date: 04/15/2026
ai-usage: ai-assisted
---

# The Microsoft.Extensions.VectorData library

The [📦 Microsoft.Extensions.VectorData.Abstractions](https://www.nuget.org/packages/Microsoft.Extensions.VectorData.Abstractions) package provides a unified layer of abstractions for interacting with vector stores in .NET. These abstractions let you write simple, high-level code against a single API, and swap out the underlying vector store with minimal changes to your application.

The library provides the following key capabilities:

- **Seamless .NET type mapping**: Map your .NET type directly to the database, similar to an object/relational mapper.
- **Unified data model**: Define your data model once using .NET attributes and use it across any supported vector store.
- **CRUD operations**: Create, read, update, and delete records in a vector store.
- **Vector and hybrid search**: Query records by semantic similarity using vector search, or combine vector and text search for hybrid search.
- **Embedding generation management**: Configure your embedding generator once and let the library transparently handle generation.
- **Collection management**: Create, list, and delete collections (tables or indices) in a vector store.

Microsoft.Extensions.VectorData is also the building block for additional, higher-level layers that need to interact with vector databases, for example, the [Microsoft.Extensions.DataIngestion](../conceptual/data-ingestion.md) library.

## Microsoft.Extensions.VectorData and Entity Framework Core

If you're already using [Entity Framework Core](/ef/core) to access your database, it's likely that your database provider already supports vector search, and LINQ queries can be used to express such searches. In such applications, Microsoft.Extensions.VectorData isn't necessarily needed. However, most dedicated vector databases aren't supported by EF Core, and Microsoft.Extensions.VectorData can provide a good experience for working with those. In addition, you might also find yourself using both EF and Microsoft.Extensions.VectorData in the same application, for example, when using an additional layer such as [Microsoft.Extensions.DataIngestion](../conceptual/medi-library.md).

## See also

- [Vector databases for .NET AI apps](../vector-stores/overview.md)
Loading
Loading