Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ google-genai = "*"
python-multipart = "*"
langchain-xai = "*"
xai-sdk = "*"
python-docx = "*"

[dev-packages]
pytest = "*"
Expand Down
1,390 changes: 625 additions & 765 deletions Pipfile.lock

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/open-api-docs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ openapi: 3.0.3
info:
title: The Agent's user-facing API
description: The user-facing parts of The Agent's API service (excluding system-level endpoints, chat completion, maintenance endpoints, etc.)
version: 5.14.1
version: 5.15.0
license:
name: MIT
url: https://opensource.org/licenses/MIT
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-05-22
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
## Context

The attachment-processing pipeline today handles three media families:

- **Images** β†’ `ComputerVisionAnalyzer` (batched)
- **Audio** β†’ `AudioTranscriber`
- **Documents** β†’ `DocumentSearch` (PDF only)

The document path is the narrowest: only `.pdf` is recognized, and the only strategy is "load via `PyMuPDFLoader`, embed into an in-memory vector store, semantic-search the top-K chunks, run a copywriter LLM over them, return summary." This works well for large PDFs but is overkill β€” and unnecessarily costly β€” for short text files.

The pipeline already provides the surrounding machinery we need: download via URL, 13-week TTL cache, per-attachment graceful failure (`__process_single` catches `Exception`, stores formatted `str(e)` into `errors[i]`), and a structured per-attachment result dict consumed by the LLM tool layer.

Recent commits (`414e4a5` and adjacent) added user-configurable `max_output_tokens` per chat membership, but that constrains LLM **output** β€” not the **input** context size we control here. There is no user-facing knob for input context; this design keeps it that way.

## Goals / Non-Goals

**Goals:**
- Extend the document attachment path to recognize and extract content from a broad range of plain-text formats (text, markup, configs, source code) plus `.docx`.
- Choose between "raw dump" and "semantic search" per attachment based on size, transparently to the caller.
- Bound memory exposure for URL-attached files via a hard size cap.
- Surface extraction failures as structured, per-attachment errors using existing error codes; do not abort the tool call when one attachment fails.
- Keep the LLM tool-call result shape (`list[dict[str, str]]`) backwards-compatible.

**Non-Goals:**
- No support for `.doc` (legacy binary Word) or `.rtf` (low demand, extra dep).
- No OCR for image-only PDFs/DOCX β€” we report "no extractable text" and stop.
- No user-configurable token threshold or strategy override (would require frontend changes; not justified for V1).
- No richer per-attachment error dict (`to_llm_dict()`) β€” would change the result type signature. Deferred.
- No new error codes β€” reuse `DOCUMENT_SEARCH_FAILED` and `ATTACHMENT_PROCESSING_FAILED`.
- No changes to image or audio paths.

## Decisions

### D1: Two strategies, selected by size

**Decision**: After extracting text, estimate token count as `len(joined_text) // 3`. If ≀ 15,000 tokens, return the joined text directly; otherwise feed the documents into `DocumentSearch`.

**Rationale**:
- Embedding + similarity search on a small file is wasted spend and latency.
- Raw dump gives the LLM the full document, producing better answers for small inputs.
- 15K tokens is ~7-10% of a typical 200K-token context window, leaving ample room for chat history and other attachments.
- `len // 3` is conservative (assumes denser tokenization than `len // 4`), reducing the risk that a "raw" dump silently exceeds the context window for non-Latin scripts or code-heavy text.

**Alternatives considered**:
- Per-format thresholds (e.g., always search PDFs, always raw txt): rejected β€” surprising behavior at the boundary (a 100 KB `.md` searched but 100 KB `.pdf` not).
- Use `tiktoken` for accurate counts: rejected for V1 β€” adds startup cost and complexity for a decision boundary that doesn't need precision.

### D2: Unified `PlainTextLoader` for all decode-based formats

**Decision**: A single new class loads every plain-text extension. UTF-8 strict first; on `UnicodeDecodeError`, retry with `errors="replace"`. Returns a single `Document` with the full file contents and metadata `{"chunk": 0}`.

**Rationale**:
- Logic is identical for `.txt`, `.md`, `.log`, `.csv`, `.json`, `.xml`, `.yaml`, `.yml`, and all source-code extensions β€” splitting per format would be pure duplication.
- `errors="replace"` covers Windows-1252 / Latin-1 / mojibake files lossily but never crashes; matches the "graceful per-attachment failure" theme.

**Alternatives considered**:
- Use `chardet` for accurate encoding detection: rejected β€” heavy dep, marginal benefit, modern uploads are nearly all UTF-8.
- Use `langchain_community.document_loaders.TextLoader`: rejected β€” it takes a file path, not bytes/URL; would force a temp-file detour.

### D3: `DocumentSearch` becomes loader-agnostic

**Decision**: Move `PyMuPDFLoader(document_url).load()` out of `DocumentSearch.__init__`. The new constructor accepts a pre-loaded `list[Document]` plus the existing tools. The caller is responsible for choosing and invoking the right loader.

**Rationale**:
- Required to support `.docx` and (eventually) any other format through the same search path.
- Keeps `DocumentSearch` focused on its real job: embed β†’ search β†’ copywrite.
- Makes the class trivially unit-testable with hand-built `Document` lists.

**Alternatives considered**:
- Loader factory inside `DocumentSearch`: rejected β€” couples the class to every supported format.

### D4: 10MB hard cap at the loader level

**Decision**: Each loader checks `len(downloaded_bytes) > 10 * 1024 * 1024` first and raises `ExternalServiceError("File too large for processing (>10MB)", ATTACHMENT_PROCESSING_FAILED) from None`. Caught by `__process_single`, reported as an error string for that attachment.

**Rationale**:
- Telegram/WhatsApp already cap uploads (~20 MB / 16 MB respectively), so the cap mainly matters for URL-attached files where users could point at arbitrarily large resources.
- 10 MB comfortably covers any reasonable user document while bounding memory and download time.
- Reuses `ATTACHMENT_PROCESSING_FAILED` (currently defined-but-unused) β€” no frontend updates.

**Alternatives considered**:
- Streaming size check during download: rejected β€” over-engineered; the existing flow already loads the full bytes into memory.
- ValidationError + 422 status: rejected β€” would mismatch the code's 5xxx range; the HTTP status is irrelevant inside the per-attachment flow anyway.

### D5: Empty-extraction short-circuit

**Decision**: If extracted text is empty or whitespace-only after joining all documents, return a fixed message (`"Document contains no extractable text (possibly an image-only document)."`). Do not invoke `DocumentSearch`.

**Rationale**:
- Running semantic search on `""` produces noise and burns tokens for no gain.
- A clear message lets the chat LLM tell the user something useful ("looks like a scanned PDF β€” try sending the image instead").

### D6: Cache key includes strategy

**Decision**: Cache key becomes `{prefix}-{attachment_id}-{strategy}-{additional_context_hash}` where `strategy ∈ {"raw", "search"}`.

**Rationale**:
- The cached value differs in structure between strategies (raw text vs. copywriter summary). Conflating them under one key would return wrong content on the second access after a strategy switch.
- Strategy almost never changes for the same attachment (size is stable), so cache hit-rate is unaffected in practice.
- Existing cache entries become orphaned but harmless β€” they expire on the 13-week TTL.

### D7: Reuse existing error codes; raise as `ExternalServiceError`

**Decision**:
- `DOCUMENT_SEARCH_FAILED` (5010) for any extraction/parse failure: corrupt `.docx`, broken `.pdf`, unexpected loader exception.
- `ATTACHMENT_PROCESSING_FAILED` (5006) for size violations.
- Both raised as `ExternalServiceError(message, CODE) from e` (where `e` is the underlying cause). `ServiceError.__str__()` formats as `[E5010] 🌐 message # Caused by: ...`, which the existing `__process_single` catch path writes into `errors[i]`.

**Rationale**:
- Both codes exist but are unused β€” wiring them up does not require new frontend localization.
- 5xxx range matches `ExternalServiceError` convention.
- HTTP status (502) does not surface for this flow; the error path is purely the per-attachment string.

### D8: Increase `SEARCH_RESULT_PAGES` from 2 to 3

**Decision**: Bump the top-K from 2 to 3 chunks for the semantic-search path.

**Rationale**:
- Small change requested by user during exploration.
- Three chunks give the copywriter LLM more context to work with on large documents without significantly inflating the prompt.

## Risks / Trade-offs

| Risk | Mitigation |
|---|---|
| `.ts` extension collides with MPEG Transport Stream (`video/mp2t`). A platform might report TS video files as `application/typescript` (or vice versa) | Routing is extension-first; existing audio/video MIME detection in the bot SDKs runs before document routing. Document a code comment near the `.ts` entry flagging the collision. |
| Lossy decoding (`errors="replace"`) may produce garbled content for legacy-encoded files | Replacement characters are visible to the LLM, which can typically signal "file content looks corrupted" to the user. Real fix is `chardet` β€” defer until needed. |
| Empty docx/PDF check is a heuristic β€” a document with only formatting whitespace gets the "no text" message | Acceptable: such documents have nothing useful to chat about anyway. |
| `python-docx` is a new dependency | Maintained, widely used, MIT-licensed, ~1 MB install. Low risk. |
| Cache key change orphans existing PDF cache entries | Entries expire naturally on 13-week TTL. No correctness risk. |
| 10 MB cap is arbitrary | Configurable in code (single constant). Easy to bump if a real use case complains. |
| Source-code extensions are added speculatively (no proven user demand yet) | Same code path as `.txt`; the cost is dict entries plus one comment about `.ts`. Trivial to remove if it causes confusion. |

## Migration Plan

1. Add `python-docx` to `Pipfile` and lock it.
2. Implement new loaders and the strategy selector behind the existing tool-call API β€” no caller-side changes needed.
3. Deploy. Behavior changes are entirely additive (existing PDF flow now goes through the refactored `DocumentSearch`; result shape unchanged).
4. No rollback gymnastics needed β€” feature is internal to the attachment tool call. Reverting the commit fully restores prior behavior. Cached entries from the new flow become orphans on revert, expire on TTL.

## Open Questions

- None blocking. `.ts` MIME collision is documented and not actionable until/unless a real user case surfaces.
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
## Why

Users can currently "chat with a PDF" but cannot send any other text-bearing file types. Real conversations involve `.txt`, `.md`, `.docx`, source code, configs, logs, and structured data. Today the bot rejects them all as unsupported. Closing this gap turns the bot from a PDF-only document assistant into a general "chat with any file" agent.

## What Changes

- Support new attachment extensions, all routed through the existing attachment-processing pipeline:
- Plain-text formats: `.txt`, `.md`, `.log`, `.csv`, `.json`, `.xml`, `.yaml`, `.yml`
- Source-code formats: `.js`, `.ts`, `.jsx`, `.tsx`, `.py`, `.java`, `.c`, `.h`, `.cpp`, `.hpp`, `.go`, `.rs`, `.rb`, `.php`, `.sh`, `.bash`, `.zsh`, `.swift`, `.kt`, `.html`, `.css`, `.scss`
- Binary document format: `.docx`
- Add a size-based strategy selector: small files (≀ ~15K-token estimate) are loaded fully into the LLM's context as raw text; larger files continue to use the existing semantic-search pipeline.
- Add a 10MB hard cap at the loader level to prevent oversized URL-attached files from exhausting memory.
- Introduce graceful, per-attachment error reporting using existing-but-unused error codes (`DOCUMENT_SEARCH_FAILED` for extraction failures; `ATTACHMENT_PROCESSING_FAILED` for size violations); one corrupt attachment does not fail the whole tool call.
- Short-circuit empty extractions with a clear "no extractable text" message instead of running semantic search on nothing.
- **BREAKING (internal)**: `DocumentSearch.__init__` now accepts a pre-loaded `list[Document]` instead of a raw `document_url`. The PDF loader moves out of the class into the caller. No public-API impact.
- Increase `SEARCH_RESULT_PAGES` from 2 to 3 for slightly richer search results.
- Add `python-docx` as a new dependency for `.docx` extraction.
- Drop `.doc` and `.rtf` from scope (legacy formats with significant extraction cost and minimal user demand).

## Capabilities

### New Capabilities
- `chat-attachments`: End-to-end behavior for processing user-attached files in chat β€” extension/MIME recognition, content extraction, raw-vs-search strategy selection, caching, and per-attachment error reporting. Covers all attachment types (images, audio, documents) but this change focuses on extending the document path.

### Modified Capabilities
<!-- None: this is the first change in OpenSpec; there are no existing specs to modify. -->

## Impact

- **Code**:
- `src/features/chat/supported_files.py` β€” expanded `KNOWN_DOCS_FORMATS`
- `src/features/chat/chat_attachment_processor.py` β€” strategy selector, size guard, empty-extract short-circuit, updated cache key, structured error raises
- `src/features/documents/document_search.py` β€” constructor signature change, `SEARCH_RESULT_PAGES` bump
- `src/features/documents/plain_text_loader.py` β€” new (unified UTF-8 loader with `errors="replace"` fallback)
- `src/features/documents/docx_loader.py` (or use `Docx2txtLoader` directly) β€” new
- `src/di/di.py` β€” wire new loaders / updated `document_search` factory
- **API**: No public API or HTTP contract changes. Behavior changes are internal to the attachment-processing tool call.
- **Dependencies**: Add `python-docx` to `Pipfile`.
- **Error codes**: Reuse `DOCUMENT_SEARCH_FAILED` (5010) and `ATTACHMENT_PROCESSING_FAILED` (5006), both currently unused. No new codes added (avoids frontend updates).
- **Tests**: New unit tests for plain-text loading (encoding fallback, oversize), processor strategy selection (raw/search boundary, corrupt input, empty extract), and refactored `DocumentSearch`.
- **Caching**: Cache key gains a `strategy` segment (`raw` | `search`); previously cached entries remain valid as-is but will be invalidated when the key shape changes.
- **Performance**: Small files skip embedding + semantic search, reducing latency and embedding-API spend on common cases.
Loading
Loading