Skip to content

GeorgesAlkhouri/docctl

Repository files navigation

docctl logo

Local-first CLI for agent and human document retrieval with provenance-grounded answers, local vector-store, and predictable machine-readable output.

CI Trivy Quality Gate Codecov

Python 3.12 | 3.13 Local-first Chroma

Why docctl

  • Optimized for agentic retrieval loops with fast multi-step questions and answers.
  • Runs locally with a persistent Chroma-backed index.
  • Ingests .pdf, .docx, .txt, and .md with provenance metadata (doc_id, source, title).
  • Uses sentence-aware chunking for better retrieval quality.
  • Supports deterministic --json output for automation and agents.
  • Exposes stable CLI workflows for ingest, search, diagnostics, and inventory.

Agent Integration

Use SKILL.md when you want an agent to drive docctl end-to-end. The skill makes session for fast iterative retrieval.

Quickstart

Requirements:

  • Python 3.12 or 3.13
  • pip
# 1) Install from PyPI
pip install docctl

# 2) Verify CLI
docctl --help

# 3) Ingest supported files
docctl ingest ./docs --recursive --approve-write --allow-model-download

# 4) Search indexed content
docctl search "security gateway diagnostics" --top-k 5 --allow-model-download

# 5) Show one chunk by id (replace with an id from search output)
docctl show <chunk_id_from_search> --allow-model-download

Command Overview

Command Purpose
docctl ingest <path> Ingest one supported file or a directory of supported files (mutates local index state).
docctl export <archive_path> Export current index data to one .zip snapshot file.
docctl import <archive_path> Import index data from one .zip snapshot file (mutating).
docctl search <query> Search indexed content with optional metadata filters.
docctl show <chunk_id> Show one indexed chunk by exact id.
docctl stats Show index statistics.
docctl catalog Show index summary and per-document inventory.
docctl doctor Run local diagnostics for index and embedding setup.
docctl session Run a read-only NDJSON request session on stdin/stdout.

JSON and Session Mode

Use --json for deterministic machine-readable output:

docctl --json search "security gateway diagnostics" --top-k 5 --allow-model-download

Use session for NDJSON request/response flows. For agents, this is the preferred fast path whenever one workflow needs two or more read operations:

cat <<'EOF' | docctl session --allow-model-download
{"id":"q1","op":"search","query":"security gateway diagnostics","top_k":5}
{"id":"q2","op":"catalog"}
EOF

Configuration

Global options:

  • --index-path (default: .docctl)
  • --collection (default: default)
  • --json (deterministic JSON payloads on stdout)
  • --verbose (extra diagnostics)

Model downloads are explicit:

  • Use --allow-model-download when embedding artifacts are not already available.

Mutation boundaries:

  • ingest and import are mutating.
  • search, show, stats, catalog, doctor, and session are read-only.
  • export is read-only.

Development

Run core quality checks:

make lint
make format-check
make typecheck
make security-lint
make import-lint
make test
make test-cov
make check-markdown-links

Apply formatting fixes:

make format

Build release artifacts locally:

make build-dist
make check-dist
make release-dry-run

Documentation Map

Contributing

For implementation and validation workflow, start with:

  1. AGENTS.md
  2. ARCHITECTURE.md
  3. The indexed docs under docs/ listed above.

About

Local-first CLI for ingesting PDFs and retrieving provenance-grounded answers with predictable machine-readable output.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages