Skip to content

10xHub/textxtract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

29 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TextXtract

Python 3.9+ PyPI version License: MIT Coverage

A professional, extensible Python package for extracting text from multiple file formats with both synchronous and asynchronous support.

πŸš€ Features

  • πŸ”„ Dual Input Support: Works with file paths or raw bytes
  • ⚑ Sync & Async APIs: Choose the right approach for your use case
  • πŸ“ Multiple Formats: PDF, DOCX, DOC, TXT, ZIP, Markdown, RTF, HTML, CSV, JSON, XML
  • 🎯 Optional Dependencies: Install only what you need
  • πŸ›‘οΈ Robust Error Handling: Comprehensive exception hierarchy
  • πŸ“Š Professional Logging: Detailed debug and info level logging
  • πŸ”’ Thread-Safe: Async operations use thread pools for I/O-bound tasks
  • 🧹 Context Manager Support: Automatic resource cleanup

Documentation

For complete documentation, including installation instructions, usage examples, and API reference, please visit our documentation site.

πŸ“¦ Installation

Basic Installation

pip install textxtract

Install with File Type Support

# Install support for specific formats
pip install textxtract[pdf]          # PDF support
pip install textxtract[docx]         # Word documents
pip install textxtract[all]          # All supported formats

# Multiple formats
pip install textxtract[pdf,docx,html]

πŸƒ Quick Start

Synchronous Extraction

from textxtract import SyncTextExtractor

extractor = SyncTextExtractor()

# Extract from file path
text = extractor.extract("document.pdf")
print(text)

# Extract from bytes (filename required for type detection)
with open("document.pdf", "rb") as f:
    file_bytes = f.read()
text = extractor.extract(file_bytes, "document.pdf")
print(text)

Asynchronous Extraction

from textxtract import AsyncTextExtractor
import asyncio

async def extract_text():
    extractor = AsyncTextExtractor()
    
    # Extract from file path
    text = await extractor.extract("document.pdf")
    return text

# Run async extraction
text = asyncio.run(extract_text())
print(text)

Context Manager Usage

# Automatic resource cleanup
with SyncTextExtractor() as extractor:
    text = extractor.extract("document.pdf")

# Async context manager
async with AsyncTextExtractor() as extractor:
    text = await extractor.extract("document.pdf")

πŸ“‹ Supported File Types

Format Extensions Dependencies Installation
Text .txt, .text Built-in pip install textxtract
Markdown .md Optional pip install textxtract[md]
PDF .pdf Optional pip install textxtract[pdf]
Word .docx Optional pip install textxtract[docx]
Word Legacy .doc Optional pip install textxtract[doc]
Rich Text .rtf Optional pip install textxtract[rtf]
HTML .html, .htm Optional pip install textxtract[html]
CSV .csv Built-in pip install textxtract
JSON .json Built-in pip install textxtract
XML .xml Optional pip install textxtract[xml]
ZIP .zip Built-in pip install textxtract

πŸ”§ Advanced Usage

Error Handling

from textxtract import SyncTextExtractor
from textxtract.exceptions import (
    FileTypeNotSupportedError,
    InvalidFileError,
    ExtractionError
)

extractor = SyncTextExtractor()

try:
    text = extractor.extract("document.pdf")
    print(text)
except FileTypeNotSupportedError:
    print("❌ File type not supported")
except InvalidFileError:
    print("❌ File is invalid or corrupted")
except ExtractionError:
    print("❌ Extraction failed")

Custom Configuration

from textxtract import SyncTextExtractor
from textxtract import ExtractorConfig

# Custom configuration
config = ExtractorConfig(
    encoding="utf-8",
    max_file_size=50 * 1024 * 1024,  # 50MB limit
    logging_level="DEBUG"
)

extractor = SyncTextExtractor(config)
text = extractor.extract("document.pdf")

Batch Processing

import asyncio
from pathlib import Path
from textxtract import AsyncTextExtractor

async def process_files(file_paths):
    async with AsyncTextExtractor() as extractor:
        # Process files concurrently
        tasks = [extractor.extract(path) for path in file_paths]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        return results

# Process multiple files
files = [Path("doc1.pdf"), Path("doc2.docx"), Path("doc3.txt")]
results = asyncio.run(process_files(files))

for file, result in zip(files, results):
    if isinstance(result, Exception):
        print(f"❌ {file}: {result}")
    else:
        print(f"βœ… {file}: {len(result)} characters extracted")

Logging Configuration

import logging
from textxtract import SyncTextExtractor

# Enable debug logging
logging.basicConfig(level=logging.DEBUG)

extractor = SyncTextExtractor()
text = extractor.extract("document.pdf")  # Will show detailed logs

πŸ§ͺ Testing

# Install test dependencies
pip install textxtract[all] pytest pytest-asyncio

# Run tests
pytest

# Run with coverage
pytest --cov=textxtract

πŸ“š Documentation

🎯 Use Cases

Document Processing

from textxtract import SyncTextExtractor

def process_document(file_path):
    extractor = SyncTextExtractor()
    text = extractor.extract(file_path)
    
    # Process extracted text
    word_count = len(text.split())
    return {
        "file": file_path,
        "text": text,
        "word_count": word_count
    }

Content Analysis

import asyncio
from textxtract import AsyncTextExtractor

async def analyze_content(files):
    async with AsyncTextExtractor() as extractor:
        results = []
        for file in files:
            try:
                text = await extractor.extract(file)
                # Perform analysis
                analysis = {
                    "file": file,
                    "length": len(text),
                    "words": len(text.split()),
                    "contains_email": "@" in text
                }
                results.append(analysis)
            except Exception as e:
                results.append({"file": file, "error": str(e)})
        return results

Data Pipeline Integration

from textxtract import SyncTextExtractor

def extract_and_store(file_path, database):
    extractor = SyncTextExtractor()
    
    try:
        text = extractor.extract(file_path)
        
        # Store in database
        database.store({
            "file_path": str(file_path),
            "content": text,
            "extracted_at": datetime.now(),
            "status": "success"
        })
        
    except Exception as e:
        database.store({
            "file_path": str(file_path),
            "error": str(e),
            "extracted_at": datetime.now(),
            "status": "failed"
        })

πŸ”§ Requirements

  • Python 3.9+
  • Optional dependencies for specific file types
  • See Installation Guide for details

🀝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Quick Contribution Setup

# Fork and clone the repo
git clone https://github.com/10XScale-in/textxtract.git
cd text-extractor

# Set up development environment
pip install -e .[all]
pip install pytest pytest-asyncio black isort mypy

# Run tests
pytest

# Format code
black textxtract tests
isort textxtract tests

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

🌟 Support

πŸ™ Acknowledgments`

  • Thanks to all contributors who have helped improve this project
  • Built with Python and the amazing open-source ecosystem
  • Special thanks to the maintainers of underlying libraries

About

A robust, extensible Python package for synchronous and asynchronous text extraction from PDF, DOCX, DOC, TXT, ZIP, Markdown, RTF, HTML, CSV, JSON, XML, and more.

Resources

License

Contributing

Stars

Watchers

Forks

Contributors