Skip to content

Python tools for PDF text extraction, text replacement with font sizing, and page manipulation using PyMuPDF

Notifications You must be signed in to change notification settings

LiteObject/pdf-tools

Repository files navigation

PDF Tools

A collection of CLI utilities for working with PDF files.

Tools

PDF to Text (pdf_to_text.py)

A tiny CLI utility to stream large PDF files into plain text without loading the entire file into memory. It wraps pdfminer.six with page-based iteration, configurable LAParams, a friendly CLI spinner, and safe logging so you can batch-process enormous PDFs. When conversion finishes, the CLI prints a summary showing file sizes and elapsed time.

Text Replacement (replace_text.py)

Replace text in PDF files with support for case-sensitive/insensitive search, whole word matching, regular expressions, and custom font sizing. The tool preserves font, size, and style formatting when not using a custom font size.

Key Features:

  • Simple text replacement with formatting preservation
  • Case-insensitive search option
  • Whole word matching
  • Regular expression support
  • Custom font size specification (--size)
  • Batch processing of multiple occurrences
  • Enhanced debugging with detailed logging

Replace Page (replace_page.py)

A Python utility to replace a page in a PDF file with an image. The image is automatically scaled to fit the dimensions of the page being replaced while maintaining its aspect ratio.

Key Features:

  • Replace any page in a PDF with an image
  • Automatically scales images to match page dimensions
  • Maintains image aspect ratio
  • Preserves all other pages in the PDF
  • Supports various image formats (PNG, JPEG, BMP, etc.)

Installation

  1. Create or activate your Python virtual environment (the repository already contains .venv/).
  2. Install the requirements:
pip install -r requirements.txt

Usage

PDF to Text

python pdf_to_text.py INPUT_PDF [-o OUTPUT_TXT] [OPTIONS]

Examples

Convert an entire PDF:

python pdf_to_text.py documents/manual.pdf

Extract a subsection without overwriting an existing file:

python pdf_to_text.py big-output.pdf --page-range 50-150 \
    --output extracted.txt --overwrite

Options

  • --page-range: specify start-end to control the page window (e.g., 10- for everything after page 10)
  • --encoding: control the output text encoding (default utf-8)
  • --char-margin, --line-margin, --word-margin, --boxes-flow, --detect-vertical: customize pdfminer.six layout heuristics
  • --quiet / --log-level: mute or raise logging verbosity
  • --no-spinner: disable the CLI animation

Text Replacement

python replace_text.py INPUT_PDF SEARCH_TEXT REPLACE_TEXT [OPTIONS]

Examples

Replace text while preserving formatting:

python replace_text.py document.pdf "old text" "new text"

Replace with custom font size:

python replace_text.py document.pdf "11-21-2020" "11-21-2025" --size 14

Case-insensitive replacement:

python replace_text.py document.pdf "Old Text" "New Text" --ignore-case

Regex replacement (email redaction):

python replace_text.py document.pdf "\b[\w.-]+@[\w.-]+\.\w+\b" "[EMAIL]" --regex

Phone number redaction:

python replace_text.py document.pdf "\b\d{3}-\d{3}-\d{4}\b" "[PHONE]" --regex

Options

  • --ignore-case (-i): Case-insensitive search
  • --whole-word (-w): Match whole words only
  • --regex (-r): Treat search text as a regular expression
  • --size SIZE: Font size to use for replacement text (preserves original size if not specified)
  • --overwrite: Overwrite the output file if it exists
  • --output (-o): Specify output file path
  • --quiet: Suppress informational logging
  • --log-level: Set logging level (DEBUG, INFO, WARNING, ERROR)

Limitations:

  • Works best with PDFs that have selectable text
  • Scanned PDFs (images) require OCR preprocessing
  • Complex layouts may not be perfectly preserved
  • Encrypted PDFs require password

Replace Page

python replace_page.py INPUT_PDF IMAGE_FILE [OPTIONS]

Examples

Replace the first page (cover):

python replace_page.py document.pdf new_cover.png

Replace a specific page:

python replace_page.py report.pdf diagram.jpg --page 3 --output updated_report.pdf

Replace with overwrite:

python replace_page.py document.pdf image.png --page 5 --overwrite

Options

  • -p, --page PAGE: Page number to replace (1-indexed, default: 1)
  • -o, --output OUTPUT: Path to the output PDF file
  • --overwrite: Overwrite the output file if it exists
  • --quiet: Suppress informational logging
  • --log-level LOG_LEVEL: Set logging level (DEBUG, INFO, WARNING, ERROR)

Supported Image Formats: PNG, JPEG, BMP, GIF, TIFF

Testing

Run the CLIs with --help to verify the scripts start without errors:

python pdf_to_text.py --help
python replace_text.py --help
python replace_page.py --help

Run the test suite:

pytest tests/ -v

Requirements

  • Python 3.8+
  • PyMuPDF (fitz) >= 1.23.0
  • PyPDF2 >= 4.0
  • Pillow >= 10.0
  • pdfminer.six

See requirements.txt for complete dependencies.

Notes

  • Always test on a copy of your PDF first
  • Complex PDFs with multiple layers may not work perfectly
  • The tools preserve images and non-text content when possible
  • Text replacement preserves formatting when custom font size is not specified

About

Python tools for PDF text extraction, text replacement with font sizing, and page manipulation using PyMuPDF

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages