Skip to content

Textify (OCR wrapper)

Turn the provided Python wrapper around ocrmypdf into a convenient terminal command named textify that can:

  • OCR PDFs (single file or whole directory)
  • Optionally combine all newly created PDFs into one file
  • Optionally recompress PDFs to reduce size

What you’ll get

After installation you’ll be able to run:

  • Directory mode (OCR everything into a new folder)

    textify -d ~/Scans ~/Scans_text
    

  • Directory mode + combine result into one PDF

    textify -d -c ~/Scans ~/Scans_text
    

  • Directory mode + combine + compress (with a level)

    textify -d -c -z --compress-level ebook ~/Scans ~/Scans_text
    

  • Single file mode (optionally compress)

    textify -z ./input.pdf ./output.pdf
    


Prerequisites

Required

  • macOS with Homebrew (or your package manager of choice)
  • ocrmypdf

Optional (used if found) - For combining many PDFs → one: qpdf (preferred), or pdfunite (from Poppler), or gs (Ghostscript) - For compressing PDFs in place: gs (preferred for levels) or qpdf

Install on macOS with Homebrew
# Required
brew install ocrmypdf

# Optional: any of the following will enable extra features
brew install qpdf             # best for combining, fallback compressor
brew install poppler          # gives `pdfunite` as combine fallback
brew install ghostscript      # combine (fallback) + best compression levels

Note

If ocrmypdf is missing, textify exits with 127 and suggests installing it via Homebrew.
If no combiner is available, combine will be skipped with a helpful error.
If no compressor is available, compression will be skipped with a helpful error.


1) Create the script

Pick a location you control, e.g. ~/bin (create it if it doesn’t exist).

mkdir -p ~/bin
vim ~/bin/textify

Paste the script below and save the file as textify.

Show script (Python) — save as ~/bin/textify
#!/usr/bin/env python3
"""
textify — thin wrapper around ocrmypdf with optional PDF combine and compression

Usage:
Directory mode:
    ./textify -d [-c|--combine] [-z|--compress [--compress-level LEVEL]] [-V|--ocr-verbose] SRC_DIR DST_DIR

Single file mode:
    ./textify [-z|--compress [--compress-level LEVEL]] [-V|--ocr-verbose] SRC_PDF DST_PDF
"""

import argparse
import os
import shutil
import sys
from pathlib import Path
from subprocess import run, PIPE, STDOUT

# ── Neon ANSI styling ──────────────────────────────────────────────────────────
USE_COLOR = sys.stdout.isatty() and os.environ.get("NO_COLOR") is None

def _paint(code: str, text: str) -> str:
    return f"\033[{code}m{text}\033[0m" if USE_COLOR else text

NEON = {
    "pink":   "1;38;5;199",
    "cyan":   "1;38;5;51",
    "purple": "1;38;5;135",
    "green":  "1;38;5;82",
    "yellow": "1;38;5;226",
    "red":    "1;38;5;197",
    "blue":   "1;38;5;45",
    "orange": "1;38;5;208",  # iteration counters
    "dim":    "2;38;5;244",
}

def neon(text: str, color: str) -> str:
    return _paint(NEON[color], text)

def tag(label: str, color: str) -> str:
    return neon(f"[{label}]", color)

def arrow() -> str:
    return neon(" → ", "blue")

def bullet() -> str:
    return neon("◆", "purple")

# ── Pink vertical gutter bar ───────────────────────────────────────────────────
def pick_bar_char() -> str:
    try:
        "┃".encode(sys.stdout.encoding or "utf-8")
        return "┃"
    except Exception:
        return "|"

VBAR_RAW = pick_bar_char()
VBAR = neon(VBAR_RAW, "pink")  # make the gutter bar pink like the ━ lines

def gutter(line: str = "", spaces: int = 2) -> str:
    # spaces=2 for OCR/COMPRESS groups (matches DESIRED OUTPUT), spaces=1 for other groups
    return f"{VBAR}{' ' * spaces}{line}"

def group_header(left: str) -> str:
    return neon(left, "pink")

def group_footer() -> str:
    return neon("┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━", "pink")

# ── OCR config ─────────────────────────────────────────────────────────────────
OCR_CMD = ["ocrmypdf", "--force-ocr", "--optimize", "0", "--deskew"]

def which(cmd: str) -> Path | None:
    p = shutil.which(cmd)
    return Path(p) if p else None

def ensure_ocrmypdf():
    if which("ocrmypdf") is None:
        print(neon("error:", "red"), "'ocrmypdf' not found in PATH. Install it via Homebrew.", file=sys.stderr)
        sys.exit(127)

def is_pdf(path: Path) -> bool:
    return path.is_file() and path.suffix.lower() == ".pdf"

def run_ocr_quiet(src: Path, dst: Path) -> tuple[int, list[str] | None]:
    """Run ocrmypdf, capturing output. Return (rc, tail_lines_on_error_or_None)."""
    cmd = OCR_CMD + [str(src), str(dst)]
    result = run(cmd, stdout=PIPE, stderr=STDOUT, check=False)
    if result.returncode != 0 and result.stdout is not None:
        lines = result.stdout.decode(errors="ignore").splitlines()
        tail = lines[-20:] if lines else []
        return result.returncode, tail
    return result.returncode, None

def run_ocr_verbose(src: Path, dst: Path) -> int:
    """Run ocrmypdf with passthrough output."""
    cmd = OCR_CMD + [str(src), str(dst)]
    return run(cmd, check=False).returncode

def convert_one(src: Path, dst: Path, passthrough_logs: bool) -> tuple[int, list[str] | None]:
    if not src.exists():
        return 1, [f"source does not exist: {src}"]
    if not is_pdf(src):
        return 1, [f"source is not a .pdf: {src}"]
    if dst.exists():
        return 1, [f"destination already exists (won't overwrite): {dst}"]
    dst.parent.mkdir(parents=True, exist_ok=True)
    if passthrough_logs:
        rc = run_ocr_verbose(src, dst)
        return rc, None
    else:
        return run_ocr_quiet(src, dst)

# ── Combine PDFs ───────────────────────────────────────────────────────────────
def find_unique_combined_path(out_dir: Path) -> Path:
    base = out_dir / "textified-combined.pdf"
    if not base.exists():
        return base
    i = 1
    while True:
        cand = out_dir / f"textified-combined{i}.pdf"
        if not cand.exists():
            return cand
        i += 1

def combine_pdfs(inputs: list[Path], out_dir: Path) -> tuple[int, Path | None]:
    inputs = [p for p in inputs if p.exists()]
    if not inputs:
        print(neon("note:", "yellow"), "nothing to combine.", file=sys.stderr)
        return 0, None
    out_file = find_unique_combined_path(out_dir)
    if which("qpdf"):
        cmd = ["qpdf", "--empty", "--pages", *map(str, inputs), "--", str(out_file)]
        tool = "qpdf"
    elif which("pdfunite"):
        cmd = ["pdfunite", *map(str, inputs), str(out_file)]
        tool = "pdfunite"
    elif which("gs"):
        cmd = ["gs", "-dBATCH", "-dNOPAUSE", "-q", "-sDEVICE=pdfwrite", f"-sOutputFile={out_file}", *map(str, inputs)]
        tool = "ghostscript"
    else:
        print(
            neon("error:", "red")
            + " no PDF combiner found (tried qpdf, pdfunite, gs). Install one via:\n"
            + "  brew install qpdf   # recommended\n"
            + "  # or: brew install poppler   # for pdfunite\n"
            + "  # or: brew install ghostscript",
            file=sys.stderr,
        )
        return 127, None

    print(group_header(f"┏━ COMBINE {neon(tool, 'purple')}{arrow()}{neon(out_file.name, 'green')}"))
    rc = run(cmd, check=False).returncode
    if rc == 0:
        print(gutter(f"  {tag('OK', 'green')} wrote {out_file}", spaces=1))
    else:
        print(gutter(f"  {tag('FAIL', 'red')} combining into {out_file} {neon(f'(rc={rc})', 'dim')}", spaces=1))
    print(group_footer(), end="\n\n")
    return (0, out_file) if rc == 0 else (rc, None)

# ── Compress PDFs (single block like OCR) ──────────────────────────────────────
def choose_compressor(level: str):
    """Pick a compressor once, return (tool_label, build_cmd(in_path, out_path))."""
    if which("gs"):
        tool = f"ghostscript/{level}"
        def build_cmd(inp: Path, outp: Path):
            return [
                "gs", "-dBATCH", "-dNOPAUSE", "-dQUIET",
                "-sDEVICE=pdfwrite",
                "-dCompatibilityLevel=1.6",
                f"-dPDFSETTINGS=/{level}",
                f"-sOutputFile={outp}",
                str(inp),
            ]
        return tool, build_cmd
    if which("qpdf"):
        tool = "qpdf"
        def build_cmd(inp: Path, outp: Path):
            return ["qpdf", "--object-streams=generate", "--compress-streams=y", str(inp), str(outp)]
        return tool, build_cmd
    return None, None

def compress_many(paths: list[Path], level: str) -> int:
    paths = [p for p in paths if p and p.exists()]
    if not paths:
        return 0

    tool, build_cmd = choose_compressor(level)
    if tool is None:
        print(neon("error:", "red") + " cannot compress — install ghostscript or qpdf.", file=sys.stderr)
        return 127

    # One big COMPRESS group header (mirrors OCR's look)
    print(group_header(f"\n┏━ COMPRESS {neon(tool, 'purple')} {neon('━━━━━━━━━━━━━━━━━━━━━━━━━━\n┃', 'pink')}"))

    failures = 0
    total = len(paths)
    for idx, p in enumerate(paths, 1):
        iter_tag = neon(f"[{idx}/{total}]", "orange")
        print(gutter(f"{bullet()} {iter_tag} compressing {neon(p.name, 'cyan')}", spaces=2))

        tmp_out = p.with_name(p.stem + ".tmp.pdf")
        cmd = build_cmd(p, tmp_out)
        rc = run(cmd, check=False).returncode

        if rc == 0 and tmp_out.exists():
            try:
                os.replace(tmp_out, p)
                print(gutter(f"  {tag('OK', 'green')} compressed {p}", spaces=2))
            except OSError as e:
                print(gutter(f"  {tag('FAIL', 'red')} could not replace original for {p}: {e}", spaces=2))
                Path(tmp_out).unlink(missing_ok=True)
                failures += 1
        else:
            print(gutter(f"  {tag('FAIL', 'red')} compressing {p} {neon(f'(rc={rc})', 'dim')}", spaces=2))
            Path(tmp_out).unlink(missing_ok=True)
            failures += 1

        print(gutter("", spaces=2))  # blank line between items inside the group

    print(group_footer(), end="\n\n")

    if failures:
        print(tag("SUMMARY", "yellow"), neon(f"compression: {total - failures} ok, {failures} failed", "yellow"), end="\n\n")
        return 1
    else:
        print(tag("SUMMARY", "green"), neon(f"compression: {total} ok", "green"), end="\n\n")
        return 0

# ── Directory & File modes ─────────────────────────────────────────────────────
def handle_directory_mode(src_dir: Path, dst_dir: Path, combine: bool, do_compress: bool, compress_level: str, passthrough_logs: bool) -> int:
    if not src_dir.exists() or not src_dir.is_dir():
        print(neon("error:", "red"), f"not a directory: {src_dir}", file=sys.stderr)
        return 1
    if src_dir.resolve() == dst_dir.resolve():
        print(neon("error:", "red"), "output directory must be different from input directory.", file=sys.stderr)
        return 1

    dst_dir.mkdir(parents=True, exist_ok=True)
    pdfs = sorted([p for p in src_dir.iterdir() if p.is_file() and p.suffix.lower() == ".pdf"])
    if not pdfs:
        print(neon("note:", "yellow"), f"no .pdf files found in {src_dir}")
        return 0

    # ── OCR GROUP ──
    print(group_header(f"\n┏━ OCR {neon('ocrmypdf', 'purple')} {neon('━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n┃', 'pink')}"))
    failures = 0
    produced: list[Path] = []
    total = len(pdfs)
    for idx, src in enumerate(pdfs, 1):
        iter_tag = neon(f"[{idx}/{total}]", "orange")
        line_left = f"{bullet()} {iter_tag} converting {neon(src.name, 'cyan')}"
        print(gutter(line_left, spaces=2))

        dst = dst_dir / src.name
        rc, tail = convert_one(src, dst, passthrough_logs)
        if rc == 0:
            print(gutter(f"  {tag('OK', 'green')} {src}{arrow()}{dst}", spaces=2))
            produced.append(dst)
        else:
            print(gutter(f"  {tag('FAIL', 'red')} {src}{arrow()}{dst} {neon(f'(rc={rc})', 'dim')}", spaces=2))
            if tail:
                print(gutter(neon("── ocrmypdf (last 20 lines) ─────────────────────────────", "dim"), spaces=2))
                for t in tail:
                    print(gutter(neon(t, "dim"), spaces=2))
            failures += 1
        print(gutter("", spaces=2))  # blank line between items inside the group
    print(group_footer(), end="\n\n")

    print(tag("SUMMARY", "purple"), neon(f"{len(produced)} succeeded, {failures} failed.", "yellow"), end="\n\n")

    combined_path: Path | None = None
    rc2 = 0
    if combine:
        produced_sorted = sorted(produced, key=lambda p: p.name.lower())
        rc2, combined_path = combine_pdfs(produced_sorted, dst_dir)

    rc3 = 0
    if do_compress:
        to_compress = produced.copy()
        if combined_path and combined_path.exists() and rc2 == 0:
            to_compress.append(combined_path)
        rc3 = compress_many(to_compress, compress_level)

    return 0 if (failures == 0 and rc2 == 0 and rc3 == 0) else 1

def handle_file_mode(src: Path, dst: Path, combine: bool, do_compress: bool, compress_level: str, passthrough_logs: bool) -> int:
    if combine:
        print(neon("note:", "yellow"), "--combine is intended for -d (directory) mode; ignoring in single-file mode.")
    # Single-file: mimic group look with a mini OCR group
    print(group_header(f"\n┏━ OCR {neon('ocrmypdf', 'purple')} {neon('━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━', 'pink')}"))
    iter_tag = neon("[1/1]", "orange")
    print(gutter(f"{bullet()} {iter_tag} converting {neon(Path(src).name, 'cyan')}", spaces=2))
    rc, tail = convert_one(src, dst, passthrough_logs)
    if rc == 0:
        print(gutter(f"  {tag('OK', 'green')} {src}{arrow()}{dst}", spaces=2))
    else:
        print(gutter(f"  {tag('FAIL', 'red')} {src}{arrow()}{dst} {neon(f'(rc={rc})', 'dim')}", spaces=2))
        if tail:
            print(gutter(neon("── ocrmypdf (last 20 lines) ─────────────────────────────", "dim"), spaces=2))
            for t in tail:
                print(gutter(neon(t, "dim"), spaces=2))
    print(group_footer(), end="\n\n")

    if rc == 0 and do_compress:
        _rc = compress_many([dst], compress_level)
        return 0 if _rc == 0 else 1
    return rc

# ── CLI ────────────────────────────────────────────────────────────────────────
def parse_args():
    p = argparse.ArgumentParser(
        prog="textify",
        description="Convert PDFs to OCR'd PDFs using ocrmypdf, with optional combine & compression."
    )
    p.add_argument("-d", "--directory", action="store_true",
                help="Directory mode: treat the two paths as <src_dir> <dst_dir>.")
    p.add_argument("-c", "--combine", action="store_true",
                help="After directory conversion, combine all new PDFs into a single 'textified-combined*.pdf'.")
    p.add_argument("-z", "--compress", action="store_true",
                help="After conversion (and optional combine), recompress all produced PDFs in place.")
    p.add_argument("--compress-level",
                choices=["screen", "ebook", "printer", "prepress", "default"],
                default="ebook",
                help="Compression level for Ghostscript (ignored if only qpdf is available). Default: ebook.")
    p.add_argument("-V", "--ocr-verbose", action="store_true",
                help="Show raw ocrmypdf output (passthrough). Default is quiet.")
    p.add_argument("src", help="Source path (pdf or directory).")
    p.add_argument("dst", help="Destination path (new pdf or output directory).")
    return p.parse_args()

def main():
    ensure_ocrmypdf()
    args = parse_args()
    src = Path(args.src)
    dst = Path(args.dst)
    if args.directory:
        code = handle_directory_mode(src, dst, args.combine, args.compress, args.compress_level, args.ocr_verbose)
    else:
        code = handle_file_mode(src, dst, args.combine, args.compress, args.compress_level, args.ocr_verbose)
    sys.exit(code)

if __name__ == "__main__":
    main()

2) Make it executable

chmod u+x ~/bin/textify

3) Make it accessible everywhere

Choose one of the options below.

Zsh or Bash

# Add once, then reload your shell
echo 'export PATH="$HOME/bin:$PATH"' >> ~/.zshrc   # if you use zsh
echo 'export PATH="$HOME/bin:$PATH"' >> ~/.bashrc  # if you use bash
# Reload (pick the file you actually updated)
source ~/.zshrc  || source ~/.bashrc
# Test
command -v textify

Fish

set -U fish_user_paths $HOME/bin $fish_user_paths
command -v textify

sudo ln -s ~/bin/textify /usr/local/bin/textify
command -v textify

Zsh or Bash

echo 'alias textify="$HOME/bin/textify"' >> ~/.zshrc   # or ~/.bashrc
source ~/.zshrc || source ~/.bashrc
textify -h
Fish
alias textify $HOME/bin/textify
funcsave textify
textify -h

Warning

If you use an alias, it exists only in interactive shells. Prefer PATH or a symlink for scripts and non-interactive contexts.


4) Verify the installation

Run the command to see the help:

textify -h

Expected (abridged)

usage: textify [-h] [-d] [-c] [-z] [--compress-level {screen,ebook,printer,prepress,default}] [-V] src dst

Convert PDFs to OCR'd PDFs using ocrmypdf, with optional combine & compression.

positional arguments:
  src                   Source path (pdf or directory).
  dst                   Destination path (new pdf or output directory).

options:
  -h, --help            show this help message and exit
  -d, --directory       Directory mode: treat the two paths as <src_dir> <dst_dir>.
  -c, --combine         After directory conversion, combine all new PDFs into a single
                        'textified-combined*.pdf'.
  -z, --compress        After conversion (and optional combine), recompress produced PDFs in place.
  --compress-level {screen,ebook,printer,prepress,default}
                        Ghostscript compression level (ignored if only qpdf is available). Default: ebook.
  -V, --ocr-verbose     Show raw ocrmypdf output (passthrough). Default is quiet.

Note

Help text can vary slightly by platform and argparse version.


Usage

Directory mode

Convert every *.pdf from a source folder into a destination folder (same filenames):

textify -d ~/Scans ~/Scans_text

Combine the newly created PDFs into one file (e.g. textified-combined.pdf, auto-incremented if the name exists):

textify -d -c ~/Scans ~/Scans_text

Convert + Combine + Compress in one go:

textify -d -c -z --compress-level ebook ~/Scans ~/Scans_text

Single file mode

Create an OCR’d PDF at a new path (will not overwrite):

textify ./input.pdf ./output.pdf

Optionally compress the result:

textify -z ./input.pdf ./output.pdf

Verbose vs. quiet

Default runs are quiet and summarize success/failure. To see raw ocrmypdf logs as they happen:

textify -V ./input.pdf ./output.pdf

Tip

Output uses tasteful ANSI colors when writing to a TTY. Set NO_COLOR=1 to disable colorization.


Compression levels

When Ghostscript is available, you can pick a target quality via --compress-level. If only qpdf is available, a reasonable stream compression is applied (levels are ignored).

Level Typical use
screen Smallest files, lower quality
ebook Balanced (default)
printer Higher quality
prepress Highest quality
default Ghostscript default profile

Examples

textify -d -z --compress-level screen  ~/Scans  ~/Scans_text   # smallest
textify -d -z --compress-level printer ~/Scans  ~/Scans_text   # higher quality

What combining & compressing looks like

Combine group

┏━ COMBINE qpdf → textified-combined.pdf
┃   [OK] wrote /path/to/Scans_text/textified-combined.pdf
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Compress group (multi-file)

┏━ COMPRESS ghostscript/ebook ━━━━━━━━━━━━━━━━━━━━━━━━━━
┃   ◆ [1/4] compressing FileA.pdf
┃     [OK] compressed /path/to/FileA.pdf

┃   ◆ [2/4] compressing FileB.pdf
┃     [OK] compressed /path/to/FileB.pdf

┃   ◆ [3/4] compressing FileC.pdf
┃     [FAIL] compressing /path/to/FileC.pdf (rc=1)

┃   ◆ [4/4] compressing textified-combined.pdf
┃     [OK] compressed /path/to/textified-combined.pdf
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[SUMMARY] compression: 3 ok, 1 failed

Behavior & conventions

  • Destination safety: Single-file mode refuses to overwrite an existing file.
  • Input filtering: Directory mode processes only files with .pdf suffix (case-insensitive).
  • Combine output name: textified-combined.pdf, then textified-combined1.pdf, textified-combined2.pdf, … if needed.
  • Return codes:
  • 0 — all steps requested completed successfully
  • 1 — at least one file failed, or a requested step failed
  • 127 — a required tool is missing (e.g., ocrmypdf); specific combine/compress tools may also trigger 127 when absent

Why so many combine tools?

The script prefers qpdf, then falls back to pdfunite (Poppler), then gs (Ghostscript).
Install at least one; qpdf yields the most consistent results for combining.


Troubleshooting

'ocrmypdf' not found in PATH

  • Install it (see prerequisites) and ensure your shell PATH includes Homebrew’s bin directory.

Combine or compress step says cannot find tool

  • Install qpdf (combine), poppler (pdfunite alternative), or ghostscript (combine+compress).
  • Re-run textify.

No color / weird characters

  • Colors are enabled only if writing to a TTY. Set NO_COLOR=1 to disable.
  • If your terminal can’t render box-drawing characters, the script automatically falls back to ASCII (e.g., |).

Quick reference (cheat sheet)

  • OCR a directory:

    textify -d SRC_DIR DST_DIR
    

  • OCR + combine:

    textify -d -c SRC_DIR DST_DIR
    

  • OCR + combine + compress (default = ebook):

    textify -d -c -z SRC_DIR DST_DIR
    

  • OCR one file (no overwrite):

    textify SRC_PDF DST_PDF
    

  • OCR one file + compress:

    textify -z SRC_PDF DST_PDF
    

  • Show verbose OCR logs:

    textify -V SRC_PDF DST_PDF