Textify (OCR wrapper)
Turn the provided Python wrapper around ocrmypdf into a convenient terminal command named textify that can:
- OCR PDFs (single file or whole directory)
- Optionally combine many PDFs into one output
- Optionally recompress PDFs to reduce file size
- Optionally skip OCR (copy-only) via
--no-ocr(useful if you just want combine/compress)
Note
This page uses Material for MkDocs markdown patterns. :contentReference[oaicite:0]{index=0}
What you’ll get
After installation you’ll be able to run:
-
Directory mode (OCR everything into a new folder)
-
Directory mode + combine result into one PDF
-
Directory mode + combine + compress (with a level)
-
Single file mode (optionally compress)
Prerequisites¶
Required¶
- Python 3 (the script uses
#!/usr/bin/env python3) ocrmypdf(required unless you run with--no-ocr)- A working shell environment with the command available on your
PATH
Optional (enables extra features)¶
- Combining many PDFs → one:
qpdf(preferred), orpdfunite(from Poppler), or-
gs(Ghostscript) -
Compressing PDFs:
gs(preferred for--compress-level), orqpdf(fallback compressor; compression levels are ignored)
Language packs
OCR is done via Tesseract through ocrmypdf. The default language is eng. If you use --lang for other languages (e.g. isl+eng), you may need to install the corresponding Tesseract language data on your system.
Install dependencies¶
- Install
ocrmypdfusing your package manager. - Install at least one combine tool (
qpdf,pdfunite, orgs) if you want-c/--combine. - Install
ghostscriptorqpdfif you want-z/--compress.
What happens when tools are missing?
- If you run OCR (no
--no-ocr) andocrmypdfis missing,textifyexits with 127. - If you request combine and no combiner is available, the combine step fails with a clear error (and exits 127).
- If you request compress and no compressor is available, the compression step fails with a clear error (and exits 127).
1) Create the script¶
Pick a location you control, e.g. ~/bin (create it if it doesn’t exist):
Paste the script below and save the file as textify.
Show script (Python) — save as ~/bin/textify
#!/usr/bin/env python3
"""
textify — thin wrapper around ocrmypdf with optional PDF combine and compression
Usage:
Directory mode:
./textify -d [--no-ocr] [-c|--combine] [-z|--compress [--compress-level LEVEL]] [--lang LANG] [-V|--ocr-verbose] SRC_DIR DST_DIR
Single file mode:
./textify [--no-ocr] [-z|--compress [--compress-level LEVEL]] [--lang LANG] [-V|--ocr-verbose] SRC_PDF DST_PDF
"""
import argparse
import os
import shutil
import sys
from pathlib import Path
from subprocess import run, PIPE, STDOUT
# ── Neon ANSI styling ──────────────────────────────────────────────────────────
USE_COLOR = sys.stdout.isatty() and os.environ.get("NO_COLOR") is None
def _paint(code: str, text: str) -> str:
return f"\033[{code}m{text}\033[0m" if USE_COLOR else text
NEON = {
"pink": "1;38;5;199",
"cyan": "1;38;5;51",
"purple": "1;38;5;135",
"green": "1;38;5;82",
"yellow": "1;38;5;226",
"red": "1;38;5;197",
"blue": "1;38;5;45",
"orange": "1;38;5;208", # iteration counters
"dim": "2;38;5;244",
}
def neon(text: str, color: str) -> str:
return _paint(NEON[color], text)
def tag(label: str, color: str) -> str:
return neon(f"[{label}]", color)
def arrow() -> str:
return neon(" → ", "blue")
def bullet() -> str:
return neon("◆", "purple")
# ── Pink vertical gutter bar ───────────────────────────────────────────────────
def pick_bar_char() -> str:
try:
"┃".encode(sys.stdout.encoding or "utf-8")
return "┃"
except Exception:
return "|"
VBAR_RAW = pick_bar_char()
VBAR = neon(VBAR_RAW, "pink")
def gutter(line: str = "", spaces: int = 2) -> str:
return f"{VBAR}{' ' * spaces}{line}"
def group_header(left: str) -> str:
return neon(left, "pink")
def group_footer() -> str:
return neon("┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━", "pink")
# ── OCR config ─────────────────────────────────────────────────────────────────
OCR_BASE = ["ocrmypdf", "--force-ocr", "--optimize", "0", "--deskew"]
def build_ocr_cmd(ocr_lang: str) -> list[str]:
cmd = OCR_BASE.copy()
if ocr_lang:
cmd += ["-l", ocr_lang]
return cmd
def which(cmd: str) -> Path | None:
p = shutil.which(cmd)
return Path(p) if p else None
def ensure_ocrmypdf():
if which("ocrmypdf") is None:
print(neon("error:", "red"), "'ocrmypdf' not found in PATH. Install it via Homebrew.", file=sys.stderr)
sys.exit(127)
def is_pdf(path: Path) -> bool:
return path.is_file() and path.suffix.lower() == ".pdf"
def run_ocr_quiet(src: Path, dst: Path, ocr_lang: str) -> tuple[int, list[str] | None]:
"""Run ocrmypdf, capturing output. Return (rc, tail_lines_on_error_or_None)."""
cmd = build_ocr_cmd(ocr_lang) + [str(src), str(dst)]
result = run(cmd, stdout=PIPE, stderr=STDOUT, check=False)
if result.returncode != 0 and result.stdout is not None:
lines = result.stdout.decode(errors="ignore").splitlines()
tail = lines[-20:] if lines else []
return result.returncode, tail
return result.returncode, None
def run_ocr_verbose(src: Path, dst: Path, ocr_lang: str) -> int:
"""Run ocrmypdf with passthrough output."""
cmd = build_ocr_cmd(ocr_lang) + [str(src), str(dst)]
return run(cmd, check=False).returncode
def convert_one(
src: Path, dst: Path, ocr_lang: str, passthrough_logs: bool, skip_ocr: bool
) -> tuple[int, list[str] | None]:
if not src.exists():
return 1, [f"source does not exist: {src}"]
if not is_pdf(src):
return 1, [f"source is not a .pdf: {src}"]
if dst.exists():
return 1, [f"destination already exists (won't overwrite): {dst}"]
dst.parent.mkdir(parents=True, exist_ok=True)
# ── SKIP OCR / COPY MODE ──
if skip_ocr:
try:
shutil.copy2(src, dst)
return 0, None
except Exception as e:
return 1, [f"copy failed: {e}"]
# ── NORMAL OCR MODE ──
if passthrough_logs:
rc = run_ocr_verbose(src, dst, ocr_lang)
return rc, None
else:
return run_ocr_quiet(src, dst, ocr_lang)
# ── Combine PDFs ───────────────────────────────────────────────────────────────
def find_unique_combined_path(out_dir: Path) -> Path:
base = out_dir / "textified-combined.pdf"
if not base.exists():
return base
i = 1
while True:
cand = out_dir / f"textified-combined{i}.pdf"
if not cand.exists():
return cand
i += 1
def run_combine_tool(inputs: list[Path], out_file: Path) -> tuple[int, str]:
"""Core logic to run the best available PDF combine tool."""
inputs_str = [str(p) for p in inputs]
if which("qpdf"):
cmd = ["qpdf", "--empty", "--pages", *inputs_str, "--", str(out_file)]
return run(cmd, check=False).returncode, "qpdf"
elif which("pdfunite"):
cmd = ["pdfunite", *inputs_str, str(out_file)]
return run(cmd, check=False).returncode, "pdfunite"
elif which("gs"):
cmd = ["gs", "-dBATCH", "-dNOPAUSE", "-q", "-sDEVICE=pdfwrite", f"-sOutputFile={out_file}", *inputs_str]
return run(cmd, check=False).returncode, "ghostscript"
else:
return 127, "none"
def combine_files_visually(inputs: list[Path], out_file: Path) -> int:
"""Runs the combine tool with UI feedback."""
if not inputs:
print(neon("note:", "yellow"), "nothing to combine.", file=sys.stderr)
return 0
rc, tool = run_combine_tool(inputs, out_file)
if tool == "none":
print(
neon("error:", "red")
+ " no PDF combiner found (tried qpdf, pdfunite, gs). Install one via:\n"
+ " brew install qpdf # recommended\n",
file=sys.stderr,
)
return 127
print(group_header(f"┏━ COMBINE {neon(tool, 'purple')}{arrow()}{neon(out_file.name, 'green')}"))
if rc == 0:
print(gutter(f" {tag('OK', 'green')} combined {len(inputs)} files into {out_file.name}", spaces=1))
else:
print(gutter(f" {tag('FAIL', 'red')} combining into {out_file} {neon(f'(rc={rc})', 'dim')}", spaces=1))
print(group_footer(), end="\n\n")
return rc
# ── Compress PDFs ──────────────────────────────────────────────────────────────
def choose_compressor(level: str):
"""Pick a compressor once, return (tool_label, build_cmd(in_path, out_path))."""
if which("gs"):
tool = f"ghostscript/{level}"
def build_cmd(inp: Path, outp: Path):
return [
"gs",
"-dBATCH",
"-dNOPAUSE",
"-dQUIET",
"-sDEVICE=pdfwrite",
"-dCompatibilityLevel=1.6",
f"-dPDFSETTINGS=/{level}",
f"-sOutputFile={outp}",
str(inp),
]
return tool, build_cmd
if which("qpdf"):
tool = "qpdf"
def build_cmd(inp: Path, outp: Path):
return ["qpdf", "--object-streams=generate", "--compress-streams=y", str(inp), str(outp)]
return tool, build_cmd
return None, None
def compress_many(paths: list[Path], level: str) -> int:
paths = [p for p in paths if p and p.exists()]
if not paths:
return 0
tool, build_cmd = choose_compressor(level)
if tool is None:
print(neon("error:", "red") + " cannot compress — install ghostscript or qpdf.", file=sys.stderr)
return 127
print(group_header(f"\n┏━ COMPRESS {neon(tool, 'purple')} {neon('━━━━━━━━━━━━━━━━━━━━━━━━━━\n┃', 'pink')}"))
failures = 0
total = len(paths)
for idx, p in enumerate(paths, 1):
iter_tag = neon(f"[{idx}/{total}]", "orange")
print(gutter(f"{bullet()} {iter_tag} compressing {neon(p.name, 'cyan')}", spaces=2))
tmp_out = p.with_name(p.stem + ".tmp.pdf")
cmd = build_cmd(p, tmp_out)
rc = run(cmd, check=False).returncode
if rc == 0 and tmp_out.exists():
try:
os.replace(tmp_out, p)
print(gutter(f" {tag('OK', 'green')} compressed {p.name}", spaces=2))
except OSError as e:
print(gutter(f" {tag('FAIL', 'red')} could not replace original for {p}: {e}", spaces=2))
Path(tmp_out).unlink(missing_ok=True)
failures += 1
else:
print(gutter(f" {tag('FAIL', 'red')} compressing {p} {neon(f'(rc={rc})', 'dim')}", spaces=2))
Path(tmp_out).unlink(missing_ok=True)
failures += 1
print(gutter("", spaces=2))
print(group_footer(), end="\n\n")
if failures:
print(tag("SUMMARY", "yellow"), neon(f"compression: {total - failures} ok, {failures} failed", "yellow"), end="\n\n")
return 1
else:
print(tag("SUMMARY", "green"), neon(f"compression: {total} ok", "green"), end="\n\n")
return 0
# ── Directory & File modes ─────────────────────────────────────────────────────
def handle_directory_mode(
src_dir: Path,
dst_dir: Path,
combine: bool,
do_compress: bool,
compress_level: str,
ocr_lang: str,
passthrough_logs: bool,
skip_ocr: bool,
) -> int:
if not src_dir.exists() or not src_dir.is_dir():
print(neon("error:", "red"), f"not a directory: {src_dir}", file=sys.stderr)
return 1
if src_dir.resolve() == dst_dir.resolve():
print(neon("error:", "red"), "output directory must be different from input directory.", file=sys.stderr)
return 1
dst_dir.mkdir(parents=True, exist_ok=True)
pdfs = sorted([p for p in src_dir.iterdir() if p.is_file() and p.suffix.lower() == ".pdf"])
if not pdfs:
print(neon("note:", "yellow"), f"no .pdf files found in {src_dir}")
return 0
# ──────────────────────────────────────────────────────────────────────────
# MODE A: COMBINE IS ON (Combine Source -> OCR Combined -> Compress Combined)
# ──────────────────────────────────────────────────────────────────────────
if combine:
# 1. Combine Raw Sources to a temp file
temp_merged = dst_dir / ".textify_temp_merged_source.pdf"
rc_comb = combine_files_visually(pdfs, temp_merged)
if rc_comb != 0 or not temp_merged.exists():
return 1
# 2. Convert (OCR or Copy) the single merged file
final_dst = find_unique_combined_path(dst_dir)
if skip_ocr:
header_title = f"┏━ COPY {neon('(NO OCR)', 'dim')} {neon('━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n┃', 'pink')}"
action_verb = "copying"
else:
lang_hint = f" -l {ocr_lang}" if ocr_lang else ""
header_title = f"┏━ OCR {neon('ocrmypdf', 'purple')}{neon(lang_hint, 'dim')} {neon('━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n┃', 'pink')}"
action_verb = "converting"
print(group_header(header_title))
# Mimic iteration UI for the single big file
print(gutter(f"{bullet()} {neon('[1/1]', 'orange')} {action_verb} {neon('merged document', 'cyan')}", spaces=2))
rc_ocr, tail = convert_one(temp_merged, final_dst, ocr_lang, passthrough_logs, skip_ocr)
# Clean up temp raw file immediately
try:
temp_merged.unlink()
except OSError:
pass
if rc_ocr == 0:
print(gutter(f" {tag('OK', 'green')} {temp_merged.name}{arrow()}{final_dst.name}", spaces=2))
else:
print(gutter(f" {tag('FAIL', 'red')} {temp_merged.name}{arrow()}{final_dst.name} {neon(f'(rc={rc_ocr})', 'dim')}", spaces=2))
if tail:
print(gutter(neon("── ocrmypdf (last 20 lines) ─────────────────────────────", "dim"), spaces=2))
for t in tail:
print(gutter(neon(t, "dim"), spaces=2))
print(group_footer(), end="\n\n")
if rc_ocr != 0:
return 1
# 3. Compress the single result
if do_compress:
return compress_many([final_dst], compress_level)
return 0
# ──────────────────────────────────────────────────────────────────────────
# MODE B: COMBINE IS OFF (Process Individually)
# ──────────────────────────────────────────────────────────────────────────
else:
if skip_ocr:
header_title = f"┏━ COPY {neon('(NO OCR)', 'dim')} {neon('━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n┃', 'pink')}"
action_verb = "copying"
else:
lang_hint = f" -l {ocr_lang}" if ocr_lang else ""
header_title = f"┏━ OCR {neon('ocrmypdf', 'purple')}{neon(lang_hint, 'dim')} {neon('━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n┃', 'pink')}"
action_verb = "converting"
print(group_header(header_title))
failures = 0
produced: list[Path] = []
total = len(pdfs)
for idx, src in enumerate(pdfs, 1):
iter_tag = neon(f"[{idx}/{total}]", "orange")
line_left = f"{bullet()} {iter_tag} {action_verb} {neon(src.name, 'cyan')}"
print(gutter(line_left, spaces=2))
dst = dst_dir / src.name
rc, tail = convert_one(src, dst, ocr_lang, passthrough_logs, skip_ocr)
if rc == 0:
print(gutter(f" {tag('OK', 'green')} {src.name}{arrow()}{dst.name}", spaces=2))
produced.append(dst)
else:
print(gutter(f" {tag('FAIL', 'red')} {src.name}{arrow()}{dst.name} {neon(f'(rc={rc})', 'dim')}", spaces=2))
if tail:
print(gutter(neon("── ocrmypdf (last 20 lines) ─────────────────────────────", "dim"), spaces=2))
for t in tail:
print(gutter(neon(t, "dim"), spaces=2))
failures += 1
print(gutter("", spaces=2))
print(group_footer(), end="\n\n")
print(tag("SUMMARY", "purple"), neon(f"{len(produced)} succeeded, {failures} failed.", "yellow"), end="\n\n")
if do_compress and produced:
return compress_many(produced, compress_level)
return 0 if failures == 0 else 1
def handle_file_mode(
src: Path,
dst: Path,
combine: bool,
do_compress: bool,
compress_level: str,
ocr_lang: str,
passthrough_logs: bool,
skip_ocr: bool,
) -> int:
if combine:
print(neon("note:", "yellow"), "--combine is intended for -d (directory) mode; ignoring in single-file mode.")
# Single-file: mimic group look with a mini OCR group
if skip_ocr:
header_title = f"┏━ COPY {neon('(NO OCR)', 'dim')} {neon('━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━', 'pink')}"
action_verb = "copying"
else:
lang_hint = f" -l {ocr_lang}" if ocr_lang else ""
header_title = f"\n┏━ OCR {neon('ocrmypdf', 'purple')}{neon(lang_hint, 'dim')} {neon('━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━', 'pink')}"
action_verb = "converting"
print(group_header(header_title))
iter_tag = neon("[1/1]", "orange")
print(gutter(f"{bullet()} {iter_tag} {action_verb} {neon(Path(src).name, 'cyan')}", spaces=2))
rc, tail = convert_one(src, dst, ocr_lang, passthrough_logs, skip_ocr)
if rc == 0:
print(gutter(f" {tag('OK', 'green')} {src.name}{arrow()}{dst.name}", spaces=2))
else:
print(gutter(f" {tag('FAIL', 'red')} {src.name}{arrow()}{dst.name} {neon(f'(rc={rc})', 'dim')}", spaces=2))
if tail:
print(gutter(neon("── ocrmypdf (last 20 lines) ─────────────────────────────", "dim"), spaces=2))
for t in tail:
print(gutter(neon(t, "dim"), spaces=2))
print(group_footer(), end="\n\n")
if rc == 0 and do_compress:
_rc = compress_many([dst], compress_level)
return 0 if _rc == 0 else 1
return rc
# ── CLI ────────────────────────────────────────────────────────────────────────
def parse_args():
p = argparse.ArgumentParser(
prog="textify", description="Convert PDFs to OCR'd PDFs using ocrmypdf, with optional combine & compression."
)
p.add_argument(
"-d", "--directory", action="store_true", help="Directory mode: treat the two paths as <src_dir> <dst_dir>."
)
p.add_argument(
"--no-ocr", action="store_true", help="Skip the OCR step (copy only). Useful if just compressing/combining."
)
p.add_argument(
"-c",
"--combine",
action="store_true",
help="Combine sources FIRST, then OCR/Compress only the single combined file (only in -d mode).",
)
p.add_argument(
"-z",
"--compress",
action="store_true",
help="After conversion, recompress PDFs (in place) using Ghostscript/qpdf.",
)
p.add_argument(
"--compress-level",
choices=["screen", "ebook", "printer", "prepress", "default"],
default="ebook",
help="Compression level for Ghostscript (ignored if only qpdf is available). Default: ebook.",
)
p.add_argument(
"--lang",
"--ocr-lang",
dest="ocr_lang",
default="eng",
help="OCR language(s) for Tesseract via ocrmypdf (-l). Example: --lang isl+eng",
)
p.add_argument(
"-V", "--ocr-verbose", action="store_true", help="Show raw ocrmypdf output (passthrough). Default is quiet."
)
p.add_argument("src", help="Source path (pdf or directory).")
p.add_argument("dst", help="Destination path (new pdf or output directory).")
return p.parse_args()
def main():
args = parse_args()
if not args.no_ocr:
ensure_ocrmypdf()
src = Path(args.src)
dst = Path(args.dst)
if args.directory:
code = handle_directory_mode(
src, dst, args.combine, args.compress, args.compress_level, args.ocr_lang, args.ocr_verbose, args.no_ocr
)
else:
code = handle_file_mode(
src, dst, args.combine, args.compress, args.compress_level, args.ocr_lang, args.ocr_verbose, args.no_ocr
)
sys.exit(code)
if __name__ == "__main__":
main()
2) Make it executable¶
3) Make it accessible everywhere¶
Choose one of the options below.
Warning
If you use an alias, it exists only in interactive shells. Prefer PATH or a symlink for scripts and non-interactive contexts.
4) Verify the installation¶
Run the command to see the help:
Note
Help text can vary slightly by platform and argparse version.
Usage¶
Directory mode¶
Convert every *.pdf from a source folder into a destination folder (same filenames):
Directory mode + combine¶
Combine the source PDFs first, then OCR (or copy) only the single combined document.
Combine behavior
- The script prefers qpdf, then pdfunite, then ghostscript.
- The combined output is written to the destination directory as:
textified-combined.pdftextified-combined1.pdf,textified-combined2.pdf, ... if the name already exists.- A temporary merge file is created in the destination folder during combine (and cleaned up afterward).
Directory mode + combine + compress¶
Single file mode¶
Create an OCR’d PDF at a new path (will not overwrite existing files):
Optionally compress the result:
Note
--combine is intended for -d/--directory mode. In single-file mode, --combine is ignored with a helpful note.
Verbose vs. quiet OCR logs¶
Default runs are quiet and summarize success/failure. To see raw ocrmypdf logs as they happen:
Tip
Output uses ANSI colors when writing to a TTY. Set NO_COLOR=1 to disable colorization:
Copy-only mode (skip OCR)¶
Use --no-ocr to skip OCR entirely (it will copy PDFs instead). This is useful when you only want to:
- Combine a directory into a single PDF, or
- Compress PDFs, without OCR
Note
When --no-ocr is set, ocrmypdf is not required and is not checked.
CLI options¶
| Option | Meaning |
|---|---|
-d, --directory |
Directory mode: treat the two paths as <src_dir> <dst_dir> |
--no-ocr |
Skip OCR step (copy only) |
-c, --combine |
Directory mode only: combine sources first, then OCR/copy the single combined file |
-z, --compress |
Compress output PDFs in-place after creation |
--compress-level {screen,ebook,printer,prepress,default} |
Ghostscript profile (ignored if only qpdf is available). Default: ebook |
--lang, --ocr-lang |
OCR language(s) for Tesseract via ocrmypdf -l (example: isl+eng) |
-V, --ocr-verbose |
Show raw ocrmypdf output (passthrough) |
Compression levels¶
When Ghostscript is available, you can pick a target quality via --compress-level. If only qpdf is available, a reasonable stream compression is applied (levels are ignored).
| Level | Typical use |
|---|---|
screen |
Smallest files, lower quality |
ebook |
Balanced (default) |
printer |
Higher quality |
prepress |
Highest quality |
default |
Ghostscript default profile |
Examples
textify -d -z --compress-level screen ~/Scans ~/Scans_text # smallest
textify -d -z --compress-level printer ~/Scans ~/Scans_text # higher quality
Behavior & conventions¶
- Destination safety: Single-file mode refuses to overwrite an existing file.
- Input filtering: Directory mode processes only files with
.pdfsuffix (case-insensitive). - Combine output name:
textified-combined.pdf, thentextified-combined1.pdf,textified-combined2.pdf, … if needed. - Return codes:
0— all requested steps completed successfully1— at least one file failed, or a requested step failed127— a required tool is missing (e.g.,ocrmypdf); combine/compress tools may also trigger127when absent
Why so many combine tools?
The script prefers qpdf, then falls back to pdfunite (Poppler), then gs (Ghostscript).
Install at least one; qpdf yields the most consistent results for combining.
Troubleshooting¶
“'ocrmypdf' not found in PATH”¶
- Install it (see prerequisites) and ensure your shell
PATHincludes your package manager’s bin directory. - If you only want combine/compress, you can use
--no-ocr.
Combine says it can’t find a tool¶
- Install one of: qpdf, poppler (for
pdfunite), or ghostscript. - Re-run
textify -d -c ....
Compress says it can’t compress¶
- Install ghostscript (best) or qpdf.
- If you only have
qpdf,--compress-levelis ignored.
No color / weird characters¶
- Colors are enabled only if writing to a TTY. Set
NO_COLOR=1to disable. - If your terminal can’t render box-drawing characters, the script automatically falls back to ASCII.