PDFApril 2026 · 8 min read

How to Extract Text from a PDF (2026)

PDF was designed for fixed layout, not for text portability. That's why copying text from a PDF into a Word doc or an email so often produces a garbled mess — extra line breaks, missing spaces, jumbled column order, fi and fl ligatures showing up as question marks. Extracting text cleanly from a PDF is its own small discipline, and the right approach depends on whether your PDF has a text layer, how the original was structured, and what you plan to do with the extracted text. This guide covers both.

📄

Try the PDF Text Extractor

Free, private, runs in your browser

→

Derek Giordano

Designer & Developer

In this guide

01Two Types of PDFs, Two Extraction Paths02Method 1: UDT PDF Text Extractor (Free, Browser-Based)03Method 2: Acrobat and Other Alternatives04Handling Tables, Columns, and Layout05Common Pitfalls and Cleanups

⚡ Key Takeaways

Pull clean text out of any PDF — for free, in your browser.
Covers two types of pdfs, two extraction paths.
Covers method 1: udt pdf text extractor (free, browser-based).
Covers method 2: acrobat and other alternatives.
Covers handling tables, columns, and layout.

Two Types of PDFs, Two Extraction Paths

There are two fundamentally different kinds of PDFs when it comes to text extraction. A text-based PDF stores text as characters with positions, fonts, and styles — the same way a Word document does internally. Extracting from this type is fast, lossless, and produces clean output because the text is already there, just wrapped in PDF structure.

A scanned PDF is the opposite: it stores text as images with no underlying character data. Extracting text requires OCR — running the images through a recognition engine that reads pixels and outputs characters. This is slower, imperfect (98–99% accuracy on clean scans, worse on poor ones), and always requires a separate processing step.

The quick test: open your PDF, select a line of text with your cursor, and try to copy it. If clean text appears in the paste, you have a text-based PDF — use direct extraction. If nothing highlights or you get garbage, you have a scanned PDF — use OCR first via the PDF OCR tool, then extract from the OCR'd version.

Method 1: UDT PDF Text Extractor (Free, Browser-Based)

The UDT PDF Text Extractor pulls text from any text-based PDF directly in your browser using pdf.js. Upload (locally — nothing leaves your browser), and the tool outputs plain text ready to paste anywhere: email, Word, a terminal, another document. You can choose between preserving the original formatting (line breaks where they appear visually) or collapsing to natural paragraphs (useful for feeding into other processing).

💡 Tip

Always include -webkit-backdrop-filter alongside backdrop-filter for Safari support. Without the prefix, the effect is invisible to roughly 25% of mobile users.

The workflow: drop your PDF into the tool, choose output format (plain text, JSON with positional data, or Markdown), and download or copy. For scanned PDFs, the tool will detect the lack of a text layer and route you to OCR first — you can't extract text that isn't there.

One quality-of-life feature worth mentioning: the tool handles column detection automatically for most layouts. A two-column article extracts with the left column fully read before the right column starts, rather than producing the classic "left-right-left-right" zigzag you get from naive extractors.

Method 2: Acrobat and Other Alternatives

Acrobat Reader (free) can export PDF text via File → Export To → Text. Acrobat Pro adds Save As → More Options → Text (Plain) with additional formatting controls. Both handle text-based PDFs well. For scanned PDFs, Acrobat Pro runs OCR automatically during export; Reader doesn't.

⚠ Warning

On iOS Safari, backdrop-filter inside a position: fixed element can cause severe scroll performance issues. Test thoroughly on real iOS devices.

Microsoft Word (2013+) can open PDFs directly, treating them as editable documents. Word's conversion is often the best-in-class for preserving formatting (fonts, tables, images), though simple text extraction is overkill if you just need the words. Google Docs has similar PDF import capabilities.

Command-line tools like pdftotext (from Poppler) are the workhorses for automated extraction. pdftotext handles layout preservation well, supports columns, and is what most document-processing pipelines use under the hood. Install via Homebrew on macOS, apt on Linux, or prebuilt binaries on Windows.

Handling Tables, Columns, and Layout

Tables are the hardest part of text extraction. A PDF table is just text positioned in a grid — there's no underlying "this is a table" structure the way there is in HTML or Word. Extraction tools guess at table structure by looking for aligned text runs, which works for clean tables and fails for complex ones with merged cells, nested columns, or inconsistent spacing.

For critical table extraction (financial data, scientific tables, anything where getting the structure right matters), specialized tools like Tabula or Camelot outperform general text extractors. They detect table boundaries, infer columns from x-coordinate clustering, and output CSV or JSON directly. General extraction tools give you text in reading order; table-specific tools give you rows and columns.

Multi-column layouts (magazines, academic papers, brochures) need column-aware extraction. A naive tool reads left-to-right across all columns on each row, producing nonsense. A better tool detects column boundaries and reads each column top-to-bottom in sequence. If your extracted text looks like interleaved fragments, you're probably using a tool that doesn't do column detection — switch to one that does.

Common Pitfalls and Cleanups

Ligatures are the most common complaint. PDFs often render "fi" and "fl" as single glyphs to look better typographically, and some extractors output these as question marks, boxes, or nothing at all. Modern extractors (including pdf.js-based tools) handle ligatures correctly; older ones or command-line defaults may not. If your output has missing letter-pairs, check ligature handling in your tool's settings.

Hyphenated line breaks are another trap. When a word wraps across lines as "how- ever" at the end of one line and start of the next, a careless extractor keeps the hyphen and newline in the output, producing "how- ever" in your text. Good extractors rejoin these automatically; if yours doesn't, a simple find-and-replace for "- " fixes most cases.

Header and footer repetition is the third common issue. Page numbers, running headers, and footers appear on every page, and naive extraction includes them in the output — producing "Page 1\nThe quick brown fox jumped...\nPage 2\n...over the lazy dog." Either skip headers and footers in the tool settings or strip them with a regex after extraction.

Frequently Asked Questions

Why does copy-paste from a PDF produce garbled text?+

Several reasons: columns being read left-to-right across the full page instead of one column at a time, ligatures rendering as question marks, hyphenated line breaks splitting words, or the PDF being scanned rather than text-based. A proper extraction tool handles all these correctly; Ctrl+C from the PDF viewer often doesn't.

Can I extract text from a scanned PDF?+

Not directly — a scanned PDF has no text layer, just images. Run the PDF through OCR first (which adds a text layer), then extract from the OCR'd version. Some extraction tools chain these steps automatically; others require a separate pass.

Does extraction preserve formatting like bold and italics?+

Plain-text extraction strips formatting. If you need bold, italics, or styling preserved, export to Word, HTML, or Markdown instead of plain text. Those formats carry style information where plain text can't.

Can I extract just specific pages?+

Yes. Most extraction tools accept a page range (e.g., pages 5–10). If your tool doesn't support ranges, extract the whole document and keep only the sections you need, or use the PDF splitter first to isolate the pages you want.

Is my PDF private when I use a browser-based extractor?+

If the tool runs locally (client-side processing with JavaScript or WebAssembly), yes — the file never leaves your browser. Check your browser's network tab during processing: no outbound uploads should occur. Avoid extraction sites that require upload to their server for "processing."

Try it yourself

Free, private, runs in your browser

⚡ Open PDF Text Extractor

Derek Giordano

Written by the creator of Ultimate Design Tools. BA in Business Marketing.

📚 References & Further Reading