Understanding PDF Structure: Why PDF to Markdown Conversion Is Challenging
A technical deep-dive into how PDFs store content internally, why text extraction and reflow are hard, and how modern converters tackle table detection, multi-column layouts, and OCR.


Understanding PDF Structure: Why PDF to Markdown Conversion Is Challenging
Converting a PDF to Markdown sounds simple — extract the text, add formatting markers, done. In practice, it's one of the harder problems in document processing. The reason lies in how PDFs actually store content, which is fundamentally different from how humans read documents.
Adobe created PDF in 1993 for one purpose: documents should look identical on every device. This design priority — visual fidelity over semantic structure — is fundamentally at odds with what Markdown needs.
How PDFs Store Content
A PDF is not a sequence of paragraphs and headings. It's a collection of drawing instructions that tell a renderer what to place where on a page.
Here's what a content stream looks like internally:
BT % Begin Text block
/F1 12 Tf % Set font: Helvetica 12pt
100 700 Td % Move to position (100, 700)
(Introduction) Tj % Draw "Introduction"
0 -20 Td % Move down 20 units
/F2 10 Tf % Switch to Times Roman 10pt
(This is the first ) Tj % Draw partial string
(paragraph.) Tj % Draw next fragment
ET % End Text block
Three things make this hard for converters:
- Text is positioned absolutely. There are no paragraphs or headings — just coordinates. "Introduction" is at position (100, 700), not tagged as a heading.
- Words are split into fragments. "This is the first paragraph." is stored as separate string operations. The converter must reassemble them.
- Structure is inferred from visual cues. "Introduction" uses a larger font — that's the only signal it's a heading. The PDF never says so explicitly.
Why Text Reflow Is Hard
When you read a PDF, you see paragraphs. A converter sees positioned text fragments on a coordinate plane.
Consider a paragraph wrapping across three lines:
Position (72, 700): "The quick brown fox jumps"
Position (72, 686): "over the lazy dog. This is"
Position (72, 672): "a sample paragraph."
Are these three separate lines (like an address) or one continuous paragraph? The converter uses heuristics: line spacing, indentation, margin alignment, and whether the previous line reaches the right edge. None are 100% reliable.
Multi-Column Layouts
Two-column academic papers are a common source of errors. If the converter reads left-to-right across the full page width, it interleaves text from both columns:
Abstract 1. Introduction
We present... The field of...
Converters must detect column boundaries through spatial clustering — finding vertical whitespace gaps that separate columns. The added complication: content stream order doesn't match visual order. A PDF might draw the right column before the left.
Table Detection
Tables are arguably the hardest element to extract. There is no "table" object in the PDF specification. What you see as a table is actually:
- Text fragments positioned in a grid pattern
- Optional line segments forming borders
- Possible background fills
The converter must reconstruct table structure from these raw elements. Tables with visible borders convert reasonably well. Borderless tables that rely on whitespace alignment are much harder. Merged cells, multi-line cells, and nested tables are where most converters fail.
Modern tools like Marker and MinerU use machine learning models trained to detect table regions, which is more robust than rule-based approaches but computationally expensive.
Scanned PDFs and OCR
Native PDFs contain actual text data. Scanned PDFs contain images of pages — no text layer at all. Before any structural analysis can happen, the converter must run OCR to convert pixels to characters.
OCR accuracy on clean printed text reaches 95-99%. But even 99% means about one error every two lines. Accuracy drops significantly with low-resolution scans, faded text, unusual fonts, or non-Latin scripts.
Some PDFs contain a structure tree with explicit tags for headings, paragraphs, and tables (primarily for accessibility). When present, these tags dramatically improve conversion quality — but most PDFs in the wild are untagged.
How Modern Converters Work
Tools like pdf2md.net build on libraries like PDF.js (Mozilla's open-source PDF renderer) which provides text positions, font info, and transform data for each page. The conversion logic built on top uses this raw data to infer headings (from font size), paragraph boundaries (from spacing), and table structure (from alignment patterns).
The emerging frontier is AI-powered conversion using vision-language models that "see" the rendered page and produce structured output. These handle ambiguous layouts better but are slower, more expensive, and can hallucinate text. The most promising approach is hybrid: traditional extraction for speed and privacy, AI post-processing for structural accuracy.
Conclusion
PDF to Markdown conversion is hard because PDFs were designed for visual fidelity, not semantic structure. Understanding this helps you set realistic expectations, choose the right tool for your document types, and know where to focus cleanup efforts.
No converter is perfect because the problem is fundamentally hard. But for most born-digital PDFs with straightforward layouts, tools like pdf2md.net produce excellent results without cloud processing or AI costs.