Understanding PDF Structure: Why PDF to Markdown Conversion Is Challenging

A technical deep-dive into how PDFs store content internally, why text extraction and reflow are hard, and how modern converters tackle table detection, multi-column layouts, and OCR.

PDF2MD Team
PDF2MD Team
April 3, 2026

Understanding PDF Structure: Why PDF to Markdown Conversion Is Challenging

Converting a PDF to Markdown sounds like it should be straightforward — extract the text, add some formatting markers, done. In practice, it is one of the harder problems in document processing. The reason lies in how PDFs actually store content, which is fundamentally different from how humans perceive documents.

This article explains the internal structure of PDF files, why text extraction is surprisingly difficult, and how modern converters tackle these challenges.


A Brief History of PDF

Adobe Systems created the Portable Document Format in 1993 to solve a specific problem: documents needed to look identical regardless of the software, hardware, or operating system used to view them. The key word is portable — the format was designed for visual fidelity above all else.

In 2008, PDF became an open standard under ISO 32000-1. The current version, ISO 32000-2 (PDF 2.0), was published in 2017 and updated in 2020. Despite being an open standard, the format’s complexity means that most implementations support only a subset of the full specification.

This history matters because PDF’s design priorities — visual fidelity and portability — are fundamentally at odds with what Markdown needs: semantic structure and logical reading order.


How PDFs Store Content Internally

A PDF file is not a sequence of paragraphs and headings. It is a collection of objects that describe how to draw content on a page. Understanding these objects explains why conversion is hard.

The Object Hierarchy

A PDF file contains these key object types:

  • Catalog — the root object that points to everything else
  • Page Tree — organizes pages in a hierarchy (not necessarily sequential)
  • Page Objects — define individual pages with their dimensions and content references
  • Content Streams — sequences of drawing operators that produce visible content
  • Font Objects — define fonts, including glyph mappings and metrics
  • Image Objects — store raster images (JPEG, PNG, CCITT) or vector graphics
  • Annotation Objects — links, form fields, comments

Content Streams: Where the Text Lives

The actual visible content of a page is defined in content streams — sequences of operators that instruct the PDF renderer what to draw and where.

Here is a simplified example of what a content stream looks like internally:

BT                          % Begin Text block
  /F1 12 Tf                 % Set font to F1 (e.g., Helvetica) at 12pt
  100 700 Td                % Move to position (100, 700) on the page
  (Introduction) Tj         % Draw the string "Introduction"
  0 -20 Td                  % Move down 20 units
  /F2 10 Tf                 % Switch to font F2 (e.g., Times Roman) at 10pt
  (This is the first ) Tj   % Draw partial string
  (paragraph of the ) Tj    % Draw next fragment
  (document.) Tj             % Draw final fragment
ET                          % End Text block

Several things are immediately apparent:

  1. Text is positioned absolutely. There are no paragraphs, headings, or sections — just coordinates on a page. The text “Introduction” appears at position (100, 700) not because it is a heading, but because that is where the PDF creator placed it.

  2. Words can be split into fragments. “This is the first paragraph of the document.” is stored as three separate string operations. A converter must reassemble these fragments into coherent text.

  3. Structure is implied by visual cues, not explicit markup. “Introduction” is a heading because it uses a different font (F1 vs F2) and a larger size (12pt vs 10pt). The PDF does not say “this is a heading” — the converter must infer that from the visual properties.

Key PDF Operators

Here are the most important operators that converters must interpret:

Operator Purpose Example
BT / ET Begin/End text block BT ... ET
Tf Set font and size /F1 12 Tf
Td Move text position (relative) 0 -14 Td
Tm Set text matrix (absolute position) 1 0 0 1 72 720 Tm
Tj Show text string (Hello) Tj
TJ Show text with individual glyph positioning [(H) 20 (ello)] TJ
Tc Set character spacing 0.5 Tc
Tw Set word spacing 2.0 Tw
re Draw rectangle (used in table borders) 100 500 200 20 re
l / m Line/Move operators (table rules) 100 500 m 300 500 l

The TJ operator is particularly important. It allows micro-positioning of individual characters for kerning:

[(T) -80 (o) 20 (da) -15 (y)] TJ

This draws “Today” but with specific spacing adjustments between characters. A converter must recognize that these fragments form a single word, not separate text elements.


Why Text Reflow Is Hard

When you read a PDF, you see paragraphs. When a converter reads a PDF, it sees positioned text fragments scattered across a coordinate plane. Reassembling these fragments into logical paragraphs is called text reflow, and it is genuinely difficult.

The Line Break Problem

Consider a paragraph that wraps across three lines in a PDF:

Position (72, 700):  "The quick brown fox jumps"
Position (72, 686):  "over the lazy dog. This is"
Position (72, 672):  "a sample paragraph."

A converter must determine: are these three separate lines (like an address or poem), or one continuous paragraph that should be joined? The answer depends on context — spacing, indentation, font consistency, and surrounding content.

Heuristics converters use:

  • If the vertical gap between lines matches the font’s leading (line height), they are likely the same paragraph.
  • If the next line starts at the same horizontal position, it is likely a continuation.
  • If the next line is indented, it might be a new paragraph or a nested list item.
  • If the previous line ends near the right margin, the line break is probably wrapping, not intentional.

None of these heuristics are 100% reliable. Every converter makes mistakes here.

The Word Spacing Problem

PDF text can be stored character by character, word by word, or as arbitrary fragments. The converter must determine word boundaries by analyzing character positions and spacing.

For the TJ operator [(W) 30 (or) -15 (d)], is the spacing value 30 between “W” and “or” a word space or kerning? The answer depends on the font’s metrics and the typical character width. Converters must compare the spacing against a threshold to decide.


Multi-Column Layouts and Reading Order

Multi-column layouts are one of the most common sources of conversion errors.

The Problem

A two-column academic paper might have text positioned like this:

Left column:                Right column:
(72, 700) "Abstract"        (306, 700) "1. Introduction"
(72, 686) "We present..."   (306, 686) "The field of..."
(72, 672) "a novel..."      (306, 672) "natural language..."

If the converter reads text in the order it appears in the content stream (which is not guaranteed to match visual order), or if it simply reads left-to-right, top-to-bottom across the full page width, the output becomes:

Abstract 1. Introduction
We present... The field of...
a novel... natural language...

This interleaving is useless. The converter must detect column boundaries and read each column independently.

How Converters Detect Columns

  1. Spatial clustering — group text fragments by their horizontal position. If there is a clear vertical gap in the middle of the page where no text appears, it likely separates columns.
  2. Font and size analysis — column headers often use consistent formatting.
  3. Line alignment — text lines within a column share the same left margin.
  4. Vertical flow analysis — within a column, text flows downward. Between columns, there is a horizontal jump.

Reading Order in Content Streams

A critical complication: the order of drawing operators in a content stream does not necessarily match the visual reading order. A PDF might draw the right column first, then the left column. Or it might draw the header, then the footer, then the body.

The PDF specification does not require any particular ordering of content within a stream. This means converters cannot simply process operators sequentially and expect correct output.


Table Detection: One of the Hardest Problems

Tables are arguably the single hardest element to extract correctly from PDFs. Here is why.

PDFs Do Not Have Tables

There is no “table” object in the PDF specification. What humans perceive as a table is actually a collection of:

  • Text fragments positioned in a grid pattern
  • Optional line segments or rectangles forming visible borders
  • Possible background fills for alternating rows

A converter must reconstruct the table structure from these raw elements.

Detection Approaches

Rule-based detection:

  1. Look for horizontal and vertical line segments that form a grid.
  2. Identify cells as regions bounded by these lines.
  3. Extract text within each cell boundary.
  4. Determine header rows by font differences or position.

Problem: Many tables do not have visible borders. They rely on whitespace alignment alone.

Whitespace-based detection:

  1. Analyze text positions for grid-like alignment.
  2. Detect columns by finding vertical bands of whitespace.
  3. Detect rows by finding horizontal alignment of text baselines.

Problem: Columns with varying text lengths create irregular whitespace patterns.

Machine learning approaches: Modern tools like Marker and MinerU use trained models to detect table regions in the rendered page image, then extract cell content from the identified regions. This is more robust but computationally expensive.

Common Table Conversion Failures

  • Merged cells — a cell spanning two columns confuses column detection algorithms
  • Nested tables — tables within tables are rare in PDFs but devastating for converters
  • Multi-line cells — text wrapping within a cell can be mistaken for multiple rows
  • Header detection — distinguishing header rows from data rows often relies on font weight, which is not always different

Scanned PDFs vs Native PDFs: The OCR Challenge

Native (born-digital) PDFs contain actual text data. Scanned PDFs contain images of pages. The difference in conversion difficulty is enormous.

Native PDF Conversion

Text is directly available in content streams. The challenge is structural — figuring out paragraphs, headings, tables, and reading order from positioned text fragments.

Scanned PDF Conversion

Before any structural analysis can happen, the converter must:

  1. Detect that the page is an image (no text layer present)
  2. Pre-process the image — deskew, denoise, adjust contrast
  3. Run OCR (Optical Character Recognition) to convert pixel patterns to characters
  4. Determine confidence levels for each recognized character
  5. Then perform all the structural analysis that native PDFs require

OCR accuracy depends on:

  • Scan resolution — 300 DPI is the practical minimum for reliable OCR
  • Image quality — faded text, coffee stains, creases all reduce accuracy
  • Font type — standard printed fonts are well-recognized; handwriting and decorative fonts are not
  • Language — English OCR is highly mature; less-common languages may have lower accuracy
  • Character set — mathematical notation, chemical formulas, and mixed-script documents are particularly challenging

Modern OCR engines like Tesseract (open source) and cloud services from Google, AWS, and Azure achieve 95-99% character accuracy on clean printed text. But even 99% accuracy means roughly one error per 100 characters — which is about one error every two lines.


Tagged PDFs and Accessibility Structure

Some PDFs contain a structure tree — a tagged hierarchy that explicitly defines headings, paragraphs, lists, tables, and other semantic elements. This is primarily used for accessibility (screen readers) and is required by standards like PDF/UA.

How Tags Help Conversion

A tagged PDF might contain:

<Document>
  <H1>Introduction</H1>
  <P>This is the first paragraph...</P>
  <H2>Background</H2>
  <P>Previous work has shown...</P>
  <Table>
    <TR><TH>Name</TH><TH>Value</TH></TR>
    <TR><TD>Alpha</TD><TD>0.95</TD></TR>
  </Table>
</Document>

This is exactly the semantic structure a Markdown converter needs. If tags are present and accurate, conversion quality jumps dramatically.

The Reality

Most PDFs are not tagged. Estimates vary, but tagged PDFs represent a small minority of documents in the wild. Government and academic publications are more likely to be tagged (due to accessibility requirements), while most business documents are not.

Even when tags exist, they may be:

  • Incomplete — some elements tagged, others not
  • Incorrect — auto-generated tags that do not match the actual document structure
  • Outdated — tags from an earlier version of the document that were not updated

Converters that can leverage tags when available — but fall back gracefully when they are absent — produce the most reliable results.


How PDF.js Approaches These Challenges

PDF.js is Mozilla’s open-source PDF rendering library, written in JavaScript. It powers the PDF viewer in Firefox and is used by tools like pdf2md.net for client-side conversion.

PDF.js Text Extraction

PDF.js provides a getTextContent() API that returns text items with position, font, and transform information for each page. This gives converters:

  • The actual text strings
  • X and Y coordinates for each text fragment
  • Font name and size
  • Text direction (important for RTL languages)
  • Character width information

Advantages of PDF.js for Conversion

  1. Runs in the browser — no server-side processing required, which means user files stay on their device
  2. Handles most PDF features — text extraction, font handling, and image rendering are mature
  3. Active development — Mozilla maintains it continuously alongside Firefox
  4. Cross-platform — works identically on Windows, Mac, Linux, and mobile browsers

Limitations

  1. No built-in structural analysis — PDF.js extracts text content but does not determine paragraph boundaries, heading levels, or table structures. That logic must be built on top of the raw extraction.
  2. OCR not included — for scanned PDFs, an additional OCR library is needed.
  3. Performance — processing very large PDFs (500+ pages) in the browser can be slow and memory-intensive.

Tools like pdf2md.net build conversion logic on top of PDF.js, using the extracted text positions and font information to infer document structure.


The Future: AI-Powered Conversion

Traditional rule-based converters are being supplemented — and in some cases replaced — by AI-powered approaches.

Vision-Language Models

The newest approach uses multimodal AI models that can “see” the rendered PDF page as an image and directly produce structured output. These models understand layout, reading order, and document conventions in ways that rule-based systems cannot.

Advantages of AI Conversion

  • Better handling of ambiguous layouts
  • Improved table detection (models can be trained on millions of table examples)
  • Context-aware decisions about structure (a model can understand that text after “References:” is a bibliography)
  • Ability to handle documents that mix multiple layout conventions

Current Limitations

  • Cost — running large vision models is computationally expensive
  • Latency — AI inference is slower than rule-based extraction
  • Privacy — cloud-based AI models require uploading documents to servers
  • Hallucination — AI models can fabricate text that does not appear in the original PDF
  • Reproducibility — the same document may produce slightly different output on different runs

The Hybrid Future

The most promising approaches combine traditional extraction (fast, reliable, deterministic) with AI post-processing (smart, context-aware, adaptive). Extract the raw text and positions using PDF.js or a similar library, then use AI to determine the correct structure, fix OCR errors, and handle ambiguous layouts.

This hybrid approach gives you the speed and privacy of client-side extraction with the intelligence of AI-powered structural analysis.


Conclusion

PDF to Markdown conversion is challenging because PDFs were designed for visual fidelity, not semantic structure. Every converter must bridge the gap between positioned text fragments on a coordinate plane and the logical document hierarchy that Markdown represents.

Understanding these challenges helps you:

  • Set realistic expectations for conversion output quality
  • Choose the right tool for your specific document types
  • Know where to focus cleanup efforts after conversion
  • Appreciate why some documents convert perfectly while others produce garbled output

The field is advancing rapidly. AI-powered tools are addressing challenges that stumped rule-based systems for decades. But for most documents — especially born-digital PDFs with straightforward layouts — tools like pdf2md.net that build on proven libraries like PDF.js produce excellent results without requiring cloud processing or AI inference costs.

The key takeaway: no converter is perfect because the problem is fundamentally hard. But understanding why it is hard helps you work with the tools more effectively and know when to invest time in manual cleanup versus trying a different converter.

Last updated: April 3, 2026