How to Convert PDF to Markdown: A Complete Step-by-Step Guide

PDF files are everywhere — research papers, technical documentation, ebooks, invoices, slide decks. But when you need to edit, version-control, or repurpose that content, PDF becomes a painful format to work with. Markdown, on the other hand, is lightweight, human-readable, and works seamlessly with tools like GitHub, Notion, Obsidian, Jekyll, and Hugo.

This guide walks you through converting PDF to Markdown effectively, covering the tools, the edge cases, and the cleanup techniques that actually matter.

Why Convert PDF to Markdown?

Before diving into the how, it helps to understand when and why this conversion makes sense.

Version control. Markdown is plain text. You can track changes in Git, diff two versions, and collaborate through pull requests. PDFs are binary blobs — Git can store them but cannot meaningfully diff them.

Portability. A single .md file works in hundreds of editors and platforms. You can render it as HTML, convert it to DOCX, publish it as a blog post, or import it into a wiki. PDF locks content into a fixed visual layout.

Editing efficiency. Copying text from a PDF often produces broken formatting, missing line breaks, and garbled special characters. Converting to Markdown gives you a clean, editable starting point.

Content repurposing. If you need to turn a PDF report into web documentation, a knowledge base article, or a set of notes, Markdown is the natural intermediate format.

AI and LLM workflows. Large language models work with text, not PDFs. Converting to Markdown preserves document structure (headings, lists, tables) in a format that LLMs can parse and reason about effectively.

Step-by-Step Guide: Converting PDF to Markdown with pdf2md.net

pdf2md.net is a browser-based tool that converts PDF files to Markdown without requiring any software installation. Here is the full process.

Step 1: Open the Tool

Navigate to https://pdf2md.net in any modern browser. The tool runs client-side, so your files are not uploaded to a remote server — the conversion happens in your browser.

Step 2: Upload Your PDF

You have two options:

Drag and drop the PDF file directly onto the upload area.
Click the upload button to open a file picker and select your PDF.

The tool accepts PDF files up to the size limit shown on the page. For very large files (100+ pages), expect the conversion to take a few seconds longer.

Step 3: Configure Conversion Settings

Depending on the tool’s current feature set, you may see options such as:

Page range — Convert specific pages instead of the entire document. Useful for extracting a single chapter from a long report.
Table detection — Enable or adjust how the tool identifies and converts tables.
Image handling — Choose whether to extract embedded images or skip them.
OCR mode — If the PDF contains scanned pages (images of text rather than actual text), OCR (Optical Character Recognition) can extract the content.

If you are unsure which settings to use, start with the defaults. You can always re-run the conversion with different options.

Step 4: Run the Conversion

Click the convert button and wait for processing to complete. The tool will parse the PDF structure, identify headings, paragraphs, lists, tables, and other elements, then generate corresponding Markdown syntax.

Step 5: Review the Output

Before downloading, review the Markdown output in the preview pane. Look for:

Headings that match the original document hierarchy
Tables that are properly formatted with pipes and dashes
Lists that use the correct bullet or numbered syntax
Code blocks that are fenced with triple backticks
Links and references that are intact

Step 6: Download or Copy

Once you are satisfied with the result:

Download the .md file to your local machine.
Copy to clipboard if you want to paste it directly into an editor.

If the PDF contained images, they may be available as a separate download or bundled in a ZIP file alongside the Markdown.

Handling Different PDF Types

Not all PDFs are created equal. The internal structure of a PDF dramatically affects conversion quality.

Text-Based PDFs (Born Digital)

These are PDFs created from word processors, LaTeX, or HTML-to-PDF tools. The text is stored as actual character data with font and position information.

Conversion quality: High. The text is directly extractable, so headings, paragraphs, and lists come through cleanly.

Watch out for:

Multi-column layouts that confuse reading order. The converter may interleave text from left and right columns.
Headers and footers repeating on every page. You will likely need to strip these manually.
Hyphenated words at line breaks (e.g., “docu-\nmentation”). Some converters rejoin these; others leave them split.

Scanned PDFs (Image-Based)

These are PDFs created by scanning physical documents. Each page is essentially a photograph. There is no text layer — just pixels.

Conversion quality: Depends entirely on OCR accuracy, which depends on scan quality, font clarity, and language.

Tips for better results:

Use a high-resolution scan (300 DPI minimum).
Ensure the document is not skewed or rotated.
For handwritten content, expect significantly lower accuracy.
OCR struggles with mathematical notation, non-Latin scripts, and unusual fonts.

Post-conversion cleanup: OCR output almost always needs manual review. Common errors include confusing l (lowercase L) with 1 (one), O (letter) with 0 (zero), and merging or splitting words incorrectly.

Academic Papers

Academic papers have predictable structure — title, authors, abstract, sections, references — but they also have challenging elements.

What converts well:

Section headings and body text
Numbered references
Simple bullet lists

What typically breaks:

Mathematical equations. LaTeX math in the source PDF becomes garbled text. You will need to reconstruct equations manually or use a specialized tool like Mathpix.
Footnotes and endnotes. The converter may place footnote text inline or at the bottom of the wrong section.
Multi-column layouts. Most academic papers use two columns, which confuses many converters.
Citations. Inline citations like [1] or (Smith, 2023) usually survive, but cross-referencing to the bibliography may not.

Recommendation: For academic papers, consider converting to LaTeX first (using tools like pdf2latex or Mathpix) if you need to preserve equations. Then convert from LaTeX to Markdown if needed.

Tables

Tables are one of the hardest elements to convert from PDF to Markdown.

Why tables break: PDF does not have a concept of a “table” — it is just text positioned at specific coordinates. The converter must infer table structure from visual alignment, which is error-prone.

What helps:

Simple tables with clear grid lines convert reasonably well.
Tables with merged cells, nested headers, or spanning rows almost always need manual repair.
If the table has more than 6-7 columns, Markdown’s pipe syntax becomes unwieldy. Consider converting wide tables to HTML within your Markdown file.

Manual table repair example:

Broken output:

| Name | Age | City
| --- | --- |
| Alice | 30 | New York |
| Bob 25 | London |

Fixed output:

| Name  | Age | City     |
| ----- | --- | -------- |
| Alice | 30  | New York |
| Bob   | 25  | London   |

The key issues to fix: missing pipe characters, misaligned columns, and merged cell content.

PDFs with Forms and Interactive Elements

PDFs with form fields, checkboxes, dropdowns, or JavaScript often do not convert well. The converter extracts the static text but loses the interactive structure. For form-heavy PDFs, you may need to recreate the form structure manually in Markdown using checkboxes (- [ ]) or description lists.

Common Issues and How to Fix Them

Garbled or Mojibake Text

Symptom: Characters appear as â€™ or Ã© or □□□ instead of proper text.

Cause: Character encoding mismatch. The PDF uses one encoding, but the converter interprets it as another.

Fix:

Check if the PDF is using a non-standard or embedded font with custom encoding.
Try a different converter — some handle encoding issues better than others.
As a last resort, open the PDF in a text editor that can detect encoding (like VS Code) and manually fix the replacement characters.

Common substitutions to watch for:

Garbled	Correct
`â€™`	`'` (right single quote)
`â€œ`	`"` (left double quote)
`â€"`	`—` (em dash)
`Ã©`	`é`
`Ã¼`	`ü`

Broken Tables

Already covered above. The short version: expect to manually fix most tables. Use a Markdown table formatter tool (many editors have built-in ones) to realign pipes and dashes after fixing the content.

Lost Formatting

Bold and italic. Some converters detect bold text from font weight metadata and wrap it in **bold**. Others ignore formatting entirely. If your converted text is missing emphasis markers, compare with the original PDF and add them manually.

Headings. If headings come through as regular text, look for patterns — they are usually shorter lines with larger font sizes. Add # markers manually.

Lists. Numbered and bulleted lists may lose their structure and become plain paragraphs. Look for lines that start with numbers or bullet-like characters and reformat them.

Image Handling

PDFs can contain embedded images (raster or vector). How these are handled during conversion varies:

Extracted as separate files. The converter saves images as PNG or JPG and inserts ![alt text](image-path.png) in the Markdown. This is the best-case scenario.
Ignored entirely. The converter skips images. You see gaps in the text where images were.
Described with placeholder text. Some tools insert [Image] or a similar placeholder.

If images are important: Extract them separately using a tool like pdfimages (part of the Poppler utilities) or take screenshots from the PDF viewer. Then manually insert image references in your Markdown:

![Figure 1: Architecture diagram](./images/figure1.png)

Line Break Issues

Symptom: Every line in the PDF becomes a separate line in Markdown, even within the same paragraph.

Cause: PDF stores text with explicit line positions. The converter treats each line as a separate element.

Fix: Use find-and-replace to merge lines within paragraphs. The pattern to look for is a newline that is not preceded by a blank line, heading, or list marker. In most text editors:

Find: ([^\n])\n([^\n#\-\*\|>]) (regex)
Replace: $1 $2

This joins consecutive non-special lines with a space. Be careful around code blocks and tables — you do not want to join those lines.

Headers and Footers Repeating

Symptom: Page numbers, document titles, or section headers appear repeatedly throughout the text.

Fix: Use find-and-replace with the exact repeated text. For page numbers, a regex like \n\d+\n can catch most cases, but review each match to avoid deleting legitimate numbered items.

Tips for Cleaning Up Converted Markdown

Raw converter output almost always needs some cleanup. Here is a systematic approach.

1. Fix the Document Structure First

Start from the top:

Ensure there is exactly one # (H1) heading — the document title.
Check that the heading hierarchy is logical: H2 for major sections, H3 for subsections, and so on.
Remove any repeated headers/footers from page breaks.

2. Fix Paragraphs and Line Breaks

Merge broken paragraphs (see the regex tip above). Then verify that intentional line breaks — like in poetry, addresses, or code — are preserved.

3. Fix Tables

Tables require the most manual work:

Ensure every row has the same number of pipe characters.
Add the separator row (| --- | --- |) after the header row.
Align content so it is readable in raw Markdown (optional but helpful).

4. Fix Lists

Ensure bulleted lists use a consistent marker (-, *, or +).
Ensure numbered lists are sequential.
Check nested list indentation (2 or 4 spaces, depending on your Markdown flavor).

5. Fix Links and References

Convert bare URLs to proper Markdown links: [descriptive text](https://example.com).
Fix broken reference-style links.
Verify that internal document links (like table of contents entries) point to the correct headings.

6. Verify with a Markdown Preview

Open the cleaned file in a Markdown previewer (VS Code, Typora, or an online tool like Dillinger) and compare against the original PDF. Look for:

Missing sections
Formatting that looks wrong
Images that do not render
Tables that are misaligned

7. Run a Linter

Tools like markdownlint can catch common Markdown syntax issues:

npx markdownlint-cli article.md

Common issues it catches: inconsistent heading styles, trailing whitespace, missing blank lines around headings, and bare URLs.

Best Practices for Different Use Cases

Documentation (Technical Docs, READMEs)

Use ATX-style headings (# Heading) consistently.
Add a table of contents at the top for documents longer than a few sections. Many Markdown renderers auto-generate TOCs from headings.
Use fenced code blocks with language identifiers for syntax highlighting:

```python
def hello():
    print("Hello, world!")
```

Keep line length reasonable. Wrapping at 80-100 characters makes diffs cleaner in version control, though many modern editors soft-wrap.
Use reference-style links for URLs that appear multiple times:

See the [installation guide][install] for details.

[install]: https://example.com/docs/install

Note-Taking (Obsidian, Logseq, Notion)

Use wiki-links if your tool supports them: [[Related Note]].
Add frontmatter for metadata:

---
title: Meeting Notes - Q1 Review
date: 2026-04-03
tags: [meeting, quarterly, planning]
---

Use callouts or admonitions for important information:

> [!note]
> This decision was reversed in the March meeting.

Keep files atomic. One concept per file makes linking and searching easier than one massive file.

GitHub (README, Wiki, Issues)

Use GitHub-Flavored Markdown (GFM) syntax: task lists, tables, and strikethrough.
Add badges at the top if relevant:

![Build Status](https://img.shields.io/github/actions/workflow/status/user/repo/ci.yml)

Use <details> for collapsible sections to keep long documents scannable:

<details>
<summary>Click to expand advanced configuration</summary>

Your detailed content here...

</details>

Test rendering on GitHub before publishing. Some Markdown features (like colored text or custom HTML) render differently or not at all on GitHub.

Blog Posts (Jekyll, Hugo, Astro)

Include frontmatter with the fields your static site generator expects:

---
title: "How to Convert PDF to Markdown"
date: 2026-04-03
description: "A practical guide to converting PDF files to Markdown format."
tags: ["pdf", "markdown", "tools"]
---

Optimize images. Compress extracted images, use descriptive filenames (architecture-diagram.png not image1.png), and add meaningful alt text.
Use relative paths for images and internal links so they work in both local preview and production.

Frequently Asked Questions

Is the conversion lossless?

No. PDF and Markdown represent documents in fundamentally different ways. PDF stores visual layout (exact character positions, fonts, colors). Markdown stores semantic structure (headings, paragraphs, lists). Some information is inevitably lost in translation — particularly precise visual formatting, custom fonts, and complex layouts.

Can I convert a password-protected PDF?

You need to unlock the PDF first. If you have the password, use a tool like qpdf to remove the protection:

qpdf --password=yourpassword --decrypt protected.pdf unlocked.pdf

Then convert the unlocked PDF. Note: removing DRM or password protection from files you do not have permission to modify may violate terms of service or laws in your jurisdiction.

How do I handle PDFs with mixed content (text + scanned pages)?

Some PDFs contain a mix of born-digital text and scanned images. For these, you need a converter that supports OCR and can detect which pages need it. Convert the entire document, then review the OCR pages more carefully since they will have lower accuracy than the text-extracted pages.

What about converting Markdown back to PDF?

If you need the round trip, tools like Pandoc handle this well:

pandoc input.md -o output.pdf --pdf-engine=xelatex

For simpler needs, most Markdown editors (Typora, VS Code with extensions) can export to PDF directly.

Can I automate PDF-to-Markdown conversion for many files?

Yes. For batch processing, command-line tools are more practical than web-based converters. Some options:

Marker — an open-source Python tool that converts PDF to Markdown with high accuracy, including OCR support.
PyMuPDF (fitz) — a Python library that can extract text and structure from PDFs programmatically.
Pandoc — can convert PDF to Markdown, though its PDF parsing is more basic.

Example with Marker:

pip install marker-pdf
marker_single input.pdf output/ --output_format markdown

How do I handle mathematical equations?

This is one of the hardest problems in PDF-to-Markdown conversion. Options:

Mathpix — a specialized tool that converts equations in PDFs to LaTeX notation, which you can embed in Markdown as $inline$ or $$block$$ math.
Manual transcription — for a small number of equations, typing them in LaTeX math syntax is often faster than debugging automated output.
Screenshots — as a last resort, screenshot equations and embed them as images.

What Markdown flavor should I use?

For maximum compatibility, stick to CommonMark with GFM (GitHub-Flavored Markdown) extensions. This gives you tables, task lists, strikethrough, and fenced code blocks, which are supported by virtually all modern Markdown tools.

Does pdf2md.net store my files?

The conversion runs in your browser. Your PDF files are processed locally and are not uploaded to a remote server. This makes it suitable for sensitive or confidential documents. Always verify the specific privacy policy of any tool you use.

Wrapping Up

Converting PDF to Markdown is rarely a one-click operation. The quality of the output depends on the quality of the input PDF, the capabilities of the converter, and the amount of manual cleanup you are willing to do.

For simple text-heavy PDFs, tools like pdf2md.net produce clean output that needs minimal editing. For complex documents with tables, equations, or scanned pages, plan to spend time on post-conversion cleanup.

The key takeaway: treat the converter output as a first draft, not a finished product. Review it against the original, fix structural issues, and run it through a linter before publishing. The time you invest in cleanup pays off in a clean, portable, version-controllable document that works across every tool in your workflow.