Question 1

Is the output semantic HTML or absolute-positioned divs?

Accepted Answer

Semantic HTML. Headings use h1, h2, and h3 tags. Paragraphs use p. Lists use ul/li and ol/li. There is no absolute positioning, no inline left/top pixel coordinates, no font-embedding mess. The output works in screen readers, indexes cleanly in search engines, and renders responsively on mobile.

Question 2

Are links from the original PDF preserved?

Accepted Answer

Yes. The tool reads the PDF annotation layer where hyperlinks live and emits them as standard anchor tags in the output HTML with the same href the PDF used. Both external links and internal cross-references (where the source PDF used them) are preserved when present in the source.

Question 3

Does my PDF get uploaded anywhere?

Accepted Answer

No. The PDF is parsed locally in your browser by pdf.js and the HTML is built locally. Nothing leaves your tab. You can convert confidential documents, internal reports, and anything else sensitive with full privacy.

Question 4

How are headings inferred without explicit heading tags in PDFs?

Accepted Answer

PDF font runs include size data. The tool finds the most common size (body text) and bands the larger sizes into h1, h2, and h3. The largest distinct size becomes h1, the next becomes h2, and the smallest still-larger-than-body band becomes h3. The result is real heading structure that search engines and screen readers understand.

Question 5

Will the output preserve images embedded in the PDF?

Accepted Answer

Embedded raster images are extracted and inlined as data: URLs (base64) so the resulting HTML is single-file portable. Vector graphics drawn directly on the PDF canvas are not preserved — those require a different rendering approach. For PDFs where the images are the content, use the PDF to Image tool instead.

Question 6

Can I paste the output directly into a CMS?

Accepted Answer

Yes. The output is clean enough for WordPress, Ghost, Webflow, Notion, Confluence, and any block-based or HTML-editor CMS. The minimal style tag at the top provides readable defaults; strip it if your CMS handles styling.

Question 7

Does it handle multi-column layouts well?

Accepted Answer

Multi-column source layouts (newspapers, academic two-column papers, magazine layouts) are flattened into a single linear text flow because the output is semantic HTML, not a pixel-faithful replica. Reading order is preserved when the source PDF tagged its columns correctly. For visual-fidelity replicas of complex layouts, pdf2htmlEX is the better tool.

Question 8

What about scanned PDFs with no text layer?

Accepted Answer

A scanned PDF is an image of text. There is no text layer for the tool to read, so the output HTML will be empty. Run scanned PDFs through the PDF OCR tool first to add a recognized text layer, then re-run this converter.

PDF to HTML Converter