BEHIND May 16, 2026 · 5 min read

Behind the scenes: a .docx file is just a ZIP — and we built a browser-only inspector that proves it

Most developers know intellectually that .docx is a ZIP — the Office Open XML spec from 2006 made it a ZIP, every Word file since 2007 has been a ZIP, and you can prove it by renaming any .docx to .zip and opening it. What's less obvious is how short the path is between "it's a ZIP" and "I can read its metadata without any third-party library, in any modern browser, without uploading anything." We just shipped a DOCX Metadata Inspector that does exactly this. The implementation is short. The lessons are interesting.

📄

DOCX Metadata Inspector

Inspect core, app, and custom .docx metadata properties

→

UDT

UDT Engineering

Engineering

The ZIP, the directory, and the two XMLs

A .docx file is a ZIP archive with a fixed internal layout. The two files that matter for metadata both live in the docProps/ directory: core.xml contains the Dublin Core properties (title, author, last-modified-by, dates), and app.xml contains the Microsoft-specific extensions (page count, word count, total edit minutes, template path, application version). There's also an optional custom.xml for arbitrary user-defined properties. That's it. You don't need to parse the word/document.xml body, the styles, or anything else. Three small XML files at known paths.

The total uncompressed size of all three XMLs is usually under 4 KB. The compressed size is under 1 KB. You're reading less data than this paragraph.

Reading the ZIP without a ZIP library

Here's the part that surprises people. A ZIP file has a well-defined trailer called the End of Central Directory (EOCD) record, which lives in the last 22 bytes of the file (or up to 65,557 bytes if there's a comment, which there essentially never is for .docx). The EOCD points at the central directory, which lists every file in the archive with its name, compressed size, uncompressed size, compression method, and offset to its local header. That's all you need to find the two XMLs.

And to decompress them, modern browsers ship DecompressionStream('deflate-raw'), which is the wire format ZIP uses for compressed files. So the full read path is: locate EOCD by scanning the last 64 KB of the file backward for the magic bytes, parse the central directory to find the entries with names docProps/core.xml and docProps/app.xml, read each entry's local header to get the compressed data offset, slice the compressed bytes, run them through DecompressionStream, and you have the XMLs. About 80 lines of TypeScript total.

Parsing the XML in the browser

The browser also ships a DOMParser. We feed it the decompressed XML, query for the Dublin Core elements (dc:title, dc:creator, cp:lastModifiedBy, etc.), then query app.xml for the Microsoft-extended properties (Pages, Words, TotalTime, Template). The whole thing runs in a single function. No JSZip, no PizZip, no fflate. The bundle stays at zero extra bytes.

The principle generalizes. Any format that's a ZIP under the hood — .xlsx, .pptx, .odt, .epub, .jar, .apk, .nupkg — can be inspected this way. The only difference is which inner files you target. For Excel files, that's xl/sharedStrings.xml for the string table and xl/workbook.xml for the sheet list.

Why this matters for privacy tools

Browser-only is the whole point. The metadata inspector handles documents that have no business leaving your laptop — résumés you're about to send to a recruiter, contracts under negotiation, internal financial reports, medical records, drafts you don't want anyone but you reading. Tools that require an upload to inspect the metadata are tools you can't run on confidential documents. Tools that run entirely in the browser are tools you can run on anything. The Network tab proves it, every time.

The tradeoff is that browser-only tools require a modern browser. DecompressionStream is available in Chrome 80+, Firefox 113+, Safari 16.4+, and every Edge since 2020. That's enough coverage in 2026 to be the default rather than a progressive enhancement. For older browsers we ship a noscript fallback that explains the constraint and links to docs; in practice, almost no one hits it.

CONTEXT Behind The Scenes · UDT Engineering May 16, 2026

UDT

UDT Engineering

Engineering notes from Ultimate Design Tools. The technical work behind the tools you use, documented as it happens for anyone interested in the implementation tradeoffs.