Why PDF parsing for catalog data isn't trivial

The core misconception

A PDF looks like a structured document to a human. To a machine, it's a stack of positioned text and image objects with no inherent logical hierarchy. A table isn't a table in the PDF spec — it's a collection of text boxes with X/Y coordinates that happen to be arranged like a table.

Classical PDF parsers reconstruct the logical structure with heuristics: if text positions are regular enough, it's probably a table. That works for administrative documents with a clean layout. For marketing-heavy manufacturer catalogs, it breaks down.

The five most common traps

1. Tables without border lines

Modern catalogs drop table borders in favour of whitespace. The human reader sees a clean table. The parser sees disorganised text blocks, because its heuristic relies on lines.

Otto works around this with a visual parse: the model sees the page as an image and recognises the table by the alignment of text blocks, not by lines.

2. SKU codes in headlines, prose, and tables

An SKU like "KO-2024-BX100" appears three times on one page: once in the headline, once in the descriptive prose, once in the technical spec table. Which one is canonical?

Otto resolves this through context relevance: the headline is marked as the primary SKU, repetitions are treated as cross-references rather than new products. That keeps the product list clean.

3. Technical data as prose

Instead of structured attributes ("Weight: 120 kg"), some manufacturers write prose ("The stove weighs 120 kilograms and sits solidly in the room"). Both variants carry the same fact, but only one is machine-readable.

Otto parses both through language-model extraction. Facts get mapped into a single attribute schema, regardless of how they were phrased in the PDF.

4. Images as PDF layers

In marketing PDFs, product images often aren't embedded JPEGs — they're compositions of several layers: image, frame, logo overlay, claim text. If you only extract the JPEG, you get either the product without the frame or the frame without the product, never the final composed image.

Otto renders the relevant page region as a bitmap, not as a layer composition. Since the final shop image is regenerated anyway, the PDF version only serves as a reference — layer separation doesn't matter.

5. Multilingual pages

EN/DE/FR catalogs present the same product description three times on one page. Classical parsers concatenate that into a single text blob that's useless for any language.

Otto detects language per text block and routes the right version to the right shop field. For shops with multi-language modules (WPML on WooCommerce, Shopify Markets), all variants get written directly into the matching language fields.

What stays unreliable

Catalogs built as pure design layouts — magazine-style spreads with background imagery and product text floating on image layers — are still a fight. Extraction rate drops below 80%, and rework time grows.

The pragmatic fix: click through those catalogs, mark the important 20 to 30 products, and extract only those. The rest gets filled in manually.

Why visual parse makes the difference

Most traps share a property: they assume visual conventions a human gets at a glance but classical text extraction doesn't. Once the parse step lets the model see the page, the majority of those traps disappear. Parse isn't perfect, but it's good enough for 90% of real catalog data — and the rest shows up explicitly in review.

Start a Pack. Otto handles the rest.

Pay once. Upload your SKUs. Shelf-ready images and SEO copy in your store this week.

Start Pack S — €149 Email the founder

Why PDF parsing for catalog data isn't trivial.