The starting scenario

A distributor for cast-iron wood-burning stoves. Three manufacturer suppliers from Poland, the Czech Republic, and Germany. Each supplier ships a PDF catalog twice a year with 80 to 180 products. The distributor wants to sell those products in their own shop — with their own images, their own copy, their own pricing.

The PDF data is unstructured: borderless tables, product codes embedded in prose, technical data listed next to photos that sit in the PDF's layer stack. No shop connector reads those PDFs directly. And manual retyping of 400 products isn't an option.

Why this problem isn't trivial

Manufacturer catalogs are designed for trade contacts, not machine parsers. The classic traps:

  • Tables without border lines — classical OCR doesn't recognise cells
  • SKU codes in headlines, in running text, and in tables — same format, different meaning
  • Technical specifications as prose with bullet lists — often in a different order per product
  • Images as PDF layers, not embedded image files — with crops, shadows, and marketing overlays
  • Sometimes multilingual within the same document — EN/DE/FR on the same page

Classical PDF parsers break the document into a text stream that loses the logical structure. The result is a word list, not a product list.

The approach

Step 1 — Visual parse with a language model

Instead of classical PDF extraction, Otto receives each page as an image plus a raw text pass. A vision-capable language model sees the page the way a trade contact sees it — recognising which blocks form a product, which text is the headline, and where the spec table starts.

The model returns a structured JSON representation per page: products with SKU, name, description, spec attributes, image-region coordinates.

Step 2 — Extract images, regenerate them

The images inside the PDF usually aren't shop-ready. They carry manufacturer marketing elements — logo overlays, claim bars, price bubbles. And they vary in background, lighting, and style from product to product.

Otto extracts the embedded images as references and generates consistent new versions: uniform light background, product centred, no marketing elements. For scene images, Otto picks appropriate settings per category — wood stove in a living-room scene, hand tool in a workshop scene.

Step 3 — SEO copy from manufacturer text

Manufacturer descriptions are rarely shop-ready. They're often marketing-heavy, sometimes incomplete, almost always copy-pasted from other resellers. Google penalises duplicate content.

Otto reads the manufacturer description and the technical data, then writes a structurally different version on that basis — same facts, different structure, different tone, different keyword selection. That's not sentence-level rewriting; it's a full restructure. Details in Product descriptions from manufacturer text — without duplicate content.

Step 4 — Review and publish

As with physical intake, every product passes through a review interface before it reaches the shop. The operator sees the extracted data, the generated images, and the new copy — side by side with the PDF source. Clicks Pass or Fail. Fail triggers targeted regeneration.

Publish phase is identical to the warehouse workflow: the connector writes to Shopify, WooCommerce, or Magento.

Numbers from the wood-stove project

  • Input: 3 PDF catalogs, 412 products total, 487 PDF pages
  • Parse phase: 42 minutes of compute for every page
  • Product extraction rate: 94% in batch 1, 98% in batch 2 after prompt adaptation
  • Image generation: 824 images in 16 hours of distributed compute
  • SEO copy: 412 product descriptions in 3.5 hours
  • Review time: 19 hours of operator work
  • Publish: 1 hour of REST latency for every approved product

Otto cost: Pack L covers 412 SKUs at €1,199. For regular catalog updates, Monthly is often better: €349/month for ongoing 250-SKU coverage.

What this pipeline can't do

Catalogs that are pure imagery with no accompanying prose — coffee-table-book-style presentations — don't give enough text material for solid SEO descriptions. There you need a human to extract facts from the images.

Catalogs with proprietary pricing and tiered rebate models are also tricky. Pricing logic is customer-specific and belongs in the commerce configuration, not in the shop database. Otto doesn't handle pricing — that gets set per shop, manually or via ERP integration.

When the model pays off

Distributors who regularly onboard manufacturer catalogs into their own assortment. Seasonal catalogs (spring/autumn), annual main catalogs, addenda for new product lines. The question isn't "can we do this", it's "how often per year do we repeat this process". From two catalog onboardings per year, the investment pays back against manual processing by a factor of three to five.

Start a Pack. Otto handles the rest.

Pay once. Upload your SKUs. Shelf-ready images and SEO copy in your store this week.