A field report on backfilling structured data from listing photos: the dead ends, the counterintuitive failures, and the architecture that finally worked for about $0.29.

The 30-second version

  1. Mine your own data first. The rows where the field is already filled are a free, honest test set. Build that harness before you build anything else.
  2. Split the work. Cheap, free OCR finds where the text is, and a vision LLM reads what it says. You only pay the AI on a tiny crop, only when there is something worth reading.
  3. Validation alone will not save you. For dense identifiers like tail numbers, VINs, and serials, an OCR error usually lands on another valid identifier, so registries and checksums give false confidence. Require agreement across several photos instead.
  4. Optimize for precision over recall when writing to production, and profile before you optimize speed. Our bottleneck was flaky downloads and CPU image processing, never the model.

The problem

We run an aircraft marketplace that aggregates listings from 100+ broker sites via automated crawlers. Every aircraft has a registration, the tail number you see painted on the fuselage (N565XL, D-EHUH, G-OSEM). It is one of the most useful fields we have: it deduplicates the same aircraft listed on five different broker sites, it powers search, and buyers use it to look up history.

The trouble: about a quarter of our listings had no registration. The crawlers could not find it in the page text because the brokers simply did not print it there. But the aircraft was sitting right there in the listing photos.

We had roughly 2,700 listings missing a registration, about 2,000 of them with images. We had already proven a large vision LLM could read the tail numbers beautifully, but running a frontier model over every photo was too expensive to justify for a backfill, and it would keep costing us as new listings arrived.

So the question became: how cheaply and how accurately can you extract one structured field from marketplace photos, accurately enough to write it straight into a live production database?

This is a write-up of what we learned. Most of it generalizes to any marketplace trying to pull structured data out of seller images: VINs from car listings, model numbers from electronics, ISBNs from book photos, serial numbers from equipment.

Learning 1: Your existing data is a free, honest test set. Use it.

The single most valuable decision we made took five minutes. Before building anything, we noticed that the roughly 7,300 listings that already had a registration, and also had a photo, were a labeled validation set we got for free.

Run any candidate extractor over those photos, compare what it reads to the registration we already know, and you get a real precision and recall number. No manual labeling, no guesswork.

This shaped the entire project. Every model, threshold, and gate was chosen against measured numbers on 150 known-answer listings, not on a hunch. If you are doing AI extraction on a marketplace, you almost certainly have the same asset: rows where the field is already filled. That is your test harness. Build it first.

One subtlety is worth flagging. The labeled set is slightly optimistic. Listings that have the field tend to be the ones where it was easy to obtain. Listings missing the field are missing it precisely because it was hard to get, so your real-world recall will be lower than your test-set recall.

Learning 2: Free OCR hits a hard precision ceiling, and the reason is subtle

We started where everyone starts: free, open-source OCR. PaddleOCR (PP-OCRv5, Apache-2.0, runs on CPU) is genuinely excellent scene-text software. We paired it with a strict matcher that only accepted strings shaped like real registrations, validated against the national prefix codes, plus a guard that only committed when a single confident candidate survived.

Best case, pure OCR topped out around 78 to 80 percent precision. That is not good enough to write to a live database. One wrong tail number in five corrupts dedup, rotates URLs, and misleads buyers.

The obvious fix is to validate against a registry of real aircraft so a misread gets rejected. We tried it, using the free OpenSky database of about 514,000 aircraft. It barely helped. Here is the subtle and important reason:

A one-character OCR error usually lands on another valid registration.

N35CT misreads as N35C, and N35C is a real, registered aircraft. S5-DSG becomes S5-OSG. N747KE becomes N747XE. The error space overlaps the valid space almost completely. A registry tells you a string is plausible. It cannot tell you it is the right one. Any validation layer whose valid set overlaps your error set will hand you false confidence.

This is the trap to internalize. For densely packed identifier spaces like tail numbers, VINs, and serials, checksums and registries catch typos that fall outside the valid set, but a large fraction of OCR errors fall inside it. You cannot validate your way to safety.

Learning 3: The winning shape is OCR locates, the LLM reads

The architecture that worked separates two jobs that everyone instinctively bundles together:


So we used OCR purely as a locator. It scans the photo, finds boxes whose text looks registration-shaped even when the exact characters are garbled, and we crop a tight region around each candidate. Only that small crop goes to the vision LLM, which reads the exact characters.

This split is the whole point, for three reasons.


If you take one architectural idea from this post, take this one. Do not ask a single model to both find and read. Let cheap tooling find, and spend your AI budget only on reading, only where it is needed.

Learning 4: The model tier matters more than you think, and cheapest is not best value

Our first instinct for the cheap vision reader was the cheapest vision model available. It gave about 74 percent precision on single images, barely better than free OCR, because it made the same one-character mistakes on stylized, angled, partially occluded tail numbers.

We then ran a proper bake-off across model tiers on the exact cases the cheap model got wrong. A mid-tier reasoning model fixed most of them. Of 19 hard errors, it turned 5 into correct reads and, just as valuable, turned another 5 into honest "I am not sure" skips instead of confident wrong answers.

Now the counterintuitive economics. With the OCR-locates-first architecture, your cost is dominated by image input tokens on a tiny crop, not by the model's per-token price. Moving from the cheapest model to a much stronger one took our projected backfill cost from a couple of dollars to still under thirty dollars. For a one-time job that writes to production, paying ten times more per token on a tiny payload to eliminate wrong answers is obviously correct.

Two practical notes that cost us time. First, we assumed bigger crops and higher resolution would fix the misreads. We tested it. It did not. The model, not the pixels, was the bottleneck, so always test the assumption before building around it. Second, reasoning models use a different API shape, with no temperature, a cap on completion tokens, and a reasoning-effort setting. Budget an hour for the plumbing, and set reasoning effort to minimal for a read-this-text task or you will pay for invisible thinking tokens.

Learning 5: For live writes, precision beats recall, and "skip" is a feature

A demo that reads tail numbers and a system that writes them to production are different products. The demo optimizes recall, so it can say look how much it found. The production system optimizes precision, because a wrong write is far more expensive than a missing one.

We layered three gates, each tuned on the labeled set, and every one of them reduces yield on purpose.


The mindset shift is that a model saying "I am not sure" is doing its job. Our chosen reader skipped when uncertain, and we treated that as a positive outcome. About 17 percent of missing-registration listings got filled. The other 83 percent were left untouched, with no regression and no garbage written.

Learning 6: Profile before you parallelize. We guessed wrong twice.

When the bulk run was slow, our instinct was that the LLM calls were the bottleneck, so we should batch them or parallelize them. Both instincts were wrong, and ten minutes of profiling saved hours.


The fixes were correspondingly unglamorous and effective: short download timeouts, concurrent downloads per listing, a lighter OCR recognizer, a lower localization resolution, and process-level sharding across CPU cores. End to end, about 2,000 listings processed in roughly an hour for $0.29 in API cost.

The lesson is that "make it faster" almost never means what you assume. Measure the phases. Our slowest phases were network I/O on bad URLs and CPU OCR on cluttered images, neither of which a fancier model or a batch endpoint would have touched.

Learning 7: The boring infrastructure will eat your day, and that is normal

A representative sample of what actually consumed time, none of it AI work:


If you are scoping an AI-on-marketplace project, assume the model is the easy 20 percent. The data plumbing, the heterogeneous sources, and the environment quirks are the other 80.

Learning 8: When you write to production, make every change reversible

This data went into a live marketplace, so the rollout discipline mattered as much as the model.


What we would tell another marketplace team

  1. Mine your own filled rows for a free, honest test set before you build.
  2. For identifier extraction, do not trust validation or checksums alone. Errors often land on other valid identifiers. Cross-evidence, meaning several photos that agree, beats single-shot validation.
  3. Split finding the text, which is cheap or free, from reading the text, which is the AI job. Spend the AI budget only on small crops, only where there is something to read. This is what makes per-item cost almost disappear.
  4. Do not reflexively pick the cheapest model. On a tiny crop, a stronger model costs cents more and removes the confident-but-wrong answers that actually hurt you.
  5. Optimize for precision, not recall, when writing to production. A model that abstains is a feature. Leave the hard ones blank.
  6. Profile before optimizing. Your bottleneck is probably network I/O or CPU preprocessing, not the model.
  7. Dry-run, tag, and make every automated write reversible.

The headline result: a backfill that the obvious frontier-model-on-every-image approach would have made too costly to consider, done for under a dollar, at a precision we were comfortable writing straight to production. We got there by being deliberate about which 15 percent of the work actually needed the expensive model.

The model was never the hard part. Knowing exactly where to point it was.