Building an LLM-Based Ad Server

An ad server has a few milliseconds to decide what to put in front of a reader. For years that decision leaned on cheap signals — keywords, URL patterns, page metadata — that only approximate what a page is really about. Large language models change the economics of understanding. Put one in the serving pipeline and the server can read a page the way a person would: grasp the topic, judge whether it’s safe for a brand, and score how relevant a given ad actually is — then attach all of that to the bid before it ever reaches the DSP.

This post walks through one practical architecture for an LLM-based ad server, and the four jobs the model does inside it: content understanding, category extraction, brand safety, and ad relevance scoring. It closes with the engineering reality of keeping a slow, expensive model out of a fast auction.

01 — ARCHITECTUREThe serving pipeline

The shape is a standard programmatic stack with one new component bolted in. A Publisher sends the page and request context to the Ad Server. Instead of going straight to auction, the server consults an LLM Context Engine that turns raw page content into structured signals. Those signals flow into Bid Enrichment, which attaches them to the OpenRTB bid request, and the enriched request goes out to the DSP Auction — where buyers now bid with far more context than a bare URL.

02 — UNDERSTANDINGGPT for content understanding

The first job is comprehension. A GPT-style model reads the page’s title, headings and body and builds a semantic picture: what the article is about, the entities it mentions, the tone. This is where it beats keyword matching outright — it knows “Apple” the company from the fruit, and “shooting a scene” from a crime report, because it reads context, not tokens.

In practice you don’t hand the model a raw HTML dump. You strip boilerplate (nav, ads, footers), keep the main content, truncate to a sensible token budget, and prompt for a structured response. The output of this stage is a compact semantic profile that the next three stages consume.

The trick isn’t calling an LLM during the auction. It’s having already called it — and serving the answer from cache.

03 — CATEGORY EXTRACTIONMapping content to a taxonomy

Understanding only becomes useful when it’s machine-readable. So the next step maps the semantic profile onto a fixed taxonomy — the IAB Content Taxonomy is the common choice — as multi-label categories with confidence scores. A page might come back as Automotive 0.91, Electric Vehicles 0.78.

The engineering tip that matters here: constrain the output to the taxonomy. Give the model the allowed category list (or an enum / JSON schema) and validate what comes back, so you never enrich a bid with a hallucinated category. Pass the top few categories downstream; drop anything below a confidence floor.

04 — BRAND SAFETYFlagging unsafe content before the bid

The same model can classify a page against a brand-safety framework (for example the GARM floor and suitability tiers): violence, hate, adult, illegal activity, and so on, each with a severity. The ad server can then block unsafe inventory outright, down-weight it, or simply pass a safety signal so each advertiser’s DSP applies its own thresholds.

This catches what blocklists miss — sarcasm, quotation, and context that a banned-words list reads as unsafe when it isn’t (and vice-versa). The caveat is calibration: over-blocking quietly destroys yield, so tune your thresholds and keep a human-review path for the grey area rather than trusting a single score.

05 — RELEVANCEAd relevance scoring

Finally, the engine scores fit: given the page’s semantic profile and a candidate creative or advertiser category, how relevant is this ad, on a 0–1 scale? That score rides along in bid enrichment so the bidder can lean into strong matches and ease off weak ones.

You don’t need a full LLM call per candidate to do this. Embed the page and the creative once, compare them with cosine similarity, and you get a fast, cheap relevance score that scales to the candidate set — reserving the heavyweight model for the understanding and safety stages.

06 — ENGINEERINGKeeping the model off the hot path

Here’s the constraint that shapes the whole design: an LLM call takes hundreds of milliseconds and costs money, and the auction budget is under <100 ms. You cannot call GPT synchronously inside the bid request. So you don’t.

Pre-compute and cache. Run the LLM Context Engine at crawl or first-seen time, key the result by URL or content hash, and refresh only when the content changes.
Serve from cache. The auction reads the pre-computed signals synchronously; the model never sits in the request path.
Use light models on the hot path. Embeddings and small distilled classifiers handle anything that must run live; the large model runs offline.
Fail safe. Set timeouts and degrade gracefully to keyword/URL signals if enrichment is missing, so a slow model never blocks a bid.

07 — OUTCOMERicher requests, better auctions

An LLM-based ad server doesn’t replace the auction — it feeds it. Content understanding, category extraction, brand safety and relevance scoring turn a thin bid request into a rich one, and richer requests clear at better prices for publishers and better outcomes for advertisers. The engineering art is simply keeping the model where it belongs: understand offline, enrich in advance, and serve the answer from cache.