When it comes to integrating Large Language Models (LLMs) into your marketing strategy, Paid Search is the perfect place to start. At its core, campaign management boils down to text and data: exactly what AI handles well. The main bottleneck for PPC specialists is typically the complexity of managing large, multi-account structures. The question isn’t whether LLMs can help, but how.

This article focuses on the reasoning side: LLMs automating tasks like search term and keyword management, product feed optimisation, ad copy automation, landing page optimisation, competitor monitoring. The math-driven side (tROAS bidding, budget allocation, pLTV modelling for measurement) plugs into the same plumbing but runs on statistical or ML engines and deserves its own deep dive per topic.

Most people start by wiring a coding agent like Claude Code, Codex, or Cursor directly to Google Ads (e.g. with MCP). Great for ad-hoc work, fragile at scale: same prompt yields different output across runs, tokens stack with every turn, no safety gates on bulk edits, and it stops the moment you stop typing. A Python pipeline does the same thing the same way every time (at a fraction of the token cost).

What works at scale: AI systems in Python, running on a schedule, with every decision logged in BigQuery. One pipeline run can classify 50,000 search terms in a batch, route the outputs to the right campaigns, and queue a Monday digest for review. SQL and models do the math, the LLM does the judgment, and the work happens whether you log in or not.

These systems can already largely be built by AI. The two skills that matter most: writing specs (what to build, what correct looks like, how to verify it) and mastering evals. Get good at both and you can build and maintain AI systems without writing most of the code yourself. More on spec-driven development in my Claude Code Guide.

Where to start using AI

Prioritize the tasks where volume is too high for humans to keep up. Imagine you hired a team of specialists, never tired and never complaining, what repetitive judgment work would you hand them?

Not every decision needs an LLM. Threshold-based pausing, anomaly detection, bid adjustments, budget shifts, pLTV scoring, geo-test readouts: SQL and statistical models handle those cleanly, every time, with no prompt to drift. Reserve the LLM for the calls those can’t make: reading intent in a search term, deciding whether two products belong in the same theme, writing ad copy in a brand voice.

Used smartly, it’s how you take back control of the surface Google is wrapping in black boxes (Performance Max, Broad Match, AI Max for Search).


Give the model your business context

Without context, an LLM behaves like a new PPC analyst on their first day. Sharp, fast, knows the platforms, but missing the details that make their output strategically right: campaign strategy, success criteria, acceptable trade-offs, hard limits.

Context comes in two layers, and you need both.

Shared context lives in one central place: account strategy, KPIs, products, business rules, brand voice. Maintained centrally, reused everywhere. Your coding agents read it during builds. Your pipelines pull it into prompts at runtime:

  • Goals and KPIs. What each campaign is actually optimized against. tROAS targets, CPA limits, new-customer share, lifetime value tiers, the metrics that matter and the ones to ignore.
  • Account strategy. Why the account is structured the way it is. Why certain campaigns exist, which themes are deliberate, what each campaign should and shouldn’t capture.
  • Products and offers. Categories, pricing tiers, seasonal items, loss leaders, protected SKUs, what’s allowed in non-brand vs. brand campaigns.
  • Business rules and guardrails. Branded vs. non-brand budget split, competitor strategy, exclusion lists, hard caps, the things an experienced operator does without thinking.
  • Voice and brand. Tone for ad copy, claims you can and can’t make, terminology that has to be exact.

Task-specific context lives in the prompts that drive specific output. Each one is a Markdown file owned by its pipeline, layered on top of the shared docs at runtime. More examples further down.

Documenting context properly is the single biggest lever on output quality.

Markdown files in the repo are usually enough. For context that lives elsewhere, pull it in via the Confluence or Notion API. For context too large to fit inline, stand up a vector store (Pinecone, Supabase, pgvector) and retrieve the relevant chunks at runtime.

If your warehouse runs on dbt, the dbt project doubles as agent context for free: column descriptions, model lineage, and tests are already documented in schema.yml. An agent reading the project understands what each table means without you re-explaining it in prose.

Treat the docs like code: review, update, delete what’s stale. Stale docs are worse than none, especially for fast-moving stuff (promotional calendars, inventory, margin targets) where the system will confidently execute last month’s strategy. A last_reviewed tag in each file header keeps ownership explicit; a CI check that flags docs untouched in 90 days is cheap to add and removes the reliance on remembering.


The data foundation: BigQuery and Data Transfer

Next, LLMs need access to your performance data. The Google Ads Data Transfer syncs daily snapshots of every entity in your account: campaigns, ad groups, keywords, search terms, conversions, performance stats, tROAS targets, budget settings, audiences. All of it, queryable with SQL.

Your coding agents use it interactively while building: writing SQL, running it, fixing errors, prototyping derived tables, exploring before they commit anything. Your production pipelines use it on a schedule: same tables, structured input for the next decision.

The Data Transfer also gives you full history, not just the current state. tROAS targets, target CPAs, budgets, keyword statuses, paused ads, label changes, bid adjustments, all captured over time. You can ask “what was this campaign’s tROAS six months ago and how did performance respond” and actually get an answer.

Besides the Google Ads Data Transfer, you can feed any other data from your warehouse directly into LLMs for decision making: revenue, margins, pLTV, forecasts, product info.


The pattern: BigQuery → decision engine → BigQuery

A workflow built on this foundation typically takes the same shape:

BigQuery → fetch data → run decision logic → write back to BigQuery → take action

The decision logic is where the use cases differ. Sometimes it’s a SQL rule or a deterministic Python function. Sometimes it’s a statistical model. Sometimes it’s an LLM with a prompt. Reach for the LLM last: every step you can keep in SQL, rules, or a model running in BigQuery is one less source of prompt variance. The infrastructure around the decision is identical regardless.

# LLM-driven (search terms, ad copy, classification)
data    = bq_client.fetch("get_search_terms.sql")    # pull data
prompt  = open("prompts/classify-main.md").read()    # load instructions
results = llm_client.classify(prompt, data)          # LLM recommends
bq_client.write(results)                             # write back

# Statistical or rule-based (anomaly detection, budget models)
data    = bq_client.fetch("daily_spend.sql")
results = detect_anomalies(data)                     # z-scores, regression, custom logic
slack.send(results)                                  # notify, or write to BQ, or call an API

Calling the LLM itself is the trivial part. Sign up with Anthropic, OpenAI, or Google, grab an API key, and call the model in a few lines of Python. The substance is everywhere else: the prompt, the data shape, the schema check, what you do with the output.

Coding agents shine here. Staging tables, MERGE statements, schema migrations, chained pipelines: bread and butter. Guardrails matter though, agents can DROP the wrong table. Scope service accounts to specific datasets, snapshot before destructive writes, require dry-runs, gate anything irreversible. You spec the system, the agent builds it.

BigQuery handles most of this for PPC, but the same pattern works with any source and destination: feeds, scrapes, APIs, files. Pull from where the data lives, run a decision, write to where it needs to go.

Wrapping it all in a Python pipeline

A “pipeline” is a Python script that wraps a small set of components and runs them in order. Each one is small, replaceable, and testable in isolation.

┌─────────────────────────────────────────────────────────┐
│  Python Pipeline (run on a schedule, e.g. daily cron)   │
│                                                         │
│  ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐  │
│  │ .sql    │   │ .md     │   │ schema  │   │ APIs/   │  │
│  │ queries │   │ prompts │   │ .json   │   │ clients │  │
│  └────┬────┘   └────┬────┘   └────┬────┘   └────┬────┘  │
│       │             │             │             │       │
│       ▼             ▼             ▼             ▼       │
│   ┌─────────────────────────────────────────────────┐   │
│   │ pipeline.py                                     │   │
│   │  1. fetch input   ── BigQuery + API clients     │   │
│   │  2. build request ── load prompt + format data  │   │
│   │  3. call engine   ── LLM client / stats / rules │   │
│   │  4. validate      ── schema check, retries      │   │
│   │  5. write output  ── staging table + MERGE      │   │
│   │  6. notify        ── Sheet / dashboard / Slack  │   │
│   └─────────────────────────────────────────────────┘   │
│                              │                          │
│                              ▼                          │
│                        BigQuery (output table)          │
└─────────────────────────────────────────────────────────┘

The structure is simple. Most of the substance lives outside Python itself: in the SQL (what data to pull), the prompts (what the model should do), the schema (what valid output looks like), the API clients (every tool, platform, or scraper you wire in), the BigQuery tables (the data and the audit log). Python is the glue.

Where it lives and how it runs: the whole pipeline (Python, SQL, prompts, configs) sits in a git repo on GitHub or GitLab. For a handful of pipelines on a simple schedule, GitHub Actions or GitLab CI/CD handle the daily cron, store secrets, and keep run logs without a separate orchestrator. Trigger any pipeline manually from the CLI for ad-hoc runs, or have a coding agent run it while debugging or experimenting.

Once pipelines start chaining (output of one becomes input of the next), backfills become routine, or you need per-task observability, the standard path is to move to a proper orchestrator: Airflow (commonly managed via Astronomer or Cloud Composer), Prefect, or Dagster. For the SQL layer specifically, dbt is the default for managing derived tables, tests, and lineage in BigQuery; the orchestrator just calls dbt run as one task in a larger DAG.


What you can build with this setup

The surface area opens up fast once this is in place, and it stops being just PPC. Every tool you touch ships an API. A few starting points:

Search term and keyword management. SQL filters and ranks by whichever performance signal matters most (highest spend without conversions, highest converting unclassified terms). The LLM only classifies what’s left, a human approves, the system pushes to Ads. Walked through in detail below.

Shopping feed optimisation. Pull feed data from a feed management tool (Channable, Productsup, Feedonomics) or your PIM, join it against shopping performance to surface products losing impression share or missing margin targets. The LLM only enriches that subset: titles, categories, custom labels, GTINs. Performance picks the targets, the prompt does the writing.

Ad copy automation and test synthesis. SQL ranks winners and losers in BigQuery (CTR, conversion rate, share by ad slot). The LLM extracts themes from those (hooks, CTAs, claims), generates new headlines and descriptions against product attributes and voice rules, scores candidates against a creative brief. Candidates land in BigQuery for human review.

Landing page alignment. Rank LPs by paid spend and conversions first, so the pipeline only scrapes the highest-impact pages. The LLM classifies each one against the keywords and ads routing traffic to it and flags mismatches: transactional queries on glossary pages, branded queries on generic homepages, ads promising a discount the LP doesn’t show.

Competitor monitoring. Pull competitor ads (auction insights, ad libraries, search scrapes) and their landing pages, overlay your own impression share and CPC trend so shifts surface where they actually hurt you. The LLM classifies positioning and messaging changes, SQL ranks which shifts matter. Manual review caps at a handful of competitors. A pipeline scales to dozens.

PMax and Demand Gen optimisation. Google keeps tightening what these products expose, and the available signal is often limited. Work with what’s there: search terms where visible, assets, placements, audiences. Classify themes, flag brand-safety risks, and use exclusion lists and asset groups to steer what you can.

Analyst agents and reporting. Scheduled weekly or monthly insight runs (pull from BigQuery, summarise, post to Slack or Notion). Same plumbing, human consumes the output instead of the account.

Embeddings as a cost-saving layer. Cosine similarity between a search term and its matched keyword flags loose broad-match decisions cheaply, and can pre-filter what reaches the LLM classifier at scale. BigQuery has ML.GENERATE_EMBEDDING and VECTOR_SEARCH built in.

Analytical and ML use cases plug in too. The same pipelines wrap statistical and ML models: pLTV-aware bidding via Offline Conversion Import, lead-scoring measurement and audience building, budget and tROAS automation, demand forecasting and inventory-aware bidding, scenario modelling, experimentation frameworks (geo tests, holdouts, MMM). Different math, same plumbing.

The use cases above are just the starting point. The deep dive below walks search term management end-to-end as a concrete example.


Example: classifying search terms for Nike running

Search term classification is a good place to start. It’s the densest decision surface in paid search: thousands of terms a month, each one a small business decision (relevant? off-target? bid on it? block it? in which campaign/ad group?) that compounds at scale. Most accounts can’t keep up with the volume, so most of those decisions never get made. Closing that gap is exactly the kind of work these systems do well.

The setup: non-brand campaigns targeting people searching for running shoes, apparel, and accessories, with brand searches routed to dedicated brand campaigns.

How the system fits together

┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ Search terms │───▶│   Classify   │───▶│ Human review │───▶│ Push to Ads  │
│  (BigQuery)  │    │   + route    │    │              │    │ (kw or neg)  │
└──────▲───────┘    └──────▲───────┘    └──────┬───────┘    └──────┬───────┘
       │                   │                   │                   │
       │  new search       │  prompt           │ rejections        │
       │  term data        │  improvements     ▼                   │
       │                   │               ┌──────────────┐        │
       │                   └───────────────│  Eval loop   │        │
       │                                   └──────────────┘        │
       │                                                           │
       └───────────────────────────────────────────────────────────┘

The classification pipeline follows the pattern from earlier:

  1. BigQuery in. Pull unclassified search terms from the Data Transfer, filtered and ranked in SQL by whichever performance signal matters (spend without conversions, conversion volume, recency, account-specific thresholds). The prompt never sees a term the numbers haven’t already justified looking at. Skip data that hasn’t matured yet: yesterday’s “zero conversion” terms could become tomorrow’s best performers.
  2. Business logic as a prompt. A Markdown file defines what to do with each term: intent scoring, audience detection, category relevance, branded vs. competitor handling, the action (ADD_KEYWORD, ADD_NEGATIVE, TO_REVIEW), match type, target campaign, target ad group. Your domain expertise lives here, not in Python.
  3. Call the LLM. Send a batch with the prompt. Validate the structured output against a schema. Uncertain terms get enriched (web search, internal docs) and reclassified.
  4. Decisions back to BigQuery. Write to staging, MERGE into production. Every recommendation is queryable: input, prompt hash, output, run ID, timestamp. Log the hash, not just a label: a pinned hash lets you replay any historical run and bisect exactly when a regression appeared.
  5. Human review. Approved recommendations push to Google Ads as ADD_KEYWORD or ADD_NEGATIVE via the API. Rejections, with a reviewer comment, flow into the eval pipeline.

Pulling search terms from BigQuery

Step 1 in code. Pull unclassified broad match terms from the Data Transfer, ordered by spend so the highest-cost terms get processed first:

-- Fetch unclassified broad match terms, ordered by spend descending.
-- search_query_stats: from Google Ads Data Transfer.
-- search_term_classification: written by this pipeline, starts empty.
SELECT
  sq.search_term_view_search_term AS search_term,
  SUM(sq.metrics_cost_micros) AS total_cost_micros
FROM
  `your-project.your_dataset.search_query_stats` AS sq
  LEFT JOIN `your-project.your_dataset.search_term_classification` AS cls
    ON sq.search_term_view_search_term = cls.search_term
WHERE
  cls.search_term IS NULL                              -- not yet classified
  AND sq.segments_search_term_match_type != 'EXACT'   -- broad match only
GROUP BY
  sq.search_term_view_search_term
HAVING
  total_cost_micros > 5000000                          -- >$5 spend threshold
ORDER BY
  total_cost_micros DESC
LIMIT
  100;

The LEFT JOIN against your own classification table makes the pipeline idempotent: terms already classified don’t come back. The spend threshold keeps cost in check (Google Ads stores costs in millionths of the currency unit, so 5000000 means $5). Adjust to your account size.

Prompt engineering

The prompt does most of the work.

Sequential, decomposed steps. Walk each term through a numbered process: confidence check first (return UNKNOWN if guessing), then intent score, audience, category relevance, brand/competitor flags, then the action, match type, and final keyword text. Hard rubrics inside each step keep it consistent across thousands of runs.

A trimmed version of the Nike running classification prompt. Production prompts run longer than this, with detailed category definitions, full competitor lists, and worked examples in the reference material.

The STEPS 0-13 logic is reusable as-is across accounts. What you customize is the REFERENCE MATERIAL block at the bottom: products, audiences, competitors, branded terms.

### ROLE & GOAL
You receive a list of 100 search terms from the Nike Running Google Ads account.
Classify each one for Nike non-brand running campaigns. For each
term, follow the strict sequential process below using your own
knowledge in combination with the REFERENCE MATERIAL.

---

### STEP 0 — Confidence Check (no exceptions)

Make a binary decision: do you KNOW this term with certainty, or NOT?
Flag as UNKNOWN (competitor: 99) if ANY is true:
- Multiple plausible meanings and you're not 100% sure which is intended
- You don't explicitly recognise the exact term as a known brand or product
- Your reasoning includes "probably", "likely", "seems", "might be"
- You are inferring meaning from parts rather than knowing the whole term

UNKNOWN output — stop, return only:
{"search_term": "...", "competitor": 99}

KNOWN terms proceed to Steps 1-10.

---

### STEPS 1-10 — Classification (KNOWN terms only)

1. commercial_intent_score (1-10)
   1-2: irrelevant, off-topic, no running context
   3-5: informational, how-to, training tips — no buying intent
   6-8: evaluation, comparison, product research
   9-10: transactional — specific product, buy, discount, near me

2. audience — one of: men, women, kids, unknown

3. competitor (0/1) — does the term mention a competing brand?
   [See REFERENCE MATERIAL for full competitor list]

4. score_shoes (1-10)        — relevance to running footwear
5. score_apparel (1-10)      — relevance to running clothing
6. score_accessories (1-10)  — relevance to running accessories
7. score_training (1-10)     — relevance to training, recovery, activity

8. justification — one sentence explaining scores 1-7
9. language — language code (en, de, fr, nl, ...)
10. branded (0/1) — does the term mention Nike?
    [See REFERENCE MATERIAL for full branded term list]

---

### STEP 11 — Recommended Action (priority order)

Exception first: comparison queries ("brand A vs brand B") → TO_REVIEW
regardless of branded/competitor flags. Human judgment required.

1. Branded (branded=1)       → ADD_NEGATIVE (always — protect non-brand)
2. Competitor (competitor=1) → TO_REVIEW (team decides on competitor strategy)
3. Standard terms:
   - commercial_intent_score >= 6 → ADD_KEYWORD
   - commercial_intent_score <= 5 → ADD_NEGATIVE
   - Borderline or ambiguous      → TO_REVIEW

---

### STEP 12 — Match Type

ADD_KEYWORD → match type is a per-account config decision; set it in your routing rule, not here. BROAD is a reasonable default for Smart Bidding accounts with room to grow, but tighter structures warrant PHRASE or EXACT for proven terms.
ADD_NEGATIVE:
  - Branded terms                              → EXACT
  - How-to / instructional queries             → EXACT (avoid overblocking)
  - Terms with running-adjacent words          → EXACT
  - Completely irrelevant (no running context) → PHRASE
TO_REVIEW → EXACT (human decides)

---

### STEP 13 — Keyword Text

ADD_KEYWORD: original search term, unmodified.
ADD_NEGATIVE:
  - Branded: full original search term
  - Non-branded: refine to the core irrelevant concept
    (strip prefixes like "how to", isolate competitor brand name)
  Risk check: if refined phrase contains running-adjacent words,
  use the full original search term + EXACT instead.

---

### OUTPUT FORMAT

Single JSON array. UNKNOWN: {"search_term": "...", "competitor": 99}
KNOWN — all fields in this order:

{
  "search_term": "best running shoes for women 2026",
  "commercial_intent_score": 8,
  "audience": "women",
  "competitor": 0,
  "score_shoes": 10,
  "score_apparel": 2,
  "score_accessories": 1,
  "score_training": 3,
  "justification": "High-intent category search, women's audience. No brand signal.",
  "language": "en",
  "branded": 0,
  "recommended_action": "ADD_KEYWORD",
  "recommended_match_type": "BROAD",
  "recommended_to_add": "best running shoes for women 2026",
  "recommended_reason": "Score 8, non-brand, core product search. Broad to capture variations."
}

---

### REFERENCE MATERIAL

Company: Nike — non-brand running campaigns
Products: shoes (road, trail, racing flats), apparel (shorts, tights,
  jackets, vests), accessories (socks, bags, hats)
Audiences: men, women, kids
Competitors: [list your direct running brand competitors]
Branded terms: nike, pegasus, vaporfly, dri-fit, [your product lines]

Use the context window. Prompts aren’t only system instructions, they’re where you offload your business logic as well. Prompts grow longer than you’d expect once you’ve encoded every rubric, exception, and worked example a domain expert carries in their head. Use as much context as the task needs, but watch for attention degradation on very long prompts: critical rubrics buried in the middle of a massive prompt get less reliable than ones near the top or bottom.

Cost is rarely the constraint. The gains from tighter routing and negative coverage outpace API costs before you’ve run any optimisation. When it matters: enable prompt caching from the start if your system prompt includes a large REFERENCE MATERIAL block. Anthropic’s prompt caching keeps the cached prefix token cost at roughly 10% of the normal input rate — at the volumes this system implies, that’s material. The Batch API cuts costs further for non-real-time workloads. Smaller models for confident cases once you have signal.

Capture as much as you can per term. Every score and flag becomes input for a downstream pipeline. Audience and category scores feed routing (this term goes to the women’s running campaign, that one to the trail ad group). Commercial intent decides ADD_KEYWORD vs ADD_NEGATIVE. Branded and competitor flags feed negative scoping. Output the full reasoning once, query it from any pipeline that needs it. The marginal cost of extra fields is a few hundred tokens; the downstream reuse is worth a lot more.

Batch to fit the context window. Send terms in batches of around 100 per call. The prompt is the same every batch; only the input array changes. Cuts cost and latency, keeps quality stable across runs.

Web search for the UNKNOWN cases. When the model flags a term as UNKNOWN, send it to a web search API like Tavily or Exa to enrich the context. The web result gives the model a working definition of what the term actually is, and then it reclassifies with that. Catches new brands, regional terms, and competitor mentions the model wouldn’t otherwise know.

Match type is where the prompt above is most opinionated. Broad as default makes sense in accounts on Smart Bidding with room to grow. In tighter structures it creates overlap and muddier attribution. A more controlled approach: exact or phrase for proven terms, broad only for themes you’re actively expanding into. The classifier already outputs intent and category scores per term, so rather than hardcoding match type in the prompt, feed those scores into a separate routing rule that assigns match type based on your account’s setup.

Example output

What the classification pipeline returns after a batch of search terms for a non-branded Nike running campaign:

Search Term Action Match Type To Add Audience Intent Comp Brand S.Shoes S.App S.Acc S.Train Justification
best running shoes for women 2026ADD_KEYWORDBROADbest running shoes for women 2026Women80010213High-intent category search, women's audience. Core product query.
trail running shoes women waterproofADD_KEYWORDBROADtrail running shoes women waterproofWomen90010112High-intent product query with feature modifier. Women's audience, shoes category.
waterproof running jacket mensADD_KEYWORDBROADwaterproof running jacket mensMen70011022High-intent apparel query, men's audience.
kids running shoes lightweightADD_KEYWORDBROADkids running shoes lightweightKids80010111High-intent product query with feature modifier. Kids audience, shoes category.
nike pegasus 41 mensADD_NEGATIVEEXACT[nike pegasus 41 mens]Men90110111Branded Nike product. Exclude from non-brand campaigns.
running shoe repair near meADD_NEGATIVEEXACT[running shoe repair near me]Unknown2003112Service query, no purchase intent. Running-adjacent words → exact match.
free couch to 5k app downloadADD_NEGATIVEPHRASE"couch to 5k"Unknown1001114Off-topic app download. No product relevance. No overblocking risk → phrase match.
adidas trailrunning shoesTO_REVIEWEXACTadidas trailrunning shoesUnknown81010111Competitor brand (Adidas). High intent, but team decides whether to bid on competitor terms.
how to start running beginnerADD_NEGATIVEEXACT[how to start running beginner]Unknown3002226Informational query, intent score 3. Instructional → exact match to avoid overblocking.

This row-per-term output is the audit log and the input to every downstream pipeline. Audience and category scores route each ADD_KEYWORD to the right campaign and ad group. Match type and keyword text are ready to push to the API as-is. Justifications surface why each term was classified that way (the first place to look when something’s off).

That same table feeds a Monday Slack digest of last week’s recommendations, a dashboard tracking rejection rates, a QA loop pulling the lowest-confidence justifications.

The classifier’s output table feeds the next pipeline in the chain: routing (ADD_KEYWORD terms to the right campaign and ad group) and negative scoping (where each ADD_NEGATIVE should apply: ad group, campaign, or account). Each new decision surface gets its own pipeline, its own prompt, its own evals, plugged into the same data layer and the same review interface. The system doesn’t grow as one giant prompt; it grows as a chain of small specialised pipelines that each do one thing well.

One thing worth noting: the spend > $5 filter that decides what goes into the classifier is a simple spending cutoff. Embedding-based similarity between the search term and its matched keyword can make this smarter, but that’s a separate layer on top of the core pipeline.

These pipelines don’t become accurate by themselves. Evals are how you measure and improve them over time.


Evals: continuous system optimization

LLMs can hallucinate and make wrong decisions. Even with good prompts, good context, and good data, the model will misclassify or ignore a rule you thought you wrote clearly. But that’s part of the operating reality, and something you will gradually improve with evaluations.

An eval is just a measurement: take a small set of inputs where you already know the correct answer (a “gold set”), run your prompt against it, count how often the output matches. That number is your accuracy. The workflow runs in two phases: stabilize the prompt against your gold set before any real traffic, then improve it from real-world feedback once it’s live. If you only invest in one thing, invest here. Hamel Husain’s writing on AI evals is the clearest material I’ve found on the practice.

Start with gold sets for regression and human review for ground truth. Add LLM-as-judge once volume outpaces manual review, pulling rejections from the approval queue as real-world signal. Calibrate the judge against human labels periodically; it drifts. The same term labeled differently across runs is the first sign it needs tightening.

You run two gold sets in parallel: a prod gold set for continuous accuracy measurement, and an experimental gold set for testing fixes against rejected terms. Both grow over time. Experimental terms graduate into prod once a fix holds without regressing the rest, so prod widens gradually as the system learns.

  1. Build a gold set. Start with 25-50 terms with known correct answers, covering your real categories: a clear keyword, a clear negative, a competitor, a branded term, an informational query, an ambiguous case. One type repeated dozens of times tells you nothing. The set grows continuously from here. Expect it to grow into the hundreds before you stop finding new failure modes.
  2. Stabilize the prompt on the gold set. Run it against the same terms ten or more times before going anywhere near production. LLM outputs vary across runs, so a one-off pass at 95% can hide a flaky prompt swinging between 80% and 100%. Iterate the prompt until accuracy is high and consistent across runs. Only then move to real data.
  3. Run real data. Once the prompt is stable, classify a few hundred terms from the live account, push to your review interface, approve or reject with specific notes. “Wrong” tells you nothing. “Classified as shoes but this is an apparel query” is signal.
  4. Open coding. One person, the domain expert, reads every rejection and writes a short note describing what went wrong. No predefined categories. Don’t run this by committee.
  5. Axial coding. Group the notes into failure modes (COMPETITOR_ERROR, CATEGORY_ERROR, MATCH_TYPE_ERROR). Count them. The most frequent one is where you start. An LLM can do the grouping. You keep the judgment.
  6. Targeted experiment. Add the failing terms to a dev gold set. Copy the production prompt, never edit it directly. Change only the section that controls the failure mode. Run it.
  7. Two regression gates. Dev gold set: did the change fix the failures? Prod gold set: did it break anything that was passing (including the original 25-50 terms you started with)? Both pass: merge the prompt change to production and promote the dev terms into the prod gold set. The prod gold set grows with every successful fix, permanently guarding against the failure modes you’ve already solved. Either fails: keep iterating.
  8. Repeat. Early cycles catch big errors fast. Later cycles catch edge cases. At some point you stop finding new failure modes. That’s theoretical saturation, and it’s the signal the prompt is production-ready for the current data.

You do the open coding because humans see things models miss. Claude Code handles everything else: triage, the prompt diff, both regressions, the deltas, the gold set graduation, the commit. A typical loop:

You:    "Run triage on the last 40 rejections and show me the top failure mode"

Claude: → reads eval_feedback (status: RECEIVED)
        → classifies each rejection into a failure mode
        → reports: COMPETITOR_ERROR: 18 (45%) ← top failure mode

You:    "Improve competitor detection. Identify why this happens."

Claude: → reads prompts/classify-main.md
        → identifies Step 0 (confidence check) as the root cause
        → proposes a diff: stricter UNKNOWN criteria + new examples
        → eval --regression --gold-set dev  → 71% → 96%  ✓
        → eval --regression --gold-set prod → 96% → 96%  ✓ no regression
        "Both gates pass. Approve to merge."

You:    approve

Claude: → updates the prompt
        → graduates dev gold set terms into prod
        → commits: "tighten competitor confidence check, 71→96%"

The part you keep is the part that needs taste: deciding which failure modes matter, and reading the rejections without flinching. Evals also tell you when to stop. When new batches stop producing new failure modes, the prompt is done for now. Move on to the next decision surface.

One number hides a lot. Each term has six-plus output fields: intent, audience, category, flags, action, match type. Track per-field precision and recall in your gold set, not a single aggregate accuracy score. A prompt change that fixes COMPETITOR_ERROR can quietly regress MATCH_TYPE on the same terms.


From recommendations to live in your account

A recommendation sitting in your data warehouse doesn’t change anything, no matter how good your evals look. Someone has to push it to the ad platform, and how you do that determines how much risk lives in the system. There are two modes worth running, and you should start with the first.

Manual review, manual application. The system writes recommendations to a review interface (Google Sheets, a custom dashboard, a Slack thread). A reviewer reads each one, approves or rejects with a comment, and applies approved changes in the Google Ads UI or via Editor. Slow, but safe, and the right starting point. Run here until you trust the system’s accuracy on real data.

Manual review, one-click push. Same review interface, same human-in-the-loop, but the approve button calls the Google Ads API and pushes the whole approved batch in one action. The reviewer still owns every decision; the system just handles the mechanics. This is where most teams should land once accuracy is proven.

Keep a human at the gate. Even for boring operations (negatives that match a strict rule, pausing zero-conversion keywords with high spend over an extended window) a 30-second sanity check costs nothing and catches the kind of mistakes that quietly compound. The cost of one bad batch pushed at 3am is much higher than the cost of a daily approval queue.

Connecting to the API. Use the official Google Ads Python client. Authenticate via OAuth refresh tokens stored in a secrets manager, never .env. Test mutations against a test account first. Scope credentials to the minimum customer IDs each pipeline needs. Log every mutation with an execution ID so a single grep can reconstruct what the system did. Plan for partial batch failures (some mutations succeed, others fail on policy or dependencies) and respect object ordering: you can’t add a keyword to an ad group that was removed two seconds earlier.

Anomaly detection sits on top. Whichever tier you run, a separate watchdog has to monitor the account: spend over budget, CPA blowouts, bid or budget shifts that look off, mutations the system shouldn’t have made. Mix statistical detection (z-scores) with fixed rules (auto-pause if spend exceeds 2x daily budget). Not a substitute for human review. The backstop that catches what review misses.


Engineering for production

Every pipeline in this system needs the same engineering rigor underneath it. The patterns below are what make the difference between a script that runs once and a system that runs every day on a real account without losing your team’s trust the first time something breaks. Build them in from day one, not after the first incident.

Tests and CI. The first thing you wire in. Unit tests on pure functions, integration tests against a sandbox dataset, end-to-end tests against a frozen sample, all in CI. Gold sets run on every prompt change; build fails if accuracy drops. Tests verify the code is correct before it ships. Dry runs (below) verify a specific batch is safe when you run it.

Thin API clients. Wrap every external service (LLM, web search, ad platform) in a thin client with no business logic. Swapping providers (one model for another, one search API for another) stays a one-file change instead of a refactor.

Structured output validation. Use the model’s native structured output mode (Anthropic tool schemas, OpenAI structured outputs, or a library like instructor on top of Pydantic) so the model returns valid JSON in the first place. Validate every response against the schema before it touches the warehouse. Invalid objects get logged and dropped or retried, not silently propagated.

Fail soft between clients. An enrichment failure (web search down, a 429 from a side API) shouldn’t abort the pipeline. Return empty, log it, let downstream steps continue with partial context. Failed items rerun next cycle.

Mutation safety. Every UPDATE, DELETE, or external API write goes through a protocol: scope lock (explicit row IDs, never open-ended WHERE clauses), pre-snapshot, pre-count, execute, post-count, validate. Warehouse writes can roll back; external API writes can’t, so store the previous state and reverse with a compensating mutation if validation fails.

Dry runs and --limit. Every pipeline ships with --dry-run (runs everything except the final write) and --limit N (caps input volume). New code, new prompt, new SQL? Dry-run first. Cheap insurance against schema mismatches, off-by-one filters, runaway API costs.

Rate limits and timeouts. External APIs throttle. Set explicit timeouts on every request so a hung connection never blocks the pipeline. Use exponential backoff with jitter on retries. Cap total retries so failures fail loudly instead of silently looping.

Batching and concurrency. LLM and ad platform APIs have request-size limits and per-call latency. Batch inputs into chunks, run batches concurrently with asyncio or a thread pool where rate limits allow, aggregate before writing. Done right, this cuts pipeline runtime from hours to minutes.

Idempotency. Every pipeline must be safe to re-run. If a job crashes halfway, the next run picks up where it stopped, no duplicates. Staging tables with run IDs, MERGE statements with dedup keys, append-only audit logs.

Logging with execution IDs. Every run gets a UUID. Every log line, every BigQuery write, every API mutation includes that ID. When something breaks, you grep one ID and reconstruct the entire run.

Cost and token observability. Log token usage and dollar cost per run alongside every decision in BigQuery. Build a small dashboard or daily Slack digest, and alert when daily spend deviates meaningfully from the moving average. The first time a prompt change or an infinite retry loop quietly 10x’s your API bill, you’ll be glad the data was already there.

Encode them in the project rules your coding agent reads (CLAUDE.md, AGENTS.md, or whatever your tool of choice uses) once, and the agent will write them automatically into every new pipeline.


Alternatives: n8n, Zapier, and no-code automation

You can build a lot of what we’ve described in n8n, Zapier, Make, or similar no-code automation tools. They ship nodes for HTTP, BigQuery, OpenAI and Anthropic, scheduling, branching, retries. For a prototype or a small set of workflows, they get you running fast.

The wall comes when you try to scale. Gold sets, regression gates, prompt versioning, data engineering, SQL writing and validation: each one is friction in a no-code editor, and some are flat-out impossible without dropping into a code node.

The flip side, and this is the part that sounds backwards until you’ve lived it: the historical reason no-code was attractive (the “I don’t want to write code” barrier) is mostly gone. A coding agent does the writing. Claude Code writes the SQL, generates the Python, debugs failures, builds the Pydantic schema, tests the prompt against the gold set, transforms a staging table, ships the CI.

I’ve tried both. n8n is a good place to start if you’ve never wired one of these things up: seeing what a pipeline is, what a trigger does, what a structured output looks like, all useful first steps. But once you understand the shape, the path of least resistance for a real system is code plus an agent (or an engineer who knows what they’re doing), not a canvas.


The bigger picture

Once you start thinking in systems, the impact of AI and automation compounds. Every pipeline you build feeds the next one. Every eval makes the next prompt sharper. Every API you wire in adds another decision surface. The operators who learn to design, build, test, and improve these systems are the ones who’ll run paid search at scale from here.

Today, you’re the orchestrator. You write the specs, design the prompts, run the evals, decide when each piece is good enough to ship. Your coding agent builds the system, the pipelines run in production, and you validate and steer.

That boundary is moving. For every task you still do by hand, ask whether you can structure it clearly enough that coding agents can own the build and pipelines can own the execution: the spec, the success criteria, the eval that proves it worked. That transfer of logic is the real skill to develop.