Applied AI Engineering for Google Ads with Python

PPC is text and numbers. LLMs are built for exactly that. They handle complexity no human can keep up with: hundreds of thousands of search terms across dozens of campaigns and ad groups, with targeting, bid, and budget decisions at every level. The question isn’t whether AI belongs in paid search. It’s how you build it so it actually works on a real account.

The fastest way to get started is also the one that doesn’t scale. Plug an agentic coding tool (Claude Code, Codex, Cursor) into a Google Ads MCP and use the account from a chat session. It’s genuinely useful for ad-hoc analysis, one-off cleanups, or debugging an odd campaign. But it’s still a session: nothing runs when you’re not in it, there’s no schedule, no continuous accuracy testing, no audit trail, no human approval queue. The work disappears when the conversation ends.

What you actually need is a system. A set of pipelines that pull data from BigQuery on a schedule, send batches through an LLM with your business logic baked into the prompts, land every decision as a row in a table, queue recommendations for human review, and prove their accuracy with evals before anything touches the live account.

These systems can already largely be built by AI. The two skills that matter most: writing specs (what to build, what correct looks like, how to verify it) and mastering evals. Get good at both and you can build and maintain AI systems without writing most of the code yourself. More on spec-driven development in my Claude Code Guide.

Pave the roads for AI

The system makes decisions. You build the road it runs on. Without rails, every run is improvisation. With them, decisions are measurable, repeatable, and reversible. Production AI systems for PPC don’t differ from production data systems in any deep way. They need scale, traceability, accuracy you can prove, a way to improve over time, and a human in the loop where the stakes are real.

Decisions at scale. Google Ads is a complex surface: thousands of search terms, keywords, campaigns, ad groups, bids, budgets, and campaign settings, all moving every day. Manual review breaks down fast. Python pipelines work through all of it on a schedule without fatigue. Volume isn’t the hard part. Making every decision repeatable is.

Traceability. Every recommendation lands in BigQuery: input, prompt version, output, timestamp, run ID. Six months later you can pull any decision and see exactly why it happened. Treat your warehouse as the audit log, because that’s what it becomes.

Accuracy you can prove. “Looks right” doesn’t ship. You need an eval framework: gold sets of known-correct answers, sampled human review of recent outputs, LLM-as-judge for scale, regression tests on every prompt change, agreement metrics between reviewers. Different evals catch different failures. Run them on every change, watch the numbers. If accuracy drops, you find out before production does.

Continuous optimization. Reviewer rejections and eval misses are signal. Each one tells you where the prompt is weak. Test the fix, measure it, ship it if accuracy improves. The system gets sharper every cycle.

Human in the loop where it matters. The system recommends, a human approves. Not because the system is unreliable, but because PPC is real money and accountability sticks. An hour a day on an approval queue beats hours unwinding bad mutations, and more importantly, it beats mistakes burning your ad budget.

Each one is a feature you have to build, not a property you get for free.

Document what the model doesn’t know

Think of an LLM as a new PPC analyst joining your team on its first day. Sharp, fast, knows the platforms. They can read a campaign export, write a sensible SQL query, structure a clean recommendation. What they don’t have yet is the context: which campaigns are protected, what “good” looks like for each segment, which trade-offs are acceptable, what’s off-limits. Without that, even strong analysts produce output that looks reasonable on the surface and is strategically wrong underneath.

You’d onboard the analyst with a doc, a 30-minute walkthrough, and a list of guardrails. You do exactly the same for the model. The Data Transfer already gives the model the operational picture: account structure, ad group naming, landing page mappings, historical tROAS and budget changes, all queryable in BigQuery. What you write down on top of that is the strategic layer, the stuff an experienced operator carries in their head.

The list is longer than people expect:

Goals and KPIs. What each campaign is actually optimized against. tROAS targets, CPA limits, new-customer share, lifetime value tiers, the metrics that matter and the ones to ignore.
Account strategy. Why the account is structured the way it is. Why certain campaigns exist, which themes are deliberate, what each campaign should and shouldn’t capture.
Products and offers. Categories, pricing tiers, seasonal items, loss leaders, protected SKUs, what’s allowed in non-brand vs. brand campaigns.
Business rules and guardrails. Branded vs. non-brand budget split, competitor strategy, exclusion lists, hard caps, the things an experienced operator does without thinking.
Voice and brand. Tone for ad copy, claims you can and can’t make, terminology that has to be exact.

Writing this down pays off twice: your coding agent stops re-asking the same questions every session, and the prompts that drive classification, routing, and budget logic pull the same context directly into production.

Markdown files in the repo are the lowest overhead and usually enough. Confluence or Notion works too, agents read and write both via their APIs. Past a few hundred pages, layer retrieval (chunk, embed, fetch) on top. Treat the docs like code: review, update, delete what’s stale. Stale docs are worse than none, especially for the fast-moving stuff (promotional calendars, inventory, margin targets) where the system will confidently execute last month’s strategy.

The foundation: BigQuery + Google Ads Data Transfer

Every system in this article starts with the same question: where does the data live? For Google Ads, the answer is the Google Ads Data Transfer. It syncs daily snapshots of every entity in your account into BigQuery: campaigns, ad groups, keywords, search terms, conversions, performance stats, tROAS targets, budget settings. All of it, queryable with SQL.

Once data is in BigQuery, two things open up. Your coding agent can crunch it directly: writing SQL, running it, fixing errors, building derived tables. And you can join Google Ads data to anything else: revenue from your warehouse, margins from your ERP, LTV from your CRM. You stop optimising for what Google reports and start optimising for what the business cares about.

The Data Transfer also gives you full history, not just the current state. tROAS targets, target CPAs, budgets, keyword statuses, paused ads, label changes, bid adjustments, all captured over time. You can ask “what was this campaign’s tROAS six months ago and how did performance respond” and actually get an answer. That’s what makes serious modelling possible later.

The pattern: BigQuery → decision engine → BigQuery

A workflow built on this foundation typically takes the same shape:

BigQuery → fetch data → run decision logic → write back to BigQuery → take action

The decision logic is where the use cases differ. Sometimes it’s an LLM with a prompt. Sometimes it’s a statistical model. Sometimes it’s a deterministic Python function. The infrastructure around it is identical.

# LLM-driven (search terms, ad copy, classification)
data    = bq_client.fetch("get_search_terms.sql")    # pull data
prompt  = open("prompts/classify-main.md").read()    # load instructions
results = llm_client.classify(prompt, data)          # LLM recommends
bq_client.write(results)                             # write back

# Statistical or rule-based (anomaly detection, budget models)
data    = bq_client.fetch("daily_spend.sql")
results = detect_anomalies(data)                     # z-scores, regression, custom logic
slack.send(results)                                  # notify, or write to BQ, or call an API

Three principles hold across every variant.

Prompts as Markdown. When the engine is an LLM, prompts live in version-controlled .md files loaded at runtime. A real system has many: a classifier prompt, a routing prompt, a triage prompt, a copy generation prompt. Each one captures a slice of domain expertise (classification rules, scoring logic, edge cases) in plain English instead of buried in Python strings. Editable without touching code, diffable in git, A/B testable in evals.

SQL as code. Queries live in .sql files, not embedded in Python. Same reasons: versioned, reviewable, testable. Your coding agent can read them, modify them, run them.

Decisions as data. Every output lands in BigQuery as a structured row. That row is your audit log, your eval input, and the next pipeline’s input. Decisions become queryable like everything else.

This is where coding agents earn their keep. Staging tables, MERGE statements, migrations, chaining pipelines so each decision feeds the next: that’s standard data engineering work, and agents are good at it. Classification feeds routing, routing feeds review, review feeds evals, evals feed the next prompt. You spec the system. The agent builds the pieces. With BigQuery underneath, the limit is how clearly you can describe what you want.

What you can build with this pattern

Once the pattern is in place, the surface area opens up fast, and it stops being just PPC. With your account data sitting in BigQuery next to revenue, margin, inventory, returns, LTV, and product feeds, you stop being a siloed channel. You become the team that plugs into systems other teams own.

“AI” here doesn’t mean only LLMs. The same Python pipeline pattern wraps LLMs, ML models, and statistical functions, and you pick whichever decision engine fits the problem. Some of the use cases below are prompt-heavy, some are pure math, all of them sit on the same data layer and run through the same review and quality gate. A few starting points worth building:

Search term and keyword management. The worked example below. An LLM classifies each term (add as keyword, add as negative, route to review), routing logic picks campaign and ad group, a human approves before anything ships. Heavy on business logic and prompts.

Shopping feed optimisation. Pull feed data from a feed management tool (Channable, Productsup, Feedonomics) or your PIM, use an LLM to enrich titles, categories, custom labels, and GTINs, score variants against shopping performance, write back to the feed. Massive lever in retail and most teams never touch it.

Demand forecasting and inventory-aware bidding. Forecast SKU-level demand from historical sales, seasonality, and promo calendars. Combine with live inventory to push in-stock items, pause low-stock SKUs, and lift bids ahead of peaks. Pure data-science work, ideal for a coding agent.

Ad copy automation. Pull top performers, product attributes, and voice rules from a Markdown file, generate headlines and descriptions, score against a creative brief, write candidates to BigQuery for human review. Same pattern, different prompt.

Anomaly detection. Rule-based: z-scores against rolling baselines, fixed thresholds, statistical checks. Python pulls daily metrics and change history from BigQuery, flags drift (cost spikes, conversion drops, suspicious mutations), Slacks soft alerts, auto-pauses egregious cases.

Budget and tROAS adjustment. With the Data Transfer’s history of tROAS, target CPA, and budgets alongside performance, you can model the sweetspot between efficiency and volume. Run a regression or a marginal-ROAS curve, output recommended shifts per campaign, queue for approval.

pLTV-aware bidding via offline conversion import. Train a model on CRM data to predict lifetime value at first conversion, push predicted values back to Google Ads via Offline Conversion Import so Smart Bidding optimises for expected LTV instead of first-purchase revenue. Stops you overpaying for discount hunters and underbidding on customers who repurchase ten times.

Same data layer, same loop, different decision engines. Search term management is the worked example below because it’s the most complex: most business logic, most prompt engineering, most edge cases.

What a Python pipeline actually does

Before the reliability patterns, it’s worth being concrete about the unit of work. A “pipeline” in this system is a Python script that wraps a small set of components and runs them in order. Each one is small, replaceable, and testable in isolation.

┌─────────────────────────────────────────────────────────┐
│  Python Pipeline (run on a schedule, e.g. daily cron)   │
│                                                         │
│   ┌──────────┐      ┌──────────┐      ┌──────────────┐  │
│   │ .sql     │      │ .md      │      │ schema.json  │  │
│   │ queries  │      │ prompts  │      │ (validation) │  │
│   └────┬─────┘      └────┬─────┘      └──────┬───────┘  │
│        │                 │                   │          │
│        ▼                 ▼                   ▼          │
│   ┌─────────────────────────────────────────────────┐   │
│   │ pipeline.py                                     │   │
│   │  1. fetch input   ── BigQuery client            │   │
│   │  2. build request ── load prompt + format data  │   │
│   │  3. call engine   ── LLM client / stats / rules │   │
│   │  4. validate      ── schema check, retries      │   │
│   │  5. write output  ── staging table + MERGE      │   │
│   │  6. notify        ── Sheet / dashboard / Slack  │   │
│   └─────────────────────────────────────────────────┘   │
│                              │                          │
│                              ▼                          │
│                        BigQuery (output table)          │
└─────────────────────────────────────────────────────────┘

The structure is simple. Most of the substance lives outside the Python: in the SQL (what data to pull), the prompts (what the model should do), the schema (what valid output looks like), the BigQuery tables (the data and the audit log). The Python is glue. The reliability patterns in the next section are what make that glue robust enough to run on real money. That’s where the complexity goes.

That separation is what makes the system maintainable. Change a prompt without touching Python. Change a query without touching prompts. Add a new pipeline by copying the scaffold and swapping ingredients. Your coding agent does most of the assembly.

This is the basic shape, and it scales as far as your account demands. Add more APIs (search console, GA4, your CRM, competitor data), wire in more context, chain pipelines so each output becomes the next input, layer in retrieval, build dashboards on top. The principle doesn’t change: pull data, run a decision, write it back, review it, learn from it.

Engineering for production

These are the patterns that make the difference between a script that runs once and a system that runs every day on a real account without losing your team’s trust the first time something breaks. Build them in from day one, not after the first incident.

Mutation safety. Every UPDATE, DELETE, or external API write goes through a protocol: scope lock (explicit row IDs, never open-ended WHERE clauses), pre-snapshot, pre-count, execute, post-count, validate. Warehouse writes can roll back; external API writes can’t, so store the previous state and reverse with a compensating mutation if validation fails.

Dry runs and --limit. Every pipeline ships with --dry-run (runs everything except the final write) and --limit N (caps input volume). New code, new prompt, new SQL? Dry-run first. Cheap insurance against schema mismatches, off-by-one filters, runaway API costs.

Rate limits and timeouts. External APIs throttle. Set explicit timeouts on every request so a hung connection never blocks the pipeline. Use exponential backoff with jitter on retries. Cap total retries so failures fail loudly instead of silently looping.

Batching. LLM and ad platform APIs have request-size limits and per-call latency. Batch inputs into chunks, parallelize where rate limits allow, aggregate before writing. Done right, batching cuts pipeline runtime from hours to minutes.

Idempotency. Every pipeline must be safe to re-run. If a job crashes halfway, the next run picks up where it stopped, no duplicates. Staging tables with run IDs, MERGE statements with dedup keys, append-only audit logs.

Logging with execution IDs. Every run gets a UUID. Every log line, every BigQuery write, every API mutation includes that ID. When something breaks, you grep one ID and reconstruct the entire run.

Tests and CI. Unit tests on the pure functions (parsers, validators, routing logic), integration tests against a sandbox dataset (does the pipeline produce the expected rows for a known input), and end-to-end tests that run the whole pipeline in --dry-run against a frozen sample. Wire it all into CI so every prompt, SQL, or Python change has to pass before it merges. Eval regressions belong here too: gold sets run on every prompt change, build fails if accuracy drops below the threshold.

These are the bones of every pipeline in this system. Encode them in the project rules your coding agent reads (CLAUDE.md, AGENTS.md, or whatever your tool of choice uses) once, and the agent will write them automatically into every new pipeline.

From recommendation to live account

A decision sitting in BigQuery doesn’t change anything. Someone has to push it to the ad platform, and how you do that determines how much risk lives in the system. There are two modes worth running, and you should start with the first.

Manual review, manual application. The system writes recommendations to a review interface (Google Sheets, a custom dashboard, a Slack thread). A reviewer reads each one, approves or rejects with a comment, and applies approved changes in the Google Ads UI or via Editor. Slow, but safe, and the right starting point. Run here until you trust the system’s accuracy on real data.

Manual review, one-click push. Same review interface, same human-in-the-loop, but the approve button calls the Google Ads API and pushes the whole approved batch in one action. The reviewer still owns every decision; the system just handles the mechanics. This is where most teams should land once accuracy is proven.

Keep a human at the gate. Even for boring operations (negatives that match a strict rule, pausing zero-conversion keywords with high spend over an extended window) a 30-second sanity check costs nothing and catches the kind of mistakes that quietly compound. The cost of one bad batch pushed at 3am is much higher than the cost of a daily approval queue.

Connecting to the API. Use the official Google Ads Python client. Authenticate via OAuth refresh tokens stored in a secrets manager, never .env. Test mutations against a test account first. Scope credentials to the minimum customer IDs each pipeline needs. Log every mutation with an execution ID so a single grep can reconstruct what the system did. Plan for partial batch failures (some mutations succeed, others fail on policy or dependencies) and respect object ordering: you can’t add a keyword to an ad group that was removed two seconds earlier.

Anomaly detection sits on top. Whichever tier you run, a separate watchdog has to monitor the account: spend over budget, CPA blowouts, bid or budget shifts that look off, mutations the system shouldn’t have made. Mix statistical detection (z-scores) with fixed rules (auto-pause if spend exceeds 2x daily budget). Not a substitute for human review. The backstop that catches what review misses.

Worked example: search term classification

Search term classification is a good place to start. It’s the densest decision surface in paid search: thousands of terms a month, each one a small business decision (relevant? off-target? bid on it? block it?) that compounds at scale. Most accounts can’t keep up with the volume, so most of those decisions never get made. Closing that gap is exactly the kind of work a pipeline like this does well.

The pipeline follows the pattern from earlier:

BigQuery in. Pull unclassified search terms from the Data Transfer, filtered by spend and recency. Skip data that hasn’t matured yet: with long conversion lags (B2B, considered purchases) yesterday’s “zero conversion” terms become tomorrow’s best performers, and a system that pauses on raw recency will quietly kill your top-of-funnel.
Business logic as a prompt. A Markdown file defines what to do with each term: intent scoring, audience detection, category relevance, branded vs. competitor handling, the action (ADD_KEYWORD, ADD_NEGATIVE, TO_REVIEW), match type, target campaign, target ad group. Your domain expertise lives here, not in Python.
Call the LLM. Send a batch with the prompt. Validate the structured output against a schema. Uncertain terms get enriched (web search, internal docs) and reclassified.
Decisions back to BigQuery. Write to staging, MERGE into production. Every recommendation is queryable: input, prompt version, output, run ID, timestamp.
Human review. Approved recommendations push to Google Ads. Rejections, with a reviewer comment, flow into the eval pipeline.

Each step is its own pipeline. Each pipeline writes to its own table. Each table is the input to the next. Your coding agent builds the whole chain, and once it’s built, the chain runs without an agent in it.

It doesn’t stop at classification either. “Add as keyword” raises the next question: which campaign, which ad group, what match type? Routing is its own pipeline with its own prompt and its own evals. Negative scoping is another. tROAS adjustment at the campaign level is another. Each new decision surface gets its own pipeline, plugged into the same data layer and the same review interface. That’s how the system grows: not one giant prompt, but a chain of small specialised pipelines that each do one thing well.

Step 5 is where evals come in: rejections are the highest-quality signal you’ll get about how the prompts are wrong, and they’re what makes the system sharper every cycle.

Evals: continuous system optimization

LLMs make wrong decisions. Even with good prompts, good context, and good data, the model will misclassify or ignore a rule you spent an hour writing. That’s the operating reality, not a defect to fix. Evals are how you measure where it’s wrong and make the next version better than the last. If you only invest in one thing, invest here.

One thing evals don’t tell you: whether the decisions actually move the business. They prove the system did what you told it to, not whether what you told it to was right. For changes that touch spend (budget shifts, bid moves, broad-match expansion, feed swaps) you still need causal measurement: geo tests, holdouts, MMM (Google’s Meridian, GeoX, or your own model). Evals catch decision drift; experiments catch strategy drift.

There’s no single right way to run them. A real eval framework mixes methods: small gold sets for fast regression, sampled human review on production output for ground truth, LLM-as-judge to scale that review across thousands of decisions, agreement scoring between reviewers to keep labels honest, and feedback from the approval queue itself (rejections are evals you didn’t have to design). Each catches a different kind of failure. The loop below is the minimum, the one that makes the whole system trustworthy. Build out from there as the system matures.

You run two gold sets in parallel: a prod gold set for continuous accuracy measurement, and an experimental gold set for testing fixes against rejected terms. Both grow over time. Experimental terms graduate into prod once a fix holds without regressing the rest, so prod widens gradually as the system learns.

Seed a gold set. Start with 25-50 terms with known correct answers, covering your real categories: a clear keyword, a clear negative, a competitor, a branded term, an informational query, an ambiguous case. One type repeated dozens of times tells you nothing. The set grows continuously from here.
Run real data. Classify a few hundred terms, push to your review interface, approve or reject with specific notes. “Wrong” tells you nothing. “Classified as shoes but this is an apparel query” is signal.
Open coding. One person, the domain expert, reads every rejection and writes a short note describing what went wrong. No predefined categories. Don’t run this by committee.
Axial coding. Group the notes into failure modes (COMPETITOR_ERROR, CATEGORY_ERROR, MATCH_TYPE_ERROR). Count them. The most frequent one is where you start. An LLM can do the grouping. You keep the judgment.
Targeted experiment. Add the failing terms to a dev gold set. Copy the production prompt, never edit it directly. Change only the section that controls the failure mode. Run it.
Two regression gates. Dev gold set: did the change fix the failures? Prod gold set: did it break anything that was passing? Both pass: merge to production, graduate the dev terms into prod. Either fails: keep iterating.
Repeat. Early cycles catch big errors fast. Later cycles catch edge cases. At some point you stop finding new failure modes. That’s theoretical saturation, and it’s the signal the prompt is production-ready for the current data.

You do the open coding because humans see things models miss. Claude Code runs everything else: triage, the prompt diff, both regressions, the deltas, the gold set graduation, the commit. A typical loop:

You:    "Run triage on the last 40 rejections and show me the top failure mode"

Claude: → reads eval_feedback (status: RECEIVED)
        → classifies each rejection into a failure mode
        → reports: COMPETITOR_ERROR: 18 (45%) ← top failure mode

You:    "Improve competitor detection. Identify why this happens."

Claude: → reads prompts/classify-main.md
        → identifies Step 0 (confidence check) as the root cause
        → proposes a diff: stricter UNKNOWN criteria + new examples
        → eval --regression --gold-set dev  → 71% → 96%  ✓
        → eval --regression --gold-set prod → 96% → 96%  ✓ no regression
        "Both gates pass. Approve to merge."

You:    approve

Claude: → updates the prompt
        → graduates dev gold set terms into prod
        → commits: "tighten competitor confidence check, 71→96%"

The part you keep is the part that needs taste: deciding which failure modes matter, and reading the rejections without flinching. Evals also tell you when to stop. When new batches stop producing new failure modes, the prompt is done for now. Move on to the next decision surface.

The bigger picture

Nothing here is a black box. BigQuery is the data layer, prompts are files in git, decisions are rows you can query. The AI stays bounded, and the rest is the data engineering anyone who’s worked on a pipeline already recognises. That’s what makes it safe to point at a real account.

Today, you’re the orchestrator. You write the specs, design the prompts, run the evals, decide when each piece is good enough to ship. Your coding agent builds and executes. The pipelines run in production. You validate and steer.

That boundary is moving. For every task you still do by hand, ask whether you can structure it clearly enough that coding agents can own the build and pipelines can own the execution: the spec, the success criteria, the eval that proves it worked. That transfer of logic is the real skill, and it’s the one that compounds.

Google is also automating large parts of the in-account surface itself: AI Max for Search, auto-created assets, AI-driven landing-page expansion. The leverage is in the layer Google can’t see: your business systems, your margin and inventory data, your LTV models, your strategic decisions about where AI gets to act. The PPC role that scales from here owns the Python stack on top of BigQuery and wires ads into the rest of the business: revenue, margin, inventory, LTV, feeds. That’s where the leverage is.

Pave the roads for AI#

Document what the model doesn’t know#

The foundation: BigQuery + Google Ads Data Transfer#

The pattern: BigQuery → decision engine → BigQuery#

What you can build with this pattern#

What a Python pipeline actually does#

Engineering for production#

From recommendation to live account#

Worked example: search term classification#

Evals: continuous system optimization#

The bigger picture#