Living document. Last updated March 2026. I update this when the stack meaningfully changes.


Why Claude Code?

Think of it as a software engineer and data analyst working as one. It builds production pipelines, writes tests, enforces type safety, and commits clean code. It queries your data warehouse directly: ad platform data, revenue, margins, whatever you’ve connected. It connects to external services through APIs or MCP servers: Google Sheets, Slack, BigQuery, Google Ads, Salesforce. Almost any tool you use can be connected nowadays.

You give it a goal, it executes using real tools on your machine: SQL, Python, CLI, whatever the task needs. You design the systems and validate the outcomes. Claude operates them.

I run it from the terminal inside Cursor. My project files are visible in the editor while I talk to Claude in the integrated terminal. You can also run Claude Code standalone from any terminal.

Writing code with Claude is the easy part. Getting code you can trust is harder, and it’s not about prompting tricks. It’s about specification: what the system should do, what correct looks like, and how to verify the output before it touches a live account.


The Persistent Context

Without persistent context, every session starts from zero. CLAUDE.md and your repo structure are what turn Claude Code from a generic coding tool into an assistant that understands your projects, your data, and your conventions.

CLAUDE.md: The Operating Instructions

Every repository has a CLAUDE.md file at the root. Claude Code reads it automatically at the start of every session. It tells Claude how to work with you: what to read first, how to validate its work, what rules to follow, and what not to touch.

Your CLAUDE.md looks different for every project: 20 lines for a prototyping repo, full architecture rules and quality gates for production. The example below is from a production pipeline repo. Only include what Claude can’t figure out on its own from your code, configs, or tool output. If something feels like a warning, ask: can this be fixed in the code instead? A docstring, a better variable name, or a clearer structure is always better than another rule.

# Paid Search Keyword Automation

## How to Assist

This is a production repo. Present the plan, wait for confirmation,
then execute autonomously.

### Before starting work
1. Read ROADMAP.md for strategic context (vision, milestones, non-goals)
2. Read tasks/todo.md to know what's active and in scope
3. Read relevant docs/ files for context (ls docs/ to discover)
4. Run make test before making changes
5. For multi-file or architectural work: write a spec in tasks/specs/
   before starting. Small changes don't need a spec.

### After completing work (mandatory, do not skip steps)
1. Run make test. Report pass count + coverage for changed files.
2. Run make lint. Report clean or list violations.
3. If pipeline or config changed: verify docs/ was updated
4. Check if tasks/todo.md needs updating
5. Report with evidence: "Ready to commit" or "Fix first: ..."

### Session Closeoff Review
At the end of a substantial work session:
1. Verify — make test + make lint + dry-run changed pipelines
2. Summarize — what changed, with before/after evidence
3. Reflect — what could we do better? CLAUDE.md updates?
4. Backlog — capture ideas in tasks/todo.md
5. Cleanup — scan for stale/unused files

## Working Style

### Verification Discipline
- After any code change, RUN make test and make lint freshly.
  Never rely on memory of a previous run.
- READ the full output before claiming success.
- Never use hedging language ("should work", "probably fixes it").
  Either verify it works or say "I have not verified this yet."
- When reporting test results, include actual counts
  (e.g., "240 passed, 0 failed"), not just "all tests pass."

### Bug Fixing
- Obvious bugs (typo, wrong variable, missing import): fix and explain.
- Non-obvious bugs: investigate root cause first. State what you think
  is happening and why BEFORE changing code.
- 3-strike rule: After 3 failed fix attempts on the same issue, STOP.
  Present findings and ask the user before continuing.

## Architecture
CLI entry points > pipelines/ > clients/ (bq_client, gemini, tavily)
Both share: config/, domain/, prompts/, sql/

- Pipeline BQ access through clients/bq_client.py (Python SDK)
- Dashboard BQ access through services/bq_service.py (Python SDK)

## Code Quality
E2E test requirement: every new or modified pipeline needs at least one
E2E test with all externals mocked. Cover: happy path, empty input,
dry run.

## Pipeline Rules
Staging-to-production: INSERT staging (audit trail), MERGE production
(dedup). Idempotent, safe to re-run. Every pipeline has --dry-run
and --limit N flags.

## What NOT to Do
- Don't run pipelines without --dry-run unless explicitly asked
- Don't modify BigQuery schemas without DDL in sql/ddl/
- Never use string concatenation with user input in SQL

The file evolves through use. The Session Closeoff Review is the mechanism: Claude proposes updates based on mistakes or patterns it noticed. Cut aggressively. Every line competes with the actual task for context.

How Claude Reads Your Repo

Claude doesn’t need you to pre-load everything. It explores on demand: CLAUDE.md is loaded automatically every session, then Claude runs ls to discover directories, reads files when it needs specifics, and runs tools (make test, git log, bq query) to understand the current state. Good file naming matters more than maintaining an index. For a focused production repo, CLAUDE.md + sensible file names is the whole system.

Projects & File-Based Tracking

Each project tracks its own work. The task tracking lives where the work lives:

paid-search-keyword-automation/
├── CLAUDE.md            ← Operating instructions (loaded every session)
├── ROADMAP.md           ← Vision, milestones, non-goals
├── tasks/
│   ├── todo.md          ← Active + backlog
│   └── specs/           ← Intent docs for non-trivial builds
├── docs/                ← Architecture, operator manual, etc.
├── pipelines/           ← Pipeline logic
├── clients/             ← API clients (BigQuery, Gemini, etc.)
├── tests/               ← E2E and unit tests
├── sql/                 ← BigQuery queries and DDL
├── prompts/             ← LLM prompts as Markdown files
└── config/              ← Settings, constants

ROADMAP.md defines what “done” looks like, what the milestones are, and what the project deliberately won’t do. Without it, the agent makes locally reasonable decisions that conflict with where the project is heading: building a feature you planned to replace next month, or over-engineering something that’s deliberately out of scope.

tasks/todo.md is self-updating. CLAUDE.md tells the agent to read it before starting work and update it after completing work. You come back a week later, say “let’s continue,” and the agent picks up where it left off. For multi-step builds, it plans the work as checkable items before starting, checks them off as it goes, and adds follow-ups. The file is committed to git, so the full history is always there.

For non-trivial builds, a spec file in tasks/specs/ captures intent that doesn’t survive execution. More on this in Plan Mode below.


Tooling

Tools You Can Use Daily

  • bq CLI: Query BigQuery directly from the terminal. Claude writes SQL, runs it, reads the results, identifies issues, and iterates until the output is correct.
  • gcloud: Google Cloud authentication, project switching, service account management.
  • git: Commits, diffs, branch management, history.
  • python / uv: Execute scripts, manage dependencies, run full pipelines. uv handles package management with lock files so environments are reproducible.
  • gh: GitHub CLI for pull requests, issues, code review.
  • ruff: Linting and formatting. Catches style issues and basic code problems before they compound.
  • mypy: Static type checking. Verifies that type hints are consistent across the codebase, catching logic errors before runtime.
  • pytest: Runs tests automatically. Unit tests for logic, integration tests for BigQuery dry runs.

The Execute, Validate, Iterate Loop

This is where the real power sits. Claude doesn’t just write a query. It runs it, validates the output, and fixes problems autonomously.

No copy-pasting between tools. No switching to BigQuery console. No manual debugging of column names. The agent handles the iteration loop that a human would otherwise do manually across multiple browser tabs.


Plan Mode: Spec-First Execution

Write the Spec First

Before opening plan mode, write a spec in tasks/specs/. Think of it as a 15-minute waterfall: you brainstorm requirements, edge cases, architecture decisions, and a testing strategy with Claude before any code is written. Not a design doc, but not a few bullet points either. The spec defines what success looks like and how you’ll verify it. That second part, the test plan, is what execution gets measured against.

Before writing the spec, have the agent restate the goal in its own words. One sentence back. If it’s wrong, correct it now rather than after 30 minutes of implementation. Have it flag 2-3 edge cases that could change the approach. This takes under a minute and catches misunderstandings before they get built into a spec you then have to rewrite.

# [What you're building, one line]

## Goal
One sentence: what exists when this is done.

## The Problem
Why this needs to exist, or what breaks without it.

## Approach
1. Step one
2. Step two
3. ...

## Success Criteria
- Measurable outcome
- Measurable outcome

## Test Plan
- Unit: what logic gets tested in isolation
- E2E: what you run end-to-end to verify
- CI gate: what blocks a merge

The test plan is the part most people skip. Don’t. It defines the testing strategy before a line of code is written: which unit tests verify the logic, which end-to-end tests confirm the pipeline runs correctly. Claude writes them, runs them, and fixes failures in a loop. Success criteria plus test plan is the definition of done. Code isn’t finished when it runs; it’s finished when it passes the tests the spec required.

Execute Against the Spec

For any non-trivial build, plan mode is where you should spend the majority of your time. Type /plan or press Shift+Tab to cycle to plan mode. Claude enters read-only research mode: it explores your codebase, reads docs, and queries schemas, but doesn’t write files yet. It’s forced to think before acting.

For big tasks, I typically spend around 70% of the time in plan mode and the remaining 30% on execution and validation. Read through the plan, refine it, add context, and only once you’re confident the plan is solid let Claude execute.

When to use it:

  • Before any build that touches multiple files or systems
  • When the architecture has multiple valid approaches
  • Before building any pipeline that involves multiple APIs or data sources

Why This Order Matters

The spec tells Claude what “done” means before it starts exploring. Plan mode can then validate: does this approach satisfy all success criteria? The test plan becomes the acceptance gate for execution, not an afterthought once the code is written. Once code exists, it’s easy to convince yourself it’s correct because it runs. Write the test plan first and you create an honest external benchmark. Skip for explorations and one-off queries.


Commands & Skills: Controlling the Pace

As Claude Code gets faster at producing code, you need more deliberate pause points, not fewer. This is where commands and skills come in, and understanding the difference between them matters.

Claude Code has three layers of instructions, each with a different level of automation:

Layer When it runs Purpose Examples
CLAUDE.md Every session, automatic Always-on guardrails + architecture Verification discipline, bug fixing protocol, safety rules
Skills Auto-detected by context Recurring rituals the agent triggers itself Sprint review, roadmap, weekly update
Commands You type /name Deliberate checkpoints you invoke /debug, /review-tests, /dry-run-all

Commands are stored in .claude/commands/ and only run when you explicitly type the slash command. You control when they fire:

/debug            structured debugging (hypothesis, diagnostic, verify)
/review-tests     honest assessment of what tests cover and what they miss
/dry-run-all      run all pipelines against real BigQuery in safe mode
/run-eval         run the eval pipeline and report accuracy metrics

Each command encodes a workflow you’d otherwise do manually, but with consistent structure every time.

Skills are stored in .claude/skills/ and auto-trigger when Claude detects they’re relevant. The skills that work well as auto-triggered have a clear, predictable trigger:

sprint-review       Monday/Friday ritual: reviews tasks and git history across all repos
sync-confluence     pushes markdown docs to Confluence, regenerates READMEs from codebase
update-bq-docs      scans repo for BigQuery table references, flags missing documentation
context-audit       portfolio-wide health check: repo structure, CLAUDE.md quality, stale docs
visualize-flow      generates Mermaid diagrams of data flows and pipeline architecture

Global skills (~/.claude/skills/) run across every repo, good for cross-project workflows. Per-repo skills (.claude/skills/ inside the repo) trigger only within that project.

Be intentional about what you put in skills versus commands. Skills auto-fire based on heuristics, so anything that should only run when you decide to run it belongs in commands. /debug fires when you decide “this is stuck, let me think systematically,” not when the agent guesses you’re debugging.

Why Pace Control Matters

This connects to something fundamental about working with AI coding agents. The agent can produce code incredibly fast. Faster than you can review it, faster than you can understand it. That speed is the feature, but it’s also the risk.

Commands are how you deliberately slow down. Not because slow is better, but because understanding is better. /debug forces you to see the debugging process laid out step by step instead of watching Claude try 10 random fixes. /review-tests forces an honest accounting of what’s actually tested versus what just looks tested. These aren’t just engineering tools. They’re comprehension tools.

One more mechanism: for changes that touch 3 or more files, the agent pauses after each logical unit of work to summarize what it did before moving on. No more big-bang reveals where 10 files changed and you’re reviewing everything at once. If you want to course-correct, you do it before the agent is five steps deep, not after. A simple rule in CLAUDE.md:

### Review gates
For changes touching 3+ files: after completing each logical unit of work,
pause and summarize what was done before moving to the next unit.

The balance looks like this: let CLAUDE.md handle the things that should always happen (run tests, verify before claiming success). Let skills handle predictable recurring rituals with a clear trigger. Put deliberate checkpoints in commands, and invoke them when you decide you need to slow down and think.


Model Routing & Context Window

Use the best model available. On a flat-rate subscription, always use the most capable model. On API billing, you can switch models mid-conversation with /model to drop to a lighter model after the planning phase.

Three things to understand about context during long sessions:

Auto-compaction. When a conversation gets long, Claude Code automatically compresses earlier messages to make room. Your session doesn’t crash, but details from early in the conversation may fade. This is why CLAUDE.md matters: it’s re-loaded every time, so critical rules always survive. Keep your CLAUDE.md tight, and front-load the most important instructions.

Subagents and isolated context. When Claude spawns a subagent (a parallel task like reading docs or running searches), that subagent gets its own separate workspace. It doesn’t consume space from your main conversation. The subagent does its work, returns a summary, and the full intermediate exploration stays outside your session.

Memory vs. local context. Memory (stored in ~/.claude/) is cross-session storage: personal preferences, workflow patterns, things that apply everywhere regardless of which repo you’re in. Local context (CLAUDE.md, docs/, tasks/) is repo-specific knowledge committed to git: architecture rules, pipeline patterns, domain knowledge that only matters for this project. The rule of thumb: if it applies to you as a person across all projects, it’s memory. If it applies to a specific codebase or system, it’s local context.


Code Quality & Testing

There are two layers of testing, and you need both.

Layer 1: Technical tests. Claude handles these. Unit tests, type checking, linting, mocking. The agent writes them, runs them, fixes failures, and iterates until everything passes. You review the result, not every intermediate step.

Layer 2: Domain validation. You handle this. Does the output match reality? Do the numbers match what you see in the platform? Would you trust this running unattended? No automated test can answer these questions. Before every project, ask: how would I validate this manually? That answer becomes part of the test plan.

The Code Quality Stack

Dependency management with uv. uv is a modern Python package manager that creates a lock file (uv.lock), guaranteeing your code runs with the exact same library versions everywhere. No more “works on my machine” failures. This matters especially for agent-built code, where dependency management tends to be an afterthought.

Type checking with mypy. One of the most effective guards against AI hallucinations in code. Type hints force every function to declare exactly what it expects and returns:

# Without types: what is "amount"? A number? A string? A currency object?
def calculate_bid(amount):
    return amount * 1.2

# With types: clear contract, mypy verifies it everywhere
def calculate_bid(amount: float) -> float:
    return amount * 1.2

If Claude writes a function that returns a float but later uses it as a string, mypy catches it immediately, before you run anything. This one rule eliminates an entire class of bugs.

Testing with pytest: does the pipeline work? Be precise about what tests verify. Not whether your AI output is accurate. That’s what evals are for. Tests answer a different question: does data flow through the pipeline correctly? Does your response parser handle every format the LLM returns? Does the BigQuery write fail gracefully when a row is malformed? Tests are deterministic. Same input, same output, every time.

The key technique: mocking. Tests never call the real API. Instead, you replace it with a fake that returns a pre-scripted response. Tests run in milliseconds, cost nothing, and work even when the API is down. You’re testing your pipeline’s logic, not the LLM’s behavior.

Types of tests:

  • End-to-end tests are the most valuable starting point. “If I feed 100 search terms into this pipeline, do I get 100 classified results back in the right format?” This doesn’t check whether the AI classifications are good (that’s what evals are for). It checks that the pipeline runs without crashing and produces structured output.

  • Unit tests catch specific logic bugs. “Does the score normalization math work correctly?” “Does the JSON parser handle a malformed LLM response?” Each test checks one function with a known input and expected output. Fast to run, and when one fails you know exactly which piece of logic broke.

  • Integration tests verify that components talk to each other correctly. “Does the BigQuery client return data in the format the pipeline expects?” “Does the LLM response parse into the right schema?”

The spec-test-validate loop. This is what makes autonomous execution work. You define what success looks like in the spec, write tests that verify those criteria, and give Claude a way to validate output. Then it doesn’t matter how the code gets built, because you can verify the result. Claude runs make test, reads the output, fixes failures, and iterates until everything passes. The 3-strike rule (defined in CLAUDE.md) is the safety net that keeps this loop from running away.

Tests are your safety gate. No code gets committed unless all tests pass. Pre-commit hooks enforce this: if tests fail, the commit is blocked, Claude reads the error, fixes the code, and retries until green. The tests that verify your spec’s success criteria aren’t just checks for today’s build: they become a permanent regression gate. Commit them to CI and any future change that breaks the spec fails the pipeline automatically. The spec doesn’t just guide the build; it becomes the contract all future code must satisfy.

Dry runs complete the picture. Mocked tests are fast and cheap, but they can’t catch everything: a renamed column, a schema mismatch, a date filter off by one. Every pipeline should have a --dry-run flag that runs the full pipeline against your real database but skips the final write step. Run it after every change that touches SQL, table references, or API calls. It takes 30 seconds and catches an entire class of bugs that mocks can’t. Layer 2 validation (comparing output against your dashboard, the ad platform, or a known baseline) is the final check that no automated test can replace. Only you have that context.

Re-evaluate architecture as you grow. Early on, getting things working matters more than getting the structure perfect. But as your codebase grows, the architecture decisions you made at 500 lines start to strain at 5,000. Periodically step back and ask whether your module boundaries, data flow patterns, and abstractions still make sense. Claude can help you map dependencies and spot structural issues, but for real architectural decisions, getting a more experienced engineer to review your code is even better. A fresh pair of eyes catches things that no amount of refactoring from within will surface.

Evaluations for AI Pipelines

Tests verify that code works. Evals verify that AI output is good. You need both, and what evals look like is entirely dependent on your use case.

The process behind evals is universal, though, and it’s the same as software engineering: systematic measurement, controlled changes, measured outcomes. Among AI product builders, there is growing consensus that evals are the most important new skill. Hamel Hussein and Shreya Shankar teach the definitive framework, and the core of it is surprisingly simple: error analysis first, automation second.

Step 1: Open coding. Look at 20-50 actual AI outputs and write free-form notes about what’s wrong. No predefined categories, just honest observation. You don’t know what you don’t know, and the whole point is discovering failure modes you didn’t anticipate. One person with domain expertise owns this (what Hamel calls the “benevolent dictator”), not a committee.

Step 2: Axial coding. After enough observations, patterns emerge. Group your notes into failure mode categories. An LLM is actually good at this part: feed it your open codes and ask it to synthesize categories.

Step 3: Basic counting. Count how often each category appears. This is, as Hamel puts it, the most powerful analytical technique in data science because it’s so simple. Now you know what to fix first.

From there, you build a gold set of labeled examples from your errors, run your pipeline against it after every change, and track accuracy over time. Only experiment on patterns (the same mistake appearing 5+ times), not individual errors. Make one change, measure the impact, keep what works. Most teams skip error analysis and jump straight to writing tests or tweaking prompts based on gut feel. That’s where things go off the rails.

Session Closeoff: The Continuous Improvement Loop

The Session Closeoff Review in the CLAUDE.md example above is the practice that compounds everything. Your CLAUDE.md gets better with every session. The first version is generic. After a month of closeoff reviews, it contains hard-won lessons: “always dry-run SQL changes against real BQ before moving on,” “capture baseline counts before touching query logic,” “shared SQL definitions must stay identical across files, enforce with tests.” These aren’t rules you could have written upfront. They come from experience.

Every session starts with better instructions than the last one. Over months, you build a personalized engineering playbook that an AI agent follows precisely, every time.


BigQuery: The Data Foundation

Every pipeline pulls from BigQuery and writes results back to it. It’s the single queryable location for your complete account structure and performance history.

The Google Ads Data Transfer syncs daily snapshots of every entity in your account to BigQuery: campaigns, ad groups, keywords, search terms, historical tROAS targets, budget settings, cost and conversion stats. Claude knows how to query all of it.

Two table types exist for each entity:

  • ads_* views: latest snapshot only. Use for dimension lookups. Filter: WHERE _DATA_DATE = _LATEST_DATE
  • p_ads_* partitioned tables: full history. Use for date-range analysis. Filter: WHERE segments_date BETWEEN ... AND ...

The critical rule: always use p_ads_ tables with segments_date for any performance stats query over a date range. The ads_* views only contain the latest day.

The BigQuery Data Transfer Service also supports DV360, Campaign Manager 360, Search Ads 360, and GA4. For non-Google platforms, use a connector (Fivetran, Supermetrics, Funnel) or build your own ingestion. Once the data lands in BigQuery, the same patterns apply.

Beyond Ad Platforms

BigQuery is a data warehouse, so you can connect anything your business has: revenue, LTV, margins, product catalogs, CRM data, whatever. Once you do, you’re no longer optimizing based on what Google Ads tells you. You’re optimizing based on what your business actually cares about: real margins, customer quality, predicted lifetime value. Document the table schema, and Claude starts using it.

Documentation & Tooling

Document table purpose, not schema. Claude can read schemas via INFORMATION_SCHEMA. What it can’t discover: business context, naming quirks, metric definitions. Keep a lightweight index that routes Claude to the right table for the right question.

Two files matter most: a calculated metrics file (canonical SQL for every business formula like CAC, LTV:CAC, conversion rate) and a channel taxonomy (channel hierarchy, funnel stages, default filters). Without these, Claude writes technically correct SQL that produces meaningless results.

In Claude Code, use the bq CLI over the BigQuery MCP. Zero context token cost, works across any GCP project, full CLI feature set. The MCP costs 15-20K tokens just to load and is locked to a single project.

Query Pattern

One example to show the shape. Search term stats over a date range, with campaign name, ad group name, triggering keyword, and match type:

SELECT
  sqs.search_term_view_search_term        AS search_term,
  c.campaign_name,
  ag.ad_group_name,
  ag.ad_group_criterion_display_name      AS keyword,
  sqs.segments_search_term_match_type     AS match_type,
  SUM(sqs.metrics_cost_micros) / 1000000  AS cost,
  SUM(sqs.metrics_clicks)                 AS clicks,
  SUM(sqs.metrics_impressions)            AS impressions,
  SAFE_DIVIDE(SUM(sqs.metrics_conversions), SUM(sqs.metrics_clicks)) AS cvr
FROM `your-project.your_dataset.p_ads_SearchQueryStats_XXXXXXXXXX` sqs
LEFT JOIN (
  SELECT campaign_id, campaign_name
  FROM `your-project.your_dataset.ads_Campaign_XXXXXXXXXX`
  WHERE _DATA_DATE = _LATEST_DATE
  GROUP BY 1, 2
) c USING (campaign_id)
LEFT JOIN (
  SELECT
    ad_group_id,
    ad_group_criterion_criterion_id AS criterion_id,
    ad_group_name,
    ad_group_criterion_display_name
  FROM `your-project.your_dataset.ads_AdGroupCriterion_XXXXXXXXXX`
  WHERE _DATA_DATE = _LATEST_DATE
  GROUP BY 1, 2, 3, 4
) ag
  ON CAST(SPLIT(SPLIT(sqs.segments_keyword_ad_group_criterion, '/')[OFFSET(3)], '~')[OFFSET(0)] AS INT64) = ag.ad_group_id
  AND CAST(SPLIT(SPLIT(sqs.segments_keyword_ad_group_criterion, '/')[OFFSET(3)], '~')[OFFSET(1)] AS INT64) = ag.criterion_id
WHERE sqs.segments_date BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY) AND CURRENT_DATE()
GROUP BY 1, 2, 3, 4, 5
ORDER BY cost DESC
LIMIT 500

The tricky part is the criterion join: Data Transfer embeds the ad group and keyword IDs as a resource path in segments_keyword_ad_group_criterion, which has to be parsed out. Claude handles it correctly once it knows your dataset and account ID. From there the pattern extends to anything: SA360, GA4, DV360, backend revenue, margins, CRM data. The query shape stays the same.


Agentic Workflows: Adding LLM APIs to Your Pipelines

This is where the system becomes truly agentic. An agentic workflow is a Python pipeline that connects an LLM API to your data, makes decisions based on that data, and executes actions.

The pattern is consistent across every workflow:

  1. Query BigQuery for the data the workflow needs (search terms, performance stats, budget actuals)
  2. Send that data to an LLM API (Gemini, Claude, or any model) with a structured prompt that defines the classification or analysis task
  3. Process the LLM’s response, validate it against a strict output schema (structured JSON via Pydantic or equivalent), write results back to BigQuery
  4. Execute the output via whatever API is relevant: Google Ads API for account changes, Google Docs API for reports, Slack webhooks for notifications

The key insight: these are just Python scripts with API calls. No special orchestration framework needed. Claude Code builds them, you test them, and they run. If something breaks, Claude reads the logs and fixes it. If output quality drifts, the eval pipeline catches it.

The Staging-to-Production Pattern

Raw AI results go into a staging table first (append-only audit trail), then a second step merges them into production. If a run produces bad output, staging has the full history. Re-runs are safe and idempotent.

Safe Data Mutations

The staging-to-production pattern handles new data. But what about UPDATE and DELETE on production tables? Every mutation must be reversible, auditable, and validated.

The protocol: scope lock (explicit row IDs, never open-ended WHERE clauses), pre-snapshot to a timestamped backup table, pre-count affected vs unaffected rows, execute, post-count, validate. If counts don’t match, roll back automatically. For multi-table mutations, wrap everything in a single transaction group: all-or-nothing.

This feels like overkill until a bad WHERE clause corrupts production data and you spend a day reconstructing it from logs.

Production Pipeline Requirements

Every pipeline ships with batching, rate limiting, retry logic, error handling, and logging. For pipelines that write to external APIs (Google Ads, Merchant Center), log every mutation with an execution ID so you can trace and revert. Claude includes these automatically because the production checklist is encoded in the architecture rules.

Scheduling and Orchestration

Once built and tested, schedule it: GitHub Actions cron, Cloud Scheduler, or a simple bash script. When pipelines start depending on each other, graduate to proper orchestration: Airflow (or Cloud Composer on GCP), Prefect, or Dagster. Most systems start with cron.

SQL Transformation Layer

If your SQL layer is growing (more tables, more transformations, more downstream consumers), dbt deserves a serious look. It lets you build SQL queries that depend on each other in a defined order, with automatic checks that your data is clean (no nulls where there shouldn’t be, no duplicate rows, all references valid). It also generates documentation that stays in sync with your actual queries. Your Python pipelines then read from clean, tested tables instead of writing raw SQL against source tables directly.


The Google Ads API is the execution layer. Once your pipelines produce decisions in BigQuery, the API is how those decisions become platform actions. Use the Data Transfer to read data. Use the API to write changes: add keywords, add negatives, adjust bids, modify tROAS targets, pause entities, reallocate budgets, update product feeds via Merchant Center.

Credentials live in .env (never committed to git, chmod 600 immediately). For production, use a secrets manager and scope service accounts to only the tables and mutation types each pipeline needs.

The Architecture

This is a closed feedback loop:

BigQuery (analysis) -> Python (mutations) -> Google Ads API (live changes)
       ^                                                    |
       |_________________ Data Transfer __________________|
  1. BigQuery queries identify pending changes: a negative to add, a bid to adjust, an ad to update
  2. Python prepares and executes mutations against the Google Ads API
  3. Google Ads API applies the changes to the live account
  4. Data Transfer syncs the updated account state back into BigQuery
  5. The next run reads from that updated state, so every cycle builds on the last

The loop is what makes it autonomous. Each execution closes over its own output: the analysis in step 1 already includes the effect of what was changed in the previous run. You’re not writing one-off scripts; you’re building a system that observes, acts, and observes again.

Claude handles the API mechanics: GAQL syntax, mutation structure, error handling for partial failures. You define the business rules: what performance threshold triggers an action, what budget reallocation formula to use, when to pause entities.


Why Documentation Matters More Now

AI agents are only as good as the context they receive. Every piece of documentation, not just table schemas but marketing workflows, naming conventions, business logic, makes every agent faster and more accurate. With AI agents, documentation compounds: better docs lead to better output, which generates better docs.

Git-based markdown works for technical context agents read directly. Shared wikis (Confluence, Notion) make documentation visible to the whole team. RAG is for scale: when you have hundreds of documents and need agents to find relevant pieces automatically. Most teams start with git docs and a shared wiki. The tooling matters less than the habit: document what you know, keep it current, make it findable.


Build Deliberately

You own everything your agent builds. Claude writes the code, but you are responsible for what ships. If a pipeline breaks or a budget reallocation fires incorrectly, that’s on you. You need to understand what was built well enough to review it, debug it, and fix it.

This isn’t a limitation. It’s the mechanism that makes you better.

Comprehension debt is real

The biggest risk with AI-generated code isn’t that it’s wrong. It’s that it’s right, and you don’t understand why. Everything looks fine: tests pass, linter is clean, pipeline runs every morning. But you can’t explain how the retry logic works or modify the routing rules without worrying you’ll break something else.

The practices in this article (specs, tests, dry runs, evals, /debug, /review-tests) exist to close that gap. Skip them and you’ll ship faster in the short term. But the first time something breaks at 2am and you’re staring at code you can’t explain, you’ll wish you’d slowed down.

Know what you can review, and when to bring in help

As your systems grow, you’ll keep running into new engineering territory: system design, security, testing patterns, scaling. You learn it by building. Engineers on your team become invaluable as you do, not to take over, but to point you in the right direction and help you make better decisions as your skills develop. A code review can redirect weeks of work.

The converging skills effect

Something interesting happens when marketers build with engineering practices: the skills converge. Your specs are written in plain language: engineers read them and immediately understand the business logic. They can explain why a design choice is better, and you follow the explanation because you’ve been building the system yourself.

Making code cheap to generate doesn’t make understanding cheap to skip. Every script you ship creates a dependency. A prototype costs nothing to abandon. A scheduled pipeline your team relies on costs real time to maintain. Know the difference before you ship.