Architecture

How BillSnap is built. Multi-tenant Cloud Run + BigQuery, optimized for ≤ 50 customers, ~$5/month to operate at 10. This page describes what we chose, why, and what we explicitly rejected — the same depth you'd want before granting any service account read access to your billing data.

On this page

Data flow at a glance
The four Cloud Run services
Cross-project IAM model
Detection: z-score per SKU
Firestore data model
Stack + dependencies
Security posture
Roadmap

1. Data flow at a glance

Every morning at 09:00 UTC, Cloud Scheduler kicks off a fan-out through Cloud Tasks to one detection job per active customer. Anomalies (if any) flow through Pub/Sub to Slack + email.

Cloud Scheduler  daily-detect (cron: 0 9 * * *)
   │
   ▼  OIDC POST
detector /dispatch                       lists active Firestore customers
   │
   ▼  enqueue 1 task / customer
Cloud Tasks queue: detect-queue          max_concurrent_dispatches = 5
   │                                     max_attempts = 3, exp backoff
   ▼  OIDC POST
detector /detect?project_doc_id=...
   │
   ├──▶ BigQuery: customer's billing_export dataset (cross-project read,
   │              impersonating billsnap-reader)
   │
   ├──▶ Compute 14-day z-score per service.sku in SQL
   │
   ├──▶ Write alert event to Firestore (idempotency key: project_doc_id_YYYY-MM-DD)
   │
   └──▶ Pub/Sub topic: anomaly-detected
            │
            ├──▶ subscription: anomaly-detected-to-slack
            │        ▼  push (OIDC)
            │      notifier /notify-slack
            │
            └──▶ subscription: anomaly-detected-to-email
                     ▼  push (OIDC)
                   notifier /notify-email  (SendGrid)

Both push subscriptions:  max_delivery_attempts = 5 → anomaly-dead-letter
                          7-day retention on dead-letter inbox

A separate Cloud Scheduler job (nightly-billing-state, cron 30 8 * * * UTC) calls reconcile, which walks every customers/{uid} doc, pulls Stripe's authoritative subscription status, and writes back any drift. This catches the case where a Stripe webhook event is dropped or filtered.

2. The four Cloud Run services

detector (private, OIDC)

Daily anomaly run.
/dispatch enumerates active customers and enqueues tasks; /detect runs the SQL against one customer. Impersonates billsnap-reader for cross-project BigQuery reads.

notifier (private, Pub/Sub push)

Slack + email fan-out.
/notify-slack uses Block Kit; /notify-email uses SendGrid + a Jinja2 HTML template. Idempotent against alerts/{id}.

webhook (public)

Stripe webhook receiver + public API.
Verifies Stripe signatures, handles 5 subscription events. Also serves /me, /me/alerts, /create-checkout-session, /verify-access — all gated by Firebase ID-token bearer auth.

reconcile (private, OIDC)

Nightly billing-state reconciler.
For every customer, retrieves Stripe truth and updates Firestore state if drifted. Defensive against dropped webhooks. Distinguishes never-had-sub (pending_checkout) from sub-ended (canceled).

All four are Python 3.11 FastAPI services, deployed to Cloud Run in us-central1. Three are private (require OIDC) and only webhook is public (Stripe needs to reach it from outside the VPC; signature verification is the real auth).

3. Cross-project IAM model

BillSnap reads your billing data from your GCP project. The grant is dataset-scoped and revocable in one command:

your-project                          billsnap-prod-495823
─────────────                          ───────────────────────
billing_export.gcp_billing_export_v1_*
                              ▲                    │
                              │                    │
                              │  dataset:dataViewer│
                              │  project:jobUser   │
                              │                    │
                              ◀───────── impersonates
                                                   │
                                       billsnap-runtime SA
                                                   │
                                       (Cloud Run detector + webhook)

Two service accounts, one purpose each:

billsnap-runtime — runs every Cloud Run service. Lives entirely inside the BillSnap project. No customer-side IAM, ever.
billsnap-reader — the SA you grant access to during onboarding. Two narrow roles:
- roles/bigquery.dataViewer on your specific billing-export dataset (not the project)
- roles/bigquery.jobUser on the project that runs the BigQuery job (so the scan is billed to you, ~$0.01/run)

At runtime, the detector requests an OAuth token impersonating billsnap-reader. The runtime SA has roles/iam.serviceAccountTokenCreator on the reader SA, granted by Terraform inside the BillSnap project. No keys. No JSON files. The token lives in memory for 5 minutes per request.

To revoke, run two bq remove-iam-policy-binding commands and the next daily run will fail with a precise error — our handler catches that and marks the project access_revoked in Firestore, halts subsequent runs, and sends a one-time email asking you to re-verify.

4. Detection: z-score per SKU

BillSnap looks at days T-15..T-1 for the baseline window and T-1 for "today" (GCP's billing export lags ~24h, so "today" is yesterday in clock time).

The query, simplified:

WITH daily AS (
  SELECT DATE(usage_start_time) AS d, service.description AS svc,
         sku.description AS sku, SUM(cost) AS daily_cost
  FROM `project.dataset.gcp_billing_export_v1_*`
  WHERE DATE(_PARTITIONTIME) BETWEEN baseline_start AND today
    AND project.id = @project_id AND currency = 'USD'
  GROUP BY d, svc, sku
),
baseline AS (
  SELECT svc, sku, AVG(daily_cost) AS mean,
         STDDEV_SAMP(daily_cost) AS stddev, COUNT(*) AS n_days
  FROM daily WHERE d < today
  GROUP BY svc, sku
)
SELECT t.svc, t.sku, t.daily_cost AS today_cost, b.mean, b.stddev,
       SAFE_DIVIDE(t.daily_cost - b.mean, NULLIF(b.stddev, 0)) AS z
FROM (SELECT * FROM daily WHERE d = today) t
JOIN baseline b USING (svc, sku)
WHERE b.n_days     >= 14
  AND b.stddev     >  0
  AND t.daily_cost >= @min_daily_cost_usd
  AND SAFE_DIVIDE(t.daily_cost - b.mean, b.stddev) > @z_threshold
ORDER BY z DESC LIMIT 50;

Full version (the one that actually runs) is published at /sql.txt.

Four anti-noise layers

min_daily_cost_usd = $1 floor — a SKU jumping from $0.001 to $0.05 is a 50× spike statistically but $0 in dollars. The floor keeps tiny SKUs from generating per-day alerts.
n_days >= 14 — brand-new SKUs with fewer than 14 daily observations get a "watching" state, not a flag. Prevents false positives on small samples.
One digest per project per day — even if 12 SKUs fire, you get one Slack message + one email listing all of them, sorted by z-score. Not 12 alerts.
Per-project tunable threshold — default z = 2.0; raise to 2.5 or 3.0 from the dashboard if alerts feel noisy.

We use directional z > threshold, not ABS(z) > threshold — BillSnap alerts on spend spikes, not drops. Drops are usually fine and would generate noise.

5. Firestore data model

Native mode, three collections, no subcollections (flat is faster and cheaper to query):

customers/{uid}                              # uid = Firebase uid
  email, stripe_customer_id, stripe_subscription_id,
  state ∈ {pending_checkout, trialing, active,
           past_due, canceled, access_revoked},
  trial_ends_at, created_at, last_payment_failure_at

projects/{uid}_{gcp_project_id}
  customer_id, gcp_project_id, bq_dataset, bq_table_suffix,
  z_threshold (default 2.0), min_daily_cost_usd (default 1.0),
  slack_webhook_url, email_alerts,
  state ∈ {active, paused},
  last_run_at, last_run_status, last_run_error

alerts/{project_doc_id}_{YYYY-MM-DD}            # idempotency key
  project_doc_id, customer_id, date,
  flagged_skus[], total_anomalous_cost,
  notified_slack_at, notified_email_at, created_at

stripe_events/{stripe_event_id}                # webhook idempotency

Composite indexes (declared in firestore.indexes.json):

alerts(customer_id ASC, date DESC) — powers the "Recent alerts" table on the dashboard
projects(customer_id ASC, state ASC) — powers the dispatch fan-out query

Security rules: all client reads + writes denied. Every BillSnap data access goes through Cloud Run services using the Firestore Admin SDK (which authenticates via the runtime SA and bypasses rules). The dashboard never reads Firestore directly; it talks to webhook /me and /me/alerts over a Firebase-ID-token-gated REST API. If anyone ever finds a path to the Firestore JS SDK in the dashboard bundle, treat it as a bug.

6. Stack + dependencies

Layer	Choice	Why
Runtime	Python 3.11 + FastAPI	Boring + fast. ~50ms cold start, ~5ms warm.
Hosting	Cloud Run (us-central1)	Free tier handles 2M req/mo; pay-per-request, scale-to-zero.
Build	Cloud Build → Artifact Registry	One command from source: `gcloud builds submit`.
Data	BigQuery (cross-project read)	Same engine that emits the billing data. No copies.
State	Firestore Native	Cheap, scale-to-zero, real-time listeners available if needed.
Queue	Cloud Tasks	Rate-limited fan-out; retries with backoff; no quota state to manage.
Schedule	Cloud Scheduler	Two cron jobs; no Composer/Airflow overhead.
Events	Pub/Sub + dead-letter	Decouples detection from notification; 5-attempt cap.
Secrets	Secret Manager	Terraform-managed containers; values loaded out-of-band.
Billing	Stripe (Checkout + Customer Portal)	Hosted; we never see card data.
Email	SendGrid (Twilio)	Free tier covers ~3K emails/mo; SPF/DKIM/DMARC configured.
Frontend	Astro (static) + Firebase Hosting	Two sites: `billsnap.dev` (marketing), `app.billsnap.dev` (dashboard).
Auth	Firebase Auth (Google OIDC only)	No passwords. Server verifies ID tokens via firebase-admin.
DNS	Cloudflare	Free tier; API token scoped to single zone.
Uptime	Better Stack (free tier)	Public status page at status.billsnap.dev.
Infra	Terraform (local state)	Single workspace, ~400 lines. State file gitignored.

7. Security posture

No customer data leaves your project. The BigQuery query runs against your dataset and bills the scan to your project. We never copy raw billing rows out — only the SKUs that triggered an alert are written to Firestore, which is the minimum needed to render the dashboard.
Read-only, dataset-scoped. The customer-side grant is exactly two roles on exactly one dataset (data) and one project (job runner). No project-wide reader. No write permissions of any kind.
Revocation is one command + immediate effect. Run bq remove-iam-policy-binding; the next daily run gets a precise 403; our handler halts the project. No appeal needed.
No keys. The reader SA is never key-exported. Cross-project access uses short-lived OAuth tokens (impersonated_credentials with a 5-minute lifetime).
Stripe signatures verified. webhook /stripe/webhook rejects any POST whose signature doesn't validate against the whsec_live_* secret in Secret Manager.
Firestore client access denied. The dashboard cannot read or write any Firestore document from the browser. All reads go through the webhook public API, which gates every endpoint on a Firebase ID token belonging to the requesting user.
CSP + standard headers. Both hosting sites send Content-Security-Policy, X-Content-Type-Options: nosniff, X-Frame-Options: SAMEORIGIN, Referrer-Policy: strict-origin-when-cross-origin, and a restrictive Permissions-Policy.
No SOC 2 / HIPAA / ISO 27001. BillSnap is solo-built. Appropriate for indie projects and small-startup workloads; not for regulated industries. See the privacy policy for details.

8. Roadmap

Things we explicitly haven't built yet and the order we'd add them based on early signal:

Weekly digest tier ($5/mo add-on). Even on quiet weeks, a Friday email: total spend, w/w delta, top 3 SKUs, idle-resource hints. Addresses the silent-churn risk — product feels present even when nothing fires.
Slack OAuth app. Replace the per-customer Incoming Webhook URL flow with a proper "Add to Slack" install. Cuts onboarding friction, enables channel selection, and lists in the Slack App Directory for free distribution.
AWS Cost & Usage Report support. Same z-score math against the AWS CUR. Requires a parallel impersonation chain (IAM role assume).
Day-of-week seasonality. v1 ignores DOW because most indie projects are flat 24/7 batch jobs. Larger workloads with strong weekday/weekend cycles would benefit from DOW-stratified baselines.
Per-SKU mean-relative floor. Instead of a flat $1/day floor, configurable as "10% of the SKU's mean", so larger SKUs aren't accidentally suppressed.
Public API. Programmatic access to your alerts for integration with internal tooling.

Questions, corrections, or just want to chat about cross-project IAM weirdness? support@billsnap.dev.

Start free trial → Back to home