BillSnap

Architecture

How BillSnap is built. Multi-tenant Cloud Run + BigQuery, optimized for ≤ 50 customers, ~$5/month to operate at 10. This page describes what we chose, why, and what we explicitly rejected — the same depth you'd want before granting any service account read access to your billing data.

1. Data flow at a glance

Every morning at 09:00 UTC, Cloud Scheduler kicks off a fan-out through Cloud Tasks to one detection job per active customer. Anomalies (if any) flow through Pub/Sub to Slack + email.

Cloud Scheduler  daily-detect (cron: 0 9 * * *)
   │
   ▼  OIDC POST
detector /dispatch                       lists active Firestore customers
   │
   ▼  enqueue 1 task / customer
Cloud Tasks queue: detect-queue          max_concurrent_dispatches = 5
   │                                     max_attempts = 3, exp backoff
   ▼  OIDC POST
detector /detect?project_doc_id=...
   │
   ├──▶ BigQuery: customer's billing_export dataset (cross-project read,
   │              impersonating billsnap-reader)
   │
   ├──▶ Compute 14-day z-score per service.sku in SQL
   │
   ├──▶ Write alert event to Firestore (idempotency key: project_doc_id_YYYY-MM-DD)
   │
   └──▶ Pub/Sub topic: anomaly-detected
            │
            ├──▶ subscription: anomaly-detected-to-slack
            │        ▼  push (OIDC)
            │      notifier /notify-slack
            │
            └──▶ subscription: anomaly-detected-to-email
                     ▼  push (OIDC)
                   notifier /notify-email  (SendGrid)

Both push subscriptions:  max_delivery_attempts = 5 → anomaly-dead-letter
                          7-day retention on dead-letter inbox

A separate Cloud Scheduler job (nightly-billing-state, cron 30 8 * * * UTC) calls reconcile, which walks every customers/{uid} doc, pulls Stripe's authoritative subscription status, and writes back any drift. This catches the case where a Stripe webhook event is dropped or filtered.

2. The four Cloud Run services

detector (private, OIDC)

Daily anomaly run.
/dispatch enumerates active customers and enqueues tasks; /detect runs the SQL against one customer. Impersonates billsnap-reader for cross-project BigQuery reads.

notifier (private, Pub/Sub push)

Slack + email fan-out.
/notify-slack uses Block Kit; /notify-email uses SendGrid + a Jinja2 HTML template. Idempotent against alerts/{id}.

webhook (public)

Stripe webhook receiver + public API.
Verifies Stripe signatures, handles 5 subscription events. Also serves /me, /me/alerts, /create-checkout-session, /verify-access — all gated by Firebase ID-token bearer auth.

reconcile (private, OIDC)

Nightly billing-state reconciler.
For every customer, retrieves Stripe truth and updates Firestore state if drifted. Defensive against dropped webhooks. Distinguishes never-had-sub (pending_checkout) from sub-ended (canceled).

All four are Python 3.11 FastAPI services, deployed to Cloud Run in us-central1. Three are private (require OIDC) and only webhook is public (Stripe needs to reach it from outside the VPC; signature verification is the real auth).

3. Cross-project IAM model

BillSnap reads your billing data from your GCP project. The grant is dataset-scoped and revocable in one command:

your-project                          billsnap-prod-495823
─────────────                          ───────────────────────
billing_export.gcp_billing_export_v1_*
                              ▲                    │
                              │                    │
                              │  dataset:dataViewer│
                              │  project:jobUser   │
                              │                    │
                              ◀───────── impersonates
                                                   │
                                       billsnap-runtime SA
                                                   │
                                       (Cloud Run detector + webhook)

Two service accounts, one purpose each:

At runtime, the detector requests an OAuth token impersonating billsnap-reader. The runtime SA has roles/iam.serviceAccountTokenCreator on the reader SA, granted by Terraform inside the BillSnap project. No keys. No JSON files. The token lives in memory for 5 minutes per request.

To revoke, run two bq remove-iam-policy-binding commands and the next daily run will fail with a precise error — our handler catches that and marks the project access_revoked in Firestore, halts subsequent runs, and sends a one-time email asking you to re-verify.

4. Detection: z-score per SKU

BillSnap looks at days T-15..T-1 for the baseline window and T-1 for "today" (GCP's billing export lags ~24h, so "today" is yesterday in clock time).

The query, simplified:

WITH daily AS (
  SELECT DATE(usage_start_time) AS d, service.description AS svc,
         sku.description AS sku, SUM(cost) AS daily_cost
  FROM `project.dataset.gcp_billing_export_v1_*`
  WHERE DATE(_PARTITIONTIME) BETWEEN baseline_start AND today
    AND project.id = @project_id AND currency = 'USD'
  GROUP BY d, svc, sku
),
baseline AS (
  SELECT svc, sku, AVG(daily_cost) AS mean,
         STDDEV_SAMP(daily_cost) AS stddev, COUNT(*) AS n_days
  FROM daily WHERE d < today
  GROUP BY svc, sku
)
SELECT t.svc, t.sku, t.daily_cost AS today_cost, b.mean, b.stddev,
       SAFE_DIVIDE(t.daily_cost - b.mean, NULLIF(b.stddev, 0)) AS z
FROM (SELECT * FROM daily WHERE d = today) t
JOIN baseline b USING (svc, sku)
WHERE b.n_days     >= 14
  AND b.stddev     >  0
  AND t.daily_cost >= @min_daily_cost_usd
  AND SAFE_DIVIDE(t.daily_cost - b.mean, b.stddev) > @z_threshold
ORDER BY z DESC LIMIT 50;

Full version (the one that actually runs) is published at /sql.txt.

Four anti-noise layers

We use directional z > threshold, not ABS(z) > threshold — BillSnap alerts on spend spikes, not drops. Drops are usually fine and would generate noise.

5. Firestore data model

Native mode, three collections, no subcollections (flat is faster and cheaper to query):

customers/{uid}                              # uid = Firebase uid
  email, stripe_customer_id, stripe_subscription_id,
  state ∈ {pending_checkout, trialing, active,
           past_due, canceled, access_revoked},
  trial_ends_at, created_at, last_payment_failure_at

projects/{uid}_{gcp_project_id}
  customer_id, gcp_project_id, bq_dataset, bq_table_suffix,
  z_threshold (default 2.0), min_daily_cost_usd (default 1.0),
  slack_webhook_url, email_alerts,
  state ∈ {active, paused},
  last_run_at, last_run_status, last_run_error

alerts/{project_doc_id}_{YYYY-MM-DD}            # idempotency key
  project_doc_id, customer_id, date,
  flagged_skus[], total_anomalous_cost,
  notified_slack_at, notified_email_at, created_at

stripe_events/{stripe_event_id}                # webhook idempotency

Composite indexes (declared in firestore.indexes.json):

Security rules: all client reads + writes denied. Every BillSnap data access goes through Cloud Run services using the Firestore Admin SDK (which authenticates via the runtime SA and bypasses rules). The dashboard never reads Firestore directly; it talks to webhook /me and /me/alerts over a Firebase-ID-token-gated REST API. If anyone ever finds a path to the Firestore JS SDK in the dashboard bundle, treat it as a bug.

6. Stack + dependencies

LayerChoiceWhy
RuntimePython 3.11 + FastAPIBoring + fast. ~50ms cold start, ~5ms warm.
HostingCloud Run (us-central1)Free tier handles 2M req/mo; pay-per-request, scale-to-zero.
BuildCloud Build → Artifact RegistryOne command from source: gcloud builds submit.
DataBigQuery (cross-project read)Same engine that emits the billing data. No copies.
StateFirestore NativeCheap, scale-to-zero, real-time listeners available if needed.
QueueCloud TasksRate-limited fan-out; retries with backoff; no quota state to manage.
ScheduleCloud SchedulerTwo cron jobs; no Composer/Airflow overhead.
EventsPub/Sub + dead-letterDecouples detection from notification; 5-attempt cap.
SecretsSecret ManagerTerraform-managed containers; values loaded out-of-band.
BillingStripe (Checkout + Customer Portal)Hosted; we never see card data.
EmailSendGrid (Twilio)Free tier covers ~3K emails/mo; SPF/DKIM/DMARC configured.
FrontendAstro (static) + Firebase HostingTwo sites: billsnap.dev (marketing), app.billsnap.dev (dashboard).
AuthFirebase Auth (Google OIDC only)No passwords. Server verifies ID tokens via firebase-admin.
DNSCloudflareFree tier; API token scoped to single zone.
UptimeBetter Stack (free tier)Public status page at status.billsnap.dev.
InfraTerraform (local state)Single workspace, ~400 lines. State file gitignored.

7. Security posture

8. Roadmap

Things we explicitly haven't built yet and the order we'd add them based on early signal:

Questions, corrections, or just want to chat about cross-project IAM weirdness? support@billsnap.dev.

Start free trial → Back to home