Architecture
How BillSnap is built. Multi-tenant Cloud Run + BigQuery, optimized for ≤ 50 customers, ~$5/month to operate at 10. This page describes what we chose, why, and what we explicitly rejected — the same depth you'd want before granting any service account read access to your billing data.
1. Data flow at a glance
Every morning at 09:00 UTC, Cloud Scheduler kicks off a fan-out through Cloud Tasks to one detection job per active customer. Anomalies (if any) flow through Pub/Sub to Slack + email.
Cloud Scheduler daily-detect (cron: 0 9 * * *)
│
▼ OIDC POST
detector /dispatch lists active Firestore customers
│
▼ enqueue 1 task / customer
Cloud Tasks queue: detect-queue max_concurrent_dispatches = 5
│ max_attempts = 3, exp backoff
▼ OIDC POST
detector /detect?project_doc_id=...
│
├──▶ BigQuery: customer's billing_export dataset (cross-project read,
│ impersonating billsnap-reader)
│
├──▶ Compute 14-day z-score per service.sku in SQL
│
├──▶ Write alert event to Firestore (idempotency key: project_doc_id_YYYY-MM-DD)
│
└──▶ Pub/Sub topic: anomaly-detected
│
├──▶ subscription: anomaly-detected-to-slack
│ ▼ push (OIDC)
│ notifier /notify-slack
│
└──▶ subscription: anomaly-detected-to-email
▼ push (OIDC)
notifier /notify-email (SendGrid)
Both push subscriptions: max_delivery_attempts = 5 → anomaly-dead-letter
7-day retention on dead-letter inbox
A separate Cloud Scheduler job (nightly-billing-state,
cron 30 8 * * * UTC) calls reconcile,
which walks every customers/{uid} doc, pulls
Stripe's authoritative subscription status, and writes back any
drift. This catches the case where a Stripe webhook event is
dropped or filtered.
2. The four Cloud Run services
detector (private, OIDC)
Daily anomaly run.
/dispatch enumerates active customers and enqueues
tasks; /detect runs the SQL against one customer.
Impersonates billsnap-reader for cross-project
BigQuery reads.
notifier (private, Pub/Sub push)
Slack + email fan-out.
/notify-slack uses Block Kit;
/notify-email uses SendGrid + a Jinja2 HTML
template. Idempotent against alerts/{id}.
webhook (public)
Stripe webhook receiver + public API.
Verifies Stripe signatures, handles 5 subscription events.
Also serves /me, /me/alerts,
/create-checkout-session,
/verify-access — all gated by Firebase ID-token
bearer auth.
reconcile (private, OIDC)
Nightly billing-state reconciler.
For every customer, retrieves Stripe truth and updates
Firestore state if drifted. Defensive against dropped
webhooks. Distinguishes never-had-sub (pending_checkout)
from sub-ended (canceled).
All four are Python 3.11 FastAPI services, deployed to Cloud Run
in us-central1. Three are private (require OIDC) and
only webhook is public (Stripe needs to reach it from
outside the VPC; signature verification is the real auth).
3. Cross-project IAM model
BillSnap reads your billing data from your GCP project. The grant is dataset-scoped and revocable in one command:
your-project billsnap-prod-495823
───────────── ───────────────────────
billing_export.gcp_billing_export_v1_*
▲ │
│ │
│ dataset:dataViewer│
│ project:jobUser │
│ │
◀───────── impersonates
│
billsnap-runtime SA
│
(Cloud Run detector + webhook) Two service accounts, one purpose each:
-
billsnap-runtime— runs every Cloud Run service. Lives entirely inside the BillSnap project. No customer-side IAM, ever. -
billsnap-reader— the SA you grant access to during onboarding. Two narrow roles:roles/bigquery.dataVieweron your specific billing-export dataset (not the project)roles/bigquery.jobUseron the project that runs the BigQuery job (so the scan is billed to you, ~$0.01/run)
At runtime, the detector requests an OAuth token impersonating
billsnap-reader. The runtime SA has
roles/iam.serviceAccountTokenCreator on the reader
SA, granted by Terraform inside the BillSnap project. No keys.
No JSON files. The token lives in memory for 5 minutes per request.
To revoke, run two bq remove-iam-policy-binding
commands and the next daily run will fail with a precise error —
our handler catches that and marks the project
access_revoked in Firestore, halts subsequent runs,
and sends a one-time email asking you to re-verify.
4. Detection: z-score per SKU
BillSnap looks at days T-15..T-1 for the baseline
window and T-1 for "today" (GCP's billing export
lags ~24h, so "today" is yesterday in clock time).
The query, simplified:
WITH daily AS (
SELECT DATE(usage_start_time) AS d, service.description AS svc,
sku.description AS sku, SUM(cost) AS daily_cost
FROM `project.dataset.gcp_billing_export_v1_*`
WHERE DATE(_PARTITIONTIME) BETWEEN baseline_start AND today
AND project.id = @project_id AND currency = 'USD'
GROUP BY d, svc, sku
),
baseline AS (
SELECT svc, sku, AVG(daily_cost) AS mean,
STDDEV_SAMP(daily_cost) AS stddev, COUNT(*) AS n_days
FROM daily WHERE d < today
GROUP BY svc, sku
)
SELECT t.svc, t.sku, t.daily_cost AS today_cost, b.mean, b.stddev,
SAFE_DIVIDE(t.daily_cost - b.mean, NULLIF(b.stddev, 0)) AS z
FROM (SELECT * FROM daily WHERE d = today) t
JOIN baseline b USING (svc, sku)
WHERE b.n_days >= 14
AND b.stddev > 0
AND t.daily_cost >= @min_daily_cost_usd
AND SAFE_DIVIDE(t.daily_cost - b.mean, b.stddev) > @z_threshold
ORDER BY z DESC LIMIT 50;
Full version (the one that actually runs) is published at
/sql.txt.
Four anti-noise layers
-
min_daily_cost_usd= $1 floor — a SKU jumping from $0.001 to $0.05 is a 50× spike statistically but $0 in dollars. The floor keeps tiny SKUs from generating per-day alerts. -
n_days >= 14— brand-new SKUs with fewer than 14 daily observations get a "watching" state, not a flag. Prevents false positives on small samples. - One digest per project per day — even if 12 SKUs fire, you get one Slack message + one email listing all of them, sorted by z-score. Not 12 alerts.
- Per-project tunable threshold — default
z = 2.0; raise to 2.5 or 3.0 from the dashboard if alerts feel noisy.
We use directional z > threshold, not
ABS(z) > threshold — BillSnap alerts on
spend spikes, not drops. Drops are usually fine and would
generate noise.
5. Firestore data model
Native mode, three collections, no subcollections (flat is faster and cheaper to query):
customers/{uid} # uid = Firebase uid
email, stripe_customer_id, stripe_subscription_id,
state ∈ {pending_checkout, trialing, active,
past_due, canceled, access_revoked},
trial_ends_at, created_at, last_payment_failure_at
projects/{uid}_{gcp_project_id}
customer_id, gcp_project_id, bq_dataset, bq_table_suffix,
z_threshold (default 2.0), min_daily_cost_usd (default 1.0),
slack_webhook_url, email_alerts,
state ∈ {active, paused},
last_run_at, last_run_status, last_run_error
alerts/{project_doc_id}_{YYYY-MM-DD} # idempotency key
project_doc_id, customer_id, date,
flagged_skus[], total_anomalous_cost,
notified_slack_at, notified_email_at, created_at
stripe_events/{stripe_event_id} # webhook idempotency
Composite indexes (declared in
firestore.indexes.json):
alerts(customer_id ASC, date DESC)— powers the "Recent alerts" table on the dashboardprojects(customer_id ASC, state ASC)— powers the dispatch fan-out query
Security rules: all client reads + writes
denied. Every BillSnap data access goes through Cloud
Run services using the Firestore Admin SDK (which authenticates
via the runtime SA and bypasses rules). The dashboard never reads
Firestore directly; it talks to webhook /me and
/me/alerts over a Firebase-ID-token-gated REST API.
If anyone ever finds a path to the Firestore JS SDK in the
dashboard bundle, treat it as a bug.
6. Stack + dependencies
| Layer | Choice | Why |
|---|---|---|
| Runtime | Python 3.11 + FastAPI | Boring + fast. ~50ms cold start, ~5ms warm. |
| Hosting | Cloud Run (us-central1) | Free tier handles 2M req/mo; pay-per-request, scale-to-zero. |
| Build | Cloud Build → Artifact Registry | One command from source: gcloud builds submit. |
| Data | BigQuery (cross-project read) | Same engine that emits the billing data. No copies. |
| State | Firestore Native | Cheap, scale-to-zero, real-time listeners available if needed. |
| Queue | Cloud Tasks | Rate-limited fan-out; retries with backoff; no quota state to manage. |
| Schedule | Cloud Scheduler | Two cron jobs; no Composer/Airflow overhead. |
| Events | Pub/Sub + dead-letter | Decouples detection from notification; 5-attempt cap. |
| Secrets | Secret Manager | Terraform-managed containers; values loaded out-of-band. |
| Billing | Stripe (Checkout + Customer Portal) | Hosted; we never see card data. |
| SendGrid (Twilio) | Free tier covers ~3K emails/mo; SPF/DKIM/DMARC configured. | |
| Frontend | Astro (static) + Firebase Hosting | Two sites: billsnap.dev (marketing), app.billsnap.dev (dashboard). |
| Auth | Firebase Auth (Google OIDC only) | No passwords. Server verifies ID tokens via firebase-admin. |
| DNS | Cloudflare | Free tier; API token scoped to single zone. |
| Uptime | Better Stack (free tier) | Public status page at status.billsnap.dev. |
| Infra | Terraform (local state) | Single workspace, ~400 lines. State file gitignored. |
7. Security posture
- No customer data leaves your project. The BigQuery query runs against your dataset and bills the scan to your project. We never copy raw billing rows out — only the SKUs that triggered an alert are written to Firestore, which is the minimum needed to render the dashboard.
- Read-only, dataset-scoped. The customer-side grant is exactly two roles on exactly one dataset (data) and one project (job runner). No project-wide reader. No write permissions of any kind.
- Revocation is one command + immediate effect.
Run
bq remove-iam-policy-binding; the next daily run gets a precise 403; our handler halts the project. No appeal needed. - No keys. The reader SA is never key-exported. Cross-project access uses short-lived OAuth tokens (impersonated_credentials with a 5-minute lifetime).
- Stripe signatures verified.
webhook /stripe/webhookrejects any POST whose signature doesn't validate against thewhsec_live_*secret in Secret Manager. - Firestore client access denied. The dashboard cannot read or write any Firestore document from the browser. All reads go through the webhook public API, which gates every endpoint on a Firebase ID token belonging to the requesting user.
- CSP + standard headers. Both hosting sites
send
Content-Security-Policy,X-Content-Type-Options: nosniff,X-Frame-Options: SAMEORIGIN,Referrer-Policy: strict-origin-when-cross-origin, and a restrictivePermissions-Policy. - No SOC 2 / HIPAA / ISO 27001. BillSnap is solo-built. Appropriate for indie projects and small-startup workloads; not for regulated industries. See the privacy policy for details.
8. Roadmap
Things we explicitly haven't built yet and the order we'd add them based on early signal:
- Weekly digest tier ($5/mo add-on). Even on quiet weeks, a Friday email: total spend, w/w delta, top 3 SKUs, idle-resource hints. Addresses the silent-churn risk — product feels present even when nothing fires.
- Slack OAuth app. Replace the per-customer Incoming Webhook URL flow with a proper "Add to Slack" install. Cuts onboarding friction, enables channel selection, and lists in the Slack App Directory for free distribution.
- AWS Cost & Usage Report support. Same z-score math against the AWS CUR. Requires a parallel impersonation chain (IAM role assume).
- Day-of-week seasonality. v1 ignores DOW because most indie projects are flat 24/7 batch jobs. Larger workloads with strong weekday/weekend cycles would benefit from DOW-stratified baselines.
- Per-SKU mean-relative floor. Instead of a flat $1/day floor, configurable as "10% of the SKU's mean", so larger SKUs aren't accidentally suppressed.
- Public API. Programmatic access to your alerts for integration with internal tooling.
Questions, corrections, or just want to chat about cross-project IAM weirdness? support@billsnap.dev.