The Pipeline I Tried Not to Build

·
architecturefirebasebigquery

My app runs on Firestore, a realtime document database. When a user edits a record, every other user looking at that data sees the change instantly. No polling. No refresh button. No "someone else has modified this record" modals. It just works. Multiple people editing the same organization at the same time, seeing each other's changes appear as they happen. This is the core experience. It's the reason users tell me the product feels alive.

Then a customer needed to generate reports that aggregate hundreds of thousands of rows across multiple dimensions with filters. Firestore doesn't aggregate. That's not what document databases do. I had a database that was excellent at realtime collaboration and incapable of analytical queries, and I needed both.

The alternatives I tried to talk myself into

Every solution I considered solved the reporting problem by degrading something else.

"Just query harder." Read every document into a Cloud Function, aggregate in application code. This works at small scale. At 300,000 rows with multi-dimensional filters (by account, by fund, by date range, by custom label) each report costs real money in document reads and real seconds in wall time. A single report shouldn't cost more than the user's monthly subscription.

"Just use Postgres." The voice in every developer's head. I looked seriously at Supabase, which wraps Postgres with a realtime layer. But "realtime layer" in this context means polling with good abstractions on top. It's genuinely good, but it can't match native Firestore listeners, which push changes over a persistent WebSocket connection with sub-second latency. I would be rewriting my strongest feature, the thing users actually compliment, to solve a reporting problem. Trading proven strength for theoretical adequacy.

"Precompute the aggregates." Maintain running totals in summary documents, updated on every write. This is tempting because it sounds simple. It works until someone edits a record from three months ago and changes which category it belongs to. Now your cached totals are wrong across multiple dimensions, and the error is silent. Nothing will tell you the numbers are off until a user notices their report doesn't add up. You're building a fragile cache that's always one edge case from silent corruption.

The shift: stop evaluating solutions, start evaluating failure modes

I spent longer than I'd like to admit going back and forth between these options before I realized I was asking the wrong question. I was asking "which architecture is simplest?" The better question was which architecture fails best?

Every system breaks. A Cloud Function cold-starts and drops a write. A batch import races with a manual edit. A network timeout hits at exactly the wrong moment. The question isn't whether your data pipeline will have edge cases. It's whether those edge cases are detectable and repairable, or silent and cumulative.

This is where the precomputed aggregates approach falls apart conceptually, not just practically. When it fails, it fails silently. Your totals drift by some small amount, and nothing in the system knows. The error is like being off by one degree at the start of a long hike. After a hundred feet you'd never notice, but after ten miles you're nowhere near where you should be. The error compounds, and by the time someone catches it, you can't easily trace back to where it went wrong.

A sync pipeline fails differently. It either syncs or it doesn't. When it doesn't, you can log it, detect it, retry it, and verify it. The errors are discrete, not cumulative. You can build a system that catches its own mistakes.

Once I reframed the problem this way, the answer was obvious. Firestore stays as the source of truth for all writes and realtime collaboration. A separate analytics store handles the aggregation queries. A pipeline connects them. And instead of trying to make the pipeline perfect, I'd make it honest: a system that knows when it's wrong and fixes itself.

The implementation: four patterns worth stealing

I'm going to strip the implementation down to the design patterns, because they're not specific to my stack. If you're building any architecture where a source of truth feeds a derived data store (event sourcing, CQRS, search indexes, analytics pipelines) these apply.

Pattern 1: Event-driven sync with deterministic error capture

Every write to the source database triggers a sync to the derived store. If the sync fails, the failure is logged with a deterministic ID. Same failure always produces the same error document, so retries don't create duplicates.

exports.trackChanges = onDocumentWritten(
  { document: "orgs/{orgId}/records/{recordId}" },
  async (event) => {
    try {
      await syncToDerivedStore({ orgId, recordId, action });
    } catch (error) {
      // Deterministic ID: same failure overwrites, never duplicates
      const errorDocId = `${orgId}_${recordId}_${action}`;
      await db.collection("syncErrors").doc(errorDocId).set({
        error: error.message,
        timestamp: new Date(),
        attempts: 1,
      }, { merge: true });
    }
  }
);

The deterministic ID matters more than it looks. Both the trigger and the sync function might try to log the same failure. The second write merges instead of creating a duplicate record. This is a small thing that prevents your error log from becoming noise.

Pattern 2: Delete by inserting

This was the part that felt genuinely weird to write.

My analytics store has a streaming buffer. Recent inserts aren't immediately queryable. They'll appear in a few minutes, but right now, they're invisible. If a user creates a record and deletes it 30 seconds later, there's nothing to delete. The original row hasn't materialized yet.

So deletes become inserts. A "tombstone" row with is_deleted: true and a version number guaranteed to be higher than the original. A view filters to the latest version per record, and the tombstone always wins.

// Can't DELETE from the streaming buffer.
// Delete by inserting a tombstone with a version that always wins.
// Unix epoch seconds (~1.7 billion) beats any sequential version (1, 2, 3...).
const preemptiveVersion = Math.floor(Date.now() / 1000);

const tombstoneRows = lines.map((_, idx) => ({
  insertId: `${recordId}_preemptive_delete_${Date.now()}_${idx}`,
  json: {
    ...emptyRow,
    is_deleted: true,
    row_version: preemptiveVersion,
    operation_type: "PREEMPTIVE_DELETE",
  },
}));

Deleting by inserting. It sounds wrong. But look at what you get: an audit trail for free because every version of every record is preserved. Race conditions resolve deterministically because highest version wins, period. And a cleanup job handles old rows on its own schedule. The "weird" solution turned out to be the most robust one.

Pattern 3: Reconciliation as a first-class system

This is the core of the architecture, and the thing I'd tell every developer building a derived data store to invest in early rather than late.

Don't trust your pipeline to be perfect. Build a scheduled job that walks both stores side by side, comparing what should exist in each. You're looking for three types of drift:

const [sourceRecords, derivedRecords] = await Promise.all([
  getAllFromSource(orgId),
  getAllFromDerived(orgId),
]);

// Type 1: In source, not in derived → sync dropped a write
for (const [id, sourceData] of sourceRecords) {
  if (!derivedRecords.has(id)) {
    const hasLoggedError = errorMap.has(id);
    // Known failure (logged) → expected, will be retried
    // Silent failure (no log) → your error capture has a gap
  }
}

// Type 2: In both, but different → partial sync or failed edit

// Type 3: In derived, not in source → delete never propagated
for (const [id, derivedData] of derivedRecords) {
  if (!sourceRecords.has(id)) {
    // Orphaned row. Auto-repair: write a tombstone.
  }
}

The reconciliation engine doesn't just fix data. It audits your error handling. A discrepancy with a corresponding logged error is a known issue, queued for repair. A discrepancy with no logged error is a silent failure, which means your error capture has a gap you didn't know about. The reconciliation system tests your error logging as a side effect of testing your data.

This was the insight from immutable architecture that made the whole approach work: you don't need a perfect pipeline. You need a pipeline with a system that knows when it's imperfect. The reconciliation engine is that system. It turns "eventually consistent" from a euphemism for "sometimes wrong" into a genuine guarantee. The data will converge, because you have a process that detects and repairs every divergence.

Pattern 4: Atomic rebuild as the escape hatch

When things go sideways enough that you need to rebuild the derived store for an entity from scratch, you need atomicity. Users should never see partial state. Either they see the old data or the fully rebuilt data, nothing in between.

// PENDING mode: write everything, nothing visible yet
const [stream] = await client.createWriteStream({
  writeStream: { type: "PENDING" },
});

// Fetch from source, append rows in batches...

// Commit: all rows become visible at once
await client.batchCommitWriteStreams({
  writeStreams: [streamName],
});

This is the escape hatch you hope to rarely use, but when you need it, nothing else will do. I've used it for data migrations, for recovering from bugs, and for onboarding large customers whose initial data import skips the normal sync path for performance.

What I actually learned

The instinct through all of this was to avoid complexity. Every alternative looked simpler on a whiteboard. Postgres is one database instead of two. Precomputed aggregates avoid the pipeline entirely. I kept trying to find the answer that let me not build the hard thing.

But simpler architecture doesn't mean simpler outcomes. The "simple" solutions either degraded the product or introduced failure modes that were harder to detect than the complexity they avoided. The Postgres migration would have been a cleaner architecture diagram and a worse product. The precomputed aggregates would have been fewer moving parts and less trustworthy data.

The right architecture was the one with the best failure mode. Not the fewest components. Not the cleanest diagram. The one where, when something goes wrong, you know about it and can fix it. Precomputed aggregates can drift silently for months. A sync pipeline with reconciliation tells you within hours, and repairs itself.

I think about this framing constantly now, and not just for databases. Whenever I'm choosing between approaches, I've stopped asking "which is simpler?" and started asking "when this breaks, what does the failure look like?" A system that fails loudly and recovers automatically is more trustworthy than one that appears to never fail but can't tell you when it's wrong.

One more thing. I built this entire pipeline with significant help from AI coding tools, and this was the part of the codebase where I had to redirect the AI the most. It would consistently try to simplify away the complexity. Suggest precomputed aggregates. Propose a Postgres migration. Collapse the tombstone pattern into a simpler delete-and-reinsert. Every suggestion was a perfectly reasonable implementation of the wrong architecture.

The AI could write any of these patterns once I told it what to build. What it kept getting wrong was which pattern to build. It optimized for simplicity on the whiteboard when I needed resilience in production. I'd ask for a sync solution and get back the clean, simple version, the one I'd already tried to talk myself into and rejected because I understood how my users actually work, what my scale actually demands, and what happens when the numbers are wrong in my specific domain.

That's the skill I think gets more valuable as implementation gets cheaper. Not writing the code. Knowing which code to write, and having the conviction to build the complex thing when the complex thing is actually the right answer.