Thursday, June 18, 2026
HomeSoftware DevelopmentWe Had a Perfectly Good Data Store. That Was the Problem.

We Had a Perfectly Good Data Store. That Was the Problem.


Nobody files a ticket that says “our architecture has an abstraction problem.” They file tickets saying the data is wrong, or missing, or late. So engineering spends two weeks chasing a data-quality issue that does not exist, fixes nothing, and the same ticket comes back the following quarter wearing a slightly different hat.

That was us. The most useful thing I learned from the whole effort is that the bug was never in the data. It was in what we were asking the data to be.

We had an on-premises MongoDB instance serving as the registered golden source for enterprise reference data. Codes, classifications, identity lookups, the unglamorous shared data that quietly underpins customer onboarding, regulatory reporting, and a dozen other things people only notice when they break. It was well-maintained, authoritative, the genuine single source of truth. The team that owned it was rightly proud of it. By every reasonable measure, the system was healthy.

And yet every time an analytics team or a downstream product group needed something from it, the experience was miserable. They reverse-engineered the operational schema. They wrote one-off queries against nested JSON they only half understood. They tracked down whoever still carried the institutional memory of the collection structure, waited, and then repeated the entire ritual three months later when the requirement shifted by an inch.

The diagnosis took longer than it should have

I watched this play out for months before it clicked. The data was fine. We were asking an operational store to moonlight as an analytical platform, and it was bad at the second job. Not through any flaw of its own. It was simply never built for that.

Operational stores optimise for correctness and life cycle management. Analytics teams need something else entirely: stable shapes, fields that are actually documented, a refresh cadence you can predict, and a way to judge whether a dataset is fit for purpose without reverse-engineering someone else’s schema. Those are not the same requirements, and conflating them is precisely how you end up with a system that is technically perfect and practically useless. Healthy uptime, miserable consumers.

So we stopped asking people to consume reference data directly from MongoDB. We started treating each dataset as a data product: something with a named owner, a definition, quality gates, governed access, and a real path to publication. The technical pipeline, MongoDB through Kafka Connect into Landing, Bronze and Silver layers as Iceberg tables on S3, Athena on top, publication through the Data Marketplace, followed from that decision rather than driving it. Twenty-one reference data products eventually shipped down that single path.

Figure 1: The full pipeline. MongoDB as the authoritative golden source, events flowing through Kafka into Landing, Bronze and Silver layers as Iceberg tables on S3, Athena providing the query surface, and the enterprise Data Marketplace as the publication endpoint. Airflow orchestrates everything; DPPS UI gives operational visibility.

What “data product” actually forced us to decide

“Data product” is one of those phrases that can mean almost anything, which usually means it means nothing. So we made it mean something specific and non-negotiable: a dataset could not be published until it had a named owner, a data dictionary, business and technical metadata, documented audit expectations, quality gates, and a governed route into the Marketplace. Compliance with all active standards at deployment time was mandatory, enforced at publication, not requested in a review meeting.

That framing immediately surfaced questions that should have been answered years earlier. What is the actual boundary of this product? Which attributes matter to consumers, and which are operational plumbing nobody outside the owning team cares about? What does “current” mean for this dataset, and how would a consumer know if it had gone stale? How does anyone discover it without filing a ticket and waiting for a human to point them at the right S3 path?

None of that was governance overhead bolted on for show. Answering those questions was the architecture. The Kafka connectors and Iceberg tables were almost the easy part by comparison.

The three decisions that shaped everything else

The first decision was to keep MongoDB as the golden source. No rip-and-replace. Authority stayed where it belonged, with the team that understood the data’s lifecycle and had maintained it correctly for years. The business requirement was explicit: no business-logic transformation, a one-to-one mapping from source to destination, faithful preservation rather than enrichment. The temptation to crown a shiny new system as the source of truth lurks in every modernisation project, and it is almost always wrong. MongoDB did its job well. We were building a delivery layer, not replacing a foundation, and confusing the two is how good migrations turn into eighteen-month disasters.

The second was to build one delivery model instead of tolerating four. Before this work, at least four teams had independently extracted roughly the same reference data, each with its own refresh logic, its own reading of the field semantics, and its own private definition of “current.” The diplomatic word for that situation is “decentralised.” The honest word is chaos. Events flowing from MongoDB through Kafka Connect into the pipeline, Airflow orchestrating a monthly batch on the 5th at 07:00 UTC with no dependency on working days or holiday calendars, schema validation firing before anything touched S3, replaced all four private empires with a single path anyone could reason about.

The cost of those four pipelines was never the compute or the storage, which was trivial. It was the reconciliation tax. Whenever two copies disagreed, and they did, someone senior and busy had to work out which one to believe. Multiply a half-day investigation by every quarter and every consuming team and you arrive at a genuinely expensive habit that never appeared on any budget line, because it was hidden inside everyone’s ordinary work. Collapsing four pipelines into one did not just simplify the diagram. It deleted an entire recurring category of argument.

The third was to treat publication as a real pipeline stage rather than an afterthought. Data that reached Silver got published into the Data Marketplace with metadata, a Kitemark quality score, documentation, and subscription behaviour already attached. Consumption happened exclusively through the Marketplace subscription model, never by handing someone an S3 path. Consumers could find a product, judge whether it fit, and subscribe to it without needing to know which bucket to ask about or which Slack channel to beg in. Publication meant the product went live. It did not mean a file quietly appeared in storage and someone hoped the right people would notice.

The boring stuff turned out to be the hard stuff

I kept waiting for the hard problems to show up in the pipeline itself. Kafka connector configuration, Iceberg table maintenance, Athena partition tuning, all of it needed attention, and all of it got sorted in due course. But the gap between “a pipeline that works” and “a platform people trust” came from the things I used to wave off as housekeeping. Naming conventions. Audit column standards. Documentation templates someone would actually open. Ownership that was real rather than nominal.

Naming is a good example of how unglamorous and how decisive this gets. A consumer searching the catalogue has to find a dataset using enterprise-standard terminology, not the internal shorthand that made sense to the team that built it. The metadata framework mapping to the enterprise standard is tedious work that shows up on no demo. It is also the entire difference between a catalogue people can navigate and a list of cryptic table names only the authors understand.

Here is the uncomfortable part I did not appreciate going in: shared enterprise data tends to fail socially before it fails technically. The Kafka connector will be fine. What corrodes is the shared understanding of what “authoritative” means in practice, whether a given dataset is the real one or a copy somebody made eighteen months ago and forgot to deprecate. No amount of Iceberg optimisation touches that. You fix it at the layer where consumers decide whether to trust a dataset, which is the product layer, and nowhere else.

A concrete example of how social this gets. Early on, two teams disagreed about which currency-code dataset was correct. Both were internally consistent. Both had been “right” at some point. The difference came down to a refresh one team had quietly stopped running a year earlier, and neither team could prove which copy reflected the live source, because nothing in either dataset recorded where it came from or when. We did not fix that with a better connector. We fixed it by making provenance a first-class column. Every Silver record now carries SOURCE_SYSTEM, JOB_RUN_ID, VALID_FROM and VALID_TO, so the question “is this the real one, and is it current?” has a documented answer instead of a hallway debate.

Storage is not the product

I have watched teams land data in S3, declare victory on self-service, and then spend six months baffled that nobody is using it. The answer is nearly always the same. “The data is in S3” is not a product. It is a location. People need to know the data exists, work out what it means, judge whether it fits their purpose, and find out who to contact when something looks wrong. A path gives them none of that.

The Marketplace addressed this more than any individual pipeline component did. It turned a scattered set of S3 paths into a governed catalogue of subscribable products, each with documentation, a quality score, and clear ownership. That is the difference between handing someone a warehouse address and handing them a shop. And because subscription is the only sanctioned route to the data, the catalogue stays the single front door rather than one option among several private back channels.

Separate truth, transport, and consumption

If I had five minutes with someone starting this work, I would spend all of it on one idea. Separate truth, transport, and consumption, and treat them as three different concerns owned by three different parts of the system. MongoDB holds truth, and stays authoritative. The pipeline, Landing through Bronze to Silver, moves that truth reliably and proves it arrived intact with checksum reconciliation and inter-layer record-count checks. The product layer, Silver tables, Athena, and the Marketplace, makes truth consumable by people who do not know and should never need to know how MongoDB organises its collections.

We Had a Perfectly Good Data Store. That Was the Problem.

Figure 2: The same data, three separated planes. Truth stays in the operational golden source; transport moves it and proves it arrived intact; consumption exposes it as governed, subscribable products. Separating the three concerns, each with its own owner, is what removes the friction between producers and consumers.

When those three are genuinely separate, an enormous amount of organisational friction simply evaporates. Producers stop getting dragged into ad-hoc reporting. Consumers stop reverse-engineering operational intent. The ops team can evolve the MongoDB schema without shattering six downstream jobs. And a new team that needs country codes or currency classifications can find them in the Marketplace, read the documentation, and be done in an afternoon instead of a quarter.

The data was always fine. What we actually built was the boundary that let everyone stop arguing about it.

 

RELATED ARTICLES

Most Popular

Recent Comments