LuvAI JournalMay 09, 20265 min read

Notes on data you don't own

The stock tool runs entirely on data I don't control. Notes on the quiet, defensive craft of building on borrowed facts.

BySakuyaAn independent studio in Taipei, Taiwan

The stock tool depends, completely and permanently, on data I do not own and cannot control. Every price, every institutional-flow number, every financial statement comes from somewhere else — the Taiwan Stock Exchange, the over-the-counter exchange, a securities depository, a handful of third-party APIs. My software is a translation layer on top of other people's facts. Which means that on any given morning, the most likely reason something is broken is that a source I don't control changed something without telling me.

I want to write down what building on borrowed data is actually like, because it's a category of indie work that nobody warns you about, and it's quietly one of the hardest parts.

Sources break, and they break silently.

The failure mode I've learned to fear most isn't the loud one — a source going down, throwing an error, the page showing a red box. That's the *easy* case. An error is honest; you see it, you fix it. The dangerous case is the silent one: a source that returns a perfectly valid, perfectly formatted response that happens to be empty, or stale, or subtly wrong. The job ran. No error was thrown. The dashboard is green. And the data quietly stopped updating four days ago, and nobody notices until a user emails to ask why a number looks off.

Early on, one of my upstream sources started returning a redirect from one network path but not another — visible only when the request came from a particular kind of server. The ingestion job didn't crash. It got an empty list, wrote zero rows, logged a tidy success, and moved on. The data went stale for days behind a green checkmark. The lesson stuck: when you depend on data you don't own, "the job succeeded" and "the data is correct" are two completely different claims, and you have to verify the second one separately — because the first one will lie to you cheerfully.

So you build a second system whose only job is suspicion.

The fix wasn't to make the ingestion more reliable. I can't; it isn't mine. The fix was to build a small, separate watchdog whose entire purpose is to distrust the first system. It doesn't ask "did the job run?" It asks "is the freshest row actually as fresh as it should be, given what day it is?" If the latest trading day in the database is older than it ought to be, or a pipeline that should have written a thousand rows wrote a hundred, it says so — loudly, by email, before a user has to. It assumes success is a lie until the data proves otherwise.

This is, I think, a generalizable principle for anyone building on borrowed data: separate the thing that does the work from the thing that checks the work, and let the checker be paranoid. A system that monitors itself by asking "did I finish?" will always answer yes. You need a second system that asks "is the result any good?" — and it has to be allowed to be rude about it.

You inherit other people's calendars, units, and conventions.

Borrowed data comes with borrowed assumptions, and they bite. One source reports volume in shares; another, for the same market, reports it in lots of a thousand — and if you mix them without noticing, a stock's trading volume is suddenly wrong by three orders of magnitude in a way that looks almost plausible. Holidays are another. The market is closed, so there's no new data, so a naive freshness check screams that everything is broken every weekend and every national holiday, until you teach it the trading calendar. Each source has its own rhythm, its own quiet conventions, its own definition of words you assumed were universal. You don't learn these from documentation. You learn them by getting them wrong in production.

The responsibility runs one direction only.

Here's the part I think matters most, and it isn't technical. When you build on borrowed data, the user trusts *you*, not your sources. If the exchange publishes a bad number and I display it, the reader doesn't blame the exchange — they've never heard of the API. They blame the tool. The trust flows to the surface they touch, which is mine, and so does the blame, even for failures that originate three layers upstream and entirely out of my hands.

That's not unfair, exactly. It's the deal. If I'm going to stand between a person and a pile of public data and claim to make it legible, then I've accepted responsibility for the legibility being *correct* — including on the days my suppliers let me down. Which is why so much of the unglamorous work on this product is defensive: sanity bounds on every number, fallback sources, the paranoid watchdog, friendly empty states for the days when there's genuinely nothing to show. None of it is visible when it's working. All of it is the job.

Building on data you don't own is a little like cooking with ingredients delivered by a supplier you can't phone. Most days the box arrives and it's fine. The craft is entirely in what you do on the days it doesn't — and in noticing, before your guests do, that today is one of those days.