Where does time live in your data?

Part 1 of 2 - From theory to system-temporal-merge

In my previous article, I explored time series data and introduced bi-temporal modeling - tracking not just what changed, but when we learned about it - a useful concept, but only as useful as the implementation behind it. If we’re going to claim that data matters at Opto, we need to show how it actually works. Otherwise, it’s just nice words on a blog. So this article is about the implementation. Full transparency: I wrote the first draft over a month ago and kept delaying it. Not because of writer’s block, but because something felt wrong in the code. I couldn’t find clean language for what we were actually trying to achieve, which usually means the thinking isn’t done. So I did what I had to do: I ran a backfill. And the first version of the pipeline blew up in my face. Millions of duplicated records. I had missed it entirely.

The reality of what we ingest

At Opto, data doesn’t arrive in a friendly format. We deal with two fundamentally different delivery modes. Some sources send us full snapshots - weekly, sometimes on irregular schedules - where each file contains the entire dataset. Not just what changed. The source gives us no indication of what actually changed between deliveries: no change logs, no per-record timestamps, no guarantees on update frequency. Records can change at any time, and we only find out the next time a file lands. Other sources send incremental files - monthly, sometimes on irregular schedules - deltas containing only new or changed records. The problem is simple to state and surprisingly annoying to solve: how do you identify what actually changed without reprocessing everything every time? The default answer is: just append everything. Which works, until it doesn’t. Append-only pipelines are great if you don’t care about downstream users. Every query becomes a deduplication problem and every consumer must understand the ingestion logic. You completely lose visibility into what changed and when. Append-only is simple for ingestion, but expensive for understanding. My colleague Noel and I decided not to shy away from the complexity. We were going to nail this. It took three attempts.

[Delivery : Load : Query]

We believe that the problem has three dimensions.

Dimension 1 - Delivery mode is a property of the source, not something you control. Either the source sends everything every time (Full), or it sends only what changed (Incremental). You can’t change this; you can only design around it. Dimension 2 - Load mode is how you persist data at the destination. This is where most of the complexity lives, and where our three attempts played out. Dimension 3 - Query mode is how consumers read the data - whether they want the current state, a stream of changes, a point-in-time snapshot, or the full history.

Each pipeline configuration is a three-part tuple: [Delivery : Load : Query]. The insight is that these three dimensions are independent of each other. A single dataset loaded one way can serve multiple query patterns simultaneously - no re-ingestion required.

Our three attempts at load mode

Attempt 1 - Upsert. The instinct. On each delivery, upsert records by key. Simple, familiar, fast to implement. It broke for two reasons: the code was slow, and it expected end-users to manage their own history before inserting. The burden was on the wrong side. Attempt 2 — Incremental Merge. Noel suggested being smarter about how to compare records and leveraging db operations where possible. But we got ahead of ourselves. We assumed every data source could be tied to a business time — so we forced a bi-temporal structure onto data that only had a system timestamp. We were tracking two dimensions of time on data that only had one. It seemed fine — until we ran a backfill and found millions of duplicates staring back at us. Attempt 3 - System-Temporal-Merge. This is what we shipped. After working through how delivery modes shape what data is available, we narrowed our focus: start with sources that have no business time. Only the system’s timeline matters — when we ingested it. That’s exactly what the name reflects. The principle: every record gets two sets of components - one that identifies it (the key), one that captures its content (the value). When a new record arrives, we compare components. Same content as before? Skip it, nothing changed. Content is different? Close the current row with a timestamp and open a new version. Every row carries the exact moment our system first observed it. This is the entry point to our pipeline, and it aligns with a core philosophy: as long as you don’t know what you don’t know, store the data as received. Don’t impose structure you haven’t earned yet.

System-Temporal-Merge answers one question cleanly: what did our system know, and when? For most datasets, that’s enough. But some data carries its own sense of time - when something actually happened in the real world, not just when we learned about it. That’s where Bi-Temporal-Merge comes in, and it’s the subject of Part 2.

Let’s make this concrete. Below is a real example using SEC filings — two extractions of the same dataset, two weeks apart.

Step 1 — Append Table

All records as received: 13 rows from 2 extractions.

Every row that arrives gets stamped with system_valid_from — the moment our system first saw it. No deduplication yet. This is the raw truth of what we received and when.

-- 13 rows: 6 from batch #1 (2026-02-18) + 7 from batch #2 (2026-03-04)
 key_hash  | val_hash | system_valid_from   | concept | period_end | filed      | form | val_millions
-----------+----------+---------------------+---------+------------+------------+------+-------------
 b4e967f6  | ee2a7ad1 | 2026-02-18 04:00:00 | Assets  | 2022-12-31 | 2023-05-11 | 10-Q |   11036.362
 1febf3fb  | d594c1c7 | 2026-02-18 04:00:00 | Assets  | 2023-03-31 | 2023-05-11 | 10-Q |    12140.49
 441c1cf0  | 0e7ca21a | 2026-02-18 04:00:00 | Assets  | 2022-12-31 | 2023-08-11 | 10-Q |   11036.362
 d22146b6  | 839b6274 | 2026-02-18 04:00:00 | Assets  | 2022-12-31 | 2024-03-07 | 10-K |   11036.362
 d5ec1f19  | 236ff5d0 | 2026-02-18 04:00:00 | Assets  | 2024-03-31 | 2024-05-09 | 10-Q |   19833.938
 006ea3c7  | 2d2dd76e | 2026-02-18 04:00:00 | Assets  | 2024-12-31 | 2025-03-04 | 10-K |   28063.881
 b4e967f6  | ee2a7ad1 | 2026-03-04 09:00:00 | Assets  | 2022-12-31 | 2023-05-11 | 10-Q |   11036.362
 1febf3fb  | d594c1c7 | 2026-03-04 09:00:00 | Assets  | 2023-03-31 | 2023-05-11 | 10-Q |    12140.49
 441c1cf0  | 0e7ca21a | 2026-03-04 09:00:00 | Assets  | 2022-12-31 | 2023-08-11 | 10-Q |   11036.362
 d22146b6  | 839b6274 | 2026-03-04 09:00:00 | Assets  | 2022-12-31 | 2024-03-07 | 10-K |   11036.362
 d5ec1f19  | 236ff5d0 | 2026-03-04 09:00:00 | Assets  | 2024-03-31 | 2024-05-09 | 10-Q |   19833.938
 006ea3c7  | 2d2dd76e | 2026-03-04 09:00:00 | Assets  | 2024-12-31 | 2025-03-04 | 10-K |   28063.881
 c1380b27  | e81b4017 | 2026-03-04 09:00:00 | Assets  | 2025-12-31 | 2026-03-03 | 10-K |   37010.766

Notice that the first 6 rows from batch #1 reappear in batch #2 with identical key_hash and val_hash — the source gave us the same data again. Only the last row (c1380b27) is truly new.

Step 2 — Merge Decision Log

What happens to each row in batch #2?

The merge logic compares each incoming row’s key_hash against what already exists. Same key, same value? Skip. Same key, different value? Update. New key entirely? Insert.

-- Merge decisions for batch #2 (7 incoming rows)
 key_hash  | concept | period_end | filed      | form | val_millions | merge_decision
-----------+---------+------------+------------+------+--------------+--------------------
 b4e967f6  | Assets  | 2022-12-31 | 2023-05-11 | 10-Q |   11036.362  |  SKIP (duplicate)
 1febf3fb  | Assets  | 2023-03-31 | 2023-05-11 | 10-Q |    12140.49  |  SKIP (duplicate)
 441c1cf0  | Assets  | 2022-12-31 | 2023-08-11 | 10-Q |   11036.362  |  SKIP (duplicate)
 d22146b6  | Assets  | 2022-12-31 | 2024-03-07 | 10-K |   11036.362  |  SKIP (duplicate)
 d5ec1f19  | Assets  | 2024-03-31 | 2024-05-09 | 10-Q |   19833.938  |  SKIP (duplicate)
 006ea3c7  | Assets  | 2024-12-31 | 2025-03-04 | 10-K |   28063.881  |  SKIP (duplicate)
 c1380b27  | Assets  | 2025-12-31 | 2026-03-03 | 10-K |   37010.766  |  INSERT (new key)

6 duplicates skipped. One new record inserted. This is the power of content-addressable hashing — we never store what we already know.

Step 3 — Merge Table

After processing 2 full extractions: 7 rows, not 13.

This is the state of truth. Each row has a system_valid_from (when we first learned it) and a system_valid_to (when it was superseded, or NULL if still current).

-- The deduplicated, temporally-aware merge table
 key_hash  | val_hash | system_valid_from   | system_valid_to | concept | period_end | filed      | form | val_millions
-----------+----------+---------------------+-----------------+---------+------------+------------+------+-------------
 b4e967f6  | ee2a7ad1 | 2026-02-18 04:00:00 | NULL            | Assets  | 2022-12-31 | 2023-05-11 | 10-Q |   11036.362
 1febf3fb  | d594c1c7 | 2026-02-18 04:00:00 | NULL            | Assets  | 2023-03-31 | 2023-05-11 | 10-Q |    12140.49
 441c1cf0  | 0e7ca21a | 2026-02-18 04:00:00 | NULL            | Assets  | 2022-12-31 | 2023-08-11 | 10-Q |   11036.362
 d22146b6  | 839b6274 | 2026-02-18 04:00:00 | NULL            | Assets  | 2022-12-31 | 2024-03-07 | 10-K |   11036.362
 d5ec1f19  | 236ff5d0 | 2026-02-18 04:00:00 | NULL            | Assets  | 2024-03-31 | 2024-05-09 | 10-Q |   19833.938
 006ea3c7  | 2d2dd76e | 2026-02-18 04:00:00 | NULL            | Assets  | 2024-12-31 | 2025-03-04 | 10-K |   28063.881
 c1380b27  | e81b4017 | 2026-03-04 09:00:00 | NULL            | Assets  | 2025-12-31 | 2026-03-03 | 10-K |   37010.766

7 rows. Clean. Each one represents a distinct fact we know, tagged with exactly when we learned it.

Step 4 — Incremental Query

Consumer checkpoint = 2026-02-18 04:00:00 → “What is new since my last read?”

Downstream consumers don’t re-scan the entire table. They store a checkpoint and ask: give me only what changed since I last looked.

-- Incremental read: only rows with system_valid_from > checkpoint
 key_hash  | recorded_at         | concept | period_end | filed      | form | val_millions
-----------+---------------------+---------+------------+------------+------+-------------
 c1380b27  | 2026-03-04 09:00:00 | Assets  | 2025-12-31 | 2026-03-03 | 10-K |   37010.766

One new filing. That’s it. The consumer doesn’t see 13 rows — they see exactly what changed since their last read. No deduplication required.

Circling back to my previous article, we can now answer the first question — when did we know it? Part 2 answers the harder one: when did it actually happen?


For disclaimers, visit https://www.optoinvest.com/disclaimers.