Building a production-grade streaming data pipeline for financial markets: Lessons from two years of iteration
Introduction
More than two years ago, I published an article on How to build a scalable streaming data pipeline for financial big data. It described a first iteration; a pragmatic foundation that allowed us to ingest, normalize, and persist market data at scale. It worked, but it was just that – a first iteration.
In the time since, we have rebuilt that pipeline from the ground up. Not because the original design was flawed, but because the requirements outgrew it. What started as a pure data ingestion problem has evolved into a full trading platform with three loosely coupled subsystems: data ingestion, complex event processing (CEP) and order execution with risk management. The streaming data pipeline is no longer a standalone artifact – it is the connective tissue of a real-time, latency-sensitive trading platform that must simultaneously serve historical data for backtesting, research and analytics.
This article describes the current architecture and the technical decisions behind it.
High-level architecture
The streaming data pipeline is deployed as a microservice architecture on an Amazon ECS EC2 cluster with placement constraints pinning workloads to two container instances (primary and replica) in two different availability zones. Both container instances are intentionally scaled vertically to keep linked microservices co-located on the same host, which minimizes latency on the hot path.
The deployment is fully reproducible via Terraform and isolated per environment (dev and prod) using separate AWS accounts under an AWS Organizations structure.
At the highest level, the system consists of:
- third-party API providers: the upstream sources of market data (exchanges, data vendors, OTC venues, …)
- schema registry (Buf Schema Registry): the single source of truth for all gRPC message contracts
- API session managers: services that maintain authenticated stateless and stateful sessions to each provider and that subscribe to market data streams
- market data normalizers: services that translate vendor-specific market data into a canonical schema (in our case, we use the Databento Binary Encoding (DBN) format)
- batch processors: services that aggregate vendor-specific market data into batches for cold storage
- in-memory pub/sub layer (Amazon ElastiCache for Redis): the hot path for real-time consumers
- data warehouse and query layer (Amazon S3 and Amazon Athena): the cold path for historical market data retrieval
- NautilusTrader: the platform that performs complex event processing (CEP) and executes/manages orders and trades on behalf of trading algorithms with arbitrarily simple or complex trading rules
Design principles
Before going deeper into the system’s components, it is worth stating the principles that shape the current architecture:
- schema-first, contract-driven communication on the hot path: every protobuf message over gRPC is defined in a centrally versioned
.protofile in the Buf Schema Registry; all payloads published to Redis are encoded in the canonical Databento Binary Encoding (DBN) format - separation of hot and cold paths: real-time consumers (hot path) must never wait on batch operations, and batch operations (cold path) must never block the real-time path
- stateless compute, stateful storage: every compute service is horizontally scalable and disposable; state lives in Redis (hot path) or Amazon S3 (cold path)
- idempotency at every boundary: reconnects, replays and retries are the norm, not the exception
Component deep dive
API session managers
The API session managers are the only stateful components on the ingestion side. Each one maintains authenticated REST sessions and persistent WebSocket (and other streaming) connections to one or multiple third-party API providers. They are deployed as Amazon ECS services with one critical constraint: exactly two instances (primary and replica) per third-party API provider per environment.
This is deliberate. Most third-party API providers enforce aggressive per-account connection and rate limits. Running more than two API session managers per third-party API provider would result in either disconnections, throttling or – in the worst case – API key revocation.
The API session manager’s responsibilities are intentionally narrow:
- maintain authenticated REST sessions and persistent WebSocket and other streaming connections
- manage subscription state
- forward raw payloads to the corresponding market data normalizer over gRPC
- emit connection-health telemetry to Amazon CloudWatch
Why gRPC and not a message broker? At the scale of raw market data (millions of messages per second per third-party API provider in very extreme cases), the overhead of a broker like Kafka or Redis Streams becomes the bottleneck. A direct gRPC stream with HTTP/2 multiplexing gives us sub-millisecond fan-out. Exactly-once delivery is not guaranteed if a consumer fails to acknowledge a message; however, in the context of high-volume, largely stateless market data, this guarantee can be traded off in favor of higher throughput – provided that recovery mechanisms are in place for the subset of stateful market data where it matters.
The protobuf schemas are defined in the Buf Schema Registry, and language-specific packages are generated for consumption in the microservices. CI enforces backward compatibility on every schema change via buf breaking.
Market data normalizers
Each market data normalizer is paired one-to-one with an API session manager and runs as a sidecar container in the same Amazon ECS service. his co-location matters: it eliminates network hops between the raw stream and the normalization stage, and it ensures that if one component fails and self-recovery is not possible, the other restarts with it.
The market data normalizer’s responsibilities are:
- receive raw payloads encoded as protobuf messages sent over gRPC
- parse and validate them against the corresponding protobuf schema
- transform them into the canonical Databento Binary Encoding (DBN) format
- publish the normalized payload to Redis Pub/Sub
Redis Pub/Sub (hot path)
The hot path is built on Amazon ElastiCache for Redis in cluster mode with multi-AZ failover. We use Redis Pub/Sub (not Streams) for the real-time distribution layer and Redis as a key-value store for cached and platform-specific state.
Redis Pub/Sub offers the PSUBSCRIBE command, which allows consumers to subscribe to glob-style patterns instead of fully qualified channel names. Combined with a hierarchical channel naming convention, this enables expressive subscription patterns – for example, subscribing to all data from a specific third-party API provider, or to all data of a specific data type. The channel names are constructed as follows:
<data-type>.<stream-instance>.<market-data-schema>.<api-provider>.<channel>
where:
<data-type>: the data type of the messages in this channel, e.g.MARKET_DATA<stream-instance>: the stream instance (PRIMARYorREPLICA)<market-data-schema>: the Databento Binary Encoding (DBN) schema, e.g.MBO<api-provider>: the third-party API provider identifier, e.g.DERIBIT<channel>: the third-party API provider-specific data channel identifier, e.g.ticker.BTC-PERPETUAL.raw
With PSUBSCRIBE, pattern subscription strings such as
MARKET_DATA.PRIMARY.*.DERIBIT.*
can be expressed. In this example, consumers subscribe to all Deribit market data on primary streams. It also becomes clear that, with a small amount of additional logic, consumer-side failover from primary to replica streams is straightforward to implement.
Batch processors, Amazon S3 and Amazon Athena (cold path)
In parallel to the market data normalizers, raw payloads are forwarded to batch processors running in dedicated Amazon ECS services. The batch processors accumulate raw payloads in memory and write them to Amazon S3 as gzip-compressed CSV files (≈ 90% compression ratio) as soon as a configurable time threshold is reached (time is progressed via payload timestamps, not via the consumer-side wall clock). Since Amazon S3 charges per PUT request, this configurable threshold is the lever for balancing in-memory data loss risk (in case a batch processor crashes) against object storage cost.
In Amazon S3, data for the current day is stored in an intermediate bucket, partitioned as follows:
.../instance=<stream-instance>/provider=<api-provider>/schema=<market-data-schema>/channel=<channel>/year=<year>/month=<month>/day=<day>
After each day, scheduled nightly post-processor Amazon ECS tasks coalesce data from the primary and replica streams. For more details and the underlying concepts, see How to build a scalable streaming data pipeline for financial big data.
Schema registry
The Buf Schema Registry warrants its own section because it is one of the most consequential architectural choices for the long-term maintainability of the system.
Both services on the hot path of the pipeline – API session managers and market data normalizers – depend on a versioned set of .proto files for their message contracts. The Buf Schema Registry:
- provides a central catalog of all schemas across these services
- enforces backward compatibility on every change via CI
- enforces style and linting rules (
buf lint) - generates language-specific code (JavaScript/TypeScript, Rust) as part of the build
Without this, the system would be ungovernable. A single careless field rename in a protobuf schema could break many services and service instances in subtle, hard-to-diagnose ways.
NautilusTrader
The complex event processing (CEP) layer is built on NautilusTrader, an open-source algorithmic trading platform written in Rust with Python bindings. NautilusTrader instances are launched as Amazon ECS services to execute trading algorithms based on arbitrarily simple or complex trading rules. To receive and process real-time market data, they subscribe to the Redis Pub/Sub interface described above.
Observability
Because the streaming data pipeline is composed of many components that can fail individually, in-depth observability is critical – even though all components have extensive self-healing capabilities. For this, we rely heavily on Amazon CloudWatch. As a convention, every failure state, raised error and exception is logged to Amazon CloudWatch log streams. These logs are additionally replicated to dedicated Discord channels on a company-internal Discord server, which allows us to receive them as push notifications on mobile devices. This way, we know what is happening in the system at any time and in any location.
Potential future improvements
At present, the data warehouse stores raw market data exactly as received from the third-party API providers. This preserves full fidelity of the upstream payloads and keeps the cold path independent of any changes to the normalization layer, but it also means that downstream consumers of historical data – backtesting engines, research notebooks and analytics pipelines – have to re-implement (or call into) the same normalization logic that already exists on the hot path. This duplication is a source of subtle inconsistencies between live and historical data, and it ties every consumer to the schemas of every vendor.
A natural next step is to move the batch processors behind the market data normalizers (or Redis Pub/Sub) so that the cold path stores data in the canonical Databento Binary Encoding (DBN) format, consistent with what is published on the hot path. This would yield a single, unified data ecosystem in which historical and real-time data are schema-equivalent, simplify all downstream consumers, and eliminate the normalization drift between the two paths. The non-trivial implication is that the entire existing data warehouse would need to be re-processed into the normalized format as part of the migration – effectively a one-time backfill across the full historical dataset – and the post-processing and partitioning logic would need to be adapted accordingly. Given the long-term simplification this brings to every downstream system, the investment is well justified.
Closing thoughts
The first iteration of this pipeline taught us how to build a streaming system that works. This iteration taught us how to build one that is maintainable, observable, correct and self-healing under failure and that fits into the bigger picture of a highly capable trading platform. The biggest lesson is that the hard parts of a streaming data pipeline are not necessarily the throughput numbers or the framework choices – they are schema governance, normalization techniques in highly heterogeneous environments, failure semantics, and the discipline to separate hot and cold paths cleanly while respecting latency constraints and ensuring high availability through redundancy.
If I were starting over today, the things I would insist on from day one are: a centralized schema registry with breaking-change enforcement, idempotent microservices wherever possible and a normalization layer that serves as a clear boundary between external systems and our internal system.
