Lambda Architecture
Lambda architecture combines batch and stream-based data processing to provide a low-latency and accurate view of time-based operational data.
Origin of Lambda Architecture
In 2011, in his blog, Nathan Marz introduced the idea of Lambda Architecture to reduce the latencies inherent in MapReduce. The original concept was broadened to Big Data in Nathan and James Warren’s paper, Big Data: Principles and Best Practices of Scalable Realtime Data Systems, published in 2013. In 2014, Jay Kreps wrote an article on Questioning the Lambda Architecture in Rader, pointing out some downsides he uncovered at LinkedIn and offering alternatives. Jay proposed the simpler kappa architecture because it only uses a streaming approach.
Why is Lambda Architecture Important?
A pure batch-based approach for fetching data must wait for a batch to complete before the fetched data can be queried. Batch processing has the benefit of being efficient in terms of resource efficiency and provides high throughput.
Stream processing is less efficient than batch processing but provides less latency because records are fetched as soon as they are created. Lambda architecture offers the immediacy of stream data with the throughput of batch and adds a level of fault tolerance by removing a single point of transmission failure. Source data is unchanged, so it remains authoritative.
The Components of a Lambda Architecture
Below is an overview of the primary components of the Lambda Architecture.
The Source Data
The raw data source does not change. Transformations are applied to a copy of the original to maintain auditability. The batch and streaming paths use the same base dataset.
The Batch Layer
The batch layer generates views as it transforms a complete dataset. These views can be recomputed in case of errors or code changes.
The Speed Layer
The speed layer uses stream processing to fill the time gaps while a batch update runs. When the next batch is complete, the past data can be cleared. In case of failure, the entire dataset can be streamed to increase data availability.
The Serving Layer
The serving layer is typically a database that supports join queries across the batch and streamed output tables to provide a single coherent view of the transformed dataset. The architecture does not dictate a specific database or file system technology as long as it transparently supports a unified result set that references both the streamed and batch-created tables.
The Benefits of the Lambda Architecture
The primary benefits of the lambda architecture include:
- The user always sees a complete and transformed dataset.
- The source data is immutable, so result sets can always be recreated from the source.
- Zero delays with records, unlike using a batch-only approach.
- Intermediate results can be stored, making it easy to debug code using these as an audit trail.
- Using dual paths to get the raw data creates high availability, as either the batch or streamed path can create a full target dataset if needed.
The Downside of Lambda Architecture
The primary problem with the Lambda Architecture is complexity. Two different mechanisms, each with its own code base, move data from a source to a destination that supports queries. One table is dedicated to batch data, and the other to steamed data. Queries must join these tables, filtering out duplicates and making them slower. The batch and streaming code paths need to apply any data transformation.
Actian and Lambda Architecture
The Actian Data Platform can provide Lambda architecture capabilities using a combination of streamed and batch-based pipelines from the source dataset. The built-in Vector columnar database includes storage for the streamed and batch-loaded tables, which can be queried using structured query language (SQL) joins to provide a view of the unique records from both tables. The beauty of the Actian approach is that both the batch and streamed data use the same transformation code to ensure consistent results.