Hadoop data lake: is it the end?
What is a Hadoop data lake? A Hadoop data lake is a managed collection of Hadoop clusters. A data lake is a repository that stores data in its native format with full fidelity. Data is typically unstructured or semi-structured, including JSON objects, flat files, log files, images, IoT event streams and weblogs.
Is it the end of Hadoop data lake?
During Hadoop’s heyday, more than a decade ago, the idea of a low-cost, highly available and scalable file system was very attractive. Many vendors, including Cloudera, Hortonworks and MapR, offered open-source distributions that drove enterprise adoption. Since those days, the market has consolidated, and Hadoop clusters have languished due to scarce skills, high administration costs and the emergence of better alternatives. Newer solutions from cloud vendors offer a better cost per terabyte and lower administration costs.
What technology can replace a Hadoop data lake?
The popularity of Hadoop has motivated cloud providers to make a broad array of choices available for businesses wanting to modernize their big data clusters. The Hadoop Distributed File System (HDFS) and the Spark API for accessing Hadoop data are at the core of Hadoop distributions. Given that Spark has always supported Amazon S3, it is a logical first step to the cloud for on-premises clusters. S3 is an object store that is highly elastic, costs less and is faster than an on-premises cluster.
Microsoft has developed HDInsight on Azure to provide a cloud-based implementation of Apache Spark to ease the migration of existing Spark jobs.
The Actian Cloud Data Platform supports the Spark API on-premises, and on multiple clouds, so you can access semi-structured data stored outside the platform’s built-in columnar relational database.
Vector in Hadoop
Vector provides a high-performance database capability directly in Hadoop and uses the underlying HDFS storage structure for data. Vector in Hadoop supports multiple Hadoop distributions, including Amazon Elastic MapReduce (EMR).
Performance is the primary reason for running Vector in Hadoop because multiple cluster nodes can parallelize operations such as SQL queries. Many businesses have evolved their Hadoop big data environments into data lakes for storing semi-structured data sets such as web activity log files and IoT data. Vector lets you scale out Hadoop SQL performance by as much as 100X compared to Apache Impala. The performance benefit is not just for queries. You can also gain the benefit of zero-penalty, real-time data updates. Some traditional Hadoop analytic databases make you sacrifice data consistency for performance. Vector for Hadoop processes real-time data updates without any associated performance penalty, ensuring that an organization’s analytic insight is always current, using the freshest data available.
Newer data lakes start life in the cloud. The Actian Cloud Data Platform provides the perfect complement to cloud-based data lakes by executing data analytics wherever your data lake exists.
The Actian Cloud Data Platform and Vector in Hadoop provide massively parallel processing (MPP) performance. Through its innovative native Spark support, Vector delivers optimized access to Hadoop data file formats, including Parquet and ORC, the ability to perform functions such as SQL joins across different table types and serves as a faster query execution engine for Spark SQL and SparkR applications.
Key reasons to consider Actian in Hadoop environments
- Vectorized query execution: exploits Single Instruction, Multiple Data (SIMD) capabilities in commodity Intel x86 architecture CPUs, enabling processing of hundreds or thousands of data values using a single instruction.
- MPP architecture: provides exceptional scalability on Hadoop clusters which scale out to thousands of users, hundreds of nodes, and petabytes of data, with built-in data redundancy and system-wide data protection.
- Full ACID compliance: performs data updates with multi-version read consistency, maintaining transaction integrity.
- Zero-penalty real-time data updates: enable in-the-moment computing using patented Positional Delta Trees (PDTs) for incremental small inserts, updates and deletes without impacting query performance.
- CPU cache optimization: uses dedicated CPU cores and caches as execution memory to run queries 100x faster than from RAM, delivering significantly greater throughput than conventional in-memory approaches.
- CPU optimizations: include hardware-accelerated string-based operations for accelerating selections on strings using wildcard matching, aggregations on string-based values, and joins or sorts using string keys.
- Column-based storage: reduces I/O to relevant columns and provides the opportunity for greater data compression, and enables storage indexes to maximize efficiency.
- Data compression: provides multiple options to maximize compression, from 4-10x for Hadoop storage.
- Storage indexes: provide automatic min-max indices to enable fast block skipping on reads and eliminate the need for an explicit data partitioning strategy.
- Parallel execution: use adaptive algorithms to maximize concurrency while enabling load prioritization.
- Spark-powered direct query access: provides direct access to Hadoop data files stored in Parquet, ORC, and other standard formats allowing users to realize significant performance benefits without converting to the Vector file format first.
- User-defined function (UDF) support: Extend the database to perform operations that are not available through built-in, system-defined functions provided by Vector. Vector for Hadoop 6 provides the capability to create Scalar UDFs.
- Faster machine learning execution: Deploy machine learning (ML) models that run alongside the database leveraging new extended UDF capabilities. By deploying ML models alongside the Vector database, data movement is reduced, thus allowing for faster data scoring.
- SQL and NoSQL in a single database: Combine classic relational columns with columns that contain documents formatted as JSON text in the same table and parse and import JSON documents in relational structures. Bridging semi-structured data with relational databases can handle additional use cases where underlying data structures change rapidly.
- Extensive SQL support with standard ANSI SQL and advanced analytics: These include cubing, grouping, and window functions.