Hadoop Cluster
Hadoop is an open-source software stack designed to provide scalable data management on a tightly coupled collection of commodity servers. The cluster operates a primary-secondary configuration with a single nameserver node and multiple worker nodes. The components of the base distribution include a distributed file system and workload distributor providing highly scalable, parallel data processing. Hadoop is particularly well suited to big data analytics.
Why are Hadoop Clusters Important?
Hadoop is essential to businesses because it makes highly scalable parallel data processing for large datasets. Before Hadoop, high-performance clusters and massive parallel processing only existed on proprietary hardware and software, making it out of reach for smaller businesses.
The business can use the recommended Apache distribution or choose one of the many commercial cloud distribution options.
Hadoop clusters are elastic because they allow nodes to be easily added or subtracted to suit workload demands as they change.
Software Modules of a Hadoop Cluster
The following are the four common components of the Hadoop distribution.
Hadoop Common
The common libraries and utilities that support the other Hadoop modules.
MapReduce
MapReduce facilitates efficient parallel processing of large datasets across multiple cluster nodes to compress data volumes. The Map task converts the source data into key/value pairs in an intermediate dataset. The output from the Map task is combined into a smaller set of data by the Reduce task.
Hadoop Distributed File System (HDFS™)
HDFS distributes large data files across the nodes of a clustered system. Applications using the HDFS benefit from parallel access that uses multiple servers for fast query performance and data throughput. HDFS stores large files across multiple nodes of a cluster. Data is protected by replicating it across nodes. By default, data is replicated to three nodes. Data nodes can rebalance data to maintain an even distribution.
Hadoop HDFS uses the following file services:
Five services are as follows:
- The Name Node is the primary node that tracks where every data block is stored, including replicates. This node maintains contact with the clients.
- The Secondary Name Node manages the checkpoints of the file system metadata used by the Name Node.
- The Job Tracker receives requests for Map Reduce execution and talks to the Name Node to get the location of the data to be processed.
- Data Nodes act as secondary tasks to the Job Tracker.
- Task Trackers act as slaves to the Job Tracker.
Hadoop YARN
YARN (Yet Another Resource Negotiator) manages global (cross-cluster) resources and schedules application jobs across the cluster. Resources are defined as CPU, Network, Disk, and Memory. Application containers request resources from their Application Manager, which passes requests to the server NodeManager, which in turn passes requests to the global ResourceManager.
The Evolution of Hadoop
In 2002, Doug Cutting and Mike Cafarella began work on the Apache Nutch project. In 2004, they implemented what they had learned from Google white papers describing the Google File System and MapReduce in the Apache Nutch project. In 2007, Yahoo began using Hadoop on a 1000-node cluster. By 2009, Hadoop was used to sort a one-petabyte dataset. In 2011, the Apache Software Foundation released Apache Hadoop version 1.0.
Hadoop Distributions
The base version of Hadoop is maintained within an Apache open-source project. Software providers distribute extended versions that they maintain and support. Cloudera, Hortonworks (now part of Cloudera), and AWS HDInsight are examples of Hadoop distributions.
Apache Spark™
Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning tasks on single-node machines or clusters.
Actian Data Platform and Hadoop
Thanks to its massively parallel processing (MPP) architecture, the Actian Data Platform scales to thousands of nodes, providing direct access to Hadoop data formats through its Spark connector. The Actian Data Platform can store data in the Hadoop Distributed File System (HDFS) using its propriety data format to protect data. Queries can be parallelized within a server node and across nodes using YARN to schedule and coordinate worker tasks.