Amazon EMR as an Easy-to-set-up Hadoop Platform By Lawrence Stoker May 3, 2016 Recently I helped a customer perform an evaluation of Actian Vector in Hadoop (VectorH) to see if it could maintain “a few seconds” performance as the data sizes grew from one to tens of billions of rows (which it did, but that’s not the subject of this entry). The customer in question was only moderately invested in Hadoop and so was looking for a way to establish the environment without having to get deep into the details of Hadoop. This actually aligns nicely with Actian’s vision for the Hadoop versions of its software– allowing customers to run real applications for business users in Hadoop without having to get into the “developer heavy” aspects that Hadoop typically requires. Actian has released a provisioning tool called the Actian Management Console (AMC) that will install and configure the operating system, Hadoop, and Actian software in click-of-a-button style that will be ideal for this customer. AMC currently supports Amazon EC2 and Rackspace clouds. At the time of the evaluation however, AMC wasn’t available so we looked for something else and thought we would try Amazon’s EMR (Elastic MapReduce), which is a fast way to get a Hadoop cluster up and running with minimal effort. This entry looks at what we found and lists out the pros and cons. Pros and Cons of Amazon EMR EMR is very easy to setup – just go to the Amazon console, select the instance types you want, the number of them, push the button and in a few minutes you have a running Hadoop cluster. This part would actually be faster to get up and running than using the AMC to provision a cluster from scratch, since EMR pre-bakes some of these installation steps to speed them up for you. By default, you get Amazon’s Hadoop flavour, which is a form of Apache Hadoop with most of the recommended patches and tweaks applied. It is possible however to specify Hortonworks, Cloudera, and MapR to be used with EMR, and other add-ons such as Apache Spark and Apache Zeppelin. In this case we used the default Apache Hadoop distribution. Hadoop purists may get a “wrinkled brow” from some of the terms used – for example the DataNodes are referred to as “core” nodes in EMR, and the NameNode is referred to as the “master” node (for those new to both EMR and VectorH be aware that the term “master” is used by both, for different things). The configuration of certain items is automatically adjusted depending on the size of the cluster. For example the HDFS replication factor is set to one for clusters of less than 4 nodes and to two for clusters between 2-10 nodes and then the usual three for larger clusters than 10 nodes. This and other properties can be explicitly set in the start up sequence via “bootstrap” options. That brings us nicely onto the main thing to be aware of when using EMR – in EMR everything (including Hadoop) is transient; when you restart the cluster everything gets deleted and rebuilt from scratch. If you want persistence of configuration then you need to do that in the bootstrap options. Similarly if you require data to be persistent across cluster reboots then you need to stage the data externally in something like S3. EMR can read and write S3 directly, so you would not need to copy raw data from S3 into HDFS just to be able to load data in VectorH – you could load directly from S3 using something like Actian’s new Spark loader. This may seem like a drawback, and indeed it is for some use cases, however it really just reflects the purpose of EMR. EMR was never intended to be a long running home for a persistent DataLake, but rather a transient “spin up, process, then shutdown” environment for running MapReduce style jobs. Amazon would probably state that S3 should be the location for persistent data. In our case we were testing a high performance database – definitely a “long lived” service with persistent data requirements – so in that sense EMR was maybe not such a good choice. However although we did have to reload the database a few times after wanting to reconfigure the cluster we did quickly adapt to provisioning from scratch by building a rudimentary provisioning script (aided here by the fast loading speed of VectorH and the fact that it doesn’t need indexes) and so it actually worked quite well as a testing environment. The second thing to be aware of with EMR, especially for those used to Hadoop, is that (at least with the default Hadoop we were using) some of the normal facilities aren’t there. The thing we noticed most of all was the absence of the usual start/stop controls for the Hadoop daemons. We were trying to restart the DataNodes to enable short circuit reads and found it a little tricky – in the end we had to restart the cluster anyway and so put this into the bootstrap options. Another aspect you don’t have control over with EMR is using an AWS Placement Group. Normally in AWS if you want a high performance cluster you would try and make sure that the different nodes were “close” in terms of physical location, thereby minimizing network latency. With EMR there didn’t seem to be any way to specify that. It could be that EMR is actually being a little clever under the covers and doing this for you anyway. Or it could be that, when dealing with larger clusters Placement Groups maybe become a little impractical. Either way (or otherwise) our setup didn’t use any specified Placement Groups. Even without them, performance was good. In fact, performance was good enough that the customer reported the results to be better than their comparable testing on Amazon Redshift! In summary, this exercise showed that you can use Amazon EMR as a “quick start” Hadoop environment even for long lived services like Actian VectorH. Apart from the absence of a few aspects – most of which can be dealt with via bootstrap options – EMR behaves like a regular Hadoop distribution. Due to its transient nature, EMR should only be considered for use cases that are transient (including testing of long lived services). Customers wanting click-of-a-button style provisioning of Hadoop for production in things like Amazon EC2 should consider installing with Actian’s Management Console, available now from http://esd.actian.com. About Lawrence Stoker Lawrence Stoker has worked for database specialist companies for over 20 years and operated in the Big Data space before it was called Big Data. Lawrence joined the Actian Sales Engineering team in 2013. His focus is on technologies that take alternative approaches to achieving extreme performance on large data sets which today is the confluence of columnar architecture, in-memory processing, and Hadoop.