If you’ve been working with Hadoop then you’ve likely come across the concept of Short Circuit Reads (SCRs) and how they can aid performance. These days they are mostly enabled by default (although not in “vanilla” Apache or close derivatives like Amazon EMR).
Actian VectorH brings high performance SQL, ACID compliance, and enterprise security into Hadoop and I wanted to test how important, or otherwise, SCRs were to the observed performance.
The first challenge I had was figuring out exactly what is in place and what is deprecated. This is quite common when working with Hadoop – sometimes the usually very helpful “google search” throws up lots of conflicting information and not all of it tagged with a handy date to assess the relevance of the material. In the case of SCRs the initial confusion is mainly down to there being two ways of implementing; the older way that grants direct access to the HDFS blocks – which achieves the performance gains but opens a security hole, and the newer method which uses a memory socket and thereby keeps the DataNode in control.
Note that for this entry I’m excluding MapR from the discussion. As far as I’m aware MapR implements its equivalent of SCRs automatically and does not require configuration (please let me know if that’s not the case).
With the newer way, the only things needed to get SCRs working is a valid native library, the property dfs.client.read.shortcircuit set to true, and the property dfs.domain.socket.path set to something that both the client and the DataNode can access. *Note there are other settings that effect the performance of SCRs, but this entry doesn’t examine those.
On my test Hortonworks cluster this is what I get as default:
# hadoop checknative -a
16/03/09 12:21:41 INFO bzip2.Bzip2Factory: Successfully loaded & initialized …
16/03/09 12:21:41 INFO zlib.ZlibFactory: Successfully loaded & initialized …
hadoop: true /usr/hdp/188.8.131.52-2950/hadoop/lib/native/libhadoop.so.1.0.0
zlib: true /lib64/libz.so.1
snappy: true /usr/hdp/184.108.40.206-2950/hadoop/lib/native/libsnappy.so.1
lz4: true revision:99
bzip2: true /lib64/libbz2.so.1
openssl: false Cannot load libcrypto.so (libcrypto.so: cannot open shared …
# hdfs getconf -confkey dfs.client.read.shortcircuit
# hdfs getconf -confkey dfs.domain.socket.path
For the testing itself I used one of our standard demo clusters. This has 5 nodes (1 NameNode and 4 DataNodes) running, in this case, HDP 2.2 on top of RHEL 6.5. The DataNodes are HP DL380 G7s with 2 x 6 cores @2.8Ghz, 288GB RAM, 1x 10GbE network card, and 16x300GB SFF SAS drives (so a reasonable spec as of 2012 but a long way from today’s “state of art”).
The data for the test is a star schema with around 25 dimension tables ranging in size from 100 to 50 million rows, and a single fact table with 8 billion rows.
The queries in the demo join two or more of the dimension tables to the fact table and filter on a date range along with other predicates.
Here I hit my second challenge – most of the queries used in the demo run in fractions of second and so there is not much opportunity to measure the effect of SCRs. For example the query below runs in 0.3 seconds against 8 billion rows (each date targets around 80 million rows):
d1.DateLabel, d2.MeasureLabel, sum(f.MeasureValue)
f.DateId = d1.DateId
and d2.MeasureId = f.MeasureId
and d1.DateLabel in (’05-Feb-2015′)
[/sql] To provide a chance to observe the benefits of SCRs I created a query that aggregated all 8 billion rows against all rows both dimension tables (i.e. removing the predicate “and d1.DateLabel in (…)“. This creates a result set of tens of thousands or rows but that doesn’t skew the result time enough to invalidate the test.
To make sure that all data was being read from the DataNode, Actian VectorH was restarted before the query was run. The Linux file system cache was left as is so as not to disrupt anything that the DataNode might be doing internally with file handles etc.
Armed with this new query I ran my first tests comparing the difference with and without SCR’s and got no difference at all – Huh! Exploring this a little I found that VectorH was using the network very efficiently so that the absence of SCRs was not affecting the read times. So I then simulated some network load using scp and sending data between the different nodes while running the queries. Under these conditions SCRs had an overall positive impact of around 10%.
|No network load, first run.||29.8 seconds||29.8 seconds|
|No network load, second run.||3.9 seconds||3.9 seconds|
|With busy network, first run.||35.1 seconds||38.3 seconds|
|With busy network, second run.||4.6 seconds||5.1 seconds|
In conclusion, given a good networking setup and software that makes good use of it, SCRs may not provide a performance benefit. However, if the network is reasonably busy, then SCR’s will likely help. The difference measured here was around 10%.