Defining and Managing High Performance Data Integration Approaches in Support of Big Data Analytics


A report published by market research company, IDC, estimated that the total count of data created or replicated worldwide in 2012 would add up to 2.8 zettabytes (ZB).  That’s 1,000 exabytes, 1 million petabytes or 1 billion terabytes. By 2020, IDC expects the annual data-creation total to reach 40 ZB, which would amount to a 50-fold increase from where things stood at the start of 2010.

It’s the storage and integration of this data that causes problems.  We’ve come up with new ways to store and retrieve data with approaches to “big data” such as Hadoop and other emerging databases.  Moreover, we’ve become cleverer in how we move data from place to place.

However, our ability to solve these problems through the opportunistic use of technology has limits.  As more data becomes the responsibility of IT, the more we need to consider the strategic use of data storage and data integration technology.  The days of just winging it and hoping for the best are long over.  The cloud is here, big data is here, and the business is well aware of their value.

The use of big data brings a few issues to the world of data integration, including:

  • An increase in the volume of data moving from transactional systems to a central data store.
  • An increase in the volume of data moving from the central data store to the analytics engine to the end user.
  • An increase in the need to view data analytics using real-time information, including both data in flight and at rest.
  • An increase in the security requirements around the data integration solution, including the need to support existing and emerging regulations around the use of the data.
  • An increase in the complexity of some of the semantic transformation that needs to occur as data moves from place to place.
  • The need to support transactionality, or, the ability to never lose information around system failures.

If these are the issues that we need to address, and thus the objectives or requirements for our data integration approaches and solutions, we have much work to do.  Also, we don’t have a lot of time to do it.

There are a few core things that must be accomplished:

First, we need the ability to understand the flow of information in great detail, from the source to the target.  We must understand the structures of the source and target systems, as well as the volume data that is likely to flow now and into the future.

This is data integration 101: Define the data semantics you’re dealing with in those systems that produce information, the transformations that need to occur, and then production of that information to the target.  The emerging complexities come into play when considering the complex structures within the source transactional systems.  Thus, they must be simplified before they are produced into the target big data systems, and formatted correctly for the purpose of analysis.

Second, we need the ability to provide a data integration solution that’s able to support real-time data analytics.  This means the ability to move data into a big data system at a speed where the information leveraged for the analytics services have access to information that may be decades old, or minutes old.

This is no small feat.  Consider the types of operations you need to perform on the data flowing from place to place, such as the semantic transformation we previously described.  You must not only define an approach to provide real-time data analytics, and thus real-time data integration, but select the proper technology to make this work the first time.

Finally, you need to evaluate the overall solution set, considering the ability for the technology and the approaches to keep up with the data growth you’ll experience over the next 3 to 5 years.  As the IDC report outlined, the use of data is the single most valuable asset that a company has available.

However, the growth and management of data will become more cumbersome over time.  Moreover, data integration will become more complex, and the stress will be on the data integration technology to support an ever-increasing demand for speed.

So, does this scare you?   I assert that it should be a core concern of IT as they move into a world where data and analysis should be on demand in support of the business.  The days of 30-day-old reports, and information having to walk from place to place within enterprises, is rapidly coming to a close.  Emerging is the strategic use of data to quickly get to business solutions and value.

About David Linthicum

Dave Linthicum is the CTO of Cloud Technology Partners, and an internationally known cloud computing and SOA expert. He is a sought-after consultant, speaker, and blogger. In his career, Dave has formed or enhanced many of the ideas behind modern distributed computing including EAI, B2B Application Integration, and SOA, approaches and technologies in wide use today. For the last 10 years, he has focused on the technology and strategies around cloud computing, including working with several cloud computing startups. His industry experience includes tenure as CTO and CEO of several successful software and cloud computing companies, and upper-level management positions in Fortune 500 companies. In addition, he was an associate professor of computer science for eight years, and continues to lecture at major technical colleges and universities, including University of Virginia and Arizona State University. He keynotes at many leading technology conferences, and has several well-read columns and blogs. Linthicum has authored 10 books, including the ground-breaking "Enterprise Application Integration" and "B2B Application Integration."

View all posts by David Linthicum →

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>