Data Integration

Don’t Just Move Data, Integrate It!

Traci Curran

June 18, 2020

Moving Data with Data Connect

If you are simply “lifting and shifting” data from one place to another, you are missing out on the power that a data integration platform can bring you. It is time to look beyond extract, transfer, load (ETL) from individual source systems, and expand your integrations to include multi-source joins that enable you to see across source systems. Don’t just move data, integrate it!

Data integration is more than moving data from source to target systems. It is part of the greater data value chain that transforms raw source data into information and actionable insights that help drive decisions and operational processes. Like any other value chain, each step in the process moves the data one step closer to consumption by transforming it in ways that add value to the end user. One might argue that moving data into a centralized repository, or a downstream database adds value.  Yes, it does, and if all you have is essentially a “data forklift,” this may be the best you can do. If you have a true data integration platform like Actian DataConnect, you can do a whole lot more (and you should).

Multi-Source Joins

A data integration platform, like Actian DataConnect, gives you a powerful set of tools at your fingertips to help you not just move data from one system to another but integrate it along the way.  You might be familiar with the ability to create SQL like inner, outer, left and right joins within a database, but did you know you can access data from multiple source systems in the same query?  The DataConnect Studio IDE was recently re-engineered with regards to how joins are implemented, taking advantage of the ability to leverage multiple source connections in your queries.

With DataConnect Studio, you can build integrations that span multiple data sources, reconciling them together into a unified output set in the target system. Let’s consider where you might want to do this.

Analytics and Reporting

By merging data across source systems earlier in the data value chain, you can normalize your data into a canonical data model that is easier for your analysts and business users to understand.  This means they can spend less time finding data and more time interpreting data to determine its relevance to your business.

eCommerce Systems

Customer-facing systems, whether they be on a website or a mobile app, should provide a consistent and simple interface to users.  Multi-source joins in your data queries enable you to combine data from different systems, so your users get a high-quality experience without having to deal with whatever complexity is taking place behind the scenes.

Customer Support

Any company that has tried to develop a 360-degree view of its customers knows that the data comes from many different source systems.  Actian DataConnect enables you to join data from different customer records and transactional systems to give you the big picture perspective you are looking for.

Operations Monitoring

Many companies are integrating IoT devices, mobile apps, and embedded sensors into their operations processes.  The multi-source join capability can enable you to leverage data from different types of monitoring devices and more easily reconstruct the virtual process flows that your operations staff need to monitor your operations.

Data in motion is one of the best times to perform integration.  If you are going to merge data at rest, you either have to copy data into a merged table, or you create views and don’t really integrate the Data until later in the data value chain – your options are limited.  When you are moving data, you have the opportunity to transform it – changing data structures, summarizing, categorizing, and aggregating data from different sources.  Each time data moves, you should be seeking ways to make it even more valuable for your organization.

Actian DataConnect can help make managing data easier – not just moving data, but really integrating it. To learn more, visit DataConnect.

Traci Curran headshot

About Traci Curran

Traci Curran is Director of Product Marketing at Actian, focusing on the Actian Data Platform. With 20+ years in tech marketing, Traci has led launches at startups and established enterprises like CloudBolt Software. She specializes in communicating how digital transformation and cloud technologies drive competitive advantage. Traci's articles on the Actian blog demonstrate how to leverage the Data Platform for agile innovation. Explore her posts to accelerate your data initiatives.
Data Management

SQLite’s Serverless Architecture Doesn’t Serve IoT Well

Actian Corporation

June 17, 2020

person juggling balls in a shirt and tie

Part Three: SQLite, the “Flat File” of Databases

Over the past few articles, our SQLite blog series has been looking at SQLite Serverless Architecture and how it is unsuitable for IoT environments. Those of you who have been following can jump ahead to the next section, but if you’re new to this discussion, you may want to review the predecessor parts.

  • In part one, mobile may be IoT, but IoT is not mobile when it comes to data, we examined the fact that though SQLite is the most popular database on the planet—largely due to its ubiquitous deployment on mobile smartphones and tablets, where it supports embedded applications for a single user—it cannot support the multi-connection, multi-user, multi-application requirements of the IoT use cases that are proliferating with viral ferocity in every industry. In a world that calls for the performance of cheetahs and peregrine falcons, SQLite is a banana slug.
  • In part two, Rethinking What Client-Server Means for Edge Data Management, we considered key features and characteristics of the SQLite Serverless Architecture (portability, little-to-no configuration, small footprint, SQL API, and some initially free version to seed adoption) in light of the needs of modern edge data management and discussed the shortcomings of the SQLite architecture in terms of its ability to integrate with critical features found in traditional client-server databases (chiefly those multi-point qualifiers above).

In our final analysis of this serverless architecture, I’d very much like to explore (read: clarify) what will happen if a developer ignores these cautionary points and doubles down on SQLite as a way to handle IoT use cases.

Don’t Mistake Multi-Connection and Multi-Threaded for Client Server

In the late 90s, as applications became more sophisticated, generated and ingested more data, and performed more complex operations on that data internally. Consequently, app developers had to develop a lot of workarounds to deal with the limitations of routine, operating system-based file management services. Instead of spending time on all these DIY efforts, application developers were clamoring for a dedicated database they could embed into an application to support their specific data management needs.

At the turn of the 21st century, SQLite appeared and seemed tailor-made to meet these needs. SQLite enabled indexing, querying, and other data management functionality through a series of standard SQL calls that could be inserted into the application code, with the entire database bundled as a set of libraries that became part of the final deployed executable. Keep in mind that the majority of these applications tended to be monolithic, single-purpose, single-user applications designed for the simpler CPU architectures in use at the time. They were not designed to run multiple processes, let alone multiple threads. End-user and data security were not yet the high priorities they are today. And as for performance in a networked environment? Wireless networks were reactive and spotty at best. Multiple, external, high-bandwidth data connections were uncommon.

So it’s really no surprise that SQLite wasn’t able to service simultaneous read and write requests for a single connection (let alone for multiple connections) when it was designed. Designers were thrilled to have an embeddable database that would allow multiple processes to have sequential read and write access to a data table within an application. They were not looking for enterprise-grade client-server capabilities. They were not designing stand-alone database systems that would support multiple applications simultaneously. They simply needed more than flat-file access mediated by an operating system.

And there lies the heart of the issue with SQLite. It was never intended to handle multiple external applications or their connections asynchronously, as would a traditional client-server database. Modern networked applications commonly have multiple processes and/or multiple threads. When you throw SQLite into a situation with multiple connections and the potential for multiple simultaneous read and write requests, you quickly encounter the possibility of race conditions and data corruption.

To be fair, SQLite has tried to accommodate these evolving demands. The current version of SQLite handles multiple connections through its thread-mode options: single-thread, multi-thread, and serialized. Single-thread is the original SQLite processing mode, handling one transaction at a time, either a read or a write from one and only one connection. Multi-thread will support multiple connections but still one at a time for read or write. Serialized—the default mode for the most current SQLite versions—can support multiple concurrent connections (and, therefore, can support a multi-threaded or multi-process application), but it cannot handle all of them simultaneously. SQLite can handle simultaneously read connections in multi-thread and serialized modes, but it locks the data tables to prevent attempts at simultaneous writes. Nor can SQLite handle the orchestration of writes from several connections.

Compare that to the architecture of a true client-server database that is built to manage simultaneous writes. The client-server database evaluates each write service request and, if attempts are made to write to the same data within a table, it blocks the request until the current operation on that data is completed. If attempts are made to different parts of the data table, the server allows them to go forward. That’s true orchestration. Locking the entire table and holding off writes (or faking it for sequential writes to occur alongside multiple reads with WAL) is not the same thing.

Why is this a showstopper for SQLite in an IoT environment? One of the most basic operations with IoT devices and gateways involves writing data from a variety of devices into your data repository, and the write locks imposed during multi-threaded/multi-connection operations render it non-viable in a production environment. Furthermore, a second basic operation taking place within an IoT environment involves performing data processing and analytics on previously collected datasets. While these may be read-intensive operations that are executed independently (either as separate processes or as separate threads) of the write-intensive operations just described, they still cannot occur concurrently in an SQLite environment and maintain ACID compliance.

As you scale up your deployments, or as system complexity increases—say you want to instrument more and more within an environment, be that an autonomous car or a smart building—you will invariably add more data connection points downstream or within your local environment. Each of these entities will have one or more additional database connections, if not their own database that needs a connection. You could try to establish these connections, but they will need to be handled through add-on application logic that will likely result in response times that are outside the design constraints for your IoT system.

Workarounds Designed to Deny (or Defy) Reality

SQLite partisans will wave their hands with dismissive nonchalance and tell you that SQLite is fast enough (it’s not; we’ve already discussed how slow SQLite is) and that you can build your own functionality to handle simultaneous reads and writes across multiple connections—in effect, manually synchronizing them specific to the use case being handled. One method by which they manage this scenario involves using the serialized mode mentioned above and building functionality to handle synchronization and orchestration within the application threads. This approach tries to avoid the transmission of read and write requests on multiple channels (thereby avoiding race conditions and the potential for data corruption). However, this approach also requires a high degree of skill, the assumption of long-term responsibility for the code, and a need for extensive test and validation to ensure that operations are transpiring properly.

An alternative approach would be to build the equivalent of a client-server orchestration front-end and use the single-thread option within SQLite, which would preclude race conditions or data corruption. But dropping back to a single-thread option would be like watching this banana slug move in even slower motion. That’s not a viable approach, given the high-speed, parallel write operations needed to accommodate multiple high-resolution data feeds or large-scale sensor grids. Moreover, all you’ve done is to accommodate the weaknesses of the database architecture by forcing the application to do something that the database should be doing. And you’d have to do that over and over, for every app in your IoT portfolio.

There are several sets of code and a couple of small shops that have tried to productize this latter approach, but with limited success. They work only with certain development platforms on a few of the SQLite supported platforms. Even if those platforms are a match for your use case, the performance issues may still increase the risk and difficulty of coding this workaround into your application.

We’ve Seen This Iceberg Before

This cautionary tale isn’t just about the amount of DIY that will be incurred with the unquestioned reliance on SQLite for a given application. Like the IoT itself, it’s much bigger than that. For example, if you commit to handling this in your own code, how will you handle the movement of data from a device to the edge on-premises? How will you handle moving data to or from the cloud? The requirements for interacting with servers on either tier may be different, requiring you to write more code to perform data transformations (remember the blog on SQLite and ETL?). You might try to avoid the ETL bottleneck by using SQLite on both ends, but that would just kick the virtual can down the virtual road. You would still have to write code to handle SQLite masquerading as a server-based database on the gateway and in the cloud.

Ultimately, you can’t escape the need to write more code to make SQLite work in any of these scenarios. And that’s just the tip of this iceberg. You would need to make trade-off comparisons between DIY and partial-DIY plus code modules/libraries for other functionality—from data encryption and public key management to SQL query editing, and more. The list of features that a true client-server infrastructure brings to the table—all lacking in SQLite—goes on and on.

Back in the day, SQLite enabled developers to avoid much of the DIY that flat-file management had required. For the use cases that were emerging back then, it was an ideal solution. For today’s use cases, though, even more DIY would be required to make SQLite work—and even then it would not work all that well. The vast majority of IoT use cases require a level of client-server functionality that SQLite cannot provide without incurring significant costs—in performance, in development time, and in risk. In a nutshell, it’s déjà vu, but now SQLite is the flat file whose deficiencies we must leave in the past.

Oh, and if you think that all this is just an issue for developers, think again. In the next and final blog in this series, we’ll widen the lens a bit and look at what this means for the business and the bottom line.

If you’re ready to reconsider SQLite and learn more about Actian Zen, you can just kick the tires for free with Zen Core, which is royalty-free for development and distribution.

actian avatar logo

About Actian Corporation

Actian empowers enterprises to confidently manage and govern data at scale. Actian data intelligence solutions help streamline complex data environments and accelerate the delivery of AI-ready data. Designed to be flexible, Actian solutions integrate seamlessly and perform reliably across on-premises, cloud, and hybrid environments. Learn more about Actian, the data division of HCLSoftware, at actian.com.
Data Intelligence

Data Governance and Data From ERP/CRM Packages: A Must Have

Actian Corporation

June 16, 2020

data-governance-erp-crm

For the last 3 decades, companies have been relying on ERP and CRM packages to run their operations.

In response to the need to comply with regulations, reduce risk, and improve profitability, competitiveness, and customer engagement, they have to become data-driven.

In addition to the need to leverage a wide variety of new data assets produced heavily by new means, strategic data from those historical systems has to be involved in any Data Initiative.

Challenges Faced by Companies Trying to Leverage Data From ERP/CRM to Feed Their Digital Initiatives

In the gold rush that companies are pursuing with Artificial Intelligence, Advanced Analytics, and in any Digital Transformation program, understanding and leveraging Data from ERP/CRM packages is on the critical path in any Data Governance journey.

First, they have large, complex, hard-to-understand, and customized database models. Understanding the descriptions, the relationship definitions, and more means to serve Data Citizens is almost impossible without an appropriate Data Catalog like the Actian Data Intelligence Platform with ad hoc ERP/CRM connectors.

As an example, SAP has more than 90.000 table sets. As a consequence, a Data Scientist will hardly understand the so-called TF120 table in SAP or the F060116 in JD Edwards.

Secondly, identifying a comprehensive subset of accurate datasets to serve a specific Data initiative is an obstacle course.

Indeed, a big percentage of the tables in those systems are empty, may appear redundant, or have complex links for those who are not experts of the ERP/CRM domain.

Thirdly, the demand for fast, agile and ROI focused Data-Driven initiatives put the ERP/CRM knowledgeable personnel in the middle of the game.

ERP/CRM experts are rare, busy and expensive workers and companies cannot afford increasing those team or having them losing their focus.

And finally, if a Data Catalog is not able to properly store Metadata information for those systems, in a smooth, comprehensive and effective way, any data initiative will be deprived of a large part of its capabilities.

The need for financial data, manufacturing data and customer data to take a few examples is obvious and therefore put ERP/CRM systems as mandatory data sources of any Metadata Management program.

Actian Data Intelligence Platform Value Proposition

An Agile and Easy Way

We believe in a Data Democracy world, where, any employee of a company can discover, understand and trust any dataset that is useful.

This is only possible with a reality proof data catalog easily and straightforwardly connecting to any data source, including the ones from ERP/CRMP packages.

But mostly, a Data Catalog has to be smart, easy to use, easy to implement and easy to scale in a complex IT Landscape.

A Wide Connectivity

Actian Data Intelligence Platform provides Premium ERP/CRM connectors for the following packages:

  • SAP and SAP/4HANA
  • SAP BW
  • Salesforce
  • Oracle E Business Suite
  • JD Edwards
  • Siebel
  • Peoplesoft
  • MS Dynamics EX
  • MS Dynamics CRM

“Premium ERP/CRM Connectors Help Companies in Various Aspects

Discovering and Assessing

Actian Data Intelligence Platform connectors help companies to build an automatic translation layer, hiding the complexity of the underneath database tables and automatically feeds the Metadata registry with accurate and useful information, saving time and money of the Data Governance Team.

Scoping Useful Metadata Information for Specific Cases

In a world with thousands of datasets, the platform provides a mean to build accurate and self-sufficient models to serve focused business needs by extracting in a comprehensive way:

  • Business and technical names for tables.
  • Business and technical names for columns in tables.
  • Relationships between tables.
  • Data Elements
  • Domains
  • Views
  • Indexes
  • Table row count.
  • Application hierarchy (where available from the package).

Compliance

Actian Data Intelligence Platform‘s “Premium ERP/CRM connectors” are able to identify and tag any personal data or Personal Identifiable Information coming from its supported CRM/ERP packages in its Data Catalog to stick with GDPR/CCPA regulation.

actian avatar logo

About Actian Corporation

Actian empowers enterprises to confidently manage and govern data at scale. Actian data intelligence solutions help streamline complex data environments and accelerate the delivery of AI-ready data. Designed to be flexible, Actian solutions integrate seamlessly and perform reliably across on-premises, cloud, and hybrid environments. Learn more about Actian, the data division of HCLSoftware, at actian.com.
Data Integration

Deploy and Manage Your Integrations Anywhere, Anytime

Traci Curran

June 15, 2020

deploy and manage data integrations

With Actian DataConnect Integration Manager, you can deploy, configure, manage and repair your integrations anywhere – meaning if it resides in the cloud, on-prem, or even if it is embedded in your SaaS applications, anytime. The latest release of Actian DataConnect Integration Manager includes an important set of enhancements to the Integration Manager API that will increase your organization’s ability to define integrations and enable them for either synchronous or asynchronous execution. Okay, this may sound like a bunch of technical jargon, but let’s break it down so you can see why this new feature is so important. Two primary execution patterns are used for data integration – synchronous and asynchronous.

Request-Response Integration

Synchronous integrations, sometimes called “request-response” integrations, are used when you want to tightly couple two applications together. In this pattern, one system generates a message to the other, waits for a response, and when it receives the response, it sends the next message. You can think of this much like a chat conversation where two parties are communicating back and forth with each other. Another example is a user interacting with a website – issuing a command or clicking a button and waiting for a response from the server. This is the most common type of data integration because it is most intuitive to implement and affords the sending system the ability to verify receipt of the message before continuing to the next step in a workflow.

The benefit of synchronous communication is that it works well for real-time integration and complex workflows with many back-and-forth interactions. We see this a lot when multiple applications serve as components of an overarching system or when the integration is part of a transactional workflow (such as a CRM system looking up the status of a customer order in an ERP system). The drawbacks are that both systems must be actively engaged in the messaging interactions to avoid processing delays.

Set and Forget Integration

Asynchronous integrations, sometimes called “set and forget” integrations, are used when you want to loosely couple applications together. In this pattern, one system sends out a message, then moves on with doing other things – it is not waiting for a response. The receiving system may have a listener configured, waiting to receive the message in real-time, or it may process incoming messages periodically (in batches). You can think of this much like a news agency publishing a story. Some readers may be watching the news feed for updates in real-time while others may check for news updates once per day. In either case, there is no expectation that the receiver of the communication will respond to the sender or even acknowledge receipt of the message.

The benefit of asynchronous communication is that it enables the publishing of data to many recipients at the same time. We see this pattern used often when a system performs batch processing of reports or pushes data to downstream systems. Asynchronous messaging is also used for things like event logs alerts, and system status messages that do not interfere with transactional processing. The drawback of this method is the sending system has no visibility into the acceptance and subsequent processing of the message that is sent. Was it received? How long was the message waiting before processing? It is difficult to build transactional workflows using asynchronous integration because of time delays and the inability to monitor the quality of service.

Your Integration Platform Needs to Support Both

As you can see, there are different situations where you might want to use one of these integration patterns over the other. That is why the enhancements to the Actian DataConnect Integration Manager are so important. You now have the flexibility to use both of these patterns in your integrations, depending on the unique needs of your business. There may even be times when you need both synchronous and asynchronous integration between the same systems. That is okay, Actian DataConnect can help you do that.

To learn more, visit DataConnect.

To download the latest DataConnect Integration Manager visit Actian ESD

Traci Curran headshot

About Traci Curran

Traci Curran is Director of Product Marketing at Actian, focusing on the Actian Data Platform. With 20+ years in tech marketing, Traci has led launches at startups and established enterprises like CloudBolt Software. She specializes in communicating how digital transformation and cloud technologies drive competitive advantage. Traci's articles on the Actian blog demonstrate how to leverage the Data Platform for agile innovation. Explore her posts to accelerate your data initiatives.
Data Intelligence

Data Science: Accelerate Your Data Lake Initiatives With Metadata

Actian Corporation

June 15, 2020

data-science

Data lakes offer unlimited storage for data and present lots of potential benefits for data scientists in the exploration and creation of new analytical models. However, this structured, unstructured, and semi-structured data are mashed together, and the business insights they contain are often overlooked or misunderstood by data users.

The reason for this is that many technologies used to implement data lakes lack the necessary information capabilities that organizations usually take for granted. It is, therefore, necessary for these enterprises to manage their data lakes by putting in place effective metadata management that considers metadata discovery, data cataloguing, and overall enterprise metadata management applied to the company’s data lake.

2020 is the year that most data and analytics use cases will require connecting to distributed data sources, leading enterprises to double their investments in metadata management. – Gartner 2019.

How to Leverage Your Data Lake With Metadata Management

To get value from their data lake, companies need to have both skilled users (such as data scientists or citizen data scientists) and effective metadata management for their data science initiatives. To begin with, an organization could focus on a specific dataset and its related metadata. Then, leverage this metadata as more data is added into the data lake. Setting up metadata management can make it easier for data lake users to initiate this task.

Here are the Areas of Focus for Successful Metadata Management in Your Data Lake

Creating a Metadata Repository

Semantic tagging is essential for discovering enterprise metadata. Metadata discovery is defined as the process of using solutions to discover the semantics of data elements in datasets. This process usually results in a set of mappings between different data elements in a centralized metadata repository. This allows data science users to understand their data and have visibility on whether or not they are clean, up-to-date, trustworthy, etc.

Automating Metadata Discovery

As numerous and diverse data gets added to a data lake on a daily basis, maintaining ingestion can be quite a challenge! By using automated solutions not only does it make it easier for data scientists or CDS to find their information but it also supports metadata discovery.

Data Cataloguing

A data catalog consists of metadata in which various data objects, categories, properties and fields are stored. Data cataloguing is both used for internal and external data (from partners or suppliers for example). In a data lake, it is used for capturing a robust set of attributes for every piece of content within the lake and enriches the metadata catalog by leveraging these information assets. This enables data science users to have a view into the flow of the data, perform impact analysis, have a common business vocabulary and accountability and an audit trail for compliance.

Data and Analytics Governance

Data and analytics governance is an important use case when it comes to metadata management. Applied to data lakes, the question “could it be exposed?” must become an essential part of the organization’s governance model. Enterprises must therefore extend their existing information governance models to specifically address business analytics and data science use cases that are built on the data lakes. Enterprise metadata management helps in providing the means to better understand the current governance rules that relate to strategic types of information assets.

Contrary to traditional approaches, the key objective of metadata management is to drive a consistent approach to the management of information assets. The more metadata semantics are consistent across all assets, the greater the consistency and understanding, allowing the leveraging of information knowledge across the company. When investing in data lakes, organizations need to consider an effective metadata strategy for those information assets to be leveraged from the data lake.

Start Metadata Management

As mentioned above, implementing metadata management into your organization’s data strategy is not only beneficial, but essential for enterprises looking to create business value with their data. Data science teams working with various amounts of data in a data lake need the right solutions to be able to trust and understand their information assets. To support this emerging discipline,  the Actian Data Intelligence Platform gives you everything you need to collect, update and leverage your metadata through its next generation platform.

actian avatar logo

About Actian Corporation

Actian empowers enterprises to confidently manage and govern data at scale. Actian data intelligence solutions help streamline complex data environments and accelerate the delivery of AI-ready data. Designed to be flexible, Actian solutions integrate seamlessly and perform reliably across on-premises, cloud, and hybrid environments. Learn more about Actian, the data division of HCLSoftware, at actian.com.
Data Management

SQLite’s Serverless Architecture Doesn’t Serve IoT Environments – Part 2

Actian Corporation

June 11, 2020

SQLite imagery

Part Two: Rethinking What Client-Server Means for Edge Data Management

Over the past few weeks, our SQLite blog series has considered the performance deficiencies of SQLite when handling local persistent data and looked at the performance complications created by the need for ETL when sharing SQLite data with back-end databases. In our last installment—Mobile may be IoT but IoT is not Mobile—we started to understand why the SQLite serverless architecture doesn’t serve IoT environments very well. The fact that SQLite is the most popular database on the planet lies in the fact that it was inexpensive (read: free) and seemingly sufficient for the single-user embedded applications emerging on mobile smartphones and tablets.

That was yesterday. Tomorrow is a very different story.

The IoT is expanding at an explosive rate, and what’s happening at the edge—in terms of applications, analytics, processing demands, and throughput—will make the world of single-user SQLite deployments seem quaint. As we’ll see in this and the next installment of this blog, the data requirements for modern edge use cases lie far outside SQLite’s wheelhouse.

SQLite Design-Ins for the IoT: Putting the Wrong Foot Forward

As we’ve noted, SQLite is based on an elegant but simple B-tree architecture. It can store any type of data, is implemented in C, and has a very small footprint—a few hundred KBs—which makes it portable to virtually any environment with minimal resourcing. And while it’s not fully ANSI-standard SQL, it’s close enough for horseshoes, hand grenades, and mobile applications.

For all these reasons, and because it has been used ubiquitously as mobile devices have proliferated over the past decade, IoT developers naturally adopted SQLite into many early IoT applications. These early design-ins were almost mirror images of mobile applications (minus the need for much effort at the presentation layer). Data was captured and cached on the device, with the expectation that it would be moved to the cloud for data processing and analytics.

But that expectation was simply an extrapolation of the mobile world that we knew, and it was shortsighted. It didn’t consider how much processing power could be packed into an ever-smaller CPU package nor where those packages might end up. It didn’t envision the edge as a locus for analytics (wasn’t that the domain of the cloud and the data center?). It didn’t envision the true power of AI and ML and the role those would soon begin to play throughout the IoT. And it didn’t count on the sheer volume of data that would soon be washing through the networks like a virtual tsunami.

Have you been to an IoT trade show recently? Three to five years ago, many of the sessions described PoCs and small pilots in which all data was sent up into the cloud. Engineers and developers we spoke to on the trade show floor expressed skepticism about the need for anything more than SQLite. Some even questioned the need for a database at all (let alone databases that were consistent across clients and servers). In the last three years, though, the common theme of the sessions has changed. They began to center on scaling up pilots to full production and infusing ML routines into local devices and gateways. The conversations started to consider more robust local data management needs. Discussions, in hushed tones at first, about client-server configurations (OMG!) began to appear. The realization that the IoT is not the same as mobile was beginning to sink in.

Rethinking Square Pegs and Round Holes

Of course, the rationale for not using a client-server database in an IoT environment (or, for that matter, any embedded environment) made perfect sense—as long as the client-server model you were eschewing was the enterprise client-server model that had been in use since the ‘80s. In that client-server paradigm, databases were designed for the data center. They were built to run on big iron and to support enterprise applications like ERP, with tens, hundreds, even thousands of concurrent users interacting from barely sentient machines. Collect these databases, add in sophisticated management overlays, an army of DBAs, maybe an outside systems integrator, and steep them in millions of dollars of investment monies — and soon you’ve got yourself a nice little enterprise data warehouse.

That’s not something you’re going to squeeze into an embedded application. Square peg, round hole. And that explains why developers and line-of-business technical staff tended to announce that they had pressing business elsewhere whenever the words “client-server” began to pop up in conversations about the IoT. The use cases emerging in what we began to think of as the IoT were not human end-user centric. Unless someone were prototyping or doing some sort of test and maintenance on a device or gateway or some complex instrumentation, little or no ad hoc querying was taking place. Client-server was serious overkill.

In short, given a very limited set of use cases, limited budgets, and an awareness of the cost and complexity of traditional client-server database environments, relying on SQLite made perfect sense.

Reimagining Client-Server With the IoT in Mind

The dynamics of modern edge data management demand that we reframe our notions of client-server, for the demands of the IoT differ from those of distributed computing as envisioned in the 80s. The old client-server paradigm involved a lot of ad hoc databases interaction—both directly for ad hoc query and indirectly by applications that involved human end-users. In IoT use cases, data access is more prescribed, often repeated and event-driven; you know exactly which data needs to be accessed, as well as when (or at least under which circumstances) an event will generate the request.

Similarly, in a given IoT use case there are no unknowns about how many applications are running on a device or about how many external devices will be requesting data from (or sending data to) an application and its database pairing (and here, whether the database is embedded or separate standalone doesn’t really matter). While these numbers vary among use cases and deployments, a virtual team of developers, systems integrators, product managers, and others will design structure, repeatability, and visibility into the system—even if it’s stateless (and more so if it’s stateful).

In the modern IoT space, client-server database requirements are more like well-defined publish and subscribe relationships (post by publisher/read by subscriber and access from publisher/write to subscriber). They operate as automated machine-to-machine relationships, in which publishing/broadcasting and parallel multichannel intake activities often take place concurrently. Indeed, client-server in the IoT is like publish-subscribe—except that everything needs to perform both operations, and most complex devices (including gateways and intelligent equipment) will need to be able to perform both operations not just simultaneously but also across parallel channels.

Let me repeat that for emphasis: most complex IoT devices (read: pretty much anything other than a sensor) is going to need to be able to read simultaneously and write simultaneously.

SQLite cannot do this.

Traditional client-server databases can, but they were not designed with a small footprint in mind. Most cloud and data center client-server databases require hundreds of megabytes, even gigabytes, of storage space. However, the core functions needed to handle simultaneous reads and writes efficiently take up far less space. The Actian Zen edge database, for example, has a footprint of less than 50MB. And while this is 100X the installed footprint of SQLite, it’s merely a sliver of the space attached to the 64-bit ARM and Intel embedded processor-based platforms we see today. Moreover, Actian Zen edge’s footprint provides all the resources necessary for multi-user management, integration with external applications through ODBC and other standards, security management, and other functionality that is a must once you jump from serverless to client-server. A serverless database like SQLite does not provide those services because their need—like the edge itself—was simply not envisioned at the time.

If we look at the difference between Actian Zen edge and Actian Zen enterprise (with its footprint under 200MB), we can see that most of the difference has to do with human end-user enablement. For example, Actian Zen enterprise includes an SQL editor that enables ad-hoc queries and other data management operations from a command line. While most of that same functionality resides in Zen edge, it is accessed and executed through API calls from an application rather than a CLI.

But Does Every IoT Edge Scenario Need a Server?

Those of you who have been following closely will now sit up and say, Hey, wait: Didn’t you say that not every IoT edge data management scenario needs a client-server architecture?

Yes, I did. Props to you for paying attention. Not all scenarios do—but that’s not really the question you should be asking. The salient question is, do you really want to master one architecture, implementation, and vendor solution for those serverless use cases and separate architectures, implementations, and vendor solutions for the Edge, cloud, and data center? And, from which direction do you approach this question?

Historically, the vast majority of data architects and developers have approached this question from the bottom up. That’s why we started with flat files and then moved to SQLite. Rather than looking from the bottom up, I’m arguing that we need to step back, embrace a new understanding of what client-server can be, and then revisit the question from the top down. Don’t just try to force-fit serverless into a world for which it was never intended—or worse, kluge up from serverless to a jury-rigged implementation of a late 20th century-server configuration.

That way madness lies, as we’ll see in the final installment of this series, where we’ll look at what happens if developers decide to use SQLite anyway.

Ready to reconsider SQLite, learn more about Actian Zen.  Or, you can just kick the tires for free with Zen Core which is royalty-free for development and distribution.

actian avatar logo

About Actian Corporation

Actian empowers enterprises to confidently manage and govern data at scale. Actian data intelligence solutions help streamline complex data environments and accelerate the delivery of AI-ready data. Designed to be flexible, Actian solutions integrate seamlessly and perform reliably across on-premises, cloud, and hybrid environments. Learn more about Actian, the data division of HCLSoftware, at actian.com.
Data Management

SQLite’s Serverless Architecture Doesn’t Serve IoT Environments – Part 1

Actian Corporation

June 11, 2020

computer screen showing code and sqlite

Part One: Mobile May Be IoT—But, When it Comes to Data, IoT is Not Mobile

Three weeks ago, we looked at the raw performance—or the lack thereof—of SQLite. After that, we looked at SQLite within the broader context of modern edge data management and discovered that its performance shortcomings were in fact compounded by the demands of the environment. As a serverless database, SQLite requires integration with a server-based database—which inevitably incurs a performance hit as the SQLite data is transformed through an ETL process for compatibility with the server-based database’s architecture.

SQLite partisans might then adopt a snarky tone and say: “Yeah? Well if SQLite is so slow and integration is so burdensome, can you remind me why it is the most ubiquitous database out there?”

Well, yeah, we can. And in the same breath, we can provide even partisans with ample reason to doubt that the popularity of SQLite will continue going forward. Spoiler alert: What do the overall growth curves of the IoT look like outside the realm of mobile handsets and tablets?

How the Banana Slug Won the Race

In the first blog in this series, we looked at why embedded developers adopted SQLite over both simple file management systems on the one end of the data management spectrum and large complex RDBMS systems on the other end. The key technical reasons, just to recap, include its small footprint; its ability to be embedded in an application; its portability to almost any operating system and programming language with a simple architecture (key-value store); and its ability to deliver standard data management functionality through an SQL API. The key non-technical reason—okay, reason—is that, well, it’s free!  in use cases dominated by personal applications that needed built-in data management (including developer tools), web applications that needed a data cache, and mobile applications that needed something with a very small footprint. If you combine free with these technical characteristics and consider where and how SQLite has been deployed, it’s no surprise that, in terms of raw numbers, SQLite found itself more widely deployed than any other database.

What all three of the aforementioned use cases have in common, though, is that they are single-user scenarios in which data associated with a user can be stored in a single file and data table (which, in SQLite are one and the same). Demand for data in these use cases generally involves serial reads and writes; there’s little likelihood of concurrent reads, let alone concurrent writes. In fact, it wasn’t until later iterations of SQLite that the product’s developers even felt the need to enable simultaneous reads with a single write.

But here’s the thing: Going forward, those three use cases are not going to be the ones driving the key architectural decisions. Ironically, the characteristics of SQLite that made it so popular among developers and in turn gave rise to a world in which billions of devices are acting, reacting, and interacting in real time—at the edge, in the cloud, and in the data center—and that’s a world for which the key characteristics of SQLite are singularly ill-suited.

SQLite has essentially worked itself out of a role in the realm of modern edge data management.

As we’ve mentioned earlier, SQLite is based on an elegant but simple architecture, key-value store, that enables you to store any type of data. Implementation is done in C with a very small footprint, a few hundred KBs, making it portable to virtually any environment with minimal resourcing. And, while it’s not fully ANSI standard SQL, it’s close enough for horseshoes, hand grenades, and mobile applications.

SQLite was adopted in many early IoT applications as these early design-ins were almost mirror images of mobile applications (minus the need for much effort at the presentation layer), focused on local caching of data with the expectation that it would be moved to the cloud for data processing and analytics. Pilot projects on the cheap meant designers and developers knee-jerk to what they know and what is free – ta-dah SQLite!

Independent of SQLite, the IoT market and its use cases have rapidly moved off this initial trajectory. Clear proof of this is readily apparent if you’ve had the opportunity to go to IoT trade shows over the last few years. Three to five years ago, recall how many of the sessions described proof of concepts (PoCs) and small pilots where all data was sent up into the cloud. When we spoke to engineers and developers on the trade show floor, they were skeptical about the need for anything more than SQLite or if you needed a database at all – let alone client-server versions. However, in the last three years, more of the sessions have centered on scaling up pilots to full production and infusion of ML routines into local devices and gateways. Many more of the conversations involved considerations to use more robust local data management, including client-server options.

Intelligent IoT is Redefining Edge Data Management

For all its strengths in the single-user application space, SQLite and its serverless architecture are unequal to the demands of autonomous vehicles, smart agriculture, medical instrumentation, and other industrial IoT spaces. The same is true with regard to the horizontal spaces occupied by key industrial IoT components, such as IoT gateways, 5G networking gear, and so forth. Unlike single-user applications designed to support human-to-machine requirements, innumerable IoT applications are being built for machine-to-machine relationships occurring in highly automated environments. Modern machine-to-machine scenarios involve far fewer one-to-one relationships and a far greater number of peer-to-peer and hierarchical relationships (including one-to-many and many-to-one subscription and publication scenarios), all of which have far more complex data management requirements than those for which SQLite was built. Moreover, as CPU power has migrated out of the data center into the cloud and now out to the edge, a far wider array of systems are performing complex software-defined operations, data processing, and analytics than ever before. Processing demands are becoming both far more sophisticated and far more local.

Consider: Tomorrow’s IoT sensor grids will run the gamut from low-speed, low-resolution structured data feeds (capturing tens of thousands of pressure, volume, and temperature readings, for example) to high-speed, high-resolution video feeds from hundreds of streaming UHD cameras. In a chemical processing plant, both sensor grids could be flowing into one or more IoT gateways that, in turn, could flow into a network of edge systems (each with the power one would only have found in a data center a few years ago) for local processing and analysis, after which some or all of the data and analytical information would be passed on a network of servers in the Cloud.

Dive deeper: The raw data streams flowing in from these grids would need to be read and processed in parallel. These activities could involve immediately discarding spurious data points, running signal-to-noise filters, normalizing data, or fusing data from multiple sensors, to name just a few of the obvious data processing functions. Some of the data would be stored as it arrived—either temporarily or permanently, as the use case demanded—while other data might be discarded.

A World of Increasing Complexity

Throughout these scenarios we see far more complex operations taking place at every level, including ML inference routines being run locally on devices, at the gateway level, or both. There may be additional operations running in parallel on these same datasets—including downstream device monitoring and management operations, which effectively create new data streams moving in the opposite direction (e.g., reads from the IoT gateway and writes down the hierarchical ladder). Or data could be extracted simultaneously for reporting and analysis by business analysts and data scientists in the cloud or data center. In an environment such as the chemical plant we have envisioned, there may also be more advanced analytics and visualization activities performed at, say, a local operations center.

These scenarios are both increasingly commonplace and wholly unlike the scenarios that propelled SQLite to prominence. They are combinatorial and additive; they present a world of processing and data management demands that is as far from that of the single-user, single-application world—the sweet-spot for SQLite—as one can possibly get:

  • Concurrent writes are a requirement, and not just to a single file or data table—with response times between write requests of as little as a few milliseconds.
  • Multiple applications will be reading and writing data to the same data tables (or joining them) in IoT gateways and other edge devices, requiring the same kind of sophisticated orchestration that would be required with multiple concurrent users.
  • On-premise edge systems may have local human oversight of operations, and their activities will add further complexity to the orchestration of multiple activities reading and writing to the databases and data tables.

If all of this sounds like an environment for which SQLite is inadequately prepared, you’re right.  In parts two and three of this blog we’ll delve into these issues further.

Ready to reconsider SQLite, learn more about Actian Zen.  Or, you can just kick the tires for free with Zen Core which is royalty-free for development and distribution.

actian avatar logo

About Actian Corporation

Actian empowers enterprises to confidently manage and govern data at scale. Actian data intelligence solutions help streamline complex data environments and accelerate the delivery of AI-ready data. Designed to be flexible, Actian solutions integrate seamlessly and perform reliably across on-premises, cloud, and hybrid environments. Learn more about Actian, the data division of HCLSoftware, at actian.com.
Data Intelligence

Build Your Citizen Data Scientist Team

Actian Corporation

June 8, 2020

citizen-data-science-team

”There aren’t enough expert data scientists to meet data science and machine learning demands, hence the emergence of citizen data scientists. Data and analytics leaders must empower “citizens” to scale efforts, or risk failure to secure data science as a core competency”. – Gartner 2019

As data science provides competitive advantages for organizations, the demand for expert data scientists is at an all-time high. However, supply remains pretty scarce for that demand! This limitation is a threat to enterprises’ competitiveness and, in some cases, their survival in the market.

In response to this challenge, an important analytical role providing a bridge between data scientists and business functions was born: the citizen data scientist.

What is a Citizen Data Scientist?

Gartner defines the citizen data scientist as “an emerging set of capabilities and practices that allows users to extract predictive and prescriptive insights from data while not requiring them to be as skilled and technically sophisticated as expert data scientists”. A “Citizen Data Scientist” is not a job title. They are “power users” who can perform both simple and sophisticated analytical tasks.

Typically, citizen data scientists don’t have coding expertise but can nevertheless build models using drag-and-drop tools and run prebuilt data pipelines and models using tools such as Dataiku. Be aware: citizen data scientists do NOT replace expert data scientists. They bring their expertise but do not have the specialized expertise for advanced data science.

The citizen data scientist is a role that has evolved as an “extension” from other roles within the organization! This means that organizations must develop a citizen data scientist persona. Potential citizen data scientists will vary based on their skills and interests in data science and machine learning. Roles that filter into the citizen data scientist category include:

  • Business Analysts.
  • BI Analysts/Developers.
  • Data Analysts.
  • Data Engineers.
  • Application Developers.
  • Business Line Manager.

How to Empower Citizen Data Scientists

As expert skills for data science initiatives tend to be quite expensive and difficult to come by, utilizing a citizen data scientist can be an effective way to close the current gap.

Here are ways you can empower your data science teams:

Break Enterprise Silos

As I’m sure you’ve heard this many times before, many organizations tend to operate independently in silos. Mentioned above, all of roles are important in an organization’s data management strategy, and they all have expressed interest in learning about data science and machine learning skills. However, most data science and machine learning knowledge is siloed in the data science department or specific roles. As a result, data science efforts are often invalidated and unleveraged. Lack of collaboration between data roles makes it difficult for citizens data scientists to access and understand enterprise data!

By establishing a community of both business and IT roles that provides detailed guidelines and/or resources allows enterprises to empower citizens data scientists. It is important for organizations to encourage the sharing of data science efforts throughout the organization and thus, break silos.

Provide Augmented Data Analytics Technology

Technology is fueling the rise of the citizen data scientist. Traditional BI vendors such as SAP, Microsoft and Tableau Software, provide advanced statistical and predictive analytics as part of their offerings. Meanwhile, data science and machine learning platforms such as SAS, H2O.ai and TIBCO Software, provide users that lack advanced analytics capabilities with “augmented analytics”. Augmented analytics leverages automated machine learning to transform how analytics content is developed, consumed and shared. It includes:

Augmented data preparation: Machine learning automation to augment data profiling and quality, modeling, enrichment and data cataloguing.

Augmented data discovery: Enables business and IT users to automatically find, visualize and analyse relevant information, such as correlations, clusters, segments, and predictions, without having to build models or write algorithms

Augmented data science and machine learning: Automates key aspects of advanced analytics modeling such as feature selection, algorithm selection and time-consuming step processes.

By incorporating the necessary tools and solutions and extending resources and efforts, enterprises can empower citizen data scientists.

Empower Citizen Data Scientists With a Metadata Management Platform

Metadata management is an essential discipline for enterprises wishing to bolster innovation or regulatory compliance initiatives on their data assets. By implementing a metadata management strategy, where metadata is well-managed and correctly documented, citizen data scientists are able to easily find and retrieve relevant information from an intuitive platform.

actian avatar logo

About Actian Corporation

Actian empowers enterprises to confidently manage and govern data at scale. Actian data intelligence solutions help streamline complex data environments and accelerate the delivery of AI-ready data. Designed to be flexible, Actian solutions integrate seamlessly and perform reliably across on-premises, cloud, and hybrid environments. Learn more about Actian, the data division of HCLSoftware, at actian.com.
Data Architecture

Actian Vector for Hadoop for Fuller SQL Functionality and Current Data

Actian Corporation

June 7, 2020

person looking at a screen with numbers and data

In this second of a three-part blog series (part 1), we’ll explain how SQL execution in Actian Vector in Hadoop (VectorH) is much more functional and ready to run in an operational environment, and how the ability for VectorH to handle data updates efficiently can enable your production environment to stay current with the state of your business. In the first part of this three-part blog post, we showed the tremendous performance advantage VectorH has over other SQL on Hadoop alternatives. The third part will cover the advantages of the VectorH file format.

Better SQL Functionality for Business Productivity

One of the original barriers to getting value out of Hadoop is the need for MapReduce skills, which are rare and expensive, and take time to apply to a given analytical question. Those challenges led to the rise of many SQL on Hadoop alternatives, many of which are now projects in the Apache ecosystem for Hadoop. While those different projects open up access to the millions of business users already fluent in writing SQL queries, in many cases they require other tradeoffs: differences in syntax, limitations on certain functions and extensions, immature optimization technology, and inefficient implementations. Is there a better way to get SQL on Hadoop?

Yes! Actian VectorH 6.0 supports a much more complete implementation, with full ANSI SQL:2003 support, plus analytic extensions like CUBE, ROLLUP, GROUPING SETS, and WINDOWING for advanced analytics. Let’s look at the workload we evaluated in our SIGMOD paper, based on the 22 queries in the TPC-H benchmark.

Each of the other SQL on Hadoop alternatives had issues running the standard SQL queries that comprise the TPC-H benchmark, which means that business users who know SQL may have to make changes manually or suffer from poor results or even failed queries:

  • Apache Hive 1.2.1 couldn’t complete query number 5.
  • Performance for Cloudera Impala 2.3 is hindered by single-core joins and aggregation processing, creating bottlenecks for exploiting parallel processing resources.
  • Apache Drill 1.5 couldn’t complete query number 21, and only 9 of the queries ran without modification to their SQL code.
  • Since Apache Spark SQL version 1.5.2 is a limited subset of ANSI SQL, most queries had to be rewritten in Spark SQL to avoid IN/EXISTS/NOT EXISTS sub-queries, and some queries required manual definition of join orders in Spark SQL. VectorH has a mature query optimizer that will reorder joins based on cost metrics to improve performance and reduce I/O bandwidth requirements.
  • Apache Hawq version 1.3.1 is based on PostgreSQL, so its older technology foundations can’t compete with the performance of a vectorized query engine.

Efficient Updates for More Consistent View of the Business

Another barrier to Hadoop adoption is that it is an append-only file system, limiting the file system’s ability to handle inserts and deletes. Yet many business applications require updates to the data, putting the burden on the database management system to handle those changes. VectorH can receive and apply updates from transactional data sources to ensure that analytics are performed on the most current representation of your business, not from an hour ago, or yesterday, or the last batch load into your data warehouse.

  • As part of the ad hoc decision support workload it represents, TPC-H has a requirement to run inserts and deletes as part of the workload. There are two refresh streams that make inserts and deletes into the six fact tables.
  • Four of the SQL on Hadoop alternatives do not support updates on HDFS: Impala, Drill, SparkSQL, and Hawq. They would not be able to meet the requirements for a full audited result.
  • The fifth, Hive, does support updates but incurs a significant performance penalty executing queries after handling the updates.
  • VectorH executed the updates more quickly than Hive. With its patent-pending Positional Delta Trees, VectorH tracks inserts and deletes separately from the data blocks, maintaining full ACID compliance while preserving the same level of query performance (no penalty!)
  • Here is the summary data from our testing that shows the performance penalty on Hive while there is no impact on VectorH from executing updates (detailed data follows):
    • Inserts took 36% longer and deletes required 796% more time on Hive than VectorH

Query performance afterwards shows PDTs have no measurable overhead, compared to the 38% performance penalty on Hive:

  • The average speedup for VectorH over Hive increases from 229x before the refresh cycles to 331x after updates are applied, with a range of 23 to 1141 on individual queries.

Appendix: Detailed Query Execution Times

actian avatar logo

About Actian Corporation

Actian empowers enterprises to confidently manage and govern data at scale. Actian data intelligence solutions help streamline complex data environments and accelerate the delivery of AI-ready data. Designed to be flexible, Actian solutions integrate seamlessly and perform reliably across on-premises, cloud, and hybrid environments. Learn more about Actian, the data division of HCLSoftware, at actian.com.
Data Management

Actian Shows Big Advantages Over SQL on Hadoop Alternatives

Actian Corporation

June 6, 2020

SQL Imagery long exposure of cars at night

Imagine if reports that currently take many minutes to run in Hadoop could come back with results in seconds. Get answers to detailed questions about sales figures and customer trends in real-time. Make revenue predictions based on up-to-date customer metrics across a spectrum of sources. Iterate more quickly simulating different business decisions to achieve better outcomes. The Actian Vector for Hadoop analytics platform can deliver those improvements in your Hadoop big data environment.

Actian Vector for Hadoop has demonstrated one to three orders of magnitude better query performance in a comparison with other major SQL in Hadoop alternatives. In this first of a three-part blog describing the results, we’ll show the astounding performance results and explain the factors that contribute to such a large advantage. Part two will cover the unique abilities Vector has to handle updates, and part three will go into the efficiencies of the Vector for Hadoop file format.

Actian performance engineering used the full set of 22 TPC-H queries to run unaudited benchmarks on several of the SQL on Hadoop solutions in the market, and the results may surprise you (but not us). Here is a quick summary:

These results have been published in an academic paper submitted to and presented at the International Conference on Management of Data (ACM SIGMOD). That paper goes into many technical reasons how Vector for Hadoop is able to achieve such a performance advantage – here is the short version:

  • Efficient, multi-core parallel and vectorized execution – Vector for Hadoop is designed to take advantage of the performance features in the Intel CPU architecture, including the AVX2 vector instruction set and large, multi-layer caches.
  • Well-tuned query optimizer – Vector for Hadoop extends the mature optimizer from its original SMP version to exploit the multiple levels of parallelism and advantages of data locality in an MPP Hadoop system. The Vector for Hadoop optimizer can change the join order or partition data tables to improve parallel operations, steps that have to be done manually for queries in the other alternatives.
  • Control over HDFS block locality – since Vector for Hadoop operates natively within HDFS and YARN, it can participate in resource management and make allocation decisions in the context of the larger cluster workload. At the same time, specific table storage optimizations reduce overhead, accelerate reads, maximize disk efficiency, and reduce data skew to help deliver faster query results.
  • Effective I/O filtering – tracking the range of values in a column (MinMax) allows skipping the reading of blocks which fall outside the range of the query, reducing disk I/O and read delays, and avoiding decompression computations, sometimes significantly.
  • Lightweight compression – Vector ‘s compression achieves good levels of compaction at high speed, achieving faster vectorized execution by minimizing branches and instruction counts. Our compression algorithms are capable of running fully in CPU cache, effectively increasing memory bandwidth. Different compression algorithms are tailored for the various data types and Vector automatically calibrates and chooses among them to reach higher levels of compression and efficiency when compared to general purpose compression algorithms.

How was the testing conducted?

  • Actian performance engineering built a 10-node Hadoop cluster, each node 2xIntel 3.0GHz E5-2690v2 CPUs, 256GB RAM, 24x600GB HDD, 10Gb Ethernet, Hadoop 2.6.0. There was one name node and nine SQL-on-Hadoop nodes, set up using Cloudera Express 5.5.
  • These tests were conducted in early 2016, running the then-most-current release of each of the SQL on Hadoop alternatives (Actian Vector for Hadoop 4.2.2, Apache Hive 1.2.1, Cloudera Impala 2.3, Apache Drill 1.5, Apache Spark SQL 1.5.2, and Pivotal HAWQ 1.3.1). Reasonable efforts were made to tune each platform to make fair comparisons.

Here are the actual individual query execution times and the speed-up factor for Vector for Hadoop versus each of the alternatives:

In part two of this blog series, we will cover the advantages Vector for Hadoop 6.0 delivers in SQL functionality and data updates capability compared to the other alternatives, and part three will show the benefits of the Vector file format for faster query performance and lower storage requirements.

actian avatar logo

About Actian Corporation

Actian empowers enterprises to confidently manage and govern data at scale. Actian data intelligence solutions help streamline complex data environments and accelerate the delivery of AI-ready data. Designed to be flexible, Actian solutions integrate seamlessly and perform reliably across on-premises, cloud, and hybrid environments. Learn more about Actian, the data division of HCLSoftware, at actian.com.
Data Architecture

Actian Vector for Hadoop File Format is Faster and More Efficient

Actian Corporation

June 5, 2020

Vector for hadoop elephant

In this third and last part of the series on Actian Vector in Hadoop (VectorH), we will cover how the VectorH file format supports the performance and efficiency of our data analytics platform to accelerate business insights, as well as some of the other enterprise features that can help businesses move their Hadoop applications into production. Part one of this series showed the huge performance advantages VectorH has over other SQL on Hadoop alternatives, while part two explored the benefits of the richer implementation of SQL and the ability to perform data updates in VectorH.

The file format for VectorH is one of the key contributors to its industry-leading performance. Having a columnar orientation allows VectorH to choose compression techniques optimized by data type, and VectorH can use various measures described in the SIGMOD paper to employ storage and I/O bandwidth more efficiently. In some simple benchmarks described in this paper, we compared VectorH to the speed and efficiency of other query engines (such as Impala and Presto) and other file formats (like Parquet and ORC). Three observations become clear from the benchmark results:

VectorH handles queries much faster than the other alternatives when the data is already in memory, from 26x to over 110x faster, primarily due to the efficiencies of decompression using vectorized processing. The chart below shows query times for each of the alternatives, showing how it varies depending on the percentage of the data selected out of the entire set of tables. VectorH and Presto avoid processing data not in the range selected, while Impala does not and performs much worse in the 10% and 30% cases.
query-times-for-alternatives

  • VectorH is also significantly faster when data hasn’t yet been loaded into memory. VectorH reduces the amount of I/O required for data residing on disk by using I/O filtering, where MinMax indexes in memory allow skipping read operations for blocks on disk with no data in the range selected. The chart shown below (similar to above) reflects the percentage of data in the range selected, and only VectorH shows significant savings from read operations as less data fits the selection criteria. Although some other formats also have range information, it is stored as metadata inside the data blocks. Every block still needs to be read at least partly before deciding whether the data is relevant. VectorH performed significantly less I/O, from 20% to 98% less, compared to Impala and Presto.

percentage-of-data-in-the-range-selected

  • VectorH has the most effective compression across a variety of data types, requiring only 11GBs of storage compared to 18GBs for Parquet and 19GBs for ORC, a savings of 39-42%. Imagine the savings over a multi-petabyte data store!

VectorH-compression-across-a-variety-of-data-types

Additional advantages for VectorH that contribute to deploying successful analytics solutions:

  • Spark integration is an example of Actian’s continuing commitment to incorporating open interfaces and frameworks directly into the VectorH solution.
    • Actian VectorH 6.0 integrates with the latest Hadoop distributions and can be deployed both on-premises and in the cloud e.g Microsoft Azure HDInsight.
    • Actian VectorH 6.0 supports multiple file systems as well as multiple data formats (Parquet, ORC, CSV, and many others through the Spark connector).
    • Users can execute queries in VectorH on data stored in any file formats supported by Spark by leveraging the Spark connector. This is fully transparent to the user: full ANSI SQL can be used to query data in any file format without even knowing about the existence of Spark.
    • With the Spark connector, data stored in VectorH can be processed in Spark through the use of Dataframes or Spark SQL. Any Spark operation can be performed on data backed by a VectorH table.
  • Overall, Actian provides more complete enterprise-grade functionality to support moving analytics applications from development into a production environment.
    • Role- and row-based security is built into VectorH, providing the access control needed to support privacy policies and regulatory requirements.
    • Actian Director provides a web-based tool for monitoring and managing VectorH and cluster resources.
    • Actian Management Console automates provisioning, deploying, and monitoring analytics in the cloud, making it quicker and easier to get your new project started.

This three-part blog series (see parts one and two) shows how Actian provides customers with the performance, flexibility and support needed when integrating with other big data technologies to deliver faster and richer insights to make better business decisions.

actian avatar logo

About Actian Corporation

Actian empowers enterprises to confidently manage and govern data at scale. Actian data intelligence solutions help streamline complex data environments and accelerate the delivery of AI-ready data. Designed to be flexible, Actian solutions integrate seamlessly and perform reliably across on-premises, cloud, and hybrid environments. Learn more about Actian, the data division of HCLSoftware, at actian.com.
Data Intelligence

Why are Your Data Scientists Leaving Your Enterprise?

Actian Corporation

May 29, 2020

data-scientist-miserable

In 2019, the Data Scientist was named the most promising job by LinkedIn. From Fortune 500 companies to small enterprises all around the world, building a team of data science professionals was a priority in their business strategies. To support this claim, the year 2019 broke all records of AI & data science investment.

Despite all of these positive trends, Data Scientists are quitting and changing companies at a rapid pace. How come? Let’s analyze the situation.

They Don’t Spend Their Time Doing What They Were Hired For

Unfortunately, many companies that hire data scientists do not have a suitable AI infrastructure in place. Surveys still suggest that roughly 80% of data scientists’ time is spent on cleaning, organizing, and finding data (instead of analyzing it), which is one of the last things they want to spend their time doing. In their article “How We Improved Data Discovery for Data Scientists at Spotify”, Spotify explains how, in the beginning, their “datasets lacked clear ownership or documentation, making it difficult for data scientists to find them.” Even data scientists working for Web Giants have felt frustration in their data journey.

Most data scientists end up leaving their companies because they end up filtering the trash in their data environments. Having clean and well-documented data is key for your data scientists to not only better find, discover, and understand the company’s data but also save time on tedious tasks and produce actionable insights.

Business and Data Science Goals are not Aligned

With all the hype around AI and Machine Learning, executives and investors want to showcase their data science projects at the forefront of the latest technological advances. They often hire AI and data experts, thinking that they will reach their business objectives in double the time. However, this is rarely the case. Data science projects typically involve a lot of experimentation, trial & error methods, and iterations of the same process before reaching the outcome.

A lot of companies increase their hiring of data specialists in order to increase the research and insight production across their company. However, this research often only has a “local impact” in specific parts of the enterprise, going unseen by other departments that might find it useful in their decision-making.

It is therefore important for both parties to effectively & efficiently work together by establishing solid communication. Aligning business objectives with data science objectives is the key to not lose your data scientists. By using a Data Ops approach, data scientists are able to work in an agile, collaborative and change-friendly environment that promotes communication between both the business and IT departments.

They Struggle to Understand & Contextualize Data at Enterprise Level

Most organizations have in place numerous complex solutions, usually misunderstood by the majority of the enterprise, making it difficult to train new data science employees. Without a unique centralized solution, data scientists find themselves going through a various different technologies, losing sight of what data is useful, up-to-date, and of quality for their usages.

This lack of visibility on data is frustrating to data scientists whom, as mentioned above, spend the majority of their time looking for data in multiple tools and sources.

By putting in place a single source of truth, data science experts are able to view their enterprise data in and produce data-driven insights.

Accelerate Your Data Scientists Work With a Metadata Management Solution

Metadata management is an essential discipline for enterprises wishing to bolster innovation or regulatory compliance initiatives on their data assets. By implementing a metadata management strategy, where metadata is well-managed and correctly documented, data scientists are able to easily find and retrieve relevant information from an intuitive platform. Empower your data science teams by providing them with the right tools that enables them to create new machine learning algorithms for their data projects and thus, value for your enterprise.

actian avatar logo

About Actian Corporation

Actian empowers enterprises to confidently manage and govern data at scale. Actian data intelligence solutions help streamline complex data environments and accelerate the delivery of AI-ready data. Designed to be flexible, Actian solutions integrate seamlessly and perform reliably across on-premises, cloud, and hybrid environments. Learn more about Actian, the data division of HCLSoftware, at actian.com.