Data Management

SQLite’s Serverless Architecture Doesn’t Serve IoT Environments – Part 2

Actian Corporation

June 11, 2020

Part Two: Rethinking What Client-Server Means for Edge Data Management

Over the past few weeks, our SQLite blog series has considered the performance deficiencies of SQLite when handling local persistent data and looked at the performance complications created by the need for ETL when sharing SQLite data with back-end databases. In our last installment—Mobile may be IoT but IoT is not Mobile—we started to understand why the SQLite serverless architecture doesn’t serve IoT environments very well. The fact that SQLite is the most popular database on the planet lies in the fact that it was inexpensive (read: free) and seemingly sufficient for the single-user embedded applications emerging on mobile smartphones and tablets.

That was yesterday. Tomorrow is a very different story.

The IoT is expanding at an explosive rate, and what’s happening at the edge—in terms of applications, analytics, processing demands, and throughput—will make the world of single-user SQLite deployments seem quaint. As we’ll see in this and the next installment of this blog, the data requirements for modern edge use cases lie far outside SQLite’s wheelhouse.

SQLite Design-Ins for the IoT: Putting the Wrong Foot Forward

As we’ve noted, SQLite is based on an elegant but simple B-tree architecture. It can store any type of data, is implemented in C, and has a very small footprint—a few hundred KBs—which makes it portable to virtually any environment with minimal resourcing. And while it’s not fully ANSI-standard SQL, it’s close enough for horseshoes, hand grenades, and mobile applications.

For all these reasons, and because it has been used ubiquitously as mobile devices have proliferated over the past decade, IoT developers naturally adopted SQLite into many early IoT applications. These early design-ins were almost mirror images of mobile applications (minus the need for much effort at the presentation layer). Data was captured and cached on the device, with the expectation that it would be moved to the cloud for data processing and analytics.

But that expectation was simply an extrapolation of the mobile world that we knew, and it was shortsighted. It didn’t consider how much processing power could be packed into an ever-smaller CPU package nor where those packages might end up. It didn’t envision the edge as a locus for analytics (wasn’t that the domain of the cloud and the data center?). It didn’t envision the true power of AI and ML and the role those would soon begin to play throughout the IoT. And it didn’t count on the sheer volume of data that would soon be washing through the networks like a virtual tsunami.

Have you been to an IoT trade show recently? Three to five years ago, many of the sessions described PoCs and small pilots in which all data was sent up into the cloud. Engineers and developers we spoke to on the trade show floor expressed skepticism about the need for anything more than SQLite. Some even questioned the need for a database at all (let alone databases that were consistent across clients and servers). In the last three years, though, the common theme of the sessions has changed. They began to center on scaling up pilots to full production and infusing ML routines into local devices and gateways. The conversations started to consider more robust local data management needs. Discussions, in hushed tones at first, about client-server configurations (OMG!) began to appear. The realization that the IoT is not the same as mobile was beginning to sink in.

Rethinking Square Pegs and Round Holes

Of course, the rationale for not using a client-server database in an IoT environment (or, for that matter, any embedded environment) made perfect sense—as long as the client-server model you were eschewing was the enterprise client-server model that had been in use since the ‘80s. In that client-server paradigm, databases were designed for the data center. They were built to run on big iron and to support enterprise applications like ERP, with tens, hundreds, even thousands of concurrent users interacting from barely sentient machines. Collect these databases, add in sophisticated management overlays, an army of DBAs, maybe an outside systems integrator, and steep them in millions of dollars of investment monies — and soon you’ve got yourself a nice little enterprise data warehouse.

That’s not something you’re going to squeeze into an embedded application. Square peg, round hole. And that explains why developers and line-of-business technical staff tended to announce that they had pressing business elsewhere whenever the words “client-server” began to pop up in conversations about the IoT. The use cases emerging in what we began to think of as the IoT were not human end-user centric. Unless someone were prototyping or doing some sort of test and maintenance on a device or gateway or some complex instrumentation, little or no ad hoc querying was taking place. Client-server was serious overkill.

In short, given a very limited set of use cases, limited budgets, and an awareness of the cost and complexity of traditional client-server database environments, relying on SQLite made perfect sense.

Reimagining Client-Server With the IoT in Mind

The dynamics of modern edge data management demand that we reframe our notions of client-server, for the demands of the IoT differ from those of distributed computing as envisioned in the 80s. The old client-server paradigm involved a lot of ad hoc databases interaction—both directly for ad hoc query and indirectly by applications that involved human end-users. In IoT use cases, data access is more prescribed, often repeated and event-driven; you know exactly which data needs to be accessed, as well as when (or at least under which circumstances) an event will generate the request.

Similarly, in a given IoT use case there are no unknowns about how many applications are running on a device or about how many external devices will be requesting data from (or sending data to) an application and its database pairing (and here, whether the database is embedded or separate standalone doesn’t really matter). While these numbers vary among use cases and deployments, a virtual team of developers, systems integrators, product managers, and others will design structure, repeatability, and visibility into the system—even if it’s stateless (and more so if it’s stateful).

In the modern IoT space, client-server database requirements are more like well-defined publish and subscribe relationships (post by publisher/read by subscriber and access from publisher/write to subscriber). They operate as automated machine-to-machine relationships, in which publishing/broadcasting and parallel multichannel intake activities often take place concurrently. Indeed, client-server in the IoT is like publish-subscribe—except that everything needs to perform both operations, and most complex devices (including gateways and intelligent equipment) will need to be able to perform both operations not just simultaneously but also across parallel channels.

Let me repeat that for emphasis: most complex IoT devices (read: pretty much anything other than a sensor) is going to need to be able to read simultaneously and write simultaneously.

SQLite cannot do this.

Traditional client-server databases can, but they were not designed with a small footprint in mind. Most cloud and data center client-server databases require hundreds of megabytes, even gigabytes, of storage space. However, the core functions needed to handle simultaneous reads and writes efficiently take up far less space. The Actian Zen edge database, for example, has a footprint of less than 50MB. And while this is 100X the installed footprint of SQLite, it’s merely a sliver of the space attached to the 64-bit ARM and Intel embedded processor-based platforms we see today. Moreover, Actian Zen edge’s footprint provides all the resources necessary for multi-user management, integration with external applications through ODBC and other standards, security management, and other functionality that is a must once you jump from serverless to client-server. A serverless database like SQLite does not provide those services because their need—like the edge itself—was simply not envisioned at the time.

If we look at the difference between Actian Zen edge and Actian Zen enterprise (with its footprint under 200MB), we can see that most of the difference has to do with human end-user enablement. For example, Actian Zen enterprise includes an SQL editor that enables ad-hoc queries and other data management operations from a command line. While most of that same functionality resides in Zen edge, it is accessed and executed through API calls from an application rather than a CLI.

But Does Every IoT Edge Scenario Need a Server?

Those of you who have been following closely will now sit up and say, Hey, wait: Didn’t you say that not every IoT edge data management scenario needs a client-server architecture?

Yes, I did. Props to you for paying attention. Not all scenarios do—but that’s not really the question you should be asking. The salient question is, do you really want to master one architecture, implementation, and vendor solution for those serverless use cases and separate architectures, implementations, and vendor solutions for the Edge, cloud, and data center? And, from which direction do you approach this question?

Historically, the vast majority of data architects and developers have approached this question from the bottom up. That’s why we started with flat files and then moved to SQLite. Rather than looking from the bottom up, I’m arguing that we need to step back, embrace a new understanding of what client-server can be, and then revisit the question from the top down. Don’t just try to force-fit serverless into a world for which it was never intended—or worse, kluge up from serverless to a jury-rigged implementation of a late 20^th century-server configuration.

That way madness lies, as we’ll see in the final installment of this series, where we’ll look at what happens if developers decide to use SQLite anyway.

Ready to reconsider SQLite, learn more about Actian Zen. Or, you can just kick the tires for free with Zen Core which is royalty-free for development and distribution.

About Actian Corporation

Actian empowers enterprises to confidently manage and govern data at scale. Actian data intelligence solutions help streamline complex data environments and accelerate the delivery of AI-ready data. Designed to be flexible, Actian solutions integrate seamlessly and perform reliably across on-premises, cloud, and hybrid environments. Learn more about Actian, the data division of HCLSoftware, at actian.com.

Data Management

SQLite’s Serverless Architecture Doesn’t Serve IoT Environments – Part 1

Actian Corporation

June 11, 2020

Part One: Mobile May Be IoT—But, When it Comes to Data, IoT is Not Mobile

Three weeks ago, we looked at the raw performance—or the lack thereof—of SQLite. After that, we looked at SQLite within the broader context of modern edge data management and discovered that its performance shortcomings were in fact compounded by the demands of the environment. As a serverless database, SQLite requires integration with a server-based database—which inevitably incurs a performance hit as the SQLite data is transformed through an ETL process for compatibility with the server-based database’s architecture.

SQLite partisans might then adopt a snarky tone and say: “Yeah? Well if SQLite is so slow and integration is so burdensome, can you remind me why it is the most ubiquitous database out there?”

Well, yeah, we can. And in the same breath, we can provide even partisans with ample reason to doubt that the popularity of SQLite will continue going forward. Spoiler alert: What do the overall growth curves of the IoT look like outside the realm of mobile handsets and tablets?

How the Banana Slug Won the Race

In the first blog in this series, we looked at why embedded developers adopted SQLite over both simple file management systems on the one end of the data management spectrum and large complex RDBMS systems on the other end. The key technical reasons, just to recap, include its small footprint; its ability to be embedded in an application; its portability to almost any operating system and programming language with a simple architecture (key-value store); and its ability to deliver standard data management functionality through an SQL API. The key non-technical reason—okay, reason—is that, well, it’s free! in use cases dominated by personal applications that needed built-in data management (including developer tools), web applications that needed a data cache, and mobile applications that needed something with a very small footprint. If you combine free with these technical characteristics and consider where and how SQLite has been deployed, it’s no surprise that, in terms of raw numbers, SQLite found itself more widely deployed than any other database.

What all three of the aforementioned use cases have in common, though, is that they are single-user scenarios in which data associated with a user can be stored in a single file and data table (which, in SQLite are one and the same). Demand for data in these use cases generally involves serial reads and writes; there’s little likelihood of concurrent reads, let alone concurrent writes. In fact, it wasn’t until later iterations of SQLite that the product’s developers even felt the need to enable simultaneous reads with a single write.

But here’s the thing: Going forward, those three use cases are not going to be the ones driving the key architectural decisions. Ironically, the characteristics of SQLite that made it so popular among developers and in turn gave rise to a world in which billions of devices are acting, reacting, and interacting in real time—at the edge, in the cloud, and in the data center—and that’s a world for which the key characteristics of SQLite are singularly ill-suited.

SQLite has essentially worked itself out of a role in the realm of modern edge data management.

As we’ve mentioned earlier, SQLite is based on an elegant but simple architecture, key-value store, that enables you to store any type of data. Implementation is done in C with a very small footprint, a few hundred KBs, making it portable to virtually any environment with minimal resourcing. And, while it’s not fully ANSI standard SQL, it’s close enough for horseshoes, hand grenades, and mobile applications.

SQLite was adopted in many early IoT applications as these early design-ins were almost mirror images of mobile applications (minus the need for much effort at the presentation layer), focused on local caching of data with the expectation that it would be moved to the cloud for data processing and analytics. Pilot projects on the cheap meant designers and developers knee-jerk to what they know and what is free – ta-dah SQLite!

Independent of SQLite, the IoT market and its use cases have rapidly moved off this initial trajectory. Clear proof of this is readily apparent if you’ve had the opportunity to go to IoT trade shows over the last few years. Three to five years ago, recall how many of the sessions described proof of concepts (PoCs) and small pilots where all data was sent up into the cloud. When we spoke to engineers and developers on the trade show floor, they were skeptical about the need for anything more than SQLite or if you needed a database at all – let alone client-server versions. However, in the last three years, more of the sessions have centered on scaling up pilots to full production and infusion of ML routines into local devices and gateways. Many more of the conversations involved considerations to use more robust local data management, including client-server options.

Intelligent IoT is Redefining Edge Data Management

For all its strengths in the single-user application space, SQLite and its serverless architecture are unequal to the demands of autonomous vehicles, smart agriculture, medical instrumentation, and other industrial IoT spaces. The same is true with regard to the horizontal spaces occupied by key industrial IoT components, such as IoT gateways, 5G networking gear, and so forth. Unlike single-user applications designed to support human-to-machine requirements, innumerable IoT applications are being built for machine-to-machine relationships occurring in highly automated environments. Modern machine-to-machine scenarios involve far fewer one-to-one relationships and a far greater number of peer-to-peer and hierarchical relationships (including one-to-many and many-to-one subscription and publication scenarios), all of which have far more complex data management requirements than those for which SQLite was built. Moreover, as CPU power has migrated out of the data center into the cloud and now out to the edge, a far wider array of systems are performing complex software-defined operations, data processing, and analytics than ever before. Processing demands are becoming both far more sophisticated and far more local.

Consider: Tomorrow’s IoT sensor grids will run the gamut from low-speed, low-resolution structured data feeds (capturing tens of thousands of pressure, volume, and temperature readings, for example) to high-speed, high-resolution video feeds from hundreds of streaming UHD cameras. In a chemical processing plant, both sensor grids could be flowing into one or more IoT gateways that, in turn, could flow into a network of edge systems (each with the power one would only have found in a data center a few years ago) for local processing and analysis, after which some or all of the data and analytical information would be passed on a network of servers in the Cloud.

Dive deeper: The raw data streams flowing in from these grids would need to be read and processed in parallel. These activities could involve immediately discarding spurious data points, running signal-to-noise filters, normalizing data, or fusing data from multiple sensors, to name just a few of the obvious data processing functions. Some of the data would be stored as it arrived—either temporarily or permanently, as the use case demanded—while other data might be discarded.

A World of Increasing Complexity

Throughout these scenarios we see far more complex operations taking place at every level, including ML inference routines being run locally on devices, at the gateway level, or both. There may be additional operations running in parallel on these same datasets—including downstream device monitoring and management operations, which effectively create new data streams moving in the opposite direction (e.g., reads from the IoT gateway and writes down the hierarchical ladder). Or data could be extracted simultaneously for reporting and analysis by business analysts and data scientists in the cloud or data center. In an environment such as the chemical plant we have envisioned, there may also be more advanced analytics and visualization activities performed at, say, a local operations center.

These scenarios are both increasingly commonplace and wholly unlike the scenarios that propelled SQLite to prominence. They are combinatorial and additive; they present a world of processing and data management demands that is as far from that of the single-user, single-application world—the sweet-spot for SQLite—as one can possibly get:

Concurrent writes are a requirement, and not just to a single file or data table—with response times between write requests of as little as a few milliseconds.
Multiple applications will be reading and writing data to the same data tables (or joining them) in IoT gateways and other edge devices, requiring the same kind of sophisticated orchestration that would be required with multiple concurrent users.
On-premise edge systems may have local human oversight of operations, and their activities will add further complexity to the orchestration of multiple activities reading and writing to the databases and data tables.

If all of this sounds like an environment for which SQLite is inadequately prepared, you’re right. In parts two and three of this blog we’ll delve into these issues further.

Ready to reconsider SQLite, learn more about Actian Zen. Or, you can just kick the tires for free with Zen Core which is royalty-free for development and distribution.

About Actian Corporation

Data Intelligence

Build Your Citizen Data Scientist Team

Actian Corporation

June 8, 2020

”There aren’t enough expert data scientists to meet data science and machine learning demands, hence the emergence of citizen data scientists. Data and analytics leaders must empower “citizens” to scale efforts, or risk failure to secure data science as a core competency”. – Gartner 2019

As data science provides competitive advantages for organizations, the demand for expert data scientists is at an all-time high. However, supply remains pretty scarce for that demand! This limitation is a threat to enterprises’ competitiveness and, in some cases, their survival in the market.

In response to this challenge, an important analytical role providing a bridge between data scientists and business functions was born: the citizen data scientist.

What is a Citizen Data Scientist?

Gartner defines the citizen data scientist as “an emerging set of capabilities and practices that allows users to extract predictive and prescriptive insights from data while not requiring them to be as skilled and technically sophisticated as expert data scientists”. A “Citizen Data Scientist” is not a job title. They are “power users” who can perform both simple and sophisticated analytical tasks.

Typically, citizen data scientists don’t have coding expertise but can nevertheless build models using drag-and-drop tools and run prebuilt data pipelines and models using tools such as Dataiku. Be aware: citizen data scientists do NOT replace expert data scientists. They bring their expertise but do not have the specialized expertise for advanced data science.

The citizen data scientist is a role that has evolved as an “extension” from other roles within the organization! This means that organizations must develop a citizen data scientist persona. Potential citizen data scientists will vary based on their skills and interests in data science and machine learning. Roles that filter into the citizen data scientist category include:

Business Analysts.
BI Analysts/Developers.
Data Analysts.
Data Engineers.
Application Developers.
Business Line Manager.

How to Empower Citizen Data Scientists

As expert skills for data science initiatives tend to be quite expensive and difficult to come by, utilizing a citizen data scientist can be an effective way to close the current gap.

Here are ways you can empower your data science teams:

Break Enterprise Silos

As I’m sure you’ve heard this many times before, many organizations tend to operate independently in silos. Mentioned above, all of roles are important in an organization’s data management strategy, and they all have expressed interest in learning about data science and machine learning skills. However, most data science and machine learning knowledge is siloed in the data science department or specific roles. As a result, data science efforts are often invalidated and unleveraged. Lack of collaboration between data roles makes it difficult for citizens data scientists to access and understand enterprise data!

By establishing a community of both business and IT roles that provides detailed guidelines and/or resources allows enterprises to empower citizens data scientists. It is important for organizations to encourage the sharing of data science efforts throughout the organization and thus, break silos.

Provide Augmented Data Analytics Technology

Technology is fueling the rise of the citizen data scientist. Traditional BI vendors such as SAP, Microsoft and Tableau Software, provide advanced statistical and predictive analytics as part of their offerings. Meanwhile, data science and machine learning platforms such as SAS, H2O.ai and TIBCO Software, provide users that lack advanced analytics capabilities with “augmented analytics”. Augmented analytics leverages automated machine learning to transform how analytics content is developed, consumed and shared. It includes:

Augmented data preparation: Machine learning automation to augment data profiling and quality, modeling, enrichment and data cataloguing.

Augmented data discovery: Enables business and IT users to automatically find, visualize and analyse relevant information, such as correlations, clusters, segments, and predictions, without having to build models or write algorithms

Augmented data science and machine learning: Automates key aspects of advanced analytics modeling such as feature selection, algorithm selection and time-consuming step processes.

By incorporating the necessary tools and solutions and extending resources and efforts, enterprises can empower citizen data scientists.

Empower Citizen Data Scientists With a Metadata Management Platform

About Actian Corporation

Data Architecture

Actian Vector for Hadoop for Fuller SQL Functionality and Current Data

Actian Corporation

June 7, 2020

person looking at a screen with numbers and data

In this second of a three-part blog series (part 1), we’ll explain how SQL execution in Actian Vector in Hadoop (VectorH) is much more functional and ready to run in an operational environment, and how the ability for VectorH to handle data updates efficiently can enable your production environment to stay current with the state of your business. In the first part of this three-part blog post, we showed the tremendous performance advantage VectorH has over other SQL on Hadoop alternatives. The third part will cover the advantages of the VectorH file format.

Better SQL Functionality for Business Productivity

One of the original barriers to getting value out of Hadoop is the need for MapReduce skills, which are rare and expensive, and take time to apply to a given analytical question. Those challenges led to the rise of many SQL on Hadoop alternatives, many of which are now projects in the Apache ecosystem for Hadoop. While those different projects open up access to the millions of business users already fluent in writing SQL queries, in many cases they require other tradeoffs: differences in syntax, limitations on certain functions and extensions, immature optimization technology, and inefficient implementations. Is there a better way to get SQL on Hadoop?

Yes! Actian VectorH 6.0 supports a much more complete implementation, with full ANSI SQL:2003 support, plus analytic extensions like CUBE, ROLLUP, GROUPING SETS, and WINDOWING for advanced analytics. Let’s look at the workload we evaluated in our SIGMOD paper, based on the 22 queries in the TPC-H benchmark.

Each of the other SQL on Hadoop alternatives had issues running the standard SQL queries that comprise the TPC-H benchmark, which means that business users who know SQL may have to make changes manually or suffer from poor results or even failed queries:

Apache Hive 1.2.1 couldn’t complete query number 5.
Performance for Cloudera Impala 2.3 is hindered by single-core joins and aggregation processing, creating bottlenecks for exploiting parallel processing resources.
Apache Drill 1.5 couldn’t complete query number 21, and only 9 of the queries ran without modification to their SQL code.
Since Apache Spark SQL version 1.5.2 is a limited subset of ANSI SQL, most queries had to be rewritten in Spark SQL to avoid IN/EXISTS/NOT EXISTS sub-queries, and some queries required manual definition of join orders in Spark SQL. VectorH has a mature query optimizer that will reorder joins based on cost metrics to improve performance and reduce I/O bandwidth requirements.
Apache Hawq version 1.3.1 is based on PostgreSQL, so its older technology foundations can’t compete with the performance of a vectorized query engine.

Efficient Updates for More Consistent View of the Business

Another barrier to Hadoop adoption is that it is an append-only file system, limiting the file system’s ability to handle inserts and deletes. Yet many business applications require updates to the data, putting the burden on the database management system to handle those changes. VectorH can receive and apply updates from transactional data sources to ensure that analytics are performed on the most current representation of your business, not from an hour ago, or yesterday, or the last batch load into your data warehouse.

As part of the ad hoc decision support workload it represents, TPC-H has a requirement to run inserts and deletes as part of the workload. There are two refresh streams that make inserts and deletes into the six fact tables.
Four of the SQL on Hadoop alternatives do not support updates on HDFS: Impala, Drill, SparkSQL, and Hawq. They would not be able to meet the requirements for a full audited result.
The fifth, Hive, does support updates but incurs a significant performance penalty executing queries after handling the updates.
VectorH executed the updates more quickly than Hive. With its patent-pending Positional Delta Trees, VectorH tracks inserts and deletes separately from the data blocks, maintaining full ACID compliance while preserving the same level of query performance (no penalty!)
Here is the summary data from our testing that shows the performance penalty on Hive while there is no impact on VectorH from executing updates (detailed data follows):
- Inserts took 36% longer and deletes required 796% more time on Hive than VectorH

Query performance afterwards shows PDTs have no measurable overhead, compared to the 38% performance penalty on Hive:

The average speedup for VectorH over Hive increases from 229x before the refresh cycles to 331x after updates are applied, with a range of 23 to 1141 on individual queries.

Appendix: Detailed Query Execution Times

About Actian Corporation

Data Management

Actian Shows Big Advantages Over SQL on Hadoop Alternatives

Actian Corporation

June 6, 2020

SQL Imagery long exposure of cars at night

Imagine if reports that currently take many minutes to run in Hadoop could come back with results in seconds. Get answers to detailed questions about sales figures and customer trends in real-time. Make revenue predictions based on up-to-date customer metrics across a spectrum of sources. Iterate more quickly simulating different business decisions to achieve better outcomes. The Actian Vector for Hadoop analytics platform can deliver those improvements in your Hadoop big data environment.

Actian Vector for Hadoop has demonstrated one to three orders of magnitude better query performance in a comparison with other major SQL in Hadoop alternatives. In this first of a three-part blog describing the results, we’ll show the astounding performance results and explain the factors that contribute to such a large advantage. Part two will cover the unique abilities Vector has to handle updates, and part three will go into the efficiencies of the Vector for Hadoop file format.

Actian performance engineering used the full set of 22 TPC-H queries to run unaudited benchmarks on several of the SQL on Hadoop solutions in the market, and the results may surprise you (but not us). Here is a quick summary:

These results have been published in an academic paper submitted to and presented at the International Conference on Management of Data (ACM SIGMOD). That paper goes into many technical reasons how Vector for Hadoop is able to achieve such a performance advantage – here is the short version:

Efficient, multi-core parallel and vectorized execution – Vector for Hadoop is designed to take advantage of the performance features in the Intel CPU architecture, including the AVX2 vector instruction set and large, multi-layer caches.
Well-tuned query optimizer – Vector for Hadoop extends the mature optimizer from its original SMP version to exploit the multiple levels of parallelism and advantages of data locality in an MPP Hadoop system. The Vector for Hadoop optimizer can change the join order or partition data tables to improve parallel operations, steps that have to be done manually for queries in the other alternatives.
Control over HDFS block locality – since Vector for Hadoop operates natively within HDFS and YARN, it can participate in resource management and make allocation decisions in the context of the larger cluster workload. At the same time, specific table storage optimizations reduce overhead, accelerate reads, maximize disk efficiency, and reduce data skew to help deliver faster query results.
Effective I/O filtering – tracking the range of values in a column (MinMax) allows skipping the reading of blocks which fall outside the range of the query, reducing disk I/O and read delays, and avoiding decompression computations, sometimes significantly.
Lightweight compression – Vector ‘s compression achieves good levels of compaction at high speed, achieving faster vectorized execution by minimizing branches and instruction counts. Our compression algorithms are capable of running fully in CPU cache, effectively increasing memory bandwidth. Different compression algorithms are tailored for the various data types and Vector automatically calibrates and chooses among them to reach higher levels of compression and efficiency when compared to general purpose compression algorithms.

How was the testing conducted?

Actian performance engineering built a 10-node Hadoop cluster, each node 2xIntel 3.0GHz E5-2690v2 CPUs, 256GB RAM, 24x600GB HDD, 10Gb Ethernet, Hadoop 2.6.0. There was one name node and nine SQL-on-Hadoop nodes, set up using Cloudera Express 5.5.
These tests were conducted in early 2016, running the then-most-current release of each of the SQL on Hadoop alternatives (Actian Vector for Hadoop 4.2.2, Apache Hive 1.2.1, Cloudera Impala 2.3, Apache Drill 1.5, Apache Spark SQL 1.5.2, and Pivotal HAWQ 1.3.1). Reasonable efforts were made to tune each platform to make fair comparisons.

Here are the actual individual query execution times and the speed-up factor for Vector for Hadoop versus each of the alternatives:

In part two of this blog series, we will cover the advantages Vector for Hadoop 6.0 delivers in SQL functionality and data updates capability compared to the other alternatives, and part three will show the benefits of the Vector file format for faster query performance and lower storage requirements.

About Actian Corporation

Data Architecture

Actian Vector for Hadoop File Format is Faster and More Efficient

Actian Corporation

June 5, 2020

In this third and last part of the series on Actian Vector in Hadoop (VectorH), we will cover how the VectorH file format supports the performance and efficiency of our data analytics platform to accelerate business insights, as well as some of the other enterprise features that can help businesses move their Hadoop applications into production. Part one of this series showed the huge performance advantages VectorH has over other SQL on Hadoop alternatives, while part two explored the benefits of the richer implementation of SQL and the ability to perform data updates in VectorH.

The file format for VectorH is one of the key contributors to its industry-leading performance. Having a columnar orientation allows VectorH to choose compression techniques optimized by data type, and VectorH can use various measures described in the SIGMOD paper to employ storage and I/O bandwidth more efficiently. In some simple benchmarks described in this paper, we compared VectorH to the speed and efficiency of other query engines (such as Impala and Presto) and other file formats (like Parquet and ORC). Three observations become clear from the benchmark results:

VectorH handles queries much faster than the other alternatives when the data is already in memory, from 26x to over 110x faster, primarily due to the efficiencies of decompression using vectorized processing. The chart below shows query times for each of the alternatives, showing how it varies depending on the percentage of the data selected out of the entire set of tables. VectorH and Presto avoid processing data not in the range selected, while Impala does not and performs much worse in the 10% and 30% cases.
query-times-for-alternatives

VectorH is also significantly faster when data hasn’t yet been loaded into memory. VectorH reduces the amount of I/O required for data residing on disk by using I/O filtering, where MinMax indexes in memory allow skipping read operations for blocks on disk with no data in the range selected. The chart shown below (similar to above) reflects the percentage of data in the range selected, and only VectorH shows significant savings from read operations as less data fits the selection criteria. Although some other formats also have range information, it is stored as metadata inside the data blocks. Every block still needs to be read at least partly before deciding whether the data is relevant. VectorH performed significantly less I/O, from 20% to 98% less, compared to Impala and Presto.

percentage-of-data-in-the-range-selected

VectorH has the most effective compression across a variety of data types, requiring only 11GBs of storage compared to 18GBs for Parquet and 19GBs for ORC, a savings of 39-42%. Imagine the savings over a multi-petabyte data store!

VectorH-compression-across-a-variety-of-data-types

Additional advantages for VectorH that contribute to deploying successful analytics solutions:

Spark integration is an example of Actian’s continuing commitment to incorporating open interfaces and frameworks directly into the VectorH solution.
- Actian VectorH 6.0 integrates with the latest Hadoop distributions and can be deployed both on-premises and in the cloud e.g Microsoft Azure HDInsight.
- Actian VectorH 6.0 supports multiple file systems as well as multiple data formats (Parquet, ORC, CSV, and many others through the Spark connector).
- Users can execute queries in VectorH on data stored in any file formats supported by Spark by leveraging the Spark connector. This is fully transparent to the user: full ANSI SQL can be used to query data in any file format without even knowing about the existence of Spark.
- With the Spark connector, data stored in VectorH can be processed in Spark through the use of Dataframes or Spark SQL. Any Spark operation can be performed on data backed by a VectorH table.
Overall, Actian provides more complete enterprise-grade functionality to support moving analytics applications from development into a production environment.
- Role- and row-based security is built into VectorH, providing the access control needed to support privacy policies and regulatory requirements.
- Actian Director provides a web-based tool for monitoring and managing VectorH and cluster resources.
- Actian Management Console automates provisioning, deploying, and monitoring analytics in the cloud, making it quicker and easier to get your new project started.

This three-part blog series (see parts one and two) shows how Actian provides customers with the performance, flexibility and support needed when integrating with other big data technologies to deliver faster and richer insights to make better business decisions.

About Actian Corporation

Data Intelligence

Why are Your Data Scientists Leaving Your Enterprise?

Actian Corporation

May 29, 2020

In 2019, the Data Scientist was named the most promising job by LinkedIn. From Fortune 500 companies to small enterprises all around the world, building a team of data science professionals was a priority in their business strategies. To support this claim, the year 2019 broke all records of AI & data science investment.

Despite all of these positive trends, Data Scientists are quitting and changing companies at a rapid pace. How come? Let’s analyze the situation.

They Don’t Spend Their Time Doing What They Were Hired For

Unfortunately, many companies that hire data scientists do not have a suitable AI infrastructure in place. Surveys still suggest that roughly 80% of data scientists’ time is spent on cleaning, organizing, and finding data (instead of analyzing it), which is one of the last things they want to spend their time doing. In their article “How We Improved Data Discovery for Data Scientists at Spotify”, Spotify explains how, in the beginning, their “datasets lacked clear ownership or documentation, making it difficult for data scientists to find them.” Even data scientists working for Web Giants have felt frustration in their data journey.

Most data scientists end up leaving their companies because they end up filtering the trash in their data environments. Having clean and well-documented data is key for your data scientists to not only better find, discover, and understand the company’s data but also save time on tedious tasks and produce actionable insights.

Business and Data Science Goals are not Aligned

With all the hype around AI and Machine Learning, executives and investors want to showcase their data science projects at the forefront of the latest technological advances. They often hire AI and data experts, thinking that they will reach their business objectives in double the time. However, this is rarely the case. Data science projects typically involve a lot of experimentation, trial & error methods, and iterations of the same process before reaching the outcome.

A lot of companies increase their hiring of data specialists in order to increase the research and insight production across their company. However, this research often only has a “local impact” in specific parts of the enterprise, going unseen by other departments that might find it useful in their decision-making.

It is therefore important for both parties to effectively & efficiently work together by establishing solid communication. Aligning business objectives with data science objectives is the key to not lose your data scientists. By using a Data Ops approach, data scientists are able to work in an agile, collaborative and change-friendly environment that promotes communication between both the business and IT departments.

They Struggle to Understand & Contextualize Data at Enterprise Level

Most organizations have in place numerous complex solutions, usually misunderstood by the majority of the enterprise, making it difficult to train new data science employees. Without a unique centralized solution, data scientists find themselves going through a various different technologies, losing sight of what data is useful, up-to-date, and of quality for their usages.

This lack of visibility on data is frustrating to data scientists whom, as mentioned above, spend the majority of their time looking for data in multiple tools and sources.

By putting in place a single source of truth, data science experts are able to view their enterprise data in and produce data-driven insights.

Accelerate Your Data Scientists Work With a Metadata Management Solution

Metadata management is an essential discipline for enterprises wishing to bolster innovation or regulatory compliance initiatives on their data assets. By implementing a metadata management strategy, where metadata is well-managed and correctly documented, data scientists are able to easily find and retrieve relevant information from an intuitive platform. Empower your data science teams by providing them with the right tools that enables them to create new machine learning algorithms for their data projects and thus, value for your enterprise.

About Actian Corporation

Data Intelligence

How a Business Glossary Empowers Your Data Scientists

Actian Corporation

May 26, 2020

In the data world, a business glossary is a sacred text that represents long hours of hard work and collaboration between the IT & business departments. In metadata management, it is a crucial part of delivering business value from data. According to Gartner, it is one of the most important solutions to put in place in an enterprise to support business objectives.

A business glossary provides clear meanings and context for any company data or business term to help your data scientists with their machine learning algorithms and data initiatives.

Back to Basics: What is a Business Glossary?

A business glossary is a place where business and/or data terms are defined and accessible throughout the organization. As simple as this may sound, it is a common problem; not all employees agree or share a common understanding of even basic terms such as “contact” or “customer.”

Its main objectives, among others, are to:

Use the same definitions and create a common language between all employees.
Have a better understanding and collaboration between business and IT teams.
Associate business terms to other assets in the enterprise and offer an overview of their different connections.
Elaborate and share a set of rules regarding data governance.

Organizations are therefore able to have information as a second language.

How Does a Business Glossary Benefit Your Data Scientists?

Centralized business information allows to share what is essentially tribal knowledge, around an enterprise’s data. In fact, it allows Data Scientists to make better decisions when choosing which datasets to use. It also allows:

A Data Literate Organization

Gartner predicts that by 2023, data literacy will become an explicit and necessary driver of business value, demonstrated by its formal inclusion in over 80% of data and analytics strategies and change management programs. Increasingly, organizations are realizing this and beginning to look at data and analytics in a new way.

As part of the Chief Data Officer job description, it is essential that all parts of the organization can understand data and business jargons. It helps all parts of the organization to better understand a data’s meaning, context, and usages. So by putting in place a business glossary, data scientists are able to efficiently collaborate with all departments in the company, whether IT or business. There are less communication errors and thus they participate in the construction and improvement of knowledge of the enterprise’s data assets.

The Implementation of a Data Culture

Closely related to data literacy, data culture refers to a workplace environment where decisions are made through emphatic and empirical data proof. In other words, executives make decisions based on data evidence, and not just on instinct.

A business glossary promotes data quality awareness and overall understanding of data in the first place. As a result, the environment becomes more data-driven. Furthermore, business glossaries can help data scientists gain better visibility into their data.

An Increase in Trusting Data

A business glossary ensures that the right definitions are used effectively for the right data. It will assist with general problem solving when data misunderstandings are identified. When all datasets are correctly documented with the correct terminology that is understood by all, it increases overall trust in enterprise data, allowing data scientists to efficiently work on their data projects.

Their time is less spent on cleaning and organizing data, but rather on bringing valuable insights to maximize business value.

Implement a Business Glossary

Actian Data Intelligence Platform provides a business glossary within our data catalog. Our business glossary automatically connects and imports your glossaries and dictionaries in our tool with our APIs. You can also manually create a glossary within the Actian Data Intelligence Platform’s interface.

Check our business glossary benefits for your data scientists.

About Actian Corporation

Data Analytics

Real-Time Decision-Making Use Cases in the Retail Industry – Part 3

Actian Corporation

May 26, 2020

In the first part of my blog on Real-Time Decision-Making (RTDM) highlighting retail industry use cases, we discussed how combining existing historical data patterns with disparate new sources of data completes the Common Operational Picture (COP). To illustrate the use case, we used an Actian customer, Kiabi, and how they used RTDM strategic capabilities to enhance their customer loyalty program.

In the second part, we explored how different roles and responsibilities can use the COP in business-as-usual scenarios versus periods of market disruption using another Actian retail customer, LeRoy Merlin. LeRoy Merlin wanted to extend their decision-making in a data-driven fashion down to managers in their retail locations across Asia, Europe, Latin America, and Africa with an emphasis on sales performance data by-products for a range of factors.

The key points I was hoping readers would take away were that in times of market uncertainties, enhanced customer focus requires business agility that can only be achieved by proper use of the COP to deliver situational awareness based on role and proximity to the point of action. The further downstream you’re able to push the analysis and decisions, the better the results – provided you can balance speed and accuracy.

It Takes Two to Make a Thing Go Right

OK, it’s an old party song stuck in my head, but the sentiment is spot on. One major theme of leveraging the COP to deliver more accurate, fresh intelligence down to frontline decision-makers on the ground is to ensure the focus on customers keeps them happy and satisfies their needs. But it’s also about satisfying the needs of the business. With Kiabi, they wanted to make sure their loyal customers got exactly what they wanted by monitoring their buying behaviors to predict what marketing programs would incentivize loyalty program members to buy more apparel. With LeRoy Merlin, they wanted to empower their store managers to understand locally which inventory was moving and which was just sitting on the shelves to determine how to improve sales performance in that individual store.

Across both these use cases, we’re leveraging RTDM intelligence to drive existing programs and operations to yield better business outcomes. Satisfaction for customers is measured in many ways, with the most critical from a business standpoint: repeated profitable business. During periods of market uncertainty, demand fluctuates, and so does the cost of goods and services. Organizations must ensure they can handle constituents – customers for businesses, patients if we’re talking healthcare, and students if we’re talking education, in a way that avoids unexpected costs or risk. In other words, profitability must be maintained. Even if we’re talking non-profit, operational expenses must be covered for the mid-to-long term. In summary, the thing that needs to go right is the relationship on both sides – for the customer and the business.

Balancing Customer Response and Risk

For several years now, we’ve been supporting The AA, the leading provider of roadside assistance services in the UK. In addition to roadside services, The AA uses independent insurance brokers who work with a group of AA underwriters to offer a range of vehicle and home insurance policies. Actian has helped The AA with its RTDM capability. The AA uses the Actian hybrid database solution to analyze insurance applicant-supplied data against third party data and verification services to assess risk that is critical to quoting a competitive yet profitable policy.

In this case, the COP consists of internal but fluid actuarial data, fraud detection data and models, external sources to collect verification of prior applicant driving records and claims, and relevant demographic traffic accident and property crime rates by location, and so forth. Decision-making is pushed down to the frontline underwriters in that they are assigning risk and to the independent brokers in that they are providing the quotes. However, their roles are essentially as the feedback loop on risk assessment and quoting that is automated, and the interaction with prospective policyholders takes place on competitive Insurance websites like GoCompare.com and CompareTheMarket.com. Prospect expectation and competitive table stakes dictate that all quotes be delivered side-by-side in under a second.

The Actian Data Platform was selected to support the risk assessment and quoting operation because of performance requirements in two separate areas.

To meet the speed of collecting the information internally and externally to generate quotes in one second or less.
The speed at which fresh data can be visualized for the underwriters in Looker, enabling them to tweak the risk-based decisions based on the competitive landscape.

Real-time is in the “Eye of the Beholder”

In The AA use case, speed and accuracy are both important. Driving record data changes all the time. There is no point in underwriting on a clean driving record based on yesterday’s data when today’s data says the 16-year old daughter just got her learners permit. In other words, if you were to use a Cube to retrieve the data to meet performance requirements depending on the business requirements, your data may be stale, and so the speed is really only one part of real-time, the other part is the freshness of the data.

Both speed and freshness are the true definitions of real-time. The requirements for The AA were 1 second, but for LeRoy Merlin, they’re daily. For many businesses, the real-time requirement is weekly. For example, grocery stores may need to review sales at each supermarket weekly as part of a regular resupply process, and the speed may be an hour or less for populating the data across all stocked items, but the stock data before store managers show up Monday morning for work. In this scenario, stock data doesn’t need to be updated every hour, but perhaps once per week.

During periods of market uncertainty, either or both speed and accuracy may require change. Take the grocery store refresh rate of a week for their stock data, and use of a cube may be too coarse when you have panic buying across everything from pasta to peanut butter and its rolling across different products by day. At that point, your need for fresh data and your real-time requirement change from weekly to daily, and the speed of data collection, analysis, and visualization may drop from an hour to minutes.

In The AA’s case, their real-time requirements are already set on speed and accuracy, and their RTDM capability for business-as-usual easily translates to scenarios where business disruption is taking place. For many organizations, this is not the case, and the question really is, how do you ascertain what your speed and accuracy requirements are for periods of market uncertainty? In our next blog in the series, we’ll look at what’s needed for speed and accuracy … on a budget. Until then, find out how The Actian Real-Time Connected Data Warehouse can help you achieve your RTDM goals.

About Actian Corporation

Data Analytics

Real-Time Decision-Making Use Cases in the Retail Industry – Part 2

Actian Corporation

May 18, 2020

In my last blog on Real-Time Decision-Making (RTDM), I shifted from a theoretical discussion of what RTDM is to use cases that highlight key aspects of the value it can bring to business in periods of market uncertainty. I’m using the retail industry because it’s an industry everyone knows something about coupled with the fact that it’s one of the sectors most impacted by our current business disruptor, COVID-19.

Further, I used Kiabi, a Global retailer headquartered in France, as the use case to delve into the importance of combining existing historical data patterns with disparate new sources of data to generate RTDM insights for a strategic capability. In this case, Kiabi’s customer loyalty program. The key point I wanted you to take away from that example is that in times of market uncertainties, enhanced customer focus requires business agility that can only be achieved by a complete, common operational picture, and that requires additional pieces to the data puzzle.

Democratizing the Common Operational Picture

I briefly mentioned that the Common Operational Picture or COP has several stakeholders and users who may need to leverage the COP in different ways suited to their specific roles, responsibilities, and evolving needs during periods of instability. Making current and accurate information available to and between parties is critical and can make all the difference in service. Take, for example, the old days when you needed a taxi, you’d call the dispatch center, the center would send you a cab, and the cab dispatcher would call you when the cab arrived.

If you got lost and couldn’t make it to the pick-up point, they wouldn’t know. If the cabby got lost, you wouldn’t know. If you were running late, they wouldn’t know and just leave. Along came Uber and Lyft and, now you and the driver know precisely where you are at all times, and you have multiple means of immediate communication with them – the middleman is the app, and it delivers the COP. In other words, you have different roles – driver and fare-paying passenger – but you have a COP and a means of leveraging it to make decisions and act on them independently of any third party, in a peer-to-peer fashion.

In modern integrated online and brick-and-mortar retail, you want a COP around stock availability and location (specific stores and online) for your customer-facing employees – online, in call centers, at stores – and customers. Under normal circumstances, it would be great to know before you make a trip to a store if the item you want to purchase is there. Now it’s critical as we try to limit trips and their duration to reduce the chances of infection rates going up with COVID-19. A less life-threatening but critical business requirement as well such that you don’t risk frustrating customers who remain dissatisfied long after this period passes.

Decentralization of Decision-Making

In periods of market uncertainty, the issues may be more complicated than is something in or out of stock, and historical data based on business-as-usual stock depletion and reordering patterns will only get you so far. You may need visibility into daily or even hourly buying patterns across your stores – and not just at the corporate level. As an individual store manager, you may even need visibility into your vendor’s stock and the ability order from them or trade with other stores.

Actian helped another retailer, LeRoy Merlin, build out an RTDM strategic capability targeting their individual stores. The capability they wanted to enhance was the ability to monitor sales performance as a function of product SKU, time of purchase and quantity, stock on hand, and several other factors. LeRoy Merlin is a Do-It-Yourself home improvement retailer with over 400 stores in 12 countries across Europe, Asia, Africa, and South America. Their enterprise data warehouse (EDW) was simply too slow to provide analytics services to high volumes of concurrent interactive users across tens of thousands of products. With Actian Vector (the engine inside the Actian Cloud Data Warehouse), LeRoy Merlin was able to perform daily uploads from their EDW to provide real-time ad hoc queries and reporting by up to 2,000 interactive users. The intelligence from the Actian solution enables individual store managers to determine what’s selling and what’s sitting on the shelves so that they can adjust stock – critical during periods of rapid change in demand as we’ve seen with COVID-19.

But it doesn’t have to be COVID-19, for example, DIY retailers in the US can tell where they will run out of stock on Generators by hurricane trajectory forecasts. The key point here is instead of a small group of decision-makers in HQ determining what stock orders should be made and sent where on a quarterly or annual basis or scaling that team up to reduce the hindsight view down to months or weeks. The RTDM capability provides LeRoy Merlin the ability to do it daily at the individual store manager level, decentralizing decision-making to improve speed, accuracy, and business agility.

Dynamic Pricing and Dynamic Rationing

As was the case with the RTDM strategic capability Actian supported for Kiabi, the support for LeRoy Merlin can be used for more than business agility to change stock and improve sales performance. It’s not uncommon for store managers to have the latitude to discount items that aren’t moving off the shelves. For this solution, the business-as-usual requirement we helped Kiabi with was speed. They couldn’t get back real-time reports and queries to so many people at the same time. It was not an accuracy issue or in other terms, the freshness of the day you could use the sales performance data the Actian solution is providing would certainly help them make the right decision about this, but the data necessary to decide what to discount or what to reorder is probably not stale if it’s refreshed on a weekly or even monthly basis.

In a period of market uncertainty that generates a rapid change in demand, you still need the real-time response, the speed, but you also need accuracy, current data. With panicked customers, you may need to reorder immediately, or you will see an empty shelf and, before your shelves empty of existing stock, you may need to adjust limits on purchasing quantities and even adjust those limits on a daily basis and let customers know before they make a trip to the store. Those limits may need to be different at different stores, again leveraging those closer to the transaction to make the decision.

Capability Reuse

You may even look to leverage the RTDM capability you’ve built across multiple core business processes. Everything we’ve just discussed could be combined with your loyalty program, and you could send out notifications stating when new deliveries of toilet paper will arrive and what the purchasing quota will be, setting expectations in advance.

Finally, as I mentioned last time, RTDM capabilities are needed in almost any industry, and virtually all industries are impacted in some way by business disruptions. In the next blog, we’ll start to gradually shift to a discussion of use cases in other industries and what it takes to build a world-class RTDM capability.

In the meantime, learn more about RTDM and Actian here.

About Actian Corporation

Data Intelligence

Data Culture: 5 Steps for Your Enterprise to Acculturate to Data

Actian Corporation

May 18, 2020

Exploding quantities of data have the potential to fuel innovation and produce more value for organizations. Stimulated by the hopes of satisfying customers, enterprises have, for the past decade or so, invested in technologies and paid handsomely for analytical talent. Yet, for many, data-driven culture remains elusive, and data is rarely used as the basis for decision-making.

The reason is that the challenges of becoming data-driven aren’t technical but rather cultural. Describing how to inject data into decision-making processes is far easier than shifting an entire organization’s mindset. In this article, we describe five ways to help enterprises create and sustain data culture at its core.

By 2023, data literacy will become an explicit and necessary driver of business value, demonstrated by its formal inclusion in over 80% of data and analytics strategies and change management programs.

What is Data Culture

”Data culture” is a relatively new concept that is becoming increasingly important to put in place, especially for organizations developing their digital and data management strategies. Just like organizational or corporate culture, data culture refers to a workplace environment where decisions are made through emphatic and empirical data proof. In other words, executives make decisions based on data evidence and not just on instinct.

Data culture gives organizations more power to organize, operate, predict, and create value with their data.

Here are our five tips for creating and sustaining data culture:

Step 1: Align With Business Objectives

“The fundamental objective of collecting, analyzing, and deploying data is to make better decisions.” (McKinsley)

Trusting your data is one of the most important tips for building data culture, as distrust in data leads to disastrous organizational culture. And to trust in data, it must align with business objectives. To drive strategic and cultural changes, it is important for the enterprise to agree on common business goals, as well as the relevant metrics to measure achievements or failures across the entire organization.

Ask yourself the right questions: How can we not only get ahead of our competitors, but also maintain the lead? What data would we need to decide what our next product offering should be? How is our product performing in the market? By introducing data into your business decision-making process, your enterprise will have already made the first step to building a data culture.

Step 2: Destroy Data Silos

In this case, data silos refer to departments, groups, or individuals who are the guardians of data, who don’t share, or don’t know how to share, data knowledge with other parts of the enterprise. When crucial information is locked away and available to only a few connaisseurs, it prevents your company from developing a cross-departmental data culture. It is also problematic on a technical standpoint: multiple data pipelines are harder to monitor and maintain, which leads to data being stale and obsolete by the time anyone uses it for decision-making.

To break data silos, enterprises must put in place a single source of truth. Empower employees to make data-driven decisions by relying on a centralized solution. A data catalog enables both technical and non-technical users to understand and trust in the enterprise’s data assets.

Step 3: Hire Data-Driven People

When building a data culture, it’s important to hire data-driven people. Enterprises are reorganizing themselves, forcing the creation of new roles to support this organizational change:

Data Stewards

Data Stewards are here to orchestrate an enterprise’s data systems. Often called the “masters of data”, they have the technical and business knowledge of data. Their main mission is to ensure the proper documentation of data and facilitate their availability to their users, such as data scientists or project managers for example.

This profession is on the rise. Their social role allows data stewards to work with both technical and business departments. They are the first point reference for data in the enterprise and serve as the entry point to access data.

Chief Data Officers

Chief Data Officers, or CDOs for short, play a key role in the enterprise’s data strategy. They are in charge of improving the organization’s overall efficiency and the capacity to create value around their data. At first, CDOs had to lead a mission to convince interest organizations to exploit data. The first few years of this mission were often supported by the construction of a data universe adapted to new uses, often in the form of a Data Lake or Data Mart. But with the exponential development of data, the role of the CDO took a new scope. From now on CDOs must reconsider the organization in a cross-functional and globalizing way. They must become the new leaders of Data Democracy.

In order to obtain the support for data initiatives from all employees, they must not only support them in understanding data (original context, production, etc.) but also help them to invest in the production strategy and the exploitation of data.

Step 4: Don’t Neglect Your Metadata

When data is created, so is metadata (its origin, format, type, etc.). However, this type of information is not enough to properly manage data in this expanding digital era; data managers must invest time in making sure this business asset is properly named, tagged, stored, and archived in a taxonomy that is consistent with all of the other assets in the enterprise.

This metadata allows for enterprises to assure greater Data quality & discovery, allowing data teams to better understand their data. Without metadata, enterprise find themselves with datasets without context, and data without context has little value.

Step 5: Respect the Various Data Regulations

If you’re in Europe, this is old news by now. With the GDPR put into place in May 2018 as well as all of the other various regulations slowly seeing the day in the United States, UK, or even Japan, it is important for enterprises to respect and follow the guidelines to conform.

Implementing data governance is a way to ensure that all personal data privacy, data security, and ensure risk management. It is a set of practices, policies, standards, and guides that will supply a solid foundation to ensure that data is properly managed thus, creating value within an organization.

Step 6 (BONUS TIP): Choose the Right Solutions

Metadata management is the new black: it is an emerging discipline, necessary for enterprises wishing to bolster innovation or regulatory compliance initiatives on their data assets. A metadata management solution offers enterprises a centralized platform to empower all data users in their data culture implementation.

For more information on metadata management, contact us.

About Actian Corporation

Data Management

SQLite Equals ETL Heavy

Actian Corporation

May 18, 2020

Two weeks ago, I likened the performance of SQLite to that of a banana slug. Now, some may consider that a bit of hyperbole (and some UC Santa Cruz alumni may feel that I impugned the good name of their mascot, which was not my intent) but the numbers don’t lie. The measurable difference in local processing performance between SQLite and a modern edge data management system like Actian Zen is two to three orders of magnitude. So, the cheetah-to-banana slug comparison is quantitatively accurate.

In fact, strap in—because I’m doubling down on the banana slug analogy—but I’m going to swap out the cheetah (with its top speed of 70 mph) for a peregrine falcon, with a top speed of 240 mph. The reason? In considering modern edge data management for IoT, or any edge-to-cloud environment for that matter, you have to consider performance in terms of distributed data reads and writes—across devices, gateways, on-premises workstations, and servers, and, of course, the cloud. Such distribution poses complications that SQLite simply cannot overcome with anything remotely resembling speed.

Let me give you an example: SQLite only works as a serverless database, which mandates integration with, and therefore transformation of its data into, a client-server database. You’ll often see SQLite paired with Microsoft SQL Server, MySQL, Oracle, or Postgres. Additionally, there are stealth pairings in which SQLite is present but seemingly invisible. You don’t see SQLite paired with MongoDB or Couchbase, for example, but the mobile client version of both these databases is really SQLite. “Sync servers” between the mobile client and the database servers perform the required extract, transform, and load (ETL) functions to move data from SQLite into the main MongoDB or Couchbase databases.

But wait, you say: Isn’t the whole point about modern edge data management that data at the edge is going to be captured, shared, and processed by smarter components at the edge? Moreover, that some of that data is going to be sent from devices at the edge up to servers in the cloud? Why are you picking on SQLite?

So, in order, the response to your objections is yes, yes, and I’ll tell you.

Modern Edge Data Management is Shared and Distributed

Yes, we should all take it as a given that IoT and mobile environments will send data from devices and IoT sensors at the edge up to servers in the cloud. And, yes, new network standards like 5G (and industrial variants of 4G-LTE), coupled with AI running on more edge devices, will lead to more local and peer-to-peer processing. That will bring device and metadata management out of the cloud/data center to edge gateways on on-premises servers. Both scenarios share and distribute massive amounts of data, and, where SQLite is involved, will entail an explosion of ETL because SQLite isn’t going to be running on the larger servers at the edge or in the cloud. That’s where you’re seeing SQL Server, MySQL, Oracle, Postgres, and others (including Actian Zen Edge, enterprise, and cloud editions).

Which brings us to the question of why ETL matters. When you think about the characteristics of the systems that will be sharing and distributing all this data, three key things stand out: performance, integration, and security. We’ve already discussed the actual processing performance characteristics of our banana slug when it comes to local data operations. When we look closely at SQLite in the broader context of data sharing and distribution, it becomes apparent that the use of SQLite can have a profound impact on operational performance and security.

It’s All About the “T” in ETL

From a data management system standpoint, the transform action in ETL is the most critical element of that initialism. Unlike the E and L which aren’t impacted by data management systems as data transfer is a function of the virtual machine, operating system, hardware abstraction layers and of course I/O subsystems, the data management implementations dictate if, when, and how data transformations will occur.. When moving data from one database or file management system format to another, it is always necessary to reformat the data so that the system receiving the data can read it. SQLite touts the consistency of its underlying file system on all platforms, which would suggest that moving data from one platform to another requires no transformation. For an SQLite application operating as a simple data cache in a mobile device or moderately trafficked web sites this may be true. But that’s not what a shared and distributed IoT environment looks like. Today’s modern edge data management environments are fully managed, secure, and built to perform complex data processing and analysis on a variety of systems in a variety of places—on device, at the edge, and in the cloud. These are environments replete with data aggregation, peer-to-peer sharing, and other data management operations that require a transformation from a SQLite format into something else—quite possibly several something else’s.

And You Thought the Banana Slug Was Slow

That’s the second dimension where SQLite simply becomes sluggish. Actian conducted a series of tests comparing the transformative performance of Zen Core and SQLite. One set of tests compared performance of data transfers between SQLite and MS SQL Server to the same data transfer between Zen Core and Zen Enterprise Server. Both the SQLite and Zen Core serverless clients ran on a Raspberry Pi device while SQL Server and Zen Enterprise ran on a Windows server-based system.

The performance results are eye-popping: Taking a block of 25K rows from Zen Core and inserting it into Zen Enterprise took an average of 3 ms. Taking the same block from SQLite and inserting it in Microsoft SQL Server took 73 ms, or roughly 24X more time. Other tests, comparing Indexed and non-indexed updates, reads, and deletes all had similar results. Why? Because of the transformations required. In moving data between SQLite and SQL Server, the data from SQLite had to be transformed into a format that SQL Server, which has a different format and different data model, can read. When moving the data from Zen Core to Zen Enterprise Server, which rely on the same format and data model, no such transformation is necessary.

So Much for Faster, Better, Cheaper

Zen isn’t the only database with a common architecture stretching from client to server. Microsoft SQL Server has such an architecture, but it only runs on Windows-based devices. Actian Zen runs on pretty much everything—from Android-based IoT and mobile devices to Windows-based edge devices, to data center and cloud servers running a wide range of Linux implementations. Zen has a single, secure, modular architecture that allows the serverless version to interact with the Edge, Enterprise and Cloud versions using the same APIs, data formats, and file system, removing any need for transformations.

And that’s really where the distinction between the peregrine falcon and the banana slug becomes palpably real. If SQLite were capable of interacting directly with other elements in the modern edge data management environment, everyone happily using SQLite could avoid data transformations and heavy ETL. But that’s not the world in which we operate. SQLite will always involve heavy ETL, and a banana slug it will remain.

There’s an age-old tradeoff in the world of engineering development that goes like this: We can give you faster, better, or cheaper. Pick two. SQLite promises faster but in practice delivers slower, as the benchmarks above prove. That leaves better and cheaper—except that, as we’ll see, with SQLite we don’t even get better or cheaper. Stay tuned for the next post in this series, where we’ll discuss why SQLite is not better. After that, we’ll take a sharp, falcon-like look at total cost of ownership.

You can learn more about Actian Zen. Or, you can just kick the tires for free with Zen Core which is royalty-free for development and distribution.

SQLite’s Serverless Architecture Doesn’t Serve IoT Environments – Part 2

Part Two: Rethinking What Client-Server Means for Edge Data Management

SQLite Design-Ins for the IoT: Putting the Wrong Foot Forward

Rethinking Square Pegs and Round Holes

Reimagining Client-Server With the IoT in Mind

But Does Every IoT Edge Scenario Need a Server?

About Actian Corporation

Related Tags

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

Ready to Get Started?

SQLite’s Serverless Architecture Doesn’t Serve IoT Environments – Part 1

Part One: Mobile May Be IoT—But, When it Comes to Data, IoT is Not Mobile

How the Banana Slug Won the Race

Intelligent IoT is Redefining Edge Data Management

A World of Increasing Complexity

About Actian Corporation

Related Tags

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

Ready to Get Started?

Build Your Citizen Data Scientist Team

What is a Citizen Data Scientist?

How to Empower Citizen Data Scientists

Break Enterprise Silos

Provide Augmented Data Analytics Technology

Empower Citizen Data Scientists With a Metadata Management Platform

About Actian Corporation

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

Actian Vector for Hadoop for Fuller SQL Functionality and Current Data

Better SQL Functionality for Business Productivity

Efficient Updates for More Consistent View of the Business

About Actian Corporation

Related Tags

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

Ready to Get Started?

Actian Shows Big Advantages Over SQL on Hadoop Alternatives

About Actian Corporation

Related Tags

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

Ready to Get Started?

Actian Vector for Hadoop File Format is Faster and More Efficient

About Actian Corporation

Related Tags

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

Ready to Get Started?

Why are Your Data Scientists Leaving Your Enterprise?

They Don’t Spend Their Time Doing What They Were Hired For

Business and Data Science Goals are not Aligned

They Struggle to Understand & Contextualize Data at Enterprise Level

Accelerate Your Data Scientists Work With a Metadata Management Solution

About Actian Corporation

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

How a Business Glossary Empowers Your Data Scientists

Back to Basics: What is a Business Glossary?

How Does a Business Glossary Benefit Your Data Scientists?

A Data Literate Organization

The Implementation of a Data Culture

An Increase in Trusting Data

Implement a Business Glossary

About Actian Corporation

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

Real-Time Decision-Making Use Cases in the Retail Industry – Part 3

It Takes Two to Make a Thing Go Right

Balancing Customer Response and Risk

Real-time is in the “Eye of the Beholder”