Blog | Data Analytics | | 5 min read

Delivering Real-Time Reporting at Speed and Scale

Delivering Real-Time Reporting at Speed and Scale

When a major UK logistics company wanted to improve reporting for its large accounts, they turned to Actian to design, implement and support the underlying database system (“LARS”) using Ingres, HVR and Vector products for its architecture.

The Brief

The customer had around 100 customer accounts representatives dedicated to large accounts, with each rep manually producing their own set of spreadsheet-based daily, twice-daily and ad hoc reports for emailing to their account contacts, based on a range of daily extracts from an Ingres operational-level database.

The customer wanted to standardize the format of the reports and to automate their production in order to save reps’ time, to deliver reports to their accounts in a consistent and timely manner, and ultimately to make it feasible to outsource the function.

The challenge was not just to provide the capability of producing the volume of scheduled complex analytical reports (over 1000 per day, tightly clustered around critical times in mid-morning and mid-afternoon) and simultaneously supporting ad hoc complex report production for 200 users with a response time of seconds, but also a) to do this without significant overhead on the source operational-level database and b) reduce the need for the range of existing Extracts from the operational-level database. An additional requirement was that it should be possible to ‘switch’ other existing applications from the operational-level database to this new database at a future stage, thus mandating the database design to be as similar as possible to the existing operational-level database.

Because of delays to the start of the project (due to changes within the customer’s organization), there was considerable pressure to deliver the project in as short a timescale as possible.

The Architecture

To provide the user-visible front-end analytical and reporting facility a semi-customized package from a partner organization was chosen, based on the Logi Analytics product.

The database schema design was constrained by the source database schema design, which resulted in the need to provide a range of database views involving joins over 12 tables, with some of the tables having over 300 million rows. In order to provide interactive users with realistic response times whilst also servicing the needs of scheduled Reports, Vector was chosen as the ideal DBMS for this database, due to its very high speed of processing complex retrieval queries and its ability to mirror the Ingres source database structure virtually unchanged.

Since the source Ingres database and the target Vector database had essentially similar schemas, HVR (High Volume Replicator) was chosen as the software solution to keeping the Vector database in-line with the source Ingres database. The HVR Capture process reads the Ingres source database transaction log, passes insert and update operations via the HVR Hub to the target machine where the HVR Integrate process reflects the inserts and updates as ‘upserts’ into the Vector database (‘deletes’ were suppressed within HVR, to avoid the regular purges of the source database also resulting in purges of the target database), placing very little load on the source database machine.

The Implementation

Ingres source database runs on an older HP-UX platform, so HVR was installed on a dedicated Linux server to act as its Hub. The Vector database sits on a separate dedicated Linux server. An HVR ‘capture’ component runs on the Ingres machine, captures the source database changes from the transaction log and sends them via the HVR Hub to the HVR ‘integrate’ component running on the Vector server which applies the same changes (via ‘upserts’) to the target Vector database.

To meet the customer’s need for reduced development timescales the project was delivered ready for user acceptance testing in 3 months from the start of development, thanks to Vector’s ability to mirror an Ingres schema with little change.

In order to reduce the number of table joins in the views from 12 down to a more manageable 9, a regularly-scheduled job (running every 10 minutes) was created to maintain a de-normalized table.

The denormalization update job, HVR’s ‘upsert’ job, the large number of scheduled reports, and the interactive users happily co-exist on the Vector server.

Vector Performance

It is often fairly meaningless to quote retrieval response times from a system since there are so many variables involved, but we can provide a flavour of the retrieval performance of the Vector database compared with its Ingres source database. A member of the customer’s IT staff needed to run an unreasonably heavy ad-hoc SQL query against the Ingres source database which ran for 10 minutes before she killed it as untenable. We ran the same SQL against the live Vector database, during ‘prime-time’ activity – it completed in 0.05 seconds. Although this is not a direct comparison since the two databases were running on different platforms and hardware configurations, it does illustrate the dramatic retrieval speed of which Vector is capable.

In fact the performance of Vector was so impressive it changed the specified requirements from the client facing team. The envisioned work practice was to allow up to ~200 complex reports to run between 10AM and 10:30 but Vector was so fast and comfortable at scale that these reports are now all run within 5 minutes of 10AM and that was only limited by the resources (cores, memory, etc.) on the machine.

Customer Satisfaction

The customer was sufficiently impressed with the novel architecture of the LARS implementation that they commissioned a second more challenging Vector-based project to be fed from a continuous message stream. This will be the subject of a future blog entry.


Blog | Insights | | 6 min read

Vector in Hadoop 5.0 – New Features You Should Care About

Vector in Hadoop 5.0

Actian Vector was renamed to Actian Analytics Engine in 2026.

Today we announce the introduction of the next release of Actian Vector in Hadoop, extending our support of Apache Spark to include direct access to native Hadoop file formats and tighter integration with Spark SQL and Spark R applications. In this release, we also incorporate performance improvements, integration with Hadoop security frameworks, and administrative enhancements. I’ll cover each of these in greater detail below.

Combine Native Hadoop Tables With Vector Tables

In previous releases, Vector in Hadoop required data to be stored in a proprietary format which optimized analytics performance and delivered great compression to reduce access latency. Vector in Hadoop 5.0 provides the ability to register Hadoop data files (such as Parquet, ORC, and CSV files) as tables in VectorH and to join these external tables with native Vector tables. Vector in Hadoop will provide the fastest analytics execution against data in these formats, even faster than their native query engines. However, query execution will never be as fast with external tables as with native Vector data. If performance matters we suggest that you load that data into Vector in Hadoop using our high-speed loader.

This feature enables customers who have standardized on a particular file format and who want to avoid copying data into a proprietary format to still get the performance acceleration VectorH offers. The details of the storage benchmark that we conducted as part of our SIGMOD paper showed the Vector file format to be more efficient from a query performance/data read and data compression perspective. See our blog post from July 2016 which further explains that benchmark.

True Enterprise Hadoop Security Integration

A Forrester survey last year indicated that data security is the number one concern with Hadoop deployments. Vector in Hadoop provides the enterprise-grade security natively that one expects in a mature EDW platform, i.e., discretionary access control (control over who can read, write, and update what data in the database), column-level data at rest encryption, data in motion encryption, security auditing with SQL addressable audit logs, and security alarms. For the rest of the Hadoop ecosystem, these concerns have driven the development of Hadoop Security Frameworks, through projects like Apache Knox and Apache Ranger. As we see these frameworks starting to appear on customer RFIs, we’re provided documentation on how to configure VectorH for integration with Apache Knox and Apache Ranger.

Significant Performance Enhancements

The performance enhancements which resulted in Vector 5.0 claiming top performance in the TPC-H 3000GB benchmark for non-clustered systems are now available in Vector in Hadoop 5.0, where we typically see linear or better than linear scalability.

Automatic Histogram Generation

Database query execution plans are heavily reliant on knowledge of the underlying data; without data statistics it has to make assumptions about data distribution e.g. it will assume that all zip codes have the same number of residents; or that customer last names are as likely to begin with an X as with an M. VectorH 5.0 includes an implementation of automatic statistic/histogram generation for Vector tables. It results in histograms being automatically created and cached in memory when a query contains a reference to a column in a WHERE, HAVING or ON clause with no explicitly created (by optimizedb or CREATE STATISTICS) histogram.

Accelerate Startup and Shutdown With Distributed Write Ahead Log

In earlier Vector in Hadoop releases the write ahead log file, which holds details of updates in the system, was managed on the VectorH Leader Node. This memory resident log file consumed a lot of the Leader Node memory and became a bottle neck in startup, as the log file needed to be replayed during startup and that process could take several minutes. In VectorH 5.0 we have implemented a distributed Write Ahead Log (WAL) file, where each node has a local WAL. This alleviates pressure on memory, improves our startup times and as a side-effect it also results in much faster COMMIT processing.

Speed Up Queries With Distributed Indexes

In earlier releases, the VectorH Leader Node was responsible for maintaining the automatic min-max indexes for all partitions. As a reminder, the min-max index keeps track of the minimum and maximum value stored within a data block; this internal index allows us to quickly identify which are the blocks that will participate in solving a query and which ones don’t need to be read. This index is memory resident and is built on server startup. In VectorH 5.0 each node is responsible for maintaining its own portion of the index which alleviates pressure on memory on the leader node, improves our startup times by distributing the work and speed-ups DML queries.

Simplified Partition Management With Partition Specification

We found a number of VectorH customers encountered performance problems because they didn’t know to include the PARTITION clause when creating tables, especially when using CREATE TABLE AS SELECT (CTAS). So let’s say they had an existing table that was distributed across 15 partitions and they wanted to create a new table based on that original table, their assumption was that it too would have 15 partitions, but that’s not the way the SQL standard intended it, and in this case being true to the SQL standard hurt us. To alleviate this we have added a configuration parameter which can be set to require the use of either NOPARTITION or PARTITION= when creating a vector table explicitly or via CTAS.

Simplify Backup and Restore With Database Cloning

VectorH 5.0 introduces a new utility, clonedb, which enables users to make an exact copy of their database into a separate Vector instance e.g. take a copy of a production database into a development environment for testing purposes. This feature was requested by one of our existing customers but has been very well received across all Vector/VectorH accounts.

Faster Exports With Spark Connector Parallel Unload

The Vector Spark Connector can now be used to unload large data volumes in parallel across all nodes.

Simplified Loading With SQL Syntax for vwload

VectorH 5.0 includes the ability to utilize vwload with the SQL COPY statement for fast parallel data load from within SQL.

Simplified Creation of CSV Exports From SQL

VectorH 5.0 includes the ability to export data in CSV format from SQL using the following syntax:

INSERT INTO EXTERNAL CSV 'filename' SELECT ... [WITH NULL_MARKER='NULL', FIELD_SEPARATOR=',', RECORD_SEPARATOR='n']

Next Steps

To learn more, request a demo or a trial version of VectorH to try within your Hadoop cluster. You can also explore the single-server version of Actian Vector running on Linux, distributed free as a community edition, available for download.


Blog | Actian Life | | 3 min read

The Essential Guide to Gartner Catalyst 2017 in San Diego, CA

The essential guide to Gartner Catalyst 2017 in San Diego, CA

If you’re headed to Gartner Catalyst 2017 in San Diego, CA August 21 – 24, or if this is your first time at a Gartner event, here’s your essential guide to get the most out of the upcoming conference. We hope you find it useful.

This is a conference for technical professionals. You’ll have plenty of opportunities to meet with your peers across all disciplines including CIOs, CTOs, solution architects, developers, database admins, data scientists, data engineers, business analysts and DevOps, amongst others.

Gartner has lined up a host of hot topics and session tracks, so be sure to check out the official session calendar to build your personalized schedule that you can access from the Gartner Events Navigator. They also have a mobile app (Android, iOS and Windows) that you can use after you have registered, and this will come in handy as you move around at the conference between sessions, roundtables, meetups, networking, and breakfast/lunches.

Actian CTO Mike Hoskins, will be sharing Actian’s Hybrid Data Vision as part of his talk titled “Actian: Drowning in Data? How to Bridge the Gap to Business Insights.” The talk will be held in the TechZone Theatre/Harbor Ballroom, Second Level on Wednesday, August 23 @ 1:30 PM PT, so be sure to add this one to your calendar. This talk is in the same area as, and during, the coffee/desserts break, so seats tend to fill up fast… don’t be late or you’ll be left standing!

Remember that the event is in San Diego, California and not in San Diego, Texas! The nearest airport is San Diego International Airport (SAN), formerly known as Lindbergh Field, which is located a short distance by car/taxi from the event hotel. Most local school districts have either just started or are starting their new school year, so the event is perfectly timed to miss the peak Summer vacations for many US tourists. Be sure to check out the local weather forecast before you pack your suitcase. Remember to find some time to stretch your legs and explore the nearby Gaslamp District and Seaport Village. Check out Gartner’s latest event-related venue and travel information, as there are some travel alerts to be aware of.

The Actian team will be there, and we look forward to meeting you in person at the Actian Booth #108 in the Harbor Ballroom, second level of the Manchester Grand Hyatt San Diego.

We’ll be sharing our hybrid data vision and will have subject matter experts available onsite to walk you through our portfolio of hybrid data-management, analytics and integration products and services for Technology Professionals like you.

If you’re new to Actian products, here are a few of the portfolio highlights:

We hope you’ll stop by to say “Hi” to the team and learn about Actian’s products, community, and customers.

Follow us on Twitter, and on LinkedIn to stay connected with what we are up to. If you fancy a job to pursue your passion in data management, data integration, and data analytics, check out our careers page and come join our team – WE’RE HIRING!


One important trend in database management is integrating location data better to improve insights about events and activities that matter to your business.

“…Interest in analyzing geospatial/location data has increased over the past four years from 26% to 36%.”
Source: Gartner Survey Analysis: Big Data Investments

Tracking customer location can be critical for offering location-based services, particularly for travelers (think Uber matching cars to riders, or restaurants making offers to customers nearby) and for shoppers (to optimize shelf locations for popular items and perhaps make real-time offers).  Tracking and managing assets by location can not only improve response time to failures but also track potential interactions that ultimately predict future failures.

Actian Ingres has supported geospatial data for a few years now, recognizing location as a data type to improve the validity, accuracy, and processing of location data.  Earlier this year, we extended that support in Ingres by introducing a plugin for ESRI ArcGIS 10.x users to view and manipulate geospatial data.  ArcGIS, ESRI’s geographic information system (GIS) for working with maps and geographic information, is used for creating and using maps and mapping information

The ESRI plugin supports two of the tools, ArcMap and ArcCatalog, in versions 10.x of ArcGIS on Windows, and Actian supports the plugin on Ingres 10S, 10.2,.  ArcMap is the primary application used in ArcGIS for mapping, editing, analysis, and data management. With the ESRI plugin and ArcMap, users can access geospatial data to create maps, visualize, filter, summarize, analyze, compare, and interpret spatial data. ArcCatalog allows users to store and organize geospatial data (like a Windows Explorer for geospatial data).

Actian is working with a couple of partners to help our customers get the most out of Ingres and ArcGIS:

  • Critigen provides implementation services with ESRI expertise to develop and deploy geospatial applications.
  • Safe Software supports Ingres through their FME integration tool and complements Actian DataConnect.

The ESRI plugin and documentation are available to existing Actian customers for download at esd.actian.com. To find out more about geospatial features, go to docs.actian.com.

Download the ESRI plugin and let us know what you think!


Summary

  • Data management is evolving from traditional business transactions to include complex human and machine-generated event trails.
  • A hybrid data landscape requires systems that can integrate and analyze diverse data types across on-premises and cloud environments.
  • Correlating machine observations with business events enables a “closed-loop” for more timely and accurate analytic decisions.
  • The ability to manage stunningly heterogeneous data at scale is the primary key to succeeding with modern enterprise analytics.

The Age of Data has arrived, with new data sources, targets and processing models proliferating madly across enterprises of all sizes. While data has never been more valuable to a business — it now informs the who, what, where, when and how of decision-making — this new hybrid data landscape introduces new challenges. We anticipate the following innovative efforts in data management, integration and analytics to address these challenges.

The Rise of HTAP – Best of Both Worlds in Data Management

One of the most exciting trends for the balance of this decade will be HTAP (Hybrid Transactional/Analytical Processing), which is a Gartner-coined term representing a hybrid, converged software infrastructure that can handle both traditional transactional data management workloads AND modern analytic data management workloads.

Every business is struggling to find tools and techniques to effectively analyze the volume, variety and velocity of data. A new generation of columnar analytic SQL databases (like Actian Analytics Engine) will be critical to delivering on the promise of data-driven decisions. At the same time, organizations are familiar with, and trying to preserve, their investment in traditional transactional SQL databases (like Actian Ingres) that represent the backbone of data management in most organizations. How to marry those two data management needs?

What if you could have both capabilities in the same database? What if you could have the best of both worlds? Robust, enterprise-class OLTP database capabilities that leverage a 30+ year history of pioneering work in data management. And then add the world’s highest-performance columnar analytic database engine (with vector processing) into the same database infrastructure. One database, one security model, one SQL, one vendor – providing an innovative hybrid of operational and analytic processing that covers the entire spectrum of data management! With the ability to deploy to the cloud or on-premise. Now that is something to get excited about.

The Rise of Edge Databases for IoT Data Management

The emerging IoT stacks and solutions are missing one important element of scalable architectures – an elastic middle tier that can sit at the “edge” of the network and deliver robust processing services to the onboarding and analysis of IoT data. Most conventional IoT architectures focus simply on the two main end-points – the sensors themselves, spitting out low-level data, and the cloud, where sensor events should eventually “land” for analysis.

The sheer volume and repetition of sensor data make it impractical to imagine “landing” all sensor data in the cloud. The smarter IoT architectures will provide an intelligent middle tier – a kind of gateway function that resides near the sensors, at the edge. This layer is intended for early capture, processing and local analysis of the sensor data before only vital information is sent to the cloud.

The natural technology to deploy at the onboarding “edge” of the network is a bullet-proof embedded IoT edge database. Apart from the obvious advantages of deploying an embedded IoTDB at the “edge” of the network (persistence, security, etc.), you could also apply crucial local filtering (e.g. duplicates, errors, steady states, etc.) and data operations (e.g. sorts, aggregates, model application and local analytics) on the data prior to “landing” the data in the cloud – a much more efficient and productive setup for cloud-based analytics of sensor data.

The Rise of Hybrid Integration Platforms

It seems that regardless of how much we invest, integration remains an unsolved problem – permanently atop the priority list in all IT shops and organizations. The diversity of IT systems guarantees a baseline of integration challenges. An uncountable number of new end-points every year exacerbates the situation. Factor in that old and new end-points are changing constantly, and you multiply the problem further. Add the requirement for different integration patterns and delivery models and you begin to see the many intimidating dimensions of the integration problem.

Is there hope? Yes, tools that surpass the limited nature of today’s typical integration offerings are making their way into the market. Instead of focusing on one dimension of today’s integration problem – legacy on-premises ETL, heavy EAI tooling or lightweight cloud services, we will see customers turn to hybrid integration platforms – modern, dynamic and cloud-based solutions – to tackle all dimensions. Whether it is the variety of end-points (cloud, mobile or on-prem), or the variety of patterns (A2A via APIs or B2B via data), or the variety of skills (IT expert to LoB practitioner) or the variety of delivery models (cloud or on-premise), a modern hybrid integration platform like the Actian DataCloud will enable customers to adapt to today’s data integration needs.

The Rise of Graph Analytics in the Cloud

Neo4J, the leading commercial provider of on-premises graph database technology, recently raised a funding round of $36 million. This funding establishes graph databases (and the associated graph analytics space) as first class citizens in the pantheon of modern analytic techniques.

Why graph? In the now-immortal words of Donald Rumsfeld, there are “known knowns” (handled via BI and reporting), there are “known unknowns” (handled via predictive analytics to get a grip on a known analytic challenge such as fraud), and then there are “unknown unknowns.” These are the questions you never knew to ask, the queries you never knew to write. What are the unknown/unseen patterns hidden away in your data, and how do you find them? This is one of the great analytic challenges in datasets – what are the inherent (but unseen) relationships in the data – what objects are “close” to what other objects? What objects are “outliers”? What heretofore seemingly unrelated events share space and time?

It is exactly for this reason that graph is an important new analytic weapon. Graph analytics in the cloud are the ideal implementation platform, and we expect to see offerings that let you transfer your data into the cloud, load it into a back-end graph datastore like Actian Versant, and then “graph it” to see patterns inherent in the data (and even see new patterns emerge spontaneously as you add more data).


We are proud to introduce the latest release of Actian DataConnect. We listened to our customers and adopted a ‘back to basics’ approach with the new product architecture. A lightweight desktop installation for design and a flexible SDK and CLI for run-time are the core components which will plug into any existing job management infrastructure you may have already built around a previous version. For users who want to take advantage of our out-of-the-box, robust and secure cloud infrastructure, DataCloud is the preferred deployment option.

Backwards compatibility with Actian DataConnect versions 9 and 10 is another core theme in the version 11 release. Pervasive Data Integrator version 9 users can skip Version 10 altogether and upgrade directly to Version 11. Those maps, schema, processes, and other artifacts can be imported to Version 11 without the need to perform a migration. Simply import and use them.

The design environment focuses on developer productivity and integration architecture simplification with a small footprint, desktop IDE installation. It includes all the familiar mapping and event features that were available in Data Integrator version 9. We’ve also added even more development tools that will increase your speed to iterate through the development of new and to modify your existing integration projects.

We wanted to get this release into the field so our version 9 users could begin to take advantage of it immediately. Version 10 users will be able to upgrade in a subsequent release targeted later in 2017.

What’s New and Different About DataConnect 11?

Architecture:

  • Lightweight desktop design interface built on a widely adopted extensible open-source IDE framework.
  • Ability to import, rather than migrate, integration artifacts from prior DataConnect versions.
  • Full support for Data Integrator Version 9 Events and Actions for backward compatibility.
  • Open, file system-based metadata repository that enables use of your existing source control systems.
  • Flexible software development kit (SDK) and command line interface (CLI) to support your custom job management infrastructure.
  • DataCloud deployment option: Manage in the cloud, run time on-premises via agents.

Integration Features:

  • REST Invoker 3.0: Easy-to-use and standardized approach to RESTful web service APIs.
  • Engine execution profiler provides immediate, interactive performance feedback.
  • Built-in XML and Text editors for power users to directly modify metadata.
  • Content assist in the script editor (aka code completion).
  • Reject connection tab for improved ease of use.
  • Optional support for macro sets and encrypted values.
  • Improved “Search and Replace” functionality and Help system.

Want to Learn More?

For the hands-on users, here is a short series of videos showing the new user interface in action.

Download data sheet and whitepaper: click here


Blog | Data Management | | 4 min read

Architecting Next-Generation Data Management Solutions

Hybrid for the Data Driven Enterprise

This is part 2 of our conversation with Forrester analyst Michele Goetz. Please click here to read the first post: Rethink Hybrid for the Data-Driven Enterprise.

After a recent Actian webinar featuring Forrester Research, John Bard, senior director of product marketing at Actian, asked Forrester principal analyst Michele Goetz more about next-generation data management solutions. Here is the second part of that conversation (see part one here):

John Bard, Actian:  What are key business imperatives that are forcing a greater priority of speed of query processing for systems of insight? 

Michele Goetz, Forrester:  More and more businesses are becoming digital. Retailers are creating digital experiences in their brick-and-mortar stores. Oil and gas companies are placing thousands of sensors on wells to get information on production and equipment states in real-time. And the mobile mind shift is driving more and more consumer and business engagement through mobile apps. Everything is in real-time, delivered through a web of microservices, and increasingly sophisticated analytics are embedded in streams and processes. This places a significant demand on systems that have to hit high-performance levels on massively orchestrated data services to get insight on demand, make decisions quickly, take action quickly, and achieve outcomes that meet business goals.

JB:  How important is it for operational data and systems of insight to be tightly linked? What are some applications/use cases driving that integration?

MG:  More and more, transactional systems have to operate on insight and not just as entry points to capture a transactional event. Analytics are running on streams of data and individual transactions such as purchases and business process events and transactions. These analytics provide suggestions and instructions to inform pricing, offers, next best action, and security/fraud patterns, along with automating manual processes. Today’s modern data platform has to run analytic and operational workloads side by side to not only enable a process but also capitalize on opportunities and threats as they occur.

JB:  How does an enterprise strike a balance between best-in-class solutions that often require integration versus all-in-one platforms that often force compromises?

MG:  For each business process, customer engagement, automated process, and partner engagement, there are different service-level needs for data and analytics. Data and data services have to be more personalized to the tasks at hand and desired outcomes. Upstream in-development applications are designed with specific requirements for data, insights, and the cadence for when data and insight are needed. These requirements manifest within the data and application APIs that drive microservices and business services. A monolithic all-in-one platform creates rigidity as a purpose-built system that is inflexible to business changes. The cost to purchase and maintain is significant and has an impact on the ability to modernize, thus building up technical debt. Additionally, for every new capability, a new silo is built, further fragmenting data and inhibiting insight. Companies need to move toward a hybrid approach that takes into account the cloud, data variety, service levels, best-in-class technologies, and open source for innovation. Hybrid systems allow flexibility and adaptability to drive service-oriented data toward business value without the cost and delivery bottlenecks that one-size-fits-all systems create.

JB:  What is the best design approach to accelerate development to achieve faster deployment to production and therefore business value?

MG:  Start with what the solution is supporting and the service levels it requires. Have an understanding of how that fits into specific data architecture patterns: data science for advanced analytics and visualization, intelligent transactional data , or analytic and BI workspaces. These patterns guide the choices for database, integration, and cloud while also helping to establish governance that guides trusted sources, repeatable and reusable data APIs and services, and the management of security policies.

JB:  What sort of new applications and services can be created from these new hybrid data architectures?

MG:  Hybrid data management is about putting the right data services and systems to the task and outcome at hand. It provides more freedom to introduce modern data technologies to quickly take advantage of capabilities to scale, get to insights you couldn’t see because of lack of data access, and deliver data and insight in real time without the lag from nightly batch processing and reconciliation. Additionally, hybrid data management has better administrative layers to help manage the peaks and valleys across the ecosystem and avoid performance bottlenecks, as well as right cost data service levels between cloud and on-premises systems. Going hybrid means getting access to all the data to create customer 360s that take personalization to the next level. It allows analytics to mature toward machine learning, advanced visualizations, and AI by providing a better data infrastructure backbone. And apps and products become more intelligent as hybrid systems create engagement that is insightful and adaptive to the way the solutions are used.


After a recent Actian webinar featuring Forrester Research, John Bard, senior director of product marketing at Actian, asked Forrester principal analyst Michele Goetz more about the trends in today’s enterprise data market.  Here is the first part of that conversation:

John Bard, Actian:  The enterprise market tends to think of “hybrid” as on-premises or cloud, but there are several other dimensions for hybrid. Can you elaborate on other ways “hybrid” applies to the data management and integration markets?

Michele Goetz, Forrester:  Hybrid architecture is really about spanning a number of dimensions: deployment, data types, access, and owner ecosystem. Analysts and data consumers can’t be hindered by technology and platform constraints that limit the reach into needed information to drive strategy, decisions, and actions to be competitive in fast-paced business environments. Information architects are required to think about aligning to data service levels and expectations, forcing them to make hybrid architecture decisions about cloud, operation, and analytic workloads; self-service and security; and where information sits internally or with trusted partners.

JB:  What factors do you think are important to customers evaluating databases when it comes to satisfying both transactional and analytic workloads?

MG:  Traditional approaches to database selection fell into either operational or analytic. Database environments were designed for one or the other. Today, operational and analytic workloads converge as transactional and log events are analyzed in streams and intelligently drive capabilities such as robotic process automation, just-in-time maintenance, and next-best-action or advise workers in their activities. Databases need the ability to run Lambda architectures and manage workloads across historical and stream data in a manner that supports real-time actions.

JB:  What are some of the market forces driving these other aspects of “hybrid” in data management?

MG:  Hybrid offers companies the ability to build adaptive composable systems that are flexible to changing business demands for data and insight. New data marts can spin up and be retired at will, allowing organizations to reduce legacy marts and conflicting data silos. Hybrid data management provides a platform where services can be built on top using APIs and connectors to connect any application. Cloud helps lower the total cost of ownership as new capabilities are spun up, at the same time management layers are allowing administrators the ability to easily shift workloads and data between cloud and on-premises to further optimize cost. Additionally, data service levels are better met by hybrid data management, as users can independently source, wrangle, and build out insights with lightweight tools for integration and analytics. In each of these examples, engineering and administration resources hours are reduced or current processes are optimized for rapid deployment and faster time-to-value for data.

JB:  What about hybrid data integration? That can span both data integration and application integration. What about business-to-business (B2B) integration? What about integration specialists versus “citizen integrators”?

MG:  Hybrid integration is defined by service-oriented data that spans data integration and application integration capabilities. Rather than relying strictly on extract, transform, and load (ETL)/extract, load, and transform (ELT) and change data capture, integration specialists have more integration tools in their toolbox to design data services based on service-level requirements. Streams allow integration to happen in real time with embedded analytics. Virtualization lets data views come into applications and business services without the burden of mass data movement. Open source ingestion provides support for a wider variety of data types and formats to take advantage of all data. APIs containerize data with application requirements and connectivity for event-driven views and insight. Data becomes tailored to business needs.

The other wave in data integration is the emergence of self-service, or the citizen integrator. With little more than an understanding of data and how to manipulate data in Excel with simple formulas, people with less technical skills can leverage and reuse APIs to get access to data or use catalog and data preparation tools to wrangle data and create data sets and APIs for data sharing. Data administrators and engineers still have visibility into and control over citizen integrator content, but they are able to focus on complex solutions and open up the bottlenecks to data that users experienced in the past.

Overall, these two trends extend flexibility, allow deployments to scale, and get to data value faster.

Hybrid data management and integration is the next-generation strategy for enterprises to go from data rich to data driven. As companies retool their businesses for digital, the internet of things (IoT), and new competitive threats, the ability to have architectures that are flexible and adapt and scale for real-time data demands will be critical to keep up with the pace of business change. Ultimately, companies will be defined and valued by the market according to their ability to harness data to stay ahead and viable.


Imagine this scenario – you have just “clicked” on an item that you are ordering online. What kind of “data trail” have you generated? Well, you are sure to have generated a transaction – the kind of “business event” that goes into a seller’s accounting system, and then on to their data warehouse for subsequent sales analysis. This was pretty much your entire data trail until just a few years ago.

In recent times, the whole notion of data trails has exploded. The first wave of new data entering your data trail consisted of web and mobile interactions – those dozens or hundreds (or even thousands) of “human events” – research clicks and social media postings that you execute leading up to and after an online order. It turns out that these human interactions, when blended with business transactions, are critical to yielding more insight into behavior.

And now we are entering the next wave of new data – the observations made by the ever-increasing number of intelligent sensors that record every “machine event.” In our example above, for each human interaction supporting your online order, there may be hundreds or thousands of software, network, location, and device metrics being gathered and added to your data trail. Further integrating and correlating these machine observations into your particular flow of business transactions and human interactions would enable game-changing advanced analytic capabilities – promising a “closed-loop” of ever more timely and accurate decisions.

The bottom line is that we find ourselves in a hybrid data landscape of such stunning heterogeneity that it forever changes both the challenges and the opportunities around the capture and analysis of relevant operational data – the business, human, and machine events that make up your data trail. The ability to manage, integrate, and analyze all these hybrid data events at price/performant scale – to build the necessary data-to-decision pipelines – becomes the key to modern data infrastructure and succeeding with modern analytics.


It is rare in life that one gets an opportunity to step back, take a fresh look and reset one’s mission and trajectory. For Actian, today is such a day, as we launch a new vision, a new product solution portfolio and of course, a new tagline. Got to have a new tagline! Although time will tell whether we have hit the mark, I can safely say that we are excited to reveal our new thinking and shine a bright light on it for all the world to see.

Our new vision is built on three observations:

1. The World is Flat
It is an incontrovertible fact that data is “flattening” within organizations today. Diverse data is being created and consumed in every corner of a company and across its data ecosystem. Increasingly, the traditional one-place-for-everything data warehouse and today’s centralized data lake just seem like old tired thinking.

2. Data is a Social Animal
Data doesn’t like to live alone – to be effective, data needs to live in an ecosystem that is constantly changing and expanding as it is touched by entities both within and outside a company’s four walls. To truly extract insight from data, one needs context, and that context more often than not comes from other applications, processes, and data sources.

3. Think Big When You Think of the Cloud
Today, the cloud is much more than a place to deploy apps and data. Although the agility and economics of hybrid cloud computing are compelling, it is just the start. A “true” cloud solution is designed to enable companies to blend together data without physically moving it and derive actionable insight, including machine learning, that can be put to work at the speed of an organization’s business (e.g., make a real-time offer to a customer on your website). The traditional static monthly report, for most companies, has the same value as yesterday’s news – zero!

4. Activate Your Data
It seems clear that now is the time for a simple call to action – a call for organizations to “activate their data.” Forward-thinking companies are applying best-fit design tools and innovative technologies to embrace their data ecosystems to ensure their data makes a difference. Whether it is powering a real-time e-commerce offer, detecting financial fraud before it happens or predicting supply chain disruptions, it is critical that the underlying insight garnered can be acted upon at the speed of a company’s business.

This is a reversal of the traditional thinking that analytics tools dictate to the business what the data can and cannot do. Now, the business dictates what insight is needed—where, when and for whom. If an organization’s IT department can’t address these needs in an economical and agile fashion, then knowledge workers are increasingly finding alternative ways, often through a new generation of SaaS solutions, to get their needs met. Serve or be served…out the door!

Meet a Big Idea – Hybrid Data!
And behind all this new thinking is the powerful new concept of hybrid data. Hybrid data has multiple dimensions, including diverse data type and format, operational and transactional data, self-service access, external B2B data exchange and hybrid cloud deployment. Our view is simple – all data needs to be viewed as hybrid data that can be joined and blended with other data across an enterprise’s data ecosystem by anyone at any point in time. It is only when an organization can adopt this progressive approach that it can address the inherent limitations of traditional monolithic data repositories (a nice way to say Oracle or SAP) or alternative siloed point solutions.


Back when I started off in the industry, some 20-something years ago (I do pretend I am still in my 20s so that number has a nice ring to it) there was only one IT Department with one manager in most large organizations. Now there are multiple managers within different departments, some aligned to different parts of the organization. Some pieces are outsourced, some in-sourced and some have contractors working on it.

When it comes to connecting most systems together, the industry is focused on “having a connector to this or that” while the real hard part is how to connect to that particular implementation of that system.

As the technologies evolved over the years the pillars (or silos) of teams evolved. So providing an integration solution to connect multiple systems together is more of a project management (herding cats) nightmare than a connector nightmare. Let’s take a typical mid-sized company that wants to connect its cloud-based applications (CRM, HR, etc.) to its on-premises applications (SAP, Oracle Finance, Dynamics, Databases, etc.). Pretty simple task as we have all the connector options and worst case we can always fall back on a web-serviced based JSON/XML connector and database connectors. The problem of “do we have a connector to each system” is solved within minutes.

The real problem and the time killer is how to connect and to whom we will give access. If we consider the layers of technology involved (taking OSI model as a method of stepping through access):

  • Physical Layer – how is the server connected and what speed limits could this be restricted by (is the server connected?).
  • Data Link Layer – what level of QoS do we have, are there any restrictions, which VLAN are we on and what does that VLAN have/not have access to?
  • Network Layer – can we perform a network test to each system we need to connect?
  • Transport Layer – can we retain a connection and what is the performance of that connection?
  • Session Layer – what are the authentication mechanisms for each system? Can we authenticate?
  • Presentation Layer – can we gain access to the metadata behind each system? Do we have sufficient rights?
  • Application Layer – Can we see a sample of the data that we are connecting to? Does the data look like what we expected? Can we perform updates, inserts, upserts, deletes, and reads? Has the application been customized and can we access those customizations?

Achieving all of this requires working with different IT teams both internally and externally. It may require working with vendors or other developers outside of the organization as well. Consider the following roles (not an exhaustive list) that would require gaining their trust and knowledge/assistance:

  • Server/Hardware Manager – Virtual server, capacity, server install.
  • Operating System Specialists – Windows / Linux / AIX / etc. Ability to run your integration software? Installation, patching and maintenance? Remote access to the server?
  • Network Manager – In which zone was the server installed? Does it have connectivity to each system? Remote access to the server?
  • Security/Firewall – Which ports are locked down and needed opening for this new service? Is the anti-virus software causing issues? Remote access to the server? Browser access to the server?
  • Cloud Application Specialist – Method of access, security, ability to access? Can we log in?
  • Database Administrators – Database access, rights, simple database read tests.
  • Specialist Applications (SAP BAPI Developers) – Are there some custom BAPIs that need to be used? Which of the standard BAPI’s should not be used? Can we use the fat client/web application to view and query the system? Can we use a test/development system?
  • Application Developers – Is there a standard method for requirements gathering, development methodology, peer reviews, user acceptance testing, system testing, load testing?

When we are required to prove we can connect to a system, we spend 90% of our time working with the people above and 10% in doing the actual connection. Knowing who to work with and gaining their trust and buy-in is the real hard yards.


Recently I worked on a POC that required some non-standard thinking. The challenge was that the customer’s use case did not only need high-performance SQL analytics but also a healthy amount of ETL (Extract, Transform, and Load). More specifically, the requirement was for ELT (or even ETLT if we want to be absolutely precise).

Why “might” this have been an issue? Well typically analytical databases and ETL-style processing don’t play well together; the latter tends to be row orientated while the typical analytical database definitely prefers to deal with data in a “chunky” fashion. Typically analytical databases are able to load data in bulk at very high speed but tend to offer modest row-by-row throughput.

Another typical characteristic is the use of table-level write locking – serializing write transactions to one at a time. This is generally accepted as the use cases for analytical databases tend to be about queries rather than any kind of transaction processing. However, when some form of ETL is required it is perhaps even more problematic than the row-by-row throughput as it requires the designer and the loading tool to be aware of this characteristic. The designer often has to “jump through hoops” to figure out how to get the data into the analytical database in a way that other team members can understand and that the tool can deliver.

I’m setting the scene here for the “big reveal” that the Actian Vector processing databases do not suffer from these drawbacks. They can deliver both high-end analytical capabilities and offer “OLTP capabilities” in the manner of the HTAP (Hybrid Transactional/Analytical Processing) technologies.

Note the quotes around “OLTP capabilities” – just to be clear we at Actian wouldn’t position these as high-performance OLTP databases, we’re just saying that the capabilities (row-level locking and concurrent tables modifications) are there even though the database is a columnar, in-memory, vector processing engine).

However they are viewed, it was these capabilities that allowed us to achieve the customer’s goals – albeit with a little cajoling. In the rest of this post, I’ll describe the steps we went through and the results we achieved. If you’re not currently a user of either Actian Vector or Actian Vector in Hadoop ((VectorH) then you might just skip to the end, however if you are using the technology then read on.

Configuring for ETL

So coming back to the use case, this customer’s requirement was to load large volumes of data from different sources in parallel into the same tables. Now above we said that we offer “OLTP capabilities”, however out-of-the-box the configuration is more aligned to deal with one bulk update per table – we needed to alter the configuration to deal with multiple concurrent bulk modifications.

At their core, Actian databases have a columnar architecture and in all cases the underlying column store is modified in a single transaction. The concurrent update feature comes from some clever technology that buffers updates in-memory in a seamless and ACID compliant way. The default configuration assumes a small memory model and so routes large scale changes directly to the column store while smaller updates are routed to the in-memory buffer. The maintenance operations performed on the in-memory buffer – such as flushing changes to the column store – are triggered by resource thresholds set in the configuration.

It’s here where, with the default configuration, you can face a challenge – situations arise where large scale updates sent directly to the column store can clash with the maintenance routine of the in-memory buffer. To make this work well we need to adjust to the configuration to cater for the fact that there is – almost certainly – more memory than what the default configuration assumes. Perhaps the installer could set these accordingly, but with a large installed base it’s safer to keep the behaviour the same to keep consistency between versions.

So we needed to do two things; first we wanted to route all changes through the in-memory buffer, and second configure the in-memory buffer large enough to cater for the amount of data we were going to load. We might also have done a third thing which is to make the maintenance routines manual and bake the commands to trigger these into the ETL processes themselves, giving them complete control of what happens when.

Routing all changes through the in-memory buffer is done using the insertmode setting. Changing this means that bulk operations that would normally go straight to the column store now go through the in-memory buffer allowing multiple bulk operations to be done concurrently.

Sizing the in-memory buffer is simply a matter of adjusting the threshold values to match the amount of memory available or as suggested above making the process completely in control of the ETL process.

The table below describes the configuration options that effect the process:

Option Meaning
update_propagation Is automatic maintenance enabled.
max_global_update_memory Controls the amount of memory that can be used by the in-memory buffer.
max_update_memory_per_transaction As above per transaction.
max_table_update_ratio Threshold for the percentage of a table held in the buffer before the maintenance process is initiated.
min_propagate_table_count Minimum row count a table must have to be considered by the maintenance process.

To trigger the maintenance process manually execute:

modify <table> to combine

If you want to see more technical details of how to implement this processing, a knowledge base article available here:

Results

The initial run load of the customer’s data – with the default configuration – took around 13 minutes. With some tuning of the memory parameters to have the maintenance routine invoked less often this came down to just over 9 minutes. Switching to all in-memory (still a single stream at this point) moved the needle to just under 9 minutes. This was an interesting aspect of the testing – routing everything through the in-memory buffer did not slow down the process, in fact it improved the time, albeit by a small factor.

Once the load was going via the in-memory buffer the load could be done in parallel streams. The final result was being able to load the data in just over a minute via eight parallel streams. This was a nice result given that the customers’ existing – OLTP based – system took over 90 minutes to load the same data with ten parallel streams.

Conclusion

Analytical databases typically face challenges when trying to load data via traditional ETL tools and methods – being characterised by low row-by-row processing speed and, most notably, table level write locking.

Actian’s vector processing databases have innovative technology that allows them avoid these problems and offer “OLTP capabilities”. While stopping short of targeting OLTP use cases, these capabilities allow Actian’s databases to utilize high performance loading concurrently and thereby provide good performance for ETL workloads.

Read KB Article