Data Integration

Connecting ServiceNow Doesn’t Have to Be Difficult

Actian Corporation

May 4, 2017

Connecting ServiceNow to other applications' data doesn’t have to be difficult

We’re excited to be at Knowledge17 this year to show how the Actian hybrid integration platform can connect ServiceNow to your existing data sources, including on-premises and cloud-based applications. The ServiceNow model can define, structure and automate the flow of work within your enterprise, but only if you can connect to other applications and data supporting IT, human resources, facilities, field service and more. You certainly don’t want to hand code your connections and map every field shared with Workday, Salesforce, Oracle, SAP and NetSuite. The Actian hybrid integration platform can help!

Actian provides pre-built, certified integration connectors that map most of the commonly used fields across those popular applications and more. Actian also allows you to customize additional integrations through a drag-and-drop graphical user interface. How much time and effort could you save and how much faster could you start getting more value out of ServiceNow? In our experience, the average time and cost to implement decreased by 50%, and there’s minimal ongoing maintenance.

With Actian DataConnect, you can orchestrate service transactions within a single service model.
You can create and share a single system of record for service transactions that is instantly reflected back into other applications and services.
Actian enables you to achieve the scale, reliability, and flexibility needed to rapidly grow adoption of the ServiceNow platform.

To get that ‘lights-out, light-speed’ experience through the ServiceNow enterprise cloud, you need invisible integration with existing applications and data across the enterprise. Actian DataConnect and DataCloud can deliver those connections. Visit us at Knowledge17 in booth 633 or on the web at actian.com to find out more.

About Actian Corporation

Actian empowers enterprises to confidently manage and govern data at scale, streamlining complex data environments and accelerating the delivery of AI-ready data. The Actian data intelligence approach combines data discovery, metadata management, and federated governance to enable smarter data usage and enhance compliance. With intuitive self-service capabilities, business and technical users can find, understand, and trust data assets across cloud, hybrid, and on-premises environments. Actian delivers flexible data management solutions to 42 million users at Fortune 100 companies and other enterprises worldwide, while maintaining a 95% customer satisfaction score.

Data Management

Architecting Next-Generation Data Management Solutions

Actian Corporation

May 2, 2017

This is part 2 of our conversation with Forrester analyst Michele Goetz. Please click here to read the first post: Rethink Hybrid for the Data-Driven Enterprise.

After a recent Actian webinar featuring Forrester Research, John Bard, senior director of product marketing at Actian, asked Forrester principal analyst Michele Goetz more about next-generation data management solutions. Here is the second part of that conversation (see part one here):

John Bard, Actian: What are key business imperatives that are forcing a greater priority of speed of query processing for systems of insight?

Michele Goetz, Forrester: More and more businesses are becoming digital. Retailers are creating digital experiences in their brick-and-mortar stores. Oil and gas companies are placing thousands of sensors on wells to get information on production and equipment states in real-time. And the mobile mind shift is driving more and more consumer and business engagement through mobile apps. Everything is in real-time, delivered through a web of microservices, and increasingly sophisticated analytics are embedded in streams and processes. This places a significant demand on systems that have to hit high-performance levels on massively orchestrated data services to get insight on demand, make decisions quickly, take action quickly, and achieve outcomes that meet business goals.

JB: How important is it for operational data and systems of insight to be tightly linked? What are some applications/use cases driving that integration?

MG: More and more, transactional systems have to operate on insight and not just as entry points to capture a transactional event. Analytics are running on streams of data and individual transactions such as purchases and business process events and transactions. These analytics provide suggestions and instructions to inform pricing, offers, next best action, and security/fraud patterns, along with automating manual processes. Today’s modern data platform has to run analytic and operational workloads side by side to not only enable a process but also capitalize on opportunities and threats as they occur.

JB: How does an enterprise strike a balance between best-in-class solutions that often require integration versus all-in-one platforms that often force compromises?

MG: For each business process, customer engagement, automated process, and partner engagement, there are different service-level needs for data and analytics. Data and data services have to be more personalized to the tasks at hand and desired outcomes. Upstream in-development applications are designed with specific requirements for data, insights, and the cadence for when data and insight are needed. These requirements manifest within the data and application APIs that drive microservices and business services. A monolithic all-in-one platform creates rigidity as a purpose-built system that is inflexible to business changes. The cost to purchase and maintain is significant and has an impact on the ability to modernize, thus building up technical debt. Additionally, for every new capability, a new silo is built, further fragmenting data and inhibiting insight. Companies need to move toward a hybrid approach that takes into account the cloud, data variety, service levels, best-in-class technologies, and open source for innovation. Hybrid systems allow flexibility and adaptability to drive service-oriented data toward business value without the cost and delivery bottlenecks that one-size-fits-all systems create.

JB: What is the best design approach to accelerate development to achieve faster deployment to production and therefore business value?

MG: Start with what the solution is supporting and the service levels it requires. Have an understanding of how that fits into specific data architecture patterns: data science for advanced analytics and visualization, intelligent transactional data , or analytic and BI workspaces. These patterns guide the choices for database, integration, and cloud while also helping to establish governance that guides trusted sources, repeatable and reusable data APIs and services, and the management of security policies.

JB: What sort of new applications and services can be created from these new hybrid data architectures?

MG: Hybrid data management is about putting the right data services and systems to the task and outcome at hand. It provides more freedom to introduce modern data technologies to quickly take advantage of capabilities to scale, get to insights you couldn’t see because of lack of data access, and deliver data and insight in real time without the lag from nightly batch processing and reconciliation. Additionally, hybrid data management has better administrative layers to help manage the peaks and valleys across the ecosystem and avoid performance bottlenecks, as well as right cost data service levels between cloud and on-premises systems. Going hybrid means getting access to all the data to create customer 360s that take personalization to the next level. It allows analytics to mature toward machine learning, advanced visualizations, and AI by providing a better data infrastructure backbone. And apps and products become more intelligent as hybrid systems create engagement that is insightful and adaptive to the way the solutions are used.

About Actian Corporation

Data Management

Rethink Hybrid for the Data-Driven Enterprise

Actian Corporation

April 25, 2017

After a recent Actian webinar featuring Forrester Research, John Bard, senior director of product marketing at Actian, asked Forrester principal analyst Michele Goetz more about the trends in today’s enterprise data market. Here is the first part of that conversation:

John Bard, Actian: The enterprise market tends to think of “hybrid” as on-premises or cloud, but there are several other dimensions for hybrid. Can you elaborate on other ways “hybrid” applies to the data management and integration markets?

Michele Goetz, Forrester: Hybrid architecture is really about spanning a number of dimensions: deployment, data types, access, and owner ecosystem. Analysts and data consumers can’t be hindered by technology and platform constraints that limit the reach into needed information to drive strategy, decisions, and actions to be competitive in fast-paced business environments. Information architects are required to think about aligning to data service levels and expectations, forcing them to make hybrid architecture decisions about cloud, operation, and analytic workloads; self-service and security; and where information sits internally or with trusted partners.

JB: What factors do you think are important to customers evaluating databases when it comes to satisfying both transactional and analytic workloads?

MG: Traditional approaches to database selection fell into either operational or analytic. Database environments were designed for one or the other. Today, operational and analytic workloads converge as transactional and log events are analyzed in streams and intelligently drive capabilities such as robotic process automation, just-in-time maintenance, and next-best-action or advise workers in their activities. Databases need the ability to run Lambda architectures and manage workloads across historical and stream data in a manner that supports real-time actions.

JB: What are some of the market forces driving these other aspects of “hybrid” in data management?

MG: Hybrid offers companies the ability to build adaptive composable systems that are flexible to changing business demands for data and insight. New data marts can spin up and be retired at will, allowing organizations to reduce legacy marts and conflicting data silos. Hybrid data management provides a platform where services can be built on top using APIs and connectors to connect any application. Cloud helps lower the total cost of ownership as new capabilities are spun up, at the same time management layers are allowing administrators the ability to easily shift workloads and data between cloud and on-premises to further optimize cost. Additionally, data service levels are better met by hybrid data management, as users can independently source, wrangle, and build out insights with lightweight tools for integration and analytics. In each of these examples, engineering and administration resources hours are reduced or current processes are optimized for rapid deployment and faster time-to-value for data.

JB: What about hybrid data integration? That can span both data integration and application integration. What about business-to-business (B2B) integration? What about integration specialists versus “citizen integrators”?

MG: Hybrid integration is defined by service-oriented data that spans data integration and application integration capabilities. Rather than relying strictly on extract, transform, and load (ETL)/extract, load, and transform (ELT) and change data capture, integration specialists have more integration tools in their toolbox to design data services based on service-level requirements. Streams allow integration to happen in real time with embedded analytics. Virtualization lets data views come into applications and business services without the burden of mass data movement. Open source ingestion provides support for a wider variety of data types and formats to take advantage of all data. APIs containerize data with application requirements and connectivity for event-driven views and insight. Data becomes tailored to business needs.

The other wave in data integration is the emergence of self-service, or the citizen integrator. With little more than an understanding of data and how to manipulate data in Excel with simple formulas, people with less technical skills can leverage and reuse APIs to get access to data or use catalog and data preparation tools to wrangle data and create data sets and APIs for data sharing. Data administrators and engineers still have visibility into and control over citizen integrator content, but they are able to focus on complex solutions and open up the bottlenecks to data that users experienced in the past.

Overall, these two trends extend flexibility, allow deployments to scale, and get to data value faster.

Hybrid data management and integration is the next-generation strategy for enterprises to go from data rich to data driven. As companies retool their businesses for digital, the internet of things (IoT), and new competitive threats, the ability to have architectures that are flexible and adapt and scale for real-time data demands will be critical to keep up with the pace of business change. Ultimately, companies will be defined and valued by the market according to their ability to harness data to stay ahead and viable.

About Actian Corporation

Data Management

Hybrid Data is a Gamechanger

Actian Corporation

April 19, 2017

Hybrid Data integrating with other channels

Imagine this scenario – you have just “clicked” on an item that you are ordering online. What kind of “data trail” have you generated? Well, you are sure to have generated a transaction – the kind of “business event” that goes into a seller’s accounting system, and then on to their data warehouse for subsequent sales analysis. This was pretty much your entire data trail until just a few years ago.

In recent times, the whole notion of data trails has exploded. The first wave of new data entering your data trail consisted of web and mobile interactions – those dozens or hundreds (or even thousands) of “human events” – research clicks and social media postings that you execute leading up to and after an online order. It turns out that these human interactions, when blended with business transactions, are critical to yielding more insight into behavior.

And now we are entering the next wave of new data – the observations made by the ever-increasing number of intelligent sensors that record every “machine event.” In our example above, for each human interaction supporting your online order, there may be hundreds or thousands of software, network, location, and device metrics being gathered and added to your data trail. Further integrating and correlating these machine observations into your particular flow of business transactions and human interactions would enable game-changing advanced analytic capabilities – promising a “closed-loop” of ever more timely and accurate decisions.

The bottom line is that we find ourselves in a hybrid data landscape of such stunning heterogeneity that it forever changes both the challenges and the opportunities around the capture and analysis of relevant operational data – the business, human, and machine events that make up your data trail. The ability to manage, integrate, and analyze all these hybrid data events at price/performant scale – to build the necessary data-to-decision pipelines – becomes the key to modern data infrastructure and succeeding with modern analytics.

About Actian Corporation

Actian Life

The Start of Something Really Big

Actian Corporation

April 17, 2017

The start of something really big - a huge data skycraper

It is rare in life that one gets an opportunity to step back, take a fresh look and reset one’s mission and trajectory. For Actian, today is such a day, as we launch a new vision, a new product solution portfolio and of course, a new tagline. Got to have a new tagline! Although time will tell whether we have hit the mark, I can safely say that we are excited to reveal our new thinking and shine a bright light on it for all the world to see.

Our new vision is built on three observations:

1. The World is Flat
It is an incontrovertible fact that data is “flattening” within organizations today. Diverse data is being created and consumed in every corner of a company and across its data ecosystem. Increasingly, the traditional one-place-for-everything data warehouse and today’s centralized data lake just seem like old tired thinking.

2. Data is a Social Animal
Data doesn’t like to live alone – to be effective, data needs to live in an ecosystem that is constantly changing and expanding as it is touched by entities both within and outside a company’s four walls. To truly extract insight from data, one needs context, and that context more often than not comes from other applications, processes, and data sources.

3. Think Big When You Think of the Cloud
Today, the cloud is much more than a place to deploy apps and data. Although the agility and economics of hybrid cloud computing are compelling, it is just the start. A “true” cloud solution is designed to enable companies to blend together data without physically moving it and derive actionable insight, including machine learning, that can be put to work at the speed of an organization’s business (e.g., make a real-time offer to a customer on your website). The traditional static monthly report, for most companies, has the same value as yesterday’s news – zero!

4. Activate Your Data
It seems clear that now is the time for a simple call to action – a call for organizations to “activate their data.” Forward-thinking companies are applying best-fit design tools and innovative technologies to embrace their data ecosystems to ensure their data makes a difference. Whether it is powering a real-time e-commerce offer, detecting financial fraud before it happens or predicting supply chain disruptions, it is critical that the underlying insight garnered can be acted upon at the speed of a company’s business.

This is a reversal of the traditional thinking that analytics tools dictate to the business what the data can and cannot do. Now, the business dictates what insight is needed—where, when and for whom. If an organization’s IT department can’t address these needs in an economical and agile fashion, then knowledge workers are increasingly finding alternative ways, often through a new generation of SaaS solutions, to get their needs met. Serve or be served…out the door!

Meet a Big Idea – Hybrid Data!
And behind all this new thinking is the powerful new concept of hybrid data. Hybrid data has multiple dimensions, including diverse data type and format, operational and transactional data, self-service access, external B2B data exchange and hybrid cloud deployment. Our view is simple – all data needs to be viewed as hybrid data that can be joined and blended with other data across an enterprise’s data ecosystem by anyone at any point in time. It is only when an organization can adopt this progressive approach that it can address the inherent limitations of traditional monolithic data repositories (a nice way to say Oracle or SAP) or alternative siloed point solutions.

About Actian Corporation

Data Integration

Knowing Who is Who in the Zoo is Important in Data Integration Industry

Actian Corporation

October 12, 2016

Back when I started off in the industry, some 20-something years ago (I do pretend I am still in my 20s so that number has a nice ring to it) there was only one IT Department with one manager in most large organizations. Now there are multiple managers within different departments, some aligned to different parts of the organization. Some pieces are outsourced, some in-sourced and some have contractors working on it.

When it comes to connecting most systems together, the industry is focused on “having a connector to this or that” while the real hard part is how to connect to that particular implementation of that system.

As the technologies evolved over the years the pillars (or silos) of teams evolved. So providing an integration solution to connect multiple systems together is more of a project management (herding cats) nightmare than a connector nightmare. Let’s take a typical mid-sized company that wants to connect its cloud-based applications (CRM, HR, etc.) to its on-premises applications (SAP, Oracle Finance, Dynamics, Databases, etc.). Pretty simple task as we have all the connector options and worst case we can always fall back on a web-serviced based JSON/XML connector and database connectors. The problem of “do we have a connector to each system” is solved within minutes.

The real problem and the time killer is how to connect and to whom we will give access. If we consider the layers of technology involved (taking OSI model as a method of stepping through access):

Physical Layer – how is the server connected and what speed limits could this be restricted by (is the server connected?).
Data Link Layer – what level of QoS do we have, are there any restrictions, which VLAN are we on and what does that VLAN have/not have access to?
Network Layer – can we perform a network test to each system we need to connect?
Transport Layer – can we retain a connection and what is the performance of that connection?
Session Layer – what are the authentication mechanisms for each system? Can we authenticate?
Presentation Layer – can we gain access to the metadata behind each system? Do we have sufficient rights?
Application Layer – Can we see a sample of the data that we are connecting to? Does the data look like what we expected? Can we perform updates, inserts, upserts, deletes, and reads? Has the application been customized and can we access those customizations?

Achieving all of this requires working with different IT teams both internally and externally. It may require working with vendors or other developers outside of the organization as well. Consider the following roles (not an exhaustive list) that would require gaining their trust and knowledge/assistance:

Server/Hardware Manager – Virtual server, capacity, server install.
Operating System Specialists – Windows / Linux / AIX / etc. Ability to run your integration software? Installation, patching and maintenance? Remote access to the server?
Network Manager – In which zone was the server installed? Does it have connectivity to each system? Remote access to the server?
Security/Firewall – Which ports are locked down and needed opening for this new service? Is the anti-virus software causing issues? Remote access to the server? Browser access to the server?
Cloud Application Specialist – Method of access, security, ability to access? Can we log in?
Database Administrators – Database access, rights, simple database read tests.
Specialist Applications (SAP BAPI Developers) – Are there some custom BAPIs that need to be used? Which of the standard BAPI’s should not be used? Can we use the fat client/web application to view and query the system? Can we use a test/development system?
Application Developers – Is there a standard method for requirements gathering, development methodology, peer reviews, user acceptance testing, system testing, load testing?

When we are required to prove we can connect to a system, we spend 90% of our time working with the people above and 10% in doing the actual connection. Knowing who to work with and gaining their trust and buy-in is the real hard yards.

About Actian Corporation

Data Management

Hadoop Short Circuit Reads and Database Performance

Actian Corporation

August 2, 2016

Hadoop Short Circuit Reads and Database Performance

If you’ve been working with Hadoop then you’ve likely come across the concept of Short Circuit Reads (SCRs) and how they can aid performance. These days they are mostly enabled by default (although not in “vanilla” Apache or close derivatives like Amazon EMR).

Actian VectorH brings high performance SQL, ACID compliance, and enterprise security into Hadoop and I wanted to test how important, or otherwise, SCRs were to the observed performance.

The first challenge I had was figuring out exactly what is in place and what is deprecated. This is quite common when working with Hadoop – sometimes the usually very helpful “google search” throws up lots of conflicting information and not all of it tagged with a handy date to assess the relevance of the material. In the case of SCRs the initial confusion is mainly down to there being two ways of implementing; the older way that grants direct access to the HDFS blocks – which achieves the performance gains but opens a security hole, and the newer method which uses a memory socket and thereby keeps the DataNode in control.

Note that for this entry I’m excluding MapR from the discussion. As far as I’m aware MapR implements its equivalent of SCRs automatically and does not require configuration (please let me know if that’s not the case).

With the newer way, the only things needed to get SCRs working is a valid native library, the property dfs.client.read.shortcircuit set to true, and the property dfs.domain.socket.path set to something that both the client and the DataNode can access. *Note there are other settings that effect the performance of SCRs, but this entry doesn’t examine those.

On my test Hortonworks cluster this is what I get as default:

# hadoop checknative -a
16/03/09 12:21:41 INFO bzip2.Bzip2Factory: Successfully loaded & initialized …
16/03/09 12:21:41 INFO zlib.ZlibFactory: Successfully loaded & initialized …
hadoop: true /usr/hdp/2.3.2.0-2950/hadoop/lib/native/libhadoop.so.1.0.0
zlib: true /lib64/libz.so.1
snappy: true /usr/hdp/2.3.2.0-2950/hadoop/lib/native/libsnappy.so.1
lz4: true revision:99
bzip2: true /lib64/libbz2.so.1
openssl: false Cannot load libcrypto.so (libcrypto.so: cannot open shared …

# hdfs getconf -confkey dfs.client.read.shortcircuit
true
# hdfs getconf -confkey dfs.domain.socket.path
/var/lib/hadoop-hdfs/dn_socket

For the testing itself I used one of our standard demo clusters. This has 5 nodes (1 NameNode and 4 DataNodes) running, in this case, HDP 2.2 on top of RHEL 6.5. The DataNodes are HP DL380 G7s with 2 x 6 cores @2.8Ghz, 288GB RAM, 1x 10GbE network card, and 16x300GB SFF SAS drives (so a reasonable spec as of 2012 but a long way from today’s “state of art”).

The data for the test is a star schema with around 25 dimension tables ranging in size from 100 to 50 million rows, and a single fact table with 8 billion rows.

The queries in the demo join two or more of the dimension tables to the fact table and filter on a date range along with other predicates.

Here I hit my second challenge – most of the queries used in the demo run in fractions of second and so there is not much opportunity to measure the effect of SCRs. For example the query below runs in 0.3 seconds against 8 billion rows (each date targets around 80 million rows):
[sql]
select
d1.DateLabel, d2.MeasureLabel, sum(f.MeasureValue)
from
Fact f
,Date_Dim d1
,Measure_Dim d2
where
f.DateId = d1.DateId
and d2.MeasureId = f.MeasureId
and d1.DateLabel in (’05-Feb-2015′)
group by
d1.DateLabel
,d2.MeasureLabel
[/sql]
To provide a chance to observe the benefits of SCRs I created a query that aggregated all 8 billion rows against all rows both dimension tables (i.e. removing the predicate “and d1.DateLabel in (…)“. This creates a result set of tens of thousands or rows but that doesn’t skew the result time enough to invalidate the test.

To make sure that all data was being read from the DataNode, Actian VectorH was restarted before the query was run. The Linux file system cache was left as is so as not to disrupt anything that the DataNode might be doing internally with file handles etc.

Armed with this new query I ran my first tests comparing the difference with and without SCR’s and got no difference at all – Huh! Exploring this a little I found that VectorH was using the network very efficiently so that the absence of SCRs was not affecting the read times. So I then simulated some network load using scp and sending data between the different nodes while running the queries. Under these conditions SCRs had an overall positive impact of around 10%.

Test	SCR	No SCR
No network load, first run.	29.8 seconds	29.8 seconds
No network load, second run.	3.9 seconds	3.9 seconds
With busy network, first run.	35.1 seconds	38.3 seconds
With busy network, second run.	4.6 seconds	5.1 seconds

In conclusion, given a good networking setup and software that makes good use of it, SCRs may not provide a performance benefit. However, if the network is reasonably busy, then SCR’s will likely help. The difference measured here was around 10%.

About Actian Corporation

Data Integration

Efficient ETL in an Analytical Database?

Actian Corporation

June 27, 2016

Recently I worked on a POC that required some non-standard thinking. The challenge was that the customer’s use case did not only need high-performance SQL analytics but also a healthy amount of ETL (Extract, Transform, and Load). More specifically, the requirement was for ELT (or even ETLT if we want to be absolutely precise).

Why “might” this have been an issue? Well typically analytical databases and ETL-style processing don’t play well together; the latter tends to be row orientated while the typical analytical database definitely prefers to deal with data in a “chunky” fashion. Typically analytical databases are able to load data in bulk at very high speed but tend to offer modest row-by-row throughput.

Another typical characteristic is the use of table-level write locking – serializing write transactions to one at a time. This is generally accepted as the use cases for analytical databases tend to be about queries rather than any kind of transaction processing. However, when some form of ETL is required it is perhaps even more problematic than the row-by-row throughput as it requires the designer and the loading tool to be aware of this characteristic. The designer often has to “jump through hoops” to figure out how to get the data into the analytical database in a way that other team members can understand and that the tool can deliver.

I’m setting the scene here for the “big reveal” that the Actian Vector processing databases do not suffer from these drawbacks. They can deliver both high-end analytical capabilities and offer “OLTP capabilities” in the manner of the HTAP (Hybrid Transactional/Analytical Processing) technologies.

Note the quotes around “OLTP capabilities” – just to be clear we at Actian wouldn’t position these as high-performance OLTP databases, we’re just saying that the capabilities (row-level locking and concurrent tables modifications) are there even though the database is a columnar, in-memory, vector processing engine).

However they are viewed, it was these capabilities that allowed us to achieve the customer’s goals – albeit with a little cajoling. In the rest of this post, I’ll describe the steps we went through and the results we achieved. If you’re not currently a user of either Actian Vector or Actian Vector in Hadoop ((VectorH) then you might just skip to the end, however if you are using the technology then read on.

Configuring for ETL

So coming back to the use case, this customer’s requirement was to load large volumes of data from different sources in parallel into the same tables. Now above we said that we offer “OLTP capabilities”, however out-of-the-box the configuration is more aligned to deal with one bulk update per table – we needed to alter the configuration to deal with multiple concurrent bulk modifications.

At their core, Actian databases have a columnar architecture and in all cases the underlying column store is modified in a single transaction. The concurrent update feature comes from some clever technology that buffers updates in-memory in a seamless and ACID compliant way. The default configuration assumes a small memory model and so routes large scale changes directly to the column store while smaller updates are routed to the in-memory buffer. The maintenance operations performed on the in-memory buffer – such as flushing changes to the column store – are triggered by resource thresholds set in the configuration.

It’s here where, with the default configuration, you can face a challenge – situations arise where large scale updates sent directly to the column store can clash with the maintenance routine of the in-memory buffer. To make this work well we need to adjust to the configuration to cater for the fact that there is – almost certainly – more memory than what the default configuration assumes. Perhaps the installer could set these accordingly, but with a large installed base it’s safer to keep the behaviour the same to keep consistency between versions.

So we needed to do two things; first we wanted to route all changes through the in-memory buffer, and second configure the in-memory buffer large enough to cater for the amount of data we were going to load. We might also have done a third thing which is to make the maintenance routines manual and bake the commands to trigger these into the ETL processes themselves, giving them complete control of what happens when.

Routing all changes through the in-memory buffer is done using the insertmode setting. Changing this means that bulk operations that would normally go straight to the column store now go through the in-memory buffer allowing multiple bulk operations to be done concurrently.

Sizing the in-memory buffer is simply a matter of adjusting the threshold values to match the amount of memory available or as suggested above making the process completely in control of the ETL process.

The table below describes the configuration options that effect the process:

Option	Meaning
update_propagation	Is automatic maintenance enabled.
max_global_update_memory	Controls the amount of memory that can be used by the in-memory buffer.
max_update_memory_per_transaction	As above per transaction.
max_table_update_ratio	Threshold for the percentage of a table held in the buffer before the maintenance process is initiated.
min_propagate_table_count	Minimum row count a table must have to be considered by the maintenance process.

To trigger the maintenance process manually execute:

modify <table> to combine

If you want to see more technical details of how to implement this processing, a knowledge base article available here:

Results

The initial run load of the customer’s data – with the default configuration – took around 13 minutes. With some tuning of the memory parameters to have the maintenance routine invoked less often this came down to just over 9 minutes. Switching to all in-memory (still a single stream at this point) moved the needle to just under 9 minutes. This was an interesting aspect of the testing – routing everything through the in-memory buffer did not slow down the process, in fact it improved the time, albeit by a small factor.

Once the load was going via the in-memory buffer the load could be done in parallel streams. The final result was being able to load the data in just over a minute via eight parallel streams. This was a nice result given that the customers’ existing – OLTP based – system took over 90 minutes to load the same data with ten parallel streams.

Conclusion

Analytical databases typically face challenges when trying to load data via traditional ETL tools and methods – being characterised by low row-by-row processing speed and, most notably, table level write locking.

Actian’s vector processing databases have innovative technology that allows them avoid these problems and offer “OLTP capabilities”. While stopping short of targeting OLTP use cases, these capabilities allow Actian’s databases to utilize high performance loading concurrently and thereby provide good performance for ETL workloads.

Read KB Article

About Actian Corporation

Insights

Performance Troubleshooting Tips for Actian Vector in Hadoop

Actian Corporation

June 20, 2016

Actian Vector and Vector in Hadoop are powerful tools for efficiently running queries. However, most users of data analytics platforms seek to find ways to optimize performance to gain incremental query improvements.

The Actian Service and Support team works with our customers to identify common areas that should be investigated when trying to improve query performance. Most of our recommendations apply equally well to Actian Vector (single node) as to Actian Vector in Hadoop (VectorH, a multi-node cluster on Hadoop).

Actian has recently published an in-depth overview of technical insights and best practices to help all Vector and VectorH users optimize performance, with a special focus on VectorH.

Unlike SQL query engines on Hadoop (Hive, Impala, Spark SQL, etc.) VectorH is a true columnar, MPP, RDBMS with full capabilities of SQL, ACID transactions (i.e. support for updates and deletes in place), built-in robust security options, etc. This flexibility allows VectorH to be optimized for complex workloads and environments.

Note that Vector and VectorH are very capable of running queries efficiently without using any of the examined techniques. But these techniques will come in handy for demanding workloads and busy Hadoop environments and will allow you to get the best results from your platform.

Through our work with customers, we have found the following areas should be investigated to achieve maximum performance.

Partition Your Tables

One very important consideration in schema design for any Massively Parallel Processing (MPP) system like VectorH is how to spread data around a cluster so as to balance query execution evenly across all of the available resources. If you do not explicitly partition your tables when they are created, VectorH will by default create non-partitioned tables – but for best performance, you should always partition the largest tables in your database.

Avoid Data Skew

Unbalanced data, where a small number of machines have much more data than most of the others, is known as data skew. Data skew can cause severe performance problems for queries, since the machine with the disproportionate amount of data governs the overall query speed and can become a bottleneck.

Missing Statistics

Data distribution statistics are essential to the proper creation of a good query plan. In the absence of statistics, the VectorH query optimizer makes certain default assumptions about things like how many rows will match when two tables are joined together. When dealing with larger data sets, it is much better to have real data about the actual distribution of data, rather than to rely on these estimates.

Sorting Data

The relational model of processing does not require that data is sorted on disk – instead, an ORDER BY clause is used on a query that needs the data back in a particular sequence.

However, by using what is known as MinMax indexes (maintained automatically within a table’s structure without user intervention), VectorH is able to use ordered data to more efficiently eliminate unnecessary blocks of data from processing and hence speed up query execution, when queries have a WHERE clause or join restriction on a column that the table is sorted on.

Using the Most Appropriate Data Types

Like any database, choosing the right data type for your schema and queries can make a big difference to VectorH’s performance, so don’t make it a practice to use the maximum column size for convenience. Instead, consider the largest values you are likely to store in a VARCHAR column, for example, and size your columns accordingly.

Because VectorH compresses column data very effectively, creating columns much larger than necessary has minimal impact on the size of data tables. As VARCHAR columns are internally stored as null-terminated strings, the size of the VARCHAR actually has no effect on query processing times. However, it does influence the frontend communication times, as the data is stored as the maximum defined length after it leaves the engine. Note however that storing data that is inherently numeric (IDs, timestamps, etc) as VARCHAR data is very detrimental to the system, as VectorH can process numeric data much more efficiently than character data.

Memory Management for Small Changes

VectorH has a patent-pending mechanism for efficiently dealing with many small changes to data, called Positional Delta Trees (PDTs). These also allow the use of update and delete statements on data stored in an append-only file system like HDFS.

However, if a lot of update, insert or delete statements are being executed, memory usage for the PDT structures can grow quickly. If large amounts of memory are used, the system can get slower to process future changes, and eventually memory will be exhausted. Management of this memory is handled automatically, however the user can also directly issue a ‘combine’ statement, which will merge the changes from the PDT back into the main table in a process called update propagation. There are a number of triggers that make the system perform this maintenance automatically in the background (such as thresholds for the total memory used for PDTs, or the percentage of rows updated), so this is usually transparent for the user.

Optimizing for Concurrency

VectorH is designed to allow a single query to run using as many parallel execution threads as possible to attain maximum performance. However, perhaps atypically for an MPP system, it is also designed to allow for high concurrency with democratic allocation of resources when there is a high number of queries present to the system. VectorH will handle both these situations with “out-of-the-box” settings, but can be tuned to suit the needs of the application (for example if wanting to cater for a higher throughput of heavy duty queries by curtailing the maximum resources any one query can acquire).

The number of concurrent connections (64 by default) that a given VectorH instance will accept is governed by the connect_limit parameter, stored in config.dat and managed through the CBF utility. But there are usually more connections than executing queries, so how are resources allocated among concurrent queries?

By default, VectorH tries to balance single-query and multi-query workloads. The key parameters in balancing this are:

The number of CPU cores in the VectorH cluster.
The number of threads that a query is able to use.
The number of threads that a query is granted by the system.
The number of queries currently executing in the system.

Summary

Vector and VectorH are very capable of running queries efficiently without using any of the techniques and tips described here. But the more demanding your workload, in terms of either data volumes, query complexity or user concurrency, the more that applying some the tips defined in the full report will allow you to get the best results from your platform.

About Actian Corporation

Data Analytics

Actian Vector Extends Dominance as Certifiably Fastest!

Actian Corporation

June 13, 2016

long exposure photo of cars driving at night and their light streaks

Actian announced a new record TPC-H result with Hewlett Packard Enterprise to demonstrate the big data analytics performance of the new Intel Xeon processor E7-8890 v4. With over 2.14M QphH and $0.38/QphH, Actian Vector sets new records for performance and price/performance on non-clustered systems for the 3000GB scale factor. These numbers beat the previous record performance published on Cisco using Microsoft SQL Server by 2X, at a 36% lower cost per query.

What do these results show about the Actian Vector database performance that relates to a business benefit? Reducing the overall solution cost makes it possible to incorporate more data into the business analytics system to provide a broader perspective from which to gain business insights. Faster query performance shortens turn-around time on business questions and makes insight into bigger problems more reasonable. With better-informed and faster business insights, companies can increase revenue and profits, improve productivity and customer loyalty, and reduce operating costs and business risks.

The Transaction Processing Performance Council (TPC) develops objective, verifiable performance benchmarks. This decision support workload was designed for broad industry-wide relevance, with 22 different business-oriented queries simulating the analysis needed to make better business decisions around sales, inventory, production, etc. There are also multiple steps of mandated inserts and deletes that an independent auditor verifies to ensure the database maintains ACID semantics, even through a hardware failure.

These record-setting results show how well Actian Vector handles this workload – 5 times faster and 50% less expensive than an Oracle solution from 3 years ago, for example. Ad-hoc cost per query has dropped from $21 to just $0.38 in just the past six years—a 98% cost reduction, while queries-per-hour capability has increased by 19X.

Internally, we compared our performance running the same workload on exactly the same configuration using the prior generation processors (E7-8890 v3), which showed a 20% difference due to the processor change. Clearly, one large factor contributing to the doubling of TPC-H performance over the Cisco configuration was changing the database from Microsoft SQL Server 2016 to Actian Vector with our columnar data format and efficient data compression. Actian Vector takes direct advantage of the large caches and vector instruction extensions Intel has designed into the Xeon processors. Vector software features deliver on the performance of industry-standard CPUs and servers, eliminating the need for specialized hardware appliances to achieve capacity and performance. Other factors may have been the choice of operating system (RHEL 7 rather than Windows Server), and the different storage subsystems.

Since the X100 query engine that drives Actian Vector performance is the same query engine at the heart of Actian Vector in Hadoop, we look forward to sharing audited results on a Hadoop cluster in the near future.

About Actian Corporation

Insights

Accelerating Spark With Actian Vector in Hadoop

Actian Corporation

May 9, 2016

One of the hottest projects in the Apache Hadoop community is Spark, and we at Actian are pleased to announce a Spark-Vector connector for the Actian Vector in Hadoop platform (VectorH) that links the two together. VectorH provides the fastest and most complete SQL in Hadoop solution, and connecting to Spark opens up interfaces to new data formats and functionality like streaming and machine learning.

Why Use VectorH With Spark?

VectorH is a high-performance, ACID-compliant analytical SQL database management system that leverages the Hadoop Distributed File System (HDFS) or MapR-FS for storage and Hadoop YARN for resource management. If you want to write in SQL and do complex SQL tasks, you need VectorH. SparkSQL is just a subset of SQL and must be invoked from a program written in Scala, R, Python, or Java.

VectorH is a mature, enterprise-grade RDBMS, with an advanced query optimizer, support for incremental updates, and certification with the most popular BI tools. It also includes advanced security features and workload management. The columnar data format in VectorH and optimized compression means faster query performance and more efficient storage utilization than other common Hadoop formats.

Why Use Spark With VectorH?

Spark offers a distributed computational engine that extends functionality to new services like structured processing, streaming, machine learning, and graph analysis. Spark, as a platform for the data scientist, enables anyone who wants to work with Scala, R, Python, or Java.

This Spark-Vector connector dramatically expands VectorH access to the broader reach of Spark connectivity and functionality. One very powerful use case is the ability to transfer data from Spark into VectorH in a highly parallel fashion. This ETL capability is one of the most common use cases for Apache Spark.

If you are not a Spark programmer yet, the connector provides a simple command line loader that leverages Spark internally and allows you to load CSV, Parquet and ORC files without having to write a single line of Spark code. Spark is a standard supported component with all major Hadoop distributions so you should be able to use the connector by following the information on the connector site.

How Does it Work?

The connector loads data from Spark into Vector as well as retrieves data via SQL from VectorH. The first part is done in parallel: data coming from every input RDD partition is serialized using Vector’s binary protocol and transferred through socket connections to Vector end points using Vector’s DataStream API. Most of the time this connector will assign only local RDD partitions within each Vector end point to preserve data locality and avoid any delays incurred by network communications. In the case of data retrieval from Vector into Spark, data gets exported from Vector and ingested into Spark using a JDBC connection to the leader Vector node. The connector works with both Vector SMP and VectorH MPP, and with Spark 1.5.x. An overview of the data movement is shown below:

spark

What Else is There?

This latest VectorH release (4.2.3) also includes the following new features:

YARN support on MapR, in addition to the Cloudera and Hortonworks distributions already certified. With native support in YARN, you can run multiple workloads in the same Hadoop cluster to share the entire pool of resources.
Per query profile files can be written to a specified directory, including an HDFS directory. This feature provides more flexibility and control to manage and share query profiles across different users.
New options to display the status of cluster services, including basic node health, Kerberos access if enabled, MPI access, HDFS Safemode, and HDFS fsck.
A new option to create MinMax indexes on a subset of columns as well as improved memory management of MinMax, resulting in lower CPU and memory overhead.

Learn more at https://www.actian.com/products/ or contact sales@actian.com to speak with a representative.

About Actian Corporation

Insights

Amazon EMR as an Easy-to-Set-Up Hadoop Platform

Actian Corporation

May 3, 2016

Recently I helped a customer perform an evaluation of Actian Vector in Hadoop (VectorH) to see if it could maintain “a few seconds” performance as the data sizes grew from one to tens of billions of rows (which it did, but that’s not the subject of this entry).

The customer in question was only moderately invested in Hadoop and so was looking for a way to establish the environment without having to get deep into the details of Hadoop. This aligns nicely with Actian’s vision for the Hadoop versions of its software– allowing customers to run real applications for business users in Hadoop without having to get into the “developer-heavy” aspects that Hadoop typically requires.

Actian has released a provisioning tool called the Actian Management Console (AMC) that will install and configure the operating system, Hadoop, and Actian software in a click-of-a-button style that will be ideal for this customer. AMC currently supports Amazon EC2 and Rackspace clouds.

At the time of the evaluation however, AMC wasn’t available so we looked for something else and thought we would try Amazon’s EMR (Elastic MapReduce), which is a fast way to get a Hadoop cluster up and running with minimal effort. This entry looks at what we found and lists out the pros and cons.

Pros and Cons of Amazon EMR

EMR is very easy to set up – just go to the Amazon console, select the instance types you want, and the number of them, push the button and in a few minutes you have a running Hadoop cluster. This part would be faster to get up and running than using the AMC to provision a cluster from scratch since EMR pre-bakes some of these installation steps to speed them up for you.

By default, you get Amazon’s Hadoop flavor, which is a form of Apache Hadoop with most of the recommended patches and tweaks applied. It is possible however to specify Hortonworks, Cloudera, and MapR to be used with EMR, and other add-ons such as Apache Spark and Apache Zeppelin. In this case, we used the default Apache Hadoop distribution.

Hadoop purists may get a “wrinkled brow” from some of the terms used – for example, the DataNodes are referred to as “core” nodes in EMR, and the NameNode is referred to as the “master” node (for those new to both EMR and VectorH be aware that the term “master” is used by both, for different things).

The configuration of certain items is automatically adjusted depending on the size of the cluster. For example, the HDFS replication factor is set to one for clusters of less than 4 nodes and to two for clusters between 2-10 nodes and then the usual three for larger clusters than 10 nodes. This and other properties can be explicitly set in the start up sequence via “bootstrap” options.

That brings us nicely onto the main thing to be aware of when using EMR – in EMR everything (including Hadoop) is transient; when you restart the cluster everything gets deleted and rebuilt from scratch. If you want persistence of configuration then you need to do that in the bootstrap options. Similarly if you require data to be persistent across cluster reboots then you need to stage the data externally in something like S3. EMR can read and write S3 directly, so you would not need to copy raw data from S3 into HDFS just to be able to load data in VectorH – you could load directly from S3 using something like Actian’s new Spark loader.

This may seem like a drawback, and indeed it is for some use cases, however it really just reflects the purpose of EMR. EMR was never intended to be a long running home for a persistent DataLake, but rather a transient “spin up, process, then shutdown” environment for running MapReduce style jobs. Amazon would probably state that S3 should be the location for persistent data.

In our case we were testing a high-performance database – definitely a “long lived” service with persistent data requirements – so in that sense EMR was maybe not such a good choice. However although we did have to reload the database a few times after wanting to reconfigure the cluster we did quickly adapt to provisioning from scratch by building a rudimentary provisioning script (aided here by the fast loading speed of VectorH and the fact that it doesn’t need indexes) and so it actually worked quite well as a testing environment.

The second thing to be aware of with EMR, especially for those used to Hadoop, is that (at least with the default Hadoop we were using) some of the normal facilities aren’t there. The thing we noticed most of all was the absence of the usual start/stop controls for the Hadoop daemons. We were trying to restart the DataNodes to enable short circuit reads and found it a little tricky – in the end we had to restart the cluster anyway and so put this into the bootstrap options.

Another aspect you don’t have control over with EMR is using an AWS Placement Group. Normally in AWS if you want a high performance cluster you would try and make sure that the different nodes were “close” in terms of physical location, thereby minimizing network latency. With EMR there didn’t seem to be any way to specify that. It could be that EMR is actually being a little clever under the covers and doing this for you anyway. Or it could be that, when dealing with larger clusters Placement Groups maybe become a little impractical. Either way (or otherwise) our setup didn’t use any specified Placement Groups. Even without them, performance was good.

In fact, performance was good enough that the customer reported the results to be better than their comparable testing on Amazon Redshift!

In summary, this exercise showed that you can use Amazon EMR as a “quick start” Hadoop environment even for long lived services like Actian VectorH. Apart from the absence of a few aspects – most of which can be dealt with via bootstrap options – EMR behaves like a regular Hadoop distribution. Due to its transient nature, EMR should only be considered for use cases that are transient (including testing of long lived services).

Customers wanting click-of-a-button style provisioning of Hadoop for production in things like Amazon EC2 should consider installing with Actian’s Management Console, available now from https://esd.actian.com.

Connecting ServiceNow Doesn’t Have to Be Difficult

About Actian Corporation

Related Tags

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

Ready to Get Started?

Architecting Next-Generation Data Management Solutions

About Actian Corporation

Related Tags

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

Ready to Get Started?

Rethink Hybrid for the Data-Driven Enterprise

About Actian Corporation

Related Tags

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

Ready to Get Started?

Hybrid Data is a Gamechanger

About Actian Corporation

Related Tags

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

Ready to Get Started?

The Start of Something Really Big

About Actian Corporation

Related Tags

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

Ready to Get Started?

Knowing Who is Who in the Zoo is Important in Data Integration Industry

About Actian Corporation

Related Tags

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

Ready to Get Started?

Hadoop Short Circuit Reads and Database Performance

About Actian Corporation

Related Tags

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

Ready to Get Started?

Efficient ETL in an Analytical Database?

About Actian Corporation

Related Tags

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

Ready to Get Started?

Performance Troubleshooting Tips for Actian Vector in Hadoop

About Actian Corporation

Related Tags

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

Ready to Get Started?

Actian Vector Extends Dominance as Certifiably Fastest!

About Actian Corporation

Related Tags

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

Ready to Get Started?

Accelerating Spark With Actian Vector in Hadoop

About Actian Corporation

Related Tags

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

Ready to Get Started?

Amazon EMR as an Easy-to-Set-Up Hadoop Platform

About Actian Corporation

Related Tags