Selecting the best location for your analytics compute poses an interesting dilemma leading to thoughtful debate in some companies while others don’t even consider the topic. If you’re dealing with distributed systems, embedded sensors in machinery, mobile apps, or IoT devices, this topic is for you! Where you locate compute resources for analytics has a significant impact on the performance of your solutions, the load on your network, and the costs of centralized data storage (either on-premise or in the cloud).
What is Analytics Compute?
In simple terms, analytics compute is the data processing capability that you use to convert raw data into meaningful information. Embedded systems generate data either through sensors or through logs of their activities. This data has relevance but isn’t very useful until it is processed and refined into information and further into actionable insights. That refinement process involves things like collecting, filtering, sorting, classifying, assessing, interpreting, and summarizing streams of data using business rules, mathematical algorithms, statistical methods, and pattern matching templates. When you apply these analytical techniques and business rules to your streaming data, that’s analytics compute.
The Dilemma Related to Where you Locate Compute Capabilities
Embedded systems are also typically distributed systems – existing in many different locations within and outside your company. The people who need to understand the data generated by the embedded systems are often centralized – located in an office building someplace removed from the system generating the data. This is important because not only does the data from embedded systems need to be refined into meaningful information, it also needs to be transported from its original source to the person who needs to view it. Here is where the dilemma comes in, where should you do the analytics?
- At the source, within the embedded system itself?
- At the edge of the network in an edge device?
- In the cloud or some centralized data center?
- In a data warehouse on-prem or in the cloud or hybrid environment?
- In the reporting system, where the information is to be consumed?
Each of these options has its benefits and drawbacks. Centralized data centers and cloud data warehouses are large scale systems that offer economies of scale, making compute resources cheaper. The problem is getting the data to them. If you stream all your raw data to a cloud service, data warehouse, or your end-users desktop, you are transmitting large amounts of data that you don’t really need (data that will be filtered out in the analytics). You also place a load on your network, increase latency, increasing your networking costs, and potentially slowing down other business activities that need to use that capacity. If you perform compute operations within the embedded system or in an edge device, you avoid the network traffic. Still, you don’t get the economies of scale and compute resources are more expensive.
A General Rule for Optimal Efficiency: “Put Your Analytics Compute Where the Data Lies.”
Relatively speaking, compute resources (even distributed ones) are cheaper than network capacity. Most systems under-utilize their compute capacity, which means that you have resources that you’re already paying for that are going to waste. The general rule data experts recommend is “Put your analytics compute where the data lies” or as close to it as possible. If the raw data is being generated in embedded systems, perform as much of the analytics as possible on that data either in the embedded system or in a network edge device. At a minimum, perform those tasks that filter out unneeded data and pare down the volume of data that needs to be transmitted over the network.
Once you’ve pared down the streaming data by doing the first level of analytics processing in the field, you will probably need to combine the data from different embedded systems or distributed devices together. Data aggregation and further analytics can’t effectively be done in the field and are better suited for a data warehouse. So go ahead and stream your (pre-processed) data to a cloud data warehouse or on-premise data warehouse where you can work with the data in a batch. The analytics compute for data stored in a warehouse should also be performed as close as possible to where the data lies. If you can perform the compute on the same physical host, awesome! If not, do it within the same data center location to avoid network latency in your processing.
If you are looking for a cloud data warehouse, Actian is a perfect choice. Avalanche is a cloud-based operational data warehouse solution designed for aggregating and processing streaming data. Actian DataConnect gives you the integration capabilities to manage all of your streaming data sources in one place.