Data Intelligence

7 Lies of Data Catalogs #7: Complex, Not Complicated

Actian Corporation

July 20, 2021

The Data Catalog market has developed rapidly, and it is now deemed essential when deploying a data-driven strategy. Victim of its own success, this market has attracted a number of players from adjacent markets.

These players have rejigged their marketing positioning in order to present themselves as Data Catalog solutions.

The reality is that, while relatively weak on the data catalog functionalities themselves, these companies attempt to convince, with degrees of success proportional to their marketing budgets, that a Data Catalog is not merely a high-performance search tool for data teams, but an integrated solution likely to address a host of other topics.

The purpose of this blog series is to deconstruct the pitch of these eleventh-hour Data Catalog vendors.

Here are, in our opinion, the 7 lies of the Data Catalog vendors:

A Data Catalog is a Data Governance platform.
A Data Catalog can measure and manage data quality.
A Data Catalog can manage regulatory compliance.
A Data Catalog can query data directly.
A Data Catalog can model logical architecture and business processes around data.
A Data Catalog is a collaborative cartography and metadata management tool that cannot be automated.
A Data Catalog is a long, complex, and expensive project.

A Data Catalog is Complex…but isn’t Complicated

This closing statement seems to us, a fitting summary for all of the above, and it will be our conclusion.

We have seen too many Data Catalog initiatives morph into endless data governance projects that try to solve a laundry list of issues- ignoring those easily solved by a Data Catalog. Once you have removed the extra baggage.

The deployment of a Data Catalog only takes a few days, rather than months, to produce value.

The services rendered by a Data Catalog are simple. In its leanest form, a Data Catalog presents as a search bar, in which any user can type in a few keywords (or even pose a question in a natural language) and obtain a list of results with the first 5 elements being the most relevant, thus providing him with all the information he needs to use the data (just like a web search engine, or an online retailer).

This ease of use is crucial to guarantee adoption by the data teams. On the user front, the Data Catalog should be a simple affair with a clean design. Like any other search or recommendations engine, however, the underlying complexity is substantial.

The good news for the customer is that this complexity is nothing for you to worry about, it’s on us.

Actian Data Intelligence Platform has invested enormously on the structure of the information (building a knowledge graph), on automation and on the search and recommendations engine. This complexity isn’t visible but it is what constitutes the value of a Data Catalog.

The obsession for simplicity is at the heart of our values. Each functionality we choose to add to the product has to tick one of the two boxes below:

Does this functionality help deploy the catalog faster in the organization?
Does this functionality enable the data teams to find the information more quickly in order to get on with their projects?

If neither of the questions above are answered by yes, the functionality will be discarded.

The result is that you can connect the Actian Data Intelligence Platform to your operational systems, configure and feed your first metamodel, and open the catalog to the end users within a matter of days.

Of course, going forward, you’ll need to complete the metamodel, integrate other sources, etc. But the value creation is instant.

About Actian Corporation

Actian empowers enterprises to confidently manage and govern data at scale. Actian data intelligence solutions help streamline complex data environments and accelerate the delivery of AI-ready data. Designed to be flexible, Actian solutions integrate seamlessly and perform reliably across on-premises, cloud, and hybrid environments. Learn more about Actian, the data division of HCLSoftware, at actian.com.

Data Management

It’s Time for Data Historians to Become…History

Actian Corporation

July 17, 2021

Database Historians…History?

Why a modern time-series capable database can simplify yet enhance time-series data analysis.

Despite the professorial image the term suggests, a data historian is not an instructor or researcher, but a purpose-built software solution. And evolutions in how operational data is used and managed have eclipsed the need for data historian software solutions.

What is a Data Historian?

There are many Operational Technology (OT) environments within manufacturing, oil and gas, engineering research, and countless other industries. In these environments, complex equipment, machinery, and networks of sensors and devices generate time-series data. These time-series streams range from sensor data representing pressure, volume and temperature to video streams for machine vision and surveillance.

Initially, these streams were ignored or sampled only at low periodic rates. As time-series streams increased in volume and local data processing incorporated multiple feed reconciliation, OT engineers began to build data collection, aggregation, and minimal processing systems to better handle these time-series data streams. Eventually, these proprietary and bespoke systems were collectively labeled data historians.

The Data Historian Process Gap

The use and users of OT data have both changed much during the past few years. Increasingly, OT data is leveraged by a host of other players within an organization beyond OT professionals. These newer users include developers, business analysts and data scientists supporting the OT, and product and service managers driving the business.

However, no data historian software solution was ever designed for use with a range of external systems or by users who were not OT professionals. Instead, the typical data historian platform was little more than libraries of data collected by and intended only for the use of OT professionals. And they typically built each data historian software solution from the ground up, directly or by proxy through vendors of manufacturing or other specialized equipment. In essence, data historian solutions are libraries built only for the librarians.

In addition, much data historian software was implemented on expensive legacy hardware. Resource constraints and lack of standards meant that functionality was pared down and focused only on the localized and immediate requirements of the OT infrastructure and process at hand. The result is that data historian software solutions are not easily extended for functions such as localized analytics and visualization or sharing data across local systems. It is also difficult or impossible for the typical data historian platform to easily and securely exchange data with modern backend systems for further analytics and visualization.

Technology That Empowers Historical Data to Shape the Future

As with any other part of the business and IT industries, the technology for data management is continuously evolving, with new capabilities emerging every day. Currently, three primary technology shifts are combining to move beyond the capabilities and expected outcomes of data historian software.

Modern Time-Series Databases: Beyond the Data Historian

Outside of the OT domain, the rest of your company data is likely stored in traditional relational databases and data warehouses. Data historian solutions were focused on capturing largely structured data in time-series formats. Today’s data is a vast superset of the data captured by these legacy systems.

Modern time-series databases include traditional time-series data capabilities. However, those modern solutions are designed and optimized for capturing data chronology and ingesting data from unstructured and multi-variate streaming data sources. These can range from Binary Large Objects (BLOBs) and data compliant with the JavaScript Open Notation (JSON) open standard to the latest in Internet of Things (IoT) connectivity.

Ad-Hoc Analysis and Reporting: the Right Data for Everyone

Data historians tend to rely upon NoSQL application programming interfaces (APIs). These store and access data based on so-called “key values,” rather than in the rows and columns of traditional databases. NoSQL APIs are great for data collection and local data management. However, they are not readily accessible for post-collection ad hoc analysis and reporting – particularly by business analysts and data scientists outside the OT domain.

Modern time-series databases provide both a NoSQL API and APIs compliant with the American National Standards Institute (ANSI) Structured Query Language (SQL) standard. The latter feature enables easy extraction of data to support remote ad-hoc analysis, reporting and visualization through widely used business intelligence and reporting tools that rely on standard IT connectivity mechanisms such as Java Database Connectivity (JDBC) and Open Database Connectivity (ODBC).

Artificial Intelligence (AI): Enabling History to Support Predicting the Future

Traditional data historian solutions can enable operations managers in the field to catch problems with their infrastructures, such as when pressure is too high or a part has failed. But these alerts are always after the fact. The collection and processing speed of the specific data historian solution somewhat determines how quickly afterwards, but hindsight is always the default.

AI, powered by modern Machine Learning (ML) capabilities, can deliver alerts that are more insightful. Depending on the combinations of data, past patterns, and the ability to analyze them, AI-driven successors to data historian solutions can even deliver predictive guidance about when a part is likely to fail. Modern, integrated time-series databases can support AI and ML capabilities locally at the point of action within the OT domain by integrating OT with backend IT. The result is that data scientists and engineers can craft AI and ML capabilities for backend IT systems. Developers and front-end OT engineers can then invoke those capabilities in the OT environment. This approach provides a new and modern way of interacting with your company’s data to generate more useful insights and improved outcomes.

Respect the Legacy, But Move into the Future

Data historian solutions have been crucial to the evolution of OT and the IT industry since the 1980s and earlier, and their contributions should be acknowledged and respected. Their time has passed, however, and modern technology solutions are replacing them. These allow you to better manage the data your company needs today and have faster, more complete, and more accurate information insights for the future.

Actian is the industry leader in operational data warehouse and edge data management solutions for modern businesses. With a complete set of solutions to help you manage data on-premises, in the cloud, and at the edge, including mobile and IoT devices. Actian can help you develop the technical foundation you need to support true business agility. To learn more, visit www.actian.com.

About Actian Corporation

Data Intelligence

7 Lies of Data Catalogs #6: Must Rely on Automation

Actian Corporation

July 9, 2021

Business process management and workflow

These players have rejigged their marketing positioning in order to present themselves as Data Catalog solutions.

The purpose of this blog series is to deconstruct the pitch of these eleventh-hour Data Catalog vendors.

Here are, in our opinion, the 7 lies of the Data Catalog vendors:

A Data Catalog is a Data Governance platform.
A Data Catalog can measure and manage data quality.
A Data Catalog can manage regulatory compliance.
A Data Catalog can query data directly.
A Data Catalog can model logical architecture and business processes around data.
A Data Catalog is a collaborative cartography and metadata management tool that cannot be automated.
A Data Catalog is a long, complex, and expensive project.

A Data Catalog Must Rely on Automation

Some Data Catalog vendors, who hail from the world of cartography, have developed the rhetoric that automation is a secondary topic, which can be addressed at a later stage.

They will tell you that a few manual file imports suffice, along with a generous user community collaborating on their tool to feed and use the catalog. A little arithmetic is enough to understand why this approach is doomed to failure in a data-centric organization.

An active Data Lake, even a modest one, quickly hoovers up, in its different layers, hundreds and even thousands of datasets. Along with these datasets, can be added those from other systems (database applications, various APIs, CRMs, ERPs, noSQL, etc) which we usually want to integrate in the catalog.

The orders of magnitude quickly go beyond thousands, sometimes tens of thousands of datasets. Each dataset contains dozens of fields. Datasets and fields alone represent several hundreds of thousands of objects (we could also include other assets: ML models, dashboards, reports, etc). In order for the catalog to be useful, inventorying those objects isn’t enough.

You also need to combine with them all the properties (metadata) which will enable end users to find, understand, and exploit these assets. There are several types of metadata: technical information, business classification, semantics, security, sensitivity, quality, norms, uses, popularity, contacts, etc. Here again, for each asset, there are dozens of properties.

Back to the arithmetics: Overall, we are dealing with millions of attributes needing to be maintained.

Such volumes alone should disqualify any temptation to choose the manual approach. But there is more. The stock of informational assets isn’t static. It is constantly growing. In a data-centric organization, datasets are created daily, others are moved or changed.

The Data Catalog Needs to Reflect These Changes.

Otherwise, its content will be permanently obsolete and the end users will reject it. Who is going to trust a Data Catalog that is incomplete and wrong? If you feel that your organization can absorb the load and keep your catalog up to date, that’s wonderful. Otherwise, we would suggest you monitor as quickly as possible the level of automation provided by the different solutions you are looking at.

What can we Automate in a Data Catalog?

In terms of automation, the most important capacity is the inventory.

A Data Catalog should be able to regularly scan all your data sources and automatically update the asset inventory (datasets, structures and technical metadata at a minimum) to reflect the day-to- day reality of the hosting systems.

Believe us: a Data Catalog that cannot connect to your data sources will quickly become useless, because its content will always be in doubt.

Once the inventory is completed, the next challenge is to automate the metamodel feed.

Here, beyond the technical metadata, complete automation seems a little hard to imagine. It is still possible to significantly reduce the necessary workload for the maintenance of the metamodel. The value of certain properties can be determined by simply applying rules at the time of the integration of the objects in the catalog.

It is also possible to suggest property values using more or less sophisticated algorithms (semantic analysis, pattern matching, etc.).

Lastly, it’s often possible to feed part of the catalog by integrating the systems that produce or contain metadata. This can apply for instance for quality measurement, for lineage information, for business ontologies, etc.

For this approach to work, the Data Catalog must be open and offer a complete set of APIs that allow the metadata to be updated from other systems.

Take Away

A Data Catalog handles millions of information in a constantly shifting landscape.

Maintaining this information manually is virtually impossible, or extremely costly. Without automation, the content of the catalog will always be in doubt, and the data teams will not use it.

About Actian Corporation

Data Intelligence

7 Lies of Data Catalogs #5: Not a Business Modeling Solution

Actian Corporation

July 9, 2021

a data catalog is not a business modeling solution

These players have rejigged their marketing positioning to present themselves as Data Catalog solutions.

The purpose of this blog series is to deconstruct the pitch of these eleventh-hour Data Catalog vendors.

Here are, in our opinion, the 7 lies of the Data Catalog vendors:

A Data Catalog is a Data Governance platform.
A Data Catalog can measure and manage data quality.
A Data Catalog can manage regulatory compliance.
A Data Catalog can query data directly.
A Data Catalog can model logical architecture and business processes around data.
A Data Catalog is a collaborative cartography and metadata management tool that cannot be automated.
A Data Catalog is a long, complex, and expensive project.

A Data Catalog is NOT a Business Modeling Solution

Some organizations, usually large ones, have invested for years in the modeling of their business processes and information architecture.

They have developed several layers of models (conceptual, logical, physical) and have put in place an organization that helps the maintenance and sharing of these models with specific populations (business experts and IT people mostly).

We do not question the value of these models. They play a key role in the urbanization, the schema blueprints, the IS management, as well as regulatory compliance. But we seriously doubt that these modeling tools can provide a decent Data Catalog.

There is also a market phenomenon at play here: certain historical business modeling players are looking to widen the scope of their offer by positioning themselves on the Data Catalog market. After all, they do already manage a great deal of information on physical architecture, business classifications, glossaries, ontologies, information lineage, processes and roles, etc. But we can identify two major flaws in their approach.

The first is organic. By their nature, modeling tools produce top-down models to outline the information in an IS. However accurate it may be, a model remains a model: a simplified representation of reality.

They are very useful communication tools in a variety of domains, but they are not an exact reflection of the day-to-day operational reality which, for me, is crucial to keeping the promises of a Data Catalog (enabling teams to find data, understanding and knowing how to use the datasets).

The second flaw?: It is not user -friendly.

A modeling tool is complex and handles an important number of abstract concepts which require an important learning curve. It’s a tool for experts.

We could consider improving user friendliness of course to open it up to a wider audience. But the built-in complexity of the information won’t go away.

Understanding the information provided by these tools requires a solid understanding of modeling principles (object classes, logical levels, nomenclatures, etc). It is quite a challenge for data teams and a challenge that seems difficult to justify from an operational perspective.

The truth is, modeling tools that have been turned into Data Catalogs are faced with important adoption issues with the teams (they have to make huge efforts to learn how to use the tool, only to not find wha t they are looking for).

A prospective client recently presented us with a metamodel they had built and asked us whether it was possible to implement it in the Actian Data Intelligence Platform. Derived from their business models, the metamodel had several dozen classes of objects and thousands of attributes. To their question, the official answer was yes (the platform metamodel is very flexible). But instead, we tried to dissuade them from taking that path: A metamodel that sophisticated ran the risk, in our opinion, of losing the end users, and turning the Data Catalog project into a failure…

Should we Therefore Abandon Business Models When Putting a Data Catalog in Place? Absolutely Not.

It must, however, be remembered that business models are there to handle some issues, and the Data Catalog other issues. Some information contained within the models help structure the catalog and enrich its content in a very useful way (for instance responsibilities, classifications, and of course business glossaries).

The best approach is therefore, in our view, to conceive the catalog metamodel by focusing exclusively on the added value to the data teams (always with the same underlying question: does this information help find, localize, understand, and correctly use the data?), and then integrating the modeling tool and the Data Catalog in order to automate the supply of certain elements of the metamodel already present in the business model.

Take Away

As useful and complete as they may be, business models are still just models: they are an imperfect reflection of the operational reality of the systems and therefore they struggle to provide a useful Data Catalog.

Modeling tools, as well as business models, are too complex and too abstract to be adopted by data teams. Our recommendation is that you define the metamodel of your catalog with a view to answering the questions of the data teams and supply some aspects of the metamodel with the business model.

About Actian Corporation

Data Intelligence

7 Lies of Data Catalogs #4: Not a Query Solution

Actian Corporation

July 2, 2021

These players have rejigged their marketing positioning to present themselves as Data Catalog solutions.

The purpose of this blog series is to deconstruct the pitch of these eleventh-hour Data Catalog vendors.

Here are, in our opinion, the 7 lies of the Data Catalog vendors:

A Data Catalog is NOT a Query Solution

Here is another oddity of the Data Catalog market. Several vendors, whose initial aim was to allow users to query simultaneously several data sources, have “pivoted” towards a Data Catalog positioning on the market.

There is a reason for them to pivot.

The emergence of Data Lakes and Big Data have cornered them in a technological cul-de-sac that has weakened the market segment they were initially in.

A Data Lake is typically segmented into sever al layers. The “raw” layer integrates data without transformation, in formats that are more or less structured and in great quantities; A second layer, which we’ll call “clean”, will contain roughly the same data but in normalized formats, after a dust down. After that, there can be one or sever al “business” layers ready for use: A data warehouse and visualization tool for analytics, a Spark cluster for data science, a storage system for commercial distribution, etc. Within these layers, data is transformed, aggregated and optimized for use, along with the tools supporting this use (data visualization tools, notebooks, massive processing, etc).

In This Landscape, a Universal Self-Service Query Tool isn’t Suitable.

It is of course possible to set up an SQL interpretation layer on top of the “clean” layer (like Hive) but query execution remains a domain for specialists. The volumes of data are huge and rarely indexed.

Allowing users to define their own queries is very risky: On on-prem systems, they run the risk of collapsing the cluster by running a very expensive query. And on the Cloud, the bill could run very high indeed. Not to mention security and data sensitivity issues.

As for the “business” layers, they are generally coupled with more specialized solutions (such as a combination of Snowflake and Tableau for analytics) that offer very complete and secured tooling, offering great performance for self-service queries. With their market space shrinking like snow in the sun, some multi-source query vendors have pivoted towards Data Catalogs.

Their pitch is now to convince customers that the ability to execute queries makes their solution the Rolls-Royce of Data Catalogs (in order to justify their six-figure pricing). We would invite you to think twice about it.

Take Away

On a modern data architecture, the capacity to execute queries from a Data Catalog isn’t just unnecessary, it’s also very risky (performance, cost, security, etc.).

Data teams already have their own tools to execute queries on data, and if they haven’t, it may be a good idea to equip them. Integrating data access issues in the deployment of a catalog is the surest way to make it a long, costly, and disappointing project.

About Actian Corporation

Data Intelligence

What is a Data Mesh?

Actian Corporation

June 28, 2021

In this new era of information, new terms are used in organizations working with data: Data Management Platform, Data Quality, Data Lake, Data Warehouse…

Behind each of these words, we find specificities, technical solutions, etc. Let’s decipher.

Did you say: “Data Mesh”? Don’t be embarrassed if you’re not familiar with the concept. The term wasn’t used until 2019 as a response to the growing number of data sources and the need for business agility.

The Data Mesh model is based on the principle of a decentralized or distributed architecture exploiting a literal mesh of data.

While a Data Lake can be thought of as a storage space for raw data, and the Data Warehouse is designed as a platform for collecting and analyzing heterogeneous data, Data Mesh responds to a different use case.

On paper, a Data Warehouse and Data Mesh have a lot in common, especially when it comes to their main purpose, which is to provide permanent, real-time access to the most up-to-date information possible. But Data Mesh goes further. The freshness of the information is only one element of the system.

Because it is part of a distributed model, Data Mesh is designed to address each business line in your company with the key information that it concerns.

To meet this challenge, Data Mesh is based on the creation of data domains.

The advantages? Your teams are more autonomous through local data management, a decentralization of your enterprise in order to aggregate more and more data, and finally, more control of the overall organization of your data assets.

Data Mesh: Between Logic and Organization

If a Data Lake is ultimately a single reservoir for all your data, Data Mesh is the opposite. Forget the monolithic dimension of a Data Lake. Data is a living, evolving asset, a tool for understanding your market and your ecosystem and an instrument of knowledge and understanding.

Therefore, in order to appropriate the concept of meshing data, you need to think differently about data. How can we do this? By laying the foundations for a multi-domain organization. Each type of data has its own use, its own target, and its own exploitation. From then on, all the business areas of your company will have to base their actions and decisions on the data that is really useful to them to accomplish their missions. The data used by marketing is not the same as the data used by sales or your production teams.

The implementation of a Data Catalog is therefore the essential prerequisite for the creation of a Data Mesh. Without a clear vision of your data’s governance, it will be difficult to initiate your company’s transformation. Data quality is also a central element. But ultimately, Data Mesh will help you by decentralizing the responsibility for data to the domain level and by delivering high-quality transformed data.

The Challenges

Does adopting Data Mesh seem impossible because the project seems both complex and technical? No cause for panic! Data Mesh, beyond its technicality, its requirements, and the rigor that goes with it, is above all a new paradigm. It must lead all the stakeholders in your organization to think of data as a product addressed to the business.

In other words, by moving towards a Data Mesh model, the technical infrastructure of the data environment is centralized, while the operational management of the data is decentralized and entrusted to the business.

With Data Mesh, you create the conditions for an acculturation to data for all your teams so that each employee can base his or her daily action on data.

The Data Mesh Paradox

Data Mesh is meant to put data at the service of the business. This means that your teams must be able to access it easily, at any time, and to manipulate the data to make it the basis of their daily activities.

But in order to preserve the quality of your data, or to guarantee compliance with governance rules, change management is crucial and the definition of each person’s prerogatives is decisive. When deploying Data Mesh, you will have to lay a sound foundation in the organization.

On the one hand, free access to data for each employee (what we call functional governance). On the other hand, management and administration, in other words, technical governance in the hands of the Data teams.

Decompartmentalizing uses by compartmentalizing roles, that’s the paradox of Data Mesh.

About Actian Corporation

Data Intelligence

7 Lies of Data Catalogs #3: Not a Compliance Solution

Actian Corporation

June 25, 2021

a data catalog is not a compliance solution

These players have rejigged their marketing positioning to present themselves as Data Catalog solutions.

The purpose of this blog series is to deconstruct the pitch of these eleventh-hour Data Catalog vendors.

Here are, in our opinion, the 7 lies of the Data Catalog vendors:

A Data Catalog is NOT a Compliance Solution

As with governance, regulatory compliance is a crucial issue for any data-centric organization.

There is a plethora of data handling regulations spanning all sectors of activity and countries. On the subject of personal data alone, GDPR is mandatory across all EU countries, but each State has a lot of wiggle room on how its implemented, and most States have a large arsenal of legislation to complete, reinforce and adapt it (Germany alone for instance, has several dozen regulations across different sectors of activity related to personal data).

In the US, there are hundreds of laws and regulations across States and sectors of activity (with varying degrees of adherence). And here we are only referring to personal data…Rules and regulations also exist for financial data, medical data, biometric data, banking data, risk data, insurance data etc. Put simply, every organization has some regulation it has to be in compliance with.

So What Does Compliance Mean in this Case?

The vast majority of regulatory audits center on the following:

The ability to provide complete and up to date documentation on the procedures and controls put in place in order to meet the norms.
The ability to prove that the procedures described in the documentation are rolled out in the field.
The ability to supervise all the measures deployed with a view towards continuous improvement.

A Data Catalog is neither a procedures library, or an evidence consolidation system, and even less a process supervision solution.

It strikes us as obvious that assigning those responsibilities to a Data Catalog will make it considerably less simple to use (norms are too obscure for most people) and will jeopardize adoption for those most likely to benefit from it (data teams).

Should we Therefore Forget About Data Catalogs in our Quest for Compliance?

No, of course not. Again, in terms of compliance, it would be much wiser to use the Da ta Catalog for the literacy of the data teams. And to tag the data appropriately thus, enabling the teams to quickly identify any norm or procedure they need to adhere to before using the data. The Catalog can even help place the tags using a variety of approaches. It can for example automatically detect sensitive or personal data.

That said, even with the help of ML, detection will never work perfectly ( the notion of “personal data” defined by GDPR for instance, is much larger and harder to detect than North American PII). The Catalog’s ability to manage these tags is therefore critical.

Take Away

Regulatory compliance is above all a matter of documentation and proof and has no place in a Data Catalog.

However, the Data Catalog can help identify (more or less automatically) data that is subject to regulations. The Data Catalog plays a key role in the acculturation of the data teams with respect to the importance of regulations.

About Actian Corporation

Data Intelligence

Data Lakes: The Benefits and Challenges

Actian Corporation

June 24, 2021

Data Lakes are increasingly used by companies for storing their enterprise data. However, storing large quantities of data in a variety of formats can lead to data chaos! Let’s take a look at the pros and cons of Data Lakes.

To understand what a Data Lake is, let’s imagine a reservoir or a water retention basin that runs alongside the road. Regardless of the type of data, its origin, its purpose, everything, absolutely everything, ends up in the Data Lake. Whether that data is raw or refined, cleansed or not, all of this information ends up in this single place where it isn’t modified, filtered, or deleted before being stored.

Sounds a bit messy, doesn’t it? But that’s the whole point of the Data Lake.

It’s because it frees the data from any preconceived idea that a Data Lake offers real added value. How? By allowing data teams to constantly reinvent the use and exploitation of your company’s data.

Improvement of customer experience with a 360° analysis of the customer journey, detection of personas to refine marketing strategies, and rapid integration of new data flows from IoT, in particular, the Data Lake is an agile response to very structured problems for companies.

Data Lakes: The Undeniable Advantages

The first advantage of a Data Lake is that it allows you to store considerable volumes of protean data. Structured or unstructured, data from NoSQL databases…a Data Lake is, by nature, agnostic to the type of information it contains. It is precisely because it has no strict data exploitation scheme that the Data Lake is a valuable tool. And for good reason, none of the data it contains is ever altered, degraded, or distorted.

This is not the only advantage of a Data Lake. Indeed, since the data is raw, it can be analyzed on an ad-hoc basis.

The objective: to detect trends and generate reports according to business needs without it being a vast project involving another platform or another data repository.

Thus, the data available in the Data Lake can be easily exploited, in real time, and allows you to place your company in a data centric scheme so that your decisions, your choices, and your strategies are never disconnected from the reality of your market or your activities.

Nevertheless, the raw data stored in your Data Lake can (and should!) be processed in a specific way, as part of a larger, more structured project. But your company’s data teams will know that they have, within reach of a click, an unrefined ore that can be put to use for further analysis.

The Challenges a Data Lake

When you think of a Data Lake, poetic mental images come to mind. Crystalline waves waving in the wind of success that carries you away…but beware! A Data Lake carries the seeds of murky, muddy waters. This receptacle of data must be the object of particular attention because without rigorous governance, the risk of sinking into a “chaos of data” is real.

In order for your Data Lake to reveal its full potential, you must have a clear and standardized vision of your data sources.

The control of these flows is a first essential safeguard to guarantee the good exploitation of data by heterogeneous nature. You must also be very vigilant about data security and the organization of your data.

The fact that the data in a Data Lake is raw does not mean that it should not have a minimum structure to allow you to at least identify and find the data you want to exploit.

Finally, a Data Lake often requires significant computing power in order to refine masses of raw data in a very short time. This power must be adapted to the volume of data that will be hosted in the Data Lake.

Between method, rigor and organization, a Data Lake is a tool that serves your strategic decisions.

About Actian Corporation

Data Intelligence

7 Lies of Data Catalogs #2: Not a Quality Solution

Actian Corporation

June 21, 2021

The Data Catalog market has developed rapidly, and it is now deemed essential when deploying a data-driven strategy. Victim of its own success, this market has attracted several players from adjacent markets.

These players have rejigged their marketing positioning in order to present themselves as Data Catalog solutions.

The purpose of this blog series is to deconstruct the pitch of these eleventh-hour Data Catalog vendors.

Here are, in our opinion, the 7 lies of the Data Catalog vendors:

A Data Catalog is NOT a Data Quality Management (DQM) Solution

Do not underestimate the importance of data quality in successfully delivering a data project, quite the contrary. It just seems absurd to me to put this in the hands of a solution, which by its very nature, cannot achieve the controls at the right time.

Let us explain: There is a very elementary rule to quality control, a rule that can be applied virtually in any domain where quality is an issue, be it an industrial production chain, software development, or the cuisine of a 5-star restaurant: The sooner the problem is detected, the less it will cost to correct.

To demonstrate the point, a car manufacturer is unlikely to refrain from testing the battery of a new vehicle until after its built and all the production costs have already been incurred and solving a defect would cost the most. No. Each piece is closely controlled, each step of the production is tested, defective pieces are removed before ever being integrated in the production circuit, and the entire chain of production can be halted if quality issues are detected at any stage. The quality issues are corrected at the earliest possible state of the production process where they are the least costly and the most durable.

“In a modern data organization, data production rests on the same principles. We are dealing with an assembly chain whose aim is to provide usage with high added value. Quality control and correction must happen at each step. The nature and level of controls will depend on what the data is used for.”

If you are handling data, you obviously have at your disposal pipelines to feed your uses. These pipelines can involve dozens of steps – data acquisition, data cleaning, various transformations, mixing various data sources, etc.

In order to develop these pipelines, you probably have a number of technologies at play, anything from in-house scripts to costly ETLs and exotic middleware tools. It’s within those pipelines that you need to insert and pilot your quality control, as early as possible, by adapting them to what is at stake with the end product. Only measuring data quality levels at the end of the chain isn’t just absurd, it’s totally inefficient.

It is therefore difficult to see how a Data Catalog (whose purpose is to inventory and document all potentially usable datasets in order to facilitate data discovery and usage) can be a useful tool to measure and manage quality.

A Data Catalog operates on available datasets, on any systems that contain data, and should be as least invasive as possible in order to be deployed quickly throughout the organization.

A DQM solution works on the data feed (the pipelines), focuses on production data and is, by design, intrusive and time consuming to deploy. I cannot think of any software architecture that can tackle both issues without compromising the quality of either one.

Data Catalog vendors promising to solve your data quality issues are, in our opinion, in a bind and it seems unlikely they can go beyond a “salesy” demo.

As for DQM vendors (who also often sell ETLs), their solutions are often too complex and costly to deploy as credible Data Catalogs.

The good news is that the orthogonal nature of data quality and data cataloging makes it easy for specialized solutions in each domain to coexist without encroaching on each other’s lane.

Indeed, while a data catalog isn’t purposed for quality control, it can exploit the information on the quality of the datasets it contains which obviously provides many benefits.

The Data Catalog uses this metadata for example to share the information (and possible alerts it may identify) with the data consumers. The catalog can benefit from this information to adjust his search and recommendation engine and thus, orientate other users towards higher quality datasets.

And both solutions can be integrated at little cost with a couple of APIs here and there.

Take Away

Data quality needs to be assessed as early as possible in the pipeline feeds.

The role of the Data Catalog is not to do quality control but to share as much as possible the results of these controls. By their natures, Data Catalogs are bad DQM solutions, and DQM solutions are mediocre and overly complex Data Catalogs.

An integration between a DQM solution and a Data Catalog is very straightforward and is the most pragmatic approach.

About Actian Corporation

Events

Hybrid Data Conference Recap and Highlights

Actian Corporation

June 17, 2021

That’s a Wrap!

Wow! What a wonderful time we had at the 2021 Hybrid Data Conference! Over two days, we showcased amazing demos, customer stories and technology advancements across the Actian portfolio. For those in attendance, we hope you enjoyed the event and the opportunity to see a few of the ways Actian is innovating and enabling our customers to gain greater value from their data at a fraction of the time and cost of other cloud data platforms.

For those who missed the event, here’s a quick recap of some of our most popular sessions.

Some of Our Favorite Sessions from the 2021 Hybrid Data Conference

Delivering on the Vision – Actian Hybrid Data Platform, presented by Emma McGrattan, Actian VP of Engineering

Emma McGrattan, Actian’s VP of Engineering gave an in-depth overview of how Actian products are delivering on the vision of hybrid cloud. Highlighting the Actian Data Platform, Emma showcased how Actian’s product portfolio is accelerating cloud adoption and changing the way customers advance along their cloud journey. If you’re looking to make the shift left right away or modernize and preserve investments in critical applications, this session is a great overview of many options and use cases to support your unique path to the cloud.

Actian on Google Cloud, Presented by Lak Lakshmanan, Google’s Director of Analytics

This brief 15 minute session presented by Lak Lakshmanan, Google’s Director of Analytics and AI Solutions, is a great intro in why Actian has chosen Google as our preferred cloud. We all love a better together story, but Lak shows provides a glimpse from the cloud provider perspective.

Of course, no conference would be complete without perspectives from our customers. Actian would like to thank all of the customers and partners that made the 2021 Hybrid Data Conference a success.

Actian Customer Panel Featuring Key Customer Speakers from Sabre, Finastra, and Goldstar Software

One Final Highlight

Greg Williams from Wired Mag Image

We were delighted to have Greg Williams, Editor-in-Chief for Wired deliver his thoughts on why data-driven insights are no longer optional in today’s modern world. Greg summarized it best in his presentation – every company is a data company.

Please visit the on-demand conference to hear more of his outstanding commentary on the future of data and how companies are creating advantage in a global economy.

Once again, we want to thank everyone that attended this year’s Hybrid Data Conference. We hope you found the networking and content valuable and we can’t wait to see you in 2022 – hopefully in person! Stay safe, and enjoy your summer!

About Actian Corporation

Data Intelligence

7 Lies of Data Catalog Providers #1: Not a Data Governance Solution

Actian Corporation

June 16, 2021

a data catalog is not a governance solution

These players have rejigged their marketing positioning to present themselves as Data Catalog solutions.

The purpose of this blog series is to deconstruct the pitch of these eleventh-hour Data Catalog vendors.

Here are, in our opinion, the 7 lies of the Data Catalog vendors:

A Data Catalog is NOT a Data Governance Solution

This is probably our most controversial stance on the role of a Data Catalog and the controversy originates with the powerful marketing messages pumped out from the world leader in metadata management whose solution is in reality a data governance platform being sold as a Data Catalog.

To be clear, having sound data governance is one of the pillars of an effective data strategy. Governance, however, has little to do with tooling.

Its main purpose is the definition of roles, responsibilities, company policies, procedures, controls, committees. In a nutshell, its function is to deploy and orchestrate, in its entirety, the internal control of data in all its dimensions.

Let’s just acknowledge that data governance has many different aspects (processing and storage architecture, classification, retention, quality, risk, conformity, innovation, etc.) and that there aren’t any universal “one-size fits all” model adapted for all organizations. Like other governance domains, each organization must conceive and pilot its own landscape based on its capacities and ambitions, as well as thorough risk analysis.

Putting in place an effective data governance is not a project, but rather it is a transformation program.

No commercial “solution” can replace that transformation effort.

So Where Does the Data Catalog fit into All This?

The quest for a Data Catalog is usually the result of a very operational requirement: Once the Data Lake and a number of self-service tools are set up, the next challenge quickly becomes to find out what the Data Lake actually contains (both from a technical and a semantic perspective), where the data comes from, what transformations the data may have incurred, who is in charge of the data, what internal policies apply to the data, who is currently using the data and why etc.

An inability to provide this type of information to the end-user can have serious consequences to an organization, and a Data Catalog is the best means to mitigate that risk. When dealing with the selection of a transverse solution, involving people from many different departments, the selection of the solution is often given to those in charge of data governance, as they appear to be in the best position to coordinate the expectations of the largest number of stakeholders.

This is where the alchemy begins. The Data Catalog, whose initial purpose was to provide data teams with a quick solution to discover, explore, understand, and exploit the data, becomes a gargantuan project in which all aspects of governance have to be solved.

The project will be expected to:

Manage data quality.
Manage personal data and compliance (GDPR first and foremost).
Manage confidentiality, security, and data access.
Propose a new Master Data Management (MDM).
Ensure a field by field automated lineage for all datasets.
Support all the roles as defined in the system of governance and enable the relevant workflow configuration.
Integrate all the business models produced in the last 10 years for the urbanization program.
Authorize crossed querying on the data sources while complying with user habilitation on those same sources, as well as anonymizing the results.

Certain vendors manage to convince their client that their solution can be this unique one-stop-shop to data governance. If you believe this is possible, by all means call them, they will gladly oblige. But to be frank, we simply do not believe such a platform is possible, or even desirable. Too complex, too rigid, too expensive and too bureaucratic, this kind of solution can never be adapted to a data-centric organization.

For us, the Data Catalog plays a key role in a data governance program. This role should not involve supporting all aspects of governance but should rather be utilized to facilitate communication and awareness of governance rules within the company and to help each stakeholder become an active part of this governance.

In our opinion, a Data Catalog is one of the components that delivers the biggest return on investment in data-centric organizations that rely on Data Lakes with modern data pipelines…provided it can be deployed quickly and has a reasonable pricing associated with it.

Take Away

A Data Catalog is not a data governance management platform.

Data governance is essentially a transformation program with multiple layers that cannot be addressed by one single solution. In a data-centric organization, the best way to start, learn, educate, and remain agile is to blend clear governance guidelines with a modern Data Catalog that can share those guidelines with the end users.

About Actian Corporation

Data Intelligence

Data Governance Framework | S03-E02 – Start in Under 6 Weeks

Actian Corporation

June 9, 2021

This is the last episode of our third and final season of “The Effective Data Governance Framework”.

Divided into two episodes, this final season will focus on the implementation of metadata management with a data catalog.

In this final episode, we will help you start a 3-6 week data journey and then deliver the first iteration of your Data Catalog.

Season 1: Alignment

Evaluate your Data maturity

Specify your Data strategy

Getting sponsors

Build a SWOT analysis

Season 2: Adapting

Organize your Data Office

Organize your Data Community

Creating Data Awareness

Season 3: Implementing Metadata Management with a Data Catalog

The importance of metadata

6 weeks to start your data governance journey

Metadata Governance Iterations

We are using an iterative approach based on short cycles (6 to 12 weeks at most) to progressively deploy and extend the metadata management initiative in the Data Catalog.

These short cycles make it possible to quickly obtain value. They also provide an opportunity to communicate regularly via the Data Community on each initiative and its associated benefits.

Each cycle is organized in predetermined steps, as follows:

1. Identify the Goal

A perimeter (data, people), a target.

2. Deploy / Connect

Technical configuration of scanners and ability to harvest the information.

Scanners deployed and operational.

3. Conceive and Configure

A metamodel tailored to meet expectations.

4. Import the Items

Define the core (minimum viable) information to properly serve the users.

5. Open and Test

Validate if the effort produced the expected value.

6. Measure the Gains

Fine grained analysis of the cycle to identify what worked, what didn’t and how to improve the next cycle.

7 Lies of Data Catalogs #7: Complex, Not Complicated

The purpose of this blog series is to deconstruct the pitch of these eleventh-hour Data Catalog vendors.

A Data Catalog is Complex…but isn’t Complicated

About Actian Corporation

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

It’s Time for Data Historians to Become…History

What is a Data Historian?

The Data Historian Process Gap

Technology That Empowers Historical Data to Shape the Future

Modern Time-Series Databases: Beyond the Data Historian

Ad-Hoc Analysis and Reporting: the Right Data for Everyone

Artificial Intelligence (AI): Enabling History to Support Predicting the Future

Respect the Legacy, But Move into the Future

About Actian Corporation

Related Tags

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

Ready to Get Started?

7 Lies of Data Catalogs #6: Must Rely on Automation

The purpose of this blog series is to deconstruct the pitch of these eleventh-hour Data Catalog vendors.

A Data Catalog Must Rely on Automation

The Data Catalog Needs to Reflect These Changes.

What can we Automate in a Data Catalog?

Take Away

About Actian Corporation

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

7 Lies of Data Catalogs #5: Not a Business Modeling Solution

The purpose of this blog series is to deconstruct the pitch of these eleventh-hour Data Catalog vendors.

A Data Catalog is NOT a Business Modeling Solution

Should we Therefore Abandon Business Models When Putting a Data Catalog in Place? Absolutely Not.

Take Away

About Actian Corporation

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

7 Lies of Data Catalogs #4: Not a Query Solution

A Data Catalog is NOT a Query Solution

In This Landscape, a Universal Self-Service Query Tool isn’t Suitable.

Take Away

About Actian Corporation

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

What is a Data Mesh?

Data Mesh: Between Logic and Organization

The Challenges

The Data Mesh Paradox

About Actian Corporation

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

7 Lies of Data Catalogs #3: Not a Compliance Solution

The purpose of this blog series is to deconstruct the pitch of these eleventh-hour Data Catalog vendors.

A Data Catalog is NOT a Compliance Solution

So What Does Compliance Mean in this Case?

Should we Therefore Forget About Data Catalogs in our Quest for Compliance?

Take Away

About Actian Corporation

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

Data Lakes: The Benefits and Challenges

Data Lakes: The Undeniable Advantages

The Challenges a Data Lake

About Actian Corporation

Subscribe to the Actian Blog

Subscribe

Thank you for subscribing to the Actian Blog!

7 Lies of Data Catalogs #2: Not a Quality Solution

The purpose of this blog series is to deconstruct the pitch of these eleventh-hour Data Catalog vendors.

A Data Catalog is NOT a Data Quality Management (DQM) Solution

Take Away

About Actian Corporation

Subscribe to the Actian Blog

Subscribe