Data Observability

No Observability, No Agents: The Key to Delivering Data to AI Webinar

April 28, 2026

61:08

Summary

TDWI webinar featuring Databricks and Actian.
Native delta lake observability.
100% coverage, no blind spots.
Massive governance reach.
The power of the semantic layer.
Context over content.

Chapters

00:00AI Data Foundations & Challenges

Hello everyone, and welcome to the TDWI webinar program. I'm Andrew Miller, and I'll be your moderator. For today's program, we're going to talk about No Observability, No Agents: The Key to Delivering Data to AI.

And our sponsors today are Actian and Databricks. For our presentations today, we'll hear first from Evan Levy with TDWI, and after Evan speaks, we'll be joined by Emma McGrattan with Actian and Raja Parimal with Databricks for presentations and a panel discussion. Before I turn over the time to our speakers, please allow me to go over a few basics.

Today's webinar will be about an hour long, and at the end of their presentations, our speakers will host a question and answer period. So if at any time during these presentations you'd like to submit a question, just use the Ask a Question area on your screen to type in your question. If you have any technical difficulties during the webinar, you can click on the technical FAQs area and you'll receive technical assistance.

And if you'd like a copy of today's presentation, you can locate the resource window to download a PDF. Lastly, we are recording today's event and we'll be emailing you a link to an archived version so you can view the presentation again later if you'd like, or if you'd like, you can share with a colleague. Again, today we're going to be discussing No Observability, No Agents: The Key to Delivering Data to AI.

And our first speaker is Evan Levy. He's a research fellow at TDWI and partner at Integral Data LLC. Evan is a management consultant in enterprise data strategy, data management, analytics, and systems integration.

He advises clients on strategies to address business challenges using their existing data and technology assets, coupled with new and creative methods and practices. With more than 25 years of experience consulting with clients, Evan leads classes and workshops offering practical, real-world experience to address challenges to ensure IT and business success. With that, please welcome Evan, and I will hand it over to you now.

Thanks, Andrew. It's terrific to be here. I'm actually pretty excited with the discussion we're going to have today.

You'd have to be hiding off planet for a few months or a few years to not know the importance not only of trying to move forward with AI, but also all the dependencies and challenges that we're trying to address because AI, in many instances, is only as powerful as the data that it can learn from, analyze, and address. So with that, going to cover a few different topics or concepts within our conversation today before we jump into the panel discussion. The first, obviously, what is data observability and semantics?

For those of you that are new to this domain, obviously, probably not new to AI, but the whole issue of what data observability and semantics are. We'll go over a few data points that TDWI research has uncovered with its surveys. And for those of you who aren't aware, TDWI research actually conducts a fair number of surveys throughout the year to try and stay on top of what are trends, interests, and ideas that our members have.

We'll get into some of those challenges with our panel discussion, then how, obviously, Actian and Databricks can help overcome and address some of these challenges. So with that, let me jump into a very simple diagram. When we talk about the idea of data observability, obviously data quality is the concept that comes top of mind, but it's more than that.

If we take a look at a typical environment, you've got a whole slew of source systems, whether you have premise-based environments, third-party data content, subscription or syndicated research, as well as cloud-based applications. And inevitably, there's a fair amount of work that goes into how do we go get that data. So you inevitably have, right at the beginning, part of your data supply chain.

What is it that I do to gather, collect, or receive the data? I then move into placing it in my core repositories or analytic platforms and/or query and reporting systems. So there's a whole slew of services and functions that are necessary to be able to transform, cleanse, manipulate those data details before they can be put into that environment.

The idea of observability is as data moves through the supply chain, you're actually able to measure, monitor, pretty much in an auditing and an inspection type idea, because you want to catch and identify the challenges or the goofs that occur. Maybe data that you don't expect, values that you don't expect, or transformation or cleansing issues that occur. You want to catch them as soon as possible rather than waiting for someone to run a report only to find out the data's missing.

Then obviously, I want this data to be prepared and usable by the analytics and the AI environment, because it's one thing if I'm looking at a report looking for an individual event. If I have a bad record, well, that report's going to be troublesome. The challenge with AI is it's studying tens of thousands, hundreds of thousands, hundreds of millions of events and details, and if in fact we've made a mistake, it could conceivably have what we often call hallucinations or inaccurate details, but it's going to conclude things inaccurately.

Now, to make all these pieces work, there's also this aspect of data governance, which includes things like a data catalog, lineage, and other details so people know where did the data come from. So when we talk about the idea of data observability, what we're really hitting on is the concept of being able to track and trace through the entire supply chain and have the supportive details to know, well, what are all the stops and/or changes that could possibly occur or that did occur? And the whole premise of semantics is in what do we call these elements or these data attributes as they move through the system, because in some instances, they could actually change name because we've got calculations dealing with things like revenue and cost or come up with profitability.

So when are those elements and details available, known, and understood? And again, this is all, in many instances, a match set. You need one piece to have the other, and you need the other to have the one piece.

But the idea of, again, observability is not just for the data loading or the data engineering. There's this misunderstanding thatObservability is only about that aspect, and it's really from the point of origin or creation all the way through to data consumption and usage, so we can see success and/or identify challenges. If I give you the formal definition of what is data observability, obviously the ability to understand, monitor, troubleshoot the health and quality of data through an organization's data system.

So again, data quality is obviously a core component, but it's not the only piece. It's really about making sure, do I have the data that I need? Is it fit for purpose?

And can it support the job and activity that needs to be performed? So it is about freshness. How recent is the content?

The volumes, is it the right level of detail? There's this belief that I can use AI to figure anything out. The challenge when you don't have enough history or detail is you can't analyze and have a more thorough understanding or be able to address things like prediction and some of the mathematical aspects that are really necessary to have lots of detail.

Obviously, distribution. Concept's pretty straightforward, which is, is the data that I'm receiving an anticipated range and set of values, or do I have something that's off the beaten path that I didn't expect? The scheme is obviously how it was designed and relates to one another.

Then lineage, as we've discussed, where did the data come from? Who touched it, and how might it have changed? So again, it's not just, is the data reflective of what happened?

Is it accurate, understandable? Do I know what the formats, the details, and the representations are? So let me jump real quick, briefly go over a few data points that TWRI research has uncovered, and things that we ask, and this was this year's, or pardon me, 2025's AI readiness assessment.

My organization has systems in place to ensure that data is easily accessible and can be integrated from diverse sources, including internal and external data sets for analytics and AI applications. And the thing that I'd remind everyone is a lot of AI and the analysis associated with it isn't always just premise or internal content. It sometimes needs to be merged or intermingled with third-party and/or external content.

But in this particular circumstance, we went and asked everybody, so do you agree with this statement, or do you disagree? And if so, what are the shades of gray? So as you can see, 15.33 completely disagree.

They don't have the pieces in place. 15% more disagree. Neutral is, well, they neither disagree or agree.

So real quickly, that's more than half. And the thing that I'd have you consider with these numbers, it's not that one organization is pure. It either is a disagree or it's not an agree.

In fact, what we often find with enterprise and larger organizations is what this means is more than 50% of the people either are not confident or have had trouble finding the data. So it's not one organization versus another organization, it's the disparity of needs and/or circumstances within a very large organization, too. Let's go on to the next one.

My organization has a trusted data foundation in place for analytics. 16% completely disagree, 13% disagree, 24% are neutral. So lo and behold, again, we have a bit more than half that don't necessarily have a trusted foundation for all the analytic and the data that they actually require.

So again, it's not that this is evil or bad. I think it's really more reflective of the maturing of our environment. We've been talking about the importance of data quality, observability, accuracy, and correction for many years.

But I think AI is bringing that more top of mind because of just the critical importance. And when you're going through potentially tens or hundreds of millions of records and the associated attributes, putting something in place to be able to detect and inspect in an automated fashion is critically important because you can't throw people at it. There's just too much content.

Let's go on to the last one. My organization is data literate. Business users as well as business analysts can use data to derive insights.

And when we use the concept of data literacy, what we're talking about is being able to use data in business conversations, and is there an awareness and an understanding of what the concepts and attributes and details or terms that we're using actually mean? So in this particular survey question, what we found was about 70% completely disagree, another 16% disagree. So that's a third right off the bat.

And then another 20, almost 28% said neither. So if we're looking at it, the idea is 40% of the organization is data literate, but a lot of it is not. So that's clearly an area that's going to require some level of investment, even with AI.

So people understand when they ask where did this data come from, this seems like a peculiar or an interesting concept that AI came up with. Inevitably, they're going to say, "Well, where did this data come from? How do we conclude or come up with this idea?" And you want to be able to answer that question.

So I don't want to characterize that we haven't accomplished a great deal. Obviously, we have. But where you find these challenges, particularly with AI, is AI tends to have a broader, deeper appetite for content.

And what we've seen with many of our members is they're not only using the data that they have used for their analytics and dashboard and reporting environments, but they're also augmenting and adding new content they may not have used because of the particular focus and purposes of the AI environment. So let me tie up or conclude my quick and brief discussion about observability and semantics. Really, if you take a look at the challenges within the AI environment specific to data, clearly data quality and degradation, and a lot of people forget that as data ages, there's an implicit issue of it not necessarily being accurate for the purpose at need.

I don't want to know what my customer trends were eight months ago if I've got to address a pricing issue right now. So freshness is part of that quality and degradation. The whole idea of data governance and access controls.

Who's allowed to look at the data? Who's allowed to change and update the content? But alsoThe rules associated with that data.

Is someone allowed to look at these things? Are they allowed to look at certain individual client or product details, or do we have policies that contradict or conflict that? The idea of data awareness and accessibility, do we even know the data that's there?

One of the biggest obstacles we find, in fact, in addition to the literacy and accuracy or quality is, are the AI developers and the AI users aware of all the data assets that exist? And if you recall on that illustration that I showed you briefly at the beginning, there's that whole data governance across the spectrum of time, which is do we make sure that people know the raw data that we have, the cleaned-up and rationalized content that we've put in our analytic platforms, and the data, in fact, that's being used by analytics. Semantic consistency, and you'll see my cute little diagram here of the idea of income, gain, return, earnings.

Do people know what the terminology is that they should be using, or are they using alternative or synonyms, not being aware of that? Because there are instances where they may actually mean something else. They're using terms that they think are synonyms or that mean the same thing, but in fact, they're entirely different concepts.

Margin and profit can be very confusing unless you've laid that out in your company, for example. And then finally, stuff that we've touched on already, which is the whole importance of data literacy. Giving the individuals the ability to know and discern the topics and issues that you're talking about without it sounding like a bunch of analogies or abbreviations and things that they've never heard of.

So clearly, the challenge of AI is basically the next tier of data knowledge and usability. So with that, thank you for bearing with my brief introduction, and let me turn it back over to Andrew. Fantastic.

13:55Trustworthy AI & Data Governance

Thank you so much, Evan. That was a great presentation there, and now it's my pleasure to introduce our first guest speaker today, Emma McGrattan, with Actian. As CTO at Actian, Emma leads technology strategy and innovation, supporting Actian's mission to simplify data management.

With nearly 30 years at Actian, she is a recognized leader in the database industry, known for her expertise in data architecture and cloud transformation. With that, please welcome Emma, and I will hand it over to you now. Great.

Thank you, Andrew. So I'm going to start with a little bit of doom and gloom, right? The stuff that keeps us awake at night.

So these continue to be interesting to me because AI isn't new, right? It's been around forever. My very first job in the US was in the Lab for Artificial Intelligence at MIT back in 1987.

So I go back with AI quite some time, but we continue to see organizations struggle with it. These are some quite sobering statistics here. If we look at the first one, which says that 95% of gen AI initiatives deliver zero ROI, right?

And this is a stat that comes from an MIT study. So how do we get into the 5% that are delivering an ROI on those investments? Because there's massive investments being made right now, and we need to show that return on the investments.

The next statistic we have here is that less than 10% of organizations have managed to scale agentic AI. So by scaling it, we mean more than one AI agent in the organization. So how do we get into that 10% that is scaling AI successfully across the organization?

That's a statistic that comes from McKinsey, and the third one here comes from Gartner, and that is that 40% of agentic AI programs are predicted to be canceled by the end of next year, so by the end of 2027. So this is quite an alarming set of statistics, and to me, when we put these statistics in front of the enterprise, they typically agree that this is a likely outcome for them if they don't get their AI and data foundations correct. So let's have a look at some of the challenges that the enterprise faces when it comes to data.

So typically, think of these in two areas, right? So the first would be the business challenges, and oftentimes we hear from the business user that it just takes too long to get to the data that they need. And even if they can access it, they don't know how to interpret it.

So getting back to what Evan was talking about there with data literacy, and they don't know if they can trust it, right? Where does it come from? What's the lineage?

What's the provenance of this data? So from the business side, they continue to have those challenges. And then it's shocking to me, but we continue to have some very fundamental foundational data challenges and data quality issues.

So data can be incomplete. It can be inconsistent. Stale data, right?

Maybe we've got broken pipelines that aren't updating data at a frequency at which the business requires it. Maybe we're dealing with duplicated data, and we don't know which version of it to trust. We've got to continue to have, and will continue to have data silos across the organization.

According to Gartner, the typical enterprise has over 400 data sources that they're having to work with. So that continues to be a challenge, right? Just finding out where the data is that you need to answer questions for the business.

Compliance and governance continue to be an issue, and they need to be addressed, and the only way that we can deal with data challenges is if we have governance in place to handle all of the requirements we have around laws like GDPR, HIPAA, the EU AI Act. And what we like to think of at Actian is implementing governance by design. So it's not something that's bolted on after the end of a data pipeline, but rather something from its inception that data is governed, and we'd like to do it in a way where it's basically providing some guardrails to enable the data teams to innovate on the data.

So having it in the background and not something that gets in the way of doing work.And then finally, we look at the fact that many organizations continue to grapple with issues like data and AI literacy, and bringing in accountability, looking at incentives so that we can build out data literacy across the organization. Because this is a cultural and organizational shift that needs to happen with this new paradigm of AI being brought into the enterprise. And so, at Actian, we talk about governance by design.

We really believe in shifting left some of the challenges that we have here. Shifting left data quality, shifting left data governance, embedding this in from the data's inception. And I like to personalize this with a story.

When I moved to the US first, I moved to California. And I went to the DMV to get my driver's license. And the DMV asked for my weight, and in Ireland we use kilograms, and I knew that in the US we use pounds for measuring weight, so I filled in my weight in pounds.

And then they asked for my height, and I assumed that the equivalent of meters and centimeters was going to be inches. So when I'm filling out my application, I said that I was 63 inches, but I got a license that said six foot three inches. So now if the person looking over the counter at me had paid any attention to me, I could barely look over the countertop, they'd have known that I wasn't six foot three.

But they didn't. So they didn't fix that data issue at the earliest opportunity. So we think of that as a $1 prevention cost.

Had they just fixed it at that point, it would've been very simple. They then issued me a license that said that I was six foot three, and I thought it was hilarious. A whole foot taller than I actually am.

So I didn't do anything about it. So if I had spent an afternoon at the DMV, I could've got my license reissued with the correct height on it, and that we would think of as a $10 correction cost. I didn't do that.

Instead, I went on a weekend to Las Vegas, and I got caught for speeding in the middle of the desert. And when I got pulled over and the police officer looked at my license and said that I obviously wasn't the person represented in the license, and given the disparity in our heights, it became a bigger issue. So instead of being able to sweet talk my way out of a ticket, I wound up in court and having to hire a lawyer to deal with the complications over my license issues.

So this, to me, was a $100, in fact, more than $100 failure cost. So we really believe if you have some data challenges, address them upfront. The earlier they're addressed, the better it is for everybody.

And as I said, with governance, bolt it on upfront. Look at governance by design. Think about data contracts and data products, and have that governance on your data from its inception.

Makes life so much easier, and the earlier we fix these things, obviously the better it is for everybody. We heard from Evan about data observability. Observability is not one and done.

So observability is a continuous process. So we're constantly looking. Is the data being refreshed?

Are we meeting our freshness objectives for the data streams that we're having to deal with? Is the quality, the completeness, the freshness, are all of those being handled here? Volume is an interesting one.

We had that CrowdStrike outage that took down a lot of internet properties that we all depend upon in our daily lives, and when we looked into what was the root cause of that CrowdStrike outage, it came down to the fact that a data file was dropped into the server, and it was twice the size it should have been, and they didn't have a system that could deal with that file size, and that's what led to the outage. So looking for spikes, looking for data volumes that are a trickle and should in fact be much larger than that, can give us some indication that there's a problem someplace along the data pipeline. Schemas are constantly changing.

Columns being added and dropped and changed, and data types being changed over time. So making sure that we understand that and the impact of that. And then lineage.

Understanding where the data came from, any transformations that happened to it along the way. So we believe that data observability is key to successful AI deployments, and because making sure that you understand your data and that you get an early heads-up should there be some problems with your data, is fundamentally important. And if we're going to build AI that we can trust, we need to power it with data that we can trust.

So really important that we look at the data's provenance. So you could think of provenance as the data's origin story. Where did it come from?

Lineage is basically the data's life story. What has happened to that data along its lifespan? Where does it come from?

What transformations did it go through? What pipelines did it trust? Ultimately, that tells us if we can trust the data that we're dealing with.

And then freshness, making sure that as we're making decisions and as we're automating decisions, that we're doing that on the latest data. And then finally from me, we've talked a lot about how we should be handling the data infrastructure, making sure that we get our fundamentals right. But to me, there's three things that we need to think about here to be successful with AI.

The first is people, and thinking about building out a data culture, and building out data literacy. And I'll personalize this again. I grew up in Ireland, you might be able to tell from the accent.

And in Ireland, we talk about the weather all the time, which is amusing because it rains all the time. But every conversation typically starts with the weather. And I thought I had a very good understanding of data and weather, and I actually learned last year that when we say that there's a 60% chance of rain, I always thought it was a weather forecaster kind of hedging their bets and saying, "Oh, it's more likely to rain than not.

60% chance of rain." It actually means there's 100% chance of rain in 60% of the area covered in the forecast. So a data nerd, a weather buff, and I completely misunderstood what a data term meant. So really important that we don't make those mistakes in the business.

So if we're talking about something like revenue, do the sales team, the finance team, and the engineering teams all understand revenue to mean the same thing? Important that we sort that out through a data literacy program. We also need to look at process.

Putting governance frameworks in place. Making sure that we have SLAs and that we're adhering to the SLAs that we're promising to the business. And then the technology.

Building out data platforms that can actually drive successful AI across the business. Looking at automation, looking at the fact that everything's moving at machine speed now, so we have to build out metadata-driven solutions. So to me, this is an incredibly exciting time in data and in AI.

And it's, to me, a great opportunity to get into those-The stats that I mentioned earlier, the companies that are going to be successfully deploying AI, the companies that are going to show a return on the investments they're making, companies that are going to scale their agentic AI programs, and I'm going to see those grow in the coming years. So that's it from me, from the Actian side of things. So let's hand it back over to you, Andrew.

Thank you so much, Emma.

25:35Databricks Lakehouse for AI

That was a fantastic presentation there. And now it's my pleasure to introduce our next guest speaker today, Raja Parimal with Databricks. Raja is a technology alliances director at Databricks, focused on data governance and business intelligence ISVs.

Prior to Databricks, he spent 10 years in analytics consulting, helping organizations deal with large amounts of data before working at Alation and Snowflake. With that, please welcome Raja, and I will hand it over to you now. Hey, everyone.

Thank you. I think that this conversation has been really awesome with helping us understand the ways that governance is changing in the world of agentic AI agents that can act on your behalf. But I think it's useful to set some context of how we get from where we are, where we have lots of data in lots of places, to developing these agentic AI experiences and how along the way, data platforms like Databricks can help us make the governance journey easier.

So first, let's start off with why we're doing this. There are a ton of different AI use cases across industries that are beyond the horizontal use cases of coding agents that we see today. Very soon, the actual use cases that we see up on this page by industry are going to be performed by AI agents that need really broad access to data across your enterprise platform.

Naturally, that means that there's going to be a struggle, though, to break the silos that cause AI agents to struggle to be deployed in production. And the reason why we have these silos is even though we have one data stack, that data stack is ultimately split up into multiple different parts. You can have a transactional database, but that transactional database also offloads data to your data warehouse.

Then there might be a separate data mart in your business intelligence tools. The data science and machine learning teams might have their own set of tooling. All across this, you have orchestration and ETL layers that glues it all together, and SaaS applications that will have some data within their own analytics or data marts, and that need to also be synced across your various different environments.

Each one of these data stores causes an area of lock-in. Lock-in's always a problem, but in the era of agentic AI, where agents really need broad access to all of your data, lock-in also creates another layer of creating challenges to get ROI on your data. On top of that, each one of these data environments will have their own security policies, oftentimes necessary, but if they're not governed in a homogenous way or managed in an enterprise fashion, you can oftentimes have conflicting policies that prevent access to data.

And a natural outcome of this, of having a multiple set of tools with a different set of policies, is that each individual tool will then try to have their own agent that may or may not work with the other agents. And what really is the major problem with this is without a common set of enterprise semantics, all of your AI agents across your different tooling will come up with different answers and different decisions when you need to have them perform their actual work. So that is where Databricks comes in with our approach with the lakehouse and lake-based environments.

We took a look at all of the key aspects of your agentic AI stack, and we realized that in order to really make this work for our customers, we need to ensure that all of the data is available in an open format and accessible by a variety of tools. We pioneered the Delta Lake format. We participate heavily in Iceberg, including the recent Iceberg 3.0 release, and our lake-based environment is readable in a standard Postgres interface.

All of this is then governed by our technical metadata solution, our technical governance layer under all of our different use cases called Unity Catalog. Unity Catalog provides one single set of access controls across all of your different type of data objects. We do more than just traditional tables.

We'll do access control and governance across unstructured data like files, AI models, MCP servers, even the agents themselves. So as your data estate grows, Databricks will be able to provide coverage and a technical governance capability across all of your different data assets. Once you have all of your data in Unity Catalog, then you can take advantage of our higher level and broader governance capabilities.

The two I'm going to call out here are AI governance, which lets you understand what your models are doing and what models you're actually running. Are you using models that run within your Databricks compute environment? Are you sending data to an API and therefore you're sending data over to Frontier Lab?

Unity Catalog will help you make sense of those different kind of commitments. Also, we have business semantics, which is really important as we stand here today because last week we went broad with our GA announcement of Unity Catalog metrics, which is our business semantic layer that allows all different tools, both inside Databricks and outside Databricks, use the same common understanding of enterprise semantics. So both our humans and our AI agents are all speaking the same language when they're performing actions.

And then finally, we can federate to other data sources to make sure that the same data governance layer is off of your other SaaS applications and enterprise databases. With this governed data format, you can then take advantage of our traditional lakehouse analytics environment, our data warehousing capabilities. You can use Lake Base, which is our capability to have transactional applications take advantage of that same separation of storage and compute as our traditional lakehouse environment.

But now we can have a write-optimized environment that lets our customers build AI apps that are either accessed by AI agents or built by AI agents themselves.And you can use all of the same data using our Lakehouse capabilities to adjust and transform data as we go through our compute platform. With this core data platform, you can then return to all of your key applications that you need for building an agentic AI stack. And now instead of having multiple different AI capabilities across your environment, they're all united with a common platform.

Okay.

32:01Panel: AI Data Issues

With that being said, I think I'll hand it back to Andrew. Thank you so much, Raja. That was fantastic, and now it's time for our panel discussion, so I'll ask that Emma and Evan rejoin us.

And Evan, I will toss it back over to you, to begin the discussion, please. Well, obviously, Raja and Emma, thank you so much for a great set of presentation materials, and I think that gives us a good starting point for the panel discussion. We've put together a few questions to try and gain everybody's perspective.

For those of you that are listening, we'll allow each panelist to respond to each question, and we'll do just a round robin. So on the first question, Emma, let me start with you. What do you think are the most top-of-mind data issues relating to AI and analytics?

You've covered just a ton of stuff, but if you had to cherry-pick a few, what would be the ones that are most top of mind? So one of the things where Actian gets brought into an opportunity is where teams have seen a lot of success with AI in the lab, right? They've got some pilots that are working pretty well.

They try to roll it out into production, and then everything gets really messy, right? Schemes change, definitions are conflicting, permissions get in the way of retrieving data, and then suddenly they've got all kinds of problems with AI. So, to me, when I think of the top-of-mind issue in these scenarios is whether or not the data layer is really operationally ready for AI.

So that's one that we see time and again where people are struggling to get AI into production. Yeah, you're really talking about the whole production deployment aspect where I've got things working- Yeah ... but can I turn it over to the audience, and do they have user IDs and access rights to see stuff, and are we following rules?

Okay. Terrific. Raja, do you have anything to add or what's your perspective of the top-of-mind, the most visible things that you're dealing with?

Yeah. I really want to plus one what Emma said, and I just want to say, I think that with AI and analytics, it's the same issue that we run into even with traditional data warehouse and BI environments, which is a matter of transparency. When an executive or a VP gets a number or even gets a decision made by an agentic agent AI, they want the answer of, "Hey, how did we get this number?

Why is it right? How can I believe it?" And that leads us to all the traditional data governance challenges that we have, just at a broader scale because we are dealing with AI doing it at scale. I love the answers because what both of you are saying is, and I'm going to nitpick, not to question, but it's fascinating because you're hitting on organizations have the chops to build this stuff.

But where they're likely struggling is what are the issues of turning it over? Transparency, production migration, and/or deployment is, are we ready to turn over to the masses? Not only is the application functionally available, is the data accessible, but then all the other details that go along with it.

So for those of you that are listening, I know a lot of folks spend all this energy, do we have the data scientists and the chops and the skills that we need? And what you're hearing from the experts is you probably have that, but are you ready to deploy and turn it over, and have you thought about institutionalizing the new AI technology? So very interesting.

Thank you, guys. Let's move on to the next one. So what do you believe is the biggest misunderstanding about data in the domain of AI and agentic AI?

And this time, Raja, we'll let you start, and then once you're done- Yeah ... Emma.

35:27AI Data Literacy & Governance

But go ahead. This is a great question because I think that there's a lot of misunderstanding in the domain of AI and agentic AI. And I'm going to say two answers that are related but may seem like the opposite.

I think enterprise organizations both underestimate and overestimate the use of AI and agentic AI. I think that they might look at something like ChatGPT or Claude that is completely divorced from their enterprise context, and they just ask it a question, and it might spit out an incorrect answer or hallucinate something or just not be very useful. However, if you are taking some of these same AI capabilities, these same frontier models, and plugging them in with your enterprise context, connecting them to your data lake or your data warehouse and pointing it at governed data, suddenly the same types of questions you ask become a lot more powerful once they have your enterprise context and have your enterprise understanding.

So, yeah. I think that companies are both not asking enough of their AI models and also expecting too much without the actual tooling or proper understanding of what the tool can do. What you're really hitting on, and again, please fill in the blank here for me is, it's not a functional issue, it's the people making the request aren't really aware of what the strengths and limitations of the particular technology is.

They just assume for an LLM like Claude, "Well, let me point Claude at my data warehouse. Now it can answer questions using plain language query." And what you're saying is there's a lot more work involved to make something like that happen. Am I hearing you correctly on that one?

Yeah. That's correct. AI is not quite at the let it loose and it'll figure everything out standpoint.

We still need to make sure that it's pointing at the right datasets with the right set of governance and the right set of compliance or policy set up to make sure that you use it correctly. But once you then have it, then I do think that we can really stretch our imagination on what we can accomplish with some of this tooling. What you're really saying isI not only need the data engineering aspect but if you don't have data governance, you got to have data governance otherwise data could be misused, misunderstood, or misprocessed.

That's exactly right. Emma, please, what else? How do I add to that?

You guys pretty much nailed it, right? Because without semantics and governance, AI is very confident, but very confidently wrong, right? So what we see as a big misunderstanding is that the AI will figure out the business context, right?

People think that the AI system can magically figure all of that out for them when we have people that are working in marketing and finance that might think of customers as two completely different things, right? That revenue that can be calculated five or six different ways, data sets that could be completely deprecated and still accessible. The AI can't figure all of this out, right?

So getting the semantics right, getting the governance right is incredibly important. It's not as I say, the AI is going to very confidently provide wrong answers to the questions you're asking of it. You've actually hit a dimension that you had touched on with your presentation, the whole premise of literacy, and we're- Yeah ...

so focused on the user audience and their data literacy. Sometimes we don't think about, "Has my development community got the same level of data literacy?" And you had hit on not just the meaning of terms, but also what are the reasonable data sets to use and which ones are out of date because I could be using the wrong data. And what is in place to make sure my entire audience, not just users, but developers, are not only literate but know the governance rules and those types of things.

Is that something that you see where there's an awareness that it's not just the users that need to be brought up to speed on this stuff, but it's also the developers themselves? Yeah. We view this as kind of an organizational challenge, and we actually recommend that every department across the enterprise hire data people, right?

That we really believe in federated governance. We believe that bring a subject matter expert into each domain across the business and make them responsible for working with the central IT team in building out your governance policies and applying those policies, and the same thing for data, right? Hire for data curiosity.

Make sure that across every part of the business that you've got people that understand the importance of getting governance right, of becoming a champion for data literacy within that organization, for making sure that as we think about rolling out these governance literacy and-- Governance literacy and what's the third one I'm thinking of here, Evan? Cultural, like data and AI cultural changes that each department within or each domain within the organization is covered. I think you're hitting on- And improve on top of that.

Yeah. Yeah. I think you're hitting on something that's incredibly important because in a lot of the organizations that I work with, there's this unwritten set of styles or personality traits, which is, well, data people have to know the data, but applications people don't need to know the data.

And what you're saying is that isn't going to fly, and I've seen this, and I'm sure all of us have seen this case in point where you've got a data scientist or an AI developer where they're desperate to find a particular type of element or content, and they'll go get it anywhere not knowing that, in fact, they're getting data that's aged or unusable or flat out not reliable, the whole trustworthiness aspect. So, I mean, for those of you that are listening, clearly, you can't hide from having data knowledge. It's a necessity in the era of AI.

Emma, thank you for the details. Let's jump to the next question. So back to you, Emma.

What are reasonable goals that folks should have in addressing semantics and literacy? You've touched on the idea of let's make every organization have an awareness of it, but if you were to pick a goal or just a means of how do I know I'm going in the right direction, what do you think it might be? So I would make sure that across the organization that people understand three key things, right?

Where the data comes from, right? What it actually means, and where it should and should not be used, right? And I think if we get that knowledge out across the organization, get a better understanding as to how this magic actually works in reality, I think that's going to be incredibly helpful for addressing data literacy across the organization.

As I said, I love to encourage organizations to promote and encourage data curiosity. If you can find those data nerds within the various functions across the enterprise and encourage those people by giving them the education and giving them the tooling that they need so that they can ask questions of that data and really understand that data so that when you need somebody to act as the human in the loop with the AI, right, you've got an educated human that's got a good sense for what looks and smells right about the data. I think that becomes incredibly important role that needs to be filled.

And I think enabling that by bringing in people that have domain expertise in the particular domain, and then that you've grown these data skills in, Emma, I think that can be a really quick path to success. Well, you're actually characterizing bringing in the expertise not to put them in a silo but in fact to position them more as mentors to everyone else. The idea of developers knowing the tools, the meaning, and the content usage, I mean, whether I'm a baker, a software developer, a plumber, it's know the tools you have, know where you got the parts from, and know what the parts can do.

I'm not trying to dumb this down. What you're really talking about is we need to look at data very much like an ingredient or a base component on everything we build, so you bring in the expertise to mentor and coach people that aren't necessarily aware of it. It's not to hide them or create this wall.

It's let that knowledge spread. I mean, am I hearing you correctly? You're absolutely bang on.

I was describing data to an Uber driver at the weekend, trying to describe what it is that I do, right? And we're talking about metadata, and I was describing it as, you think of it as the label on the container of data that you're using. Right?

So the metadata tells you the ingredients. Right? What's in here, right?

The freshness, right? When should it be used by? If you can't understand what's in the label, maybe you shouldn't be using it.

Right? So if you can't make sense of it, maybe you need to look beyond that. But yeah, I think incredibly important that we really understand the data that we're using, and we understand its freshness, right?

When it should be used by, and that we are measured in our deployment of AI, not moving too quickly, and making sure that we have humans in the process that actually know what good looks like so that they can identify when the AI systems might be hallucinating. Well, Raja, I didn't mean to cut you out of this, but Emma just hit on something that I want to make sure the audience thinks. What can you add?

What are your thoughts about what are the reasonable goals? Because there's so many ways to take a look at this three-dimensional or five-dimensional issue. Yeah, and again, just to agree with what Emma is saying, I think that's the right way to think about this.

When I think about reasonable goals and when I think about any sort of data journey or AI journey, I always think about it in steps. Like, what do you do for your foundation, and how can you build upon that? And the three steps I would say for addressing semantics or talking about data literacy is document, harmonize, and then educate, which I think kind of lines up to what Emma was saying.

First, you just document all the definitions of your metrics and your semantics as much as you can. You don't need to be perfect, but if you can get the most used assets documented and describe how folks are using it, that's honestly going to put you at a higher step than many organizations out there. Harmonize.

I always use the example of a previous customer of mine who sold to Walmart, but had 14 definitions of Walmart in their system. So once you have things documented, then you can take the next step in saying, "How can we harmonize this to make sure that we are understanding and simplifying our definitions?" And then finally, educate. Once you start having this education, once you start having this documentation and harmonization, then you can start trying to change the culture of your organization and say, "Hey, go look up a definition of Walmart before you create a 15th definition of Walmart." So if you can get those three things, you're going to be really far ahead.

And then with the advent of AI tools, hopefully a lot of these things can be sped up even as you go through the process of, as an example, harmonizing your different semantics. I mean, you're also making the issue of this is some missionary work. This isn't just establishing data standards or practices, it's also communicating to everybody and gain adoption.

Coming up with the best schema or the best naming conventions and a process that every element needs to be reviewed isn't enough. It's also ensuring the audience themselves understands why, so they, in turn, can be the pied pipers of data so people understand this is an adjustment to the way that we live. I hope I'm not putting words in your mouth.

I'm hearing very much like what Emma said. You're agreeing, but taking it to the next step, which is we need to be missionaries about this type of stuff. Yep.

That's exactly right. This is about culture change. Culture change is a laborious process at times that requires us to be patient and bring everyone with us instead of trying to push them too quickly.

Yeah. I'm glad you made the remark because it isn't-- I'm going to split hairs, and then feel free to correct me. It's not that we're changing something.

We're evolving something that, in many instances, doesn't exist. I mean, it's not that anybody's doing anything wrong. It's we're not confirming that people understand the data.

We're not confirming and injecting expertise into the broad audience of what these things mean and what the importance is. I mean, it's not dissimilar, I hate to say it, from an HR world of no one ever disputed we shouldn't have harassment in the workplace, but we created classes that everyone takes to make sure that there's an awareness of that, listen, you're not allowed to be violent in the workplace. And what you and Emma hit on is we need to not only broach this idea and introduce this idea, but make sure everybody understands this is going to be a practice moving forward.

No one's doing it wrong. We're just not doing enough of it, and if we understand, it becomes easy. It becomes second nature.

So very much, thanks for the clarity. Let's move on to the next question, and Raja, we'll start with you. Are there any particular data risks that should be considered with AI delivery and deployment?

We've touched a lot on the importance of transparency, literacy, and deployment. But so now let's talk about, well, are there some risks? Yeah, absolutely.

And that's a great question when talking about AI deployment, and you can split that into two ways, and I have a perspective on both. So the first data risk is what goes into the AI systems. This is why I think about my vision for a unified data platform with a unified governance tool before you go down the AI journey.

Because the last thing you would like to do is train your enterprise AI systems off of data that is either not PII-compliant, simply contains wrong calculations or wrong information about how your business runs and what the rules your business uses to run. And that can oftentimes be very difficult to catch after you've set up your AI system if you don't fully understand what is the data that's going inside of your AI system, your agentic AI system. The second, obviously, is what comes out of it.

You've got to make sure that you have accurate evaluations in place to understand that whatever is the either decision-making process or the analytical output of your AI systems has a way of being proven against some sort of evaluation framework, so you can say, "Is this working correctly? Is it working fast enough? Is it accurate enough?" And both of those things are, I think, data risks that, again, are not-Too different than traditional data governance.

It's just at the scale of these AI systems, you need additional tooling to make sure that you're dealing with it effectively. You can't just have one data governance steward searching for all of the data out of an AI system, the outputs of an AI system, because that's going to get overwhelming pretty quickly. Well, the whole idea, we tend to throw terms around like a data architect or a data designer or someone like that.

What you're really hitting on, it's at a higher level, which is from a governance perspective, what level of data trustworthy or testing has occurred to ensure that did not only the process work accurately, but was the data results reliable and acceptable? And that's a continuous process. For those of you that are new to AI, this is not a crazy concept.

Any bank that has credit card and/or account verification, they do this time in and out. Very, very common. What we're really talking about here is applying operational concepts to the AI environment.

Certainly, it's not apples and apples. It's more complex testing and proof, but the concept is well understood in a lot of organizations. Go leverage some of the things that you have.

Is that a fair statement, Raja? I don't want to, again, put words in your mouth, but sometimes we think- No, that is well said. Yeah.

This is not a from scratch activity. There might actually be people in, depending on the type of industry, and that they may have some of that expertise that you can apply to the AI environment. Emma, let me jump to you.

And what's your perspective of another potential or a couple of potential risks that we haven't discussed?

51:37Panel: AI Risks & Platforms

Yeah, so I totally agree with Raja. I think we've done a good job here of predicting what the other may say. But I think what I would add here is thinking about RAG search, right?

So this is an area that a lot of organizations are using today. So, retrieval, augmented, generative, right? And when we look at RAG searching, we're always going to be searching for the most semantically similar content, which is not always going to be the most correct content, right?

So here, to me, there's a big concern that if we are not applying governance here and if we're not filtering out older documents, that there's a risk here of just delivering the wrong content to answer questions. And when we deal with regulated industries, this becomes a massive risk to the business. So that's one that I think Raja hadn't touched on.

Well, you've actually hit on a fairly broad and important topic because, again, can the data support the type of analysis we're addressing, and is it reliable and trustworthy, or does it have bias? There's all kinds of case studies where people went and used data they thought was perfectly acceptable for the purpose at hand, only to find out that they didn't necessarily understand the context of how that data was collected. Or in fact, they're misusing it for the purposes that were intended.

A good example that I use frequently is I was working on a project for the US government, and someone came to me saying they wanted travel information on dietary preferences. They wanted to know halal meal orders, and I said, "You can't have that. That's actually racial profiling.

That's against the law." And you've touched on something that, again, this is where that injection of expertise, the data knowledge, is so critically important because this person was trying to go around and be creative and break a rule. And the goal here is to let everybody step up and learn what the rules and guidelines are, not to try and penalize or be an obstacle, because in many instances, this is a new domain for everybody. But I love you're a little more diplomatic about things than I am.

But it is a huge issue that you hit on, which is, is it fit for purpose, and are we following the rules themselves? Let's go ahead. We've got a few minutes to go.

Let's go ahead and hit on the last question, and Emma, you get to be first, which is if you could make, given all the things that we've discussed, and my goodness, you and Raja have hit a huge horizon of, or should I say, cornucopia of content and ideas. But if you could make one recommendation for the audience, those that are listening to us discuss this, what would that one recommendation be? So I would say as we move forward and with AI, everything's happening at machine speed.

So it's unlikely that we're going to have a human in the loop every place where we would like to have one, just because it's just all moving far too quickly. I would say focus on making your data understandable to machines. So invest in things like having clear semantic definitions, right?

So when Raja talks about 15 definitions of Walmart, make sure that we're very clear in how we understand things. Make sure that we've got clearly identified and documented lineage and provenance, looking to make sure that our governance by design is enforced automatically, not bolted on to the end of data pipelines, and make sure that our machine-- sorry, that our metadata is machine readable. So that's it for me.

Clear semantic definitions, lineage and provenance, governance enforced automatically, and machine-readable metadata so that machines can understand your data in the same way that humans would. Excellent. Raja, to you, what do you think?

What would your one recommendation to the audience be? Emma stole, I think, basically all my big ones. So let me go in a different direction and go with something maybe a little bit lighter.

I would say with the new world that we're entering, you should experiment and experiment safely. There's a lot of different capabilities out there that you can go really take advantage of. There's a lot of different AI tools and even different frontier models that you can play with.

I would recommend go and experiment. Go and see the capabilities for yourself. Do it in a safe way.

Don't do it on production data. Don't do it in a production system. Make sure your PII is accurately tagged.

But if you have those concerns, I think we're entering a new world that's really worth exploring and poking at. Well, terrific. We have hit the end of our segment for the panel.

Let me turn it back over to Andrew, and let's get to some of the audience questions. Absolutely. I thank you all for that great discussion there.

We do have some questions here that came in during your discussion, and I fear that you may have already touched on some of these, so I'll ask you to double-click on them if you don't mind. So Raja, we'll start with you, and of course, Emma, Evan, if you have anything to chime in on, please go right ahead. So this came in a little earlier here, Raja.

This question is: As organizations scale their AI programs and implement into product use cases, how critical is the integration and governance between the data platform and the data model or semantic layer? Yeah. This is a great question.

I think that in our legacy environment, when we are talking about data and data tooling, it was very common to have your definitions of your business semantics across multiple different places, whether it's your BI tool or multiple BI tools, your data warehouse. You could even have a third-party semantic layer vendor that managed it for you. And it was very common to have these definition documents in many places.

And I think that this worked in a world where data silos were more tolerable, but increasingly in an AI first and agentic AI world, we need to make sure that the semantic layer is synced and readable across your enterprise landscape. My obvious first recommendation is to pick one place to be your master or your source of truth for semantics. I think that should be the data platform, like Databricks, but we can make different cases as to which platform it should be in.

The key is that you have a source of truth at every other semantic layer or any other set of tooling that uses semantics is either syncing to or downstream from that source of truth. All right. Terrific.

We'll move on to the next question. I think this will be the last one that we have time for today. And Emma, I'll start with you on this.

This question came in a little earlier as well, and it says: As organizations begin deploying agentic AI systems that can make decisions or trigger actions, what new responsibilities fall on the data platform to ensure those systems operate safely and reliably? Yeah. So this is a great question, right?

Because when we deal with preparing data for analytics, right, there's always been a human that's interpreting a dashboard, and you've got that gut feel as to what's right and what's wrong. When we're dealing with agentic AI systems, they're making decisions for us and actioning these things automatically, without that human being there to jump in front of it should things go wrong. So incredibly important here that the data that our agentic systems are working on is of high quality, that we've got our observability in place so that we can identify if there should be a problem with a broken pipeline or with a schema change that needs to be addressed, that we get an early heads-up that that needs remediation, and ultimately we'll get to a point where we automate that remediation.

So, yeah, the platform has to be able to provide the machine-readable trust signals for that agentic AI. Governance, hugely important, right? That governance is not just there at data ingestion time, but governance is there through the entire data's life cycle so that the governance is applying at query time.

And then the third is that the agentic AI systems are going to need semantic context. So the data platforms like Databricks will need to be able to provide that. So it sounds like Raja's got all of that under control for us.

And you can't understate the importance of observability to make sure that everything that that AI agentic system is depending upon is being observed and where we need to make remediations, that those are being made so that we can be successful. All right.

01:00:18Webinar Wrap-up & Sponsors