107. Kevin Hu - Data observability and why it matters

Towards Data Science - A podcast by The TDS team

Categorie:

Imagine for a minute that you’re running a profitable business, and that part of your sales strategy is to send the occasional mass email to people who’ve signed up to be on your mailing list. For a while, this approach leads to a reliable flow of new sales, but then one day, that abruptly stops. What happened? You pour over logs, looking for an explanation, but it turns out that the problem wasn’t with your software; it was with your data. Maybe the new intern accidentally added a character to every email address in your dataset, or shuffled the names on your mailing list so that Christina got a message addressed to “John”, or vice-versa. Versions of this story happen surprisingly often, and when they happen, the cost can be significant: lost revenue, disappointed customers, or worse — an irreversible loss of trust. Today, entire products are being built on top of datasets that aren’t monitored properly for critical failures — and an increasing number of those products are operating in high-stakes situations. That’s why data observability is so important: the ability to  track the origin, transformations and characteristics of mission-critical data to detect problems before they lead to downstream harm. And it’s also why we’ll be talking to Kevin Hu, the co-founder and CEO of Metaplane, one of the world’s first data observability startups. Kevin has a deep understanding of data pipelines, and the problems that cap pop up if you they aren’t properly monitored. He joined me to talk about data observability, why it matters, and how it might be connected to responsible AI on this episode of the TDS podcast. Intro music: ➞ Artist: Ron Gelinas ➞ Track Title: Daybreak Chill Blend (original mix) ➞ Link to Track: https://youtu.be/d8Y2sKIgFWc 0:00 Chapters:  0:00 Intro 2:00 What is data observability? 8:20 Difference between a dataset’s internal and external characteristics 12:20 Why is data so difficult to log? 17:15 Tracing back models 22:00 Algorithmic analyzation of a date 26:30 Data ops in five years 33:20 Relation to cutting-edge AI work 39:25 Software engineering and startup funding 42:05 Problems on a smaller scale 46:40 Future data ops problems to solve 48:45 Wrap-up

Visit the podcast's native language site