Sveriges mest populära poddar

Data Engineering Podcast

Achieving Data Reliability: The Role of Data Contracts in Modern Data Management

49 min • 28 juli 2024
Summary
Data contracts are both an enforcement mechanism for data quality, and a promise to downstream consumers. In this episode Tom Baeyens returns to discuss the purpose and scope of data contracts, emphasizing their importance in achieving reliable analytical data and preventing issues before they arise. He explains how data contracts can be used to enforce guarantees and requirements, and how they fit into the broader context of data observability and quality monitoring. The discussion also covers the challenges and benefits of implementing data contracts, the organizational impact, and the potential for standardization in the field.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • At Outshift, the incubation engine from Cisco, they are driving innovation in AI, cloud, and quantum technologies with the powerful combination of enterprise strength and startup agility. Their latest innovation for the AI ecosystem is Motific, addressing a critical gap in going from prototype to production with generative AI. Motific is your vendor and model-agnostic platform for building safe, trustworthy, and cost-effective generative AI solutions in days instead of months. Motific provides easy integration with your organizational data, combined with advanced, customizable policy controls and observability to help ensure compliance throughout the entire process. Move beyond the constraints of traditional AI implementation and ensure your projects are launched quickly and with a firm foundation of trust and efficiency. Go to motific.ai today to learn more!
  • Your host is Tobias Macey and today I'm interviewing Tom Baeyens about using data contracts to build a clearer API for your data
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe the scope and purpose of data contracts in the context of this conversation?
  • In what way(s) do they differ from data quality/data observability?
  • Data contracts are also known as the API for data, can you elaborate on this?
  • What are the types of guarantees and requirements that you can enforce with these data contracts?
  • What are some examples of constraints or guarantees that cannot be represented in these contracts?
  • Are data contracts related to the shift-left?
  • Data contracts are also known as the API for data, can you elaborate on this?
  • The obvious application of data contracts are in the context of pipeline execution flows to prevent failing checks from propagating further in the data flow. What are some of the other ways that these contracts can be integrated into an organization's data ecosystem?
  • How did you approach the design of the syntax and implementation for Soda's data contracts?
  • Guarantees and constraints around data in different contexts have been implemented in numerous tools and systems. What are the areas of overlap in e.g. dbt, great expectations?
  • Are there any emerging standards or design patterns around data contracts/guarantees that will help encourage portability and integration across tooling/platform contexts?
  • What are the most interesting, innovative, or unexpected ways that you have seen data contracts used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on data contracts at Soda?
  • When are data contracts the wrong choice?
  • What do you have planned for the future of data contracts?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Förekommer på
00:00 -00:00