Sveriges mest populära poddar

Data Engineering Podcast

Data Modeling That Evolves With Your Business Using Data Vault

66 min • 9 februari 2020

Summary

Designing the structure for your data warehouse is a complex and challenging process. As businesses deal with a growing number of sources and types of information that they need to integrate, they need a data modeling strategy that provides them with flexibility and speed. Data Vault is an approach that allows for evolving a data model in place without requiring destructive transformations and massive up front design to answer valuable questions. In this episode Kent Graziano shares his journey with data vault, explains how it allows for an agile approach to data warehousing, and explains the core principles of how to use it. If you’re struggling with unwieldy dimensional models, slow moving projects, or challenges integrating new data sources then listen in on this conversation and then give data vault a try for yourself.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Setting up and managing a data warehouse for your business analytics is a huge task. Integrating real-time data makes it even more challenging, but the insights you obtain can make or break your business growth. You deserve a data warehouse engine that outperforms the demands of your customers and simplifies your operations at a fraction of the time and cost that you might expect. You deserve Clickhouse, the open source analytical database that deploys and scales wherever and whenever you want it to and turns data into actionable insights. And Altinity, the leading software and service provider for Clickhouse, is on a mission to help data engineers and DevOps managers tame their operational analytics. Go to dataengineeringpodcast.com/altinity for a free consultation to find out how they can help you today.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Kent Graziano about data vault modeling and the role that it plays in the current data landscape

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by giving an overview of what data vault modeling is and how it differs from other approaches such as third normal form or the star/snowflake schema?
    • What is the history of this approach and what limitations of alternate styles of modeling is it attempting to overcome?
    • How did you first encounter this approach to data modeling and what is your motivation for dedicating so much time and energy to promoting it?
  • What are some of the primary challenges associated with data modeling that contribute to the long lead times for data requests or outright project Datafailure?
  • What are some of the foundational skills and knowledge that are necessary for effective modeling of data warehouses?
    • How has the era of data lakes, unstructured/semi-structured data, and non-relational storage engines impacted the state of the art in data modeling?
    • Is there any utility in data vault modeling in a data lake context (S3, Hadoop, etc.)?
  • What are the steps for establishing and evolving a data vault model in an organization?
    • How does that approach scale from one to many data sources and their varying lifecycles of schema changes and data loading?
  • What are some of the changes in query structure that consumers of the model will need to plan for?
  • Are there any performance or complexity impacts imposed by the data vault approach?
  • Can you talk through the overall lifecycle of data in a data vault modeled warehouse?
    • How does that compare to approaches such as audit/history tables in transaction databases or slowly changing dimensions in a star or snowflake model?
  • What are some cases where a data vault approach doesn’t fit the needs of an organization or application?
  • For listeners who want to learn more, what are some references or exercises that you recommend?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Förekommer på
00:00 -00:00