Sveriges mest populära poddar

How AI Is Built

Navigating the Modern Data Stack, Choosing the Right OSS Tools, From Problem to Requirements to Architecture | ep 7

38 min • 17 maj 2024

From Problem to Requirements to Architecture.

In this episode, Nicolay Gerold and Jon Erich Kemi Warghed discuss the landscape of data engineering, sharing insights on selecting the right tools, implementing effective data governance, and leveraging powerful concepts like software-defined assets. They discuss the challenges of keeping up with the ever-evolving tech landscape and offer practical advice for building sustainable data platforms. Tune in to discover how to simplify complex data pipelines, unlock the power of orchestration tools, and ultimately create more value from your data.

  • "Don't overcomplicate what you're actually doing."
  • "Getting your basic programming software development skills down is super important to becoming a good data engineer."
  • "Who has time to learn 500 new tools? It's like, this is not humanly possible anymore."

Key Takeaways:

  • Data Governance: Data governance is about transparency and understanding the data you have. It's crucial for organizations as they scale and data becomes more complex. Tools like dbt and Dagster can help achieve this.
  • Open Source Tooling: When choosing open source tools, assess their backing, commit frequency, community support, and ease of use.
  • Agile Data Platforms: Focus on the capabilities you want to enable and prioritize solving the core problems of your data engineers and analysts.
  • Software Defined Assets: This concept, exemplified by Dagster, shifts the focus from how data is processed to what data should exist. This change in mindset can greatly simplify data orchestration and management.
  • The Importance of Fundamentals: Strong programming and software development skills are crucial for data engineers, and understanding the basics of data management and orchestration is essential for success.
  • The Importance of Versioning Data: Data has to be versioned so you can easily track changes, revert to previous states if needed, and ensure reproducibility in your data pipelines. lakeFS applies the concepts of Git to your data lake. This gives you the ability to create branches for different development environments, commit changes to specific versions, and merge branches together once changes have been tested and validated.

Jon Erik Kemi Warghed:

Nicolay Gerold:

Chapters

00:00 The Problem with the Modern Data Stack: Too many tools and buzzwords

00:57 How to Choose the Right Tools: Considerations for startups and large companies

03:13 Evaluating Open Source Tools: Background checks and due diligence

07:52 Defining Data Governance: Transparency and understanding of data

10:15 The Importance of Data Governance: Challenges and solutions

12:21 Data Governance Tools: dbt and Dagster

17:05 The Impact of Dagster: Software-defined assets and declarative thinking

19:31 The Power of Software Defined Assets: How Dagster differs from Airflow and Mage

21:52 State Management and Orchestration in Dagster: Real-time updates and dependency management

26:24 Why Use Orchestration Tools?: The role of orchestration in complex data pipelines

28:47 The Importance of Tool Selection: Thinking about long-term sustainability

31:10 When to Adopt Orchestration: Identifying the need for orchestration tools

Kategorier
Förekommer på
00:00 -00:00