How AI is Built dives into the different building blocks necessary to develop AI applications: how they work, how you can get started, and how you can master them. Build on the breakthroughs of others. Follow along, as Nicolay learns from the best data engineers, ML engineers, solution architects, and tech founders.
The podcast How AI Is Built is created by Nicolay Gerold. The podcast and the artwork on this page are embedded on this page using the public podcast feed (RSS).
Alex Garcia is a developer focused on making vector search accessible and practical. As he puts it: "I'm a SQLite guy. I use SQLite for a lot of projects... I want an easier vector search thing that I don't have to install 10,000 dependencies to use.”
Core Mantra: "Simple, Local, Scalable"
Why SQLite Vec?
"I didn't go along thinking, 'Oh, I want to build vector search, let me find a database for it.' It was much more like: I use SQLite for a lot of projects, I want something lightweight that works in my current workflow."
SQLiteVec uses row-oriented storage with some key design choices:
Practical limits:
Key advantages:
Garcia's preferred tools for local AI:
1. Choose Your Storage
"There's two ways of storing vectors within SQLiteVec. One way is a manual way where you just store a JSON array... [second is] using a virtual table."
2. Optimize Performance
"With binary quantization it's 1/32 of the space... and holds up at 95 percent quality"
3. Integration Patterns
"It's a single file, right? So you can like copy and paste it if you want to make a backup."
4. Real-World Tips
"I typically choose the really small model... it's 30 megabytes. It quantizes very easily... I like it because it's very small, quick and easy."
Alex Garcia
Nicolay Gerold:
Today, I (Nicolay Gerold) sit down with Trey Grainger, author of the book AI-Powered Search. We discuss the different techniques for search and recommendations and how to combine them.
While RAG (Retrieval-Augmented Generation) has become a buzzword in AI, Trey argues that the current understanding of "RAG" is overly simplified – it's actually a bidirectional process he calls "GARRAG," where retrieval and generation continuously enhance each other.
Trey uses a three context framework for search architecture:
Trey shares insights on:
For engineers building search systems, Trey offers practical advice on choosing the right tools and techniques, from traditional search engines like Solr and Elasticsearch to modern approaches like ColBERT.
Also how to layer different techniques to make search tunable and debuggable.
Quotes:
Trey Grainger:
Nicolay Gerold:
00:00 Introduction to Search Challenges 00:50 Layered Approach to Ranking 01:00 Personalization and Signal Boosting 02:25 Broader Principles in Software Engineering 02:51 Interview with Trey Greinger 03:32 Understanding RAG and Retrieval 04:35 Nonlinear Pipelines in Search 06:01 Generative AI and Retrieval 08:10 Search Renaissance and AI 10:27 Misconceptions in AI-Powered Search 18:12 Search vs. Recommendation Systems 22:26 Three Buckets of Relevance 38:19 Traditional Learning to Rank 39:11 Semantic Relevance and User Behavior 39:53 Layered Ranking Algorithms 41:40 Personalization in Search 43:44 Technological Setup for Query Understanding 48:21 Personalization and User Behavior Vectors 52:10 Choosing the Right Search Engine 56:35 Future of AI-Powered Search 01:00:48 Building Effective Search Applications 01:06:50 Three Critical Context Frameworks 01:12:08 Modern Search Systems and Contextual Understanding 01:13:37 Conclusion and Recommendations
Today we are back continuing our series on search. We are talking to Brandon Smith, about his work for Chroma. He led one of the largest studies in the field on different chunking techniques. So today we will look at how we can unfuck our RAG systems from badly chosen chunking hyperparameters.
The biggest lie in RAG is that semantic search is simple. The reality is that it's easy to build, it's easy to get up and running, but it's really hard to get right. And if you don't have a good setup, it's near impossible to debug. One of the reasons it's really hard is actually chunking. And there are a lot of things you can get wrong.
And even OpenAI boggled it a little bit, in my opinion, using an 800 token length for the chunks. And this might work for legal, where you have a lot of boilerplate that carries little semantic meaning, but often you have the opposite. You have very information dense content and imagine fitting an entire Wikipedia page into the size of a tweet There will be a lot of information that's actually lost and that's what happens with long chunks The next is overlap openai uses a foreign token overlap or used to And what this does is actually we try to bring the important context into the chunk, but in reality, we don't really know where the context is coming from.
It could be from a few pages prior, not just the 400 tokens before. It could also be from a definition that's not even in the document at all. There is a really interesting solution actually from Anthropic Contextual Retrieval, where you basically pre process all the chunks to see whether there is any missing information and you basically try to reintroduce it.
Brandon Smith:
Nicolay Gerold:
00:00 The Biggest Lie in RAG: Semantic Search Simplified 00:43 Challenges in Chunking and Overlap 01:38 Introducing Brandon Smith and His Research 02:05 The Motivation and Mechanics of Chunking 04:40 Issues with Current Chunking Methods 07:04 Optimizing Chunking Strategies 23:04 Introduction to Chunk Overlap 24:23 Exploring LLM-Based Chunking 24:56 Challenges with Initial Approaches 28:17 Alternative Chunking Methods 36:13 Language-Specific Considerations 38:41 Future Directions and Best Practices
Most LLMs you use today already use synthetic data.
It’s not a thing of the future.
The large labs use a large model (e.g. gpt-4o) to generate training data for a smaller one (gpt-4o-mini).
This lets you build fast, cheap models that do one thing well.
This is “distillation”.
But the vision for synthetic data is much bigger.
Enable people to train specialized AI systems without having a lot of training data.
Today we are talking to Adrien Morisot, an ML engineer at Cohere.
We talk about how Cohere uses synthetic data to train their models, their learnings, and how you can use synthetic data in your training.
We are slightly diverging from our search focus, but I wanted to create a deeper dive into synthetic data after our episode with Saahil.
You could use it in a lot of places: generate hard negatives, generate training samples for classifiers and rerankers and much more.
Scaling Synthetic Data Creation: https://arxiv.org/abs/2406.20094
Adrien Morisot:
Nicolay Gerold:
00:00 Introduction to Synthetic Data in LLMs 00:18 Distillation and Specialized AI Systems 00:39 Interview with Adrien Morisot 02:00 Early Challenges with Synthetic Data 02:36 Breakthroughs and Rediscovery 03:54 The Evolution of AI and Synthetic Data 07:51 Data Harvesting and Internet Scraping 09:28 Generating Diverse Synthetic Data 15:37 Manual Review and Quality Control 17:28 Automating Data Evaluation 18:54 Fine-Tuning Models with Synthetic Data 21:45 Avoiding Behavioral Cloning 23:47 Ensuring Model Accuracy with Verification 24:31 Adapting Models to Specific Domains 26:41 Challenges in Financial and Legal Domains 28:10 Improving Synthetic Data Sets 30:45 Evaluating Model Performance 32:21 Using LLMs as Judges 35:42 Practical Tips for AI Practitioners 41:26 Synthetic Data in Training Processes 43:51 Quality Control in Synthetic Data 45:41 Domain Adaptation Strategies 46:51 Future of Synthetic Data Generation 47:30 Conclusion and Next Steps
Modern RAG systems build on flexibility.
At their core, they match each query with the best tool for the job.
They know which tool fits each task. When you ask about sales numbers, they reach for SQL. When you need to company policies, they use vector search or BM25. The key is switching tools smoothly.
A question about sales figures might need SQL, while a search through policy documents works better with vector search. The key is building systems that can switch between these tools smoothly.
But all types of retrieval start with metadata. By tagging documents with key details during processing, we narrow the search space before diving in.
The best systems use a mix of approaches: they might keep full documents for context, summaries for quick scanning, and metadata for filtering. They cast a wide net at first, then use neural ranking to zero in on the most relevant results.
The quality of embeddings can make or break a system. General-purpose models often fall short in specialized fields. Testing different embedding models on your specific data pays off - what works for general text might fail for legal documents or technical manuals. Sometimes, fine-tuning a model for your domain is worth the effort.
When building search systems, think modular. Start with pieces that can be swapped out as needs change or better tools emerge. Add metadata processing early - it's harder to add later. Break the retrieval process into steps: first find possible matches quickly, then rank them carefully. For complex documents with tables or images, add tools that can handle different types of content.
The best systems also check their work. They ask: "Did I actually answer the question?" If not, they try a different approach. But they also know when to stop - endless loops help no one. In the end, RAG isn't just about finding information. It's about finding the right information, in the right way, at the right time.
Stephen Batifol:
Nicolay Gerold:
00:00 Introduction to Agentic RAG 00:04 Understanding Control Flow in Agentic RAG 00:33 Decision Making with LLMs 01:11 Exploring Agentic RAG with Stephen Batifol 03:35 Comparing RAG and GAR 06:31 Implementing Agentic RAG Workflows 22:36 Filtering with Prefix, Suffix, and Midfix 24:15 Breaking Mechanisms in Workflows 28:00 Evaluating Agentic Workflows 30:31 Multimodal and VLLMs in Document Processing 33:51 Challenges and Innovations in Parsing 34:51 Overrated and Underrated Aspects in LLMs 39:52 Building Effective Search Applications
Many companies use Elastic or OpenSearch and use 10% of the capacity.
They have to build ETL pipelines.
Get data Normalized.
Worry about race conditions.
All in all. At the moment, when you want to do search on top of your transactional data, you are forced to build a distributed systems.
Not anymore.
ParadeDB is building an open-source PostgreSQL extension to enable search within your database.
Today, I am talking to Philippe Noël, the founder and CEO of ParadeDB.
We talk about how they build it, how they integrate into the Postgres Query engines, and how you can build search on top of Postgres.
Key Insights:
Search is changing. We're moving from separate search clusters to search inside databases. Simpler architecture, stronger guarantees, lower costs up to a certain scale.
Most search engines force you to duplicate data. ParadeDB doesn't. You keep data normalized and join at query time. It hooks deep into Postgres's query planner. It doesn't just bolt on search - it lets Postgres optimize search queries alongside SQL ones.
Search indices can work with ACID. ParadeDB's BM25 index keeps Lucene-style components (term frequency, normalization) but adds Postgres metadata for transactions. Search + ACID is possible.
Two storage types matter: inverted indices for text, columnar "fast fields" for analytics. Pick the right one or queries get slow. Integers now default to columnar to prevent common mistakes.
Mixing query engines looks tempting but fails. The team tried using DuckDB and DataFusion inside Postgres. Both were fast but broke ACID compliance. They had to rebuild features natively.
Philippe Noël:
Nicolay Gerold:
00:00 Introduction to ParadeDB 00:53 Building ParadeDB with Rust 01:43 Integrating Search in Postgres 03:04 ParadeDB vs. Elastic 05:48 Technical Deep Dive: Postgres Integration 07:27 Challenges and Solutions 09:35 Transactional Safety and Performance 11:06 Composable Data Systems 15:26 Columnar Storage and Analytics 20:54 Case Study: Alibaba Cloud 21:57 Data Warehouse Context 23:24 Custom Indexing with BM25 24:01 Postgres Indexing Overview 24:17 Fast Fields and Columnar Format 24:52 Lucene Inspiration and Data Storage 26:06 Setting Up and Managing Indexes 27:43 Query Building and Complex Searches 30:21 Scaling and Sharding Strategies 35:27 Query Optimization and Common Mistakes 38:39 Future Developments and Integrations 39:24 Building a Full-Fledged Search Application 42:53 Challenges and Advantages of Using ParadeDB 46:43 Final Thoughts and Recommendations
RAG isn't a magic fix for search problems. While it works well at first, most teams find it's not good enough for production out of the box. The key is to make it better step by step, using good testing and smart data creation.
Today, we are talking to Saahil Ognawala from Jina AI to start to understand RAG.
To build a good RAG system, you need three things: ways to test it, methods to create training data, and plans to make it better over time. Testing starts with a set of example searches that users might make. These should include common searches that happen often, medium-rare searches, and rare searches that only happen now and then. This mix helps you measure if changes make your system better or worse.
Creating synthetic data helps make the system stronger, especially in spotting wrong answers that look right. Think of someone searching for a "gluten-free chocolate cake." A "sugar-free chocolate cake" might look like a good answer because it shares many words, but it's wrong.
These tricky examples help the system learn the difference between similar but different things.
When creating synthetic data, you need rules. The best way is to show the AI a few real examples and give it a list of topics to work with. Most teams find that using half real data and half synthetic data works best. This gives you enough variety while keeping things real.
Getting user feedback is hard with RAG. In normal search, you can see if users click on results. But with RAG, the system creates an answer from many pieces. A good answer might come from both good and bad pieces, making it hard to know which parts helped. This means you need smart ways to track which pieces of information actually helped make good answers.
One key rule: don't make things harder than they need to be. If simple keyword search (called BM25) works well enough, adding fancy AI search might not be worth the extra work.
Success with RAG comes from good testing, careful data creation, and steady improvements based on real use. It's not about using the newest AI models. It's about building good systems and processes that work reliably.
"It isn’t a magic wand you can place on your catalog and expect results you didn’t get before."
“Most of our users are enterprise users who have seen the most success in their RAG systems are the ones that very early implemented a continuous feedback mechanism.“
“If you can't tell in real time usage whether an answer is a bad answer or a right answer because the LLM just makes it look like the right answer then you only have your retrieval dataset to blame”
Saahil Ognawala:
Nicolay Gerold:
00:00 Introduction to Retrieval Augmented Generation (RAG) 00:29 Interview with Saahil Ognawala 00:52 Synthetic Data in Language Generation 01:14 Understanding the E5 Mistral Instructor Embeddings Paper 03:15 Challenges and Evolution in Synthetic Data 05:03 User Intent and Retrieval Systems 11:26 Evaluating RAG Systems 14:46 Setting Up Evaluation Frameworks 20:37 Fine-Tuning and Embedding Models 22:25 Negative and Positive Examples in Retrieval 26:10 Synthetic Data for Hard Negatives 29:20 Case Study: Marine Biology Project 29:54 Addressing Errors in Marine Biology Queries 31:28 Ensuring Query Relevance with Human Intervention 31:47 Few Shot Prompting vs Zero Shot Prompting 35:09 Balancing Synthetic and Real World Data 37:17 Improving RAG Systems with User Feedback 39:15 Future Directions for Jina and Synthetic Data 40:44 Building and Evaluating Embedding Models 41:24 Getting Started with Jina and Open Source Tools 51:25 The Importance of Hard Negatives in Embedding Models
Documentation quality is the silent killer of RAG systems. A single ambiguous sentence might corrupt an entire set of responses. But the hardest part isn't fixing errors - it's finding them.
Today we are talking to Max Buckley on how to find and fix these errors.
Max works at Google and has built a lot of interesting experiments with LLMs on using them to improve knowledge bases for generation.
We talk about identifying ambiguities, fixing errors, creating improvement loops in the documents and a lot more.
Some Insights:
Max Buckley: (All opinions are his own and not of Google)
Nicolay Gerold:
00:00 Understanding LLM Hallucinations 00:02 Challenges with Temporal Inconsistencies 00:43 Issues with Document Structure and Terminology 01:05 Introduction to Retrieval Augmented Generation (RAG) 01:49 Interview with Max Buckley 02:27 Anthropic's Approach to Document Chunking 02:55 Contextualizing Chunks for Better Retrieval 06:29 Challenges in Chunking and Search 07:35 LLMs in Internal Knowledge Management 08:45 Identifying and Fixing Documentation Errors 10:58 Using LLMs for Error Detection 15:35 Improving Documentation with User Feedback 24:42 Running Processes on Retrieved Context 25:19 Challenges of Terminology Consistency 26:07 Handling Definitions and Glossaries 30:10 Addressing Context Misinterpretation 31:13 Improving Documentation Quality 36:00 Future of AI and Search Technologies 42:29 Ensuring Documentation Readiness for AI
Ever wondered why vector search isn't always the best path for information retrieval?
Join us as we dive deep into BM25 and its unmatched efficiency in our latest podcast episode with David Tippett from GitHub.
Discover how BM25 transforms search efficiency, even at GitHub's immense scale.
BM25, short for Best Match 25, use term frequency (TF) and inverse document frequency (IDF) to score document-query matches. It addresses limitations in TF-IDF, such as term saturation and document length normalization.
Search Is About User Expectations
The Challenge of Vector Search at Scale
Vector Search vs. BM25: A Trade-off of Precision vs. Cost
David Tippett:
Nicolay Gerold:
00:00 Introduction to RAG and Vector Search Challenges 00:28 Introducing BM25: The Efficient Search Solution 00:43 Guest Introduction: David Tippett 01:16 Comparing Search Engines: Vespa, Weaviate, and More 07:53 Understanding BM25 and Its Importance 09:10 Deep Dive into BM25 Mechanics 23:46 Field-Based Scoring and BM25F 25:49 Introduction to Zero Shot Retrieval 26:03 Vector Search vs BM25 26:22 Combining Search Techniques 26:56 Favorite BM25 Adaptations 27:38 Postgres Search and Term Proximity 31:49 Challenges in GitHub Search 33:59 BM25 in Large Scale Systems 40:00 Technical Deep Dive into BM25 45:30 Future of Search and Learning to Rank 47:18 Conclusion and Future Plans
Ever wondered why your vector search becomes painfully slow after scaling past a million vectors? You're not alone - even tech giants struggle with this.
Charles Xie, founder of Zilliz (company behind Milvus), shares how they solved vector database scaling challenges at 100B+ vector scale:
Key Insights:
Perfect for teams hitting scaling walls with their current vector search implementation or planning for future growth.
Worth watching if you're building production search systems or need to optimize costs vs performance.
Charles Xie:
Nicolay Gerold:
00:00 Introduction to Search System Challenges 00:26 Introducing Milvus: The Open Source Vector Database 00:58 Interview with Charles: Founder of Zilliz 02:20 Scalability and Performance in Vector Databases 03:35 Challenges in Distributed Systems 05:46 Data Consistency and Real-Time Search 12:12 Hierarchical Storage and GPU Acceleration 18:34 Emerging Technologies in Vector Search 23:21 Self-Learning Indexes and Future Innovations 28:44 Key Takeaways and Conclusion
Modern search systems face a complex balancing act between performance, relevancy, and cost, requiring careful architectural decisions at each layer.
While vector search generates buzz, hybrid approaches combining traditional text search with vector capabilities yield better results.
The architecture typically splits into three core components:
Critical but often overlooked aspects include query understanding depth, systematic relevancy testing (avoid anecdote-driven development), and data governance as search systems naturally evolve into organizational data hubs.
Performance optimization requires careful tradeoffs between index-time vs query-time computation, with even 1-2% improvements being significant in mature systems.
Success requires testing against production data (staging environments prove unreliable), implementing proper evaluation infrastructure (golden query sets, A/B testing, interleaving), and avoiding the local maxima trap where improving one query set unknowingly damages others.
The end goal is finding an acceptable balance between corpus size, latency requirements, and cost constraints while maintaining system manageability and relevance quality.
"It's quite easy to end up in local maxima, whereby you improve a query for one set and then you end up destroying it for another set."
"A good marker of a sophisticated system is one where you actually see it's getting worse... you might be discovering a maxima."
"There's no free lunch in all of this. Often it's a case that, to service billions of documents on a vector search, less than 10 millis, you can do those kinds of things. They're just incredibly expensive. It's really about trying to manage all of the overall system to find what is an acceptable balance."
Search Pioneers:
Stuart Cam:
Russ Cam:
Nicolay Gerold:
00:00 Introduction to Search Systems 00:13 Challenges in Search: Relevancy vs Latency 00:27 Insights from Industry Experts 01:00 Evolution of Search Technologies 03:16 Storage and Compute in Search Systems 06:22 Common Mistakes in Building Search Systems 09:10 Evaluating and Improving Search Systems 19:27 Architectural Components of Search Systems 29:17 Understanding Search Query Expectations 29:39 Balancing Speed, Cost, and Corpus Size 32:03 Trade-offs in Search System Design 32:53 Indexing vs Querying: Key Considerations 35:28 Re-ranking and Personalization Challenges 38:11 Evaluating Search System Performance 44:51 Overrated vs Underrated Search Techniques 48:31 Final Thoughts and Contact Information
Today we are talking to Michael Günther, a senior machine learning scientist at Jina about his work on JINA Clip.
Some key points:
Types of Text-Image Models
Training Insights from Jina CLIP
Practical Considerations
Future Directions
Practical Applications
Key Takeaways for Engineers
Michael Guenther
Nicolay Gerold:
00:00 Introduction to Uni-modal and Multimodal Embeddings 00:16 Exploring Multimodal Embeddings and Their Applications 01:06 Training Multimodal Embedding Models 02:21 Challenges and Solutions in Embedding Models 07:29 Advanced Techniques and Future Directions 29:19 Understanding Model Interference in Search Specialization 30:17 Fine-Tuning Jina CLIP for E-Commerce 32:18 Synthetic Data Generation and Pseudo-Labeling 33:36 Challenges and Learnings in Embedding Models 40:52 Future Directions and Takeaways
Imagine a world where data bottlenecks, slow data loaders, or memory issues on the VM don't hold back machine learning.
Machine learning and AI success depends on the speed you can iterate. LanceDB is here to to enable fast experiments on top of terabytes of unstructured data. It is the database for AI. Dive with us into how LanceDB was built, what went into the decision to use Rust as the main implementation language, the potential of AI on top of LanceDB, and more.
"LanceDB is the database for AI...to manage their data, to do a performant billion scale vector search."
“We're big believers in the composable data systems vision."
"You can insert data into LanceDB using Panda's data frames...to sort of really large 'embed the internet' kind of workflows."
"We wanted to create a new generation of data infrastructure that makes their [AI engineers] lives a lot easier."
"LanceDB offers up to 1,000 times faster performance than Parquet."
Change She:
LanceDB:
Nicolay Gerold:
00:00 Introduction to Multimodal Embeddings
00:26 Challenges in Storage and Serving
02:51 LanceDB: The Solution for Multimodal Data
04:25 Interview with Chang She: Origins and Vision
10:37 Technical Deep Dive: LanceDB and Rust
18:11 Innovations in Data Storage Formats
19:00 Optimizing Performance in Lakehouse Ecosystems
21:22 Future Use Cases for LanceDB
26:04 Building Effective Recommendation Systems
32:10 Exciting Applications and Future Directions
Today’s guest is Mór Kapronczay. Mór is the Head of ML at superlinked. Superlinked is a compute framework for your information retrieval and feature engineering systems, where they turn anything into embeddings.
When most people think about embeddings, they think about ada, openai.
You just take your text and throw it in there.
But that’s too crude.
OpenAI embeddings are trained on the internet.
But your data set (most likely) is not the internet.
You have different nuances.
And you have more than just text.
So why not use it.
Some highlights:
➡️ Pouring everything into a text embedding model won't yield magical results ➡️ Language is lossy - it's a poor compression method for complex information
➡️ Direct number embeddings don't work well for vector search ➡️ Consider projecting number ranges onto a quarter circle ➡️ Apply logarithmic transforms for skewed distributions
➡️ Create separate vector parts for different data aspects ➡️ Normalize individual parts ➡️ Weight vector parts based on importance
A Multi-Vector approach can help you understand the contributions of each modality or embedding and give you an easier time to fine-tune your retrieval system without fine-tuning your embedding models by tuning your vector database like you would a search database (like Elastic).
Mór Kapronczay
Nicolay Gerold:
00:00 Introduction to Embeddings 00:30 Beyond Text: Expanding Embedding Capabilities 02:09 Challenges and Innovations in Embedding Techniques 03:49 Unified Representations and Vector Computers 05:54 Embedding Complex Data Types 07:21 Recommender Systems and Interaction Data 08:59 Combining and Weighing Embeddings 14:58 Handling Numerical and Categorical Data 20:35 Optimizing Embedding Efficiency 22:46 Dynamic Weighting and Evaluation 24:35 Exploring AB Testing with Embeddings 25:08 Joint vs Separate Embedding Spaces 27:30 Understanding Embedding Dimensions 29:59 Libraries and Frameworks for Embeddings 32:08 Challenges in Embedding Models 33:03 Vector Database Connectors 34:09 Balancing Production and Updates 36:50 Future of Vector Search and Modalities 39:36 Building with Embeddings: Tips and Tricks 42:26 Concluding Thoughts and Next Steps
Today we have Jessica Talisman with us, who is working as an Information Architect at Adobe. She is (in my opinion) the expert on taxonomies and ontologies.
That’s what you will learn today in this episode of How AI Is Built. Taxonomies, ontologies, knowledge graphs.
Everyone is talking about them no-one knows how to build them.
But before we look into that, what are they good for in search?
Imagine a large corpus of academic papers. When a user searches for "machine learning in healthcare", the system can:
So we are building the plumbing, the necessary infrastructure for tagging, categorization, query expansion and relexation, filtering.
So how can we build them?
1️⃣ Start with Industry Standards • Leverage established taxonomies (e.g., Google, GS1, IAB) • Audit them for relevance to your project • Use as a foundation, not a final solution
2️⃣ Customize and Fill Gaps • Adapt industry taxonomies to your specific domain • Create a "coverage model" for your unique needs • Mine internal docs to identify domain-specific concepts
3️⃣ Follow Ontology Best Practices • Use clear, unique primary labels for each concept • Include definitions to avoid ambiguity • Provide context for each taxonomy node
Jessica Talisman:
Nicolay Gerold:
00:00 Introduction to Taxonomies and Knowledge Graphs 02:03 Building the Foundation: Metadata to Knowledge Graphs 04:35 Industry Taxonomies and Coverage Models 06:32 Clustering and Labeling Techniques 11:00 Evaluating and Maintaining Taxonomies 31:41 Exploring Taxonomy Granularity 32:18 Differentiating Taxonomies for Experts and Users 33:35 Mapping and Equivalency in Taxonomies 34:02 Best Practices and Examples of Taxonomies 40:50 Building Multilingual Taxonomies 44:33 Creative Applications of Taxonomies 48:54 Overrated and Underappreciated Technologies 53:00 The Importance of Human Involvement in AI 53:57 Connecting with the Speaker 55:05 Final Thoughts and Takeaways
ColPali makes us rethink how we approach document processing.
ColPali revolutionizes visual document search by combining late interaction scoring with visual language models. This approach eliminates the need for extensive text extraction and preprocessing, handling messy real-world data more effectively than traditional methods.
In this episode, Jo Bergum, chief scientist at Vespa, shares his insights on how ColPali is changing the way we approach complex document formats like PDFs and HTML pages.
Introduction to ColPali:
Advantages of ColPali:
Jo Bergum:
Nicolay Gerold:
00:00 Messy Data in AI 01:19 Challenges in Search Systems 03:41 Understanding Representational Approaches 08:18 Dense vs Sparse Representations 19:49 Advanced Retrieval Models and ColPali 30:59 Exploring Image-Based AI Progress 32:25 Challenges and Innovations in OCR 33:45 Understanding ColPali and MaxSim 38:13 Scaling and Practical Applications of ColPali 44:01 Future Directions and Use Cases
Today, we're talking to Aamir Shakir, the founder and baker at mixedbread.ai, where he's building some of the best embedding and re-ranking models out there. We go into the world of rerankers, looking at how they can classify, deduplicate documents, prioritize LLM outputs, and delve into models like ColBERT.
We discuss:
Still not sure whether to listen? Here are some teasers:
Aamir Shakir:
Nicolay Gerold:
00:00 Introduction and Overview 00:25 Understanding Rerankers 01:46 Maxsim and Token-Level Embeddings 02:40 Setting Thresholds and Similarity 03:19 Guest Introduction: Aamir Shakir 03:50 Training and Using Rerankers (Episode Start) 04:50 Challenges and Solutions in Reranking 08:03 Future of Retrieval and Recommendation 26:05 Multimodal Retrieval and Reranking 38:04 Conclusion and Takeaways
Text embeddings have limitations when it comes to handling long documents and out-of-domain data.
Today, we are talking to Nils Reimers. He is one of the researchers who kickstarted the field of dense embeddings, developed sentence transformers, started HuggingFace’s Neural Search team and now leads the development of search foundational models at Cohere. Tbh, he has too many accolades to count off here.
We talk about the main limitations of embeddings:
Are you still not sure whether to listen? Here are some teasers:
Nils Reimers:
Nicolay Gerold:
text embeddings, limitations, long documents, interpretation, fine-tuning, re-ranking, future research
00:00 Introduction and Guest Introduction 00:43 Early Work with BERT and Argument Mining 02:24 Evolution and Innovations in Embeddings 03:39 Constructive Learning and Hard Negatives 05:17 Training and Fine-Tuning Embedding Models 12:48 Challenges and Limitations of Embeddings 18:16 Adapting Embeddings to New Domains 22:41 Handling Long Documents and Re-Ranking 31:08 Combining Embeddings with Traditional ML 45:16 Conclusion and Upcoming Episodes
Hey! Welcome back.
Today we look at how we can get our RAG system ready for scale.
We discuss common problems and their solutions, when you introduce more users and more requests to your system.
For this we are joined by Nirant Kasliwal, the author of fastembed.
Nirant shares practical insights on metadata extraction, evaluation strategies, and emerging technologies like Colipali. This episode is a must-listen for anyone looking to level up their RAG implementations.
"Naive RAG has a lot of problems on the retrieval end and then there's a lot of problems on how LLMs look at these data points as well."
"The first 30 to 50% of gains are relatively quick. The rest 50% takes forever."
"You do not want to give the same answer about company's history to the co-founding CEO and the intern who has just joined."
"Embedding similarity is the signal on which you want to build your entire search is just not quite complete."
Key insights:
Nirant Kasliwal:
Nicolay Gerold:
query understanding, AI-powered search, Lambda Mart, e-commerce ranking, networking, experts, recommendation, search
In this episode of How AI is Built, Nicolay Gerold interviews Doug Turnbull, a search engineer at Reddit and author on “Relevant Search”. They discuss how methods and technologies, including large language models (LLMs) and semantic search, contribute to relevant search results.
Key Highlights:
Key Quotes:
"There's not like a perfect measure or definition of what a relevant search result is for a given application. There are a lot of really good proxies, and a lot of really good like things, but you can't just like blindly follow the one objective, if you want to build a good search product." - Doug Turnbull
"I think 10 years ago, what people would do is they would just put everything in Solr, Elasticsearch or whatever, and they would make the query to Elasticsearch pretty complicated to rank what they wanted... What I see people doing more and more these days is that they'll use each retrieval source as like an independent piece of infrastructure." - Doug Turnbull on the evolution of search architecture
"Honestly, I feel like that's a very practical and underappreciated thing. People talk about RAG and I talk, I call this GAR - generative AI augmented retrieval, so you're making search smarter with generative AI." - Doug Turnbull on using LLMs to enhance search
"LambdaMART and gradient boosted decision trees are really powerful, especially for when you're expressing your re-ranking as some kind of structured learning problem... I feel like we'll see that and like you're seeing papers now where people are like finding new ways of making BM25 better." - Doug Turnbull on underappreciated techniques
Doug Turnbull
Nicolay Gerold:
Chapters
00:00 Introduction and Guest Introduction 00:52 Understanding Relevant Search Results 01:18 Search Behavior on Social Media 02:14 Challenges in Defining Relevance 05:12 Query Understanding and Ranking Signals 10:57 Evolution of Search Technologies 15:15 Combining Search Techniques 21:49 Leveraging LLMs and Embeddings 25:49 Operational Considerations in Search Systems 39:09 Concluding Thoughts and Future Directions
In this episode, we talk data-driven search optimizations with Charlie Hull.
Charlie is a search expert from Open Source Connections. He has built Flax, one of the leading open source search companies in the UK, has written “Searching the Enterprise”, and is one of the main voices on data-driven search.
We discuss strategies to improve search systems quantitatively and much more.
Key Points:
Resources mentioned:
Charlie Hull:
Nicolay Gerold:
search results, search systems, assessing, evaluation, improvement, data quality, user behavior, proactive, test dataset, search engine optimization, SEO, search quality, metadata, query classification, user intent, search results, metrics, business objectives, user objectives, experimentation, continuous improvement, data modeling, embeddings, machine learning, information retrieval
00:00 Introduction
01:35 Challenges in Measuring Search Relevance
02:19 Common Mistakes in Search System Assessment
03:22 Methods to Measure Search System Performance
04:28 Human Evaluation in Search Systems
05:18 Leveraging User Interaction Data
06:04 Implementing AI for Search Evaluation
09:14 Technical Components for Assessing Search Systems
12:07 Improving Search Quality Through Data Analysis
17:16 Proactive Search System Monitoring
24:26 Balancing Business and User Objectives in Search
25:08 Search Metrics and KPIs: A Contract Between Teams
26:56 The Role of Recency and Popularity in Search Algorithms
28:56 Experimentation: The Key to Optimizing Search
30:57 Offline Search Labs and A/B Testing
34:05 Simple Levers to Improve Search
37:38 Data Modeling and Its Importance in Search
43:29 Combining Keyword and Vector Search
44:24 Bridging the Gap Between Machine Learning and Information Retrieval
47:13 Closing Remarks and Contact Information
Welcome back to How AI Is Built.
We have got a very special episode to kick off season two.
Daniel Tunkelang is a search consultant currently working with Algolia. He is a leader in the field of information retrieval, recommender systems, and AI-powered search. He worked for Canva, Algolia, Cisco, Gartner, Handshake, to pick a few.
His core focus is query understanding.
**Query understanding is about focusing less on the results and more on the query.** The query of the user is the first-class citizen. It is about figuring out what the user wants and than finding, scoring, and ranking results based on it. So most of the work happens before you hit the database.
**Key Takeaways:**
- The "bag of documents" model for queries and "bag of queries" model for documents are useful approaches for representing queries and documents in search systems.
- Query specificity is an important factor in query understanding. It can be measured using cosine similarity between query vectors and document vectors.
- Query classification into broad categories (e.g., product taxonomy) is a high-leverage technique for improving search relevance and can act as a guardrail for query expansion and relaxation.
- Large Language Models (LLMs) can be useful for search, but simpler techniques like query similarity using embeddings can often solve many problems without the complexity and cost of full LLM implementations.
- Offline processing to enhance document representations (e.g., filling in missing metadata, inferring categories) can significantly improve search quality.
**Daniel Tunkelang**
- [LinkedIn](https://www.linkedin.com/in/dtunkelang/)
- [Medium](https://queryunderstanding.com/)
**Nicolay Gerold:**
- [LinkedIn](https://www.linkedin.com/in/nicolay-gerold/)
- [X (Twitter)](https://twitter.com/nicolaygerold)
- [Substack](https://nicolaygerold.substack.com/)
Query understanding, search relevance, bag of documents, bag of queries, query specificity, query classification, named entity recognition, pre-retrieval processing, caching, large language models (LLMs), embeddings, offline processing, metadata enhancement, FastText, MiniLM, sentence transformers, visualization, precision, recall
[00:00:00] 1. Introduction to Query Understanding
[00:05:30] 2. Query Representation Models
[00:12:00] 3. Query Specificity and Classification
[00:19:30] 4. Named Entity Recognition in Query Understanding
[00:24:00] 5. Pre-Retrieval Query Processing
[00:28:30] 6. Performance Optimization Techniques
[00:33:00] 7. Advanced Techniques: Embeddings and Language Models
[00:39:00] 8. Practical Implementation Strategies
[00:44:00] 9. Visualization and Analysis of Query Spaces
[00:47:00] 10. Future Directions and Closing Thoughts - Emerging trends in query understanding - Key takeaways for search system engineers
[00:53:00] End of Episode
Today we are launching the season 2 of How AI Is Built.
The last few weeks, we spoke to a lot of regular listeners and past guests and collected feedback. Analyzed our episode data. And we will be applying the learnings to season 2.
This season will be all about search.
We are trying to make it better, more actionable, and more in-depth. The goal is that at the end of this season, you have a full-fleshed course on search in podcast form, which mini-courses on specific elements like RAG.
We will be talking to experts from information retrieval, information architecture, recommendation systems, and RAG; from academia and industry. Fields that do not really talk to each other.
We will try to unify and transfer the knowledge and give you a full tour of search, so you can build your next search application or feature with confidence.
We will be talking to Charlie Hull on how to systematically improve search systems, with Nils Reimers on the fundamental flaws of embeddings and how to fix them, with Daniel Tunkelang on how to actually understand the queries of the user, and many more.
We will try to bridge the gaps. How to use decades of research and practice in iteratively improving traditional search and apply it to RAG. How to take new methods from recommendation systems and vector databases and bring it into traditional search systems. How to use all of the different methods as search signals and combine them to deliver the results your user actually wants.
We will be using two types of episodes:
We will be starting with episodes next week, looking at the first, last, and overarching action in search: understanding user intent and understanding the queries with Daniel Tunkelang.
I am really excited to kick this off.
I would love to hear from you:
Yeah, let me know in the comments or just slide into my DMs on Twitter or LinkedIn.
I am looking forward to hearing from you guys.
I want to try to be more interactive. So anytime you encounter anything unclear or any question pops up in one of the episode, give me a shout and I will try to answer it to you and to everyone.
Enough of me rambling. Let’s kick this off. I will see you next Thursday, when we start with query understanding.
Shoot me a message and stay up to date:
In this episode of "How AI is Built," host Nicolay Gerold interviews Jonathan Yarkoni, founder of Reach Latent. Jonathan shares his expertise in extracting value from unstructured data using AI, discussing challenging projects, the impact of ChatGPT, and the future of generative AI. From weather prediction to legal tech, Jonathan provides valuable insights into the practical applications of AI across various industries.
Key Takeaways
Key Quotes
"I think we're going to see another wave in 2024 and another one in 2025. And people are familiarized. That's kind of the wave of 2023. 2024 is probably still going to be a lot of internal use cases because it's a low risk environment and there was a lot of opportunity to be had."
"To really get to production reliably, we have to have these tools evolve further and get more standardized so people can still use the old ways of doing production with the new technology."
Jonathan Yarkoni
Nicolay Gerold:
Chapters
00:00 Introduction: Extracting Value from Unstructured Data
03:16 Flexible Tailoring Solutions to Client Needs
05:39 Monitoring and Retraining Models in the Evolving AI Landscape
09:15 Generative AI: Disrupting Industries and Unlocking New Possibilities
17:47 Balancing Immediate Results and Cutting-Edge Solutions in AI Development
28:29 Dream Tech Stack for Generative AI
unstructured data, textual data, automation, weather prediction, data cleaning, chat GPT, AI disruption, legal, education, software engineering, marketing, biotech, immediate results, cutting-edge solutions, tech stack
This episode of "How AI Is Built" is all about data processing for AI. Abhishek Choudhary and Nicolay discuss Spark and alternatives to process data so it is AI-ready.
Spark is a distributed system that allows for fast data processing by utilizing memory. It uses a dataframe representation "RDD" to simplify data processing.
When should you use Spark to process your data for your AI Systems?
→ Use Spark when:
→ Consider alternatives when:
Spark isn't always necessary. Evaluate your specific needs and resources before committing to a Spark-based solution for AI data processing.
In today’s episode of How AI Is Built, Abhishek and I discuss data processing:
Abhishek Choudhary:
Nicolay Gerold:
In this episode, Nicolay talks with Rahul Parundekar, founder of AI Hero, about the current state and future of AI agents. Drawing from over a decade of experience working on agent technology at companies like Toyota, Rahul emphasizes the importance of focusing on realistic, bounded use cases rather than chasing full autonomy.
They dive into the key challenges, like effectively capturing expert workflows and decision processes, delivering seamless user experiences that integrate into existing routines, and managing costs through techniques like guardrails and optimized model choices. The conversation also explores potential new paradigms for agent interactions beyond just chat.
Key Takeaways:
Key Quotes:
Rahul Parundekar:
Nicolay Gerold:
00:00 Exploring the Potential of Autonomous Agents
02:23 Challenges of Accuracy and Repeatability in Agents
08:31 Capturing User Workflows and Improving Prompts
13:37 Tech Stack for Implementing Agents in the Enterprise
agent development, determinism, user experience, agent paradigms, private use, human-agent interaction, user workflows, agent deployment, human-in-the-loop, LLMs, declarative ways, scalability, AI Hero
In this conversation, Nicolay and Richmond Alake discuss various topics related to building AI agents and using MongoDB in the AI space. They cover the use of agents and multi-agents, the challenges of controlling agent behavior, and the importance of prompt compression.
When you are building agents. Build them iteratively. Start with simple LLM calls before moving to multi-agent systems.
Main Takeaways:
Richmond Alake:
Nicolay Gerold:
00:00 Reducing the Scope of AI Agents
01:55 Seamless Data Ingestion
03:20 Challenges and Considerations in Implementing Multi-Agents
06:05 Memory Modeling for Robust Agents with MongoDB
15:05 Performance Optimization in AI Agents
18:19 RAG Setup
AI agents, multi-agents, prompt compression, MongoDB, data storage, data ingestion, performance optimization, tooling, generative AI
In this episode, Kirk Marple, CEO and founder of Graphlit, shares his expertise on building efficient data integrations.
Kirk breaks down his approach using relatable concepts:
Kirk Marple:
Nicolay Gerold:
Chapters
00:00 Building Integrations into Different Tools
00:44 The Two-Sided Funnel Model for Data Flow
04:07 Using Well-Defined Interfaces for Faster Integration
04:36 Managing Feeds and State with Actor Models
06:05 The Importance of Data Normalization
10:54 Tech Stack for Data Flow
11:52 Progression towards a Kappa Architecture
13:45 Reusability of Patterns for Faster Integration
data integration, data sources, data flow, two-sided funnel model, canonical format, stream of ingestible objects, competing consumer model, well-defined interfaces, actor model, data normalization, tech stack, Kappa architecture, reusability of patterns
In our latest episode, we sit down with Derek Tu, Founder and CEO of Carbon, a cutting-edge ETL tool designed specifically for large language models (LLMs).
Carbon is streamlining AI development by providing a platform for integrating unstructured data from various sources, enabling businesses to build innovative AI applications more efficiently while addressing data privacy and ethical concerns.
Derek Tu:
Nicolay Gerold:
Key Takeaways:
00:00 Introduction and Optimizing Embedding Models
03:00 The Evolution of Carbon and Focus on Unstructured Data
06:19 Customer Progression and Target Group
09:43 Interesting Use Cases and Handling Different Data Representations
13:30 Chunking Strategies and Normalization
20:14 Approach to Chunking and Choosing a Vector Database
23:06 Tech Stack and Recommended Tools
28:19 Future of Carbon: Multimodal Models and Building a Platform
Carbon, LLMs, RAG, chunking, data processing, global customer base, GDPR compliance, AI founders, AI agents, enterprises
In this episode, Nicolay sits down with Hugo Lu, founder and CEO of Orchestra, a modern data orchestration platform. As data pipelines and analytics workflows become increasingly complex, spanning multiple teams, tools and cloud services, the need for unified orchestration and visibility has never been greater.
Orchestra is a serverless data orchestration tool that aims to provide a unified control plane for managing data pipelines, infrastructure, and analytics across an organization's modern data stack.
The core architecture involves users building pipelines as code which then run on Orchestra's serverless infrastructure. It can orchestrate tasks like data ingestion, transformation, AI calls, as well as monitoring and getting analytics on data products. All with end-to-end visibility, data lineage and governance even when organizations have a scattered, modular data architecture across teams and tools.
Key Quotes:
Hugo Lu:
Nicolay Gerold:
00:00 Introduction to Orchestra and its Focus on Data Products
08:03 Unified Control Plane for Data Stack and End-to-End Control
14:42 Use Cases and Unique Applications of Orchestra
19:31 Retaining Existing Dev Workflows and Best Practices in Orchestra
22:23 Event-Driven Architectures and Monitoring in Orchestra
23:49 Putting Data Products First and Monitoring Health and Usage
25:40 The Future of Data Orchestration: Stream-Based and Cost-Effective
data orchestration, Orchestra, serverless architecture, versatility, use cases, maturity levels, challenges, AI workloads
Ever wondered how AI systems handle images and videos, or how they make lightning-fast recommendations? Tune in as Nicolay chats with Zain Hassan, an expert in vector databases from Weaviate. They break down complex topics like quantization, multi-vector search, and the potential of multimodal search, making them accessible for all listeners. Zain even shares a sneak peek into the future, where vector databases might connect our brains with computers!
Zain Hasan:
Nicolay Gerold:
Key Insights:
Key Quotes:
Chapters
00:00 - 01:24 Introduction
01:24 - 03:48 Underappreciated aspects of vector databases
03:48 - 06:06 Quantization trade-offs and techniques
06:06 - 08:24 Binary quantization
08:24 - 10:44 Product quantization and other techniques
10:44 - 13:08 Quantization as a "superpower" to reduce costs
13:08 - 15:34 Comparing quantization approaches
15:34 - 17:51 Placing vector databases in the database landscape
17:51 - 20:12 Pruning unused vectors and nodes
20:12 - 22:37 Improving precision beyond similarity thresholds
22:37 - 25:03 Multi-vector search
25:03 - 27:11 Impact of vector databases on data interaction
27:11 - 29:35 Interesting and weird use cases
29:35 - 32:00 Future of multimodal search and recommendations
32:00 - 34:22 Extending recommendations to user data
34:22 - 36:39 What's next for Weaviate
36:39 - 38:57 Exciting technologies beyond vector databases and LLMs
vector databases, quantization, hybrid search, multi-vector support, representation learning, cost reduction, memory optimization, multimodal recommender systems, brain-computer interfaces, weather prediction models, AI applications
In this episode of "How AI is Built", data architect Anjan Banerjee provides an in-depth look at the world of data architecture and building complex AI and data systems. Anjan breaks down the basics using simple analogies, explaining how data architecture involves sorting, cleaning, and painting a picture with data, much like organizing Lego bricks to build a structure.
Summary by Section
Introduction
Sources and Tools
Airflow and Orchestration
AI and Data Processing
Data Lakes and Storage
Data Quality and Standardization
Hot Takes and Wishes
Anjan Banerjee:
Nicolay Gerold:
00:00 Understanding Data Architecture
12:36 Choosing the Right Tools
20:36 The Benefits of Serverless Functions
21:34 Integrating AI in Data Acquisition
24:31 The Trend Towards Single Node Engines
26:51 Choosing the Right Database Management System and Storage
29:45 Adding Additional Storage Components
32:35 Reducing Human Errors for Better Data Quality
39:07 Overhyped and Underutilized Tools
Data architecture, AI, data systems, data sources, data extraction, data storage, multi-modal storage engines, data orchestration, Airflow, edge computing, batch processing, data lakes, Delta Lake, Iceberg, data quality, standardization, poka-yoke, compliance, entity resolution
Jorrit Sandbrink, a data engineer specializing on open table formats, discusses the advantages of decoupling storage and compute, the importance of choosing the right table format, and strategies for optimizing your data pipelines. This episode is full of practical advice for anyone looking to build a high-performance data analytics platform.
Key Takeaways:
Sound Bites
"The Lake house is sort of a modular setup where you decouple the storage and the compute." "A lake house is an architecture, an architecture for data analytics platforms." "The most popular table formats for a lake house are Delta, Iceberg, and Apache Hoodie."
Jorrit Sandbrink:
Nicolay Gerold:
Chapters
00:00 Introduction to the Lake House Architecture
03:59 Choosing Storage and Table Formats
06:19 Comparing Compute Engines
21:37 Simplifying Data Ingress
25:01 Building a Preferred Data Stack
lake house, data analytics, architecture, storage, table format, query execution engine, document store, DuckDB, Polars, orchestration, Airflow, Dexter, DLT, data ingress, data processing, data storage
Kirk Marple, CEO and founder of Graphlit, discusses the evolution of his company from a data cataloging tool to an platform designed for ETL (Extract, Transform, Load) and knowledge retrieval for Large Language Models (LLMs). Graphlit empowers users to build custom applications on top of its API that go beyond naive RAG.
Key Points:
Notable Quotes:
Kirk Marple:
Nicolay Gerold:
Chapters
00:00 Graphlit’s Hybrid Approach 02:23 Use Cases and Transition to Graphlit 04:19 Knowledge Graphs as a Filtering Mechanism 13:23 Using Gremlin for Querying the Graph 32:36 XML in Prompts for Better Segmentation 35:04 The Future of LLMs and Graphlit 36:25 Getting Started with Graphlit
Graphlit, knowledge graphs, AI, document store, graph database, search index co-pilot, entity extraction, Azure Cognitive Services, XML, event-driven architecture, serverless architecture graph rag, developer portal
From Problem to Requirements to Architecture.
In this episode, Nicolay Gerold and Jon Erich Kemi Warghed discuss the landscape of data engineering, sharing insights on selecting the right tools, implementing effective data governance, and leveraging powerful concepts like software-defined assets. They discuss the challenges of keeping up with the ever-evolving tech landscape and offer practical advice for building sustainable data platforms. Tune in to discover how to simplify complex data pipelines, unlock the power of orchestration tools, and ultimately create more value from your data.
Key Takeaways:
Jon Erik Kemi Warghed:
Nicolay Gerold:
Chapters
00:00 The Problem with the Modern Data Stack: Too many tools and buzzwords
00:57 How to Choose the Right Tools: Considerations for startups and large companies
03:13 Evaluating Open Source Tools: Background checks and due diligence
07:52 Defining Data Governance: Transparency and understanding of data
10:15 The Importance of Data Governance: Challenges and solutions
12:21 Data Governance Tools: dbt and Dagster
17:05 The Impact of Dagster: Software-defined assets and declarative thinking
19:31 The Power of Software Defined Assets: How Dagster differs from Airflow and Mage
21:52 State Management and Orchestration in Dagster: Real-time updates and dependency management
26:24 Why Use Orchestration Tools?: The role of orchestration in complex data pipelines
28:47 The Importance of Tool Selection: Thinking about long-term sustainability
31:10 When to Adopt Orchestration: Identifying the need for orchestration tools
In this episode, Nicolay Gerold interviews John Wessel, the founder of Agreeable Data, about data orchestration. They discuss the evolution of data orchestration tools, the popularity of Apache Airflow, the crowded market of orchestration tools, and the key problem that orchestrators solve. They also explore the components of a data orchestrator, the role of AI in data orchestration, and how to choose the right orchestrator for a project. They touch on the challenges of managing orchestrators, the importance of monitoring and optimization, and the need for product people to be more involved in the orchestration space. They also discuss data residency considerations and the future of orchestration tools.
Sound Bites
"The modern era, definitely airflow. Took the market share, a lot of people running it themselves." "It's like people are launching new orchestrators every day. This is a funny one. This was like two weeks ago, somebody launched an orchestrator that was like a meta-orchestrator." "The DAG introduced two other components. It's directed acyclic graph is what DAG means, but direct is like there's a start and there's a finish and the acyclic is there's no loops."
Key Topics
John Wessel:
Nicolay Gerold:
Data orchestration, data movement, Apache Airflow, orchestrator selection, DAG, AI in orchestration, serverless, Kubernetes, infrastructure as code, monitoring, optimization, data residency, product involvement, generative AI.
Chapters
00:00 Introduction and Overview
00:34 The Evolution of Data Orchestration Tools
04:54 Components and Flow of Data in Orchestrators
08:24 Deployment Options: Serverless vs. Kubernetes
11:14 Considerations for Data Residency and Security
13:02 The Need for a Clear Winner in the Orchestration Space
20:47 Optimization Techniques for Memory and Time-Limited Issues
23:09 Integrating Orchestrators with Infrastructure-as-Code
24:33 Bridging the Gap Between Data and Engineering Practices
27:2 2Exciting Technologies Outside of Data Orchestration
30:09 The Feature of Dagster
In this episode of "How AI is Built", we learn how to build and evaluate real-world language model applications with Shahul and Jithin, creators of Ragas. Ragas is a powerful open-source library that helps developers test, evaluate, and fine-tune Retrieval Augmented Generation (RAG) applications, streamlining their path to production readiness.
Main Insights
Practical Takeaways
Interesting Quotes
Ragas:
Jithin James:
Shahul ES:
Nicolay Gerold:
00:00 Introduction
02:03 Introduction to Open Assistant project
04:05 Creating Customizable and Fine-Tunable Models
06:07 Ragas and the LLM Use Case
08:09 Introduction to Language Model Metrics (LLMs)
11:12 Reducing the Cost of Data Generation
13:19 Evaluation of Components at Melvess
15:40 Combining Ragas Metrics with AutoML Providers
20:08 Improving Performance with Fine-tuning and Reranking
22:56 End-to-End Metrics and Component-Specific Metrics
25:14 The Importance of Deep Knowledge and Understanding
25:53 Robustness vs Optimization
26:32 Challenges of Evaluating Models
27:18 Creating a Dream Tech Stack
27:47 The Future Roadmap for Ragas
28:02 Doubling Down on Grid Data Generation
28:12 Open-Source Models and Expanded Support
28:20 More Metrics for Different Applications
RAG, Ragas, LLM, Evaluation, Synthetic Data, Open-Source, Language Model Applications, Testing.
In this episode of Changelog, Weston Pace dives into the latest updates to LanceDB, an open-source vector database and file format. Lance's new V2 file format redefines the traditional notion of columnar storage, allowing for more efficient handling of large multimodal datasets like images and embeddings. Weston discusses the goals driving LanceDB's development, including null value support, multimodal data handling, and finding an optimal balance for search performance.
Sound Bites
"A little bit more power to actually just try." "We're becoming a little bit more feature complete with returns of arrow." "Weird data representations that are actually really optimized for your use case."
Key Points
Conversation Highlights
LanceDB:
Weston Pace:
Nicolay Gerold:
Chapters
00:00 Introducing Lance: A New File Format
06:46 Enabling Custom Encodings in Lance
11:51 Exploring the Relationship Between Lance and Arrow
20:04 New Chapter
Lance file format, nulls, round-tripping data, optimized data representations, full-text search, encodings, downsides, multimodal data, compression, point lookups, full scan performance, non-contiguous columns, custom encodings
Had a fantastic conversation with Christopher Williams, Solutions Architect at Supabase, about setting up Postgres the right way for AI. We dug deep into Supabase, exploring:
Had a fantastic conversation with Christopher Williams, Solutions Architect at Supabase, about setting up Postgres the right way for AI. We dug deep into Supabase, exploring:
If you've ever wanted a simpler way to integrate AI directly into your database, SuperDuperDB might be the answer. SuperDuperDB lets you easily apply AI processes to your data while keeping everything up-to-date with real-time calculations. It works with various databases and aims to make AI development less of a headache.
In this podcast, we explore:
Takeaways
Duncan Blythe:
SuperDuperDB:
Nicolay Gerold:
Chapters
00:00 Introduction to SuperDuperDB
04:19 Real-time Computation and Data Deployment
13:46 Bringing Compute and Database Closer Together
29:30 Declarative Machine Learning with SuperDuperDB
35:09 Future Plans for SuperDuperDB
SuperDuperDB, AI, databases, embeddings, classifications, data deployment, operational databases, analytical databases, AI development, data science
Supabase just acquired OrioleDB, a storage engine for PostgreSQL.
Oriole gets creative with MVCC! It uses an UNDO log rather than keeping multiple versions of an entire data row (tuple). This means when you update data, Oriole tracks the changes needed to "undo" the update if necessary. Think of this like the "undo" function in a text editor. Instead of keeping a full copy of the old text, it just remembers what changed. This can be much smaller. This also saves space by eliminating the need for a garbage collection process.
It also has a bunch of additional performance boosters like data compression, easy integration with data lakes, and index-organized tables.
Show notes:
Chris Gwilliams:
Nicolay Gerold:
00:42 Introduction to OrioleDB
04:38 The Undo Log Approach
08:39 Improving Performance for High Throughput Databases
11:08 My take on OrioleDB
OrioleDB, storage engine, Postgres, table access methods, undo log, high throughput databases, automated features, new use cases, S3, data migration
Today’s guest is Antonio Bustamante, a serial entrepreneur who previously built Kite and Silo and is now working to fix bad data. He is building bem, the data tool to transform any data into the schema your AI and software needs.
bem.ai is a data tool that focuses on transforming any data into the schema needed for AI and software. It acts as a system's interoperability layer, allowing systems that couldn't communicate before to exchange information. Learn what place LLMs play in data transformation, how to build reliable data infrastructure and more.
"Surprisingly, the hardest was semi-structured data. That is data that should be structured, but is unreliable, undocumented, hard to work with."
"We were spending close to four or five million dollars a year just in integrations, which is no small budget for a company that size. So I was pretty much determined to fix this problem once and for all."
"bem focuses on being the system's interoperability layer."
"We basically take in anything you send us, we transform it exactly into your internal data schema so that you don't have to parse, process, transform anything of that sort."
"LLMs are a 30% of it... A lot of it is very, very like thorough validation layers, great infrastructure, just ensuring reliability and connection to our user systems.”
"You can use a million token context window and feed an entire document to an LLM. I can guarantee you if you don't, semantically chunk it out before you're not going to get the right results.”
"We're obsessed with time to value... Our milestone is basically five minute onboarding max, and then you're ready to go."
Antonio Bustamante
Nicolay Gerold:
Semi-structured data, Data integrations, Large language models (LLMs), Data transformation, Schema interoperability, Fault tolerance, Validation layers, System reliability, Schema evolution, Enterprise software, Data pipelines.
Chapters
00:00 The Problem of Integrations
05:58 Building Fault Tolerant Systems
13:51 Versioning and Semantic Validation
27:33 BEM in the Data Ecosystem
34:40 Future Plans and Onboarding
Imagine a world where data bottlenecks, slow data loaders, or memory issues on the VM don't hold back machine learning.
Machine learning and AI success depends on the speed you can iterate. LanceDB is here to to enable fast experiments on top of terabytes of unstructured data. It is the database for AI. Dive with us into how LanceDB was built, what went into the decision to use Rust as the main implementation language, the potential of AI on top of LanceDB, and more.
"LanceDB is the database for AI...to manage their data, to do a performant billion scale vector search."
“We're big believers in the composable data systems vision."
"You can insert data into LanceDB using Panda's data frames...to sort of really large 'embed the internet' kind of workflows."
"We wanted to create a new generation of data infrastructure that makes their [AI engineers] lives a lot easier."
"LanceDB offers up to 1,000 times faster performance than Parquet."
Change She:
LanceDB:
Nicolay Gerold:
Chapters:
00:00 Introduction to LanceDB
02:16 Building LanceDB in Rust
12:10 Optimizing Data Infrastructure
26:20 Surprising Use Cases for LanceDB
32:01 The Future of LanceDB
LanceDB, AI, database, Rust, multimodal AI, data infrastructure, embeddings, images, performance, Parquet, machine learning, model database, function registries, agents.
En liten tjänst av I'm With Friends. Finns även på engelska.