Podd: How AI Is Built

#048 Why Your AI Agents Need Permission to Act, Not Just Read

6 maj 2025 | 57 min

Nicolay here,

most AI conversations obsess over capabilities. This one focuses on constraints - the right ones that make AI actually useful rather than just impressive demos.

Today I have the chance to talk to Dexter Horthy, who recently put out a long piece called the “12-factor agents”.

It’s like the 10 commandments, but for building agents.

One of it is “Contact human with tool calls”: the LLM can call humans for high-stakes decisions or “writes”.

The key insight is brutally simple. AI can get to 90% accuracy on most tasks - good enough for spam-like activities but disastrous for anything that requires trust. The solution isn't to wait for models to get smarter; it's to add a human approval layer for critical actions.

Imagine you are writing to a database or sending an email. Each “write” has to be approved by a human. So you post the email in a Slack channel and in most cases, your sales people will approve. In the 10%, it’s stopped in its tracks and the human can take over. You stop the slop and get good training data in the mean time.

Dexter’s company is building exactly this: an approval mechanism that lets AI agents send requests to humans before executing.

In the podcast, we also touch on a bunch of other things:

MCP and that they are (atm) just a thin client
Are we training LLMs toward mediocrity?
What infrastructure do we need for human in the loop (e.g. DBOS)?
and more

💡 Core Concepts

Context Engineering: Crafting the information representation for LLMs - selecting optimal data structures, metadata, and formats to ensure models receive precisely what they need to perform effectively.
Token Bloat Prevention: Ruthlessly eliminating irrelevant information from context windows to maintain agent focus during complex tasks, preventing the pattern of repeating failed approaches.
Human-in-the-loop Approval Flows: Achieving 99% reliability through a "90% AI + 10% human oversight" framework where agents analyze data and suggest actions but request explicit permission before execution.
Rubric Engineering: Systematically evaluating AI outputs through dimension-specific scoring criteria to provide precise feedback and identify exceptional results, helping escape the trap of models converging toward mediocrity.

📶 Connect with Dexter:

📶 Connect with Nicolay:

LinkedIn
X / Twitter
Bluesky
Website
My Agency Aisbach (for ai implementations / strategy)

⏱️ Important Moments

MCP Servers as Clients: [03:07] Dexter explains why what many call "MCP servers" actually function more like clients when examining the underlying code.
Authentication Challenges: [04:45] The discussion shifts to how authentication should be handled in MCP implementations and whether it belongs in the protocol.
Asynchronous Agent Execution: [08:18] Exploring how to handle agents that need to pause for human input without wasting tokens on continuous polling.
Token Bloat Prevention: [14:41] Strategies for keeping context windows focused and efficient, moving beyond standard chat formats.
Context Engineering: [29:06] The concept that everything in AI agent development ultimately comes down to effective context engineering.
Fine-tuning vs. RAG for Writing Style: [20:05] Contrasting personal writing style fine-tuning versus context window examples.
Generating Options vs. Deterministic Outputs: [19:44] The unexplored potential of having AI generate diverse creative options for human selection.
The "Mediocrity Convergence" Question: [37:11] The philosophical concern that popular LLMs may inevitably trend toward average quality.
Data Labeling Interfaces: [35:25] Discussion about the need for better, lower-friction interfaces to collect human feedback on AI outputs.
Human-in-the-loop Approval Flows: [42:46] The core approach of HumanLayer, allowing agents to ask permission before taking action.

🛠️ Tools & Tech Mentioned

📚 Recommended Resources

🔮 What's Next

Next week, we will continue going more into getting generative AI into production talking to Vibhav from BAML.

💬 Join The Conversation

Follow How AI Is Built on YouTube, Bluesky, or Spotify.

If you have any suggestions for future guests, feel free to leave it in the comments or write me (Nicolay) directly on LinkedIn, X, or Bluesky. Or at [email protected].

I will be opening a Discord soon to get you guys more involved in the episodes! Stay tuned for that.

♻️ I am trying to build the new platform for engineers to share their experience that they have earned after building and deploying stuff into production. I am trying to produce the best content possible - informative, actionable, and engaging. I'm asking for two things: hit subscribe now to show me what content you like (so I can do more of it), and if this episode helped you, pay it forward by sharing with one engineer who's facing similar challenges. That's the agreement - I deliver practical value, you help grow this resource for everyone. ♻️

#047 Architecting Information for Search, Humans, and Artificial Intelligence

27 mars 2025 | 57 min

Today on How AI Is Built, Nicolay Gerold sits down with Jorge Arango, an expert in information architecture. Jorge emphasizes that aligning systems with users' mental models is more important than optimizing backend logic alone. He shares a clear framework with four practical steps:

Key Points:

Information architecture should bridge user mental models with system data models
Information's purpose is to help people make better choices and act more skillfully
Well-designed systems create learnable (not just "intuitive") interfaces
Context and domain boundaries significantly impact user understanding
Progressive disclosure helps accommodate users with varying expertise levels

Chapters

00:00 Introduction to Backend Systems
00:36 Guest Introduction: Jorge Arango
01:12 Podcast Dynamics and Guest Experiences
01:53 Timeless Principles in Technology
02:08 Interesting Conversations and Learnings
04:04 Physical vs. Digital Organization
04:21 Smart Defaults and System Maintenance
07:20 Data Models and Conceptual Structures
08:53 Designing User-Centric Systems
10:20 Challenges in Information Systems
10:35 Understanding Information and Choices
15:49 Clarity and Context in Design
26:36 Progressive Disclosure and User Research
37:05 The Role of Large Language Models
54:59 Future Directions and New Series (MLOps)

Information Architecture Fundamentals

What Is Information?

Information helps people make better choices to act more skillfully
Example: "No dog pooping" signs help predict consequences of actions
Poor information systems fail to provide relevant guidance for users' needs

Mental Models vs. Data Models

Systems have underlying conceptual structures that should reflect user mental models
Data models make these conceptual models "normative" in the infrastructure
Designers serve as translators between user needs and technical implementation
Goal: Users should think "the person who designed this really gets me"

Design Strategies for Complex Systems

Progressive Disclosure

Present simple interfaces by default with clear paths to advanced functionality
Example: HyperCard - visual interface for beginners with programming layer for experts
Allows both novice and expert users to use the same system effectively

Context Setting and Domain Boundaries

All interactions happen within a context that influences understanding
Words acquire different meanings in different contexts (e.g., "save" in computing vs. banking)
Clearer domain boundaries make information architecture design easier
Hardest systems to design: those serving many purposes for diverse audiences

Conceptual Modeling (Underrated Practice)

Should precede UI sketching but often skipped by designers
Defines concepts needed in the system and their relationships
Creates more cohesive and coherent systems, especially for complex projects
More valuable than sitemaps, which imply rigid hierarchies

LLMs and Information Architecture

Current and Future Applications

Transforming search experiences (e.g., Perplexity providing answers vs. link lists)
Improving intent parsing in traditional search
Helping information architects with content analysis and navigation structure design
Enabling faster, better analysis of large content repositories

Implementation Advice

For Engineers and Designers

Designers should understand how systems are built (materials of construction)
Engineers benefit from understanding user perspectives and mental models
Both disciplines have much to teach each other

For Complex Applications

Map conceptual models before writing code
Test naming with real users
Implement progressive disclosure with good defaults
Remember: "If the user can't find it, it doesn't exist"

Notable Quotes:

"People only understand things relative to things they already understand." - Richard Saul Wurman

"The hardest systems to design are the ones that are meant to do a lot of things for a lot of different people." - Jorge Arango

"Very few things are intuitive. There's a long running joke in the industry that the only intuitive interface for humans is the nipple. Everything else is learned." - Jorge Arango

Jorge Arango

Nicolay Gerold:

#046 Building a Search Database From First Principles

13 mars 2025 | 53 min

Modern search is broken. There are too many pieces that are glued together.

Vector databases for semantic search
Text engines for keywords
Rerankers to fix the results
LLMs to understand queries
Metadata filters for precision

Each piece works well alone.

Together, they often become a mess.

When you glue these systems together, you create:

Data Consistency Gaps Your vector store knows about documents your text engine doesn't. Which is right?
Timing Mismatches New content appears in one system before another. Users see different results depending on which path their query takes.
Complexity Explosion Every new component doubles your integration points. Three components means three connections. Five means ten.
Performance Bottlenecks Each hop between systems adds latency. A 200ms search becomes 800ms after passing through four components.
Brittle Chains When one system fails, your entire search breaks. More pieces mean more breaking points.

I recently built a system where we had query specific post-filters but the requirement to deliver a fixed number of results to the user.

A lot of times, the query had to be run multiple times to achieve the desired amount.

So we had an unpredictable latency. A high load on the backend, where some queries hammered the database 10+ times. A relevance cliff, where results 1-6 look great, but the later ones were poor matches.

Today on How AI Is Built, we are talking to Marek Galovic from TopK.

We talk about how they built a new search database with modern components. "How would search work if we built it today?”

Cloud storage is cheap. Compute is fast. Memory is plentiful.

One system that handles vectors, text, and filters together - not three systems duct-taped into one.

One pass handles everything:

Vector search + Text search + Filters → Single sorted result

Built with hand-optimized Rust kernels for both x86 and ARM, the system scales to 100M documents with 200ms P99 latency.

The goal is to do search in 5 lines of code.

Marek Galovic:

Nicolay Gerold:

00:00 Introduction to TopK and Snowflake Comparison

00:35 Architectural Patterns and Custom Formats

01:30 Query Execution Engine Explained

02:56 Distributed Systems and Rust

04:12 Query Execution Process

06:56 Custom File Formats for Search

11:45 Handling Distributed Queries

16:28 Consistency Models and Use Cases

26:47 Exploring Database Versioning and Snapshots

27:27 Performance Benchmarks: Rust vs. C/C++

29:02 Scaling and Latency in Large Datasets

29:39 GPU Acceleration and Use Cases

31:04 Optimizing Search Relevance and Hybrid Search

34:39 Advanced Search Features and Custom Scoring

38:43 Future Directions and Research in AI

47:11 Takeaways for Building AI Applications

#045 RAG As Two Things - Prompt Engineering and Search

6 mars 2025 | 63 min

John Berryman moved from aerospace engineering to search, then to ML and LLMs. His path: Eventbrite search → GitHub code search → data science → GitHub Copilot. He was drawn to more math and ML throughout his career.

RAG Explained

"RAG is not a thing. RAG is two things." It breaks into:

Search - finding relevant information
Prompt engineering - presenting that information to the model

These should be treated as separate problems to optimize.

The Little Red Riding Hood Principle

When prompting LLMs, stay on the path of what models have seen in training. Use formats, structures, and patterns they recognize from their training data:

For code, use docstrings and proper formatting
For financial data, use SEC report structures
Use Markdown for better formatting

Models respond better to familiar structures.

Testing Prompts

Testing strategies:

Start with "vibe testing" - human evaluation of outputs
Develop systematic tests based on observed failure patterns
Use token probabilities to measure model confidence
For few-shot prompts, watch for diminishing returns as examples increase

Managing Token Limits

When designing prompts, divide content into:

Static elements (boilerplate, instructions)
Dynamic elements (user inputs, context)

Prioritize content by:

Must-have information
Nice-to-have information
Optional if space allows

Even with larger context windows, efficiency remains important for cost and latency.

Completion vs. Chat Models

Chat models are winning despite initial concerns about their constraints:

Completion models allow more flexibility in document format
Chat models are more reliable and aligned with common use cases
Most applications now use chat models, even for completion-like tasks

Applications: Workflows vs. Assistants

Two main LLM application patterns:

Assistants: Human-in-the-loop interactions where users guide and correct
Workflows: Decomposed tasks where LLMs handle well-defined steps with safeguards

Breaking Down Complex Problems

Two approaches:

Horizontal: Split into sequential steps with clear inputs/outputs
Vertical: Divide by case type, with specialized handling for each scenario

Example: For SOX compliance, break horizontally (understand control, find evidence, extract data, compile report) and vertically (different audit types).

On Agents

Agents exist on a spectrum from assistants to workflows, characterized by:

Having some autonomy to make decisions
Using tools to interact with the environment
Usually requiring human oversight

Best Practices

For building with LLMs:

Start simple: API key + Jupyter notebook
Build prototypes and iterate quickly
Add evaluation as you scale
Keep users in the loop until models prove reliability

John Berryman:

Nicolay Gerold:

⁠LinkedIn⁠
⁠X (Twitter)
00:00 Introduction to RAG: Retrieval and Generation
00:19 Optimizing Retrieval Systems
01:11 Introducing John Berryman
02:31 John's Journey from Search to Prompt Engineering
04:05 Understanding RAG: Search and Prompt Engineering
05:39 The Little Red Riding Hood Principle in Prompt Engineering
14:14 Balancing Static and Dynamic Elements in Prompts
25:52 Assistants vs. Workflows: Choosing the Right Approach
30:15 Defining Agency in AI
30:35 Spectrum of Assistance and Workflows
34:35 Breaking Down Problems Horizontally and Vertically
37:57 SOX Compliance Case Study
40:56 Integrating LLMs into Existing Applications
44:37 Favorite Tools and Missing Features
46:37 Exploring Niche Technologies in AI
52:52 Key Takeaways and Future Directions

#044 Graphs Aren't Just For Specialists Anymore

28 februari 2025 | 64 min

Kuzu is an embedded graph database that implements Cypher as a library.

It can be easily integrated into various environments—from scripts and Android apps to serverless platforms.

Its design supports both ephemeral, in-memory graphs (ideal for temporary computations) and large-scale persistent graphs where traditional systems struggle with performance and scalability.

Key Architectural Decisions:

Columnar Storage:
Kuzu stores node and relationship properties in separate, contiguous columns. This design reduces I/O by allowing queries to scan only the needed columns, unlike row-based systems (e.g., Neo4j) that read full records even when only a subset of properties is required.
Efficient Join Indexing with CSR:
The join index is maintained using a Compressed Sparse Row (CSR) format. By sorting and compressing relationship data, Kuzu ensures that adjacent node relationships are stored contiguously, minimizing random I/O and speeding up traversals.
Vectorized Query Processing:
Instead of processing one tuple at a time, Kuzu processes blocks (vectors) of tuples. This block-based (or vectorized) approach reduces function-call overhead and improves cache locality, boosting performance for analytic queries.
Factorization and ASP Join:
For many-to-many queries that can generate enormous intermediate results, Kuzu uses factorization to represent data compactly. Its ASP join algorithm integrates factorization, sequential scanning, and sideways information passing to avoid unnecessary full scans and materializations.

Kuzu is optimized for read-heavy, analytic workloads. While batched writes are efficient, the system is less tuned for high-frequency, small transactions. Upcoming features include:

A WebAssembly (Wasm) version for running in browsers.
Enhanced vector and full-text search indices.
Built-in graph data science algorithms for tasks like PageRank and centrality analysis.

Kuzu can be a powerful backend for AI applications in several ways:

Knowledge Graphs:
Store and query complex relationships between entities to support natural language understanding, semantic search, and reasoning tasks.
Graph Data Science:
Run built-in graph algorithms (like PageRank, centrality, or community detection) that help uncover patterns and insights, which can improve recommendation systems, fraud detection, and other AI-driven analyses.
Retrieval-Augmented Generation (RAG):
Integrate with large language models by efficiently retrieving relevant, structured graph data. Kuzu’s vector search capabilities and fast query processing make it ideal for augmenting AI responses with contextual information.
Graph Embeddings & ML Pipelines:
Serve as the foundation for generating graph embeddings, which are used in downstream machine learning tasks—such as clustering, classification, or link prediction—to enhance model performance.

Semih Salihoğlu:

Nicolay Gerold:

00:00 Introduction to Graph Databases
00:18 Introducing Kuzu: A Modern Graph Database
01:48 Use Cases and Applications of Kuzu
03:03 Kuzu's Research Origins and Scalability
06:18 Columnar Storage vs. Row-Oriented Storage
10:27 Query Processing Techniques in Kuzu
22:22 Compressed Sparse Row (CSR) Storage
27:25 Vectorization in Graph Databases
31:24 Optimizing Query Processors with Vectorization
33:25 Common Wisdom in Graph Databases
35:13 Introducing ASP Join in Kuzu
35:55 Factorization and Efficient Query Processing
39:49 Challenges and Solutions in Graph Databases
45:26 Write Path Optimization in Kuzu
54:10 Future Developments in Kuzu
57:51 Key Takeaways and Final Thoughts

#043 Knowledge Graphs Won't Fix Bad Data

20 februari 2025 | 71 min

Metadata is the foundation of any enterprise knowledge graph.

By organizing both technical and business metadata, organizations create a “brain” that supports advanced applications like AI-driven data assistants.

The goal is to achieve economies of scale—making data reusable, traceable, and ultimately more valuable.

Juan Sequeda is a leading expert in enterprise knowledge graphs and metadata management. He has spent years solving the challenges of integrating diverse data sources into coherent, accessible knowledge graphs. As Principal Scientist at data.world, Juan provides concrete strategies for improving data quality, streamlining feature extraction, and enhancing model explainability. If you want to build AI systems on a solid data foundation—one that cuts through the noise and delivers reliable, high-performance insights—you need to listen to Juan’s proven methods and real-world examples.

Terms like ontologies, taxonomies, and knowledge graphs aren’t new inventions. Ontologies and taxonomies have been studied for decades—even since ancient Greece. Google popularized “knowledge graphs” in 2012 by building on decades of semantic web research. Despite current buzz, these concepts build on established work.

Traditionally, data lives in siloed applications—each with its own relational databases, ETL processes, and dashboards. When cross-application queries and consistent definitions become painful, organizations face metadata management challenges. The first step is to integrate technical metadata (table names, columns, code lineage) into a unified knowledge graph. Then, add business metadata by mapping business glossaries and definitions to that technical layer.

A modern data catalog should:

Integrate Multiple Sources: Automatically ingest metadata from databases, ETL tools (e.g., dbt, Fivetran), and BI tools.
Bridge Technical and Business Views: Link technical definitions (e.g., table “CUST_123”) with business concepts (e.g., “Customer”).
Enable Reuse and Governance: Support data discovery, impact analysis, and proper governance while facilitating reuse across teams.

Practical Approaches & Use Cases:

Start with a Clear Problem: Whether it’s reducing churn, improving operational efficiency, or meeting compliance needs, begin by solving a specific pain point.
Iron Thread Method: Follow one query end-to-end—from identifying a business need to tracing it back to source systems—to gradually build and refine the graph.
Automation vs. Manual Oversight: Technical metadata extraction is largely automated. For business definitions or text-based entity extraction (e.g., via LLMs), human oversight is key to ensuring accuracy and consistency.

Technical Considerations:

Entity vs. Property: If you need to attach additional details or reuse an element across contexts, model it as an entity (with a unique identifier). Otherwise, keep it as a simple property.
Storage Options: The market offers various graph databases—Neo4j, Amazon Neptune, Cosmos DB, TigerGraph, Apache Jena (for RDF), etc. Future trends point toward multimodel systems that allow querying in SQL, Cypher, or SPARQL over the same underlying data.

Juan Sequeda:

LinkedIn
data.world
Semantic Web for the Working Ontologist
Designing and Building Enterprise Knowledge Graphs (before you buy, send Juan a message, he is happy to send you a copy)
Catalog & Cocktails (Juan’s podcast)

Nicolay Gerold:

00:00 Introduction to Knowledge Graphs 00:45 The Role of Metadata in AI 01:06 Building Knowledge Graphs: First Steps 01:42 Interview with Juan Sequira 02:04 Understanding Buzzwords: Ontologies, Taxonomies, and More 05:05 Challenges and Solutions in Data Management 08:04 Practical Applications of Knowledge Graphs 15:38 Governance and Data Engineering 34:42 Setting the Stage for Data-Driven Problem Solving 34:58 Understanding Consumer Needs and Data Challenges 35:33 Foundations and Advanced Capabilities in Data Management 36:01 The Role of AI and Metadata in Data Maturity 37:56 The Iron Thread Approach to Problem Solving 40:12 Constructing and Utilizing Knowledge Graphs 54:38 Trends and Future Directions in Knowledge Graphs 59:17 Practical Advice for Building Knowledge Graphs

#042 Temporal RAG, Embracing Time for Smarter, Reliable Knowledge Graphs

13 februari 2025 | 94 min

Daniel Davis is an expert on knowledge graphs. He has a background in risk assessment and complex systems—from aerospace to cybersecurity. Now he is working on “Temporal RAG” in TrustGraph.

Time is a critical—but often ignored—dimension in data. Whether it’s threat intelligence, legal contracts, or API documentation, every data point has a temporal context that affects its reliability and usefulness. To manage this, systems must track when data is created, updated, or deleted, and ideally, preserve versions over time.

Three Types of Data:

Observations:
- Definition: Measurable, verifiable recordings (e.g., “the hat reads ‘Sunday Running Club’”).
- Characteristics: Require supporting evidence and may be updated as new data becomes available.
Assertions:
- Definition: Subjective interpretations (e.g., “the hat is greenish”).
- Characteristics: Involve human judgment and come with confidence levels; they may change over time.
Facts:
- Definition: Immutable, verified information that remains constant.
- Characteristics: Rare in dynamic environments because most data evolves; serve as the “bedrock” of trust.

By clearly categorizing data into these buckets, systems can monitor freshness, detect staleness, and better manage dependencies between components (like code and its documentation).

Integrating Temporal Data into Knowledge Graphs:

Challenge:
Traditional knowledge graphs and schemas (e.g., schema.org) rarely integrate time beyond basic metadata. Long documents may only provide a single timestamp, leaving the context of internal details untracked.
Solution:
Attach detailed temporal metadata (such as creation, update, and deletion timestamps) during data ingestion. Use versioning to maintain historical context. This allows systems to:
- Assess whether data is current or stale.
- Detect conflicts when updates occur.
- Employ Bayesian methods to adjust trust metrics as more information accumulates.

Key Takeaways:

Focus on Specialization:
Build tools that do one thing well. For example, design a simple yet extensible knowledge graph rather than relying on overly complex ontologies.
Integrate Temporal Metadata:
Always timestamp data operations and version records. This is key to understanding data freshness and evolution.
Adopt Robust Infrastructure:
Use scalable, proven technologies to connect specialized modules via APIs. This reduces maintenance overhead compared to systems overloaded with connectors and extra features.
Leverage Bayesian Updates:
Start with initial trust metrics based on observed data and refine them as new evidence arrives.
Mind the Big Picture:
Avoid working in isolated silos. Emphasize a holistic system design that maintains in situ context and promotes collaboration across teams.

Daniel Davis

Nicolay Gerold:

00:00 Introduction to Temporal Dimensions in Data 00:53 Timestamping and Versioning Data 01:35 Introducing Daniel Davis and Temporal RAG 01:58 Three Buckets of Data: Observations, Assertions, and Facts 03:22 Dynamic Data and Data Freshness 05:14 Challenges in Integrating Time in Knowledge Graphs 09:41 Defining Observations, Assertions, and Facts 12:57 The Role of Time in Data Trustworthiness 46:58 Chasing White Whales in AI 47:58 The Problem with Feature Overload 48:43 Connector Maintenance Challenges 50:02 The Swiss Army Knife Analogy 51:16 API Meshes and Glue Code 54:14 The Importance of Software Infrastructure 01:00:10 The Need for Specialized Tools 01:13:25 Outro and Future Plans

#041 Context Engineering, How Knowledge Graphs Help LLMs Reason

6 februari 2025 | 94 min

Robert Caulk runs Emergent Methods, a research lab building news knowledge graphs. With a Ph.D. in computational mechanics, he spent 12 years creating open-source tools for machine learning and data analysis. His work on projects like Flowdapt (model serving) and FreqAI (adaptive modeling) has earned over 1,000 academic citations.

His team built AskNews, which he calls "the largest news knowledge graph in production." It's a system that doesn't just collect news - it understands how events, people, and places connect.

Current AI systems struggle to connect information across sources and domains. Simple vector search misses crucial relationships. But building knowledge graphs at scale brings major technical hurdles around entity extraction, relationship mapping, and query performance.

Emergent Methods built a hybrid system combining vector search and knowledge graphs:

Vector DB (Quadrant) handles initial broad retrieval
Custom knowledge graph processes relationships
Translation pipeline normalizes multi-language content
Entity extraction model identifies key elements
Context engineering pipeline structures data for LLMs

Implementation Details:

Data Pipeline:

All content normalized to English for consistent embeddings
Entity names preserved in original language when untranslatable
Custom Gleiner News model handles entity extraction
Retrained every 6 months on fresh data
Human review validates entity accuracy

Entity Management:

Base extraction uses BERT-based Gleiner architecture
Trained on diverse data across topics/regions
Disambiguation system merges duplicate entities
Manual override options for analysts
Metadata tracking preserves relationship context

Knowledge Graph:

Selective graph construction from vector results
On-demand relationship processing
Graph queries via standard Cypher
Built for specific use cases vs general coverage
Integration with S3 and other data stores

System Validation:

Custom "Context is King" benchmark suite
RAGAS metrics track retrieval accuracy
Time-split validation prevents data leakage
Manual review of entity extraction
Production monitoring of query patterns

Engineering Insights:

Key Technical Decisions:

English normalization enables consistent embeddings
Hybrid vector + graph approach balances speed/depth
Selective graph construction keeps costs down
Human-in-loop validation maintains quality

Dead Ends Hit:

Full multi-language entity system too complex
Real-time graph updates not feasible at scale
Pure vector or pure graph approaches insufficient

Top Quotes:

"At its core, context engineering is about how we feed information to AI. We want clear, focused inputs for better outputs. Think of it like talking to a smart friend - you'd give them the key facts in a way they can use, not dump raw data on them." - Robert
"Strong metadata paints a high-fidelity picture. If we're trying to understand what's happening in Ukraine, we need to know not just what was said, but who said it, when they said it, and what voice they used to say it. Each piece adds color to the picture." - Robert
"Clean data beats clever models. You can throw noise at an LLM and get something that looks good, but if you want real accuracy, you need to strip away the clutter first. Every piece of noise pulls the model in a different direction." - Robert
"Think about how the answer looks in the real world. If you're comparing apartments, you'd want a table. If you're tracking events, you'd want a timeline. Match your data structure to how humans naturally process that kind of information." - Nico
"Building knowledge graphs isn't about collecting everything - it's about finding the relationships that matter. Most applications don't need a massive graph. They need the right connections for their specific problem." - Robert
"The quality of your context sets the ceiling for what your AI can do. You can have the best model in the world, but if you feed it noisy, unclear data, you'll get noisy, unclear answers. Garbage in, garbage out still applies." - Robert
"When handling multiple languages, it's better to normalize everything to one language than to try juggling many. Yes, you lose some nuance, but you gain consistency. And consistency is what makes these systems reliable." - Robert
"The hard part isn't storing the data - it's making it useful. Anyone can build a database. The trick is structuring information so an AI can actually reason with it. That's where context engineering makes the difference." - Robert
"Start simple, then add complexity only when you need it. Most teams jump straight to sophisticated solutions when they could get better results by just cleaning their data and thinking carefully about how they structure it." - Nico
"Every token in your context window is precious. Don't waste them on HTML tags or formatting noise. Save that space for the actual signal - the facts, relationships, and context that help the AI understand what you're asking." - Nico

Robert Caulk:

Nicolay Gerold:

00:00 Introduction to Context Engineering 00:24 Curating Input Signals 01:01 Structuring Raw Data 03:05 Refinement and Iteration 04:08 Balancing Breadth and Precision 06:10 Interview Start 08:02 Challenges in Context Engineering 20:25 Optimizing Context for LLMs 45:44 Advanced Cypher Queries and Graphs 46:43 Enrichment Pipeline Flexibility 47:16 Combining Graph and Semantic Search 49:23 Handling Multilingual Entities 52:57 Disambiguation and Deduplication Challenges 55:37 Training Models for Diverse Domains 01:04:43 Dealing with AI-Generated Content 01:17:32 Future Developments and Final Thoughts

#040 Vector Database Quantization, Product, Binary, and Scalar

31 januari 2025 | 52 min

When you store vectors, each number takes up 32 bits.

With 1000 numbers per vector and millions of vectors, costs explode.

A simple chatbot can cost thousands per month just to store and search through vectors.

The Fix: Quantization

Think of it like image compression. JPEGs look almost as good as raw photos but take up far less space. Quantization does the same for vectors.

Today we are back continuing our series on search with Zain Hasan, a former ML engineer at Weaviate and now a Senior AI/ ML Engineer at Together. We talk about the different types of quantization, when to use them, how to use them, and their tradeoff.

Three Ways to Quantize:

Binary Quantization
- Turn each number into just 0 or 1
- Ask: "Is this dimension positive or negative?"
- Works great for 1000+ dimensions
- Cuts memory by 97%
- Best for normally distributed data
Product Quantization
- Split vector into chunks
- Group similar chunks
- Store cluster IDs instead of full numbers
- Good when binary quantization fails
- More complex but flexible
Scalar Quantization
- Use 8 bits instead of 32
- Simple middle ground
- Keeps more precision than binary
- Less savings than binary

Key Quotes:

"Vector databases are pretty much the commercialization and the productization of representation learning."
"I think quantization, it builds on the assumption that there is still noise in the embeddings. And if I'm looking, it's pretty similar as well to the thought of Matryoshka embeddings that I can reduce the dimensionality."
"Going from text to multimedia in vector databases is really simple."
"Vector databases allow you to take all the advances that are happening in machine learning and now just simply turn a switch and use them for your application."

Zain Hasan:

Nicolay Gerold:

vector databases, quantization, hybrid search, multi-vector support, representation learning, cost reduction, memory optimization, multimodal recommender systems, brain-computer interfaces, weather prediction models, AI applications

#039 Local-First Search, How to Push Search To End-Devices

23 januari 2025 | 53 min

Alex Garcia is a developer focused on making vector search accessible and practical. As he puts it: "I'm a SQLite guy. I use SQLite for a lot of projects... I want an easier vector search thing that I don't have to install 10,000 dependencies to use.”

Core Mantra: "Simple, Local, Scalable"

Why SQLite Vec?

"I didn't go along thinking, 'Oh, I want to build vector search, let me find a database for it.' It was much more like: I use SQLite for a lot of projects, I want something lightweight that works in my current workflow."

SQLiteVec uses row-oriented storage with some key design choices:

Vectors are stored in large chunks (megabytes) as blobs
Data is split across 4KB SQLite pages, which affects analytical performance
Currently uses brute force linear search without ANN indexing
Supports binary quantization for 32x size reduction
Handles tens to hundreds of thousands of vectors efficiently

Practical limits:

500ms search time for 500K vectors (768 dimensions)
Best performance under 100ms for user experience
Binary quantization enables scaling to ~1M vectors
Metadata filtering and partitioning coming soon

Key advantages:

Fast writes for transactional workloads
Simple single-file database
Easy integration with existing SQLite applications
Leverages SQLite's mature storage engine

Garcia's preferred tools for local AI:

Sentence Transformers models converted to GGUF format
Llama.cpp for inference
Small models (30MB) for basic embeddings
Larger models like Arctic Embed (hundreds of MB) for recent topics
SQLite L-Embed extension for text embeddings
Transformers.js for browser-based implementations

1. Choose Your Storage

"There's two ways of storing vectors within SQLiteVec. One way is a manual way where you just store a JSON array... [second is] using a virtual table."

Traditional row storage: Simple, flexible, good for small vectors
Virtual table storage: Optimized chunks, better for large datasets
Performance sweet spot: Up to 500K vectors with 500ms search time

2. Optimize Performance

"With binary quantization it's 1/32 of the space... and holds up at 95 percent quality"

Binary quantization reduces storage 32x with 95% quality
Default page size is 4KB - plan your vector storage accordingly
Metadata filtering dramatically improves search speed

3. Integration Patterns

"It's a single file, right? So you can like copy and paste it if you want to make a backup."

Two storage approaches: manual columns or virtual tables
Easy backups: single file database
Cross-platform: desktop, mobile, IoT, browser (via WASM)

4. Real-World Tips

"I typically choose the really small model... it's 30 megabytes. It quantizes very easily... I like it because it's very small, quick and easy."

Start with smaller, efficient models (30MB range)
Use binary quantization before trying complex solutions
Plan for partitioning when scaling beyond 100K vectors

Alex Garcia

Nicolay Gerold:

#038 AI-Powered Search, Context Is King, But Your RAG System Ignores Two-Thirds of It

9 januari 2025 | 74 min

Today, I (Nicolay Gerold) sit down with Trey Grainger, author of the book AI-Powered Search. We discuss the different techniques for search and recommendations and how to combine them.

While RAG (Retrieval-Augmented Generation) has become a buzzword in AI, Trey argues that the current understanding of "RAG" is overly simplified – it's actually a bidirectional process he calls "GARRAG," where retrieval and generation continuously enhance each other.

Trey uses a three context framework for search architecture:

Content Context: Traditional document understanding and retrieval
User Context: Behavioral signals driving personalization and recommendations
Domain Context: Knowledge graphs and semantic understanding

Trey shares insights on:

Why collecting and properly using user behavior signals is crucial yet often overlooked
How to implement "light touch" personalization without trapping users in filter bubbles
The evolution from simple vector similarity to sophisticated late interaction models
Why treating search as a non-linear pipeline with feedback loops leads to better results

For engineers building search systems, Trey offers practical advice on choosing the right tools and techniques, from traditional search engines like Solr and Elasticsearch to modern approaches like ColBERT.

Also how to layer different techniques to make search tunable and debuggable.

Quotes:

"I think of whether it's search or generative AI, I think of all of these systems as nonlinear pipelines."
"The reason we use retrieval when we're working with generative AI is because A generative AI model these LLMs will take your query, your request, whatever you're asking for. They will then try to interpret them and without access to up to date information, without access to correct information, they will generate a response from their highly compressed understanding of the world. And so we use retrieval to augment them with information."
"I think the misconception is that, oh, hey, for RAG I can just, plug in a vector database and a couple of libraries and, a day or two later everything's magically working and I'm off to solve the next problem. Because search and information retrieval is one of those problems that you never really solve. You get it, good enough and quit, or you find so much value in it, you just continue investing to constantly make it better."
"To me, they're, search and recommendations are fundamentally the same problem. They're just using different contexts."
"Anytime you're building a search system, whether it's traditional search, whether it's RAG for generative AI, you need to have all three of those contexts in order to effectively get the most relevant results to solve solve the problem."
"There's no better way to make your users really angry with you than to stick them in a bucket and get them stuck in that bucket, which is not their actual intent."

Trey Grainger:

Nicolay Gerold:

00:00 Introduction to Search Challenges 00:50 Layered Approach to Ranking 01:00 Personalization and Signal Boosting 02:25 Broader Principles in Software Engineering 02:51 Interview with Trey Greinger 03:32 Understanding RAG and Retrieval 04:35 Nonlinear Pipelines in Search 06:01 Generative AI and Retrieval 08:10 Search Renaissance and AI 10:27 Misconceptions in AI-Powered Search 18:12 Search vs. Recommendation Systems 22:26 Three Buckets of Relevance 38:19 Traditional Learning to Rank 39:11 Semantic Relevance and User Behavior 39:53 Layered Ranking Algorithms 41:40 Personalization in Search 43:44 Technological Setup for Query Understanding 48:21 Personalization and User Behavior Vectors 52:10 Choosing the Right Search Engine 56:35 Future of AI-Powered Search 01:00:48 Building Effective Search Applications 01:06:50 Three Critical Context Frameworks 01:12:08 Modern Search Systems and Contextual Understanding 01:13:37 Conclusion and Recommendations

#037 Chunking for RAG: Stop Breaking Your Documents Into Meaningless Pieces

3 januari 2025 | 49 min

Today we are back continuing our series on search. We are talking to Brandon Smith, about his work for Chroma. He led one of the largest studies in the field on different chunking techniques. So today we will look at how we can unfuck our RAG systems from badly chosen chunking hyperparameters.

The biggest lie in RAG is that semantic search is simple. The reality is that it's easy to build, it's easy to get up and running, but it's really hard to get right. And if you don't have a good setup, it's near impossible to debug. One of the reasons it's really hard is actually chunking. And there are a lot of things you can get wrong.

And even OpenAI boggled it a little bit, in my opinion, using an 800 token length for the chunks. And this might work for legal, where you have a lot of boilerplate that carries little semantic meaning, but often you have the opposite. You have very information dense content and imagine fitting an entire Wikipedia page into the size of a tweet There will be a lot of information that's actually lost and that's what happens with long chunks The next is overlap openai uses a foreign token overlap or used to And what this does is actually we try to bring the important context into the chunk, but in reality, we don't really know where the context is coming from.

It could be from a few pages prior, not just the 400 tokens before. It could also be from a definition that's not even in the document at all. There is a really interesting solution actually from Anthropic Contextual Retrieval, where you basically pre process all the chunks to see whether there is any missing information and you basically try to reintroduce it.

Brandon Smith:

Nicolay Gerold:

00:00 The Biggest Lie in RAG: Semantic Search Simplified 00:43 Challenges in Chunking and Overlap 01:38 Introducing Brandon Smith and His Research 02:05 The Motivation and Mechanics of Chunking 04:40 Issues with Current Chunking Methods 07:04 Optimizing Chunking Strategies 23:04 Introduction to Chunk Overlap 24:23 Exploring LLM-Based Chunking 24:56 Challenges with Initial Approaches 28:17 Alternative Chunking Methods 36:13 Language-Specific Considerations 38:41 Future Directions and Best Practices

#036 How AI Can Start Teaching Itself - Synthetic Data Deep Dive

19 december 2024 | 48 min

#035 A Search System That Learns As You Use It (Agentic RAG)

13 december 2024 | 46 min

Modern RAG systems build on flexibility.

At their core, they match each query with the best tool for the job.

They know which tool fits each task. When you ask about sales numbers, they reach for SQL. When you need to company policies, they use vector search or BM25. The key is switching tools smoothly.

A question about sales figures might need SQL, while a search through policy documents works better with vector search. The key is building systems that can switch between these tools smoothly.

But all types of retrieval start with metadata. By tagging documents with key details during processing, we narrow the search space before diving in.

The best systems use a mix of approaches: they might keep full documents for context, summaries for quick scanning, and metadata for filtering. They cast a wide net at first, then use neural ranking to zero in on the most relevant results.

The quality of embeddings can make or break a system. General-purpose models often fall short in specialized fields. Testing different embedding models on your specific data pays off - what works for general text might fail for legal documents or technical manuals. Sometimes, fine-tuning a model for your domain is worth the effort.

When building search systems, think modular. Start with pieces that can be swapped out as needs change or better tools emerge. Add metadata processing early - it's harder to add later. Break the retrieval process into steps: first find possible matches quickly, then rank them carefully. For complex documents with tables or images, add tools that can handle different types of content.

The best systems also check their work. They ask: "Did I actually answer the question?" If not, they try a different approach. But they also know when to stop - endless loops help no one. In the end, RAG isn't just about finding information. It's about finding the right information, in the right way, at the right time.

Stephen Batifol:

Nicolay Gerold:

00:00 Introduction to Agentic RAG 00:04 Understanding Control Flow in Agentic RAG 00:33 Decision Making with LLMs 01:11 Exploring Agentic RAG with Stephen Batifol 03:35 Comparing RAG and GAR 06:31 Implementing Agentic RAG Workflows 22:36 Filtering with Prefix, Suffix, and Midfix 24:15 Breaking Mechanisms in Workflows 28:00 Evaluating Agentic Workflows 30:31 Multimodal and VLLMs in Document Processing 33:51 Challenges and Innovations in Parsing 34:51 Overrated and Underrated Aspects in LLMs 39:52 Building Effective Search Applications

#034 Rethinking Search Inside Postgres, From Lexemes to BM25

5 december 2024 | 47 min

Many companies use Elastic or OpenSearch and use 10% of the capacity.

They have to build ETL pipelines.

Get data Normalized.

Worry about race conditions.

All in all. At the moment, when you want to do search on top of your transactional data, you are forced to build a distributed systems.

Not anymore.

ParadeDB is building an open-source PostgreSQL extension to enable search within your database.

Today, I am talking to Philippe Noël, the founder and CEO of ParadeDB.

We talk about how they build it, how they integrate into the Postgres Query engines, and how you can build search on top of Postgres.

Key Insights:

Search is changing. We're moving from separate search clusters to search inside databases. Simpler architecture, stronger guarantees, lower costs up to a certain scale.

Most search engines force you to duplicate data. ParadeDB doesn't. You keep data normalized and join at query time. It hooks deep into Postgres's query planner. It doesn't just bolt on search - it lets Postgres optimize search queries alongside SQL ones.

Search indices can work with ACID. ParadeDB's BM25 index keeps Lucene-style components (term frequency, normalization) but adds Postgres metadata for transactions. Search + ACID is possible.

Two storage types matter: inverted indices for text, columnar "fast fields" for analytics. Pick the right one or queries get slow. Integers now default to columnar to prevent common mistakes.

Mixing query engines looks tempting but fails. The team tried using DuckDB and DataFusion inside Postgres. Both were fast but broke ACID compliance. They had to rebuild features natively.

Philippe Noël:

Nicolay Gerold:

00:00 Introduction to ParadeDB 00:53 Building ParadeDB with Rust 01:43 Integrating Search in Postgres 03:04 ParadeDB vs. Elastic 05:48 Technical Deep Dive: Postgres Integration 07:27 Challenges and Solutions 09:35 Transactional Safety and Performance 11:06 Composable Data Systems 15:26 Columnar Storage and Analytics 20:54 Case Study: Alibaba Cloud 21:57 Data Warehouse Context 23:24 Custom Indexing with BM25 24:01 Postgres Indexing Overview 24:17 Fast Fields and Columnar Format 24:52 Lucene Inspiration and Data Storage 26:06 Setting Up and Managing Indexes 27:43 Query Building and Complex Searches 30:21 Scaling and Sharding Strategies 35:27 Query Optimization and Common Mistakes 38:39 Future Developments and Integrations 39:24 Building a Full-Fledged Search Application 42:53 Challenges and Advantages of Using ParadeDB 46:43 Final Thoughts and Recommendations

#033 RAG's Biggest Problems & How to Fix It (ft. Synthetic Data)

28 november 2024 | 51 min

RAG isn't a magic fix for search problems. While it works well at first, most teams find it's not good enough for production out of the box. The key is to make it better step by step, using good testing and smart data creation.

Today, we are talking to Saahil Ognawala from Jina AI to start to understand RAG.

To build a good RAG system, you need three things: ways to test it, methods to create training data, and plans to make it better over time. Testing starts with a set of example searches that users might make. These should include common searches that happen often, medium-rare searches, and rare searches that only happen now and then. This mix helps you measure if changes make your system better or worse.

Creating synthetic data helps make the system stronger, especially in spotting wrong answers that look right. Think of someone searching for a "gluten-free chocolate cake." A "sugar-free chocolate cake" might look like a good answer because it shares many words, but it's wrong.

These tricky examples help the system learn the difference between similar but different things.

When creating synthetic data, you need rules. The best way is to show the AI a few real examples and give it a list of topics to work with. Most teams find that using half real data and half synthetic data works best. This gives you enough variety while keeping things real.

Getting user feedback is hard with RAG. In normal search, you can see if users click on results. But with RAG, the system creates an answer from many pieces. A good answer might come from both good and bad pieces, making it hard to know which parts helped. This means you need smart ways to track which pieces of information actually helped make good answers.

One key rule: don't make things harder than they need to be. If simple keyword search (called BM25) works well enough, adding fancy AI search might not be worth the extra work.

Success with RAG comes from good testing, careful data creation, and steady improvements based on real use. It's not about using the newest AI models. It's about building good systems and processes that work reliably.

"It isn’t a magic wand you can place on your catalog and expect results you didn’t get before."

“Most of our users are enterprise users who have seen the most success in their RAG systems are the ones that very early implemented a continuous feedback mechanism.“

“If you can't tell in real time usage whether an answer is a bad answer or a right answer because the LLM just makes it look like the right answer then you only have your retrieval dataset to blame”

Saahil Ognawala:

Nicolay Gerold:

00:00 Introduction to Retrieval Augmented Generation (RAG) 00:29 Interview with Saahil Ognawala 00:52 Synthetic Data in Language Generation 01:14 Understanding the E5 Mistral Instructor Embeddings Paper 03:15 Challenges and Evolution in Synthetic Data 05:03 User Intent and Retrieval Systems 11:26 Evaluating RAG Systems 14:46 Setting Up Evaluation Frameworks 20:37 Fine-Tuning and Embedding Models 22:25 Negative and Positive Examples in Retrieval 26:10 Synthetic Data for Hard Negatives 29:20 Case Study: Marine Biology Project 29:54 Addressing Errors in Marine Biology Queries 31:28 Ensuring Query Relevance with Human Intervention 31:47 Few Shot Prompting vs Zero Shot Prompting 35:09 Balancing Synthetic and Real World Data 37:17 Improving RAG Systems with User Feedback 39:15 Future Directions for Jina and Synthetic Data 40:44 Building and Evaluating Embedding Models 41:24 Getting Started with Jina and Open Source Tools 51:25 The Importance of Hard Negatives in Embedding Models

#032 Improving Documentation Quality for RAG Systems

21 november 2024 | 47 min

Documentation quality is the silent killer of RAG systems. A single ambiguous sentence might corrupt an entire set of responses. But the hardest part isn't fixing errors - it's finding them.

Today we are talking to Max Buckley on how to find and fix these errors.

Max works at Google and has built a lot of interesting experiments with LLMs on using them to improve knowledge bases for generation.

We talk about identifying ambiguities, fixing errors, creating improvement loops in the documents and a lot more.

Some Insights:

A single ambiguous sentence can systematically corrupt an entire knowledge base's responses. Fixing these "documentation poisons" often requires minimal changes but identifying them is challenging.
Large organizations develop their own linguistic ecosystems that evolve over time. This creates unique challenges for both embedding models and retrieval systems that need to bridge external and internal vocabularies.
Multiple feedback loops are crucial - expert testing, user feedback, and system monitoring each catch different types of issues.

Max Buckley: (All opinions are his own and not of Google)

LinkedIn

Nicolay Gerold:

00:00 Understanding LLM Hallucinations 00:02 Challenges with Temporal Inconsistencies 00:43 Issues with Document Structure and Terminology 01:05 Introduction to Retrieval Augmented Generation (RAG) 01:49 Interview with Max Buckley 02:27 Anthropic's Approach to Document Chunking 02:55 Contextualizing Chunks for Better Retrieval 06:29 Challenges in Chunking and Search 07:35 LLMs in Internal Knowledge Management 08:45 Identifying and Fixing Documentation Errors 10:58 Using LLMs for Error Detection 15:35 Improving Documentation with User Feedback 24:42 Running Processes on Retrieved Context 25:19 Challenges of Terminology Consistency 26:07 Handling Definitions and Glossaries 30:10 Addressing Context Misinterpretation 31:13 Improving Documentation Quality 36:00 Future of AI and Search Technologies 42:29 Ensuring Documentation Readiness for AI

#031 BM25 As The Workhorse Of Search; Vectors Are Its Visionary Cousin

15 november 2024 | 54 min

Ever wondered why vector search isn't always the best path for information retrieval?

Join us as we dive deep into BM25 and its unmatched efficiency in our latest podcast episode with David Tippett from GitHub.

Discover how BM25 transforms search efficiency, even at GitHub's immense scale.

BM25, short for Best Match 25, use term frequency (TF) and inverse document frequency (IDF) to score document-query matches. It addresses limitations in TF-IDF, such as term saturation and document length normalization.

Search Is About User Expectations

Search isn't just about relevance but aligning with what users expect:
- GitHub users, for example, have diverse use cases—finding security vulnerabilities, exploring codebases, or managing repositories. Each requires a different prioritization of fields, boosting strategies, and possibly even distinct search workflows.
Key Insight: Search is deeply contextual and use-case driven. Understanding your users' intent and tailoring search behavior to their expectations matters more than chasing state-of-the-art technology.

The Challenge of Vector Search at Scale

Vector search systems require in-memory storage of vectorized data, making them costly for datasets with billions of documents (e.g., GitHub’s 100 billion documents).
IVF and HNSW offer trade-offs:
- IVF: Reduces memory requirements by bucketing vectors but risks losing relevance due to bucket misclassification.
- HNSW: Offers high relevance but demands high memory, making it impractical for massive datasets.
Architectural Insight: When considering vector search, focus on niche applications or subdomains with manageable dataset sizes or use hybrid approaches combining BM25 with sparse/dense vectors.

Vector Search vs. BM25: A Trade-off of Precision vs. Cost

Vector search is more precise and effective for semantic similarity, but its operational costs and memory requirements make it prohibitive for massive datasets like GitHub’s corpus of over 100 billion documents.
BM25’s scaling challenges (e.g., reliance on disk IOPS) are manageable compared to the memory-bound nature of vector search engines like HNSW and IVF.
Key Insight: BM25’s scalability allows for broader adoption, while vector search is still a niche solution requiring high specialization and infrastructure.

David Tippett:

Nicolay Gerold:

00:00 Introduction to RAG and Vector Search Challenges 00:28 Introducing BM25: The Efficient Search Solution 00:43 Guest Introduction: David Tippett 01:16 Comparing Search Engines: Vespa, Weaviate, and More 07:53 Understanding BM25 and Its Importance 09:10 Deep Dive into BM25 Mechanics 23:46 Field-Based Scoring and BM25F 25:49 Introduction to Zero Shot Retrieval 26:03 Vector Search vs BM25 26:22 Combining Search Techniques 26:56 Favorite BM25 Adaptations 27:38 Postgres Search and Term Proximity 31:49 Challenges in GitHub Search 33:59 BM25 in Large Scale Systems 40:00 Technical Deep Dive into BM25 45:30 Future of Search and Learning to Rank 47:18 Conclusion and Future Plans

#030 Vector Search at Scale, Why One Size Doesn't Fit All

7 november 2024 | 36 min

#029 Search Systems at Scale, Avoiding Local Maxima and Other Engineering Lessons

31 oktober 2024 | 55 min

Modern search systems face a complex balancing act between performance, relevancy, and cost, requiring careful architectural decisions at each layer.

While vector search generates buzz, hybrid approaches combining traditional text search with vector capabilities yield better results.

The architecture typically splits into three core components:

ingestion/indexing (requiring decisions between batch vs streaming)
query processing (balancing understanding vs performance)
analytics/feedback loops for continuous improvement.

Critical but often overlooked aspects include query understanding depth, systematic relevancy testing (avoid anecdote-driven development), and data governance as search systems naturally evolve into organizational data hubs.

Performance optimization requires careful tradeoffs between index-time vs query-time computation, with even 1-2% improvements being significant in mature systems.

Success requires testing against production data (staging environments prove unreliable), implementing proper evaluation infrastructure (golden query sets, A/B testing, interleaving), and avoiding the local maxima trap where improving one query set unknowingly damages others.

The end goal is finding an acceptable balance between corpus size, latency requirements, and cost constraints while maintaining system manageability and relevance quality.

"It's quite easy to end up in local maxima, whereby you improve a query for one set and then you end up destroying it for another set."

"A good marker of a sophisticated system is one where you actually see it's getting worse... you might be discovering a maxima."

"There's no free lunch in all of this. Often it's a case that, to service billions of documents on a vector search, less than 10 millis, you can do those kinds of things. They're just incredibly expensive. It's really about trying to manage all of the overall system to find what is an acceptable balance."

Search Pioneers:

Stuart Cam:

LinkedIn

Russ Cam:

Nicolay Gerold:

00:00 Introduction to Search Systems 00:13 Challenges in Search: Relevancy vs Latency 00:27 Insights from Industry Experts 01:00 Evolution of Search Technologies 03:16 Storage and Compute in Search Systems 06:22 Common Mistakes in Building Search Systems 09:10 Evaluating and Improving Search Systems 19:27 Architectural Components of Search Systems 29:17 Understanding Search Query Expectations 29:39 Balancing Speed, Cost, and Corpus Size 32:03 Trade-offs in Search System Design 32:53 Indexing vs Querying: Key Considerations 35:28 Re-ranking and Personalization Challenges 38:11 Evaluating Search System Performance 44:51 Overrated vs Underrated Search Techniques 48:31 Final Thoughts and Contact Information

#028 Training Multi-Modal AI, Inside the Jina CLIP Embedding Model

25 oktober 2024 | 49 min

Today we are talking to Michael Günther, a senior machine learning scientist at Jina about his work on JINA Clip.

Some key points:

Uni-modal embeddings convert a single type of input (text, images, audio) into vectors
Multimodal embeddings learn a joint embedding space that can handle multiple types of input, enabling cross-modal search (e.g., searching images with text)
Multimodal models can potentially learn richer representations of the world, including concepts that are difficult or impossible to put into words

Types of Text-Image Models

CLIP-like Models
- Separate vision and text transformer models
- Each tower maps inputs to a shared vector space
- Optimized for efficient retrieval
Vision-Language Models
- Process image patches as tokens
- Use transformer architecture to combine image and text information
- Better suited for complex document matching
Hybrid Models
- Combine separate encoders with additional transformer components
- Allow for more complex interactions between modalities
- Example: Google's Magic Lens model

Training Insights from Jina CLIP

Key Learnings
- Freezing the text encoder during training can significantly hinder performance
- Short image captions limit the model's ability to learn rich text representations
- Large batch sizes are crucial for training embedding models effectively
Training Process
- Three-stage training approach:
  - Stage 1: Training on image captions and text pairs
  - Stage 2: Adding longer image captions
  - Stage 3: Including triplet data with hard negatives

Practical Considerations

Similarity Scales
- Different modalities can produce different similarity value scales
- Important to consider when combining multiple embedding types
- Can affect threshold-based filtering
Model Selection
- Evaluate models based on relevant benchmarks
- Consider the domain similarity between training data and intended use case
- Assessment of computational requirements and efficiency needs

Future Directions

Areas for Development
- More comprehensive benchmarks for multimodal tasks
- Better support for semi-structured data
- Improved handling of non-photographic images
Upcoming Developments at Jina AI
- Multilingual support for Jina ColBERT
- New version of text embedding models
- Focus on complex multimodal search applications

Practical Applications

E-commerce
- Product search and recommendations
- Combined text-image embeddings for better results
- Synthetic data generation for fine-tuning
Fine-tuning Strategies
- Using click data and query logs
- Generative pseudo-labeling for creating training data
- Domain-specific adaptations

Key Takeaways for Engineers

Be aware of similarity value scales and their implications
Establish quantitative evaluation metrics before optimization
Consider model limitations (e.g., image resolution, text length)
Use performance optimizations like flash attention and activation checkpointing
Universal embedding models might not be optimal for specific use cases

Michael Guenther

Nicolay Gerold:

00:00 Introduction to Uni-modal and Multimodal Embeddings 00:16 Exploring Multimodal Embeddings and Their Applications 01:06 Training Multimodal Embedding Models 02:21 Challenges and Solutions in Embedding Models 07:29 Advanced Techniques and Future Directions 29:19 Understanding Model Interference in Search Specialization 30:17 Fine-Tuning Jina CLIP for E-Commerce 32:18 Synthetic Data Generation and Pseudo-Labeling 33:36 Challenges and Learnings in Embedding Models 40:52 Future Directions and Takeaways

#027 Building the database for AI, Multi-modal AI, Multi-modal Storage

23 oktober 2024 | 45 min

#026 Embedding Numbers, Categories, Locations, Images, Text, and The World

10 oktober 2024 | 47 min

Today’s guest is Mór Kapronczay. Mór is the Head of ML at superlinked. Superlinked is a compute framework for your information retrieval and feature engineering systems, where they turn anything into embeddings.

When most people think about embeddings, they think about ada, openai.

You just take your text and throw it in there.

But that’s too crude.

OpenAI embeddings are trained on the internet.

But your data set (most likely) is not the internet.

You have different nuances.

And you have more than just text.

So why not use it.

Some highlights:

Text Embeddings are Not a Magic Bullet

➡️ Pouring everything into a text embedding model won't yield magical results ➡️ Language is lossy - it's a poor compression method for complex information

Embedding Numerical Data

➡️ Direct number embeddings don't work well for vector search ➡️ Consider projecting number ranges onto a quarter circle ➡️ Apply logarithmic transforms for skewed distributions

Multi-Modal Embeddings

➡️ Create separate vector parts for different data aspects ➡️ Normalize individual parts ➡️ Weight vector parts based on importance

A Multi-Vector approach can help you understand the contributions of each modality or embedding and give you an easier time to fine-tune your retrieval system without fine-tuning your embedding models by tuning your vector database like you would a search database (like Elastic).

Mór Kapronczay

Nicolay Gerold:

00:00 Introduction to Embeddings 00:30 Beyond Text: Expanding Embedding Capabilities 02:09 Challenges and Innovations in Embedding Techniques 03:49 Unified Representations and Vector Computers 05:54 Embedding Complex Data Types 07:21 Recommender Systems and Interaction Data 08:59 Combining and Weighing Embeddings 14:58 Handling Numerical and Categorical Data 20:35 Optimizing Embedding Efficiency 22:46 Dynamic Weighting and Evaluation 24:35 Exploring AB Testing with Embeddings 25:08 Joint vs Separate Embedding Spaces 27:30 Understanding Embedding Dimensions 29:59 Libraries and Frameworks for Embeddings 32:08 Challenges in Embedding Models 33:03 Vector Database Connectors 34:09 Balancing Production and Updates 36:50 Future of Vector Search and Modalities 39:36 Building with Embeddings: Tips and Tricks 42:26 Concluding Thoughts and Next Steps

#025 Data Models to Remove Ambiguity from AI and Search

4 oktober 2024 | 59 min

Today we have Jessica Talisman with us, who is working as an Information Architect at Adobe. She is (in my opinion) the expert on taxonomies and ontologies.

That’s what you will learn today in this episode of How AI Is Built. Taxonomies, ontologies, knowledge graphs.

Everyone is talking about them no-one knows how to build them.

But before we look into that, what are they good for in search?

Imagine a large corpus of academic papers. When a user searches for "machine learning in healthcare", the system can:

Recognize "machine learning" as a subcategory of "artificial intelligence"
Identify "healthcare" as a broad field with subfields like "diagnostics" and "patient care"
We can use these to expand the query or narrow it down.
We can return results that include papers on "neural networks for medical imaging" or "predictive analytics in patient outcomes", even if these exact phrases weren't in the search query
We can also filter down and remove papers not tagged with AI that might just mention it in a side not.

So we are building the plumbing, the necessary infrastructure for tagging, categorization, query expansion and relexation, filtering.

So how can we build them?

1️⃣ Start with Industry Standards • Leverage established taxonomies (e.g., Google, GS1, IAB) • Audit them for relevance to your project • Use as a foundation, not a final solution

2️⃣ Customize and Fill Gaps • Adapt industry taxonomies to your specific domain • Create a "coverage model" for your unique needs • Mine internal docs to identify domain-specific concepts

3️⃣ Follow Ontology Best Practices • Use clear, unique primary labels for each concept • Include definitions to avoid ambiguity • Provide context for each taxonomy node

Jessica Talisman:

LinkedIn

Nicolay Gerold:

00:00 Introduction to Taxonomies and Knowledge Graphs 02:03 Building the Foundation: Metadata to Knowledge Graphs 04:35 Industry Taxonomies and Coverage Models 06:32 Clustering and Labeling Techniques 11:00 Evaluating and Maintaining Taxonomies 31:41 Exploring Taxonomy Granularity 32:18 Differentiating Taxonomies for Experts and Users 33:35 Mapping and Equivalency in Taxonomies 34:02 Best Practices and Examples of Taxonomies 40:50 Building Multilingual Taxonomies 44:33 Creative Applications of Taxonomies 48:54 Overrated and Underappreciated Technologies 53:00 The Importance of Human Involvement in AI 53:57 Connecting with the Speaker 55:05 Final Thoughts and Takeaways

#024 How ColPali is Changing Information Retrieval

27 september 2024 | 55 min

#023 The Power of Rerankers in Modern Search

26 september 2024 | 42 min

#022 The Limits of Embeddings, Out-of-Domain Data, Long Context, Finetuning (and How We're Fixing It)

19 september 2024 | 46 min

#021 The Problems You Will Encounter With RAG At Scale And How To Prevent (or fix) Them

12 september 2024 | 50 min

Hey! Welcome back.

Today we look at how we can get our RAG system ready for scale.

We discuss common problems and their solutions, when you introduce more users and more requests to your system.

For this we are joined by Nirant Kasliwal, the author of fastembed.

Nirant shares practical insights on metadata extraction, evaluation strategies, and emerging technologies like Colipali. This episode is a must-listen for anyone looking to level up their RAG implementations.

"Naive RAG has a lot of problems on the retrieval end and then there's a lot of problems on how LLMs look at these data points as well."

"The first 30 to 50% of gains are relatively quick. The rest 50% takes forever."

"You do not want to give the same answer about company's history to the co-founding CEO and the intern who has just joined."

"Embedding similarity is the signal on which you want to build your entire search is just not quite complete."

Key insights:

Naive RAG often fails due to limitations of embeddings and LLMs' sensitivity to input ordering.
Query profiling and expansion:
- Use clustering and tools like latent Scope to identify problematic query types
- Expand queries offline and use parallel searches for better results
Metadata extraction:
- Extract temporal, entity, and other relevant information from queries
- Use LLMs for extraction, with checks against libraries like Stanford NLP
User personalization:
- Include user role, access privileges, and conversation history
- Adapt responses based on user expertise and readability scores
Evaluation and improvement:
- Create synthetic datasets and use real user feedback
- Employ tools like DSPY for prompt engineering
Advanced techniques:
- Query routing based on type and urgency
- Use smaller models (1-3B parameters) for easier iteration and error spotting
- Implement error handling and cross-validation for extracted metadata

Nirant Kasliwal:

Nicolay Gerold:

query understanding, AI-powered search, Lambda Mart, e-commerce ranking, networking, experts, recommendation, search

#020 The Evolution of Search, Finding Search Signals, GenAI Augmented Retrieval

5 september 2024 | 52 min

In this episode of How AI is Built, Nicolay Gerold interviews Doug Turnbull, a search engineer at Reddit and author on “Relevant Search”. They discuss how methods and technologies, including large language models (LLMs) and semantic search, contribute to relevant search results.

Key Highlights:

Defining relevance is challenging and depends heavily on user intent and context
Combining multiple search techniques (keyword, semantic, etc.) in tiers can improve results
LLMs are emerging as a powerful tool for augmenting traditional search approaches
Operational concerns often drive architectural decisions in large-scale search systems
Underappreciated techniques like LambdaMART may see a resurgence

Key Quotes:

"There's not like a perfect measure or definition of what a relevant search result is for a given application. There are a lot of really good proxies, and a lot of really good like things, but you can't just like blindly follow the one objective, if you want to build a good search product." - Doug Turnbull

"I think 10 years ago, what people would do is they would just put everything in Solr, Elasticsearch or whatever, and they would make the query to Elasticsearch pretty complicated to rank what they wanted... What I see people doing more and more these days is that they'll use each retrieval source as like an independent piece of infrastructure." - Doug Turnbull on the evolution of search architecture

"Honestly, I feel like that's a very practical and underappreciated thing. People talk about RAG and I talk, I call this GAR - generative AI augmented retrieval, so you're making search smarter with generative AI." - Doug Turnbull on using LLMs to enhance search

"LambdaMART and gradient boosted decision trees are really powerful, especially for when you're expressing your re-ranking as some kind of structured learning problem... I feel like we'll see that and like you're seeing papers now where people are like finding new ways of making BM25 better." - Doug Turnbull on underappreciated techniques

Doug Turnbull

Nicolay Gerold:

Chapters

00:00 Introduction and Guest Introduction 00:52 Understanding Relevant Search Results 01:18 Search Behavior on Social Media 02:14 Challenges in Defining Relevance 05:12 Query Understanding and Ranking Signals 10:57 Evolution of Search Technologies 15:15 Combining Search Techniques 21:49 Leveraging LLMs and Embeddings 25:49 Operational Considerations in Search Systems 39:09 Concluding Thoughts and Future Directions

#019 Data-driven Search Optimization, Analysing Relevance

30 augusti 2024 | 51 min

In this episode, we talk data-driven search optimizations with Charlie Hull.

Charlie is a search expert from Open Source Connections. He has built Flax, one of the leading open source search companies in the UK, has written “Searching the Enterprise”, and is one of the main voices on data-driven search.

We discuss strategies to improve search systems quantitatively and much more.

Key Points:

Relevance in search is subjective and context-dependent, making it challenging to measure consistently.
Common mistakes in assessing search systems include overemphasizing processing speed and relying solely on user complaints.
Three main methods to measure search system performance:
- Human evaluation
- User interaction data analysis
- AI-assisted judgment (with caution)
Importance of balancing business objectives with user needs when optimizing search results.
Technical components for assessing search systems:
- Query logs analysis
- Source data quality examination
- Test queries and cases setup

Resources mentioned:

Quepid: Open-source tool for search quality testing
Haystack conference: Upcoming event in Berlin (September 30 - October 1)
Relevance Slack community
OpenSource Connections

Charlie Hull:

Nicolay Gerold:

search results, search systems, assessing, evaluation, improvement, data quality, user behavior, proactive, test dataset, search engine optimization, SEO, search quality, metadata, query classification, user intent, search results, metrics, business objectives, user objectives, experimentation, continuous improvement, data modeling, embeddings, machine learning, information retrieval

00:00 Introduction
01:35 Challenges in Measuring Search Relevance
02:19 Common Mistakes in Search System Assessment
03:22 Methods to Measure Search System Performance
04:28 Human Evaluation in Search Systems
05:18 Leveraging User Interaction Data
06:04 Implementing AI for Search Evaluation
09:14 Technical Components for Assessing Search Systems
12:07 Improving Search Quality Through Data Analysis
17:16 Proactive Search System Monitoring
24:26 Balancing Business and User Objectives in Search
25:08 Search Metrics and KPIs: A Contract Between Teams
26:56 The Role of Recency and Popularity in Search Algorithms
28:56 Experimentation: The Key to Optimizing Search
30:57 Offline Search Labs and A/B Testing
34:05 Simple Levers to Improve Search
37:38 Data Modeling and Its Importance in Search
43:29 Combining Keyword and Vector Search
44:24 Bridging the Gap Between Machine Learning and Information Retrieval
47:13 Closing Remarks and Contact Information

#018 Query Understanding: Doing The Work Before The Query Hits The Database

15 augusti 2024 | 53 min

Welcome back to How AI Is Built.

We have got a very special episode to kick off season two.

Daniel Tunkelang is a search consultant currently working with Algolia. He is a leader in the field of information retrieval, recommender systems, and AI-powered search. He worked for Canva, Algolia, Cisco, Gartner, Handshake, to pick a few.

His core focus is query understanding.

**Query understanding is about focusing less on the results and more on the query.** The query of the user is the first-class citizen. It is about figuring out what the user wants and than finding, scoring, and ranking results based on it. So most of the work happens before you hit the database.

**Key Takeaways:**

- The "bag of documents" model for queries and "bag of queries" model for documents are useful approaches for representing queries and documents in search systems.
- Query specificity is an important factor in query understanding. It can be measured using cosine similarity between query vectors and document vectors.
- Query classification into broad categories (e.g., product taxonomy) is a high-leverage technique for improving search relevance and can act as a guardrail for query expansion and relaxation.
- Large Language Models (LLMs) can be useful for search, but simpler techniques like query similarity using embeddings can often solve many problems without the complexity and cost of full LLM implementations.
- Offline processing to enhance document representations (e.g., filling in missing metadata, inferring categories) can significantly improve search quality.

**Daniel Tunkelang**

- [LinkedIn](https://www.linkedin.com/in/dtunkelang/)
- [Medium](https://queryunderstanding.com/)

**Nicolay Gerold:**

- [⁠LinkedIn⁠](https://www.linkedin.com/in/nicolay-gerold/)
- [⁠X (Twitter)](https://twitter.com/nicolaygerold)
- [Substack](https://nicolaygerold.substack.com/)

Query understanding, search relevance, bag of documents, bag of queries, query specificity, query classification, named entity recognition, pre-retrieval processing, caching, large language models (LLMs), embeddings, offline processing, metadata enhancement, FastText, MiniLM, sentence transformers, visualization, precision, recall

[00:00:00] 1. Introduction to Query Understanding

Definition and importance in search systems
Evolution of query understanding techniques

[00:05:30] 2. Query Representation Models

The "bag of documents" model for queries
The "bag of queries" model for documents
Advantages of holistic query representation

[00:12:00] 3. Query Specificity and Classification

Measuring query specificity using cosine similarity
Importance of query classification in search relevance
Implementing and leveraging query classifiers

[00:19:30] 4. Named Entity Recognition in Query Understanding

Role of NER in query processing
Challenges with unique or tail entities

[00:24:00] 5. Pre-Retrieval Query Processing

Importance of early-stage query analysis
Balancing computational resources and impact

[00:28:30] 6. Performance Optimization Techniques

Caching strategies for query understanding
Offline processing for document enhancement

[00:33:00] 7. Advanced Techniques: Embeddings and Language Models

Using embeddings for query similarity
Role of Large Language Models (LLMs) in search
When to use simpler techniques vs. complex models

[00:39:00] 8. Practical Implementation Strategies

Starting points for engineers new to query understanding
Tools and libraries for query understanding (FastText, MiniLM, etc.)
Balancing precision and recall in search systems

[00:44:00] 9. Visualization and Analysis of Query Spaces

Discussion on t-SNE, UMAP, and other visualization techniques
Limitations and alternatives to embedding visualizations

[00:47:00] 10. Future Directions and Closing Thoughts - Emerging trends in query understanding - Key takeaways for search system engineers

[00:53:00] End of Episode

Season 2 Trailer: Mastering Search

8 augusti 2024 | 4 min

Today we are launching the season 2 of How AI Is Built.

The last few weeks, we spoke to a lot of regular listeners and past guests and collected feedback. Analyzed our episode data. And we will be applying the learnings to season 2.

This season will be all about search.

We are trying to make it better, more actionable, and more in-depth. The goal is that at the end of this season, you have a full-fleshed course on search in podcast form, which mini-courses on specific elements like RAG.

We will be talking to experts from information retrieval, information architecture, recommendation systems, and RAG; from academia and industry. Fields that do not really talk to each other.

We will try to unify and transfer the knowledge and give you a full tour of search, so you can build your next search application or feature with confidence.

We will be talking to Charlie Hull on how to systematically improve search systems, with Nils Reimers on the fundamental flaws of embeddings and how to fix them, with Daniel Tunkelang on how to actually understand the queries of the user, and many more.

We will try to bridge the gaps. How to use decades of research and practice in iteratively improving traditional search and apply it to RAG. How to take new methods from recommendation systems and vector databases and bring it into traditional search systems. How to use all of the different methods as search signals and combine them to deliver the results your user actually wants.

We will be using two types of episodes:

Traditional deep dives, like we have done them so far. Each one will dive into one specific topic within search interviewing an expert on that topic.
Supplementary episodes, which answer one additional question; often either complementary or precursory knowledge for the episode, which we did not get to in the deep dive.

We will be starting with episodes next week, looking at the first, last, and overarching action in search: understanding user intent and understanding the queries with Daniel Tunkelang.

I am really excited to kick this off.

I would love to hear from you:

What would you love to learn in this season?
What guest should I have on?
What topics should I make a deep dive on (try to be specific)?

Yeah, let me know in the comments or just slide into my DMs on Twitter or LinkedIn.

I am looking forward to hearing from you guys.

I want to try to be more interactive. So anytime you encounter anything unclear or any question pops up in one of the episode, give me a shout and I will try to answer it to you and to everyone.

Enough of me rambling. Let’s kick this off. I will see you next Thursday, when we start with query understanding.

Shoot me a message and stay up to date:

#017 Unlocking Value from Unstructured Data, Real-World Applications of Generative AI

16 juli 2024 | 36 min

In this episode of "How AI is Built," host Nicolay Gerold interviews Jonathan Yarkoni, founder of Reach Latent. Jonathan shares his expertise in extracting value from unstructured data using AI, discussing challenging projects, the impact of ChatGPT, and the future of generative AI. From weather prediction to legal tech, Jonathan provides valuable insights into the practical applications of AI across various industries.

Key Takeaways

Generative AI projects often require less data cleaning due to the models' tolerance for "dirty" data, allowing for faster implementation in some cases.
The success of AI projects post-delivery is ensured through monitoring, but automatic retraining of generative AI applications is not yet common due to evaluation challenges.
Industries ripe for AI disruption include text-heavy fields like legal, education, software engineering, and marketing, as well as biotech and entertainment.
The adoption of AI is expected to occur in waves, with 2024 likely focusing on internal use cases and 2025 potentially seeing more customer-facing applications as models improve.
Synthetic data generation, using models like GPT-4, can be a valuable approach for training AI systems when real data is scarce or sensitive.
Evaluation frameworks like RAGAS and custom metrics are essential for assessing the quality of synthetic data and AI model outputs.
Jonathan’s ideal tech stack for generative AI projects includes tools like Instructor, Guardrails, Semantic Routing, DSPY, LangChain, and LlamaIndex, with a growing emphasis on evaluation stacks.

Key Quotes

"I think we're going to see another wave in 2024 and another one in 2025. And people are familiarized. That's kind of the wave of 2023. 2024 is probably still going to be a lot of internal use cases because it's a low risk environment and there was a lot of opportunity to be had."

"To really get to production reliably, we have to have these tools evolve further and get more standardized so people can still use the old ways of doing production with the new technology."

Jonathan Yarkoni

Nicolay Gerold:

Chapters

00:00 Introduction: Extracting Value from Unstructured Data
03:16 Flexible Tailoring Solutions to Client Needs
05:39 Monitoring and Retraining Models in the Evolving AI Landscape
09:15 Generative AI: Disrupting Industries and Unlocking New Possibilities
17:47 Balancing Immediate Results and Cutting-Edge Solutions in AI Development
28:29 Dream Tech Stack for Generative AI

unstructured data, textual data, automation, weather prediction, data cleaning, chat GPT, AI disruption, legal, education, software engineering, marketing, biotech, immediate results, cutting-edge solutions, tech stack

#016 Data Processing for AI, Integrating AI into Data Pipelines, Spark

12 juli 2024 | 46 min

#015 Building AI Agents for the Enterprise, Agent Cost Controls, Seamless UX

4 juli 2024 | 35 min

In this episode, Nicolay talks with Rahul Parundekar, founder of AI Hero, about the current state and future of AI agents. Drawing from over a decade of experience working on agent technology at companies like Toyota, Rahul emphasizes the importance of focusing on realistic, bounded use cases rather than chasing full autonomy.

They dive into the key challenges, like effectively capturing expert workflows and decision processes, delivering seamless user experiences that integrate into existing routines, and managing costs through techniques like guardrails and optimized model choices. The conversation also explores potential new paradigms for agent interactions beyond just chat.

Key Takeaways:

Agents need to focus on realistic use cases rather than trying to be fully autonomous. Enterprises are unlikely to allow agents full autonomy anytime soon.
Capturing the logic and workflows in the user's head is the key challenge. Shadowing experts and having them demonstrate workflows is more effective than asking them to document processes.
User experience is crucial - agents must integrate seamlessly into existing user workflows without major disruptions. Interfaces beyond just chat may be needed.
Cost control is important - techniques like guardrails, context windowing, model choice optimization, and dev vs production modes can help manage costs.
New paradigms beyond just chat could be powerful - e.g. workflow specification, state/declarative definition of desired end-state.
Prompt engineering and dynamic prompt improvement based on feedback remain an open challenge.

Key Quotes:

"Empowering users to create their own workflows is essential for effective agent usage."
"Capturing workflows accurately is a significant challenge in agent development."
"Preferences, right? So a lot of the work becomes like, hey, can you do preference learning for this user so that the next time the user doesn't have to enter the same information again, things like that."

Rahul Parundekar:

Nicolay Gerold:

00:00 Exploring the Potential of Autonomous Agents

02:23 Challenges of Accuracy and Repeatability in Agents

08:31 Capturing User Workflows and Improving Prompts

13:37 Tech Stack for Implementing Agents in the Enterprise

agent development, determinism, user experience, agent paradigms, private use, human-agent interaction, user workflows, agent deployment, human-in-the-loop, LLMs, declarative ways, scalability, AI Hero

#014 Building Predictable Agents through Prompting, Compression, and Memory Strategies

27 juni 2024 | 32 min

In this conversation, Nicolay and Richmond Alake discuss various topics related to building AI agents and using MongoDB in the AI space. They cover the use of agents and multi-agents, the challenges of controlling agent behavior, and the importance of prompt compression.

When you are building agents. Build them iteratively. Start with simple LLM calls before moving to multi-agent systems.

Main Takeaways:

Prompt Compression: Using techniques like prompt compression can significantly reduce the cost of running LLM-based applications by reducing the number of tokens sent to the model. This becomes crucial when scaling to production.
Memory Management: Effective memory management is key for building reliable agents. Consider different memory components like long-term memory (knowledge base), short-term memory (conversation history), semantic cache, and operational data (system logs). Store each in separate collections for easy access and reference.
Performance Optimization: Optimize performance across multiple dimensions - output quality (by tuning context and knowledge base), latency (using semantic caching), and scalability (using auto-scaling databases like MongoDB).
Prompting Techniques: Leverage prompting techniques like ReAct (observe, plan, act) and structured prompts (JSON, pseudo-code) to improve agent predictability and output quality.
Experimentation: Continuous experimentation is crucial in this rapidly evolving field. Try different frameworks (LangChain, Crew AI, Haystack), models (Claude, Anthropic, open-source), and techniques to find the best fit for your use case.

Richmond Alake:

Nicolay Gerold:

00:00 Reducing the Scope of AI Agents

01:55 Seamless Data Ingestion

03:20 Challenges and Considerations in Implementing Multi-Agents

06:05 Memory Modeling for Robust Agents with MongoDB

15:05 Performance Optimization in AI Agents

18:19 RAG Setup

AI agents, multi-agents, prompt compression, MongoDB, data storage, data ingestion, performance optimization, tooling, generative AI

Data Integration and Ingestion for AI & LLMs, Architecting Data Flows | changelog 3

25 juni 2024 | 15 min

#013 ETL for LLMs, Integrating and Normalizing Unstructured Data

19 juni 2024 | 37 min

In our latest episode, we sit down with Derek Tu, Founder and CEO of Carbon, a cutting-edge ETL tool designed specifically for large language models (LLMs).

Carbon is streamlining AI development by providing a platform for integrating unstructured data from various sources, enabling businesses to build innovative AI applications more efficiently while addressing data privacy and ethical concerns.

"I think people are trying to optimize around the chunking strategy... But for me, that seems a bit maybe not focusing on the right area of optimization. These embedding models themselves have gone just like, so much more advanced over the past five to 10 years that regardless of what representation you're passing in, they do a pretty good job of being able to understand that information semantically and returning the relevant chunks." - Derek Tu on the importance of embedding models over chunking strategies
"If you are cost conscious and if you're worried about performance, I would definitely look at quantizing your embeddings. I think we've probably been able to, I don't have like the exact numbers here, but I think we might be saving at least half, right, in storage costs by quantizing everything." - Derek Tu on optimizing costs and performance with vector databases

Derek Tu:

Nicolay Gerold:

Key Takeaways:

Understand your data sources: Before building your ETL pipeline, thoroughly assess the various data sources you'll be working with, such as Slack, Email, Google Docs, and more. Consider the unique characteristics of each source, including data format, structure, and metadata.
Normalize and preprocess data: Develop strategies to normalize and preprocess the unstructured data from different sources. This may involve parsing, cleaning, and transforming the data into a standardized format that can be easily consumed by your AI models.
Experiment with chunking strategies: While there's no one-size-fits-all approach to chunking, it's essential to experiment with different strategies to find what works best for your specific use case. Consider factors like data format, structure, and the desired granularity of the chunks.
Leverage metadata and tagging: Metadata and tagging can play a crucial role in organizing and retrieving relevant data for your AI models. Implement mechanisms to capture and store important metadata, such as document types, topics, and timestamps, and consider using AI-powered tagging to automatically categorize your data.
Choose the right embedding model: Embedding models have advanced significantly in recent years, so focus on selecting the right model for your needs rather than over-optimizing chunking strategies. Consider factors like model performance, dimensionality, and compatibility with your data types.
Optimize vector database usage: When working with vector databases, consider techniques like quantization to reduce storage costs and improve performance. Experiment with different configurations and settings to find the optimal balance for your specific use case.

00:00 Introduction and Optimizing Embedding Models

03:00 The Evolution of Carbon and Focus on Unstructured Data

06:19 Customer Progression and Target Group

09:43 Interesting Use Cases and Handling Different Data Representations

13:30 Chunking Strategies and Normalization

20:14 Approach to Chunking and Choosing a Vector Database

23:06 Tech Stack and Recommended Tools

28:19 Future of Carbon: Multimodal Models and Building a Platform

Carbon, LLMs, RAG, chunking, data processing, global customer base, GDPR compliance, AI founders, AI agents, enterprises

#012 Serverless Data Orchestration, AI in the Data Stack, AI Pipelines

14 juni 2024 | 28 min

In this episode, Nicolay sits down with Hugo Lu, founder and CEO of Orchestra, a modern data orchestration platform. As data pipelines and analytics workflows become increasingly complex, spanning multiple teams, tools and cloud services, the need for unified orchestration and visibility has never been greater.

Orchestra is a serverless data orchestration tool that aims to provide a unified control plane for managing data pipelines, infrastructure, and analytics across an organization's modern data stack.

The core architecture involves users building pipelines as code which then run on Orchestra's serverless infrastructure. It can orchestrate tasks like data ingestion, transformation, AI calls, as well as monitoring and getting analytics on data products. All with end-to-end visibility, data lineage and governance even when organizations have a scattered, modular data architecture across teams and tools.

Key Quotes:

Find the right level of abstraction when building data orchestration tasks/workflows. "I think the right level of abstraction is always good. I think like Prefect do this really well, right? Their big sell was, just put a decorator on a function and it becomes a task. That is a great idea. You know, just make tasks modular and have them do all the boilerplate stuff like error logging, monitoring of data, all of that stuff.”
Modularize data pipeline components: "It's just around understanding what that dev workflow should look like. I think it should be a bit more modular." Having a modular architecture where different components like data ingestion, transformation, model training are decoupled allows better flexibility and scalability.
Adopt a streaming/event-driven architecture for low-latency AI use cases: "If you've got an event-driven architecture, then, you know, that's not what you use an orchestration tool for...if you're having a conversation with a chatbot, like, you know, you're sending messages, you're sending events, you're getting a response back. That I would argue should be dealt with by microservices."

Hugo Lu:

Nicolay Gerold:

00:00 Introduction to Orchestra and its Focus on Data Products

08:03 Unified Control Plane for Data Stack and End-to-End Control

14:42 Use Cases and Unique Applications of Orchestra

19:31 Retaining Existing Dev Workflows and Best Practices in Orchestra

22:23 Event-Driven Architectures and Monitoring in Orchestra

23:49 Putting Data Products First and Monitoring Health and Usage

25:40 The Future of Data Orchestration: Stream-Based and Cost-Effective

data orchestration, Orchestra, serverless architecture, versatility, use cases, maturity levels, challenges, AI workloads

#011 Mastering Vector Databases, Product & Binary Quantization, Multi-Vector Search

7 juni 2024 | 40 min

Ever wondered how AI systems handle images and videos, or how they make lightning-fast recommendations? Tune in as Nicolay chats with Zain Hassan, an expert in vector databases from Weaviate. They break down complex topics like quantization, multi-vector search, and the potential of multimodal search, making them accessible for all listeners. Zain even shares a sneak peek into the future, where vector databases might connect our brains with computers!

Zain Hasan:

Nicolay Gerold:

Key Insights:

Vector databases can handle not just text, but also image, audio, and video data
Quantization is a powerful technique to significantly reduce costs and enable in-memory search
Binary quantization allows efficient brute force search for smaller datasets
Multi-vector search enables retrieval of heterogeneous data types within the same index
The future lies in multimodal search and recommendations across different senses
Brain-computer interfaces and EEG foundation models are exciting areas to watch

Key Quotes:

"Vector databases are pretty much the commercialization and the productization of representation learning."
"I think quantization, it builds on the assumption that there is still noise in the embeddings. And if I'm looking, it's pretty similar as well to the thought of Matryoshka embeddings that I can reduce the dimensionality."
"Going from text to multimedia in vector databases is really simple."
"Vector databases allow you to take all the advances that are happening in machine learning and now just simply turn a switch and use them for your application."

Chapters

00:00 - 01:24 Introduction

01:24 - 03:48 Underappreciated aspects of vector databases

03:48 - 06:06 Quantization trade-offs and techniques

Various quantization techniques: binary quantization, product quantization, scalar quantization

06:06 - 08:24 Binary quantization

Reducing vectors from 32-bits per dimension down to 1-bit
Enables efficient in-memory brute force search for smaller datasets
Requires normally distributed data between negative and positive values

08:24 - 10:44 Product quantization and other techniques

Alternative to binary quantization, segments vectors and clusters each segment
Scalar quantization reduces vectors to 8-bits per dimension

10:44 - 13:08 Quantization as a "superpower" to reduce costs

13:08 - 15:34 Comparing quantization approaches

15:34 - 17:51 Placing vector databases in the database landscape

17:51 - 20:12 Pruning unused vectors and nodes

20:12 - 22:37 Improving precision beyond similarity thresholds

22:37 - 25:03 Multi-vector search

25:03 - 27:11 Impact of vector databases on data interaction

27:11 - 29:35 Interesting and weird use cases

29:35 - 32:00 Future of multimodal search and recommendations

32:00 - 34:22 Extending recommendations to user data

34:22 - 36:39 What's next for Weaviate

36:39 - 38:57 Exciting technologies beyond vector databases and LLMs

vector databases, quantization, hybrid search, multi-vector support, representation learning, cost reduction, memory optimization, multimodal recommender systems, brain-computer interfaces, weather prediction models, AI applications

#010 Building Robust AI and Data Systems, Data Architecture, Data Quality, Data Storage

31 maj 2024 | 46 min

In this episode of "How AI is Built", data architect Anjan Banerjee provides an in-depth look at the world of data architecture and building complex AI and data systems. Anjan breaks down the basics using simple analogies, explaining how data architecture involves sorting, cleaning, and painting a picture with data, much like organizing Lego bricks to build a structure.

Summary by Section

Introduction

Anjan Banerjee, a data architect, discusses building complex AI and data systems
Explains the basics of data architecture using Lego and chat app examples

Sources and Tools

Identifying data sources is the first step in designing a data architecture
Pick the right tools to extract data based on use cases (block storage for images, time series DB, etc.)
Use one tool for most activities if possible, but specialized tools offer benefits
Multi-modal storage engines are gaining popularity (Snowflake, Databricks, BigQuery)

Airflow and Orchestration

Airflow is versatile but has a learning curve; good for orgs with Python/data engineering skills
For less technical orgs, GUI-based tools like Talend, Alteryx may be better
AWS Step Functions and managed Airflow are improving native orchestration capabilities
For multi-cloud, prefer platform-agnostic tools like Astronomer, Prefect, Airbyte

AI and Data Processing

ML is key for data-intensive use cases to avoid storing/processing petabytes in cloud
TinyML and edge computing enable ML inference on device (drones, manufacturing)
Cloud batch processing still dominates for user targeting, recommendations

Data Lakes and Storage

Storage choice depends on data types, use cases, cloud ecosystem
Delta Lake excels at data versioning and consistency; Iceberg at partitioning and metadata
Pulling data into separate system often needed for advanced analytics beyond source system

Data Quality and Standardization

"Poka-yoke" error-proofing of input screens is vital for downstream data quality
Impose data quality rules and unified schemas (e.g. UTC timestamps) during ingestion
Complexity arises with multi-region compliance (GDPR, CCPA) requiring encryption, sanitization

Hot Takes and Wishes

Snowflake is overhyped; great UX but costly at scale. Databricks is preferred.
Automated data set joining and entity resolution across systems would be a game-changer

Anjan Banerjee:

LinkedIn

Nicolay Gerold:

00:00 Understanding Data Architecture

12:36 Choosing the Right Tools

20:36 The Benefits of Serverless Functions

21:34 Integrating AI in Data Acquisition

24:31 The Trend Towards Single Node Engines

26:51 Choosing the Right Database Management System and Storage

29:45 Adding Additional Storage Components

32:35 Reducing Human Errors for Better Data Quality

39:07 Overhyped and Underutilized Tools

Data architecture, AI, data systems, data sources, data extraction, data storage, multi-modal storage engines, data orchestration, Airflow, edge computing, batch processing, data lakes, Delta Lake, Iceberg, data quality, standardization, poka-yoke, compliance, entity resolution

#009 Modern Data Infrastructure for Analytics and AI, Lakehouses, Open Source Data Stack

24 maj 2024 | 28 min

Jorrit Sandbrink, a data engineer specializing on open table formats, discusses the advantages of decoupling storage and compute, the importance of choosing the right table format, and strategies for optimizing your data pipelines. This episode is full of practical advice for anyone looking to build a high-performance data analytics platform.

Lake house architecture: A blend of data warehouse and data lake, addressing their shortcomings and providing a unified platform for diverse workloads.
Key components and decisions: Storage options (cloud or on-prem), table formats (Delta Lake, Iceberg, Apache Hoodie), and query engines (Apache Spark, Polars).
Optimizations: Partitioning strategies, file size considerations, and auto-optimization tools for efficient data layout and query performance.
Orchestration tools: Airflow, Dagster, Prefect, and their roles in triggering and managing data pipelines.
Data ingress with DLT: An open-source Python library for building data pipelines, focusing on efficient data extraction and loading.

Key Takeaways:

Lake houses offer a powerful and flexible architecture for modern data analytics.
Open-source solutions provide cost-effective and customizable alternatives.
Carefully consider your specific use cases and preferences when choosing tools and components.
Tools like DLT simplify data ingress and can be easily integrated with serverless functions.
The data landscape is constantly evolving, so staying informed about new tools and trends is crucial.

Sound Bites

"The Lake house is sort of a modular setup where you decouple the storage and the compute." "A lake house is an architecture, an architecture for data analytics platforms." "The most popular table formats for a lake house are Delta, Iceberg, and Apache Hoodie."

Jorrit Sandbrink:

Nicolay Gerold:

Chapters

00:00 Introduction to the Lake House Architecture

03:59 Choosing Storage and Table Formats

06:19 Comparing Compute Engines

21:37 Simplifying Data Ingress

25:01 Building a Preferred Data Stack

lake house, data analytics, architecture, storage, table format, query execution engine, document store, DuckDB, Polars, orchestration, Airflow, Dexter, DLT, data ingress, data processing, data storage

#008 Knowledge Graphs for Better RAG, Virtual Entities, Hybrid Data Models

20 maj 2024 | 37 min

Kirk Marple, CEO and founder of Graphlit, discusses the evolution of his company from a data cataloging tool to an platform designed for ETL (Extract, Transform, Load) and knowledge retrieval for Large Language Models (LLMs). Graphlit empowers users to build custom applications on top of its API that go beyond naive RAG.

Key Points:

Knowledge Graphs: Graphlet utilizes knowledge graphs as a filtering layer on top of keyword metadata and vector search, aiding in information retrieval.
Storage for KGs: A single piece of content in their data model resides across multiple systems: a document store with JSON, a graph node, and a search index. This hybrid approach creates a virtual entity with representations in different databases.
Entity Extraction: Azure Cognitive Services and other models are employed to extract entities from text for improved understanding.
Metadata-first approach: The metadata-first strategy involves extracting comprehensive metadata from various sources, ensuring it is canonicalized and filterable. This approach aids in better indexing and retrieval of data, crucial for effective RAG.
Challenges: Entity resolution and deduplication remain significant challenges in knowledge graph development.

Notable Quotes:

"Knowledge graphs is a filtering [mechanism]...but then I think also the kind of spidering and pulling extra content in is the other place this comes into play."
"Knowledge graphs to me are kind of like index per se...you're providing a new type of index on top of that."
"[For RAG]...you have to find constraints to make it workable."
"Entity resolution, deduping, I think is probably the number one thing."
"I've essentially built a connector infrastructure that would be like a FiveTran or something that Airflow would have..."
"One of the reasons is because we're a platform as a service, the burstability of it is really important. We can spin up to a hundred instances without any problem, and we don't have to think about it."
"Once cost and performance become a no-brainer, we're going to start seeing LLMs be more of a compute tool. I think that would be a game-changer for how applications are built in the future."

Kirk Marple:

Nicolay Gerold:

Chapters

00:00 Graphlit’s Hybrid Approach 02:23 Use Cases and Transition to Graphlit 04:19 Knowledge Graphs as a Filtering Mechanism 13:23 Using Gremlin for Querying the Graph 32:36 XML in Prompts for Better Segmentation 35:04 The Future of LLMs and Graphlit 36:25 Getting Started with Graphlit

Graphlit, knowledge graphs, AI, document store, graph database, search index co-pilot, entity extraction, Azure Cognitive Services, XML, event-driven architecture, serverless architecture graph rag, developer portal

#007 Navigating the Modern Data Stack, Choosing the Right OSS Tools, From Problem to Requirements to Architecture

17 maj 2024 | 38 min

From Problem to Requirements to Architecture.

In this episode, Nicolay Gerold and Jon Erich Kemi Warghed discuss the landscape of data engineering, sharing insights on selecting the right tools, implementing effective data governance, and leveraging powerful concepts like software-defined assets. They discuss the challenges of keeping up with the ever-evolving tech landscape and offer practical advice for building sustainable data platforms. Tune in to discover how to simplify complex data pipelines, unlock the power of orchestration tools, and ultimately create more value from your data.

"Don't overcomplicate what you're actually doing."
"Getting your basic programming software development skills down is super important to becoming a good data engineer."
"Who has time to learn 500 new tools? It's like, this is not humanly possible anymore."

Key Takeaways:

Data Governance: Data governance is about transparency and understanding the data you have. It's crucial for organizations as they scale and data becomes more complex. Tools like dbt and Dagster can help achieve this.
Open Source Tooling: When choosing open source tools, assess their backing, commit frequency, community support, and ease of use.
Agile Data Platforms: Focus on the capabilities you want to enable and prioritize solving the core problems of your data engineers and analysts.
Software Defined Assets: This concept, exemplified by Dagster, shifts the focus from how data is processed to what data should exist. This change in mindset can greatly simplify data orchestration and management.
The Importance of Fundamentals: Strong programming and software development skills are crucial for data engineers, and understanding the basics of data management and orchestration is essential for success.
The Importance of Versioning Data: Data has to be versioned so you can easily track changes, revert to previous states if needed, and ensure reproducibility in your data pipelines. lakeFS applies the concepts of Git to your data lake. This gives you the ability to create branches for different development environments, commit changes to specific versions, and merge branches together once changes have been tested and validated.

Jon Erik Kemi Warghed:

LinkedIn

Nicolay Gerold:

Chapters

00:00 The Problem with the Modern Data Stack: Too many tools and buzzwords

00:57 How to Choose the Right Tools: Considerations for startups and large companies

03:13 Evaluating Open Source Tools: Background checks and due diligence

07:52 Defining Data Governance: Transparency and understanding of data

10:15 The Importance of Data Governance: Challenges and solutions

12:21 Data Governance Tools: dbt and Dagster

17:05 The Impact of Dagster: Software-defined assets and declarative thinking

19:31 The Power of Software Defined Assets: How Dagster differs from Airflow and Mage

21:52 State Management and Orchestration in Dagster: Real-time updates and dependency management

26:24 Why Use Orchestration Tools?: The role of orchestration in complex data pipelines

28:47 The Importance of Tool Selection: Thinking about long-term sustainability

31:10 When to Adopt Orchestration: Identifying the need for orchestration tools

#006 Data Orchestration Tools, Choosing the right one for your needs

10 maj 2024 | 33 min

In this episode, Nicolay Gerold interviews John Wessel, the founder of Agreeable Data, about data orchestration. They discuss the evolution of data orchestration tools, the popularity of Apache Airflow, the crowded market of orchestration tools, and the key problem that orchestrators solve. They also explore the components of a data orchestrator, the role of AI in data orchestration, and how to choose the right orchestrator for a project. They touch on the challenges of managing orchestrators, the importance of monitoring and optimization, and the need for product people to be more involved in the orchestration space. They also discuss data residency considerations and the future of orchestration tools.

Sound Bites

"The modern era, definitely airflow. Took the market share, a lot of people running it themselves." "It's like people are launching new orchestrators every day. This is a funny one. This was like two weeks ago, somebody launched an orchestrator that was like a meta-orchestrator." "The DAG introduced two other components. It's directed acyclic graph is what DAG means, but direct is like there's a start and there's a finish and the acyclic is there's no loops."

Key Topics

The evolution of data orchestration: From basic task scheduling to complex DAG-based solutions
What is a data orchestrator and when do you need one? Understanding the role of orchestrators in handling complex dependencies and scaling data pipelines.
The crowded market: A look at popular options like Airflow, Daxter, Prefect, and more.
Best practices: Choosing the right tool, prioritizing serverless solutions when possible, and focusing on solving the use case before implementing complex tools.
Data residency and GDPR: How regulations influence tool selection, especially in Europe.
Future of the field: The need for consolidation and finding the right balance between features and usability.

John Wessel:

Nicolay Gerold:

Data orchestration, data movement, Apache Airflow, orchestrator selection, DAG, AI in orchestration, serverless, Kubernetes, infrastructure as code, monitoring, optimization, data residency, product involvement, generative AI.

Chapters

00:00 Introduction and Overview

00:34 The Evolution of Data Orchestration Tools

04:54 Components and Flow of Data in Orchestrators

08:24 Deployment Options: Serverless vs. Kubernetes

11:14 Considerations for Data Residency and Security

13:02 The Need for a Clear Winner in the Orchestration Space

20:47 Optimization Techniques for Memory and Time-Limited Issues

23:09 Integrating Orchestrators with Infrastructure-as-Code

24:33 Bridging the Gap Between Data and Engineering Practices

27:2 2Exciting Technologies Outside of Data Orchestration

30:09 The Feature of Dagster

#005 Building Reliable LLM Applications, Production-Ready RAG, Data-Driven Evals

3 maj 2024 | 30 min

In this episode of "How AI is Built", we learn how to build and evaluate real-world language model applications with Shahul and Jithin, creators of Ragas. Ragas is a powerful open-source library that helps developers test, evaluate, and fine-tune Retrieval Augmented Generation (RAG) applications, streamlining their path to production readiness.

Main Insights

Challenges of Open-Source Models: Open-source large language models (LLMs) can be powerful tools, but require significant post-training optimization for specific use cases.
Evaluation Before Deployment: Thorough testing and evaluation are key to preventing unexpected behaviors and hallucinations in deployed RAGs. Ragas offers metrics and synthetic data generation to support this process.
Data is Key: The quality and distribution of data used to train and evaluate LLMs dramatically impact their performance. Ragas is enabling novel synthetic data generation techniques to make this process more effective and cost-efficient.
RAG Evolution: Techniques for improving RAGs are continuously evolving. Developers must be prepared to experiment and keep up with the latest advancements in chunk embedding, query transformation, and model alignment.

Practical Takeaways

Start with a solid testing strategy: Before launching, define the quality metrics aligned with your RAG's purpose. Ragas helps in this process.
Embrace synthetic data: Manually creating test data sets is time-consuming. Tools within Ragas help automate the creation of synthetic data to mirror real-world use cases.
RAGs are iterative: Be prepared for continuous improvement as better techniques and models emerge.

Interesting Quotes

"...models are very stochastic and grading it directly would rather trigger them to give some random number..." - Shahul, on the dangers of naive model evaluation.
"Reducing the developer time in acquiring these test data sets by 90%." - Shahul, on the efficiency gains of Ragas' synthetic data generation.
"We want to ensure maximum diversity..." - Shahul, on creating realistic and challenging test data for RAG evaluation.

Ragas:

Jithin James:

LinkedIn

Shahul ES:

Nicolay Gerold:

00:00 Introduction

02:03 Introduction to Open Assistant project

04:05 Creating Customizable and Fine-Tunable Models

06:07 Ragas and the LLM Use Case

08:09 Introduction to Language Model Metrics (LLMs)

11:12 Reducing the Cost of Data Generation

13:19 Evaluation of Components at Melvess

15:40 Combining Ragas Metrics with AutoML Providers

20:08 Improving Performance with Fine-tuning and Reranking

22:56 End-to-End Metrics and Component-Specific Metrics

25:14 The Importance of Deep Knowledge and Understanding

25:53 Robustness vs Optimization

26:32 Challenges of Evaluating Models

27:18 Creating a Dream Tech Stack

27:47 The Future Roadmap for Ragas

28:02 Doubling Down on Grid Data Generation

28:12 Open-Source Models and Expanded Support

28:20 More Metrics for Different Applications

RAG, Ragas, LLM, Evaluation, Synthetic Data, Open-Source, Language Model Applications, Testing.

Lance v2: Rethinking Columnar Storage for Faster Lookups, Nulls, and Flexible Encodings | changelog 2

29 april 2024 | 22 min

In this episode of Changelog, Weston Pace dives into the latest updates to LanceDB, an open-source vector database and file format. Lance's new V2 file format redefines the traditional notion of columnar storage, allowing for more efficient handling of large multimodal datasets like images and embeddings. Weston discusses the goals driving LanceDB's development, including null value support, multimodal data handling, and finding an optimal balance for search performance.

Sound Bites

"A little bit more power to actually just try." "We're becoming a little bit more feature complete with returns of arrow." "Weird data representations that are actually really optimized for your use case."

Key Points

Weston introduces LanceDB, an open-source multimodal vector database and file format.
The goals behind LanceDB's design: handling null values, multimodal data, and finding the right balance between point lookups and full dataset scan performance.
Lance V2 File Format:
Potential Use Cases

Conversation Highlights

On the benefits of Arrow integration: Strengthening the connection with the Arrow data ecosystem for seamless data handling.
Why "columnar container format"?: A broader definition than "table format" to encompass more unconventional use cases.
Tackling multimodal data: How LanceDB V2 enables storage of large multimodal data efficiently and without needing tons of memory.
Python's role in encoding experimentation: Providing a way to rapidly prototype custom encodings and plug them into LanceDB.

LanceDB:

Weston Pace:

Nicolay Gerold:

Chapters

00:00 Introducing Lance: A New File Format

06:46 Enabling Custom Encodings in Lance

11:51 Exploring the Relationship Between Lance and Arrow

20:04 New Chapter

Lance file format, nulls, round-tripping data, optimized data representations, full-text search, encodings, downsides, multimodal data, compression, point lookups, full scan performance, non-contiguous columns, custom encodings

#004 AI with Supabase, Postgres Configuration, Real-Time Processing, and more

26 april 2024 | 32 min

#003 AI Inside Your Database, Real-Time AI, Declarative ML/AI

19 april 2024 | 36 min

Supabase acquires OrioleDB, A New Database Engine for PostgreSQL | changelog 1

17 april 2024 | 14 min

#002 AI Powered Data Transformation, Combining gen & trad AI, Semantic Validation

12 april 2024 | 37 min

Today’s guest is Antonio Bustamante, a serial entrepreneur who previously built Kite and Silo and is now working to fix bad data. He is building bem, the data tool to transform any data into the schema your AI and software needs.

bem.ai is a data tool that focuses on transforming any data into the schema needed for AI and software. It acts as a system's interoperability layer, allowing systems that couldn't communicate before to exchange information. Learn what place LLMs play in data transformation, how to build reliable data infrastructure and more.

"Surprisingly, the hardest was semi-structured data. That is data that should be structured, but is unreliable, undocumented, hard to work with."

"We were spending close to four or five million dollars a year just in integrations, which is no small budget for a company that size. So I was pretty much determined to fix this problem once and for all."

"bem focuses on being the system's interoperability layer."

"We basically take in anything you send us, we transform it exactly into your internal data schema so that you don't have to parse, process, transform anything of that sort."

"LLMs are a 30% of it... A lot of it is very, very like thorough validation layers, great infrastructure, just ensuring reliability and connection to our user systems.”

"You can use a million token context window and feed an entire document to an LLM. I can guarantee you if you don't, semantically chunk it out before you're not going to get the right results.”

"We're obsessed with time to value... Our milestone is basically five minute onboarding max, and then you're ready to go."

Antonio Bustamante

bem.ai

Nicolay Gerold:

Semi-structured data, Data integrations, Large language models (LLMs), Data transformation, Schema interoperability, Fault tolerance, Validation layers, System reliability, Schema evolution, Enterprise software, Data pipelines.

Chapters

00:00 The Problem of Integrations

05:58 Building Fault Tolerant Systems

13:51 Versioning and Semantic Validation

27:33 BEM in the Data Ecosystem

34:40 Future Plans and Onboarding

#001 Multimodal AI, Storing 1 Billion Vectors, Building Data Infrastructure at LanceDB

5 april 2024 | 34 min

How AI Is Built

Real engineers.

Om podden

Avsnitt

#048 Why Your AI Agents Need Permission to Act, Not Just Read

#047 Architecting Information for Search, Humans, and Artificial Intelligence

#046 Building a Search Database From First Principles

#045 RAG As Two Things - Prompt Engineering and Search

#044 Graphs Aren't Just For Specialists Anymore

#043 Knowledge Graphs Won't Fix Bad Data

#042 Temporal RAG, Embracing Time for Smarter, Reliable Knowledge Graphs

#041 Context Engineering, How Knowledge Graphs Help LLMs Reason

#040 Vector Database Quantization, Product, Binary, and Scalar

#039 Local-First Search, How to Push Search To End-Devices

#038 AI-Powered Search, Context Is King, But Your RAG System Ignores Two-Thirds of It

#037 Chunking for RAG: Stop Breaking Your Documents Into Meaningless Pieces

#036 How AI Can Start Teaching Itself - Synthetic Data Deep Dive

#035 A Search System That Learns As You Use It (Agentic RAG)

#034 Rethinking Search Inside Postgres, From Lexemes to BM25

#033 RAG's Biggest Problems & How to Fix It (ft. Synthetic Data)

#032 Improving Documentation Quality for RAG Systems

#031 BM25 As The Workhorse Of Search; Vectors Are Its Visionary Cousin

#030 Vector Search at Scale, Why One Size Doesn't Fit All

#029 Search Systems at Scale, Avoiding Local Maxima and Other Engineering Lessons

#028 Training Multi-Modal AI, Inside the Jina CLIP Embedding Model

#027 Building the database for AI, Multi-modal AI, Multi-modal Storage

#026 Embedding Numbers, Categories, Locations, Images, Text, and The World

#025 Data Models to Remove Ambiguity from AI and Search

#024 How ColPali is Changing Information Retrieval

#023 The Power of Rerankers in Modern Search

#022 The Limits of Embeddings, Out-of-Domain Data, Long Context, Finetuning (and How We're Fixing It)

#021 The Problems You Will Encounter With RAG At Scale And How To Prevent (or fix) Them

#020 The Evolution of Search, Finding Search Signals, GenAI Augmented Retrieval

#019 Data-driven Search Optimization, Analysing Relevance

#018 Query Understanding: Doing The Work Before The Query Hits The Database

Season 2 Trailer: Mastering Search

#017 Unlocking Value from Unstructured Data, Real-World Applications of Generative AI

#016 Data Processing for AI, Integrating AI into Data Pipelines, Spark

#015 Building AI Agents for the Enterprise, Agent Cost Controls, Seamless UX

#014 Building Predictable Agents through Prompting, Compression, and Memory Strategies

Data Integration and Ingestion for AI & LLMs, Architecting Data Flows | changelog 3

#013 ETL for LLMs, Integrating and Normalizing Unstructured Data

#012 Serverless Data Orchestration, AI in the Data Stack, AI Pipelines

#011 Mastering Vector Databases, Product & Binary Quantization, Multi-Vector Search

#010 Building Robust AI and Data Systems, Data Architecture, Data Quality, Data Storage

#009 Modern Data Infrastructure for Analytics and AI, Lakehouses, Open Source Data Stack

#008 Knowledge Graphs for Better RAG, Virtual Entities, Hybrid Data Models

#007 Navigating the Modern Data Stack, Choosing the Right OSS Tools, From Problem to Requirements to Architecture

#006 Data Orchestration Tools, Choosing the right one for your needs

#005 Building Reliable LLM Applications, Production-Ready RAG, Data-Driven Evals

Lance v2: Rethinking Columnar Storage for Faster Lookups, Nulls, and Flexible Encodings | changelog 2

#004 AI with Supabase, Postgres Configuration, Real-Time Processing, and more

#003 AI Inside Your Database, Real-Time AI, Declarative ML/AI

Supabase acquires OrioleDB, A New Database Engine for PostgreSQL | changelog 1

#002 AI Powered Data Transformation, Combining gen & trad AI, Semantic Validation

#001 Multimodal AI, Storing 1 Billion Vectors, Building Data Infrastructure at LanceDB