Start / AI Engineering Podcast / Running generative ai models in production

Running Generative AI Models In Production

58 min • 28 oktober 2024

Summary
In this episode Philip Kiely from BaseTen talks about the intricacies of running open models in production. Philip shares his journey into AI and ML engineering, highlighting the importance of understanding product-level requirements and selecting the right model for deployment. The conversation covers the operational aspects of deploying AI models, including model evaluation, compound AI, and model serving frameworks such as TensorFlow Serving and AWS SageMaker. Philip also discusses the challenges of model quantization, rapid model evolution, and monitoring and observability in AI systems, offering valuable insights into the future trends in AI, including local inference and the competition between open source and proprietary models.

Announcements

Hello and welcome to the AI Engineering Podcast, your guide to the fast-moving world of building scalable and maintainable AI systems
Your host is Tobias Macey and today I'm interviewing Philip Kiely about running open models in production

Interview

Introduction
How did you get involved in machine learning?
Can you start by giving an overview of the major decisions to be made when planning the deployment of a generative AI model?
How does the model selected in the beginning of the process influence the downstream choices?
In terms of application architecture, the major patterns that I've seen are RAG, fine-tuning, multi-agent, or large model. What are the most common methods that you see? (and any that I failed to mention)
- How have the rapid succession of model generations impacted the ways that teams think about their overall application? (capabilities, features, architecture, etc.)
In terms of model serving, I know that Baseten created Truss. What are some of the other notable options that teams are building with?
- What is the role of the serving framework in the context of the application?
There are also a large number of inference engines that have been released. What are the major players in that arena?
- What are the features and capabilities that they are each basing their competitive advantage on?
For someone who is new to AI Engineering, what are some heuristics that you would recommend when choosing an inference engine?
Once a model (or set of models) is in production and serving traffic it's necessary to have visibility into how it is performing. What are the key metrics that are necessary to monitor for generative AI systems?
- In the event that one (or more) metrics are trending negatively, what are the levers that teams can pull to improve them?
When running models constructed with e.g. linear regression or deep learning there was a common issue with "concept drift". How does that manifest in the context of large language models, particularly when coupled with performance optimization?
What are the most interesting, innovative, or unexpected ways that you have seen teams manage the serving of open gen AI models?
What are the most interesting, unexpected, or challenging lessons that you have learned while working with generative AI model serving?
When is Baseten the wrong choice?
What are the future trends and technology investments that you are focused on in the space of AI model serving?

Contact Info

Parting Question

From your perspective, what are the biggest gaps in tooling, technology, or training for AI systems today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers.

Links

Baseten
- Podcast Episode
Copyleft
Llama Models
Nomic
Olmo
Allen Institute for AI
Playground 2
The Peace Dividend Of The SaaS Wars
Vercel
Netlify
RAG == Retrieval Augmented Generation
- Podcast Episode
Compound AI
Langchain
Outlines Structured output for AI systems
Truss
Chains
Llamaindex
Ray
MLFlow
Cog (Replicate) containers for ML
BentoML
Django
WSGI
uWSGI
Gunicorn
Zapier
vLLM
TensorRT-LLM
TensorRT
Quantization
LoRA Low Rank Adaptation of Large Language Models
Pruning
Distillation
Grafana
Speculative Decoding
Groq
Runpod
Lambda Labs

The intro and outro music is from Hitman's Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0

Kategorier

Poddar Teknologi Utbildning

Förekommer på

Teknik

00:00 -00:00