283 avsnitt • Längd: 25 min • Månadsvis
Artificial Intelligence, algorithms and tech tales that are shaping the world. Hype not included.
The podcast Data Science at Home is created by Francesco Gadaleta. The podcast and the artwork on this page are embedded on this page using the public podcast feed (RSS).
🎙️ In this episode of Data Science at Home, we sit down with Kenny Vaneetvelde, the mastermind behind Atomic Agents, a groundbreaking framework redefining AI development.
🔍 Discover how atomicity simplifies complex AI systems, why modularity matters more than ever, and how Atomic Agents is eliminating hidden assumptions and redundant complexity in AI workflows.
💡 From real-world applications to the tech stack behind the framework, Kenny takes us on a deep dive into this lightweight, powerful tool for creating consistent and brand-aligned AI.
🔥 Whether you’re a seasoned developer or just AI-curious, this conversation is packed with insights you don’t want to miss.
📌 Timestamps:
0:00 - Intro
2:30 - Kenny’s journey in AI
5:00 - What are Atomic Agents?
10:45 - Why atomicity matters in AI
18:20 - The tech behind Atomic Agents: Instructor, Pydantic & more
25:00 - Real-world use cases and future vision
40:45 - Advice for AI developers and businesses
📲 Check out Atomic Agents on GitHub: https://github.com/BrainBlend-AI/atomic-agents
https://brainblendai.com/
🔗 Follow Kenny on LinkedIn: https://www.linkedin.com/in/kennyvaneetvelde/
💬 Got questions? Drop them in the comments! Don’t forget to like, subscribe, and share this episode with your fellow AI enthusiasts.
#AI #AtomicAgents #DataScience #Podcast
AI shouldn’t be limited to those with access to expensive hardware. This episode explores how to break down barriers by running massive AI models on "crappy machines"—affordable, low-spec devices. Clever techniques like quantization, pruning, model distillation might not be enough.
With edge offloading, we can make state-of-the-art AI accessible to hobbyists, researchers, and innovators everywhere.
By democratizing AI, we can empower individuals and small teams to experiment, create, and solve problems without needing deep pockets or enterprise-grade resources.
AI for everyone, on everything. Let’s make it happen.
✨ Connect with us!
📩 Newsletter: https://datascienceathome.substack.com
🎙 Podcast: Available on Spotify, Apple Podcasts, and more.
🐦 Twitter: @DataScienceAtHome
📘 LinkedIn: https://www.linkedin.com/in/fragadaleta/
Instagram: https://www.instagram.com/datascienceathome/
Facebook: https://www.facebook.com/datascienceAH
LinkedIn: https://www.linkedin.com/company/data-science-at-home-podcast
Discord Channel: https://discord.gg/4UNKGf3
NEW TO DATA SCIENCE AT HOME?
Welcome! Data Science at Home explores the latest in AI, data science, and machine learning. Whether you’re a data professional, tech enthusiast, or just curious about the field, our podcast delivers insights, interviews, and discussions. Learn more at https://datascienceathome.com
SEND US MAIL!
We love hearing from you! Send us mail at:
Don’t forget to like, subscribe, and hit the 🔔 for updates on the latest in AI and data science!
Is DeepSeek the next big thing in AI? Can OpenAI keep up? And how do we truly understand these massive LLMs?
Enter WeightWatcher—the AI detective tool that peeks inside neural networks without needing their data.
In this episode, we chat with its creator, Dr. Charles Martin, to uncover what makes LLMs tick, the hidden patterns inside models like GPT-4 and DeepSeek, and whether AI is headed for a breakthrough—or a bottleneck.
If you’re into cutting-edge AI, you won’t want to miss this one!
✨ Connect with us!
📩 Newsletter: https://datascienceathome.substack.com
🎙 Podcast: Available on Spotify, Apple Podcasts, and more.
🐦 Twitter: @DataScienceAtHome
📘 LinkedIn: https://www.linkedin.com/in/fragadaleta/
Instagram: https://www.instagram.com/datascienceathome/
Facebook: https://www.facebook.com/datascienceAH
LinkedIn: https://www.linkedin.com/company/data-science-at-home-podcast
Discord Channel: https://discord.gg/4UNKGf3
NEW TO DATA SCIENCE AT HOME?
Welcome! Data Science at Home explores the latest in AI, data science, and machine learning. Whether you’re a data professional, tech enthusiast, or just curious about the field, our podcast delivers insights, interviews, and discussions. Learn more at https://datascienceathome.com
SEND US MAIL!
We love hearing from you! Send us mail at:
Don’t forget to like, subscribe, and hit the 🔔 for updates on the latest in AI and data science!
From the viral article "Tech's Dumbest Mistake: Why Firing Programmers for AI Will Destroy Everything" on my newsletter at https://defragzone.substack.com/p/techs-dumbest-mistake-why-firing
here are my thoughts about AI replacing programmers...
✨ Connect with us!
📩 Newsletter: https://datascienceathome.substack.com
🎙 Podcast: Available on Spotify, Apple Podcasts, and more.
🐦 Twitter: @DataScienceAtHome
📘 LinkedIn: https://www.linkedin.com/in/fragadaleta/
Instagram: https://www.instagram.com/datascienceathome/
Facebook: https://www.facebook.com/datascienceAH
LinkedIn: https://www.linkedin.com/company/data-science-at-home-podcast
Discord Channel: https://discord.gg/4UNKGf3
NEW TO DATA SCIENCE AT HOME?
Welcome! Data Science at Home explores the latest in AI, data science, and machine learning. Whether you’re a data professional, tech enthusiast, or just curious about the field, our podcast delivers insights, interviews, and discussions. Learn more at https://datascienceathome.com
SEND US MAIL!
We love hearing from you! Send us mail at:
Don’t forget to like, subscribe, and hit the 🔔 for updates on the latest in AI and data science!
In this episode, we dive into the transformative world of AI, data analytics, and cloud infrastructure with Josh Miramant, CEO of Blue Orange Digital. As a seasoned entrepreneur with over $25 million raised across ventures and two successful exits, Josh shares invaluable insights on scaling data-driven businesses, integrating machine learning frameworks, and navigating the rapidly evolving landscape of cloud data architecture. From generative AI to large language models, Josh explores cutting-edge trends shaping financial services, real estate, and consumer goods.
Tune in for a masterclass in leveraging data for impact and innovation!
Links
https://blueorange.digital/blog/a-data-intelligence-platform-what-is-it/
https://blueorange.digital/blog/ai-makes-bi-tools-accessible-to-anyone/
Here’s the updated text with links to the websites included:
AI is revolutionizing the military with autonomous drones, surveillance tech, and decision-making systems. But could these innovations spark the next global conflict? In this episode of Data Science at Home, we expose the cutting-edge tech reshaping defense—and the chilling ethical questions that follow. Don’t miss this deep dive into the AI arms race!
🎧 LISTEN / SUBSCRIBE TO THE PODCAST
Chapters
00:00 - Intro
01:54 - Autonomous Vehicles
03:11 - Surveillance And Reconnaissance
04:15 - Predictive Analysis
05:57 - Decision Support System
08:24 - Real World Examples
10:42 - Ethical And Strategic Considerations
12:25 - International Regulation
13:21 - Conclusion
14:50 - Outro
✨ Connect with us!
🎥Youtube: https://www.youtube.com/@DataScienceatHome
📩 Newsletter: https://datascienceathome.substack.com
🎙 Podcast: Available on Spotify, Apple Podcasts, and more.
🐦 Twitter: @DataScienceAtHome
📘 LinkedIn: Francesco Gad
📷 Instagram: https://www.instagram.com/datascienceathome/
📘 Facebook: https://www.facebook.com/datascienceAH
💼 LinkedIn: https://www.linkedin.com/company/data-science-at-home-podcast
💬 Discord Channel: https://discord.gg/4UNKGf3
NEW TO DATA SCIENCE AT HOME?
Welcome! Data Science at Home explores the latest in AI, data science, and machine learning. Whether you’re a data professional, tech enthusiast, or just curious about the field, our podcast delivers insights, interviews, and discussions. Learn more at https://datascienceathome.com.
📫 SEND US MAIL!
We love hearing from you! Send us mail at:
[email protected]
Don’t forget to like, subscribe, and hit the 🔔 for updates on the latest in AI and data science!
#DataScienceAtHome #ArtificialIntelligence #AI #MilitaryTechnology #AutonomousDrones #SurveillanceTech #AIArmsRace #DataScience #DefenseInnovation #EthicsInAI #GlobalConflict #PredictiveAnalysis #AIInWarfare #TechnologyAndEthics #AIRevolution #MachineLearning
In this episode of Data Science at Home, we’re diving deep into the powerful strategies that top AI companies, like OpenAI, use to scale their systems to handle millions of requests every minute! From stateless services and caching to the secrets of async processing, discover 8 essential strategies to make your AI and machine learning systems unstoppable. Whether you're working with traditional ML models or large LLMs, these techniques will transform your infrastructure. Hit play to learn how the pros do it and apply it to your own projects!
LISTEN / SUBSCRIBE TO THE PODCAST
YouTube: https://www.youtube.com/@DataScienceatHome
Apple Podcasts: https://podcasts.apple.com/us/podcast/data-science-at-home/id1069871378
Podbean Podcasts: https://datascienceathome.podbean.com/
Player Fm: https://player.fm/series/data-science-at-home-2600992
Chapters
00:00 Intro
00:34 Scalability Strategies
01:08 Stateless Services
02:47 Horizontal Scaling
04:51 Load Balancing
06:14 Auto Scaling
07:41 Caching
09:27 Database Replication
11:07 Database Sharding
12:54 Async Processing
14:50 Infographics
RESOURCES & LINKS
Data Science at home: https://datascienceathome.com
Amethix Technologies: https://amethix.com
CONNECT WITH US!
Instagram: https://www.instagram.com/datascienceathome/
Twitter: @datascienceathome
Facebook: https://www.facebook.com/datascienceAH
LinkedIn: https://www.linkedin.com/company/data-science-at-home-podcast
Discord Channel: https://discord.gg/4UNKGf3
NEW TO DATA SCIENCE AT HOME?
Welcome! Data Science at Home explores the latest in AI, data science, and machine learning. Whether you’re a data professional, tech enthusiast, or just curious about the field, our podcast delivers insights, interviews, and discussions. Learn more at https://datascienceathome.com
SEND US MAIL!
We love hearing from you! Send us mail at: [email protected]
In this episode of Data Science at Home, host Francesco Gadaleta dives deep into the evolving world of AI-generated content detection with experts Souradip Chakraborty, Ph.D. grad student at the University of Maryland, and Amrit Singh Bedi, CS faculty at the University of Central Florida.
Together, they explore the growing importance of distinguishing human-written from AI-generated text, discussing real-world examples from social media to news. How reliable are current detection tools like DetectGPT? What are the ethical and technical challenges ahead as AI continues to advance? And is the balance between innovation and regulation tipping in the right direction?
Tune in for insights on the future of AI text detection and the broader implications for media, academia, and policy.
Chapters
00:00 - Intro
00:23 - Guests: Souradip Chakraborty and Amrit Singh Bedi
01:25 - Distinguish Text Generation By AI
04:33 - Research on Safety and Alignment of Generative Model
06:01 - Tools to Detect Generated AI Text
11:28 - Water Marking
18:27 - Challenges in Detecting Large Documents Generated by AI
23:34 - Number of Tokens
26:22 - Adversarial Attack
29:01 - True Positive and False Positive of Detectors
31:01 - Limit of Technologies
41:01 - Future of AI Detection Techniques
46:04 - Closing Thought
Subscribe to our new YouTube channel https://www.youtube.com/@DataScienceatHome
Welcome to Data Science at Home, where we don’t just drink the AI Kool-Aid. Today, we’re dissecting Sam Altman’s “AI manifesto”—a magical journey where, apparently, AI will fix everything from climate change to your grandma's back pain. Superintelligence is “just a few thousand days away,” right? Sure, Sam, and my cat’s about to become a calculus tutor.
In this episode, I’ll break down the bold (and often bizarre) claims in Altman’s grand speech for the Intelligence Age. I’ll give you the real scoop on what’s realistic, what’s nonsense, and why some tech billionaires just can’t resist overselling. Think AI’s all-knowing, all-powerful future is just around the corner? Let’s see if we can spot the fairy dust.
Strap in, grab some popcorn, and get ready to see past the hype!
Chapters
00:00 - Intro
00:18 - CEO of Baidu Statement on AI Bubble
03:47 - News On Sam Altman Open AI
06:43 - Online Manifesto "The Intelleigent Age"
13:14 - Deep Learning
16:26 - AI gets Better With Scale
17:45 - Conclusion On Manifesto
Still have popcorns?
Get some laughs at https://ia.samaltman.com/
#AIRealTalk #NoHypeZone #InvestorBaitAlert
In this episode of Data Science at Home, we dive into the hidden costs of AI’s rapid growth — specifically, its massive energy consumption. With tools like ChatGPT reaching 200 million weekly active users, the environmental impact of AI is becoming impossible to ignore. Each query, every training session, and every breakthrough come with a price in kilowatt-hours, raising questions about AI’s sustainability.
Join us, as we uncovers the staggering figures behind AI's energy demands and explores practical solutions for the future. From efficiency-focused algorithms and specialized hardware to decentralized learning, this episode examines how we can balance AI’s advancements with our planet's limits. Discover what steps we can take to harness the power of AI responsibly!
Check our new YouTube channel at https://www.youtube.com/@DataScienceatHome
Chapters
00:00 - Intro
01:25 - Findings on Summary Statics
05:15 - Energy Required To Querry On GPT
07:20 - Energy Efficiency In BlockChain
10:41 - Efficicy Focused Algorithm
14:02 - Hardware Optimization
17:31 - Decentralized Learning
18:38 - Edge Computing with Local Inference
19:46 - Distributed Architectures
21:46 - Outro
#AIandEnergy #AIEnergyConsumption #SustainableAI #AIandEnvironment #DataScience #EfficientAI #DecentralizedLearning #GreenTech #EnergyEfficiency #MachineLearning #FutureOfAI #EcoFriendlyAI #FrancescoFrag #DataScienceAtHome #ResponsibleAI #EnvironmentalImpact
Subscribe to our new channel https://www.youtube.com/@DataScienceatHome
In this episode of Data Science at Home, we confront a tragic story highlighting the ethical and emotional complexities of AI technology. A U.S. teenager recently took his own life after developing a deep emotional attachment to an AI chatbot emulating a character from Game of Thrones. This devastating event has sparked urgent discussions on the mental health risks, ethical responsibilities, and potential regulations surrounding AI chatbots, especially as they become increasingly lifelike.
🎙️ Topics Covered:
AI & Emotional Attachment: How hyper-realistic AI chatbots can foster intense emotional bonds with users, especially vulnerable groups like adolescents.
Mental Health Risks: The potential for AI to unintentionally contribute to mental health issues, and the challenges of diagnosing such impacts. Ethical & Legal Accountability: How companies like Character AI are being held accountable and the ethical questions raised by emotionally persuasive AI.
🚨 Analogies Explored:
From VR to CGI and deepfakes, we discuss how hyper-realism in AI parallels other immersive technologies and why its emotional impact can be particularly disorienting and even harmful.
🛠️ Possible Mitigations:
We cover potential solutions like age verification, content monitoring, transparency in AI design, and ethical audits that could mitigate some of the risks involved with hyper-realistic AI interactions. 👀 Key Takeaways: As AI becomes more realistic, it brings both immense potential and serious responsibility. Join us as we dive into the ethical landscape of AI—analyzing how we can ensure this technology enriches human lives without crossing lines that could harm us emotionally and psychologically. Stay curious, stay critical, and make sure to subscribe for more no-nonsense tech talk!
Chapters
00:00 - Intro
02:21 - Emotions In Artificial Intelligence
04:00 - Unregulated Influence and Misleading Interaction
06:32 - Overwhelming Realism In AI
10:54 - Virtual Reality
13:25 - Hyper-Realistic CGI Movies
15:38 - Deep Fake Technology
18:11 - Regulations To Mitigate AI Risks
22:50 - Conclusion
#AI#ArtificialIntelligence#MentalHealth#AIEthics#podcast#AIRegulation#EmotionalAI#HyperRealisticAI#TechTalk#AIChatbots#Deepfakes#VirtualReality#TechEthics#DataScience#AIDiscussion #StayCuriousStayCritical
Ever feel like VC advice is all over the place? That’s because it is. In this episode, I expose the madness behind the money and how to navigate their confusing advice!
Watch the video at https://youtu.be/IBrPFyRMG1Q
Subscribe to our new Youtube channel https://www.youtube.com/@DataScienceatHome
00:00 - Introduction
00:16 - The Wild World of VC Advice
02:01 - Grow Fast vs. Grow Slow
05:00 - Listen to Customers or Innovate Ahead
09:51 - Raise Big or Stay Lean?
11:32 - Sell Your Vision in Minutes?
14:20 - The Real VC Secret: Focus on Your Team and Vision
17:03 - Outro
Can AI really out-compress PNG and FLAC? 🤔 Or is it just another overhyped tech myth? In this episode of Data Science at Home, Frag dives deep into the wild claims that Large Language Models (LLMs) like Chinchilla 70B are beating traditional lossless compression algorithms. 🧠💥
But before you toss out your FLAC collection, let's break down Shannon's Source Coding Theorem and why entropy sets the ultimate limit on lossless compression.
We explore: ⚙️ How LLMs leverage probabilistic patterns for compression 📉 Why compression efficiency doesn’t equal general intelligence 🚀 The practical (and ridiculous) challenges of using AI for compression 💡 Can AI actually BREAK Shannon’s limit—or is it just an illusion?
If you love AI, algorithms, or just enjoy some good old myth-busting, this one’s for you. Don't forget to hit subscribe for more no-nonsense takes on AI, and join the conversation on Discord!
Let’s decode the truth together.
Join the discussion on the new Discord channel of the podcast https://discord.gg/4UNKGf3
Don't forget to subscribe to our new YouTube channel
https://www.youtube.com/@DataScienceatHome
References
Have you met Shannon? https://datascienceathome.com/have-you-met-shannon-conversation-with-jimmy-soni-and-rob-goodman-about-one-of-the-greatest-minds-in-history/
Are AI giants really building trustworthy systems? A groundbreaking transparency report by Stanford, MIT, and Princeton says no. In this episode, we expose the shocking lack of transparency in AI development and how it impacts bias, safety, and trust in the technology. We’ll break down Gary Marcus’s demands for more openness and what consumers should know about the AI products shaping their lives.
Check our new YouTube channel https://www.youtube.com/@DataScienceatHome and Subscribe!
Cool links
We're revisiting one of our most popular episodes from last year, where renowned financial expert Chris Skinner explores the future of money. In this fascinating discussion, Skinner dives deep into cryptocurrencies, digital currencies, AI, and even the metaverse. He touches on government regulations, the role of tech in finance, and what these innovations mean for humanity.
Now, one year later, we encourage you to listen again and reflect—how much has changed? Are Chris Skinner's predictions still holding up, or has the financial landscape evolved in unexpected ways? Tune in and find out!
In this episode, join me and the Kaggle Grand Master, Konrad Banachewicz, for a hilarious journey into the zany world of data science trends. From algorithm acrobatics to AI, creativity, Hollywood movies, and music, we just can't get enough. It's the typical episode with a dose of nerdy comedy you didn't know you needed. Buckle up, it's a data disco, and we're breaking down the binary!
Sponsors
🔗 Links Mentioned in the Episode:
And finally, don't miss Konrad's Substack for more nerdy goodness! (If you're there already, be there again! 😄)
In this episode we delve into the dynamic realm of game development and the transformative role of artificial intelligence (AI).
Join Frag, Jim and Mike as they explore the current landscape of game development processes, from initial creative ideation to the integration of AI-driven solutions.
With Mike's expertise as a software executive and avid game developer, we uncover the potential of AI to revolutionize game design, streamline development cycles, and enhance player experiences. Discover insights into AI's applications in asset creation, code assistance, and even gameplay itself, as we discuss real-world implementations and cutting-edge research.
From the innovative GameGPT framework to the challenges of balancing automation with human creativity, this episode offers valuable perspectives and practical advice for developers looking to harness the power of AI in their game projects. Don't miss out on this insightful exploration at the intersection of technology and entertainment!
Sponsors
References
In this episode, we dive into the wild world of Large Language Models (LLMs) and their knack for… making things up. Can they really generalize without throwing in some fictional facts? Or is hallucination just part of their charm?
Let’s separate the genius from the guesswork in this insightful breakdown of AI’s creativity problem.
TL;DR;
LLM Generalisation without hallucinations. Is that possible?
References
https://github.com/lamini-ai/Lamini-Memory-Tuning/blob/main/research-paper.pdf
https://www.lamini.ai/blog/lamini-memory-tuning
The hype around Generative AI is real, but is the bubble about to burst?
Join me as we dissect the recent downturn in AI investments and what it means for the tech giants like OpenAI and Nvidia.
Could this be the end of the AI gold rush, or just a bump in the road?
References
In this insightful episode, we dive deep into the pressing issue of data privacy, where 86% of U.S. consumers express growing concerns and 40% don't trust companies to handle their data ethically.
Join us as we chat with the Vice President of Engineering at MetaRouter, a cutting-edge platform enabling enterprises to regain control over their customer data. We explore how MetaRouter empowers businesses to manage data in a 1st-party context, ensuring ethical, compliant handling while navigating the complexities of privacy regulations.
Sponsors
References
Join us as David Marom, Head of Panoply Business, explores the benefits of all-in-one data platforms.
Learn how tech stack consolidation boosts efficiency, improves data accuracy, and cuts costs.
David shares insights on overcoming common challenges, enhancing data governance, and success stories from organizations thriving with Panoply.
Sponsors
References
Blog: The Transformative Power of an All-in-One Data Platform
Whitepaper: Eradicating Platform Inefficiencies
Join us in this exciting episode of the Data Science at Home podcast. It's all about GPUs. We'll take you on a journey through the inner workings of these powerful processors, explaining how they handle complex computations and drive everything from gaming graphics to scientific simulations.
Whether you're a budding programmer or a tech enthusiast, understanding GPUs is key to unlocking new levels of performance and efficiency in your projects. Tune in and get ready to turbocharge your tech knowledge!
Sponsors
In this episode, we sit down with Ryan Smith, Founder of QFunction LLC, to explore how AI and machine learning are revolutionizing cybersecurity. With over 8 years of experience, including work at NASA's Jet Propulsion Laboratory, Ryan shares insights on the future of threat detection and prevention, the challenges businesses face in maintaining effective cybersecurity, and the ethical considerations of AI implementation.
Learn about cost-effective strategies for small businesses, the importance of collaboration in combating cyber threats, and how QFunction tailors its AI solutions to meet diverse industry needs.
Sponsors
In this last episode of the series "Rust in the Cosmos" we speak about what happens in space, what projects are currently active and what happened in the past that we can learn from?
What about Rust and space applications? As always, let's find out ;)
Sponsors
Intrepid AI, AeroRust, Bytenook
In this episode of "Rust in the Cosmos" we delve into the challenges of building embedded applications for space.
Did you know that once you ship your app to space... you can't get it back? :P
What role is Rust playing here? Let's find out ;)
Sponsors
AeroRust, Intrepid, Bytenook
In this episode of "Rust in the Cosmos" we delve into the challenge of testing software for... ehm ... space
How can Rust help? Let's find out ;)
Sponsors
AeroRust, Intrepid, Bytenook
References
In this inaugural episode of "Rust in the Cosmos," we delve into the intricacies of communication in space and some of the challenges in space application development.
Sponsors
In this episode we delve into the dynamic realm of game development and the transformative role of artificial intelligence (AI). Join Frag, Jim and Mike as they explore the current landscape of game development processes, from initial creative ideation to the integration of AI-driven solutions. With Mike's expertise as a software executive and avid game developer, we uncover the potential of AI to revolutionize game design, streamline development cycles, and enhance player experiences. Discover insights into AI's applications in asset creation, code assistance, and even gameplay itself, as we discuss real-world implementations and cutting-edge research.
From the innovative GameGPT framework to the challenges of balancing automation with human creativity, this episode offers valuable perspectives and practical advice for developers looking to harness the power of AI in their game projects. Don't miss out on this insightful exploration at the intersection of technology and entertainment!
Sponsors
References
In this episode, join me and the Kaggle Grand Master, Konrad Banachewicz, for a hilarious journey into the zany world of data science trends. From algorithm acrobatics to AI, creativity, Hollywood movies, and music, we just can't get enough. It's the typical episode with a dose of nerdy comedy you didn't know you needed. Buckle up, it's a data disco, and we're breaking down the binary!
Sponsors
🔗 Links Mentioned in the Episode:
And finally, don't miss Konrad's Substack for more nerdy goodness! (If you're there already, be there again! 😄)
In this episode of Data Science at Home, we explore the game-changing impact of low-code solutions in robotics development. Discover how these tools bridge the coding gap, simplify integration, and enable trial-and-error development. We'll also uncover challenges with traditional coding methods using ROS. Join us for a concise yet insightful discussion on the future of robotics!
SponsorsJoin us in a dynamic conversation with Yori Lavi, Field CTO at SQream, as we unravel the data analytics landscape. From debunking the data lakehouse hype to SQream's GPU-based magic, discover how extreme data challenges are met with agility.
Yori shares success stories, insights into SQream's petabyte-scale capabilities, and a roadmap to breaking down organizational bottlenecks in data science.
Dive into the future of data analytics with SQream's commitment to innovation, leaving legacy formats behind and leading the charge in large-scale, cost-effective data projects.
Tune in for a dose of GPU-powered revolution!
References
In this episode from a month ago, join me as we unravel the controversial CEO firing at OpenAI in December 2023. I share my insights on the events, decode the intricacies, and explore what lies ahead for this influential organization. Don't miss this concise yet insightful take on the intersection of leadership and artificial intelligence innovation.
Sponsor
Learn what the new year holds for ransomware as a service, Active Directory, artificial intelligence and more when you download the 2024 Arctic Wolf Labs Predictions Report today at arcticwolf.com/datascience
!!WARNING!!
Due to some technical issues the volume is not always constant during the show. I sincerely apologise for any inconvenience
Francesco
In this episode, I speak with Richie Cotton, Data Evangelist at DataCamp, as he delves into the dynamic intersection of AI and education. Richie, a seasoned expert in data science and the host of the podcast, brings together a wealth of knowledge and experience to explore the evolving landscape of AI careers, the skills essential for generative AI technologies, and the symbiosis of domain expertise and technical skills in the industry.
References
Dive into the world of Data Science at Home with our latest episode, where we explore the dynamic relationship between Artificial Intelligence and the redemption of open source software. In this thought-provoking discussion, I share my insights on why now, more than ever, is the opportune moment for open source to leave an indelible mark on the field of AI. Join me as I unpack my opinions and set expectations for the near future, discussing the pivotal role open source is set to play in shaping the landscape of data science and artificial intelligence. Don't miss out—tune in to gain a deeper understanding of this revolutionary intersection!
This episode is available as YouTube stream at https://www.youtube.com/live/0Enenz1HqIs?si=woyYdjJVz656BneH&t=915
In this captivating podcast episode, join renowned financial expert Chris Skinner as he delves into the fascinating realm of the future of money.
From cryptocurrencies to government currencies, the metaverse to artificial intelligence (AI), Skinner explores the intricate interplay between technology and humanity. Gain valuable insights as he defines the future of money, examines the potential impact of cryptocurrencies on traditional government currencies, and addresses the advantages and disadvantages of digital currencies.
Delve into the complex issues of regulation and governance in the context of emerging financial technologies, and discover Skinner's unique perspective on the metaverse and its implications for the future of money and technology.
Brace yourself for an enlightening discussion on the integration of AI in the financial sector and its potential impact on humanity. Tune in to explore the cutting-edge concepts that shape our financial landscape and get a glimpse of what lies ahead.
You can read about Chris at https://thefinanser.com/
Sponsors
This episode is sponsored by Setapp. Setapp is a platform that combines 230+ powerful MacOS and iOS apps and
tools under one $9.99 subscription. Their selection of apps is mostly helpful for people who use their
Macs as an actual working tool, covering complete use cases like coding, designing, project and time
management and so on. Once subscribed, you get full access to paid features of the apps, as well as to
new apps that are being constantly added.
So you’ll always be sure you’re not missing out on any cool apps and services that actually help you do
your work more efficiently for just a fraction of the price. Get 7 days for free at https://stpp.co/dsat
In this thought-provoking episode, we sit down with the renowned AI expert, Filip Piekniewski, Phd, who fearlessly challenges the prevailing narratives surrounding artificial general intelligence (AGI) and the singularity. With a no-nonsense approach and a deep understanding of the field, Filip dismantles the hype and exposes some of the misconceptions about AI, LLMs and AGI.
Join us as we delve into the real-world implications of AI, separating fact from fiction, and gaining a firm grasp on the tangible possibilities of AI advancement.
If you're seeking a refreshingly pragmatic perspective on the future of AI, this episode is an absolute must-listen.
Filip Piekniewski Bio
Filip Piekniewski is a distinguished computer vision researcher and engineer, specializing in visual object tracking and perception. He approaches machine learning with a pragmatic mindset, recognizing its current limitations. Filip earned his Ph.D. from Warsaw University, where he explored neuroscience and later joined Brain Corporation in San Diego. His extensive study of neuroscience inspired him to develop innovative, bio-inspired machine learning architectures. Filip's unique blend of scientific curiosity and software engineering expertise allows him to quickly prototype and implement new ideas. He is known for his realistic perspective on AI, debunking AGI hype and focusing on tangible advancements.
Sponsors
Finally, a better way to do B2B research. NewtonX The World’s Leading B2B Market Research Company
Explore the Complex World of Regulations. Compliance can be overwhelming. Multiple frameworks. Overlapping requirements. Let Arctic Wolf be your guide.
Check it out at https://arcticwolf.com/datascience
References
Brace yourselves, dear friends!
In this episode, we delve into the earth-shattering revelation that OpenAI might have stumbled upon AGI (lol) and we're all just seconds away from being replaced by highly sophisticated toasters (lol lol).
Spoiler alert: OpenAI's CEO is just playing 7D chess with the entire human race. So, sit back, relax, and enjoy this totally not ominous exploration into the 'totally not happening' future of AI!
Dive into the cool world of AI chips with us! 🚀 We're breaking down how these special computer chips for AI have evolved and what makes them different. Think of them like the superheroes of the tech world!
Don't miss out! 🎙️🔍 #AIChips #TechTalk #SimpleScience
Hey there, engineering enthusiasts! Ever wondered how engineers deal with the wild, unpredictable twists and turns in their projects? In this episode, we're spilling the beans on uncertainty and why it's the secret sauce in every engineering recipe, not just the fancy stuff like deep learning and neural networks!
Join us for a ride through the world of uncertainty quantification. Tune in and let's demystify the unpredictable together! 🎲🔧🚀
References
https://www.osti.gov/servlets/purl/1428000
https://arc.aiaa.org/doi/pdf/10.2514/6.2010-124
https://arxiv.org/pdf/2001.10411
In this episode, dive deep into the world of Language Models as we decode their intricate structure, revealing how these powerful algorithms exploit concepts from the past.
But... what if LLMs were just a database?
References
https://fchollet.substack.com/p/how-i-think-about-llm-prompt-engineering
In this episode, I delve into Elon Musk's foresight on the future of AI as he champions Rust programming language.
Here is why Rust stands at the forefront of AI technology and the potential it holds.
References
https://github.com/WasmEdge/mediapipe-rs
https://blog.stackademic.com/why-did-elon-musk-say-that-rust-is-the-language-of-agi-eb36303ce341
As a continuation of Episode 238, I explain some effective and fun attacks to conduct against LLMs. Such attacks are even more effective on models served locally, that are hardly controlled by human feedback.
Have great fun and learn them responsibly.
References
https://www.jailbreakchat.com/
https://www.reddit.com/r/ChatGPT/comments/10tevu1/new_jailbreak_proudly_unveiling_the_tried_and/
https://arxiv.org/abs/2305.13860
Join me on an enlightening journey through the world of prompt engineering. Explore the multifaceted skills and strategies involved in harnessing the potential of large language models for various applications. From enhancing safety measures to augmenting models with domain knowledge, learn how prompt engineering is shaping the future of AI.
References
In this era of AI-powered code generation, software architects are facing a concerning decline in the quality of their creations. The once meticulously crafted software architectures are now being compromised. Should LLMs be responsible?
References
Program Design in the UNIX Environment
https://harmful.cat-v.org/cat-v/unix_prog_design.pdf
Let's delve into the emerging trend in database design – or is it really a new trend?
The realm of vector databases and their revolutionary influence on AI and ML is making headlines.
Come along as we investigate how these groundbreaking databases are revolutionizing the landscape of data storage, retrieval, and processing, ultimately unlocking the complete potential of artificial intelligence and machine learning.
But are they genuinely as innovative as they seem?
References
https://partee.io/2022/08/11/vector-embeddings/
https://blog.det.life/why-you-shouldnt-invest-in-vector-databases-c0cd3f59d23c
In this episode, we dive into the world of data analytics and artificial intelligence with Ryan, the CEO, and Paul, the CTO of Zenlytic. Having graduated from Harvard and with extensive backgrounds in venture capital, consulting, and data engineering, Ryan and Paul provide valuable insights into their journey of building Zenlytic, a cutting-edge analytics platform.
Join us as we explore how Zenlytic's natural language interface enhances user experiences, enabling seamless access and analysis of analytics data. Discover how their self-service platform empowers teams to leverage business intelligence effectively, and learn about the unique features that set Zenlytic apart from other analytics platforms in the market. Delve into the crucial aspects of data security and privacy while granting team access, and find out how Zenlytic's analytics capabilities have transformed companies into data-driven decision-makers, ultimately improving their performance.
In this captivating podcast episode, join renowned financial expert Chris Skinner as he delves into the fascinating realm of the future of money.
From cryptocurrencies to government currencies, the metaverse to artificial intelligence (AI), Skinner explores the intricate interplay between technology and humanity. Gain valuable insights as he defines the future of money, examines the potential impact of cryptocurrencies on traditional government currencies, and addresses the advantages and disadvantages of digital currencies.
Delve into the complex issues of regulation and governance in the context of emerging financial technologies, and discover Skinner's unique perspective on the metaverse and its implications for the future of money and technology.
Brace yourself for an enlightening discussion on the integration of AI in the financial sector and its potential impact on humanity. Tune in to explore the cutting-edge concepts that shape our financial landscape and get a glimpse of what lies ahead.
You can read about Chris at https://thefinanser.com/
Sponsors
This episode is sponsored by Setapp. Setapp is a platform that combines 230+ powerful MacOS and iOS apps and
tools under one $9.99 subscription. Their selection of apps is mostly helpful for people who use their
Macs as an actual working tool, covering complete use cases like coding, designing, project and time
management and so on. Once subscribed, you get full access to paid features of the apps, as well as to
new apps that are being constantly added.
So you’ll always be sure you’re not missing out on any cool apps and services that actually help you do
your work more efficiently for just a fraction of the price. Get 7 days for free at https://stpp.co/dsat
In this thought-provoking episode, we sit down with the renowned AI expert, Filip Piekniewski, Phd, who fearlessly challenges the prevailing narratives surrounding artificial general intelligence (AGI) and the singularity. With a no-nonsense approach and a deep understanding of the field, Filip dismantles the hype and exposes some of the misconceptions about AI, LLMs and AGI.
Join us as we delve into the real-world implications of AI, separating fact from fiction, and gaining a firm grasp on the tangible possibilities of AI advancement.
If you're seeking a refreshingly pragmatic perspective on the future of AI, this episode is an absolute must-listen.
Filip Piekniewski Bio
Filip Piekniewski is a distinguished computer vision researcher and engineer, specializing in visual object tracking and perception. He approaches machine learning with a pragmatic mindset, recognizing its current limitations. Filip earned his Ph.D. from Warsaw University, where he explored neuroscience and later joined Brain Corporation in San Diego. His extensive study of neuroscience inspired him to develop innovative, bio-inspired machine learning architectures. Filip's unique blend of scientific curiosity and software engineering expertise allows him to quickly prototype and implement new ideas. He is known for his realistic perspective on AI, debunking AGI hype and focusing on tangible advancements.
Sponsors
Finally, a better way to do B2B research. NewtonX The World’s Leading B2B Market Research Company
Explore the Complex World of Regulations. Compliance can be overwhelming. Multiple frameworks. Overlapping requirements. Let Arctic Wolf be your guide.
Check it out at https://arcticwolf.com/datascience
References
In this exciting episode, we dive into the world of Forward-Forward Neural Networks, unveiling their mind-boggling power and potential.
Join us as we demystify these advanced AI algorithms and explore how they're reshaping industries and revolutionizing machine learning.
From self-driving cars to personalized medicine, discover the cutting-edge applications that are propelling us into a new era of AI greatness.
Get ready to unlock the secrets of Forward-Forward Neural Networks and witness the future of artificial intelligence unfold before your eyes.
Don't miss out – tune in now and be part of the AI revolution!
References
The Forward-Forward Algorithm: Some Preliminary Investigations
Brace yourselves as we uncover the mind-blowing AI model, Google Bard, that's poised to challenge ChatGPT and other conversational AI systems. Join us as we explore the revolutionary features of Bard, its cutting-edge architecture, and its ability to generate human-like responses. Discover why AI enthusiasts are buzzing with excitement. References: [1] [2] [3]
Sponsors
Finally, a better way to do B2B research. NewtonX The World's Leading B2B Market Research Company
References
Google Unveils Palm-2: Its Revolutionary AI Model. https://datascientest.com/en/google-unveils-palm-2-its-revolutionary-ai-model
Google AI - Discover Palm-2 https://ai.google/discover/palm2/
"Palm-2: A Large Scale Language Model for Conversational AI." ArXiv preprint arXiv:2305.10403 (2023). https://arxiv.org/abs/2305.10403
In this enlightening episode of our podcast, we delve into the fascinating realm of Physics Informed Neural Networks (PINNs) and explore how they combine the extraordinary prediction capabilities of neural networks with the unparalleled accuracy of physics models.
Join us as we unravel the mysteries behind PINNs and their potential to revolutionize various scientific and engineering domains. We'll discuss the underlying principles that enable these networks to incorporate physical laws and constraints, resulting in enhanced predictions and a deeper understanding of complex systems.
Sponsors
This episode is supported by Mimecast - the email security solution that every business needs. With Mimecast, you get a security solution that is specifically designed for email and workplace collaboration. Head to mimecast.com for a free trial.
References
Physics Informed Deep Learning https://maziarraissi.github.io/PINNs/
In this episode, we dive into the ways in which AI and machine learning are disrupting traditional software engineering principles. With the advent of automation and intelligent systems, developers are increasingly relying on algorithms to create efficient and effective code. However, this reliance on AI can come at a cost to the tried-and-true methods of software engineering. Join us as we explore the pros and cons of this paradigm shift and discuss what it means for the future of software development.
Sponsors Bloomberg
At Bloomberg, they solve complex, real-world problems for customers across the global capital markets. From real-time market data to sophisticated analytics, powerful trading tools, and more, Bloomberg engineers work with systems that operate at scale.
If you're a software engineer looking for an exciting and fulfilling career, head over to bloomberg.com/careers to learn more.
Cybercriminals are evolving. Their techniques and tactics are more advanced, intricate, and dangerous than ever before. Industries and governments around the world are fighting back, unveiling new regulations meant to better protect data against this rising threat. Arctic Wolf — the leader in security operations — is on a mission to end cyber risk by giving organizations the protection, information, and confidence they need to protect their people, technology, and data.
Visit arcticwolf.com/datascience to take your first step.
Hold on to your calculators and buckle up for a wild mathematical ride in this episode! Brace yourself as we dive into the fascinating realm of Liquid Time-Constant Networks (LTCs), where mathematical content reaches new heights of excitement.
In this mind-bending adventure, we demystify the intricacies of LTCs, from complex equations to mind-boggling mathematical concepts, we break them down into digestible explanations.
References
https://www.science.org/doi/10.1126/scirobotics.adc8892
https://spectrum.ieee.org/liquid-neural-networks#toggle-gdpr
Get ready for an eye-opening episode! 🎙️
In our latest podcast episode, we dive deep into the world of LoRa (Low-Rank Adaptation) for large language models (LLMs). This groundbreaking technique is revolutionizing the way we approach language model training by leveraging low-rank approximations.
Join us as we unravel the mysteries of LoRa and discover how it enables us to retrain LLMs with minimal expenditure of money and resources. We'll explore the ingenious strategies and practical methods that empower you to fine-tune your language models without breaking the bank.
Whether you're a researcher, developer, or language model enthusiast, this episode is packed with invaluable insights. Learn how to unlock the potential of LLMs without draining your resources.
Tune in and join the conversation as we unravel the secrets of LoRa low-rank adaptation and show you how to retrain LLMs on a budget.
Listen to the full episode now on your favorite podcast platform! 🎧✨
References
This is the first episode about the latest trend in artificial intelligence that's shaking up the industry - running large language models locally on your machine. This new approach allows you to bypass the limitations and constraints of cloud-based models controlled by big tech companies, and take control of your own AI journey.
We'll delve into the benefits of running models locally, such as increased speed, improved privacy and security, and greater customization and flexibility. We'll also discuss the technical requirements and considerations for running these models on your own hardware, and provide practical tips and advice to get you started.
Join us as we uncover the secrets to unleashing the full potential of large language models and taking your AI game to the next level!
Sponsors
AI-powered Email Security Best-in-class protection against the most sophisticated attacks,
from phishing and impersonation to BEC and zero-day threats
https://www.mimecast.com/
References
The journey of porting our projects to Rust was intense, but it was a decision we made to improve the quality of our software. The migration was not an easy task, as it required a considerable amount of time and resources. However, it was worth the effort as we have seen significant improvements in code reusability, code cleanliness, and performance.
In this episode I will tell you why you should consider taking that journey too.
In this episode of our podcast, we dive deep into the fascinating world of Graph Neural Networks.
First, we explore Hierarchical Networks, which allow for the efficient representation and analysis of complex graph structures by breaking them down into smaller, more manageable components.
Next, we turn our attention to Generative Graph Models, which enable the creation of new graph structures that are similar to those in a given dataset. We discuss the inner workings of these models and their potential applications in fields such as drug discovery and social network analysis.
Finally, we delve into the essential Pooling Mechanism, which allows for the efficient passing of information across different parts of the graph neural network. We examine the various types of pooling mechanisms and their advantages and disadvantages.
Whether you're a seasoned graph neural network expert or just starting to explore the field, this episode has something for you. So join us for a deep dive into the power and potential of Graph Neural Networks.
References
Machine Learning with Graphs - http://web.stanford.edu/class/cs224w/
A Comprehensive Survey on Graph Neural Networks - https://arxiv.org/abs/1901.00596
In this episode, I explore the cutting-edge technology of graph neural networks (GNNs) and how they are revolutionizing the field of artificial intelligence. I break down the complex concepts behind GNNs and explain how they work by modeling the relationships between data points in a graph structure.
I also delve into the various real-world applications of GNNs, from drug discovery to recommendation systems, and how they are outperforming traditional machine learning models.
Join me and demystify this exciting area of AI research and discover the power of graph neural networks.
In this episode, we dive into the not-so-secret sauce of ChatGPT, and what makes it a different model than its predecessors in the field of NLP and Large Language Models.
We explore how human feedback can be used to speed up the learning process in reinforcement learning, making it more efficient and effective.
Whether you're a machine learning practitioner, researcher, or simply curious about how machines learn, this episode will give you a fascinating glimpse into the world of reinforcement learning with human feedback.
Sponsors
This episode is supported by How to Fix the Internet, a cool podcast from the Electronic Frontier Foundation and Bloomberg, global provider of financial news and information, including real-time and historical price data, financial data, trading news, and analyst coverage.
References
Learning through human feedback
https://www.deepmind.com/blog/learning-through-human-feedback
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
https://arxiv.org/abs/2204.05862
In this episode, we explore the potential of the highly anticipated GPT-4 language model and the challenges that come with its development. From its ability to generate highly coherent and creative text to concerns about ethical considerations and the potential misuse of such technology, we delve into the promise and pitfalls of GPT-4.
Join us as we speak with experts in the field to gain insights into the latest developments and the impact that GPT-4 could have on the future of natural language processing.
In this episode, we dive into the ways in which AI and machine learning are disrupting traditional software engineering principles. With the advent of automation and intelligent systems, developers are increasingly relying on algorithms to create efficient and effective code. However, this reliance on AI can come at a cost to the tried-and-true methods of software engineering. Join us as we explore the pros and cons of this paradigm shift and discuss what it means for the future of software development.
In this episode, we dive into the fascinating world of zero-knowledge proofs and their impact on data science. Zero-knowledge proofs allow one party to prove to another that they know a secret without revealing the secret itself. This powerful concept has numerous applications in data science, from ensuring data privacy and security, to facilitating secure transactions and identity verification. We explore the mechanics of zero-knowledge proofs, its real-world applications, and how it is revolutionizing the way we handle sensitive information.
Join us as we uncover the secrets of zero-knowledge proofs and its impact on the future of data science.
Sponsors
Want to enjoy the 4K video anytime, anywhere?
With ASUS ZenWiFi you can. Asus ZenWiFi XD5 mesh system puts your WiFi on steroids. It has a super easy Setup, with Flexible Network Naming, Lifelong free AiProtection and of course WiFi 6 technology. With Asus ZenWifi XD5 you get superfast, reliable and secure WiFi connections in every corner of your home!
With Asus ZenWifi XD5, you get the best WiFi experience!
Find more at https://asus.click/ZenWiFi_XD5
Deep learning methods are not as effective with tabular data. Here is why, and what to do about it.
Sponsors
If you're ready to take your WiFi game to the next level, head over to asus.click/ZenWiFi_XD5 or check out the show notes for this episode. Trust me, with ASUS ZenWiFi XD5, you'll get the best WiFi experience ever!
References
In this episode I speak about online learning systems and why blindly choosing such a paradigm can lead to very unpredictable and expensive outcomes.
Also in this episode, I have to deal with an intruder :)
Links
Birman, K.; Joseph, T. (1987). "Exploiting virtual synchrony in distributed systems". Proceedings of the Eleventh ACM Symposium on Operating Systems Principles - SOSP '87. pp. 123–138. doi:10.1145/41457.37515. ISBN 089791242X. S2CID 7739589.
In this episode, I'll be discussing the capabilities and limitations of ChatGPT, an advanced language AI model. I'll go over its power to understand and respond to natural language, and its applications in tasks such as language translation and text summarization.
However, I'll also touch on the challenges that still need to be overcome such as bias and data privacy concerns.
Tune in for a comprehensive look at the current state of advanced language AI.
References
NordPass Business has developed a password manager, that will save you a lot of time and energy whenever you
need access to business accounts, work across devices, even with the other members of your team, or whenever you need to share sensitive data with your colleagues, or make payments efficiently. All this with the highest standard of cyber secure technology.
See NordPass Business in action now with a 3-month free trial here
https://nordpass.com/DATASCIENCE with code DATASCIENCE
Amethix works to create and maximize the impact of the world’s leading corporations and startups, so they can create a better future for everyone they serve. We provide solutions in AI/ML, Fintech, Healthcare/RWE, and Predictive maintenance.
Is it possible to reconstruct a 3D model from a simple image?
Under certain constraints, it is!
In this episode I tell you how.
Our Sponsors
Explore the Complex World of Regulations. Compliance can be overwhelming. Multiple frameworks. Overlapping requirements. Let Arctic Wolf be your guide.
Check it out at https://arcticwolf.com/datascience
Amethix works to create and maximize the impact of the world’s leading corporations and startups, so they can create a better future for everyone they serve. We provide solutions in AI/ML, Fintech, Healthcare/RWE, and Predictive maintenance.
https://github.com/isl-org/Open3D
https://huggingface.co/docs/transformers/model_doc/glpn
https://arxiv.org/abs/2201.07436
What if we borrowed from physics some theories that would interpret deep learning and machine learning in general?
Here is a list of plausible ways to interpret our beloved ML models and understand why they works, or they don't.
Enjoy the show!
NordPass Business has developed a password manager, that will save you a lot of time and energy whenever you
need access to business accounts, work across devices, even with the other members of your team, or whenever you need to share sensitive data with your colleagues, or make payments efficiently. All this with the highest standard of cyber secure technology.
See NordPass Business in action now with a 3-month free trial here
https://nordpass.com/DATASCIENCE with codeDATASCIENCE
Amethix works to create and maximize the impact of the world’s leading corporations and startups, so they can create a better future for everyone they serve. We provide solutions in AI/ML, Fintech, Healthcare/RWE, and Predictive maintenance.
If you think that the problem of self-driving cars has been solved, think twice.
As a matter of fact, the problem of self-driving cars cannot be solved with the technical solutions that companies are currently considering.
Don't get fooled by marketing and PR on social media. Whoever is telling you they solved the problem of driving a vehicle fully autonomously, they are lying.
Here is why.
Our Sponsors
Explore the Complex World of Regulations. Compliance can be overwhelming. Multiple frameworks. Overlapping requirements. Let Arctic Wolf be your guide.
Check it out at https://arcticwolf.com/datascience
Amethix works to create and maximize the impact of the world’s leading corporations and startups, so they can create a better future for everyone they serve. We provide solutions in AI/ML, Fintech, Healthcare/RWE, and Predictive maintenance.
Let's look at the history of data platforms. How did they evolve? Why?
Shall I switch to the latest architecture?
Enjoy the show!
Our Sponsors
Explore the Complex World of Regulations. Compliance can be overwhelming. Multiple frameworks. Overlapping requirements. Let Arctic Wolf be your guide.
Check it out at https://arcticwolf.com/datascience
Amethix works to create and maximize the impact of the world’s leading corporations and startups, so they can create a better future for everyone they serve. We provide solutions in AI/ML, Fintech, Healthcare/RWE, and Predictive maintenance.
Companies and other business entities are actively involved in defining data products and applied research every year. Academia has always played a role in creating new methods and solutions/algorithms in the fields of machine learning and artificial intelligence.
However, there is doubt about how powerful and effective such research efforts are.
Is studying AI in academia a waste of time?
Our Sponsors
Ready to advance your career in data science? University of Cincinnati Online offers nationally recognized educational programs in business analytics and information systems. Predictive Analytics Today named UC as the No.1 MS Data Science school in the country and is nationally recognized with a proven track record of placing students at high-profile companies such as Google, Amazon and P&G.
Discover more about the University of Cincinnati’s 100% online master’s degree programs at online.uc.edu/obais
Amethix works to create and maximize the impact of the world’s leading corporations and startups, so they can create a better future for everyone they serve. We provide solutions in AI/ML, Fintech, Healthcare/RWE, and Predictive maintenance.
There are many solutions to private machine learning. I am pretty confident when I say that the one we are speaking in this episode is probably one of the most feasible and reliable.
I am with Daniel Huynh, CEO of Mithril Security, a graduate from Ecole Polytechnique with a specialisation in AI and data science. He worked at Microsoft on Privacy Enhancing Technologies under the office of the CTO of Microsoft France. He has written articles on Homomorphic Encryptions with the CKKS explained series (https://blog.openmined.org/ckks-explained-part-1-simple-encoding-and-decoding/). He is now focusing on Confidential Computing at Mithril Security and has written extensive articles on the topic: https://blog.mithrilsecurity.io/.
In this show we speak about confidential computing, SGX and private machine learning
References
Ready to advance your career in data science? University of Cincinnati Online offers nationally recognized educational programs in business analytics and information systems. Predictive Analytics Today named UC as the No.1 MS Data Science school in the country and is nationally recognized with a proven track record of placing students at high-profile companies such as Google, Amazon and P&G.
Discover more about the University of Cincinnati’s 100% online master’s degree programs at online.uc.edu/obais
Amethix works to create and maximize the impact of the world’s leading corporations and startups, so they can create a better future for everyone they serve. We provide solutions in AI/ML, Fintech, Healthcare/RWE, and Predictive maintenance.
How does an autonomous vehicle see? How does it sense the road?
They are equipped of many sensors, of course. Are they all powerful enough? Small enough to hide them and make your car look beautiful?
In this episode I speak about LIDAR, high resolution cameras and some machine learning methods adapted to a minimal number of sensors.
Our Sponsors
Ready to advance your career in data science? University of Cincinnati Online offers nationally recognized educational programs in business analytics and information systems. Predictive Analytics Today named UC as the No.1 MS Data Science school in the country and is nationally recognized with a proven track record of placing students at high-profile companies such as Google, Amazon and P&G.
Discover more about the University of Cincinnati’s 100% online master’s degree programs at online.uc.edu/obais
Amethix works to create and maximize the impact of the world’s leading corporations and startups, so they can create a better future for everyone they serve. We provide solutions in AI/ML, Fintech, Healthcare/RWE, and Predictive maintenance.
References
https://patents.google.com/patent/US20220043449A1/en?oq=20220043449
Sometimes applications crash. Some other times applications crash because memory is exhausted. Such issues exist because of bugs in the code, or heavy memory usage for reasons that were not expected during design and implementation.
Can we use machine learning to predict and eventually detect out of memory kills from the operating system?
Apparently, the Netflix app many of us use on a daily basis leverage ML and time series analysis to prevent OOM-kills.
Enjoy the show!
Our SponsorsExplore the Complex World of Regulations. Compliance can be overwhelming. Multiple frameworks. Overlapping requirements. Let Arctic Wolf be your guide.
Check it out at https://arcticwolf.com/datascience
Amethix works to create and maximize the impact of the world’s leading corporations and startups, so they can create a better future for everyone they serve. We provide solutions in AI/ML, Fintech, Healthcare/RWE, and Predictive maintenance.
Transcript
1
00:00:04,150 --> 00:00:09,034
And here we are again with the season four of the Data Science at Home podcast.
2
00:00:09,142 --> 00:00:19,170
This time we have something for you if you want to help us shape the data science leaders of the future, we have created the the Data Science at Home's Ambassador program.
3
00:00:19,340 --> 00:00:28,378
Ambassadors are volunteers who are passionate about data science and want to give back to our growing community of data science professionals and enthusiasts.
4
00:00:28,534 --> 00:00:37,558
You will be instrumental in helping us achieve our goal of raising awareness about the critical role of data science in cutting edge technologies.
5
00:00:37,714 --> 00:00:45,740
If you want to learn more about this program, visit the Ambassadors page on our [email protected].
6
00:00:46,430 --> 00:00:49,234
Welcome back to another episode of Data Science at Home podcast.
7
00:00:49,282 --> 00:00:55,426
I'm Francesco Podcasting from the Regular Office of Amethyx Technologies, based in Belgium.
8
00:00:55,618 --> 00:01:02,914
In this episode, I want to speak about a machine learning problem that has been formulated at Netflix.
9
00:01:03,022 --> 00:01:22,038
And for the record, Netflix is not sponsoring this episode, though I still believe that this problem is a very well known problem, a very common one across factors, which is how to predict out of memory kill in an application and formulate this problem as a machine learning problem.
10
00:01:22,184 --> 00:01:39,142
So this is something that, as I said, is very interesting, not just because of Netflix, but because it allows me to explain a few points that, as I said, are kind of invariance across sectors.
11
00:01:39,226 --> 00:01:56,218
Regardless of your application, is a video streaming application or any other communication type of application, or a fintech application, or energy, or whatever, this memory kill, out of memory kill still occurs.
12
00:01:56,314 --> 00:02:05,622
And what is an out of memory kill? Well, it's essentially the extreme event in which the machine doesn't have any more memory left.
13
00:02:05,756 --> 00:02:16,678
And so usually the operating system can start eventually swapping, which means using the SSD or the hard drive as a source of memory.
14
00:02:16,834 --> 00:02:19,100
But that, of course, will slow down a lot.
15
00:02:19,430 --> 00:02:45,210
And eventually when there is a bug or a memory leak, or if there are other applications running on the same machine, of course there is some kind of limiting factor that essentially kills the application, something that occurs from the operating system most of the time that kills the application in order to prevent the application from monopolizing the entire machine, the hardware of the machine.
16
00:02:45,710 --> 00:02:48,500
And so this is a very important problem.
17
00:02:49,070 --> 00:03:03,306
Also, it is important to have an episode about this because there are some strategies that I've used at Netflix that are pretty much in line with what I believe machine learning should be about.
18
00:03:03,368 --> 00:03:25,062
And usually people would go for the fancy solution there like this extremely accurate predictors or machine learning models, but you should have a massive number of parameters and that try to figure out whatever is happening on that machine that is running that application.
19
00:03:25,256 --> 00:03:29,466
While the solution at Netflix is pretty straightforward, it's pretty simple.
20
00:03:29,588 --> 00:03:33,654
And so one would say then why making an episode after this? Well.
21
00:03:33,692 --> 00:03:45,730
Because I think that we need more sobriety when it comes to machine learning and I believe we still need to spend a lot of time thinking about what data to collect.
22
00:03:45,910 --> 00:03:59,730
Reasoning about what is the problem at hand and what is the data that can actually tickle the particular machine learning model and then of course move to the actual prediction that is the actual model.
23
00:03:59,900 --> 00:04:15,910
That most of the time it doesn't need to be one of these super fancy things that you see on the news around chatbots or autonomous gaming agent or drivers and so on and so forth.
24
00:04:16,030 --> 00:04:28,518
So there are essentially two data sets that the people at Netflix focus on which are consistently different, dramatically different in fact.
25
00:04:28,604 --> 00:04:45,570
These are data about device characteristics and capabilities and of course data that are collected at Runtime and that give you a picture of what's going on in the memory of the device, right? So that's the so called runtime memory data and out of memory kills.
26
00:04:45,950 --> 00:05:03,562
So the first type of data is I would consider it very static because it considers for example, the device type ID, the version of the software development kit that application is running, cache capacities, buffer capacities and so on and so forth.
27
00:05:03,646 --> 00:05:11,190
So it's something that most of the time doesn't change across sessions and so that's why it's considered static.
28
00:05:12,050 --> 00:05:18,430
In contrast, the other type of data, the Runtime memory data, as the name says it's runtime.
29
00:05:18,490 --> 00:05:24,190
So it varies across the life of the session it's collected at Runtime.
30
00:05:24,250 --> 00:05:25,938
So it's very dynamic data.
31
00:05:26,084 --> 00:05:36,298
And example of these records are for example, profile, movie details, playback information, current memory usage, et cetera, et cetera.
32
00:05:36,334 --> 00:05:56,086
So this is the data that actually moves and moves in the sense that it changes depending on how the user is actually using the Netflix application, what movie or what profile description, what movie detail has been loaded for that particular movie and so on and so forth.
33
00:05:56,218 --> 00:06:15,094
So one thing that of course the first difficulty of the first challenge that the people at Netflix had to deal with was how would you combine these two things, very static and usually small tables versus very dynamic and usually large tables or views.
34
00:06:15,142 --> 00:06:36,702
Well, there is some sort of join on key that is performed by the people at Netflix in order to put together these different data resolutions, right, which is data of the same phenomenon but from different sources and with different carrying very different signals in there.
35
00:06:36,896 --> 00:06:48,620
So the device capabilities is captured usually by the static data and of course the other data, the Runtime memory and out of memory kill data.
36
00:06:48,950 --> 00:07:04,162
These are also, as I said, the data that will describe pretty accurately how is the user using that particular application on that particular hardware.
37
00:07:04,306 --> 00:07:17,566
Now of course, when it comes to data and deer, there is nothing new that people at Netflix have introduced dealing with missing data for example, or incorporating knowledge of devices.
38
00:07:17,698 --> 00:07:26,062
It's all stuff that it's part of the so called data cleaning and data collection strategy, right? Or data preparation.
39
00:07:26,146 --> 00:07:40,782
That is, whatever you're going to do in order to make that data or a combination of these data sources, let's say, compatible with the way your machine learning model will understand or will read that data.
40
00:07:40,916 --> 00:07:58,638
So if you think of a big data platform, the first step, the first challenge you have to deal, you have to deal with is how can I, first of all, collect the right amount of information, the right data, but also how to transform this data for my particular big data platform.
41
00:07:58,784 --> 00:08:12,798
And that's something that, again, nothing new, nothing fancy, just basics, what we have been used to, what we are used to seeing now for the last decade or more, that's exactly what they do.
42
00:08:12,944 --> 00:08:15,222
And now let me tell you something important.
43
00:08:15,416 --> 00:08:17,278
Cybercriminals are evolving.
44
00:08:17,374 --> 00:08:22,446
Their techniques and tactics are more advanced, intricate and dangerous than ever before.
45
00:08:22,628 --> 00:08:30,630
Industries and governments around the world are fighting back on dealing new regulations meant to better protect data against this rising threat.
46
00:08:30,950 --> 00:08:39,262
Today, the world of cybersecurity compliance is a complex one, and understanding the requirements your organization must adhere to can be a daunting task.
47
00:08:39,406 --> 00:08:42,178
But not when the pack has your best architect.
48
00:08:42,214 --> 00:08:53,840
Wolf, the leader in security operations, is on a mission to end cyber risk by giving organizations the protection, information and confidence they need to protect their people, technology and data.
49
00:08:54,170 --> 00:09:02,734
The new interactive compliance portal helps you discover the regulations in your region and industry and start the journey towards achieving and maintaining compliance.
50
00:09:02,902 --> 00:09:07,542
Visit Arcticwolves.com DataScience to take your first step.
51
00:09:07,676 --> 00:09:11,490
That's arcticwolf.com DataScience.
52
00:09:12,050 --> 00:09:18,378
I think that the most important part, though, I think are actually equally important.
53
00:09:18,464 --> 00:09:26,854
But the way they treat runtime memory data and out of memory kill data is by using sliding windows.
54
00:09:26,962 --> 00:09:38,718
So that's something that is really worth mentioning, because the way you would frame this problem is something is happening at some point in time and I have to kind of predict that event.
55
00:09:38,864 --> 00:09:49,326
That is usually an outlier in the sense that these events are quite rare, fortunately, because Netflix would not be as usable as we believe it is.
56
00:09:49,448 --> 00:10:04,110
So you would like to predict these weird events by looking at a historical view or an historical amount of records that you have before this particular event, which is the kill of the application.
57
00:10:04,220 --> 00:10:12,870
So the concept of the sliding window, the sliding window approach is something that comes as the most natural thing anyone would do.
58
00:10:13,040 --> 00:10:18,366
And that's exactly what the researchers and Netflix have done.
59
00:10:18,488 --> 00:10:25,494
So unexpectedly, in my opinion, they treated this problem as a time series, which is exactly what it is.
60
00:10:25,652 --> 00:10:26,190
Now.
61
00:10:26,300 --> 00:10:26,754
They.
62
00:10:26,852 --> 00:10:27,330
Of course.
63
00:10:27,380 --> 00:10:31,426
Use this sliding window with a different horizon.
64
00:10:31,558 --> 00:10:32,190
Five minutes.
65
00:10:32,240 --> 00:10:32,838
Four minutes.
66
00:10:32,924 --> 00:10:33,702
Two minutes.
67
00:10:33,836 --> 00:10:36,366
As close as possible to the event.
68
00:10:36,548 --> 00:10:38,886
Because maybe there are some.
69
00:10:39,008 --> 00:10:39,762
Let's say.
70
00:10:39,896 --> 00:10:45,678
Other dynamics that can raise when you are very close to the event or when you are very far from it.
71
00:10:45,704 --> 00:10:50,166
Like five minutes far from the out of memory kill.
72
00:10:50,348 --> 00:10:51,858
Might have some other.
73
00:10:51,944 --> 00:10:52,410
Let's say.
74
00:10:52,460 --> 00:10:55,986
Diagrams or shapes in the data.
75
00:10:56,168 --> 00:11:11,310
So for example, you might have a certain number of allocations that keep growing and growing, but eventually they grow with a certain curve or a certain rate that you can measure when you are five to ten minutes far from the out of memory kill.
76
00:11:11,420 --> 00:11:16,566
When you are two minutes far from the out of memory kill, probably this trend will change.
77
00:11:16,688 --> 00:11:30,800
And so probably what you would expect is that the memory is already half or more saturated and therefore, for example, the operating system starts swapping or other things are happening that you are going to measure in this.
78
00:11:31,550 --> 00:11:39,730
And that would give you a much better picture of what's going on in the, let's say, closest neighborhood of that event, the time window.
79
00:11:39,790 --> 00:11:51,042
The sliding window and time window approach is definitely worth mentioning because this is something that you can apply if you think pretty much anywhere right now.
80
00:11:51,116 --> 00:11:52,050
What they did.
81
00:11:52,160 --> 00:12:04,146
In addition to having a time window, a sliding window, they also assign different levels to memory readings that are closer to the out of memory kill.
82
00:12:04,208 --> 00:12:10,062
And usually these levels are higher and higher as we get closer and closer to the out of memory kill.
83
00:12:10,136 --> 00:12:15,402
So this means that, for example, we would have, for a five minute window, we would have a level one.
84
00:12:15,596 --> 00:12:22,230
Five minute means five minutes far from the out of memory kill, four minutes would be a level two.
85
00:12:22,280 --> 00:12:37,234
Three minutes it's much closer would be a level three, two minutes would be a level four, which means like kind of the severity of the event as we get closer and closer to the actual event when the application is actually killed.
86
00:12:37,342 --> 00:12:51,474
So by looking at this approach, nothing new there, even, I would say not even a seasoned data scientist would have understood that using a sliding window is the way to go.
87
00:12:51,632 --> 00:12:55,482
I'm not saying that Netflix engineers are not seasoned enough.
88
00:12:55,556 --> 00:13:04,350
Actually they do a great job every day to keep giving us video streaming platforms that actually never fail or almost never fail.
89
00:13:04,910 --> 00:13:07,460
So spot on there, guys, good job.
90
00:13:07,850 --> 00:13:27,738
But looking at this sliding window approach, the direct consequence of this is that they can plot, they can do some sort of graphical analysis of the out of memory kills versus the memory usage that can give the reader or the data scientist a very nice picture of what's going on there.
91
00:13:27,824 --> 00:13:39,330
And so you would have, for example, and I would definitely report some of the pictures, some of the diagrams and graphs in the show notes of this episode on the official website datascienceaton.com.
92
00:13:39,500 --> 00:13:48,238
But essentially what you can see there is that there might be premature peaks at, let's say, a lower memory reading.
93
00:13:48,334 --> 00:14:08,958
And usually these are some kind of false positives or anomalies that should not be there, then it's possible to set a threshold where the threshold to start lowering the memory usage because after that threshold something nasty can happen and usually happens according to your data.
94
00:14:09,104 --> 00:14:18,740
And then of course there is another graph about the Gaussian distribution or in fact no sharp peak at all.
95
00:14:19,250 --> 00:14:21,898
That is like kills or out of memory.
96
00:14:21,934 --> 00:14:33,754
Kills are more or less distributed in a normalized fashion and then of course there are the genuine peaks that indicate that kills near, let's say, the threshold.
97
00:14:33,802 --> 00:14:38,758
And so usually you would see that after that particular threshold of memory usage.
98
00:14:38,914 --> 00:14:42,142
You see most of the out of memory kills.
99
00:14:42,226 --> 00:14:45,570
Which makes sense because given a particular device.
100
00:14:45,890 --> 00:14:48,298
Which means certain amount of memories.
101
00:14:48,394 --> 00:14:50,338
Certain memory characteristics.
102
00:14:50,494 --> 00:14:53,074
Certain version of the SDK and so on and so forth.
103
00:14:53,182 --> 00:14:53,814
You can say.
104
00:14:53,852 --> 00:14:54,090
Okay.
105
00:14:54,140 --> 00:15:10,510
Well for this device type I have this memory memory usage threshold and after this I see that I have a relatively high number of out of memory kills immediately after this threshold.
106
00:15:10,570 --> 00:15:18,150
And this means that probably that is the threshold you would like to consider as the critical threshold you should never or almost never cross.
107
00:15:18,710 --> 00:15:38,758
So once you have this picture in front of you, you can start thinking of implementing some mechanisms that can monitor the memory usage and of course kind of preemptively dialocate things or keep that memory threshold as low as possible with respect to the critical threshold.
108
00:15:38,794 --> 00:15:53,446
So you can start implementing some logic that prevents the application from being killed by the operating system so that you would in fact reduce the rate of out of memory kills overall.
109
00:15:53,578 --> 00:16:11,410
Now, as always and as also the engineers state in their blog post, in the technical post, they say well, it's much more important for us to predict with a certain amount of false positive rather than false negatives.
110
00:16:11,590 --> 00:16:18,718
False negatives means missing an out of memory kill that actually occurred but got not predicted.
111
00:16:18,874 --> 00:16:40,462
If you are a regular listener of this podcast, that statement should resonate with you because this is exactly what happens, for example in healthcare applications, which means that doctors or algorithms that operate in healthcare would definitely prefer to have a bit more false positives rather than more false negatives.
112
00:16:40,486 --> 00:16:54,800
Because missing that someone is sick means that you are not providing a cure and you're just sending the patient home when he or she is sick, right? That's the false positive, it's the mess.
113
00:16:55,130 --> 00:16:57,618
So that's a false negative, it's the mess.
114
00:16:57,764 --> 00:17:09,486
But having a false positive, what can go wrong with having a false positive? Well, probably you will undergo another test to make sure that the first test is confirmed or not.
115
00:17:09,608 --> 00:17:16,018
So adding a false positive in this case is relatively okay with respect to having a false negative.
116
00:17:16,054 --> 00:17:19,398
And that's exactly what happens to the Netflix application.
117
00:17:19,484 --> 00:17:32,094
Now, I don't want to say that of course Netflix application is as critical as, for example, the application that predicts a cancer or an xray or something on an xray or disorder or disease of some sort.
118
00:17:32,252 --> 00:17:48,090
But what I'm saying is that there are some analogies when it comes to machine learning and artificial intelligence and especially data science, the old school data science, there are several things that kind of are, let's say, invariant across sectors.
119
00:17:48,410 --> 00:17:56,826
And so, you know, two worlds like the media streaming or video streaming and healthcare are of course very different from each other.
120
00:17:56,888 --> 00:18:05,274
But when it comes to machine learning and data science applications, well, there are a lot of analogies there.
121
00:18:05,372 --> 00:18:06,202
And indeed.
122
00:18:06,286 --> 00:18:10,234
In terms of the models that they use at Netflix to predict.
123
00:18:10,342 --> 00:18:24,322
Once they have the sliding window data and essentially they have the ground truth of where this out of memory kill happened and what happened before to the memory of the application or the machine.
124
00:18:24,466 --> 00:18:24,774
Well.
125
00:18:24,812 --> 00:18:30,514
Then the models they use to predict these things is these events is Artificial Neural Networks.
126
00:18:30,622 --> 00:18:31,714
Xg Boost.
127
00:18:31,822 --> 00:18:36,742
Ada Boost or Adaptive Boosting Elastic Net with Softmax and so on and so forth.
128
00:18:36,766 --> 00:18:39,226
So nothing fancy.
129
00:18:39,418 --> 00:18:45,046
As you can see, Xg Boost is probably one of the most used I would have expected even random forest.
130
00:18:45,178 --> 00:18:47,120
Probably they do, they've tried that.
131
00:18:47,810 --> 00:18:58,842
But XGBoost is probably one of the most used models on kaggle competitions for a reason, because it works and it leverages a lot.
132
00:18:58,916 --> 00:19:04,880
The data preparation step, that solves already more than half of the problem.
133
00:19:05,810 --> 00:19:07,270
Thank you so much for listening.
134
00:19:07,330 --> 00:19:11,910
I also invite you, as always, to join the Discord Channel.
135
00:19:12,020 --> 00:19:15,966
You will find a link on the official website [email protected].
136
00:19:16,148 --> 00:19:17,600
Speak with you next time.
137
00:19:18,350 --> 00:19:21,382
You've been listening to Data Science at home podcast.
138
00:19:21,466 --> 00:19:26,050
Be sure to subscribe on itunes, Stitcher, or Pot Bean to get new, fresh episodes.
139
00:19:26,110 --> 00:19:31,066
For more, please follow us on Instagram, Twitter and Facebook or visit our website at datascienceathome.com
References
Companies and other business entities are actively involved in defining data products and applied research every year. Academia has always played a role in creating new methods and solutions/algorithms in the fields of machine learning and artificial intelligence.
However, there is doubt about how powerful and effective such research efforts are.
Is studying AI in academia a waste of time?
Our Sponsors
Explore the Complex World of Regulations. Compliance can be overwhelming. Multiple frameworks. Overlapping requirements. Let Arctic Wolf be your guide.
Check it out at https://arcticwolf.com/datascience
Amethix works to create and maximize the impact of the world’s leading corporations and startups, so they can create a better future for everyone they serve. We provide solutions in AI/ML, Fintech, Healthcare/RWE, and Predictive maintenance.
Neural networks are becoming massive monsters that are hard to train (without the "regular" 12 last-generation GPUs).
Is there a way to skip that?
Let me introduce you to Zero-Cost proxies
References
In this episode I speak about online learning systems and why blindly choosing such a paradigm can lead to very unpredictable and expensive outcomes.
Also in this episode, I have to deal with an intruder :)
Links
Birman, K.; Joseph, T. (1987). "Exploiting virtual synchrony in distributed systems". Proceedings of the Eleventh ACM Symposium on Operating Systems Principles - SOSP '87. pp. 123–138. doi:10.1145/41457.37515. ISBN 089791242X. S2CID 7739589.
In this episode, I am with Chip Kent, chief data scientist at Deephaven Data Labs.
We speak about streaming data, real-time, and other powerful tools part of the Deephaven platform.
Links
GitHub:
YouTube Channel - https://www.youtube.com/channel/UCoaYOlkX555PSTTJz8ZaI_w
Blog posts
Careers https://deephaven.io/company/careers/
Community Slack http://deephaven.io/slack.
In this episode I speak with Matt Swalley, Chief Business Officer of Omneky, an AI platform that generates, analyzes and optimizes personalized ad creatives at scale.
We speak about the way AI is used for generating customized recommendation and creating experiences with data aggregation and analytics. And yes! respecting the privacy of individuals.
Links
Grow your business with personalized ads https://www.omneky.com/
Data Science at Home Podcast (Live) https://www.twitch.tv/datascienceathome
Let's take a break and think about the state of AI in 2022.
In this episode I summarize the long report from the Stanford Institute for Human-Centered Artificial Intelligence (HAI)
Enjoy!
Referenceshttps://spectrum.ieee.org/artificial-intelligence-index
In this episode I have a conversation with, Itai Bar-Sinai, CPO & Cofounder of Mona.
We speak about several interesting points about data and monitoring.
Why is AI monitoring so different from monitoring classic software?
How to reduce the gap between data science and business?
What is the role of MLOps in the data monitoring field?
With over 10 years of experience with AI and as the CPO and head of customer success at Mona, the leading AI monitoring intelligence company, Itai has a unique view of the AI industry. Working closely with data science and ML teams applying dozens of AI solutions in over 10 industries, Itai encounters the wide variety of business use-cases, organizational structures and cultures, and technologies and tools used in today’s AI world.
References
I am with Ander Steele, data scientist and mathematician with a passion for privacy and Shannon Bayatpur, product manager with a background in technical writing and computer science, from Tonic.ai. We speak about data. Fake data.
But all we say is authentic.
Links
In this episode my friend and I speak about AI, batteries and automotive.
Dennis Berner, founder of Digitlabs has been operating in the field of automotive and batteries for a long time. His point of views are absolutely a must to listen to.
Below a list of the links he mentioned in the show.
In this episode I speak with Manavalan Krishnan from Tsecond about capturing massive amounts of data at the edge with security and reliability in mind.
This episode is brought to you by NordVPN
NordVPN protects your privacy while you are online. Get secure and private access to the internet by surfing nordvpn.com/DATASCIENCE or use coupon code DATASCIENCE and get a massive discount.
and by Amethix Technologies
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
References
https://tsecond.us/company/manavalan-krishnan/
This is one episode where passion for math, statistics and computers are merged.
I have a very interesting conversation with Ravin, data scientist at Google where he uses data to inform decisions.
He has previously worked at Sweetgreen, designing systems that would benefit team members and communities through sustainable and healthy food, and SpaceX, creating tools that would ultimately launch rocket ships.
All opinions in this episode are his own and none of the companies he has worked for are represented.
This episode is brought to you by RailzAI
The Railz API connects to major accounting platforms to provide you with quick access to normalized and analyzed financial data. Get free access to their API and more. Just tell them you came through Data Science at Home podcast.
and by Amethix Technologies
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
References
In this episode I am with Matt Forrest, VP of Solutions Engineering at Carto. We speak about machine learning applied to spatial data, spatial SQL and GIS (Geographic Information System).
Enjoy the show!
This episode is brought to you by RailzAI
The Railz API connects to major accounting platforms to provide you with quick access to normalized and analyzed financial data. Get free access to their API and more. Just tell them you came through Data Science at Home podcast.
and by Amethix Technologies
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
References
Carto https://carto.com
Spatial Feature Engineering: https://geographicdata.science/book/intro.html
CARTO Blog: https://carto.com/blog/
Spatial SQL Resources: https://forrest.nyc/learn-spatial-sql/
Spatial Data Science: https://forrest.nyc/geospatial-python
In this episode I am with Pasha Zavari - Director of Data Science and Derek Manuge - Co-founder and CTO at Railz.
Railz is a very interesting company with an incredible mission: normalizing and extracting insights from the most tedious data out there, financial data.
Guess what technology stack are they on?
Enjoy the show!
This episode is brought to you by RailzAI
The Railz API connects to major accounting platforms to provide you with quick access to normalized and analyzed financial data.
Sponsored by NordVPN
NordVPN protects your privacy while you are online. Get secure and private access to the internet by surfing nordvpn.com/DATASCIENCE or use coupon code DATASCIENCE and get a massive discount.
and by Amethix Technologies
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
References
Railz Homepage: https://go.railz.ai/Railz-DSH
Railz API Document: https://go.railz.ai/RailzAPI-DSH
Railz API Signup: https://go.railz.ai/RailzSignup-DSH
Railz Startup Pricing: https://go.railz.ai/RailzStartupPricing-DSH
Railz Careers: https://secure.collage.co/jobs/railz
How did we get here? Who invented the methods data scientists use every day?
We answer such questions and much more in this wonderful episode with Triveni Gandhi, Senior Data Scientist and Shaun McGirr, AI Evangelist at Dataiku. We cover topics about the history of data science, ethical AI and...
This episode is brought to you by Dataiku
With Dataiku, you have everything you need to build and deploy AI projects in one place, including easy-to-use data preparation and pipelines, AutoML, and advanced automation.
Sponsored by NordVPN
NordVPN protects your privacy while you are online. Get secure and private access to the internet by surfing nordvpn.com/DATASCIENCE or use coupon code DATASCIENCE and get a massive discount.
and by Amethix Technologies
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
References
In this episode I speak about AI and cloud automation with Leon Kuperman, co-founder and CTO at CAST AI. Formerly Vice President of Security Products OCI at Oracle, Leon’s professional experience spans across tech companies such as IBM, Truition, and HostedPCI.
Enjoy the episode!
Chat with me
Join us on Discord community chat to discuss the show, suggest new episodes and chat with other listeners!
Sponsored by Amethix Technologies
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
Sponsored by NordVPN
NordVPN protects your privacy while you are online. Get secure and private access to the internet by surfing nordvpn.com/DATASCIENCE or use coupon code DATASCIENCE and get a massive discount.
References
This is the last episode of the series "Embedded ML" and I made it for the bravest :)
I speak about machine learning compiler optimization to a much greater detail.
Enjoy the episode!
Chat with me
Join us on Discord community chat to discuss the show, suggest new episodes and chat with other listeners!
Sponsored by Amethix Technologies
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
Links
In this episode I speak about machine learning compilers, the most important tools to bridge the gap between high level frontends, ML backends and hardware target architectures.
There are several compilers one can choose. Before that, let's get familiar with what a compiler is supposed to do.
Enjoy the episode!
Chat with me
Join us on Discord community chat to discuss the show, suggest new episodes and chat with other listeners!
Sponsored by Amethix Technologies
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
Links
In this episode I speak about neural network quantization, a technique that makes networks feasible for embedded systems and small devices.
There are many quantization techniques depending on several factors that are all important to consider during design and implementation.
Enjoy the episode!
Chat with me
Join us on Discord community chat to discuss the show, suggest new episodes and chat with other listeners!
Sponsored by Amethix Technologies
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
Links
In Part 2 of Embedded Machine Learning, I speak about one important technique to prune a neural network and perform inference on small devices. Such technique helps preserving most of the accuracy with a model orders of magnitude smaller.
Enjoy the show!
References
This episode is the first of a series about Embedded Machine Learning. I explain the requirements of tiny devices and how it is possible to run machine learning models.
Join us on Discord community chat to discuss the show, suggest new episodes and chat with other listeners!
Sponsored by Amethix Technologies
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
References
https://datascienceathome.com/compressing-deep-learning-models-distillation-ep-104/
How did we get here? Who invented the methods data scientists use every day?
We answer such questions and much more in this wonderful episode with Triveni Gandhi, Senior Data Scientist and Shaun McGirr, AI Evangelist at Dataiku. We cover topics about the history of data science, ethical AI and...
This episode is brought to you by Dataiku
With Dataiku, you have everything you need to build and deploy AI projects in one place, including easy-to-use data preparation and pipelines, AutoML, and advanced automation.
and by Amethix Technologies
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
References
In this episode I speak with Manavalan Krishnan from Tsecond about capturing massive amounts of data at the edge with security and reliability in mind.
This episode is brought to you by Tsecond
The growth of data being created at static and moving edges across industries such as air travel, ocean and space exploration, shipping and freight, oil and gas, media, and more proposes numerous challenges in capturing, processing, and analyzing large amounts of data.
and by Amethix Technologies
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
References
https://tsecond.us/company/manavalan-krishnan/
If you think deep learning is a method to get to AGI, think again. Humans, as well as all mammals think in a... composable way.
Come chat with us on Discord
SponsorsThis episode is brought to you by Amethix Technologies
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
This episode is brought to you by Amethix Technologies
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
Join us on Discord
Feel free to drop by and have a chat with the host and the followers of the show
This episode is brought to you by Advanced RISC Machines (ARM). ARM is a family of reduced instruction set computing architectures for computer processors https://www.arm.com/
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
This episode summarizes a study about trends of AI in 2021, the way AI is perceived by people of different background and some other weird questions.
For instance, would you have sexual intercourse with a robot? Would you be in a relationship with an artificial intelligence?
The study has been conducted by Tidio.com and reported at https://www.tidio.com/blog/ai-trends/
Sponsors
This episode is supported by Amethix Technologies.
Amethix uses machine learning and advanced analytics to empower people and organizations to ask and answer complex questions like never before.
Coming soon at https://amethix.com
If you think deep learning is a method to get to AGI, think again. Humans, as well as all mammals think in a... composable way.
Sponsors
This episode is brought to you by Advanced RISC Machines (ARM). ARM is a family of reduced instruction set computing architectures for computer processors https://www.arm.com/
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
AI, ethics, and explainability.
Are these topics that only large corporations can spend resources on?
Can product-focused startups even think about them?
We answer such questions in this amazing episode with Erika Agostinelli from the AI Elite team at IBM.
Sponsored by Amethix Technologies
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
References
AIX360:
Questioning the AI: Informing Design Practices for Explainable AI User Experiences
https://arxiv.org/abs/2001.02478
Explainable AI - IBM
https://www.ibm.com/uk-en/watson/explainable-ai
AI Ethics - IBM
https://www.ibm.com/artificial-intelligence/ethics
Erika Agostinelli’s personal webpage:
This episode is brought to you by Advanced RISC Machines (ARM). ARM is a family of reduced instruction set computing architectures for computer processors https://www.arm.com/
The content of this episode has been created by Sylvain Kerkour
Feel free to subscribe to his newsletter at https://kerkour.com
Projects worth considering
IBM Global AI Strategist Mara Pometti is IBM’s first AI Strategist. She defines and designs the strategy for AI solutions by revealing overlooked insights hidden in enterprises’ data.
In this episode we speak about strategy, explainable AI and data storytelling.
References
IBM Trustworthy AI: https://www.ibm.com/watson/trustworthy-ai
IBM AIX360: https://aix360.mybluemix.net/
Explainable AI and Data Storytelling: https://medium.com/aixdesign/the-next-generation-of-storytelling-1d5fecc8f999
Mara Pometti’s website: www.marapometti.com
Mara Pometti LinkedIn: https://www.linkedin.com/in/mara-pometti-99962594
Mara Pometti Twitter: https://twitter.com/91_pometti
In this episode Mikkel and Francesco have a really interesting conversation about some key differences between large and small organization in approaching machine learning.
Listen to the episode to know more.
References
Quantum Metric
Stay off the naughty list this holiday season by reducing customer friction, increasing conversions, and personalizing the shopping experience. Want a sneak peak? Visit us at quantummetric.com/podoffer and see if you qualify to receive our “12 Days of Insights” offer with code DATASCIENCE. This offer gives you 12-day access to our platform coupled with a bespoke insight report that will help you identify where customers are struggling or engaging in your digital product.
Amethix Technologies
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
References
A few weeks ago I was the guest of a very interesting show called "AI Today".
In that episode I talked about some of the biggest trends emerging in AI and machine learning today as well as how organizations are dealing with and managing their data.
The original show has been published at https://www.cognilytica.com/2021/08/11/ai-today-podcast-interview-with-francesco-gadaleta-host-of-data-science-at-home-podcast/
Our Sponsors
Quantum Metric
Stay off the naughty list this holiday season by reducing customer friction, increasing conversions, and personalizing the shopping experience. Want a sneak peak? Visit us at quantummetric.com/podoffer and see if you qualify to receive our “12 Days of Insights” offer with code DATASCIENCE. This offer gives you 12-day access to our platform coupled with a bespoke insight report that will help you identify where customers are struggling or engaging in your digital product.
Amethix Technologies
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
Remember GANs? Generative Adversarial Networks for synthetic data generation?
There is a new method called Generative Teaching Networks, that uses similar concepts - just quite the opposite :P - to train models faster, better and with less data.
Enjoy the show!
Our SponsorsQuantum Metric
Stay off the naughty list this holiday season by reducing customer friction, increasing conversions, and personalizing the shopping experience. Want a sneak peak? Visit us at quantummetric.com/podoffer and see if you qualify to receive our “12 Days of Insights” offer with code DATASCIENCE. This offer gives you 12-day access to our platform coupled with a bespoke insight report that will help you identify where customers are struggling or engaging in your digital product.
Amethix Technologies
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
References
It's time we get serious about replacing the CSV format with something that, guess what? it has been around for so long.
In this episode I explain the good parts of CSV files and the not so good ones. It's time we evolve to something better.
Our Sponsors
Quantum Metric
Stay off the naughty list this holiday season by reducing customer friction, increasing conversions, and personalizing the shopping experience. Want a sneak peak? Visit us at quantummetric.com/podoffer and see if you qualify to receive our “12 Days of Insights” offer with code DATASCIENCE. This offer gives you 12-day access to our platform coupled with a bespoke insight report that will help you identify where customers are struggling or engaging in your digital product.
Amethix Technologies
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
Is reinforcement learning sufficient to build truly intelligent machines? Listen to this episode to find out.
Our SponsorsQuantum Metric
Stay off the naughty list this holiday season by reducing customer friction, increasing conversions, and personalizing the shopping experience. Want a sneak peak? Visit us at quantummetric.com/podoffer and see if you qualify to receive our “12 Days of Insights” offer with code DATASCIENCE. This offer gives you 12-day access to our platform coupled with a bespoke insight report that will help you identify where customers are struggling or engaging in your digital product.
Amethix Technologies
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
References
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
Links
Our Sponsor
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
References
Fisher, Aaron, Cynthia Rudin, and Francesca Dominici. “Model Class Reliance: Variable importance measures for any machine learning model class, from the ‘Rashomon’ perspective.” http://arxiv.org/abs/1801.01489 (2018).
Remember the Netflix challenge?
It was a ton of money for the one who would have cracked the problem of recommending the best possible movie.
Was it a fair challenge? Did it work?
Let me tell you what happened...
Sponsors
Get one of the best VPN at a massive discount with coupon code DATASCIENCE. It provides you with an 83% discount which unlocks the best price in the market plus 3 extra months for free. Here is the link https://surfshark.deals/DATASCIENCE
In this episode Fetch AI CTO Jonathan Ward speaks about decentralization, AI, blockchain for smart cities and the enterprise.
Below some great links about collective learning, smart contracts in Rust and the Fetch AI ecosystem.
Do you want to know the latest in big data analytics frameworks? Have you ever heard of Apache Arrow? Rust? Ballista? In this episode I speak with Andy Grove one of the main authors of Apache Arrow and Ballista compute engine.
Andy explains some challenges while he was designing the Arrow and Ballista memory models and he describes some amazing solutions.
If building software is your passion, you’ll love ThoughtWorks Technology Podcast. It’s a podcast for techies by techies. Their team of experienced technologists take a deep dive into a tech topic that’s piqued their interest — it could be how machine learning is being used in astrophysics or maybe how to succeed at continuous delivery.
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
References
https://github.com/ballista-compute/ballista
It made already quite some noise in the news, GitHub copilot promises to be your pair programmer for life.
In this episode I explain how and what GitHub copilot does. Should developers be happy, scared or just keep coding the traditional way?
Sponsors
Get one of the best VPN at a massive discount with coupon code DATASCIENCE. It provides you with an 83% discount which unlocks the best price in the market plus 3 extra months for free. Here is the link https://surfshark.deals/DATASCIENCE
Get one of the best VPN at a massive discount with coupon code DATASCIENCE. It provides you with an 83% discount which unlocks the best price in the market plus 3 extra months for free.
Here is the link https://surfshark.deals/DATASCIENCE
Data from the real world are never perfectly balanced. In this episode I explain a simple yet effective trick to train models with very unbalanced data. Enjoy the show!
Get one of the best VPN at a massive discount with coupon code DATASCIENCE. It provides you with an 83% discount which unlocks the best price in the market plus 3 extra months for free. Here is the link https://surfshark.deals/DATASCIENCE
References
In this episode I have a really interesting conversation with Karan Grewal, member of the research staff at Numenta where he investigates how biological principles of intelligence can be translated into silicon.
We speak about the thousand brains theory and why neural networks forget.
References
Delivering unstoppable data to unstoppable apps is now possible with Streamr Network
Streamr is a layer zero protocol for real-time data which powers the decentralized Streamr pub/sub network. The technology works in tandem with companion blockchains - currently Ethereum and xDai chain - which are used for identity, security and payments. On top is the application layer, including the Data Union framework, Marketplace and Core, and all third party applications.
In this episode I have a very interesting conversation with Streamr founder and CEO Henri Pihkala
Our Sponsor
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
Our Sponsor
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
If you think that knowing Tensorflow and Scikit-learn is enough, think again.
MLOps is one of those trendy terms today.
What is MLOps and why is it important?
In this episode I speak about the undeniable evolution of the data scientist in the last 5-10 years.
Sponsors
If building software is your passion, you’ll love ThoughtWorks Technology Podcast. It’s a podcast for techies by techies. Their team of experienced technologists take a deep dive into a tech topic that’s piqued their interest — it could be how machine learning is being used in astrophysics or maybe how to succeed at continuous delivery.
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
Is there a gap between life sciences and data science?
What's the situation when it comes to interdisciplinary research?
In this episode I am with Laura Harris, Director of Training for the Institute of Cyber-Enabled Research (ICER) at Michigan State University (MSU), and we try to answer some of those questions.
You can contact Laura at [email protected] or on LinkedIn
This episode is supported by Chapman’s Schmid College of Science and Technology, where master’s and PhD students join in cutting-edge research as they prepare to take the next big leap in their professional journey.
To learn more about the innovative tools and collaborative approach that distinguish the Chapman program in Computational and Data Sciences, visit chapman.edu/datascience
If building software is your passion, you’ll love ThoughtWorks Technology Podcast. It’s a podcast for techies by techies. Their team of experienced technologists take a deep dive into a tech topic that’s piqued their interest — it could be how machine learning is being used in astrophysics or maybe how to succeed at continuous delivery.
Links
In this episode I speak with Ritchie Vink, the author of Polars, a crate that is the fastest dataframe library at date of speaking :) If you want to participate to an amazing Rust open source project, this is your change to collaborate to the official repository in the references.
References
https://github.com/ritchie46/polars
Do you want to know the latest in big data analytics frameworks? Have you ever heard of Apache Arrow? Rust? Ballista? In this episode I speak with Andy Grove one of the main authors of Apache Arrow and Ballista compute engine.
Andy explains some challenges while he was designing the Arrow and Ballista memory models and he describes some amazing solutions.
This episode is supported by Chapman’s Schmid College of Science and Technology, where master’s and PhD students join in cutting-edge research as they prepare to take the next big leap in their professional journey.
To learn more about the innovative tools and collaborative approach that distinguish the Chapman program in Computational and Data Sciences, visit chapman.edu/datascience
If building software is your passion, you’ll love ThoughtWorks Technology Podcast. It’s a podcast for techies by techies. Their team of experienced technologists take a deep dive into a tech topic that’s piqued their interest — it could be how machine learning is being used in astrophysics or maybe how to succeed at continuous delivery.
References
https://github.com/ballista-compute/ballista
Pandas is the de-facto standard for data loading and manipulation. Python is the de-facto programming language for such operations. Rust is the underdog. Or is it?
In this episode I am showing you why that is no longer the case.
Our Sponsors
This episode is supported by Chapman’s Schmid College of Science and Technology, where master’s and PhD students join in cutting-edge research as they prepare to take the next big leap in their professional journey.
To learn more about the innovative tools and collaborative approach that distinguish the Chapman program in Computational and Data Sciences, visit chapman.edu/datascience
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
Useful Links
https://github.com/haixuanTao/Data-Manipulation-Rust-Pandas
https://github.com/ritchie46/polars
https://github.com/rust-ndarray/ndarray
In plain English, concurrent and parallel are synonyms. Not for a CPU. And definitely not for programmers. In this episode I summarize the ways to parallelize on different architectures and operating systems.
Rock-star data scientists must know how concurrency works and when to use it IMHO.
Our Sponsors
This episode is supported by Chapman’s Schmid College of Science and Technology, where master’s and PhD students join in cutting-edge research as they prepare to take the next big leap in their professional journey.
To learn more about the innovative tools and collaborative approach that distinguish the Chapman program in Computational and Data Sciences, visit chapman.edu/datascience
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
Useful Links
http://web.mit.edu/6.005/www/fa14/classes/17-concurrency/
https://doc.rust-lang.org/book/ch16-00-concurrency.html
https://urban-institute.medium.com/using-multiprocessing-to-make-python-code-faster-23ea5ef996ba
In plain English, concurrent and parallel are synonyms. Not for a CPU. And definitely not for programmers. In this episode I summarize the ways to parallelize on different architectures and operating systems.
Rock-star data scientists must know how concurrency works and when to use it IMHO.
Our Sponsors
This episode is supported by Chapman’s Schmid College of Science and Technology, where master’s and PhD students join in cutting-edge research as they prepare to take the next big leap in their professional journey.
To learn more about the innovative tools and collaborative approach that distinguish the Chapman program in Computational and Data Sciences, visit chapman.edu/datascience
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
This is one of the most dynamic and fascinating topics: API technologies for machine learning.
It's always fun to build ML models. But how about serving them in the real world? In this episode I speak about three must-know technologies to place your model behind an API.
Our Sponsors
This episode is supported by Chapman’s Schmid College of Science and Technology, where master’s and PhD students join in cutting-edge research as they prepare to take the next big leap in their professional journey.
To learn more about the innovative tools and collaborative approach that distinguish the Chapman program in Computational and Data Sciences, visit chapman.edu/datascience
If building software is your passion, you’ll love ThoughtWorks Technology Podcast. It’s a podcast for techies by techies. Their team of experienced technologists take a deep dive into a tech topic that’s piqued their interest — it could be how machine learning is being used in astrophysics or maybe how to succeed at continuous delivery.
This episode is supported by Chapman’s Schmid College of Science and Technology, where master’s and PhD students join in cutting-edge research as they prepare to take the next big leap in their professional journey.
To learn more about the innovative tools and collaborative approach that distinguish the Chapman program in Computational and Data Sciences, visit chapman.edu/datascience
If building software is your passion, you’ll love ThoughtWorks Technology Podcast. It’s a podcast for techies by techies. Their team of experienced technologists take a deep dive into a tech topic that’s piqued their interest — it could be how machine learning is being used in astrophysics or maybe how to succeed at continuous delivery.
Links
The financial system is changing. It is becoming more efficient and integrated with many more services making our life more... digital. Is the old banking system doomed to fail? Or will it just be disrupted by the smaller players of the fintech industry?
In this episode we answer some of these fundamental questions with Alessandro E. Hatami from Pacemakers
Subscribe to the Newsletter and come chat with us on the official Discord channel
Our Sponsors
This episode is supported by Chapman’s Schmid College of Science and Technology, where master’s and PhD students join in cutting-edge research as they prepare to take the next big leap in their professional journey.
To learn more about the innovative tools and collaborative approach that distinguish the Chapman program in Computational and Data Sciences, visit chapman.edu/datascience
Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business.
Have you clicked the button? Accepted the new terms?
It's time we have a talk.
In this podcast I get inspired by Paul Done's presentation about The Six Principles for Building Robust Yet Flexible Shared Data Applications, and show how powerful of a language Rust is while still maintaining the flexibility of less strict languages.
Our Sponsor
This episode is supported by Chapman’s Schmid College of Science and Technology, where master's and PhD students join in cutting-edge research as they prepare to take the next big leap in their professional journey.
To learn more about the innovative tools and collaborative approach that distinguish the Chapman program in Computational and Data Sciences, visit chapman.edu/datascience
In this episode I explain the basics of computer architecture and introduce some features of the Apple M1
Is it good for Machine Learning tasks?
References
In this episode I speak with Daniel McKenna about Rust, machine learning and artificial intelligence.
You can find Daniel from
Don't forget to come join me in our Discord channel speaking about all things data science.
Subscribe to the official Newsletter and never miss an episode
Let's finish this year with an amazing episode about scaling ML with clusters and GPUs. Kind of as a continuation of Episode 112 I have a terrific conversation with Aaron Richter from Saturn Cloud about, well, making ML faster and scaling it to massive infrastructure.
Aaron can be reached on his website https://rikturr.com and Twitter @rikturr
Our Sponsor
Saturn Cloud is a data science and machine learning platform for scalable Python analytics. Users can jump into cloud-based Jupyter and Dask to scale Python for big data using the libraries they know and love, while leveraging Docker and Kubernetes so that work is reproducible, shareable, and ready for production.
Try Saturn Cloud for free at https://saturncloud.io
Twitter: @saturn_cloud
What is data ethics? In this episode I have an interesting chat with Denny Wong from FaqBot and Muna.
Our Sponsor Amethix use advanced Artificial Intelligence and Machine Learning to build data platforms and predictive engines in domain like finance, healthcare, pharmaceuticals, logistics, energy. Amethix provide solutions to collect and secure data with higher transparency and disintermediation, and build the statistical models that will support your business. ReferencesCome join me in our Discord channel speaking about all things data science.
Subscribe to the official Newsletter and never miss an episode
Follow me on Twitch during my live coding sessions usually in Rust and Python
Our SponsorsIn this episode Adam Leon Smith, CTO of DragonFly and expert in data regulations explains some of the consequences of Schrems II and data transfers from EU to US.
For very interesting references and a practical example, subscribe to our Newsletter
Come join me in our Discord channel speaking about all things data science.
Subscribe to the official Newsletter and never miss an episode
Follow me on Twitch during my live coding sessions usually in Rust and Python
Our SponsorsCome join me in our Discord channel speaking about all things data science.
Follow me on Twitch during my live coding sessions usually in Rust and Python
Subscribe to the official Newsletter and never miss an episode
Our SponsorsCome join me in our Discord channel speaking about all things data science.
Follow me on Twitch during my live coding sessions usually in Rust and Python
Our SponsorsDataset distillation (official paper)
Come join me in our Discord channel speaking about all things data science.
Follow me on Twitch during my live coding sessions usually in Rust and Python
Our SponsorsCome join me in our Discord channel speaking about all things data science.
Follow me on Twitch during my live coding sessions usually in Rust and Python
Our SponsorsCome join me in our Discord channel speaking about all things data science.
Follow me on Twitch during my live coding sessions usually in Rust and Python
Our SponsorsCome join me in our Discord channel speaking about all things data science.
Follow me on Twitch during my live coding sessions usually in Rust and Python
References
A Simple Framework for Contrastive Learning of Visual Representations
Come join me in our Discord channel speaking about all things data science.
Follow me on Twitch during my live coding sessions usually in Rust and Python
This episode is supported by Monday.com
The Monday Apps Challenge is bringing developers around the world together to compete in order to build apps that can improve the way teams work together on monday.com.
Let's talk about federated learning. Why is it important? Why large organizations are not ready yet?
Come join me in our Discord channel speaking about all things data science.
Follow me on Twitch during my live coding sessions usually in Rust and Python
This episode is supported by Monday.com
The Monday Apps Challenge is bringing developers around the world together to compete in order to build apps that can improve the way teams work together on monday.com.
Come join me in our Discord channel speaking about all things data science.
Follow me on Twitch during my live coding sessions usually in Rust and Python
This episode is supported by Monday.com
Monday.com bring teams together so you can plan, manage and track everything your team is working on in one centralized place
The monday Apps Challenge is bringing developers around the world together to compete in order to build apps that can improve the way teams work together on monday.com.
Come join me in our Discord channel speaking about all things data science.
Follow me on Twitch during my live coding sessions usually in Rust and Python
This episode is supported by Women in Tech by Manning Conferences
Hey there! Having the best time of my life ;)
This is the first episode I record while I am live on my new Twitch channel :) So much fun!
Feel free to follow me for the next live streaming. You can also see me coding machine learning stuff in Rust :))
Don't forget to jump on the usual Discord and have a chat
I'll see you there!
In this episode I speak with Adam Leon Smith, CTO at DragonFly and expert in testing strategies for software and machine learning.
We cover testing with deep learning (neuron coverage, threshold coverage, sign change coverage, layer coverage, etc.), combinatorial testing and their practical aspects.
On September 15th there will be a live@Manning Rust conference. In one Rust-full day you will attend many talks about what's special about rust, building high performance web services or video game, about web assembly and much more.
If you want to meet the tribe, tune in september 15th to the live@manning rust conference.
In this episode I speak with Adam Leon Smith, CTO at DragonFly and expert in testing strategies for software and machine learning.
On September 15th there will be a live@Manning Rust conference. In one Rust-full day you will attend many talks about what's special about rust, building high performance web services or video game, about web assembly and much more.
If you want to meet the tribe, tune in september 15th to the live@manning rust conference.
After deep learning, a new entry is about ready to go on stage. The usual journalists are warming up their keyboards for blogs, news feeds, tweets, in one word, hype.
This time it's all about privacy and data confidentiality. The new words, homomorphic encryption.
Join and chat with us on the official Discord channel.
Sponsors
This episode is supported by Amethix Technologies.
Amethix works to create and maximize the impact of the world’s leading corporations, startups, and nonprofits, so they can create a better future for everyone they serve. They are a consulting firm focused on data science, machine learning, and artificial intelligence.
References
Towards a Homomorphic Machine Learning Big Data Pipeline for the Financial Services Sector
In this episode I speak about a testing methodology for machine learning models that are supposed to be integrated in production environments.
Don't forget to come chat with us in our Discord channel
Enjoy the show!
--
This episode is supported by Amethix Technologies.
Amethix works to create and maximize the impact of the world’s leading corporations, startups, and nonprofits, so they can create a better future for everyone they serve. They are a consulting firm focused on data science, machine learning, and artificial intelligence.
The hype around GPT-3 is alarming and gives and provides us with the awful picture of people misunderstanding artificial intelligence. In response to some comments that claim GPT-3 will take developers' jobs, in this episode I express some personal opinions about the state of AI in generating source code (and in particular GPT-3).
If you have comments about this episode or just want to chat, come join us on the official Discord channel.
This episode is supported by Amethix Technologies.
Amethix works to create and maximize the impact of the world’s leading corporations, startups, and nonprofits, so they can create a better future for everyone they serve. They are a consulting firm focused on data science, machine learning, and artificial intelligence.
There is definitely room for improvement in the family of algorithms of stochastic gradient descent. In this episode I explain a relatively simple method that has shown to improve on the Adam optimizer. But, watch out! This approach does not generalize well.
Join our Discord channel and chat with us.
References
In this episode I speak about data transformation frameworks available for the data scientist who writes Python code.
The usual suspect is clearly Pandas, as the most widely used library and de-facto standard. However when data volumes increase and distributed algorithms are in place (according to a map-reduce paradigm of computation), Pandas no longer performs as expected. Other frameworks play a role in such context.
In this episode I explain the frameworks that are the best equivalent to Pandas in bigdata contexts.
Don't forget to join our Discord channel and comment previous episodes or propose new ones.
This episode is supported by Amethix Technologies
Amethix works to create and maximize the impact of the world’s leading corporations, startups, and nonprofits, so they can create a better future for everyone they serve. Amethix is a consulting firm focused on data science, machine learning, and artificial intelligence.
References
Pandas a fast, powerful, flexible and easy to use open source data analysis and manipulation tool - https://pandas.pydata.org/
Modin - Scale your pandas workflows by changing one line of code - https://github.com/modin-project/modin
Dask advanced parallelism for analytics https://dask.org/
Ray is a fast and simple framework for building and running distributed applications https://github.com/ray-project/ray
RAPIDS - GPU data science https://rapids.ai/
In this episode I speak with Filip Piekniewski about some of the most worth noting findings in AI and machine learning in 2019. As a matter of fact, the entire field of AI has been inflated by hype and claims that are hard to believe. A lot of the promises made a few years ago have revealed quite hard to achieve, if not impossible. Let's stay grounded and realistic on the potential of this amazing field of research, not to bring disillusion in the near future.
Join us to our Discord channel to discuss your favorite episode and propose new ones.
This episode is brought to you by Protonmail
Click on the link in the description or go to protonmail.com/datascience and get 20% off their annual subscription.
In this episode I make a non exhaustive list of machine learning tools and frameworks, written in Rust. Not all of them are mature enough for production environments. I believe that community effort can change this very quickly.
To make a comparison with the Python ecosystem I will cover frameworks for linear algebra (numpy), dataframes (pandas), off-the-shelf machine learning (scikit-learn), deep learning (tensorflow) and reinforcement learning (openAI).
Rust is the language of the future.
Happy coding!
In the 3rd episode of Rust and machine learning I speak with Alec Mocatta.
Alec is a +20 year experience professional programmer who has been spending time at the interception of distributed systems and data analytics. He's the founder of two startups in the distributed system space and author of Amadeus, an open-source framework that encourages you to write clean and reusable code that works, regardless of data scale, locally or distributed across a cluster.
Only for June 24th, LDN *Virtual* Talks June 2020 with Bippit (Alec speaking about Amadeus)
In the second episode of Rust and Machine learning I am speaking with Luca Palmieri, who has been spending a large part of his career at the interception of machine learning and data engineering.
In addition, Luca contributed to several projects closer to the machine learning community using the Rust programming language. Linfa is an ambitious project that definitely deserves the attention of the data science community (and it's written in Rust, with Python bindings! How cool??!).
References
This is the first episode of a series about the Rust programming language and the role it can play in the machine learning field.
Rust is one of the most beautiful languages I have ever studied so far. I personally come from the C programming language, though for professional activities in machine learning I had to switch to the loved and hated Python language.
This episode is clearly not providing you with an exhaustive list of the benefits of Rust, nor its capabilities. For this you can check the references and start getting familiar with what I think it's going to be the language of the next 20 years.
Sponsored
This episode is supported by Pryml Technologies. Pryml offers secure and cost effective data privacy solutions for your organisation. It generates a synthetic alternative without disclosing you confidential data.
References
In this episode I have a chat with Sandeep Pandya, CEO at Everguard.ai a company that uses sensor fusion, computer vision and more to provide safer working environments to workers in heavy industry.
Sandeep is a senior executive who can hide the complexity of the topic with great talent.
This episode is supported by Pryml.io
Pryml is an enterprise-scale platform to synthesise data and deploy applications built on that data back to a production environment.
Test ideas. Launch new products. Fast. Secure.
As a continuation of the previous episode in this one I cover the topic about compressing deep learning models and explain another simple yet fantastic approach that can lead to much smaller models that still perform as good as the original one.
Don't forget to join our Slack channel and discuss previous episodes or propose new ones.
This episode is supported by Pryml.io
Pryml is an enterprise-scale platform to synthesise data and deploy applications built on that data back to a production environment.
References
Comparing Rewinding and Fine-tuning in Neural Network Pruning
https://arxiv.org/abs/2003.02389
Using large deep learning models on limited hardware or edge devices is definitely prohibitive. There are methods to compress large models by orders of magnitude and maintain similar accuracy during inference.
In this episode I explain one of the first methods: knowledge distillation
Come join us on Slack
ReferenceCodiv-19 is an emergency. True. Let's just not prepare for another emergency about privacy violation when this one is over.
Join our new Slack channel
This episode is supported by Proton. You can check them out at protonmail.com or protonvpn.com
Whenever people reason about probability of events, they have the tendency to consider average values between two extremes.
In this episode I explain why such a way of approximating is wrong and dangerous, with a numerical example.
We are moving our community to Slack. See you there!
In this episode I briefly explain the concept behind activation functions in deep learning. One of the most widely used activation function is the rectified linear unit (ReLU).
While there are several flavors of ReLU in the literature, in this episode I speak about a very interesting approach that keeps computational complexity low while improving performance quite consistently.
This episode is supported by pryml.io. At pryml we let companies share confidential data. Visit our website.
Don't forget to join us on discord channel to propose new episode or discuss the previous ones.
ReferencesDynamic ReLU https://arxiv.org/abs/2003.10027
One of the best features of neural networks and machine learning models is to memorize patterns from training data and apply those to unseen observations. That's where the magic is.
However, there are scenarios in which the same machine learning models learn patterns so well such that they can disclose some of the data they have been trained on. This phenomenon goes under the name of unintended memorization and it is extremely dangerous.
Think about a language generator that discloses the passwords, the credit card numbers and the social security numbers of the records it has been trained on. Or more generally, think about a synthetic data generator that can disclose the training data it is trying to protect.
In this episode I explain why unintended memorization is a real problem in machine learning. Except for differentially private training there is no other way to mitigate such a problem in realistic conditions.
At Pryml we are very aware of this. Which is why we have been developing a synthetic data generation technology that is not affected by such an issue.
This episode is supported by Harmonizely.
Harmonizely lets you build your own unique scheduling page based on your availability so you can start scheduling meetings in just a couple minutes.
Get started by connecting your online calendar and configuring your meeting preferences.
Then, start sharing your scheduling page with your invitees!
References
The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks
https://www.usenix.org/conference/usenixsecurity19/presentation/carlini
In this episode I explain a very effective technique that allows one to infer the membership of any record at hand to the (private) training dataset used to train the target model. The effectiveness of such technique is due to the fact that it works on black-box models of which there is no access to the data used for training, nor model parameters and hyperparameters. Such a scenario is very realistic and typical of machine learning as a service APIs.
This episode is supported by pryml.io, a platform I am personally working on that enables data sharing without giving up confidentiality.
As promised below is the schema of the attack explained in the episode.
References
Membership Inference Attacks Against Machine Learning Models
Masking, obfuscating, stripping, shuffling.
All the above techniques try to do one simple thing: keeping the data private while sharing it with third parties. Unfortunately, they are not the silver bullet to confidentiality.
All the players in the synthetic data space rely on simplistic techniques that are not secure, might not be compliant and risky for production.
At pryml we do things differently.
There are very good reasons why a financial institution should never share their data. Actually, they should never even move their data. Ever.
In this episode I explain you why.
Building reproducible models is essential for all those scenarios in which the lead developer is collaborating with other team members. Reproducibility in machine learning shall not be an art, rather it should be achieved via a methodical approach.
In this episode I give a few suggestions about how to make your ML models reproducible and keep your workflow as smooth.
Enjoy the show!
Come visit us on our discord channel and have a chat
Data science and data engineering are usually two different departments in organisations. Bridging the gap between the two is essential to success. Many times the brilliant applications created by data scientists don't find a match in production, just because they are not production-ready.
In this episode I have a talk with Daan Gerits, co-founder and CTO at Pryml.io
Why so much silence? Building a company! That's why :)
I am building pryml, a platform that allows data scientists build their applications on data they cannot get access to.
This is the first of a series of episodes in which I will speak about the technology and the challenges we are facing while we build it.
Happy listening and stay tuned!
In the last episode of 2019 I speak with Filip Piekniewski about some of the most worth noting findings in AI and machine learning in 2019. As a matter of fact, the entire field of AI has been inflated by hype and claims that are hard to believe. A lot of the promises made a few years ago have revealed quite hard to achieve, if not impossible. Let's stay grounded and realistic on the potential of this amazing field of research, not to bring disillusion in the near future.
Join us to our Discord channel to discuss your favorite episode and propose new ones.
I would like to thank all of you for supporting and inspiring us. I wish you a wonderful 2020!
Francesco and the team of Data Science at Home
This is the fourth and last episode of mini series "The dark side of AI".
I am your host Francesco and I’m with Chiara Tonini from London. The title of today’s episode is Bias in the machine
C: Francesco, today we are starting with an infuriating discussion. Are you ready to be angry?
F: yeah sure is this about brexit?
No, I don’t talk about that. In 1986 the New York City’s Rockefeller University conducted a study on breast and uterine cancers and their link to obesity. Like in all clinical trials up to that point, the subjects of the study were all men.
So Francesco, do you see a problem with this approach?
F: No problem at all, as long as those men had a perfectly healthy uterus.
In medicine, up to the end of the 20th century, medical studies and clinical trials were conducted on men, medicine dosage and therapy calculated on men (white men). The female body has historically been considered an exception, or variation, from a male body.
F: Like Eve coming from Adam’s rib. I thought we were past that...
When the female body has been under analysis, the focus was on the difference between it and the male body, the so-called “bikini approach”: the reproductive organs are different, therefore we study those, and those only. For a long time medicine assumed this was the only difference.
Oh good ...
This has led to a hugely harmful fallout across society. Because women had reproductive organs, they should reproduce, and all else about them was deemed uninteresting. Still today, they consider a woman without children somehow to have betrayed her biological destiny. This somehow does not apply to a man without children, who also has reproductive organs.
F: so this is an example of a very specific type of bias in medicine, regarding clinical trials and medical studies, that is not only harmful for the purposes of these studies, but has ripple effects in all of society
Only in the 2010 a serious conversation has started about the damage caused by not including women in clinical trials. There are many many examples (which we list in the references for this episode).
Give me one
Researchers consider cardiovascular disease a male disease - they even call it “the widower”. They conduct studies on male samples. But it turns out, the symptoms of a heart attack, especially the ones leading up to one, are different in women. This led to doctors not recognising or dismissing the early symptoms in women.
F: I was reading that women are also subject to chronic pain much more than men: for example migraines, and pain related to endometriosis. But there is extensive evidence now of doctors dismissing women’s pain, as either imaginary, or “inevitable”, like it is a normal state of being and does not need a cure at all.
The failure of the medical community as a whole to recognise this obvious bias up to the 21st century is an example of how insidious the problem of bias is.
There are 3 fundamental types of bias:
Bias is a warping of our understanding of reality. We see reality through the lens of our experience and our culture. The origin of bias can date back to traditions going back centuries, and is so ingrained in our way of thinking, that we don’t even see it anymore.
F: And let me add, when it comes to machine learning, we see reality through the lens of data. Bias is everywhere, and we could spend hours and hours talking about it. It’s complicated.
It’s about to become more complicated.
F: of course, if I know you…
Let’s throw artificial intelligence in the mix.
F: You know, there was a happier time when this sentence didn’t fill me with a sense of dread...
ImageNet is an online database of over 14 million photos, compiled more than a decade ago at Stanford University. They used it to train machine learning algorithms for image recognition and computer vision, and played an important role in the rise of deep learning. We’ve all played with it, right? The cats and dogs classifier when learning Tensorflow? (I am a dog by the way. )
F: ImageNet has been a critical asset for computer-vision research. There was an annual international competition to create algorithms that could most accurately label subsets of images.
In 2012, a team from the University of Toronto used a Convolutional Neural Network to handily win the top prize. That moment is widely considered a turning point in the development of contemporary AI. The final year of the ImageNet competition was 2017, and accuracy in classifying objects in the limited subset had risen from 71% to 97%. But that subset did not include the “Person” category, where the accuracy was much lower...
ImageNet contained photos of thousands of people, with labels. This included straightforward tags like “teacher,” “dancer” and “plumber”, as well as highly charged labels like “failure, loser” and “slut, slovenly woman, trollop.”
F: Uh Oh.
Then “ImageNet Roulette” was created, by an artist called Trevor Paglen and a Microsoft researcher named Kate Crawford. It was a digital art project, where you could upload your photo and let the classifier identify you, based on the labels of the database. Imagine how well that went.
F: I bet it did’t work
Of course it didn’t work. Random people were classified as “orphans” or “non-smoker” or “alcoholic”. Somebody with glasses was a “nerd”. Tabong Kima, a 24-year old African American, was classified as “offender” and “wrongdoer”.
F: and there it is.
Quote from Trevor Paglen: “We want to show how layers of bias and racism and misogyny move from one system to the next. The point is to let people see the work that is being done behind the scenes, to see how we are being processed and categorized all the time.”
F: The ImageNet labels were applied by thousands of unknown people, most likely in the United States, hired by the team from Stanford, and working through the crowdsourcing service Amazon Mechanical Turk. They earned pennies for each photo they labeled, churning through hundreds of labels an hour. The labels were not verified in any way : if a labeler thought someone looks “shady”, this label is just a result of their prejudice, but has no basis in reality.
As they did, biases were baked into the database. Paglen quote again: “The way we classify images is a product of our worldview,” he said. “Any kind of classification system is always going to reflect the values of the person doing the classifying.” They defined what a “loser” looked like. And a “slut.” And a “wrongdoer.”
F: The labels originally came from another sprawling collection of data called WordNet, a kind of conceptual dictionary for machines built by researchers at Princeton University in the 1980s. But with these inflammatory labels included, the Stanford researchers may not have realized what they were doing.
What is happening here is the transferring of bias from one system to the next.
Tech jobs, in past decades but still today, predominantly go to white males from a narrow social class. Inevitably, they imprint the technology with their worldview. So their algorithms learn that a person of color is a criminal, and a woman with a certain look is a slut.
I’m not saying they do it on purpose, but the lack of diversity in the tech industry translates into a narrower world view, which has real consequences in the quality of AI systems.
F: Diversity in tech teams is often framed as an equality issue (which of course it is), but there are enormous advantages in it: it allows to create that cognitive diversity that will reflect into superior products or services.
I believe this is an ongoing problem. In recent months, researchers have shown that face-recognition services from companies like Amazon, Microsoft and IBM can be biased against women and people of color.
Crawford and Paglen argue this:
“In many narratives around AI it is assumed that ongoing technical improvements will resolve all problems and limitations.
But what if the opposite is true? What if the challenge of getting computers to “describe what they see” will always be a problem? The automated interpretation of images is an inherently social and political project, rather than a purely technical one. Understanding the politics within AI systems matters more than ever, as they are quickly moving into the architecture of social institutions: deciding whom to interview for a job, which students are paying attention in class, which suspects to arrest, and much else.”
F: You are using the words “interpretation of images” here, as opposed to “description” or “classification”. Certain images depict something concrete, with an objective reality. Like an apple. But other images… not so much?
ImageNet contain images only corresponding to nouns (not verbs for example). Noun categories such as “apple” are well defined.
But not all nouns are created equal. Linguist George Lakoff points out that the concept of an “apple” is more nouny than the concept of “light”, which in turn is more nouny than a concept such as “health.”
Nouns occupy various places on an axis from concrete to abstract, and from descriptive to judgmental. The images corresponding to these nouns become more and more ambiguous.
These gradients have been erased in the logic of ImageNet. Everything is flattened out and pinned to a label.
The results can be problematic, illogical, and cruel, especially when it comes to labels applied to people.
F: so when an image is interpreted as Drug Addict, Crazy, Hypocrite, Spinster, Schizophrenic, Mulatto, Red Neck… this is not an objective description of reality, it’s somebody’s worldview coming to the surface.
The selection of images for these categories skews the meaning in ways that are gendered, racialized, ableist, and ageist. ImageNet is an object lesson in what happens when people are categorized like objects.
And this practice has only become more common in recent years, often inside the big AI companies, where there is no way for outsiders to see how images are being ordered and classified.
The bizarre thing about these systems is that they remind of early 20th century criminologists like Lombroso, or phrenologists (including Nazi scientists), and physiognomy in general. This was a discipline founded on the assumption that there is a relationship between an image of a person and the character of that person. If you are a murderer, or a Jew, the shape of your head for instance will tell.
F: In reaction to these ideas, Rene’ Magritte produced that famous painting of the pipe with the tag “This is not a pipe”.
You know that famous photograph of the soldier kissing the nurse at the end of the second world war? The nurse came public about it when she was like 90 years old, and told how this total stranger in the street had grabbed her and kissed her. This is a picture of sexual harassment. And knowing that, it does not seem romantic anymore.
F: not romantic at all indeed
Images do not describe themselves. This is a feature that artists have explored for centuries. We see those images differently when we see how they’re labeled. The correspondence between image, label, and referent is fluid. What’s more, those relations can change over time as the cultural context of an image shifts, and can mean different things depending on who looks, and where they are located. Images are open to interpretation and reinterpretation. Entire subfields of philosophy, art history, and media theory are dedicated to teasing out all the nuances of the unstable relationship between images and meanings.
The common mythos of AI and the data it draws on, is that they are objectively and scientifically classifying the world. But it’s not true, everywhere there is politics, ideology, prejudices, and all of the subjective stuff of history.
F: When we survey the most widely used training sets, we find that this is the rule rather than the exception.
Training sets are the foundation on which contemporary machine-learning systems are built. They are central to how AI systems recognize and interpret the world.
By looking at the construction of these training sets and their underlying structures, we discover many unquestioned assumptions that are shaky and skewed. These assumptions inform the way AI systems work—and fail—to this day.
And the impenetrability of the algorithms, the impossibility of reconstructing the decision-making of a NN, hides the bias further away from scrutiny. When an algorithm is a black box and you can’t look inside, you have no way of analysing its bias.
And the skewness and bias of these algorithms have real effects in society, the more you use AI in the judicial system, in medicine, the job market, in security systems based on facial recognition, the list goes on and on.
Last year Google unveiled BERT (Bidirectional Encoder Representations from Transformers). It’s an AI system that learns to talk: it’s a Natural Language Processing engine to generate written (or spoken) language.
F: we have an episode in which we explain all that
They trained it from lots and lots of digitized information, as varied as old books, Wikipedia entries and news articles. They baked decades and even centuries of biases — along with a few new ones — into all that material. So for instance BERT is extremely sexist: it associates with male almost all professions and positive attributes (except for “mom”).
BERT is widely used in industry and academia. For example it can interpret news headlines automatically. Even Google’s search engine use it.
Try googling “CEO”, and you get out a gallery of images of old white men.
F: such a pervasive and flawed AI system can propagate inequality at scale. And it’s super dangerous because it’s subtle. Especially in industry, query results will not be tested and examined for bias. AI is a black box and researchers take results at face value.
There are many cases of algorithm-based discrimination in the job market. Targeting candidates for tech jobs for instance, may be done by algorithms that will not recognise women as potential candidates. Therefore, they will not be exposed to as many job ads as men. Or, automated HR systems will rank them lower (for the same CV) and screen them out.
In the US, algorithms are used to calculate bail. The majority of the prison population in the US is composed of people of colour, as a result of a systemic bias that goes back centuries. An algorithm learns that a person of colour is more likely to commit a crime, is more likely to not be able to afford bail, is more likely to violate parole. Therefore, people of colour will receive harsher punishments for the same crime. This amplifies this inequality at scale.
Conclusion
Question everything, never take predictions of your models at face value. Always question how your training samples have been put together, who put them together, when and in what context. Always remember that your model produces an interpretation of reality, not a faithful depiction.
Treat reality responsibly.
We always hear the word “metadata”, usually in a sentence that goes like this
Your Honor, I swear, we were not collecting users data, just metadata.
Usually the guy saying this sentence is Zuckerberg, but could be anybody from Amazon or Google. “Just” metadata, so no problem. This is one of the biggest lies about the reality of data collection.
F: Ok the first question is, what the hell is metadata?
Metadata is data about data.
F: Ok… still not clear.
Imagine you make a phone call to your mum. How often do you call your mum, Francesco?
F: Every day of course! (coughing)
Good boy! Ok, so let’s talk about today’s phone call. Let’s call “data” the stuff that you and your mum actually said. What did you talk about?
F: She was giving me the recipe for her famous lasagna.
So your mum’s lasagna is the DATA. What is the metadata of this phone call? The lasagna has data of its own attached to it: the date and time when the conversation happened, the duration of the call, the unique hardware identifiers of your phone and your mum’s phone, the identifiers of the two sim cards, the location of the cell towers that pinged the call, the GPS coordinates of the phones themselves.
F: yeah well, this lasagna comes with a lot of data :)
And this is assuming that this data is not linked to any other data like your Facebook account or your web browsing history. More of that later.
F: Whoa Whoa Whoa, ok. Let’s put a pin in that. Going back to the “basic” metadata that you describe. I think we understand the concept of data about data. I am sure you did your research and you would love to paint me a dystopian nightmare, as always. Tell us why is this a big deal?
Metadata is a very big deal. In fact, metadata is far more “useful” than the actual data, where by “useful” I mean that it allows a third party to learn about you and your whole life. What I am saying is, the fact that you talk with your mum every day for 15 minutes is telling me more about you than the content of the actual conversations. In a way, the content does not matter. Only the metadata matters.
F: Ok, can you explain this point a bit more?
Imagine this scenario: you work in an office in Brussels, and you go by car. Every day, you use your time in the car while you go home to call your mum. So every day around 6pm, a cell tower along the path from your office to your home pings a call from your phone to your mum’s phone. Someone who is looking at your metadata, knows exactly where you are while you call your mum. Every day you will talk about something different, and it doesn't really matter. Your location will come through loud and clear. A lot of additional information can be deduced from this too: for example, you are moving along a motorway, therefore you have a car. The metadata of a call to mum now becomes information on where you are at 6pm, and the way you travel.
F: I see. So metadata about the phone call is, in fact, real data about me.
Exactly. YOU are what is interesting, not your mum’s lasagna.
F: you say so because you haven’t tried my mum’s lasagna. But I totally get your point.
Now, imagine that one day, instead of going straight home, you decide to go somewhere else. Maybe you are secretly looking for another job. Your metadata is recording the fact that after work you visit the offices of a rival company. Maybe you are a journalist and you visit your anonymous source. Your metadata records wherever you go, and one of these places is your secret meeting with your source. Anyone’s metadata can be combined with yours. There will be someone who was with you at the time and place of your secret meeting. Anyone who comes in contact with you can be tagged and monitored. Now their anonymity has been reduced.
F: I get it. So, compared to the content of my conversation, its metadata contains more actionable information. And this is the most useful, and most precious, kind of information about me. What I do, what I like, who I am, beyond the particular conversation.
Precisely. If companies like Facebook or the phone companies had the explicit permission to collect all the users’ data, including all content of conversations, it’s still the metadata that would generate the most actionable information. They would probably throw the content of conversations away. In the vast majority of instances, the content does not matter. Unless you are an actual spy talking about state secrets, nobody cares.
F: Let’s stay on the spy point for a minute. One could say, So what? As I have heard this many times. So what if my metadata contains actionable information, and there are entities that collect it. If I am an honest person, I have nothing to hide.
There are two aspects to the problem of privacy. Government surveillance, and corporate - in other words private - surveillance.
Government surveillance is a topic that has been covered flawlessly by Edward Snowden in his book “Permanent Record”, and in the documentary about his activity, “Citizenfour”. Which I both recommend, and in fact I think every data scientist should read and watch.
Let’s just briefly mention the obvious: just because something comes from a government, it does not mean it’s legal or legitimate, or even ethical or moral. What if your government is corrupt, or authoritarian. What if you are a dissident and you are fighting for human rights. What if you are a journalist, trying to uncover government corruption.
F: In other words, it is a false equivalence to say that protecting your privacy has anything to do with having something to hide.
Mass surveillance of private citizens without cause is a danger to individual freedom as well as civil liberties. Government exists to serve its citizens, not the other way around. To freely paraphrase Snowden, as individuals have no power compared to the government, the only way the system works is if the government is completely transparent to the citizens, so that they can collectively change it, and at the same time the single citizens are opaque to the government, so that it cannot abuse its power. But today the opposite happens: we citizens are completely naked and exposed in front of a completely opaque government machine, with secret surveillance programs on us, that we don’t even know exist. We are not free to self-determine, or do anything about government power, really.
F: We could really talk for days and days about government mass surveillance. But let’s go back to metadata, and let’s talk about the commercial use of it. Metadata for sale. You mentioned this term, “corporate surveillance”. It sounds…. Ominous.
We live in privacy hell, Francesco.
F: I get that. According to your research, where can we find metadata?
First of all, metadata is everywhere. We are swimming in it. In each and every interaction between two people, that make use of digital technology, metadata is generated automatically, without the user’s consent. When two people interact, two machines also interact, recording the “context” of this interaction. Who we are, when, where, why, what we want.
F: And that doesn’t seem avoidable. In fact metadata must be generated by devices and software to just work properly. I look at it as an intrinsic component that cannot be removed from the communication system, whatever it is. The problem is who owns it. So tell me, who has such data?
It does not matter, because it’s all for sale. Which means, we are for sale.
F: Ok, holy s**t, this keeps getting darker. Let’s have a practical example, shall we?
Have you booked a flight recently?
F: Yep. I’m going to Berlin, and in fact so are you. For a hackathon, no less.
Have you ever heard of a company called Adara?
F: No… Cannot say that I have.
Adara is a “Predictive Traveler Intelligence” company.
F: sounds pretty pretentious. Kinda douchy.
This came up on the terrifying twitter account of Wolfie Christl, author among other things of a great report about corporate surveillance for Cracked Labs. Go check him out on twitter, he’s great.
F: Sure I will add what I find to the show notes of this episode. Oh and by the way you can find all this stuff on datascienceathome.com
Sorry go ahead.
Adara collects data - metadata - about travel-related online searches, purchases, devices, passenger records, loyalty program records. Data from clients that include major airlines, major airports, hotel chains and car rental chains. It creates a profile, a “traveler graph” in real time, for 750 million people around the world. A profile based on personal identifiers.
F: uhh uhh Then what?
Then Adara sells these profiles.
F: Ok… I have to say, the box that I tick giving consent to the third-party use of my personal data when I use an airline website does not quite convey how far my data actually goes.
Consent. LOL. Adara calculates a: “traveler value score” based on customer behaviour and needs across the global travel ecosystem, over time.
The score is in the Salesforce Service Cloud, for sale to anyone.
This score, and your profile, determine the personalisation of travel offers and treatment, before purchase, during booking, post purchase, at check in, in airport, at destination.
In their own website, Adara explains how customer service agents for their myriad of clients - for example a front desk agent at a hotel - can instantly see the Traveler value score. Therefore they will treat you differently based on this score.
F: Oh so if you have money to spend they will treat you differently
The score is used to assess your potential value, to inform service and customer service strategies for you, as well as personalised messaging and relevant offers. And of course, the pricing you see when you look for flights. Low score? Prepare yourself to wait to have your call rerouted to a customer service agent. Would you ever tick a box to give consent to this?
F: Fuck no. How is this even legal? What about the GDPR?
It is, in fact, illegal. Adara is based in the US, but they collect data through data warehouses in the Netherlands. They claim they are GDPR-compliant. However, they collect all the data, and then decide on the specific business use, which is definitely not GDPR compliant.
F: exactly! According to GDPR the user has to know in advance what the business use of the data they are giving consent for!!
With GDPR and future regulations, there is a way to control how the data is used and with what purpose. Regulations are still blurred or undefined when it comes to metadata. For example, there’s no regulation for the number of records in a database or the timestamp when such record was created. As a matter of fact data is useless without metadata.
One cannot even collect data without metadata.
Whatsapp, telegram, Facebook messenger... they all create metadata. So one might say “I’ve got end-to-end encryption, buddy”. Sure thing. How about the metadata attached to that encrypted gibberish nobody is really interested in? To show you how unavoidable the concept of metadata is, even Signal developed by the Signal Foundation which is considered the truly end-to-end and open source protocol for confidential information exchange, can see metadata. At Signal they claim they just don’t keep it, as they also state in the Signal’s privacy policy.
"Certain information (e.g. a recipient's identifier, an encrypted message body, etc.) is transmitted to us solely for the purpose of placing calls or transmitting messages. Unless otherwise stated below, this information is only kept as long as necessary to place each call or transmit each message, and is not used for any other purpose."
This is one of those issues that shall be solved with legislation.
But like money laundering, your data is caught in a storm of transactions so intricate that at a certain point, how do you even check...
All participating companies share customer data with each other (a process called value exchange). They let marketers utilize the data, for example to target people after they have searched for flights or hotels. Adara creates audience segments and sells them, for example to Google, for advertisement targeting. The consumer data broker LiveRamp for example lists Adara as a data provider.
F: consumer data broker. I am starting to get what you mean when you say that we are for sale.
Let’s talk about LiveRamp, part of Acxiom.
F: there they go... Acxiom... I heard of them
They self-describe as an “Identity Resolution Platform”.
F: I mean, George Orwell would be proud.
Their mission? “To connect offline data and online data back to a single identifier”. In other words, clients can “resolve all” of their “offline and online identifiers back to the individual consumer”.
Various digital profiles, like the ones generated on social media or when you visit a website, are matched to databases which contains names, postal addresses, email addresses, phone numbers, geo locations and IP addresses, online and mobile identifiers, such as cookie and device IDs.
F: well, all this stuff is possible if and only if someone gets in possession of all these profiles, or well... they purchase them. Still, what the f**k.
A cute example? Imagine you register on any random website but you don’t want to give them your home address. They just buy it from LiveRamp, which gets it from your phone geolocation data - which is for sale. Where does your phone sit still for 12 hours every night? That’s your home address. Easy.
F: And they definitely know how much time do I spend at the gym, without even checking my Instagram! Ok this is another level of creepy.
Clients of LiveRamp can upload their own consumer data to the platform, combine it with data from hundreds of 100 third-party data providers, and then utilize it on more than 500 marketing technology platforms. They can use this data to find and target people with specific characteristics, to recognize and track consumers across devices and platforms, to profile and categorize them, to personalize content for them, and to measure how they behave. For example, clients could “recognize a website visitor” and “provide a customized offer” based on extensive profile data, without requiring said user to log in to the website. Furthermore, LiveRamp has a data store, for other companies to “buy and sell valuable customer data”.
F: What is even the point of giving me the choice to consent to anything online?
In short, there is no point.
F: it seems we are so behind with regulations on data sharing. GDPR is not cutting it, not really. With programmatic advertising we have created a monster that has really grown out of control.
So: our lives are completely transparent to private corporations, that constantly surveil us en-masse, and exploit all of our data to sell us shit. How does this affect our freedom? How about we just don’t buy it? Can it be that simple? And I would not take a no for an answer here.
Unfortunately, no.
F: oh crap!
I’m going to read you a passage from Permanent Record:
Who among us can predict the future? Who would dare to?
The answer to the first question is no one, really, and the answer to the second is everyone, especially every government and business on the planet. This is what that data of ours is used for. Algorithms analyze it for patterns of established behaviour in order to extrapolate behaviours to come, a type of digital prophecy that’s only slightly more accurate that analog methods like palm reading. Once you go digging into the actual technical mechanisms by which predictability is calculated, you come to understand that its science is, in fact, anti-scientific, and fatally misnamed: predictability is actually manipulation.
A website that tells you that because you liked book 1 then you might also like book 2, isn’t offering an educated guess as much as a mechanism of subtle coercion. We can’t allow ourselves to be used in this way, to be used against the future. We can’t permit our data to be used to sell us the very things that must not be sold, such as journalism. [....]
We can’t let the god-like surveillance we’re under be used to “calculate” our citizenship scores, or to “predict” our criminal activity; to tell us what kind of education we can have, or what kind of job we can have [...], to discriminate against us based on our financial, legal, and medical histories, not to mention our ethnicity or race, which are constructs that data often assumes or imposes.
[...] if we allow [our data] to be used to identify us, then it will be used to victimize us, even to modify us - to remake the very essence of our humanity in the image of the technology that seeks its control. Of course, all of the above has already happened.
F: In other words, we are surveilled and our data collected, and used to affect every aspect of our lives - what we read, what movies we watch, where we travel, what we buy, who we date, what we study, where we work… This is a self-fulfilling prophecy for all of humanity, and the prophet is a stupid, imperfect algorithm optimised just to make money.
So I guess my message of today for all Data Scientists out there is this: just… don't.
References
In 2017 a research group at the University of Washington did a study on the Black Lives Matter movement on Twitter. They constructed what they call a “shared audience graph” to analyse the different groups of audiences participating in the debate, and found an alignment of the groups with the political left and political right, as well as clear alignments with groups participating in other debates, like environmental issues, abortion issues and so on. In simple terms, someone who is pro-environment, pro-abortion, left-leaning, is also supportive of the Black Lives Matter movement, and viceversa.
F: Ok, this seems to make sense, right? But… I suspect there is more to this story?So far, yes…. What they did not expect to find, though, was a pervasive network of Russian accounts participating in the debate, which turned out to be orchestrated by the Internet Research Agency, the not-so-secret Russian secret service agency of internet black ops. The same connected with the US election and Brexit referendum, allegedly.
F: Are we talking about actual spies? Where are you going with this?Basically, the Russian accounts (part of them human and part of them bots) were infiltrating all aspects of the debate, both on the left and on the right side, and always taking the most extreme stances on any particular aspect of the debate. The aim was to radicalise the conversation, to make it more and more extreme, in a tactic of divide-and-conquer: turn the population against itself in an online civil war, push for policies that normally would be considered too extreme (for instance, give tanks to the police to control riots, force a curfew, try to ban Muslims from your country). Chaos and unrest have repercussions on international trade and relations, and can align to foreign interests.
F: It seems like a pretty indirect and convoluted way of influencing a foreign power…You might think so, but you are forgetting social media. This sort of operation is directly exploiting a core feature of internet social media platforms. And that feature, I am afraid, is recommender systems.
F: Whoa. Let’s take a step back. Let’s recap the general features of recommender systems, so we are on the same page.The main purpose of recommender systems is to recommend people the same items similar people show an interest in.
Let’s think about books and readers. The general idea is to find a way to predict the best book to the best reader. Amazon is doing it, Netflix is doing it, probably the bookstore down the road does that too, just on a smaller scale.
Some of the most common methods to implement recommender systems, use concepts such as cosine/correlation similarity, matrix factorization, neural autoencoders and sequence predictors.
The major issue of recommender systems is in their validation. Even though validation occurs in a way that is similar to many machine learning methods, one should recommend a set of items first (in production) and measure the efficacy of such a recommendation. But, recommending is already altering the entire scenario, a bit in the flavour of the Heisenberg principle of uncertainty.
F: In the attention economy, the business model is to monetise the time the user spends on a platform, by showing them ads. Recommender systems are crucial for this purpose.As you say, recommender systems exist because the business model of social media platforms is to monetise attention. The most effective way to keep users’ attention is to show them stuff they could show an interest in.
In order to do that, one must segment the audience to find the best content for each user. But then, for each user, how do you keep them engaged, and make them consume more content?
Spot on. To keep the user on the platform, you start by showing them content that they are interested in, and that agrees with their opinion.
But that is not all. How many videos of the same stuff can you watch, how many articles can you read? You must also escalate the content that the user sees, increasing the wow factor. The content goes from mild to extreme (conspiracy theories, hate speech etc).
The recommended content pushes the user opinion towards more extreme stances. It is hard to see from inside the bubble, but a simple experiment will show it. If you continue to click the first recommended video on YouTube, and you follow the chain of first recommended videos, soon you will find yourself watching stuff you’d never have actively looked for, like conspiracy theories, or alt-right propaganda (or pranks that get progressively more cruel, videos by people committing suicide, and so on).
F: So you are saying that this is not an accident: is this the basis of the optimisation of the recommender system?Yes, and it’s very effective. But obviously there are consequences.
F: And I’m guessing they are not good.The collective result of single users being pushed toward more radical stances is a radicalisation of the whole conversation, the disappearance of nuances in the argument, the trivialisation of complex issues. For example, the Brexit debate in 2016 was about trade deals and custom unions, and now it is about remain vs no deal, with almost nothing in between.
F: Yes, the conversation is getting stupider. Is this just a giant accident? Just a sensible system that got out of control?Yes and no. Recommender systems originate as a tool for boosting commercial revenue, by selling more products. But applied to social media, they have caused an aberration: the recommendation of information, which leads to the so-called filter bubbles, the rise of fake news and disinformation, and the manipulation of the masses.
There is an intense debate in the scientific community about the polarising effects of the internet and social media on the population. An example of such study is a paper by Johnson et al. It predicts that whether and how a population becomes polarised is dictated by the nature of the underlying competition, rather than the validity of the information that individuals receive or their online bubbles.
F: I would like to stress on this finding. This is really f*cked up. Polarisation is not caused by the particular subject nor the way a debate is conducted. But by how legitimate the information seems to the single person. Which means that if I find a way to convince the single individuals about something, I will be in fact manipulating the debate at a community scale or, in some cases, globally!Take for instance the people who believe that the Earth is flat. Or the time it took people to recognise global warming as scientific, despite the fact that, the threshold for scientific confirmation was reached decades ago.
F: So, recommender systems let loose on social media platforms amplify controversy and conflict, and fringe opinions. I know I’m not going to like the answer, but I’m going to ask the question anyway.Last year, the European Data Protection Supervisor has published a report on online manipulation at scale.
F: That does not sound good.The online digital ecosystem has connected people across the world with over 50% of the population on the Internet, albeit very unevenly in terms of geography, wealth and gender. The initial optimism about the potential of internet tools and social media for civic engagement has given way to concern that people are being manipulated. This happens through the combination of constant harvesting of often intimate information about them, and the control over the information they see online according to the category they are put into (so called segmentation of the audience). Arguably since 2016, but probably before, mass manipulation at scale has occurred during democratic elections. By using algorithms to game recommender systems, among other things, to spread misinformation. Remember Cambridge Analytica?
F: I remember. I wish I didn’t. But why does it work? Are we so easy to manipulate?An interesting point is this. When one receives information collectively, as for example from the television news, it is far less likely that she develops extreme views (like, the Earth is flat), because she would base the discourse on a common understanding of reality. And people call out each other’s bulls*it.
F: Fair enough.
But when one receives information singularly, like what happens via a recommender system through micro-targeting, then reality has a different manifestation for each audience member, with no common ground. It is far more likely to adopt extreme views, because there is no way to fact check, and because the news feel personal. In fact, they tailor such news are to the users to push their buttons.
Francesco, if you show me George Clooney shirtless and holding a puppy, and George tells me that the Earth is flat, I might have doubts for a minute. Too personal?
Solutions have focused on transparency measures, exposing the source of information while neglecting the accountability of players in the ecosystem who profit from harmful behaviour. But these are band aids on bullet wounds.
The problem is the social media platforms. In October 2019 Zuckerberg was in front of congress again, because Facebook refuses to fact-check political advertisements, in 2019, after everything that’s happened. At the same time market concentration and the rise of platform dominance threatens media pluralism. This in turn, is leading to repeat and amplify a handful of news pieces and to silence independent journalism.
This seems relatively benign. Although, if you think some more, you realise that this mechanism will prevent you from actually discovering anything new. It just gives you more of what you are likely to like. But one would not think that this would have world-changing consequences.
If you think of the news, this mechanism becomes lethal: in the mildest form – which is already bad – you will only hear opinions that already align with those of your own peer group. In the worst scenario, you will not hear some news at all, or you will hear a misleading or false version of the news, and you don’t even know that a different version exists.
In the Brexit referendum, misleading or false content (like the famous NHS money that supposedly was going to the EU instead) has been amplified in filter bubbles. Each bubble of people was essentially understanding a different version of the same issue. Brexit was a million different things, depending on your social media feeds.
And of course, there are malicious players in the game, like the russian Internet Research Agency and Cambridge Analytica, who actively exploited this features in order to swing the vote.
Researchers use recommender systems in a variety of applications.
For instance, in the job market. A recommender system limits exposure to certain information about jobs on the basis of the person’s gender or inferred health status, and therefore it perpetuates discriminatory attitudes and practices. In the US, researchers use recommender systems to calculate the bail fee for people who have been arrested, disproportionately penalising people of colour. This has to do with the training of the algorithm. In an already unequal system (where for instance there are few women in top managerial positions, and more African-Americans in jail than white Americans) a recommender system will by design amplify such inequality.
Yep. The problem with recommender systems goes even deeper. I would rather connect it to the problem of privacy. A recommender system only works if it knows its audience. They are so powerful, because they know everything about us.
We don’t have any privacy anymore. Online players know exactly who we are, our lives are transparent to both corporations and governments. For an excellent analysis of this, read Snowden’s book “Permanent Record”. I highly recommend it.
With all this information about us, we are put into “categories” for specific purposes: selling us products, influencing our vote. They target us with ads aimed at our specific category, and this generates more discussion and more content on our social media. Recommender systems amplify the targeting by design. They would be much less effective, and much less dangerous, in a world where our lives are private.
F: Social media platforms base their whole business model in “knowing us”. The business model itself is problematic.As we said in the previous episode, the internet has become centralised, with a handful of platforms controlling most of the traffic. In some countries like Myanmar, internet access itself is provided and controlled by Facebook.
F: Chiara, where’s Myanmar?In South-East Asia, between India and Thailand.
In effect, the forum for public discourse and the available space for freedom of speech is now bounded by the profit motives of powerful private companies. Due to technical complexity or on the grounds of commercial secrecy, such companies decline to explain how decisions are made. Mostly, they make decisions via recommender algorithms, which amplify bias and segregation. And at the same time the few major platforms with their extraordinary reach offer an easy target for people seeking to use the system for malicious ends.
This is our call to all data scientists out there. Be aware of personalisation in building recommender systems. Personalising is not always beneficial. There are a few cases where it is, e.g. medicine, genetics, drug discovery. Many other cases where it is detrimental e.g. news, consumer products/services, opinions.
Personalisation by algorithm, and in particular of the news, leads to a fragmentation of reality that undermines democracy. Collectively we need to push for reigning in targeted advertising, and the path to this leads to more strict rules on privacy. As long as we are completely transparent to commercial and governmental players, like we are today, we are vulnerable to lies, misdirection and manipulation.
As Christopher Wylie (the Cambridge Analytica whistleblower) eloquently said, it’s like going on a date, where you know nothing about the other person, but they know absolutely everything about you.
We are left without agency, and without real choice.
In other words, we are f*cked
Black lives matter / Internet Research Agency (IRA) articles:
http://faculty.washington.edu/kstarbi/Stewart_Starbird_Drawing_the_Lines_of_Contention-final.pdf
https://faculty.washington.edu/kstarbi/BLM-IRA-Camera-Ready.pdf
EDPS report
https://edps.europa.eu/sites/edp/files/publication/18-03-19_online_manipulation_en.pdf
Johnson et al. “Population polarization dynamics and next-generation social media algorithms” https://arxiv.org/abs/1712.06009
Chamath Palihapitiya, former Vice President of User Growth at Facebook, was giving a talk at Stanford University, when he said this:
“I feel tremendous guilt. The short-term, dopamine-driven feedback loops that we have created are destroying how society works ”.
He was referring to how social media platforms leverage our neurological build-up in the same way slot machines and cocaine do, to keep us using their products as much as possible. They turn us into addicts.
F: how many times do you check your Facebook in a day?
I am not a fan of Facebook. I do not have it on my phone. Still, I check it in the morning on my laptop, and maybe twice more per day. I have a trick though: I do not scroll down. I only check the top bar to see if someone has invited me to an event, or contacted me directly. But from time to time, this resolution of mine slips, and I catch myself scrolling down, without even realising it!
F: is it the first thing you check when you wake up?
No because usually I have a message from you!! :) But yes, while I have my coffee I do a sweep on Facebook and twitter and maybe Instagram, plus the news.
F: Check how much time you spend on Facebook
And then sum it up to your email, twitter, reddit, youtube, instagram, etc. (all viable channels for ads to reach you)
We have an answer. More on that later.
Clearly in this episode there is some form of addiction we would like to talk about. So let’s start from the beginning: how does addiction work?
Dopamine is a hormone produced by our body, and in the brain it works as a neurotransmitter, a chemical that neurons use to transmit signals to each other. One of the main functions of dopamine is to shape the “reward-motivated behaviour”: this is the way our brain learns through association, positive reinforcement, incentives, and positively-valenced emotions, in particular, pleasure. In other words, it makes our brain desire more of the things that make us feel good. These things can be for example good food, sex, and crucially, good social interactions, like hugging your friends or your baby, or having a laugh together. Because we are evolved to be social animals with complex social structures, successful social interactions are an evolutionary advantage, and therefore they trigger dopamine release in our brain, which makes us feel good, and reinforces the association between the action and the reward. This feeling motivates us to repeat the behaviour.
F: now that you mention reinforcement, I recall that this mechanism is so powerful and effective that in fact we have been inspired by nature and replicated it in-silico with reinforcement learning. The idea is to motivate (and eventually create an addictive pattern) an agent to follow what is called the optimal policy by giving it positive rewards or punishing it when things don’t go the way we planned.
In our brain, every time an action produces a reward, the connection between action and reward becomes stronger. Through reinforcement, a baby learns to distinguish a cat from a dog, or that fire hurts (that was me).
F: and so this means that all the social interactions people get from social media platforms are in fact doing the same, right?
Yes, but with a difference: smartphones in our pockets keep us connected to an unlimited reserve of constant social interactions. This constant flux of notifications - the rewards - flood our brain with dopamine. The mechanism of reinforcement can spin out of control. The reward pathways in our brain can malfunction, and this leads to addiction.
F: you are saying that social media has LITERALLY the effect of a drug?
Yes. In fact, social media platforms are DESIGNED to exploit the rewards systems in our brain. They are designed to work like a drug.
Have you been to a casino and played roulette or the slot machines?
F: ...maybe?
Why is it fun to play roulette? The fun comes from the WAIT before the reward. You put a chip on a number, you don’t know how it’s going to go. You wait for the ball to spin, you get excited. And from time to time, BAM! Your number comes out. Now, compare this with posting something on facebook. You write a message into the void, wait…. And then the LIKES start coming in.
F: yeah i find that familiar...
Contrary to the casino, social media platforms do not want our money, in fact they are free. What they want is, and what we are buying into with, is our time. Because the longer we stay on, the longer they can show us ads, and the more money advertisers can pay them. This is no accident, this is the business model. But asking for our time out loud would not work, we would probably not consciously give it to them. So, like a casino, they make it hard for us to get off, once we are on: they make us crave the likes, the right-swipes, the retweets, the subscriptions. So we check in, we stay on, we keep scrolling, because we hope to get those rewards. The short-term satisfaction of getting a “like” is a little boost of dopamine in our brain. We get used to it, and we want more.
F: a lot of machine learning is also being deployed to amplify this form of addiction and make it.... Well more addictive :) But the question is: how such powerful ads and scenarios are so effective because of the algorithms and how much just because humans are just wired to obey such dynamics? My question is: are we essentially flawed or are these algorithms truly powerful?
It is not a flaw, it’s a feature. The way our brain has evolved has been in response to very specific needs. In particular for this conversation, our brain is wired to favour social interactions, because it is an evolutionary advantage. These algorithms exploit these features of the brain on purpose, they are designed to exploit them.
F: I believe so, but I also believe that the human brain is a powerful machine, so it should be able to predict what satisfaction it can get from social media. So how does it happen that we become addicted?
An example of optimisation strategy that social media platforms use is based on the principle of “reward prediction error coding”. Our brain learns to find patterns in data - this is a basic survival skill - and therefore learns when to expect a reward for a given set of actions. I eat cake, therefore I am happy. Every time.
Imagine a scenario, where we have learnt through experience that when we play slot machines in a casino, we learn that we win some money once every 100 times we pull the lever. The difference between predicted and received rewards is a known, fixed quantity. If so, just after winning once, we have almost zero incentive to play again. So the casino fixes the slot machines, to introduce a random element to the timing of the reward. Suddenly our prediction error increases substantially. In this margin of error, in the time between the action (pull the lever) and the reward (maybe) our brain has time to make us anticipate the result and make us excited at the possibility, and this releases dopamine. Playing in itself becomes a reward.
F: There is an equivalent in reinforcement learning called the grid world which consists in a mouse getting to the cheese in a maze. In reinforcement learning, everything works smooth as long as the cheese stays in the same place.
Exactly! Now social media apps implement an equivalent trick, called “variable reward schedules”.
In our brain, after an action we get a reward or punishment, and we generate positive or negative feedback to that action.
Social media apps optimise their algorithms for the ideal balance of negative and positive feedback in our brains caused by the difference between these predicted and received rewards.
If we perceive a reward to be delivered at random, and - crucially - if checking for the reward comes at little cost, like opening the Facebook app, we end up checking for rewards all the time. Every time we are just a little bit bored, without even thinking, we check the app. The Facebook reward system (the schedule and triggers of notification and likes) has been optimised to maximise this behaviour.
F: are you saying that buffering some likes and then finding the right moment to show them to the user can make the user crave for reward?
Oh yes. Instagram will withhold likes for a period of time, causing a dip in reward compared to the expected level. It will then deliver them later in larger bundles, thus boosting the reward above the expected value, which trigger extra dopamine release, which sends us on a high akin to a cocaine hit.
F: Dear audience, do you remember my question? How much time do each of you spend on social media (or similar) in a day? And why do we still do it?
The fundamental feature here is how little is the perceived cost to check for the reward: I just need to open the app. We perceive this cost to be minimal, so we don’t even think about it. YouTube for instance had the autoplay feature, so you need to do absolutely nothing to remain on the app. But the cost is cumulative over time, it becomes hours in our day, days in a month, years in our lives!! 2 hours of social media per day amounts to 1 month per year.
F: But it’s so EASY, it has become so natural to use social media for everything. To use Google for everything.
The convenience that the platforms give us is one of the most dangerous things about them, and not only for our individual life. The convenience of reaching so many users, together with the business model of monetising attention is one of the causes of the centralisation of the internet, i.e. the fact a few giant platforms control most of the internet traffic. Revenue from ads is concentrated on big platforms, and content creators have no other choice but to use them, if they want to be competitive. The internet went from looking like a distributed network to a centralised network. And this in turn causes data to be centralised, in a self-reinforcing loop. Most of human conversations and interactions pass through the servers of a handful of private corporations.
Conclusion
As Data scientists we should be aware of this (and we think mostly we are). We should also be ethically responsible. I think that being a data scientist no longer has a neutral connotation. Algorithms have this huge power of manipulating human behaviour, and let’s be honest, we are the only ones who really understand how they work. So we have a responsibility here.
There are some organisations, like Data For Democracy for example, who are advocating for something equivalent to the Hippocratic Oath for data scientists. Do no harm.
References
Dopamine reward prediction error coding https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4826767/
Skinner - Operant Conditioning https://www.simplypsychology.org/operant-conditioning.html
Dopamine, Smartphones & You: A battle for your time http://sitn.hms.harvard.edu/flash/2018/dopamine-smartphones-battle-time/
Reward system https://en.wikipedia.org/wiki/Reward_system
Data for democracy datafordemocracy.org
Some of the most powerful NLP models like BERT and GPT-2 have one thing in common: they all use the transformer architecture.
Such architecture is built on top of another important concept already known to the community: self-attention.
In this episode I explain what these mechanisms are, how they work and why they are so powerful.
Don't forget to subscribe to our Newsletter or join the discussion on our Discord server
References
Generative Adversarial Networks or GANs are very powerful tools to generate data. However, training a GAN is not easy. More specifically, GANs suffer of three major issues such as instability of the training procedure, mode collapse and vanishing gradients.
In this episode I not only explain the most challenging issues one would encounter while designing and training Generative Adversarial Networks. But also some methods and architectures to mitigate them. In addition I elucidate the three specific strategies that researchers are considering to improve the accuracy and the reliability of GANs.
The most tedious issues of GANs
Convergence to equilibrium
A typical GAN is formed by at least two networks: a generator G and a discriminator D. The generator's task is to generate samples from random noise. In turn, the discriminator has to learn to distinguish fake samples from real ones. While it is theoretically possible that generators and discriminators converge to a Nash Equilibrium (at which both networks are in their optimal state), reaching such equilibrium is not easy.
Vanishing gradients
Moreover, a very accurate discriminator would push the loss function towards lower and lower values. This in turn, might cause the gradient to vanish and the entire network to stop learning completely.
Mode collapse
Another phenomenon that is easy to observe when dealing with GANs is mode collapse. That is the incapability of the model to generate diverse samples. This in turn, leads to generated data that are more and more similar to the previous ones. Hence, the entire generated dataset would be just concentrated around a particular statistical value.
The solution
Researchers have taken into consideration several approaches to overcome such issues. They have been playing with architectural changes, different loss functions and game theory.
Listen to the full episode to know more about the most effective strategies to build GANs that are reliable and robust.
Don't forget to join the conversation on our new Discord channel. See you there!
What happens to a neural network trained with random data?
Are massive neural networks just lookup tables or do they truly learn something?
Today’s episode will be about memorisation and generalisation in deep learning, with Stanislaw Jastrzębski from New York University.
Stan spent two summers as a visiting student with Prof. Yoshua Bengio and has been working on
What makes deep learning unique?
I have asked him a few questions for which I was looking for an answer for a long time. For instance, what is deep learning bringing to the table that other methods don’t or are not capable of?
Stan believe that the one thing that makes deep learning special is representation learning. All the other competing methods, be it kernel machines, or random forests, do not have this capability. Moreover, optimisation (SGD) lies at the heart of representation learning in the sense that it allows finding good representations.
What really improves the training quality of a neural network?
We discussed about the accuracy of neural networks depending pretty much on how good the Stochastic Gradient Descent method is at finding minima of the loss function. What would influence such minima?
Stan's answer has revealed that training set accuracy or loss value is not that interesting actually. It is relatively easy to overfit data (i.e. achieve the lowest loss possible), provided a large enough network, and a large enough computational budget. However, shape of the minima, or performance on validation sets are in a quite fascinating way influenced by optimisation.
Optimisation in the beginning of the trajectory, steers such trajectory towards minima of certain properties that go much further than just training accuracy.
As always we spoke about the future of AI and the role deep learning will play.
I hope you enjoy the show!
Don't forget to join the conversation on our new Discord channel. See you there!
References
Homepage of Stanisław Jastrzębski https://kudkudak.github.io/
A Closer Look at Memorization in Deep Networks https://arxiv.org/abs/1706.05394
Three Factors Influencing Minima in SGD https://arxiv.org/abs/1711.04623
Don't Decay the Learning Rate, Increase the Batch Size https://arxiv.org/abs/1711.00489
Stiffness: A New Perspective on Generalization in Neural Networks https://arxiv.org/abs/1901.09491
Join the discussion on our Discord server
In this episode I explain how a research group from the University of Lubeck dominated the curse of dimensionality for the generation of large medical images with GANs.
The problem is not as trivial as it seems. Many researchers have failed in generating large images with GANs before. One interesting application of such approach is in medicine for the generation of CT and X-ray images.
Enjoy the show!
References
Multi-scale GANs for Memory-efficient Generation of High Resolution Medical Images https://arxiv.org/abs/1907.01376
Some of the most powerful NLP models like BERT and GPT-2 have one thing in common: they all use the transformer architecture.
Such architecture is built on top of another important concept already known to the community: self-attention.
In this episode I explain what these mechanisms are, how they work and why they are so powerful.
Don't forget to subscribe to our Newsletter or join the discussion on our Discord server
References
Join the discussion on our Discord server
In this episode, I am with Aaron Gokaslan, computer vision researcher, AI Resident at Facebook AI Research. Aaron is the author of OpenGPT-2, a parallel NLP model to the most discussed version that OpenAI decided not to release because too accurate to be published.
We discuss about image-to-image translation, the dangers of the GPT-2 model and the future of AI.
Moreover, Aaron provides some very interesting links and demos that will blow your mind!
Join the discussion on our Discord server
After reinforcement learning agents doing great at playing Atari video games, Alpha Go, doing financial trading, dealing with language modeling, let me tell you the real story here.
In this episode I want to shine some light on reinforcement learning (RL) and the limitations that every practitioner should consider before taking certain directions. RL seems to work so well! What is wrong with it?
Are you a listener of Data Science at Home podcast?
A reader of the Amethix Blog?
Or did you subscribe to the Artificial Intelligence at your fingertips newsletter?
In any case let’s stay in touch!
https://amethix.com/survey/
References
Join the discussion on our Discord server
In this episode I have an amazing conversation with Jimmy Soni and Rob Goodman, authors of “A mind at play”, a book entirely dedicated to the life and achievements of Claude Shannon. Claude Shannon does not need any introduction. But for those who need a refresh, Shannon is the inventor of the information age.
Have you heard of binary code, entropy in information theory, data compression theory (the stuff behind mp3, mpg, zip, etc.), error correcting codes (the stuff that makes your RAM work well), n-grams, block ciphers, the beta distribution, the uncertainty coefficient?
All that stuff has been invented by Claude Shannon :)
Articles: https://medium.com/the-mission/10-000-hours-with-claude-shannon-12-lessons-on-life-and-learning-from-a-genius-e8b9297bee8fJoin the discussion on our Discord server
As ML plays a more and more relevant role in many domains of everyday life, it’s quite obvious to see more and more attacks to ML systems. In this episode we talk about the most popular attacks against machine learning systems and some mitigations designed by researchers Ambra Demontis and Marco Melis, from the University of Cagliari (Italy). The guests are also the authors of SecML, an open-source Python library for the security evaluation of Machine Learning (ML) algorithms. Both Ambra and Marco are members of research group PRAlab, under the supervision of Prof. Fabio Roli.
Marco Melis (Ph.D Student, Project Maintainer, https://www.linkedin.com/in/melismarco/)
Ambra Demontis (Postdoc, https://pralab.diee.unica.it/it/AmbraDemontis)
Maura Pintor (Ph.D Student, https://it.linkedin.com/in/maura-pintor)
Battista Biggio (Assistant Professor, https://pralab.diee.unica.it/it/BattistaBiggio)
SecML: an open-source Python library for the security evaluation of Machine Learning (ML) algorithms https://secml.gitlab.io/.
Demontis et al., “Why Do Adversarial Attacks Transfer? Explaining Transferability of Evasion and Poisoning Attacks,” presented at the 28th USENIX Security Symposium (USENIX Security 19), 2019, pp. 321–338. https://www.usenix.org/conference/usenixsecurity19/presentation/demontis
W. Koh and P. Liang, “Understanding Black-box Predictions via Influence Functions,” in International Conference on Machine Learning (ICML), 2017. https://arxiv.org/abs/1703.04730
Melis, A. Demontis, B. Biggio, G. Brown, G. Fumera, and F. Roli, “Is Deep Learning Safe for Robot Vision? Adversarial Examples Against the iCub Humanoid,” in 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), 2017, pp. 751–759. https://arxiv.org/abs/1708.06939
Biggio and F. Roli, “Wild Patterns: Ten Years After the Rise of Adversarial Machine Learning,” Pattern Recognition, vol. 84, pp. 317–331, 2018. https://arxiv.org/abs/1712.03141
Biggio et al., “Evasion attacks against machine learning at test time,” in Machine Learning and Knowledge Discovery in Databases (ECML PKDD), Part III, 2013, vol. 8190, pp. 387–402. https://arxiv.org/abs/1708.06131
Biggio, B. Nelson, and P. Laskov, “Poisoning attacks against support vector machines,” in 29th Int’l Conf. on Machine Learning, 2012, pp. 1807–1814. https://arxiv.org/abs/1206.6389
Dalvi, P. Domingos, Mausam, S. Sanghai, and D. Verma, “Adversarial classification,” in Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Seattle, 2004, pp. 99–108. https://dl.acm.org/citation.cfm?id=1014066
Sundararajan, Mukund, Ankur Taly, and Qiqi Yan. "Axiomatic attribution for deep networks." Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017. https://arxiv.org/abs/1703.01365
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "Model-agnostic interpretability of machine learning." arXiv preprint arXiv:1606.05386 (2016). https://arxiv.org/abs/1606.05386
Guo, Wenbo, et al. "Lemna: Explaining deep learning based security applications." Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2018. https://dl.acm.org/citation.cfm?id=3243792
Bach, Sebastian, et al. "On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation." PloS one 10.7 (2015): E0130140. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0130140
Join the discussion on our Discord server
Scaling technology and business processes are not equal. Since the beginning of the enterprise technology, scaling software has been a difficult task to get right inside large organisations. When it comes to Artificial Intelligence and Machine Learning, it becomes vastly more complicated.
In this episode I propose a framework - in five pillars - for the business side of artificial intelligence.
Join the discussion on our Discord server
In this episode, I am with Aaron Gokaslan, computer vision researcher, AI Resident at Facebook AI Research. Aaron is the author of OpenGPT-2, a parallel NLP model to the most discussed version that OpenAI decided not to release because too accurate to be published.
We discuss about image-to-image translation, the dangers of the GPT-2 model and the future of AI.
Moreover, Aaron provides some very interesting links and demos that will blow your mind!
Join the discussion on our Discord server
Training neural networks faster usually involves the usage of powerful GPUs. In this episode I explain an interesting method from a group of researchers from Google Brain, who can train neural networks faster by squeezing the hardware to their needs and making the training pipeline more dense.
Enjoy the show!
ReferencesFaster Neural Network Training with Data Echoing
https://arxiv.org/abs/1907.05550
Join the discussion on our Discord server
In this episode I explain how a research group from the University of Lubeck dominated the curse of dimensionality for the generation of large medical images with GANs.
The problem is not as trivial as it seems. Many researchers have failed in generating large images with GANs before. One interesting application of such approach is in medicine for the generation of CT and X-ray images.
Enjoy the show!
References
Multi-scale GANs for Memory-efficient Generation of High Resolution Medical Images https://arxiv.org/abs/1907.01376
In this episode I am with Jadiel de Armas, senior software engineer at Disney and author of Videflow, a Python framework that facilitates the quick development of complex video analysis applications and other series-processing based applications in a multiprocessing environment.
I have inspected the videoflow repo on Github and some of the capabilities of this framework and I must say that it’s really interesting. Jadiel is going to tell us a lot more than what you can read from Github
References
Videflow Github official repository
https://github.com/videoflow/videoflow
In this episode, I am with Dr. Charles Martin from Calculation Consulting a machine learning and data science consulting company based in San Francisco. We speak about the nuts and bolts of deep neural networks and some impressive findings about the way they work.
The questions that Charles answers in the show are essentially two:
References
In this episode I explain how a community detection algorithm known as Markov clustering can be constructed by combining simple concepts like random walks, graphs, similarity matrix. Moreover, I highlight how one can build a similarity graph and then run a community detection algorithm on such graph to find clusters in tabular data.
You can find a simple hands-on code snippet to play with on the Amethix Blog
Enjoy the show!
References
[1] S. Fortunato, “Community detection in graphs”, Physics Reports, volume 486, issues 3-5, pages 75-174, February 2010.
[2] Z. Yang, et al., “A Comparative Analysis of Community Detection Algorithms on Artificial Networks”, Scientific Reports volume 6, Article number: 30750 (2016)
[3] S. Dongen, “A cluster algorithm for graphs”, Technical Report, CWI (Centre for Mathematics and Computer Science) Amsterdam, The Netherlands, 2000.
[4] A. J. Enright, et al., “An efficient algorithm for large-scale detection of protein families”, Nucleic Acids Research, volume 30, issue 7, pages 1575-1584, 2002.
The two most widely considered software development models in modern project management are, without any doubt, the Waterfall Methodology and the Agile Methodology. In this episode I make a comparison between the two and explain what I believe is the best choice for your machine learning project.
An interesting post to read (mentioned in the episode) is How businesses can scale Artificial Intelligence & Machine Learning https://amethix.com/how-businesses-can-scale-artificial-intelligence-machine-learning/
Training neural networks faster usually involves the usage of powerful GPUs. In this episode I explain an interesting method from a group of researchers from Google Brain, who can train neural networks faster by squeezing the hardware to their needs and making the training pipeline more dense.
Enjoy the show!
ReferencesFaster Neural Network Training with Data Echoing
https://arxiv.org/abs/1907.05550
In this episode, I am with Dr. Charles Martin from Calculation Consulting a machine learning and data science consulting company based in San Francisco. We speak about the nuts and bolts of deep neural networks and some impressive findings about the way they work.
The questions that Charles answers in the show are essentially two:
References
In this episode I am with Jadiel de Armas, senior software engineer at Disney and author of Videflow, a Python framework that facilitates the quick development of complex video analysis applications and other series-processing based applications in a multiprocessing environment.
I have inspected the videoflow repo on Github and some of the capabilities of this framework and I must say that it’s really interesting. Jadiel is going to tell us a lot more than what you can read from Github
References
Videflow Github official repository
https://github.com/videoflow/videoflow
In this episode I have a wonderful conversation with Chris Skinner.
Chris and I recently got in touch at The banking scene 2019, fintech conference recently held in Brussels. During that conference he talked as a real trouble maker - that’s how he defines himself - saying that “People are not educated with loans, credit, money” and that “Banks are failing at digital”.
After I got my hands on his last book Digital Human, I invited him to the show to ask him a few questions about innovation, regulation and technology in finance.
Today I am with David Kopec, author of Classic Computer Science Problems in Python, published by Manning Publications.
His book deepens your knowledge of problem solving techniques from the realm of computer science by challenging you with interesting and realistic scenarios, exercises, and of course algorithms.
There are examples in the major topics any data scientist should be familiar with, for example search, clustering, graphs, and much more.
Get the book from https://www.manning.com/books/classic-computer-science-problems-in-python and use coupon code poddatascienceathome19 to get 40% discount.
References
Twitter https://twitter.com/davekopec
GitHub https://github.com/davecom
In this episode I talk about a new paradigm of learning, which can be found a bit blurry and not really different from the other methods we know of, such as supervised and unsupervised learning. The method I introduce here is called self-supervised learning.
Enjoy the show!
Don't forget to subscribe to our Newsletter at amethix.com and get the latest updates in AI and machine learning. We do not spam. Promise!
References
Deep Clustering for Unsupervised Learning of Visual Features
Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey
The successes of deep learning for text analytics, also introduced in a recent post about sentiment analysis and published here are undeniable. Many other tasks in NLP have also benefitted from the superiority of deep learning methods over more traditional approaches. Such extraordinary results have also been possible due to the neural network approach to learn meaningful character and word embeddings, that is the representation space in which semantically similar objects are mapped to nearby vectors.
All this is strictly related to a field one might initially find disconnected or off-topic: biology.
Don't forget to subscribe to our Newsletter at amethix.com and get the latest updates in AI and machine learning. We do not spam. Promise! References
[1] Rives A., et al., “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences”, biorxiv, doi: https://doi.org/10.1101/622803
[2] Vaswani A., et al., “Attention is all you need”, Advances in neural information processing systems, pp. 5998–6008, 2017.
[3] Bahdanau D., et al., “Neural machine translation by jointly learning to align and translate”, arXiv, http://arxiv.org/abs/1409.0473.
The rapid diffusion of social media like Facebook and Twitter, and the massive use of different types of forums like Reddit, Quora, etc., is producing an impressive amount of text data every day.
There is one specific activity that many business owners have been contemplating over the last five years, that is identifying the social sentiment of their brand, by analysing the conversations of their users.
In this episode I explain how one can get the best shot at classifying sentences with deep learning and word embedding.
Additional material
Schematic representation of how to learn a word embedding matrix E by training a neural network that, given the previous M words, predicts the next word in a sentence.
Word2Vec example source code
https://gist.github.com/rlangone/ded90673f65e932fd14ae53a26e89eee#file-word2vec_example-py
References
[1] Mikolov, T. et al., "Distributed Representations of Words and Phrases and their Compositionality", Advances in Neural Information Processing Systems 26, pages 3111-3119, 2013.
[2] The Best Embedding Method for Sentiment Classification, https://medium.com/@bramblexu/blog-md-34c5d082a8c5
[3] The state of sentiment analysis: word, sub-word and character embedding
https://amethix.com/state-of-sentiment-analysis-embedding/
In this episode I speak to Alexandr Honchar, data scientist and owner of blog https://medium.com/@alexrachnog
Alexandr has written very interesting posts about time series analysis for financial data. His blog is in my personal list of best tutorial blogs.
We discuss about financial time series and machine learning, what makes predicting the price of stocks a very challenging task and why machine learning might not be enough.
As usual, I ask Alexandr how he sees machine learning in the next 10 years. His answer - in my opinion quite futuristic - makes perfect sense.
You can contact Alexandr on
Enjoy the show!
In this episode I have a wonderful conversation with Chris Skinner.
Chris and I recently got in touch at The banking scene 2019, fintech conference recently held in Brussels. During that conference he talked as a real trouble maker - that’s how he defines himself - saying that “People are not educated with loans, credit, money” and that “Banks are failing at digital”.
After I got my hands on his last book Digital Human, I invited him to the show to ask him a few questions about innovation, regulation and technology in finance.
It all starts from physics. The entropy of an isolated system never decreases… Everyone at school, at some point of his life, learned this in his physics class. What does this have to do with machine learning?
To find out, listen to the show.
References
Entropy in machine learning
https://amethix.com/entropy-in-machine-learning/
Deep learning is the future. Get a crash course on deep learning. Now!
In this episode I speak to Oliver Zeigermann, author of Deep Learning Crash Course published by Manning Publications at https://www.manning.com/livevideo/deep-learning-crash-course
Oliver (Twitter: @DJCordhose) is a veteran of neural networks and machine learning. In addition to the course - that teaches you concepts from prototype to production - he's working on a really cool project that predicts something people do every day... clicking their mouse.
If you use promo code poddatascienceathome19 you get a 40% discount for all products on the Manning platform
Enjoy the show!
References:
Deep Learning Crash Course (Manning Publications)
https://www.manning.com/livevideo/deep-learning-crash-course?a_aid=djcordhose&a_bid=e8e77cbf
Companion notebooks for the code samples of the video course "Deep Learning Crash Course"
https://github.com/DJCordhose/deep-learning-crash-course-notebooks/blob/master/README.md
Next-button-to-click predictor source code
https://github.com/DJCordhose/ux-by-tfjs
In this episode I met three crazy researchers from KULeuven (Belgium) who found a method to fool surveillance cameras and stay hidden just by holding a special t-shirt.
We discussed about the technique they used and some consequences of their findings.
They published their paper on Arxiv and made their source code available at https://gitlab.com/EAVISE/adversarial-yolo
Enjoy the show!
References
Fooling automated surveillance cameras: adversarial patches to attack person detection
Simen Thys, Wiebe Van Ranst, Toon Goedemé
Eavise Research Group KULeuven (Belgium)
https://iiw.kuleuven.be/onderzoek/eavise
There is a connection between gradient descent based optimizers and the dynamics of damped harmonic oscillators. What does that mean? We now have a better theory for optimization algorithms.
In this episode I explain how all this works.
All the formulas I mention in the episode can be found in the post The physics of optimization algorithms
Enjoy the show.
How are differential equations related to neural networks? What are the benefits of re-thinking neural network as a differential equation engine? In this episode we explain all this and we provide some material that is worth learning. Enjoy the show!
Residual Block
References
[1] K. He, et al., “Deep Residual Learning for Image Recognition”, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770-778, 2016
[2] S. Hochreiter, et al., “Long short-term memory”, Neural Computation 9(8), pages 1735-1780, 1997.
[3] Q. Liao, et al.,”Bridging the gaps between residual learning, recurrent neural networks and visual cortex”, arXiv preprint, arXiv:1604.03640, 2016.
[4] Y. Lu, et al., “Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equation”, Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 2018.
[5] T. Q. Chen, et al., ” Neural Ordinary Differential Equations”, Advances in Neural Information Processing Systems 31, pages 6571-6583}, 2018
Since the beginning of AI in the 1950s and until the 1980s, symbolic AI approaches have dominated the field. These approaches, also known as expert systems, used mathematical symbols to represent objects and the relationship between them, in order to depict the extensive knowledge bases built by humans.
The opposite of the symbolic AI paradigm is named connectionism, which is behind the machine learning approaches of today
The successes that deep learning systems have achieved in the last decade in all kinds of domains are unquestionable. Self-driving cars, skin cancer diagnostics, movie and song recommendations, language translation, automatic video surveillance, digital assistants represent just a few examples of the ongoing revolution that affects or is going to disrupt soon our everyday life.
But all that glitters is not gold…
Read the full post on the Amethix Technologies blog
In this episode I speak about how important reproducible machine learning pipelines are.
When you are collaborating with diverse teams, several tasks will be distributed among different individuals. Everyone will have good reasons to change parts of your pipeline, leading to confusion and definitely a number of options that soon explode.
In all those cases, tracking data and code is extremely helpful to build models that are reproducible anytime, anywhere.
Listen to the podcast and learn how.
Have you ever wanted to get an estimate of the uncertainty of your neural network? Clearly Bayesian modelling provides a solid framework to estimate uncertainty by design. However, there are many realistic cases in which Bayesian sampling is not really an option and ensemble models can play a role.
In this episode I describe a simple yet effective way to estimate uncertainty, without changing your neural network’s architecture nor your machine learning pipeline at all.
The post with mathematical background and sample source code is published here.
The success of a machine learning model depends on several factors and events. True generalization to data that the model has never seen before is more a chimera than a reality. But under specific conditions a well trained machine learning model can generalize well and perform with testing accuracy that is similar to the one performed during training.
In this episode I explain when and why machine learning models fail from training to testing datasets.
In this episode I am completing the explanation about the integration fitchain-oceanprotocol that allows secure on-premise compute to operate in the decentralized data marketplace designed by Ocean Protocol.
As mentioned in the show, this is a picture that provides a 10000-feet view of the integration.
I hope you enjoy the show!
In this episode I briefly explain how two massive technologies have been merged in 2018 (work in progress :) - one providing secure machine learning on isolated data, the other implementing a decentralized data marketplace.
In this episode I explain:
I hope you enjoy the show!
References
fitchain.io decentralized machine learnin
Ocean protocol decentralized data marketplace
It's always good to put in perspective all the findings in AI, in order to clear some of the most common misunderstandings and promises.
In this episode I make a list of some of the most misleading statements about what artificial intelligence can achieve in the near future.
In this episode - which I advise to consume at night, in a quite place - I speak about private machine learning and blockchain, while I sip a cup of coffee in my home office.
There are several reasons why I believe we should start thinking about private machine learning...
It doesn't really matter what approach becomes successful and gets adopted, as long as it makes private machine learning possible. If people own their data, they should also own the by-product of such data.
Decentralized machine learning makes this scenario possible.
Today I am having a conversation with Filip Piękniewski, researcher working on computer vision and AI at Koh Young Research America.
His adventure with AI started in the 90s and since then a long list of experiences at the intersection of computer science and physics, led him to the conclusion that deep learning might not be sufficient nor appropriate to solve the problem of intelligence, specifically artificial intelligence.
I read some of his publications and got familiar with some of his ideas. Honestly, I have been attracted by the fact that Filip does not buy the hype around AI and deep learning in particular.
He doesn’t seem to share the vision of folks like Elon Musk who claimed that we are going to see an exponential improvement in self driving cars among other things (he actually said that before a Tesla drove over a pedestrian).
In this episode I continue the conversation from the previous one, about failing machine learning models.
When data scientists have access to the distributions of training and testing datasets it becomes relatively easy to assess if a model will perform equally on both datasets. What happens with private datasets, where no access to the data can be granted?
At fitchain we might have an answer to this fundamental problem.
The success of a machine learning model depends on several factors and events. True generalization to data that the model has never seen before is more a chimera than a reality. But under specific conditions a well trained machine learning model can generalize well and perform with testing accuracy that is similar to the one performed during training.
In this episode I explain when and why machine learning models fail from training to testing datasets.
In this episode I don't talk about data. In fact, I talk about metadata.
While many machine learning models rely on certain amounts of data eg. text, images, audio and video, it has been proved how powerful is the signal carried by metadata, that is all data that is invisible to the end user.
Behind a tweet of 140 characters there are more than 140 fields of data that draw a much more detailed profile of the sender and the content she is producing... without ever considering the tweet itself.
References
You are your Metadata: Identification and Obfuscation of Social Media Users using Metadata Information https://www.ucl.ac.uk/~ucfamus/papers/icwsm18.pdf
Today’s episode is about text analysis with python.
Python is the de facto standard in machine learning. A large community, a generous choice in the set of libraries, at the price of less performant tasks, sometimes. But overall a decent language for typical data science tasks.
I am with Rebecca Bilbro, co-author of Applied Text Analysis with Python, with Benjamin Bengfort and Tony Ojeda.
We speak about the evolution of applied text analysis, tools and pipelines, chatbots.
Deep learning models have shown very promising results in computer vision and sound recognition. As more and more deep learning based systems get integrated in disparate domains, they will keep affecting the life of people. Autonomous vehicles, medical imaging and banking applications, surveillance cameras and drones, digital assistants, are only a few real applications where deep learning plays a fundamental role. A malfunction in any of these applications will affect the quality of such integrated systems and compromise the security of the individuals who directly or indirectly use them.
In this episode, we explain how machine learning models can be attacked and what we can do to protect intelligent systems from being compromised.
Today’s episode will be about deep learning and reasoning. There has been a lot of discussion about the effectiveness of deep learning models and their capability to generalize, not only across domains but also on data that such models have never seen.
But there is a research group from the Department of Computer Science, Duke University that seems to be on something with deep learning and interpretability in computer vision.
References
Prediction Analysis Lab Duke University https://users.cs.duke.edu/~cynthia/lab.html
This looks like that: deep learning for interpretable image recognition https://arxiv.org/abs/1806.10574
Today’s episode will be about deep learning and compression of data, and in particular compressing images. We all know how important compressing data is, reducing the size of digital objects without affecting the quality.
As a very general rule, the more one compresses an image the lower the quality, due to a number of factors like bitrate, quantization error, etcetera. I am glad to be here with Tong Chen, researcher at the School of electronic Science and Engineering of Nanjing University, China.
Tong developed a deep learning based compression algorithm for images, that seems to improve over state of the art approaches like BPG, JPEG2000 and JPEG.
Reference
Deep Image Compression via End-to-End Learning - Haojie Liu, Tong Chen, Qiu Shen, Tao Yue, and Zhan Ma School of Electronic Science and Engineering, Nanjing University, Jiangsu, China
In this episode I explain the differences between L1 and L2 regularization that you can find in function minimization in basically any machine learning model.
In the second part of this episode I am interviewing Johannes Castner from CollectiWise, a platform for collective intelligence.
I am moving the conversation towards the more practical aspects of the project, asking about the centralised AGI and blockchain components that are essential part of the platform.
This is the first part of the amazing episode with Johannes Castner, CEO and founder of CollectiWise. Johannes is finishing his PhD in Sustainable Development from Columbia University in New York City, and he is building a platform for collective intelligence. Today we talk about artificial general intelligence and wisdom.
All references and shownotes will be published after the next episode.
Enjoy and stay tuned!
Predicting the weather is one of the most challenging tasks in machine learning due to the fact that physical phenomena are dynamic and riche of events. Moreover, most of traditional approaches to climate forecast are computationally prohibitive.
It seems that a joint research between the Earth System Science at the University of California, Irvine and the faculty of Physics at LMU Munich has an interesting improvement on the scalability and accuracy of climate predictive modeling. The solution is... superparameterization and deep learning.
References
Could Machine Learning Break the Convection Parameterization Deadlock?
Humans seem to have reached a cross-point, where they are asked to choose between functionality and privacy. But not both. Not both at all. No data, no service. That’s what companies building personal finance services say. The same applies to marketing companies, social media companies, search engine companies, and healthcare institutions.
In this episode I speak about the reasons to aggregate data for precision medicine, the consequences of such strategies and how can researchers and organizations provide services to individuals while respecting their privacy.
Deep learning models have shown very promising results in computer vision and sound recognition. As more and more deep learning based systems get integrated in disparate domains, they will keep affecting the life of people. Autonomous vehicles, medical imaging and banking applications, surveillance cameras and drones, digital assistants, are only a few real applications where deep learning plays a fundamental role. A malfunction in any of these applications will affect the quality of such integrated systems and compromise the security of the individuals who directly or indirectly use them.
In this episode, we explain how machine learning models can be attacked and what we can do to protect intelligent systems from being compromised.
Today I am having a conversation with Filip Piękniewski, researcher working on computer vision and AI at Koh Young Research America.
His adventure with AI started in the 90s and since then a long list of experiences at the intersection of computer science and physics, led him to the conclusion that deep learning might not be sufficient nor appropriate to solve the problem of intelligence, specifically artificial intelligence.
I read some of his publications and got familiar with some of his ideas. Honestly, I have been attracted by the fact that Filip does not buy the hype around AI and deep learning in particular.
He doesn’t seem to share the vision of folks like Elon Musk who claimed that we are going to see an exponential improvement in self driving cars among other things (he actually said that before a Tesla drove over a pedestrian).
In the attempt of democratizing machine learning, data scientists should have the possibility to train their models on data they do not necessarily own, nor see. A model that is privately trained should be verified and uniquely identified across its entire life cycle, from its random initialization to setting the optimal values of its parameters.
How does blockchain allow all this? Fitchain is the decentralized machine learning platform that provides models an identity and a certification of their training procedure, the proof-of-train
I know, I have been away too long without publishing much in the last 3 months.
But, there's a reason for that. I have been building a platform that combines machine learning with blockchain technology.
Let me introduce you to fitchain and tell you more in this episode.
If you want to collaborate on the project or just think it's interesting, drop me a line on the contact page at fitchain.io
Francesco Gadaleta introduces Fitchain, a decentralized machine learning platform that combines blockchain technology and AI to solve the data manipulation problem in restrictive environments such as healthcare or financial institutions.Francesco Gadaleta is the founder of Fitchain.io and senior advisor to Abe AI. Fitchain is a platform that officially started in October 2017, which allows data scientists to write machine learning models on data they cannot see and access due to restrictions imposed in healthcare or financial environments. In the Fitchain platform, there are two actors, the data owner and the data scientist. They both run the Fitchain POD, which orchestrates the relationship between these two sides. The idea behind Fitchain is summarized in the thesis “do not move the data, move the model – bring the model where the data is stored.”
The Fitchain team has also coined a new term called “proof of train” – a way to guarantee that the model is truly trained at the organization, and that it becomes traceable on the blockchain. To develop the complex technological aspects of the platform, Fitchain has partnered up with BigChainDB, the project we have recently featured on Crypto Radio.
RoadmapFitchain team is currently validating the assumptions and increasing the security of the platform. In the next few months, they will extend the portfolio of machine learning libraries and are planning to move from a B2B product towards a Fitchain for consumers.
By June 2018 they plan to start the Internet of PODs. They will also design the Fitchain token – FitCoin, which will be a utility token to enable operating on the Fitchain platform.
Data is a complex topic, not only related to machine learning algorithms, but also and especially to privacy and security of individuals, the same individuals who create such data just by using the many mobile apps and services that characterize their digital life.
In this episode I am together with B.J.n Mendelson, author of “Social Media is Bullshit” from St. Martin’s Press and world-renowned speaker on issues involving the myths and realities involving today’s Internet platforms. B.J. has a new a book about privacy and sent me a free copy of "Privacy, and how to get it back" that I read in just one day. That was enough to realise how much we have in common when it comes to data and data collection.
Despite what researchers claim about genetic evolution, in this episode we give a realistic view of the field.
In order to succeed with artificial intelligence, it is better to know how to fail first. It is easier than you think.
Here are 9 easy steps to fail your AI startup.
The enthusiasm for artificial intelligence is raising some concerns especially with respect to some ventured conclusions about what AI can really do and what its direct descendent, artificial general intelligence would be capable of doing in the immediate future. From stealing jobs, to exterminating the entire human race, the creativity (of some) seems to have no limits.
In this episode I make sure that everyone comes back to reality - which might sound less exciting than Hollywood but definitely more... real.
In the aftermath of the Barclays Accelerator, powered by Techstars experience, one of the most innovative and influential startup accelerators in the world, I’d like to give back to the community lessons learned, including the need for confidence, soft-skills, and efficiency, to be applied to startups that deal with artificial intelligence and data science.
In this episode I also share some thoughts about the culture of fireflies in modern and dynamic organisations.
In this episode I speak about Deep Learning technology applied to Alzheimer disorder prediction. I had a great chat with Saman Sarraf, machine learning engineer at Konica Minolta, former lab manager at the Rotman Research Institute at Baycrest, University of Toronto and author of DeepAD: Alzheimer′ s Disease Classification via Deep Convolutional Neural Networks using MRI and fMRI.
I hope you enjoy the show.
In this episode, I speak about the requirements and the skills to become data scientist and join an amazing community that is changing the world with data analyticsa
In machine learning and data science in general it is very common to deal at some point with imbalanced datasets and class distributions. This is the typical case where the number of observations that belong to one class is significantly lower than those belonging to the other classes. Actually this happens all the time, in several domains, from finance, to healthcare to social media, just to name a few I have personally worked with.
Think about a bank detecting fraudulent transactions among millions or billions of daily operations, or equivalently in healthcare for the identification of rare disorders.
In genetics but also with clinical lab tests this is a normal scenario, in which, fortunately there are very few patients affected by a disorder and therefore very few cases wrt the large pool of healthy patients (or not affected).
There is no algorithm that can take into account the class distribution or the amount of observations in each class, if it is not explicitly designed to handle such situations.
In this episode I speak about some effective techniques to handle imbalanced datasets, advising the right method, or the most appropriate one to the right dataset or problem.
In this episode I explain how to deal with such common and challenging scenarios.
Ensemble methods have been designed to improve the performance of the single model, when the single model is not very accurate. According to the general definition of ensembling, it consists in building a number of single classifiers and then combining or aggregating their predictions into one classifier that is usually stronger than the single one.
The key idea behind ensembling is that some models will do well when they model certain aspects of the data while others will do well in modelling other aspects.
In this episode I show with a numeric example why and when ensemble methods work.
Continuing the discussion of the last two episodes, there is one more aspect of deep learning that I would love to consider and therefore left as a full episode, that is parallelising and distributing deep learning on relatively large clusters.
As a matter of fact, computing architectures are changing in a way that is encouraging parallelism more than ever before. And deep learning is no exception and despite the greatest improvements with commodity GPUs - graphical processing units, when it comes to speed, there is still room for improvement.
Together with the last two episodes, this one completes the picture of deep learning at scale. Indeed, as I mentioned in the previous episode, How to master optimisation in deep learning, the function optimizer is the horsepower of deep learning and neural networks in general. A slow and inaccurate optimisation method leads to networks that slowly converge to unreliable results.
In another episode titled “Additional strategies for optimizing deeplearning” I explained some ways to improve function minimisation and model tuning in order to get better parameters in less time. So feel free to listen to these episodes again, share them with your friends, even re-broadcast or download for your commute.
While the methods that I have explained so far represent a good starting point for prototyping a network, when you need to switch to production environments or take advantage of the most recent and advanced hardware capabilities of your GPU, well... in all those cases, you would like to do something more.
In the last episode How to master optimisation in deep learning I explained some of the most challenging tasks of deep learning and some methodologies and algorithms to improve the speed of convergence of a minimisation method for deep learning.
I explored the family of gradient descent methods - even though not exhaustively - giving a list of approaches that deep learning researchers are considering for different scenarios. Every method has its own benefits and drawbacks, pretty much depending on the type of data, and data sparsity. But there is one method that seems to be, at least empirically, the best approach so far.
Feel free to listen to the previous episode, share it, re-broadcast or just download for your commute.
In this episode I would like to continue that conversation about some additional strategies for optimising gradient descent in deep learning and introduce you to some tricks that might come useful when your neural network stops learning from data or when the learning process becomes so slow that it really seems it reached a plateau even by feeding in fresh data.
The secret behind deep learning is not really a secret. It is function optimisation. What a neural network essentially does, is optimising a function. In this episode I illustrate a number of optimisation methods and explain which one is the best and why.
Over the past few years, neural networks have re-emerged as powerful machine-learning models, reaching state-of-the-art results in several fields like image recognition and speech processing. More recently, neural network models started to be applied also to textual data in order to deal with natural language, and there too with promising results. In this episode I explain why is deep learning performing the way it does, and what are some of the most tedious causes of failure.
Artificial Intelligence allow machines to learn patterns from data. The way humans learn however is different and more efficient. With Lifelong Machine Learning, machines can learn the way human beings do, faster, and more efficiently
Talking about security of communication and privacy is never enough, especially when political instabilities are driving leaders towards decisions that will affect people on a global scale
We strongly believe 2017 will be a very interesting year for data science and artificial intelligence. Let me tell you what I expect and why.
Is the market really predictable? How do stock prices increase? What is their dynamics? Here is what I think about the magics and the reality of predictions applied to markets and the stock exchange.
Why the job of the data scientist can disappear soon. What is required by a data scientist to survive inflation.
Data science is making the difference also in fraud detection. In this episode I have a conversation with an expert in the field, Engineer Eyad Sibai, who works at iZettle, a fraud detection company
Extracting knowledge from large datasets with large number of variables is always tricky. Dimensionality reduction helps in analyzing high dimensional data, still maintaining most of the information hidden behind complexity. Here are some methods that you must try before further analysis (Part 1).
How would you perform accurate classification on a very large dataset by just looking at a sample of it
What is deep learning?If you have no patience, deep learning is the result of training many layers of non-linear processing units for feature extraction and data transformation e.g. from pixel, to edges, to shapes, to object classification, to scene description, captioning, etc.
At some point, statistical problems need sampling. Sampling consists in generating observations from a specific distribution.
In this show I interview Sebastian Raschka, data scientist and author of Python Machine Learning.In addition to the fun we had offline, there are great elements about machine learning, data science, current and future trends, to keep an ear on. Moreover, it is the conversation of two data scientists who contribute and operate in the field, on a daily basis.
There are statisticians and data scientists... Among statisticians, there are some who just count. Some others who… think differently. In this show we explore the old time dilemma between frequentists and bayesians.Given a statistical problem, who’s going to be right?
In this episode, we tell you how to become data scientist and join an amazing community that is changing the world with data analytics.
Should data scientists follow the old good practices of software engineering? Data scientists make software after all.
It’s time to experiment with Data Science at home. Since we are still dealing with our hosting service, consider the first episode purely experimental, even though the content might be of your interest, no matter what.
Have you ever thought to get a Big Data infrastructure on your desk? That’s right! On your desk.
Have you ever thought to get a Big Data infrastructure on your desk? That’s right! On your desk.
In this episode I meet Dr Eliseo Ferrante, formerly at the University of Leuven, currently researcher at the Université de Technologie de Compiègne, who studies self-organization and evolution.
En liten tjänst av I'm With Friends. Finns även på engelska.