Podd: Data Skeptic

The Small World Hypothesis

21 april 2025 | 17 min

Thinking in Networks

12 april 2025 | 34 min

Fraud Networks

1 april 2025 | 43 min

Criminal Networks

17 mars 2025 | 44 min

Graph Bugs

10 mars 2025 | 29 min

Organizational Network Analysis

3 mars 2025 | 44 min

Organizational Networks

25 februari 2025 | 28 min

Networks of the Mind

18 februari 2025 | 43 min

LLMs and Graphs Synergy

10 februari 2025 | 35 min

A Network of Networks

4 februari 2025 | 46 min

Auditing LLMs and Twitter

29 januari 2025 | 40 min

Fraud Detection with Graphs

22 januari 2025 | 37 min

Optimizing Supply Chains with GNN

15 januari 2025 | 38 min

The Mystery Behind Large Graphs

10 januari 2025 | 48 min

Customizing a Graph Solution

16 december 2024 | 38 min

Graph Transformations

9 december 2024 | 33 min

Networks for AB Testing

25 november 2024 | 37 min

Lessons from eGamer Networks

18 november 2024 | 38 min

Github Collaboration Network

11 november 2024 | 42 min

Graphs and ML for Robotics

4 november 2024 | 42 min

Graphs for HPC and LLMs

29 oktober 2024 | 52 min

Graph Databases and AI

21 oktober 2024 | 36 min

Network Analysis in Practice

14 oktober 2024 | 30 min

Animal Intelligence Final Exam

7 oktober 2024 | 30 min

Process Mining with LLMs

24 september 2024 | 26 min

Open Animal Tracks

17 september 2024 | 23 min

Bird Distribution Modeling with Satbird

10 september 2024 | 40 min

Ant Encounters

26 augusti 2024 | 31 min

Computing Toolbox

19 augusti 2024 | 39 min

Biodiversity Monitoring

14 augusti 2024 | 32 min

Hacking the Colony

8 augusti 2024 | 41 min

Primate Poses

31 juli 2024 | 33 min

Generating 3D Animals with YouDream

23 juli 2024 | 60 min

Weird Communication

15 juli 2024 | 38 min

Reducing the Impact of Ship Noise on Marine Mammals

1 juli 2024 | 36 min

Analysis of Unstructured Data

28 juni 2024 | 27 min

iNaturalist

24 juni 2024 | 38 min

Learn to Code

18 juni 2024 | 50 min

Animal Computer Interaction

10 juni 2024 | 43 min

Ape Gestures

3 juni 2024 | 49 min

Evaluating AI Abilities

27 maj 2024 | 50 min

HMMs for Behavior

20 maj 2024 | 45 min

Bioinspired Engineering

14 maj 2024 | 38 min

Modelling Evolution

9 maj 2024 | 41 min

Behavioral Genetics

30 april 2024 | 47 min

Signal in the Noise

25 april 2024 | 42 min

Pose Tracking

16 april 2024 | 51 min

Modeling Group Behavior

8 april 2024 | 41 min

Advances in Data Loggers

25 mars 2024 | 36 min

What You Know About Intelligence is Wrong (fixed)

20 mars 2024 | 42 min

Animal Decision Making

12 mars 2024 | 37 min

Octopus Cognition

8 mars 2024 | 38 min

Optimal Foraging

28 februari 2024 | 38 min

Memory in Chess

12 februari 2024 | 49 min

OpenWorm

5 februari 2024 | 34 min

What the Antlion Knows

30 januari 2024 | 42 min

AI Roundtable

17 januari 2024 | 51 min

Uncontrollable AI Risks

27 december 2023 | 39 min

I LLM and You Can Too

23 december 2023 | 24 min

Q&A with Kyle

19 december 2023 | 40 min

LLMs for Data Analysis

12 december 2023 | 29 min

AI Platforms

4 december 2023 | 34 min

Deploying LLMs

27 november 2023 | 35 min

A Survey Assessing Github Copilot

20 november 2023 | 26 min

Program Aided Language Models

13 november 2023 | 32 min

Which Programming Language is ChatGPT Best At

6 november 2023 | 40 min

GraphText

31 oktober 2023 | 31 min

arXiv Publication Patterns

23 oktober 2023 | 28 min

Do LLMs Make Ethical Choices

16 oktober 2023 | 29 min

Emergent Deception in LLMs

9 oktober 2023 | 27 min

Agents with Theory of Mind Play Hanabi

2 oktober 2023 | 38 min

LLMs for Evil

25 september 2023 | 26 min

The Defeat of the Winograd Schema Challenge

11 september 2023 | 31 min

LLMs in Social Science

4 september 2023 | 34 min

LLMs in Music Composition

28 augusti 2023 | 34 min

Cuttlefish Model Tuning

21 augusti 2023 | 27 min

Which Professions Are Threatened by LLMs

15 augusti 2023 | 39 min

Why Prompting is Hard

8 augusti 2023 | 49 min

Automated Peer Review

31 juli 2023 | 36 min

Prompt Refusal

24 juli 2023 | 44 min

A Long Way Till AGI

18 juli 2023 | 37 min

Brain Inspired AI

11 juli 2023 | 36 min

Computable AGI

3 juli 2023 | 36 min

AGI Can Be Safe

26 juni 2023 | 46 min

AI Fails on Theory of Mind Tasks

19 juni 2023 | 52 min

AI for Mathematics Education

12 juni 2023 | 36 min

Evaluating Jokes with LLMs

6 juni 2023 | 43 min

Why Machines Will Never Rule the World

29 maj 2023 | 55 min

A Psychopathological Approach to Safety in AGI

23 maj 2023 | 49 min

The NLP Community Metasurvey

15 maj 2023 | 50 min

Skeptical Survey Interpretation

10 maj 2023 | 22 min

The Gallup Poll

1 maj 2023 | 40 min

Inclusive Study Group Formation at Scale

25 april 2023 | 32 min

The PhilPapers Survey

21 april 2023 | 32 min

Non-Response Bias

10 april 2023 | 36 min

Measuring Trust in Robots with Likert Scales

3 april 2023 | 47 min

CAREER Prediction

27 mars 2023 | 41 min

The Panel Study of Income Dynamics

21 mars 2023 | 34 min

Survey Design Working Session

14 mars 2023 | 62 min

Bot Detection and Dyadic Surveys

6 mars 2023 | 35 min

Reproducible ESP Testing

20 februari 2023 | 47 min

A Survey of Data Science Methodologies

13 februari 2023 | 25 min

Opinion Dynamics Models

6 februari 2023 | 36 min

Casual Affective Triggers

30 januari 2023 | 36 min

Conversational Surveys

23 januari 2023 | 40 min

Do Results Generalize for Privacy and Security Surveys

17 januari 2023 | 40 min

4 out of 5 Data Scientists Agree

10 januari 2023 | 29 min

Crowdfunded Board Games

26 december 2022 | 35 min

Russian Election Interference Effectiveness

19 december 2022 | 42 min

Placement Laundering Fraud

15 december 2022 | 33 min

Data Clean Rooms

12 december 2022 | 32 min

Dark Patterns in Site Design

5 december 2022 | 35 min

Internet Advertising Bureau Media Lab

3 december 2022 | 37 min

Your Mouse Reveals Your Gender and Age

28 november 2022 | 40 min

Measuring Web Search Behavior

21 november 2022 | 36 min

StrategyQA and Big Bench

18 november 2022 | 42 min

Ad Blockers Effect on News Consumption

14 november 2022 | 39 min

Your Consent is Worth 75 Euros a Year

7 november 2022 | 24 min

Automated Email Generation for Targeted Attacks

31 oktober 2022 | 45 min

Tribal Marketing

24 oktober 2022 | 38 min

Nano-targetted Facebook Ads

17 oktober 2022 | 45 min

Debiasing GPT-3 Job Ads

10 oktober 2022 | 49 min

ML Ops in Production

6 oktober 2022 | 42 min

Ad Network Tomography

3 oktober 2022 | 35 min

First Party Tracking Cookies

26 september 2022 | 35 min

The Harms of Targeted Weight Loss Ads

19 september 2022 | 35 min

Podcast Advertising

12 september 2022 | 35 min

Fairness in e-Commerce Search

5 september 2022 | 41 min

Fraudulent Amazon Reviewers

29 augusti 2022 | 41 min

Ad Targeting in Amazon Smart Speakers

22 augusti 2022 | 33 min

Adwords with Unknown Budgets

15 augusti 2022 | 34 min

ML Ops Best Practices

12 augusti 2022 | 30 min

Affiliate Marketing Rabbithole

8 augusti 2022 | 52 min

Monetization of Youtube Conspiracy Theorists

1 augusti 2022 | 54 min

User Perceptions of Problematic Ads

25 juli 2022 | 38 min

Political Digital Advertising Analysis

21 juli 2022 | 36 min

Fraud Detection in Crowdfunding Campaigns

18 juli 2022 | 36 min

Artificial Intelligence and Auction Design

11 juli 2022 | 43 min

Privacy Preference Signals

4 juli 2022 | 33 min

Neural Architecture Search for CTR Prediction

27 juni 2022 | 28 min

Algorithmic PPC Management

21 juni 2022 | 44 min

Data Skeptic: Ad Tech

18 juni 2022 | 42 min

The Reliability of Mobile Phone Data

13 juni 2022 | 50 min

Haywire Algorithms

6 juni 2022 | 34 min

School Reopening Analysis

30 maj 2022 | 33 min

Modern Data Stacks

26 maj 2022 | 35 min

Emoji as a Predictor

23 maj 2022 | 21 min

Polarizing Trends in the Gig Economy

16 maj 2022 | 46 min

Remote Learning in Applied Engineering

12 maj 2022 | 25 min

Remote Productivity

9 maj 2022 | 30 min

Does Remote Learning Work?

1 maj 2022 | 48 min

Covid-19 Impact on Bicycle Usage

25 april 2022 | 31 min

Learning Digital Fabrication Remotely

22 april 2022 | 34 min

Remote Software Development

18 april 2022 | 38 min

Quantum K-Means

11 april 2022 | 40 min

K-Means in Practice

4 april 2022 | 31 min

Fair Hierarchical Clustering

28 mars 2022 | 34 min

Matrix Factorization For k-Means

21 mars 2022 | 30 min

Breathing K-Means

14 mars 2022 | 43 min

Power K-Means

7 mars 2022 | 33 min

Explainable K-Means

3 mars 2022 | 26 min

Customer Clustering

28 februari 2022 | 22 min

k-means Image Segmentation

22 februari 2022 | 23 min

Tracking Elephant Clusters

18 februari 2022 | 26 min

k-means clustering

14 februari 2022 | 24 min

Snowflake Essentials

7 februari 2022 | 47 min

Explainable Climate Science

31 januari 2022 | 35 min

Energy Forecasting Pipelines

24 januari 2022 | 43 min

Matrix Profiles in Stumpy

17 januari 2022 | 39 min

The Great Australian Prediction Project

14 januari 2022 | 25 min

Water Demand Forecasting

10 januari 2022 | 26 min

Open Telemetry

3 januari 2022 | 36 min

Fashion Predictions

27 december 2021 | 35 min

Time Series Mini Episodes

25 december 2021 | 37 min

Forecasting Motor Vehicle Collision

20 december 2021 | 39 min

Deep Learning for Road Traffic Forecasting

13 december 2021 | 32 min

Bike Share Demand Forecasting

6 december 2021 | 41 min

Forecasting in Supply Chain

29 november 2021 | 36 min

Black Friday

26 november 2021 | 45 min

Aligning Time Series on Incomparable Spaces

22 november 2021 | 34 min

Comparing Time Series with HCTSA

15 november 2021 | 43 min

Change Point Detection Algorithms

8 november 2021 | 31 min

Time Series for Good

1 november 2021 | 38 min

Long Term Time Series Forecasting

25 oktober 2021 | 38 min

Fast and Frugal Time Series Forecasting

17 oktober 2021 | 38 min

Causal Inference in Educational Systems

11 oktober 2021 | 41 min

Boosted Embeddings for Time Series

4 oktober 2021 | 29 min

Change Point Detection in Continuous Integration Systems

27 september 2021 | 34 min

Applying k-Nearest Neighbors to Time Series

20 september 2021 | 24 min

Ultra Long Time Series

13 september 2021 | 28 min

MiniRocket

6 september 2021 | 26 min

ARiMA is not Sufficient

30 augusti 2021 | 23 min

Comp Engine

23 augusti 2021 | 36 min

Detecting Ransomware

16 augusti 2021 | 31 min

GANs in Finance

9 augusti 2021 | 23 min

Predicting Urban Land Use

2 augusti 2021 | 27 min

Opportunities for Skillful Weather Prediction

26 juli 2021 | 34 min

Predicting Stock Prices

19 juli 2021 | 34 min

N-Beats

12 juli 2021 | 34 min

Translation Automation

6 juli 2021 | 36 min

Time Series at the Beach

28 juni 2021 | 23 min

Automatic Identification of Outlier Galaxy Images

21 juni 2021 | 36 min

Do We Need Deep Learning in Time Series

16 juni 2021 | 29 min

Detecting Drift

11 juni 2021 | 27 min

Darts Library for Time Series

31 maj 2021 | 25 min

Forecasting Principles and Practice

24 maj 2021 | 32 min

Prequisites for Time Series

21 maj 2021 | 9 min

Orders of Magnitude

7 maj 2021 | 33 min

They're Coming for Our Jobs

3 maj 2021 | 44 min

Pandemic Machine Learning Pitfalls

26 april 2021 | 40 min

Flesch Kincaid Readability Tests

19 april 2021 | 20 min

Fairness Aware Outlier Detection

9 april 2021 | 40 min

Life May be Rare

5 april 2021 | 43 min

Social Networks

29 mars 2021 | 50 min

The QAnon Conspiracy

22 mars 2021 | 44 min

Benchmarking Vision on Edge vs Cloud

15 mars 2021 | 48 min

Goodhart's Law in Reinforcement Learning

5 mars 2021 | 37 min

Video Anomaly Detection

1 mars 2021 | 24 min

Fault Tolerant Distributed Gradient Descent

22 februari 2021 | 36 min

Decentralized Information Gathering

15 februari 2021 | 33 min

Leaderless Consensus

5 februari 2021 | 27 min

Automatic Summarization

29 januari 2021 | 28 min

Gerrymandering

22 januari 2021 | 34 min

Even Cooperative Chess is Hard

15 januari 2021 | 23 min

Consecutive Votes in Paxos

11 januari 2021 | 30 min

Visual Illusions Deceiving Neural Networks

1 januari 2021 | 34 min

Earthquake Detection with Crowd-sourced Data

25 december 2020 | 29 min

Byzantine Fault Tolerant Consensus

22 december 2020 | 36 min

Alpha Fold

11 december 2020 | 23 min

Arrow's Impossibility Theorem

4 december 2020 | 26 min

Face Mask Sentiment Analysis

27 november 2020 | 41 min

Counting Briberies in Elections

20 november 2020 | 38 min

Sybil Attacks on Federated Learning

13 november 2020 | 32 min

Differential Privacy at the US Census

6 november 2020 | 30 min

Distributed Consensus

30 oktober 2020 | 28 min

ACID Compliance

23 oktober 2020 | 24 min

National Popular Vote Interstate Compact

16 oktober 2020 | 31 min

Defending the p-value

12 oktober 2020 | 30 min

Retraction Watch

5 oktober 2020 | 32 min

Crowdsourced Expertise

21 september 2020 | 28 min

The Spread of Misinformation Online

14 september 2020 | 36 min

Consensus Voting

7 september 2020 | 23 min

Voting Mechanisms

31 augusti 2020 | 27 min

False Consensus

24 augusti 2020 | 33 min

Fraud Detection in Real Time

18 augusti 2020 | 38 min

Listener Survey Review

11 augusti 2020 | 23 min

Human Computer Interaction and Online Privacy

27 juli 2020 | 33 min

Authorship Attribution of Lennon McCartney Songs

20 juli 2020 | 33 min

GANs Can Be Interpretable

11 juli 2020 | 27 min

Sentiment Preserving Fake Reviews

6 juli 2020 | 29 min

Interpretability Practitioners

26 juni 2020 | 32 min

Facial Recognition Auditing

19 juni 2020 | 48 min

Robust Fit to Nature

12 juni 2020 | 38 min

Black Boxes Are Not Required

5 juni 2020 | 32 min

Robustness to Unforeseen Adversarial Attacks

30 maj 2020 | 22 min

Estimating the Size of Language Acquisition

22 maj 2020 | 25 min

Interpretable AI in Healthcare

15 maj 2020 | 36 min

Understanding Neural Networks

8 maj 2020 | 35 min

Self-Explaining AI

2 maj 2020 | 32 min

Plastic Bag Bans

24 april 2020 | 35 min

Self Driving Cars and Pedestrians

18 april 2020 | 31 min

Computer Vision is Not Perfect

10 april 2020 | 26 min

Uncertainty Representations

4 april 2020 | 40 min

AlphaGo, COVID-19 Contact Tracing and New Data Set

28 mars 2020 | 34 min

Visualizing Uncertainty

20 mars 2020 | 33 min

Interpretability Tooling

13 mars 2020 | 43 min

Shapley Values

6 mars 2020 | 20 min

Anchors as Explanations

28 februari 2020 | 37 min

Mathematical Models of Ecological Systems

22 februari 2020 | 37 min

Adversarial Explanations

14 februari 2020 | 37 min

ObjectNet

7 februari 2020 | 39 min

Visualization and Interpretability

31 januari 2020 | 36 min

Interpretable One Shot Learning

26 januari 2020 | 31 min

Fooling Computer Vision

22 januari 2020 | 25 min

Algorithmic Fairness

14 januari 2020 | 42 min

Interpretability

7 januari 2020 | 33 min

NLP in 2019

31 december 2019 | 39 min

The Limits of NLP

24 december 2019 | 30 min

Jumpstart Your ML Project

15 december 2019 | 21 min

Serverless NLP Model Training

10 december 2019 | 29 min

Team Data Science Process

3 december 2019 | 41 min

Ancient Text Restoration

1 december 2019 | 41 min

ML Ops

27 november 2019 | 37 min

Annotator Bias

23 november 2019 | 26 min

NLP for Developers

20 november 2019 | 29 min

Indigenous American Language Research

13 november 2019 | 23 min

Talking to GPT-2

31 oktober 2019 | 29 min

Reproducing Deep Learning Models

23 oktober 2019 | 23 min

What BERT is Not

14 oktober 2019 | 27 min

SpanBERT

8 oktober 2019 | 25 min

BERT is Shallow

23 september 2019 | 20 min

BERT is Magic

16 september 2019 | 18 min

Applied Data Science in Industry

6 september 2019 | 22 min

Building the howto100m Video Corpus

19 augusti 2019 | 23 min

BERT

29 juli 2019 | 14 min

Onnx

22 juli 2019 | 21 min

Catastrophic Forgetting

15 juli 2019 | 21 min

Transfer Learning

8 juli 2019 | 30 min

Facebook Bargaining Bots Invented a Language

21 juni 2019 | 23 min

Under Resourced Languages

15 juni 2019 | 17 min

Named Entity Recognition

8 juni 2019 | 17 min

The Death of a Language

1 juni 2019 | 20 min

Neural Turing Machines

25 maj 2019 | 25 min

Data Infrastructure in the Cloud

18 maj 2019 | 30 min

NCAA Predictions on Spark

11 maj 2019 | 24 min

The Transformer

3 maj 2019 | 15 min

Mapping Dialects with Twitter Data

26 april 2019 | 25 min

Sentiment Analysis

20 april 2019 | 27 min

Attention Primer

13 april 2019 | 15 min

Cross-lingual Short-text Matching

5 april 2019 | 25 min

ELMo

29 mars 2019 | 24 min

BLEU

23 mars 2019 | 42 min

Simultaneous Translation at Baidu

15 mars 2019 | 24 min

Human vs Machine Transcription

8 mars 2019 | 33 min

seq2seq

1 mars 2019 | 22 min

Text Mining in R

22 februari 2019 | 20 min

Recurrent Relational Networks

15 februari 2019 | 19 min

Text World and Word Embedding Lower Bounds

8 februari 2019 | 39 min

word2vec

1 februari 2019 | 31 min

Authorship Attribution

25 januari 2019 | 51 min

Very Large Corpora and Zipf's Law

18 januari 2019 | 24 min

Semantic search at Github

11 januari 2019 | 35 min

Let's Talk About Natural Language Processing

4 januari 2019 | 36 min

Data Science Hiring Processes

28 december 2018 | 33 min

Holiday Reading - Epicac

25 december 2018 | 21 min

Drug Discovery with Machine Learning

21 december 2018 | 29 min

Sign Language Recognition

14 december 2018 | 20 min

Data Ethics

7 december 2018 | 20 min

Escaping the Rabbit Hole

30 november 2018 | 34 min

[MINI] Theorem Provers

23 november 2018 | 19 min

Automated Fact Checking

16 november 2018 | 32 min

[MINI] Single Source of Truth

9 november 2018 | 30 min

Detecting Fast Radio Bursts with Deep Learning

2 november 2018 | 45 min

Being Bayesian

26 oktober 2018 | 25 min

Modeling Fake News

19 oktober 2018 | 33 min

The Louvain Method for Community Detection

12 oktober 2018 | 27 min

Cultural Cognition of Scientific Consensus

5 oktober 2018 | 32 min

False Discovery Rates

28 september 2018 | 26 min

Deep Fakes

21 september 2018 | 30 min

Fake News Midterm

14 september 2018 | 19 min

Quality Score

7 september 2018 | 19 min

The Knowledge Illusion

31 augusti 2018 | 40 min

Click Through Rates

24 augusti 2018 | 32 min

A Click Through Rate (CTR) is the proportion of clicks to impressions of some item of content shared online. This terminology is most commonly used in digital advertising but applies just as well to content websites might choose to feature on their homepage or in search results.

A CTR is intuitively appealing as a metric for optimization. After all, if users are disinterested in some content, under normal circumstances, it's reasonable to assume they would ignore the content, rather than clicking on it. On the other hand, the best content is likely to elicit a high CTR as users signal their interest by following the hyperlink.

In the advertising world, a website could charge per impression, per click, or per action. Both impression and action based pricing have asymmetrical results for the publisher and advertiser. However, paying per click (CPC based advertising) seems to strike a nice balance. For this and other numeric reasons, many digital advertising mechanisms (such as Google Adwords) use CPC as the payment mechanism.

When charging per click, an advertising platform will value a high CTR when selecting which ad to show. As we learned in our episode on Goodhart's Law, once a measure is turned into a target, it ceases to be a good measure. While CTR alone does not entirely drive most online advertising algorithms, it does play an important role. Thus, advertisers are incentivized to adopt strategies that maximize CTR.

On the surface, this sounds like a great idea: provide internet users what they are looking for, and be awarded with their attention and lower advertising costs. However, one possible unintended consequence of this type of optimization is the creation of ads which are designed solely to generate clicks, regardless of if the users are happy with the page they visit after clicking a link.

So, at least in part, websites that optimize for higher CTRs are going to favor content that does a good job getting viewers to click it. Getting a user to view a page is not totally synonymous with getting a user to appreciate the content of a page. The gap between the algorithmic goal and the user experience could be one of the factors that has promoted the creation of fake news.

Algorithmic Detection of Fake News

17 augusti 2018 | 46 min

Ant Intelligence

10 augusti 2018 | 28 min

Human Detection of Fake News

3 augusti 2018 | 28 min

Spam Filtering with Naive Bayes

27 juli 2018 | 20 min

Today's spam filters are advanced data driven tools. They rely on a variety of techniques to effectively and often seamlessly filter out junk email from good email.

Whitelists, blacklists, traffic analysis, network analysis, and a variety of other tools are probably employed by most major players in this area. Naturally content analysis can be an especially powerful tool for detecting spam.

Given the binary nature of the problem ( or ) its clear that this is a great problem to use machine learning to solve. In order to apply machine learning, you first need a labelled training set. Thankfully, many standard corpora of labelled spam data are readily available. Further, if you're working for a company with a spam filtering problem, often asking users to self-moderate or flag things as spam can be an effective way to generate a large amount of labels for "free".

With a labeled dataset in hand, a data scientist working on spam filtering must next do feature engineering. This should be done with consideration of the algorithm that will be used. The Naive Bayesian Classifer has been a popular choice for detecting spam because it tends to perform pretty well on high dimensional data, unlike a lot of other ML algorithms. It also is very efficient to compute, making it possible to train a per-user Classifier if one wished to. While we might do some basic NLP tricks, for the most part, we can turn each word in a document (or perhaps each bigram or n-gram in a document) into a feature.

The Naive part of the Naive Bayesian Classifier stems from the naive assumption that all features in one's analysis are considered to be independent. If and are known to be independent, then . In other words, you just multiply the probabilities together. Shh, don't tell anyone, but this assumption is actually wrong! Certainly, if a document contains the word algorithm, it's more likely to contain the word probability than some randomly selected document. Thus, , violating the assumption. Despite this "flaw", the Naive Bayesian Classifier works remarkably will on many problems. If one employs the common approach of converting a document into bigrams (pairs of words instead of single words), then you can capture a good deal of this correlation indirectly.

In the final leg of the discussion, we explore the question of whether or not a Naive Bayesian Classifier would be a good choice for detecting fake news.

The Spread of Fake News

20 juli 2018 | 45 min

Fake News

13 juli 2018 | 38 min

Dev Ops for Data Science

11 juli 2018 | 38 min

First Order Logic

6 juli 2018 | 17 min

Blind Spots in Reinforcement Learning

29 juni 2018 | 28 min

Defending Against Adversarial Attacks

22 juni 2018 | 31 min

Transfer Learning

15 juni 2018 | 18 min

Medical Imaging Training Techniques

8 juni 2018 | 25 min

Kalman Filters

1 juni 2018 | 22 min

AI in Industry

25 maj 2018 | 43 min

AI in Games

18 maj 2018 | 26 min

Game Theory

11 maj 2018 | 24 min

The Experimental Design of Paranormal Claims

4 maj 2018 | 28 min

Winograd Schema Challenge

27 april 2018 | 37 min

The Imitation Game

20 april 2018 | 61 min

Eugene Goostman

13 april 2018 | 17 min

The Theory of Formal Languages

6 april 2018 | 24 min

The Loebner Prize

30 mars 2018 | 33 min

Chatbots

23 mars 2018 | 27 min

The Master Algorithm

16 mars 2018 | 47 min

The No Free Lunch Theorems

9 mars 2018 | 27 min

ML at Sloan Kettering Cancer Center

2 mars 2018 | 39 min

Optimal Decision Making with POMDPs

23 februari 2018 | 19 min

AI Decision-Making

16 februari 2018 | 43 min

[MINI] Reinforcement Learning

9 februari 2018 | 23 min

Evolutionary Computation

2 februari 2018 | 25 min

[MINI] Markov Decision Processes

26 januari 2018 | 20 min

Neuroscience Frontiers

19 januari 2018 | 29 min

Neuroimaging and Big Data

12 januari 2018 | 27 min

The Agent Model of Artificial Intelligence

5 januari 2018 | 17 min

Artificial Intelligence, a Podcast Approach

29 december 2017 | 33 min

Holiday reading 2017

22 december 2017 | 13 min

Complexity and Cryptography

15 december 2017 | 36 min

Mercedes Benz Machine Learning Research

14 december 2017 | 27 min

[MINI] Parallel Algorithms

8 december 2017 | 21 min

Quantum Computing

1 december 2017 | 48 min

Azure Databricks

28 november 2017 | 28 min

[MINI] Exponential Time Algorithms

24 november 2017 | 16 min

P vs NP

17 november 2017 | 39 min

[MINI] Sudoku \in NP

10 november 2017 | 18 min

Algorithms with similar runtimes are said to be in the same complexity class. That runtime is measured in the how many steps an algorithm takes relative to the input size.

The class P contains all algorithms which run in polynomial time (basically, a nested for loop iterating over the input). NP are algorithms which seem to require brute force. Brute force search cannot be done in polynomial time, so it seems that problems in NP are more difficult than problems in P. I say it "seems" this way because, while most people believe it to be true, it has not been proven. This is the famous P vs. NP conjecture. It will be discussed in more detail in a future episode.

Given a solution to a particular problem, if it can be verified/checked in polynomial time, that problem might be in NP. If someone hands you a completed Sudoku puzzle, it's not difficult to see if they made any mistakes. The effort of developing the solution to the Sudoku game seems to be intrinsically more difficult. In fact, as far as anyone knows, in the general case of all possible examples of the game, it seems no strategy can do better on average than just random guessing.

This notion of random guessing the solution is where the N in NP comes from: Non-deterministic. Imagine a machine with a random input already written in its memory. Given enough such machines, one of them will have the right answer. If they all ran in parallel, one of them could verify it's input in polynomial time. This guess / provided input is often called a witness string.

NP is an important concept for many reasons. To me, the most reason to know about NP is a practical one. Depending on your goals or the goals of your employer, there are many challenging problems you may attempt to solve. If a problem you are trying to solve happens to be in NP, then you should consider the implications very carefully. Perhaps you'll be lucky and discover that your particular instance of the problem is easy. Sudoku is pretty easy if only 2 remaining squares need to be filled in. The traveling salesman problem is easy to solve if you live in a country where all roads for a ring with exactly one road in and out.

If the problem you wish to solve is not trivial, or if you will face many instances of the problem and expect some will not be trivial, then it's unlikely you'll be able to find the exact solution. Sure, maybe you can grab a bunch of commodity servers and try to scale the heck out of your attempt. Depending on the problem you're solving, that might just work. If you can out-purchase your problem in computing power, then problems in NP will surrender to you. But if your input size ever grows, it's unlikely you'll be able to keep up.

If your problem is intractable in this way, all is not lost. You might be able to find an approximate solution to your problem. Good enough is better than no solution at all, right? Most of the time, probably. However, some tremendous work has also been done studying topics like this. Are there problems which are not even approximable in polynomial time? What approximation techniques work best? Alas, those answers lie elsewhere.

This episode avoids a discussion of a few key points in order to keep the material accessible. If you find this interesting, you should next familiarize yourself with the notions of NP-Complete, NP-Hard, and co-NP. These are topics we won't necessarily get to in future episodes. Michael Sipser's Introduction to the Theory of Computation is a good resource.

The Computational Complexity of Machine Learning

3 november 2017 | 48 min

In this episode, Professor Michael Kearns from the University of Pennsylvania joins host Kyle Polich to talk about the computational complexity of machine learning, complexity in game theory, and algorithmic fairness. Michael's doctoral thesis gave an early broad overview of computational learning theory, in which he emphasizes the mathematical study of efficient learning algorithms by machines or computational systems.

When we look at machine learning algorithms they are almost like meta-algorithms in some sense. For example, given a machine learning algorithm, it will look at some data and build some model, and it’s going to behave presumably very differently under different inputs. But does that mean we need new analytical tools? Or is a machine learning algorithm just the same thing as any deterministic algorithm, but just a little bit more tricky to figure out anything complexity-wise? In other words, is there some overlap between the good old-fashioned analysis of algorithms with the analysis of machine learning algorithms from a complexity viewpoint? And what is the difference between strategies for determining the complexity bounds on samples versus algorithms?

A big area of machine learning (and in the analysis of learning algorithms in general) Michael and Kyle discuss is the topic known as complexity regularization. Complexity regularization asks: How should one measure the goodness of fit and the complexity of a given model? And how should one balance those two, and how can one execute that in a scalable, efficient way algorithmically? From this, Michael and Kyle discuss the broader picture of why one should care whether a learning algorithm is efficiently learnable if it's learnable in polynomial time.

Another interesting topic of discussion is the difference between sample complexity and computational complexity. An active area of research is how one should regularize their models so that they're balancing the complexity with the goodness of fit to fit their large training sample size.

As mentioned, a good resource for getting started with correlated equilibria is: https://www.cs.cornell.edu/courses/cs684/2004sp/feb20.pdf

Thanks to our sponsors:

Mendoza College of Business - Get your Masters of Science in Business Analytics from Notre Dame.

brilliant.org - A fun, affordable, online learning tool. Check out their Computer Science Algorithms course.

[MINI] Turing Machines

27 oktober 2017 | 14 min

The Complexity of Learning Neural Networks

20 oktober 2017 | 39 min

[MINI] Big Oh Analysis

13 oktober 2017 | 19 min

Data science tools and other announcements from Ignite

6 oktober 2017 | 32 min

Generative AI for Content Creation

29 september 2017 | 35 min

[MINI] One Shot Learning

22 september 2017 | 18 min

Recommender Systems Live from FARCON 2017

15 september 2017 | 46 min

[MINI] Long Short Term Memory

8 september 2017 | 15 min

Zillow Zestimate

1 september 2017 | 37 min

Cardiologist Level Arrhythmia Detection with CNNs

25 augusti 2017 | 32 min

[MINI] Recurrent Neural Networks

18 augusti 2017 | 17 min

Project Common Voice

11 augusti 2017 | 31 min

[MINI] Bayesian Belief Networks

4 augusti 2017 | 17 min

pix2code

28 juli 2017 | 27 min

[MINI] Conditional Independence

21 juli 2017 | 15 min

Estimating Sheep Pain with Facial Recognition

14 juli 2017 | 27 min

CosmosDB

7 juli 2017 | 34 min

[MINI] The Vanishing Gradient

30 juni 2017 | 15 min

Doctor AI

23 juni 2017 | 42 min

[MINI] Activation Functions

16 juni 2017 | 14 min

MS Build 2017

9 juni 2017 | 28 min

[MINI] Max-pooling

2 juni 2017 | 13 min

Unsupervised Depth Perception

26 maj 2017 | 24 min

[MINI] Convolutional Neural Networks

19 maj 2017 | 15 min

Multi-Agent Diverse Generative Adversarial Networks

12 maj 2017 | 29 min

[MINI] Generative Adversarial Networks

5 maj 2017 | 10 min

Opinion Polls for Presidential Elections

28 april 2017 | 53 min

OpenHouse

21 april 2017 | 26 min

[MINI] GPU CPU

14 april 2017 | 11 min

[MINI] Backpropagation

7 april 2017 | 15 min

Data Science at Patreon

31 mars 2017 | 32 min

[MINI] Feed Forward Neural Networks

24 mars 2017 | 16 min

Feed Forward Neural Networks

In a feed forward neural network, neurons cannot form a cycle. In this episode, we explore how such a network would be able to represent three common logical operators: OR, AND, and XOR. The XOR operation is the interesting case.

Below are the truth tables that describe each of these functions.

AND Truth Table Input 1 Input 2 Output 0 0 0 0 1 0 1 0 0 1 1 1 OR Truth Table Input 1 Input 2 Output 0 0 0 0 1 1 1 0 1 1 1 1 XOR Truth Table Input 1 Input 2 Output 0 0 0 0 1 1 1 0 1 1 1 0

The AND and OR functions should seem very intuitive. Exclusive or (XOR) if true if and only if exactly single input is 1. Could a neural network learn these mathematical functions?

Let's consider the perceptron described below. First we see the visual representation, then the Activation function , followed by the formula for calculating the output.

Can this perceptron learn the AND function?

Sure. Let and

What about OR?

Yup. Let and

An infinite number of possible solutions exist, I just picked values that hopefully seem intuitive. This is also a good example of why the bias term is important. Without it, the AND function could not be represented.

How about XOR?

No. It is not possible to represent XOR with a single layer. It requires two layers. The image below shows how it could be done with two laters.

In the above example, the weights computed for the middle hidden node capture the essence of why this works. This node activates when recieving two positive inputs, thus contributing a heavy penalty to be summed by the output node. If a single input is 1, this node will not activate.

Universal approximation theorem tells us that any continuous function can be tightly approximated using a neural network with only a single hidden layer and a finite number of neurons. With this in mind, a feed forward neural network should be adaquet for any applications. However, in practice, other network architectures and the allowance of more hidden layers are empirically motivated.

Other types neural networks have less strict structal definitions. The various ways one might relax this constraint generate other classes of neural networks that often have interesting properties. We'll get into some of these in future mini-episodes.

Check out our recent blog post on how we're using Periscope Data cohort charts.

Thanks to Periscope Data for sponsoring this episode. More about them at periscopedata.com/skeptics

Reinventing Sponsored Search Auctions

17 mars 2017 | 42 min

[MINI] The Perceptron

10 mars 2017 | 15 min

The Data Refuge Project

3 mars 2017 | 25 min

[MINI] Automated Feature Engineering

24 februari 2017 | 16 min

Big Data Tools and Trends

17 februari 2017 | 31 min

[MINI] Primer on Deep Learning

10 februari 2017 | 14 min

Data Provenance and Reproducibility with Pachyderm

3 februari 2017 | 40 min

[MINI] Logistic Regression on Audio Data

27 januari 2017 | 21 min

Studying Competition and Gender Through Chess

20 januari 2017 | 34 min

[MINI] Dropout

13 januari 2017 | 16 min

The Police Data and the Data Driven Justice Initiatives

6 januari 2017 | 49 min

The Library Problem

30 december 2016 | 35 min

2016 Holiday Special

23 december 2016 | 40 min

[MINI] Entropy

16 december 2016 | 17 min

MS Connect Conference

9 december 2016 | 42 min

Causal Impact

2 december 2016 | 34 min

[MINI] The Bootstrap

25 november 2016 | 11 min

[MINI] Gini Coefficients

18 november 2016 | 16 min

Unstructured Data for Finance

11 november 2016 | 34 min

[MINI] AdaBoost

4 november 2016 | 11 min

Stealing Models from the Cloud

28 oktober 2016 | 37 min

[MINI] Calculating Feature Importance

21 oktober 2016 | 13 min

NYC Bike Share Rebalancing

14 oktober 2016 | 30 min

[MINI] Random Forest

7 oktober 2016 | 13 min

Election Predictions

30 september 2016 | 22 min

[MINI] F1 Score

23 september 2016 | 9 min

Urban Congestion

16 september 2016 | 35 min

[MINI] Heteroskedasticity

9 september 2016 | 9 min

Music21

2 september 2016 | 35 min

[MINI] Paxos

26 augusti 2016 | 15 min

Trusting Machine Learning Models with LIME

19 augusti 2016 | 35 min

[MINI] ANOVA

12 augusti 2016 | 13 min

Machine Learning on Images with Noisy Human-centric Labels

5 augusti 2016 | 23 min

[MINI] Survival Analysis

29 juli 2016 | 14 min

Predictive Models on Random Data

22 juli 2016 | 37 min

[MINI] Receiver Operating Characteristic (ROC) Curve

15 juli 2016 | 11 min

Multiple Comparisons and Conversion Optimization

8 juli 2016 | 30 min

[MINI] Leakage

1 juli 2016 | 12 min

Predictive Policing

24 juni 2016 | 36 min

[MINI] The CAP Theorem

17 juni 2016 | 11 min

Detecting Terrorists with Facial Recognition?

10 juni 2016 | 33 min

[MINI] Goodhart's Law

3 juni 2016 | 11 min

Data Science at eHarmony

27 maj 2016 | 43 min

[MINI] Stationarity and Differencing

20 maj 2016 | 14 min

Feather

13 maj 2016 | 23 min

[MINI] Bargaining

6 maj 2016 | 15 min

deepjazz

29 april 2016 | 30 min

[MINI] Auto-correlative functions and correlograms

22 april 2016 | 15 min

Early Identification of Violent Criminal Gang Members

15 april 2016 | 27 min

[MINI] Fractional Factorial Design

8 april 2016 | 11 min

Machine Learning Done Wrong

1 april 2016 | 25 min

Potholes

25 mars 2016 | 41 min

[MINI] The Elbow Method

18 mars 2016 | 15 min

Too Good to be True

11 mars 2016 | 35 min

[MINI] R-squared

4 mars 2016 | 13 min

Models of Mental Simulation

26 februari 2016 | 40 min

[MINI] Multiple Regression

19 februari 2016 | 18 min

Scientific Studies of People's Relationship to Music

12 februari 2016 | 42 min

[MINI] k-d trees

5 februari 2016 | 14 min

Auditing Algorithms

29 januari 2016 | 43 min

[MINI] The Bonferroni Correction

22 januari 2016 | 14 min

Detecting Pseudo-profound BS

15 januari 2016 | 38 min

[MINI] Gradient Descent

8 januari 2016 | 15 min

Let's Kill the Word Cloud

1 januari 2016 | 15 min

2015 Holiday Special

25 december 2015 | 14 min

Wikipedia Revision Scoring as a Service

18 december 2015 | 43 min

[MINI] Term Frequency - Inverse Document Frequency

11 december 2015 | 10 min

The Hunt for Vulcan

4 december 2015 | 42 min

[MINI] The Accuracy Paradox

27 november 2015 | 17 min

Neuroscience from a Data Scientist's Perspective

20 november 2015 | 40 min

[MINI] Bias Variance Tradeoff

13 november 2015 | 14 min

Big Data Doesn't Exist

6 november 2015 | 32 min

[MINI] Covariance and Correlation

30 oktober 2015 | 14 min

Bayesian A/B Testing

23 oktober 2015 | 30 min

[MINI] The Central Limit Theorem

16 oktober 2015 | 13 min

Accessible Technology

9 oktober 2015 | 39 min

[MINI] Multi-armed Bandit Problems

2 oktober 2015 | 13 min

Shakespeare, Abiogenesis, and Exoplanets

25 september 2015 | 58 min

[MINI] Sample Sizes

18 september 2015 | 13 min

The Model Complexity Myth

11 september 2015 | 30 min

[MINI] Distance Measures

4 september 2015 | 13 min

ContentMine

28 augusti 2015 | 53 min

[MINI] Structured and Unstructured Data

21 augusti 2015 | 13 min

Measuring the Influence of Fashion Designers

14 augusti 2015 | 25 min

[MINI] PageRank

7 augusti 2015 | 8 min

Data Science at Work in LA County

29 juli 2015 | 41 min

[MINI] k-Nearest Neighbors

24 juli 2015 | 9 min

Crypto

17 juli 2015 | 85 min

[MINI] MapReduce

10 juli 2015 | 13 min

Genetically Engineered Food and Trends in Herbicide Usage

3 juli 2015 | 35 min

[MINI] The Curse of Dimensionality

26 juni 2015 | 11 min

Video Game Analytics

19 juni 2015 | 31 min

[MINI] Anscombe's Quartet

12 juni 2015 | 9 min

Proposing Annoyance Mining

9 juni 2015 | 31 min

Preserving History at Cyark

5 juni 2015 | 23 min

[MINI] A Critical Examination of a Study of Marriage by Political Affiliation

29 maj 2015 | 10 min

Detecting Cheating in Chess

22 maj 2015 | 45 min

[MINI] z-scores

15 maj 2015 | 10 min

Using Data to Help Those in Crisis

8 maj 2015 | 35 min

The Ghost in the MP3

1 maj 2015 | 35 min

Data Fest 2015

28 april 2015 | 27 min

[MINI] Cornbread and Overdispersion

24 april 2015 | 16 min

[MINI] Natural Language Processing

17 april 2015 | 13 min

Computer-based Personality Judgments

10 april 2015 | 32 min

[MINI] Markov Chain Monte Carlo

3 april 2015 | 16 min

[MINI] Markov Chains

20 mars 2015 | 11 min

Oceanography and Data Science

13 mars 2015 | 33 min

[MINI] Ordinary Least Squares Regression

6 mars 2015 | 18 min

NYC Speed Camera Analysis with Tim Schmeier

27 februari 2015 | 17 min

[MINI] k-means clustering

20 februari 2015 | 14 min

Shadow Profiles on Social Networks

13 februari 2015 | 39 min

[MINI] The Chi-Squared Test

6 februari 2015 | 18 min

Mapping Reddit Topics with Randy Olson

30 januari 2015 | 30 min

[MINI] Partially Observable State Spaces

23 januari 2015 | 13 min

Easily Fooling Deep Neural Networks

16 januari 2015 | 28 min

[MINI] Data Provenance

9 januari 2015 | 11 min

Doubtful News, Geology, Investigating Paranormal Groups, and Thinking Scientifically with Sharon Hill

3 januari 2015 | 31 min

[MINI] Belief in Santa

26 december 2014 | 10 min

Economic Modeling and Prediction, Charitable Giving, and a Follow Up with Peter Backus

19 december 2014 | 24 min

[MINI] The Battle of the Sexes

12 december 2014 | 18 min

The Science of Online Data at Plenty of Fish with Thomas Levi

5 december 2014 | 59 min

[MINI] The Girlfriend Equation

28 november 2014 | 16 min

The Secret and the Global Consciousness Project with Alex Boklin

21 november 2014 | 42 min

[MINI] Monkeys on Typewriters

14 november 2014 | 3 min

Mining the Social Web with Matthew Russell

7 november 2014 | 50 min

[MINI] Is the Internet Secure?

31 oktober 2014 | 26 min

Practicing and Communicating Data Science with Jeff Stanton

24 oktober 2014 | 37 min

[MINI] The T-Test

17 oktober 2014 | 17 min

Data Myths with Karl Mamer

10 oktober 2014 | 48 min

Contest Announcement

8 oktober 2014 | 12 min

[MINI] Selection Bias

3 oktober 2014 | 15 min

[MINI] Confidence Intervals

26 september 2014 | 12 min

[MINI] Value of Information

19 september 2014 | 14 min

Game Science Dice with Louis Zocchi

17 september 2014 | 47 min

Data Science at ZestFinance with Marick Sinay

12 september 2014 | 31 min

[MINI] Decision Tree Learning

5 september 2014 | 13 min

Jackson Pollock Authentication Analysis with Kate Jones-Smith

29 augusti 2014 | 50 min

[MINI] Noise!!

22 augusti 2014 | 16 min

Guerilla Skepticism on Wikipedia with Susan Gerbic

15 augusti 2014 | 70 min

[MINI] Ant Colony Optimization

8 augusti 2014 | 15 min

Data in Healthcare IT with Shahid Shah

1 augusti 2014 | 57 min

[MINI] Cross Validation

25 juli 2014 | min

Streetlight Outage and Crime Rate Analysis with Zach Seeskin

18 juli 2014 | 33 min

[MINI] Experimental Design

11 juli 2014 | 16 min

The Right (big data) Tool for the Job with Jay Shankar

7 juli 2014 | 50 min

[MINI] Bayesian Updating

27 juni 2014 | 11 min

Personalized Medicine with Niki Athanasiadou

20 juni 2014 | 57 min

[MINI] p-values

13 juni 2014 | 17 min

Advertising Attribution with Nathan Janos

6 juni 2014 | 76 min

[MINI] type i / type ii errors

30 maj 2014 | 11 min

Introduction

23 maj 2014 | 4 min

Data Skeptic

The Data Skeptic Podcast features interviews and discussion of topics related to data science, statistics, machine learning, artificial intelligence and the like, all from the perspective of applying ...

Om podden

Avsnitt