Linear Digressions

Linear Digressions is a podcast about machine learning and data science. Machine learning is being used to solve a ton of interesting problems, and to accomplish goals that were out of reach even a few short years ago.

Prenumerera

iTunes / Overcast / RSS

Webbplats

lineardigressions.com

Avsnitt

So long, and thanks for all the fish

All good things must come to an end, including this podcast. This is the last episode we plan to release, and it doesn?t cover data science?it?s mostly reminiscing, thanking our wonderful audience (that?s you!), and marveling at how this thing that started out as a side project grew into a huge part of our lives for over 5 years. It?s been a ride, and a real pleasure and privilege to talk to you each week. Thanks, best wishes, and good night! ?Katie and Ben

2020-07-27
Länk till avsnitt

A Reality Check on AI-Driven Medical Assistants

The data science and artificial intelligence community has made amazing strides in the past few years to algorithmically automate portions of the healthcare process. This episode looks at two computer vision algorithms, one that diagnoses diabetic retinopathy and another that classifies liver cancer, and asks the question?are patients now getting better care, and achieving better outcomes, with these algorithms in the mix? The answer isn?t no, exactly, but it?s not a resounding yes, because these algorithms interact with a very complex system (the healthcare system) and other shortcomings of that system are proving hard to automate away. Getting a faster diagnosis from an image might not be an improvement if the image is now harder to capture (because of strict data quality requirements associated with the algorithm that wouldn?t stop a human doing the same job). Likewise, an algorithm getting a prediction mostly correct might not be an overall benefit if it introduces more dramatic failures when the prediction happens to be wrong. For every data scientist whose work is deployed into some kind of product, and is being used to solve real-world problems, these papers underscore how important and difficult it is to consider all the context around those problems.

2020-07-20
Länk till avsnitt

A Data Science Take on Open Policing Data

A few weeks ago, we put out a call for data scientists interested in issues of race and racism, or people studying how those topics can be studied with data science methods, should get in touch to come talk to our audience about their work. This week we?re excited to bring on Todd Hendricks, Bay Area data scientist and a volunteer who reached out to tell us about his studies with the Stanford Open Policing dataset.

2020-07-13
Länk till avsnitt

Procella: YouTube's super-system for analytics data storage

This is a re-release of an episode that originally ran in October 2019. If you?re trying to manage a project that serves up analytics data for a few very distinct uses, you?d be wise to consider having custom solutions for each use case that are optimized for the needs and constraints of that use cases. You also wouldn?t be YouTube, which found themselves with this problem (gigantic data needs and several very different use cases of what they needed to do with that data) and went a different way: they built one analytics data system to serve them all. Procella, the system they built, is the topic of our episode today: by deconstructing the system, we dig into the four motivating uses of this system, the complexity they had to introduce to service all four uses simultaneously, and the impressive engineering that has to go into building something that ?just works.?

2020-07-06
Länk till avsnitt

The Data Science Open Source Ecosystem

Open source software is ubiquitous throughout data science, and enables the work of nearly every data scientist in some way or another. Open source projects, however, are disproportionately maintained by a small number of individuals, some of whom are institutionally supported, but many of whom do this maintenance on a purely volunteer basis. The health of the data science ecosystem depends on the support of open source projects, on an individual and institutional level. https://hdsr.mitpress.mit.edu/pub/xsrt4zs2/release/2

2020-06-29
Länk till avsnitt

Rock the ROC Curve

This is a re-release of an episode that first ran on January 29, 2017. This week: everybody's favorite WWII-era classifier metric! But it's not just for winning wars, it's a fantastic go-to metric for all your classifier quality needs.

2020-06-22
Länk till avsnitt

Criminology and Data Science

This episode features Zach Drake, a working data scientist and PhD candidate in the Criminology, Law and Society program at George Mason University. Zach specializes in bringing data science methods to studies of criminal behavior, and got in touch after our last episode (about racially complicated recidivism algorithms). Our conversation covers a wide range of topics?common misconceptions around race and crime statistics, how methodologically-driven criminology scholars think about building crime prediction models, and how to think about policy changes when we don?t have a complete understanding of cause and effect in criminology. For the many of us currently re-thinking race and criminal justice, but wanting to be data-driven about it, this conversation with Zach is a must-listen.

2020-06-15
Länk till avsnitt

Racism, the criminal justice system, and data science

As protests sweep across the United States in the wake of the killing of George Floyd by a Minneapolis police officer, we take a moment to dig into one of the ways that data science perpetuates and amplifies racism in the American criminal justice system. COMPAS is an algorithm that claims to give a prediction about the likelihood of an offender to re-offend if released, based on the attributes of the individual, and guess what: it shows disparities in the predictions for black and white offenders that would nudge judges toward giving harsher sentences to black individuals. We dig into this algorithm a little more deeply, unpacking how different metrics give different pictures into the ?fairness? of the predictions and what is causing its racially disparate output (to wit: race is explicitly not an input to the algorithm, and yet the algorithm gives outputs that correlate with race?what gives?) Unfortunately it?s not an open-and-shut case of a tuning parameter being off, or the wrong metric being used: instead the biases in the justice system itself are being captured in the algorithm outputs, in such a way that a self-fulfilling prophecy of harsher treatment for black defendants is all but guaranteed. Like many other things this week, this episode left us thinking about bigger, systemic issues, and why it?s proven so hard for years to fix what?s broken.

2020-06-08
Länk till avsnitt

An interstitial word from Ben

A message from Ben around algorithmic bias, and how our models are sometimes reflections of ourselves.

2020-06-05
Länk till avsnitt

Convolutional Neural Networks

This is a re-release of an episode that originally aired on April 1, 2018 If you've done image recognition or computer vision tasks with a neural network, you've probably used a convolutional neural net. This episode is all about the architecture and implementation details of convolutional networks, and the tricks that make them so good at image tasks.

2020-05-31
Länk till avsnitt

Stein's Paradox

This is a re-release of an episode that was originally released on February 26, 2017. When you're estimating something about some object that's a member of a larger group of similar objects (say, the batting average of a baseball player, who belongs to a baseball team), how should you estimate it: use measurements of the individual, or get some extra information from the group? The James-Stein estimator tells you how to combine individual and group information make predictions that, taken over the whole group, are more accurate than if you treated each individual, well, individually.

2020-05-25
Länk till avsnitt

Protecting Individual-Level Census Data with Differential Privacy

The power of finely-grained, individual-level data comes with a drawback: it compromises the privacy of potentially anyone and everyone in the dataset. Even for de-identified datasets, there can be ways to re-identify the records or otherwise figure out sensitive personal information. That problem has motivated the study of differential privacy, a set of techniques and definitions for keeping personal information private when datasets are released or used for study. Differential privacy is getting a big boost this year, as it?s being implemented across the 2020 US Census as a way of protecting the privacy of census respondents while still opening up the dataset for research and policy use. When two important topics come together like this, we can?t help but sit up and pay attention.

2020-05-18
Länk till avsnitt

Causal Trees

What do you get when you combine the causal inference needs of econometrics with the data-driven methodology of machine learning? Usually these two don?t go well together (deriving causal conclusions from naive data methods leads to biased answers) but economists Susan Athey and Guido Imbens are on the case. This episodes explores their algorithm for recursively partitioning a dataset to find heterogeneous treatment effects, or for you ML nerds, applying decision trees to causal inference problems. It?s not a free lunch, but for those (like us!) who love crossover topics, causal trees are a smart approach from one field hopping the fence to another. Relevant links: https://www.pnas.org/content/113/27/7353

2020-05-11
Länk till avsnitt

The Grammar Of Graphics

You may not realize it consciously, but beautiful visualizations have rules. The rules are often implict and manifest themselves as expectations about how the data is summarized, presented, and annotated so you can quickly extract the information in the underlying data using just visual cues. It?s a bit abstract but very profound, and these principles underlie the ggplot2 package in R that makes famously beautiful plots with minimal code. This episode covers a paper by Hadley Wickham (author of ggplot2, among other R packages) that unpacks the layered approach to graphics taken in ggplot2, and makes clear the assumptions and structure of many familiar data visualizations.

2020-05-04
Länk till avsnitt

Gaussian Processes

It?s pretty common to fit a function to a dataset when you?re a data scientist. But in many cases, it?s not clear what kind of function might be most appropriate?linear? quadratic? sinusoidal? some combination of these, and perhaps others? Gaussian processes introduce a nonparameteric option where you can fit over all the possible types of functions, using the data points in your datasets as constraints on the results that you get (the idea being that, no matter what the ?true? underlying function is, it produced the data points you?re trying to fit). What this means is a very flexible, but depending on your parameters not-too-flexible, way to fit complex datasets. The math underlying GPs gets complex, and the links below contain some excellent visualizations that help make the underlying concepts clearer. Check them out! Relevant links: http://katbailey.github.io/post/gaussian-processes-for-dummies/ https://thegradient.pub/gaussian-process-not-quite-for-dummies/ https://distill.pub/2019/visual-exploration-gaussian-processes/

2020-04-27
Länk till avsnitt

Keeping ourselves honest when we work with observational healthcare data

The abundance of data in healthcare, and the value we could capture from structuring and analyzing that data, is a huge opportunity. It also presents huge challenges. One of the biggest challenges is how, exactly, to do that structuring and analysis?data scientists working with this data have hundreds or thousands of small, and sometimes large, decisions to make in their day-to-day analysis work. What data should they include in their studies? What method should they use to analyze it? What hyperparameter settings should they explore, and how should they pick a value for their hyperparameters? The thing that?s really difficult here is that, depending on which path they choose among many reasonable options, a data scientist can get really different answers to the underlying question, which makes you wonder how to conclude anything with certainty at all. The paper for this week?s episode performs a systematic study of many, many different permutations of the questions above on a set of benchmark datasets where the ?right? answers are known. Which strategies are most likely to yield the ?right? answers? That?s the whole topic of discussion. Relevant links: https://hdsr.mitpress.mit.edu/pub/fxz7kr65

2020-04-20
Länk till avsnitt

Changing our formulation of AI to avoid runaway risks: Interview with Prof. Stuart Russell

AI is evolving incredibly quickly, and thinking now about where it might go next (and how we as a species and a society should be prepared) is critical. Professor Stuart Russell, an AI expert at UC Berkeley, has a formulation for modifications to AI that we should study and try implementing now to keep it much safer in the long run. Prof. Russell?s new book, ?Human Compatible: Artificial Intelligence and the Problem of Control? gives an accessible but deeply thoughtful exploration of why he thinks runaway AI is something we need to be considering seriously now, and what changes in formulation might be a solution. This episodes features Prof. Russell as a special guest, exploring the topics in his book and giving more perspective on the long-term possible futures of AI: both good and bad. Relevant links: https://www.penguinrandomhouse.com/books/566677/human-compatible-by-stuart-russell/

2020-04-13
Länk till avsnitt

Putting machine learning into a database

Most data scientists bounce back and forth regularly between doing analysis in databases using SQL and building and deploying machine learning pipelines in R or python. But if we think ahead a few years, a few visionary researchers are starting to see a world in which the ML pipelines can actually be deployed inside the database. Why? One strong advantage for databases is they have built-in features for data governance, including things like permissioning access and tracking the provenance of data. Adding machine learning as another thing you can do in a database means that, potentially, these enterprise-grade features will be available for ML models too, which will make them much more widely accepted across enterprises with tight IT policies. The papers this week articulate the gap between enterprise needs and current ML infrastructure, how ML in a database could be a way to knit the two closer together, and a proof-of-concept that ML in a database can actually work. Relevant links: https://blog.acolyer.org/2020/02/19/ten-year-egml-predictions/ https://blog.acolyer.org/2020/02/21/extending-relational-query-processing/

2020-04-06
Länk till avsnitt

The work-from-home episode

Many of us have the privilege of working from home right now, in an effort to keep ourselves and our family safe and slow the transmission of covid-19. But working from home is an adjustment for many of us, and can hold some challenges compared to coming in to the office every day. This episode explores this a little bit, informally, as we compare our new work-from-home setups and reflect on what?s working well and what we?re finding challenging.

2020-03-30
Länk till avsnitt

Understanding Covid-19 transmission: what the data suggests about how the disease spreads

Covid-19 is turning the world upside down right now. One thing that?s extremely important to understand, in order to fight it as effectively as possible, is how the virus spreads and especially how much of the spread of the disease comes from carriers who are experiencing no or mild symptoms but are contagious anyway. This episode digs into the epidemiological model that was published in Science this week?this model finds that the data suggests that the majority of carriers of the coronavirus, 80-90%, do not have a detected disease. This has big implications for the importance of social distancing of a way to get the pandemic under control and explains why a more comprehensive testing program is critical for the United States. Also, in lighter news, Katie (a native of Dayton, Ohio) lays a data-driven claim for just declaring the University of Dayton flyers to be the 2020 NCAA College Basketball champions. Relevant links: https://science.sciencemag.org/content/early/2020/03/13/science.abb3221

2020-03-23
Länk till avsnitt

Network effects re-release: when the power of a public health measure lies in widespread adoption

This week?s episode is a re-release of a recent episode, which we don?t usually do but it seems important for understanding what we can all do to slow the spread of covid-19. In brief, public health measures for infectious diseases get most of their effectiveness from their widespread adoption: most of the protection you get from a vaccine, for example, comes from all the other people who also got the vaccine. That?s why measures like social distancing are so important right now: even if you?re not in a high-risk group for covid-19, you should still stay home and avoid in-person socializing because your good behavior lowers the risk for those who are in high-risk groups. If we all take these kinds of measures, the risk lowers dramatically. So stay home, work remotely if you can, avoid physical contact with others, and do your part to manage this crisis. We?re all in this together.

2020-03-15
Länk till avsnitt

Causal inference when you can't experiment: difference-in-differences and synthetic controls

When you need to untangle cause and effect, but you can?t run an experiment, it?s time to get creative. This episode covers difference in differences and synthetic controls, two observational causal inference techniques that researchers have used to understand causality in complex real-world situations.

2020-03-09
Länk till avsnitt

Better know a distribution: the Poisson distribution

This is a re-release of an episode that originally ran on October 21, 2018. The Poisson distribution is a probability distribution function used to for events that happen in time or space. It?s super handy because it?s pretty simple to use and is applicable for tons of things?there are a lot of interesting processes that boil down to ?events that happen in time or space.? This episode is a quick introduction to the distribution, and then a focus on two of our favorite everyday applications: using the Poisson distribution to identify supernovas and study army deaths from horse kicks.

2020-03-02
Länk till avsnitt

The Lottery Ticket Hypothesis

Recent research into neural networks reveals that sometimes, not all parts of the neural net are equally responsible for the performance of the network overall. Instead, it seems like (in some neural nets, at least) there are smaller subnetworks present where most of the predictive power resides. The fascinating thing is that, for some of these subnetworks (so-called ?winning lottery tickets?), it?s not the training process that makes them good at their classification or regression tasks: they just happened to be initialized in a way that was very effective. This changes the way we think about what training might be doing, in a pretty fundamental way. Sometimes, instead of crafting a good fit from wholecloth, training might be finding the parts of the network that always had predictive power to begin with, and isolating and strengthening them. This research is pretty recent, having only come to prominence in the last year, but nonetheless challenges our notions about what it means to train a machine learning model.

2020-02-24
Länk till avsnitt

Interesting technical issues prompted by GDPR and data privacy concerns

Data privacy is a huge issue right now, after years of consumers and users gaining awareness of just how much of their personal data is out there and how companies are using it. Policies like GDPR are imposing more stringent rules on who can use what data for what purposes, with an end goal of giving consumers more control and privacy around their data. This episode digs into this topic, but not from a security or legal perspective?this week, we talk about some of the interesting technical challenges introduced by a simple idea: a company should remove a user?s data from their database when that user asks to be removed. We talk about two topics, namely using Bloom filters to efficiently find records in a database (and what Bloom filters are, for that matter) and types of machine learning algorithms that can un-learn their training data when it contains records that need to be deleted.

2020-02-17
Länk till avsnitt

Thinking of data science initiatives as innovation initiatives

Put yourself in the shoes of an executive at a big legacy company for a moment, operating in virtually any market vertical: you?re constantly hearing that data science is revolutionizing the world and the firms that survive and thrive in the coming years are those that execute on a data strategy. What does this mean for your company? How can you best guide your established firm through a successful transition to becoming data-driven? How do you balance the momentum your firm has right now, and the need to support all your current products, customers and operations, against a new and relatively unknown future? If you?re working as a data scientist at a mature and well-established company, these are the worries on the mind of your boss?s boss?s boss. The worries on your mind may be similar: you?re trying to understand where your work fits into the bigger picture, you need to break down silos, you?re often running into cultural headwinds created by colleagues who don?t understand or trust your work. Congratulations, you?re in the midst of a classic set of challenges encountered by innovation initiatives everywhere. Harvard Business School professor Clayton Christensen wrote a classic business book (The Innovator?s Dilemma) explaining the paradox of trying to innovate in established companies, and why the structure and incentives of those companies almost guarantee an uphill climb to innovate. This week?s episode breaks down the innovator?s dilemma argument, and what it means for data scientists working in mature companies trying to become more data-centric.

2020-02-10
Länk till avsnitt

Building a curriculum for educating data scientists: Interview with Prof. Xiao-Li Meng

As demand for data scientists grows, and it remains as relevant as ever that practicing data scientists have a solid methodological and technical foundation for their work, higher education institutions are coming to terms with what?s required to educate the next cohorts of data scientists. The heterogeneity and speed of the field makes it challenging for even the most talented and dedicated educators to know what a data science education ?should? look like. This doesn?t faze Xiao-Li Meng, Professor of Statistics at Harvard University and founding Editor-in-Chief of the Harvard Data Science Review. He?s our interview guest in this episode, talking about the pedagogically distinct classes of data science and how he thinks about designing curricula for making anyone more data literate. From new initiatives in data science to dealing with data science FOMO, this wide-ranging conversation with a leading scholar gives us a lot to think about. Relevant links: https://hdsr.mitpress.mit.edu/

2020-02-03
Länk till avsnitt

Running experiments when there are network effects

Traditional A/B tests assume that whether or not one person got a treatment has no effect on the experiment outcome for another person. But that?s not a safe assumption, especially when there are network effects (like in almost any social context, for instance!) SUTVA, or the stable treatment unit value assumption, is a big phrase for this assumption and violations of SUTVA make for some pretty interesting experiment designs. From news feeds in LinkedIn to disentangling herd immunity from individual immunity in vaccine studies, indirect (i.e. network) effects in experiments can be just as big as, or even bigger than, direct (i.e. individual effects). And this is what we talk about this week on the podcast. Relevant links: http://hanj.cs.illinois.edu/pdf/www15_hgui.pdf https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2600548/pdf/nihms-73860.pdf

2020-01-27
Länk till avsnitt

Zeroing in on what makes adversarial examples possible

Adversarial examples are really, really weird: pictures of penguins that get classified with high certainty by machine learning algorithms as drumsets, or random noise labeled as pandas, or any one of an infinite number of mistakes in labeling data that humans would never make but computers make with joyous abandon. What gives? A compelling new argument makes the case that it?s not the algorithms so much as the features in the datasets that holds the clue. This week?s episode goes through several papers pushing our collective understanding of adversarial examples, and giving us clues to what makes these counterintuitive cases possible. Relevant links: https://arxiv.org/pdf/1905.02175.pdf https://arxiv.org/pdf/1805.12152.pdf https://distill.pub/2019/advex-bugs-discussion/ https://arxiv.org/pdf/1911.02508.pdf

2020-01-20
Länk till avsnitt

Unsupervised Dimensionality Reduction: UMAP vs t-SNE

Dimensionality reduction redux: this episode covers UMAP, an unsupervised algorithm designed to make high-dimensional data easier to visualize, cluster, etc. It?s similar to t-SNE but has some advantages. This episode gives a quick recap of t-SNE, especially the connection it shares with information theory, then gets into how UMAP is different (many say better). Between the time we recorded and released this episode, an interesting argument made the rounds on the internet that UMAP?s advantages largely stem from good initialization, not from advantages inherent in the algorithm. We don?t cover that argument here obviously, because it wasn?t out there when we were recording, but you can find a link to the paper below. Relevant links: https://pair-code.github.io/understanding-umap/ https://www.biorxiv.org/content/10.1101/2019.12.19.877522v1

2020-01-13
Länk till avsnitt

Data scientists: beware of simple metrics

Picking a metric for a problem means defining how you?ll measure success in solving that problem. Which sounds important, because it is, but oftentimes new data scientists only get experience with a few kinds of metrics when they?re learning and those metrics have real shortcomings when you think about what they tell you, or don?t, about how well you?re really solving the underlying problem. This episode takes a step back and says, what are some metrics that are popular with data scientists, why are they popular, and what are their shortcomings when it comes to the real world? There?s been a lot of great thinking and writing recently on this topic, and we cover a lot of that discussion along with some perspective of our own. Relevant links: https://www.fast.ai/2019/09/24/metrics/ https://arxiv.org/abs/1909.12475 https://medium.com/shoprunner/evaluating-classification-models-1-ff0730801f17 https://hbr.org/2019/09/dont-let-metrics-undermine-your-business

2020-01-05
Länk till avsnitt

Communicating data science, from academia to industry

For something as multifaceted and ill-defined as data science, communication and sharing best practices across the field can be extremely valuable but also extremely, well, multifaceted and ill-defined. That doesn?t bother our guest today, Prof. Xiao-Li Meng of the Harvard statistics department, who is leading an effort to start an open-access Data Science Review journal in the model of the Harvard Business Review or Law Review. This episode features Xiao-Li talking about the need he sees for a central gathering place for data scientists in academia, industry, and government to come together to learn from (and teach!) each other. Relevant links: https://hdsr.mitpress.mit.edu/

2019-12-30
Länk till avsnitt

Optimizing for the short-term vs. the long-term

When data scientists run experiments, like A/B tests, it?s really easy to plan on a period of a few days to a few weeks for collecting data. The thing is, the change that?s being evaluated might have effects that last a lot longer than a few days or a few weeks?having a big sale might increase sales this week, but doing that repeatedly will teach customers to wait until there?s a sale and never buy anything at full price, which could ultimately drive down revenue in the long term. Increasing the volume of ads on a website might lead people to click on more ads in the short term, but in the long term they?ll be more likely to visually block the ads out and learn to ignore them. But these long-term effects aren?t apparent from the short-term experiment, so this week we?re talking about a paper from Google research that confronts the short-term vs. long-term tradeoff, and how to measure long-term effects from short-term experiments. Relevant links: https://research.google/pubs/pub43887/

2019-12-23
Länk till avsnitt

Interview with Prof. Andrew Lo, on using data science to inform complex business decisions

This episode features Prof. Andrew Lo, the author of a paper that we discussed recently on Linear Digressions, in which Prof. Lo uses data to predict whether a medicine in the development pipeline will eventually go on to win FDA approval. This episode gets into the story behind that paper: how the approval prospects of different drugs inform the investment decisions of pharma companies, how to stitch together siloed and incomplete datasts to form a coherent picture, and how the academics building some of these models think about when and how their work can make it out of academia and into industry. Professor Lo is an expert in business (he teaches at the MIT Sloan School of Management) and work like his shows how data science can open up new ways of doing business. Relevant links: https://hdsr.mitpress.mit.edu/pub/ct67j043

2019-12-16
Länk till avsnitt

Using machine learning to predict drug approvals

One of the hottest areas in data science and machine learning right now is healthcare: the size of the healthcare industry, the amount of data it generates, and the myriad improvements possible in the healthcare system lay the groundwork for compelling, innovative new data initiatives. One spot that drives much of the cost of medicine is the riskiness of developing new drugs: drug trials can cost hundreds of millions of dollars to run and, especially given that numerous medicines end up failing to get approval from the FDA, pharmaceutical companies want to have as much insight as possible about whether a drug is more or less likely to make it through clinical trials and on to approval. Professor Andrew Lo and collaborators at MIT Sloan School of Management is taking a look at this prediction task using machine learning, and has an article in the Harvard Data Science Review showing what they were able to find. It?s a fascinating example of how data science can be used to address business needs in creative but very targeted and effective ways. Relevant links: https://hdsr.mitpress.mit.edu/pub/ct67j043

2019-12-08
Länk till avsnitt

Facial recognition, society, and the law

Facial recognition being used in everyday life seemed far-off not too long ago. Increasingly, it?s being used and advanced widely and with increasing speed, which means that our technical capabilities are starting to outpace (if they haven?t already) our consensus as a society about what is acceptable in facial recognition and what isn?t. The threats to privacy, fairness, and freedom are real, and Microsoft has become one of the first large companies using this technology to speak out in specific support of its regulation through legislation. Their arguments are interesting, provocative, and even if you don?t agree with every point they make or harbor some skepticism, there?s a lot to think about in what they?re saying.

2019-12-02
Länk till avsnitt

Lessons learned from doing data science, at scale, in industry

If you?ve taken a machine learning class, or read up on A/B tests, you likely have a decent grounding in the theoretical pillars of data science. But if you?re in a position to have actually built lots of models or run lots of experiments, there?s almost certainly a bunch of extra ?street smarts? insights you?ve had that go beyond the ?books smarts? of more academic studies. The data scientists at Booking.com, who run build models and experiments constantly, have written a paper that bridges the gap and talks about what non-obvious things they?ve learned from that practice. In this episode we read and digest that paper, talking through the gotchas that they don?t always teach in a classroom but that make data science tricky and interesting in the real world. Relevant links: https://www.kdd.org/kdd2019/accepted-papers/view/150-successful-machine-learning-models-6-lessons-learned-at-booking.com

2019-11-25
Länk till avsnitt

Varsity A/B Testing

When you want to understand if doing something causes something else to happen, like if a change to a website causes and dip or rise in downstream conversions, the gold standard analysis method is to use randomized controlled trials. Once you?ve properly randomized the treatment and effect, the analysis methods are well-understood and there are great tools in R and python (and other languages) to find the effects. However, when you?re operating at scale, the logistics of running all those tests, and reaching correct conclusions reliably, becomes the main challenge?making sure the right metrics are being computed, you know when to stop an experiment, you minimize the chances of finding spurious results, and many other issues that are simple to track for one or two experiments but become real challenges for dozens or hundreds of experiments. Nonetheless, the reality is that there might be dozens or hundreds of experiments worth running. So in this episode, we?ll work through some of the most important issues for running experiments at scale, with strong support from a series of great blog posts from Airbnb about how they solve this very issue. For some blog post links relevant to this episode, visit lineardigressions.com

2019-11-18
Länk till avsnitt

The Care and Feeding of Data Scientists: Growing Careers

In the third and final installment of a conversation with Michelangelo D?Agostino, VP of Data Science and Engineering at Shoprunner, about growing and mentoring data scientists on your team. Some of our topics of conversation include how to institute hack time as a way to learn new things, what career growth looks like in data science, and how to institutionalize professional growth as part of a career ladder. As with the other episodes in this series, the topics we cover today are also covered in the O?Reilly report linked below. Relevant links: https://oreilly-ds-report.s3.amazonaws.com/Care_and_Feeding_of_Data_Scientists.pdf

2019-11-11
Länk till avsnitt

The Care and Feeding of Data Scientists: Recruiting and Hiring Data Scientists

This week?s episode is the second in a three-part interview series with Michelangelo D?Agostino, VP of Data Science at Shoprunner. This discussion centers on building a team, which means recruiting, interviewing and hiring data scientists. Since data science talent is in such high demand, and data scientists are understandably choosy about where they go to work, a good recruiting and hiring program can have a big impact on the size and quality of the team. Our chat covers much a couple of sections in our dual-authored O?Reilly report, ?The Care and Feeding of Data Scientists,? which you can read at the link below. https://oreilly-ds-report.s3.amazonaws.com/Care_and_Feeding_of_Data_Scientists.pdf

2019-11-04
Länk till avsnitt

The Care and Feeding of Data Scientists: Becoming a Data Science Manager

Data science management isn?t easy, and many data scientists are finding themselves learning on the job how to manage data science teams as they get promoted into more formal leadership roles. O?Reilly recently release a report, written by yours truly (Katie) and another experienced data science manager, Michelangelo D?Agostino, where we lay out the most important tasks of a data science manager and some thoughts on how to unpack those tasks and approach them in a way that makes a new manager successful. This episode is an interview episode, the first of three, where we discuss some of the common paths to data science management and what distinguishes (and unifies) different types of data scientists and data science teams. Relevant links: https://oreilly-ds-report.s3.amazonaws.com/Care_and_Feeding_of_Data_Scientists.pdf

2019-10-28
Länk till avsnitt

Procella: YouTube's super-system for analytics data storage

If you?re trying to manage a project that serves up analytics data for a few very distinct uses, you?d be wise to consider having custom solutions for each use case that are optimized for the needs and constraints of that use cases. You also wouldn?t be YouTube, which found themselves with this problem (gigantic data needs and several very different use cases of what they needed to do with that data) and went a different way: they built one analytics data system to serve them all. Procella, the system they built, is the topic of our episode today: by deconstructing the system, we dig into the four motivating uses of this system, the complexity they had to introduce to service all four uses simultaneously, and the impressive engineering that has to go into building something that ?just works.? Relevant links: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45a6cea2b9c101761ea1b51c961628093ec1d5da.pdf

2019-10-21
Länk till avsnitt

Kalman Runners

The Kalman Filter is an algorithm for taking noisy measurements of dynamic systems and using them to get a better idea of the underlying dynamics than you could get from a simple extrapolation. If you've ever run a marathon, or been a nuclear missile, you probably know all about these challenges already. IMPORTANT NON-DATA SCIENCE CHICAGO MARATHON RACE RESULT FROM KATIE: My finish time was 3:20:17! It was the closest I may ever come to having the perfect run. That?s a 34-minute personal record and a qualifying time for the Boston Marathon, so? guess I gotta go do that now.

2019-10-13
Länk till avsnitt

What's really so hard about feature engineering?

Feature engineering is ubiquitous but gets surprisingly difficult surprisingly fast. What could be so complicated about just keeping track of what data you have, and how you made it? A lot, as it turns out?most data science platforms at this point include explicit features (in the product sense, not the data sense) just for keeping track of and sharing features (in the data sense, not the product sense). Just like a good library needs a catalogue, a city needs a map, and a home chef needs a cookbook to stay organized, modern data scientists need feature libraries, data dictionaries, and a general discipline around generating and caring for their datasets.

2019-10-07
Länk till avsnitt

Data storage for analytics: stars and snowflakes

If you?re a data scientist or data engineer thinking about how to store data for analytics uses, one of the early choices you?ll have to make (or live with, if someone else made it) is how to lay out the data in your data warehouse. There are a couple common organizational schemes that you?ll likely encounter, and that we cover in this episode: first is the famous star schema, followed by the also-famous snowflake schema.

2019-09-30
Länk till avsnitt

Data storage: transactions vs. analytics

Data scientists and software engineers both work with databases, but they use them for different purposes. So if you?re a data scientist thinking about the best way to store and access data for your analytics, you?ll likely come up with a very different set of requirements than a software engineer looking to power an application. Hence the split between analytics and transactional databases?certain technologies are designed for one or the other, but no single type of database is perfect for both use cases. In this episode we?ll talk about the differences between transactional and analytics databases, so no matter whether you?re an analytics person or more of a classical software engineer, you can understand the needs of your colleagues on the other side.

2019-09-23
Länk till avsnitt

GROVER: an algorithm for making, and detecting, fake news

There are a few things that seem to be very popular in discussions of machine learning algorithms these days. First is the role that algorithms play now, or might play in the future, when it comes to manipulating public opinion, for example with fake news. Second is the impressive success of generative adversarial networks, and similar algorithms. Third is making state-of-the-art natural language processing algorithms and naming them after muppets. We get all three this week: GROVER is an algorithm for generating, and detecting, fake news. It?s quite successful at both tasks, which raises an interesting question: is it safer to embargo the model (like GPT-2, the algorithm that was ?too dangerous to release?), or release it as the best detector and antidote for its own fake news? Relevant links: https://grover.allenai.org/ https://arxiv.org/abs/1905.12616

2019-09-16
Länk till avsnitt

Data science teams as innovation initiatives

When a big, established company is thinking about their data science strategy, chances are good that whatever they come up with, it?ll be somewhat at odds with the company?s current structure and processes. Which makes sense, right? If you?re a many-decades-old company trying to defend a successful and long-lived legacy and market share, you won?t have the advantage that many upstart competitors have of being able to bake data analytics and science into the core structure of the organization. Instead, you have to retrofit. If you?re the data scientist working in this environment, tasked with being on the front lines of a data transformation, you may be grappling with some real institutional challenges in this setup, and this episode is for you. We?ll unpack the reason data innovation is necessarily challenging, the different ways to innovate and some of their tradeoffs, and some of the hardest but most critical phases in the innovation process. Relevant links: https://www.amazon.com/Innovators-Dilemma-Revolutionary-Change-Business/dp/0062060244 https://www.amazon.com/Other-Side-Innovation-Execution-Challenge/dp/1422166961

2019-09-09
Länk till avsnitt

Can Fancy Running Shoes Cause You To Run Faster?

This is a re-release of an episode that originally aired on July 29, 2018. The stars aligned for me (Katie) this past weekend: I raced my first half-marathon in a long time and got to read a great article from the NY Times about a new running shoe that Nike claims can make its wearers run faster. Causal claims like this one are really tough to verify, because even if the data suggests that people wearing the shoe are faster that might be because of correlation, not causation, so I loved reading this article that went through an analysis of thousands of runners' data in 4 different ways. Each way has a great explanation with pros and cons (as well as results, of course), so be sure to read the article after you check out this episode! Relevant links: https://www.nytimes.com/interactive/2018/07/18/upshot/nike-vaporfly-shoe-strava.html

2019-09-02
Länk till avsnitt

Organizational Models for Data Scientists

When data science is hard, sometimes it?s because the algorithms aren?t converging or the data is messy, and sometimes it?s because of organizational or business issues: the data scientists aren?t positioned correctly to bring value to their organization. Maybe they don?t know what problems to work on, or they build solutions to those problems but nobody uses what they build. A lot of this can be traced back to the way the team is organized, and (relatedly) how it interacts with the rest of the organization, which is what we tackle in this issue. There are lots of options about how to organize your data science team, each of which has strengths and weaknesses, and Pardis Noorzad wrote a great blog post recently that got us talking. Relevant links: https://medium.com/swlh/models-for-integrating-data-science-teams-within-organizations-7c5afa032ebd

2019-08-26
Länk till avsnitt

Hur lyssnar man på podcast?

En liten tjänst av I'm With Friends. Finns även på engelska.
Uppdateras med hjälp från iTunes.

Linear Digressions

Prenumerera

Webbplats

Avsnitt

So long, and thanks for all the fish

A Reality Check on AI-Driven Medical Assistants

A Data Science Take on Open Policing Data

Procella: YouTube's super-system for analytics data storage

The Data Science Open Source Ecosystem

Rock the ROC Curve

Criminology and Data Science

Racism, the criminal justice system, and data science

An interstitial word from Ben

Convolutional Neural Networks

Stein's Paradox

Protecting Individual-Level Census Data with Differential Privacy

Causal Trees

The Grammar Of Graphics

Gaussian Processes

Keeping ourselves honest when we work with observational healthcare data

Changing our formulation of AI to avoid runaway risks: Interview with Prof. Stuart Russell

Putting machine learning into a database

The work-from-home episode

Understanding Covid-19 transmission: what the data suggests about how the disease spreads

Network effects re-release: when the power of a public health measure lies in widespread adoption

Causal inference when you can't experiment: difference-in-differences and synthetic controls

Better know a distribution: the Poisson distribution

The Lottery Ticket Hypothesis

Interesting technical issues prompted by GDPR and data privacy concerns

Thinking of data science initiatives as innovation initiatives

Building a curriculum for educating data scientists: Interview with Prof. Xiao-Li Meng

Running experiments when there are network effects

Zeroing in on what makes adversarial examples possible

Unsupervised Dimensionality Reduction: UMAP vs t-SNE

Data scientists: beware of simple metrics

Communicating data science, from academia to industry

Optimizing for the short-term vs. the long-term

Interview with Prof. Andrew Lo, on using data science to inform complex business decisions

Using machine learning to predict drug approvals

Facial recognition, society, and the law

Lessons learned from doing data science, at scale, in industry

Varsity A/B Testing

The Care and Feeding of Data Scientists: Growing Careers

The Care and Feeding of Data Scientists: Recruiting and Hiring Data Scientists

The Care and Feeding of Data Scientists: Becoming a Data Science Manager

Procella: YouTube's super-system for analytics data storage

Kalman Runners

What's *really* so hard about feature engineering?

Data storage for analytics: stars and snowflakes

Data storage: transactions vs. analytics

GROVER: an algorithm for making, and detecting, fake news

Data science teams as innovation initiatives

Can Fancy Running Shoes Cause You To Run Faster?

Organizational Models for Data Scientists

What's really so hard about feature engineering?