Podd: the bioinformatics chat

#70 Prioritizing drug target genes with Marie Sadler

21 december 2023 | 52 min

#69 Suffix arrays in optimal compressed space and δ-SA with Tomasz Kociumaka and Dominik Kempa

29 september 2023 | 57 min

#68 Phylogenetic inference from raw reads and Read2Tree with David Dylus

28 augusti 2023 | 49 min

#67 AlphaFold and variant effect prediction with Amelie Stein

29 juli 2023 | 35 min

#66 AlphaFold and shape-mers with Janani Durairaj

10 juli 2023 | 21 min

#65 AlphaFold and protein interactions with Pedro Beltrao

21 juni 2023 | 52 min

#64 Enformer: predicting gene expression from sequence with Žiga Avsec

9 november 2021 | 60 min

#63 Bioinformatics Contest 2021 with Maksym Kovalchuk and James Matthew Holt

27 september 2021 | 61 min

#62 Steady states of metabolic networks and Dingo with Apostolos Chalkis

28 juli 2021 | 38 min

#61 3D genome organization and GRiNCH with Da-Inn Erika Lee

23 juni 2021 | 70 min

#60 Differential gene expression and DESeq2 with Michael Love

12 maj 2021 | 91 min

#59 Proteomics calibration with Lindsay Pino

21 april 2021 | 48 min

#58 B cell maturation and class switching with Hamish King

31 mars 2021 | 89 min

#57 Enhancers with Molly Gasperini

10 mars 2021 | 47 min

#56 Polygenic risk scores in admixed populations with Bárbara Bitarello

17 februari 2021 | 90 min

#55 Phylogenetics and the likelihood gradient with Xiang Ji

13 januari 2021 | 57 min

#54 Seeding methods for read alignment with Markus Schmidt

16 december 2020 | 61 min

#53 Real-time quantitative proteomics with Devin Schweppe

18 november 2020 | 63 min

#52 How 23andMe finds identical-by-descent segments with William Freyman

27 oktober 2020 | 43 min

#51 Basset and Basenji with David Kelley

7 oktober 2020 | 74 min

#50 ENCODE3 with Jill Moore

10 september 2020 | 56 min

#49 Most Permissive Boolean Networks with Loïc Paulevé

19 augusti 2020 | 64 min

#48 Machine learning for drug development with Marinka Zitnik

29 juli 2020 | 85 min

#47 Reproducible pipelines and NGLess with Luis Pedro Coelho

24 juni 2020 | 58 min

#46 HiFi reads and HiCanu with Sergey Nurk and Sergey Koren

27 maj 2020 | 69 min

#45 Genome assembly and Canu with Sergey Koren and Sergey Nurk

20 maj 2020 | 77 min

#44 DNA tagging and Porcupine with Kathryn Doroschak

29 april 2020 | 45 min

#43 Generalized PCA for single-cell data with William Townes

27 mars 2020 | 60 min

#42 Spectrum-preserving string sets and simplitigs with Amatur Rahman and Karel Břinda

28 februari 2020 | 53 min

#41 Epidemic models with Kris Parag

27 januari 2020 | 68 min

#40 Plasmid classification and binning with Sergio Arredondo-Alonso and Anita Schürch

30 december 2019 | 45 min

#39 Amplicon sequence variants and bias with Benjamin Callahan

29 november 2019 | 62 min

#38 Issues in legacy genomes with Luke Anderson-Trocmé

22 oktober 2019 | 61 min

#37 Causality and potential outcomes with Irineo Cabreros

27 september 2019 | 41 min

#36 scVI with Romain Lopez and Gabriel Misrachi

30 augusti 2019 | 80 min

#35 The role of the DNA shape in transcription factor binding with Hassan Samee

26 juli 2019 | 62 min

#34 Power laws and T-cell receptors with Kristina Grigaityte

29 juni 2019 | 87 min

#33 Genome assembly from long reads and Flye with Mikhail Kolmogorov

31 maj 2019 | 73 min

#32 Deep tensor factorization and a pitfall for machine learning methods with Jacob Schreiber

29 april 2019 | 75 min

#31 Bioinformatics Contest 2019 with Alexey Sergushichev and Gennady Korotkevich

24 mars 2019 | 106 min

#30 Bayesian inference of chromatin structure from Hi-C data with Simeon Carstens

27 februari 2019 | 66 min

#29 Haplotype-aware genotyping from long reads with Trevor Pesout

27 januari 2019 | 72 min

#28 Space-efficient variable-order Markov models with Fabio Cunial

28 december 2018 | 69 min

#27 Classification of CRISPR-induced mutations and CRISPRpic with HoJoon Lee and Seung Woo Cho

29 november 2018 | 57 min

#26 Feature selection, Relief and STIR with Trang Lê

27 oktober 2018 | 69 min

#25 Transposons and repeats with Kaushik Panda and Keith Slotkin

24 september 2018 | 101 min

#24 Read correction and Bcool with Antoine Limasset

31 augusti 2018 | 60 min

#23 RNA design, EteRNA and NEMO with Fernando Portela

27 juli 2018 | 91 min

#22 smCounter2: somatic variant calling and UMIs with Chang Xu

29 juni 2018 | 64 min

#21 Linear mixed models, GWAS, and lme4qtl with Andrey Ziyatdinov

31 maj 2018 | 51 min

#20 B cell receptor substitution profile prediction and SPURF with Kristian Davidsen and Amrit Dhar

30 april 2018 | 121 min

#19 Genome fingerprints with Gustavo Glusman

7 april 2018 | 89 min

#18 Bioinformatics Contest 2018 with Alexey Sergushichev and Ekaterina Vyahhi

3 mars 2018 | 113 min

#17 Rarefaction, alpha diversity, and statistics with Amy Willis

22 januari 2018 | 74 min

#16 Javier Quilez on what makes large sequencing projects successful

24 december 2017 | 64 min

#15 Optimal transport for single-cell expression data with Geoffrey Schiebinger

26 november 2017 | 69 min

#14 Generating functions for read mapping with Guillaume Filion

13 november 2017 | 70 min

Guillaume Filion recently published a preprint in which he applies generating functions, a concept from analytic combinatorics, to estimating the optimal seed length for read mapping.

In this episode, Guillaume and I attempt to explain the core concepts from analytic combinatorics and why they are useful in modeling sequences.

Links:

Guillaume’s preprint: Analytic combinatorics for bioinformatics I: seeding methods
Once upon a BLAST
Guillaume’s blog, «The Grand Locus»
Dan Gusfield’s home page featuring the fast fourier transform lectures I mention in the podcast

After we recorded the podcast, Guillaume wrote to me to clarify the relationship between read mapping and BLAST:

I looked into my notes about BLAST. The problem that it solves is the following: “Given that a local alignment has score S, what is the probability that it does not contain a word of score T or greater”? The background work of Karlin and Altschul is used to give a statistical significance for S (what is the probability that a “Smith-Waterman random walk” starting at height 0 would reach height S, i.e. what is the probability that aligning two random proteins would yield a score S). The authors write in the original paper “Theory does not yet exist to calculate the probability q that such segment pair will contain a word pair with a score of at least T. However, one argument suggests that q should depend exponentially upon the score of the MSP”.

This is the part that I did not remember well. MSP stands for Maximal Segment Pair, this is the “longest fragment” with “highest score” in the alignment. I thought that Karlin and Altschul solved this part as well, but the authors just go empirical and they calibrate the relationship between T and S with simulations.

I realize a little bit better now that my work is precisely about this problem that the authors of BLAST could not solve, but as you pointed out, I am attacking only a very specific sub-case that is much easier because the models of sequencing error are much simpler than protein evolution. BLAST is concerned with local alignment, so it wants to get all the hits with an MSP score above S. Short read mapping just wants the true location of the read, which does not really have the notion of a score S. But still, mathematically, it is equivalent to the case where S is a constant that depends only on the read size and the distribution of the score T depends only on the seed length and the error rate. I have a few ideas of how to use analytic combinatorics to solve the problem for proteins, but it is mostly complicated because the variable of interest T is a fractional numbers and not an integer…

So what is different from BLAST? The right answer (I think) is that BLAST finds all the hits with an MSP above statistical background, but it says nothing of the probability that the true location contains such an MSP, so it is hard to calibrate the heuristic for that specific problem. In reality, the parallel with BLAST is just the basic strategy: make a statistical model for your problem and use it to calibrate the heuristic.

If you enjoyed this episode, please consider supporting the podcast on Patreon.

#13 Bracken with Jennifer Lu

21 oktober 2017 | 47 min

#12 Modelling the immune system and C-ImmSim with Filippo Castiglione

8 oktober 2017 | 67 min

#11 Collective cell migration with Linus Schumacher

18 september 2017 | 60 min

#10 Spatially variable genes and SpatialDE with Valentine Svensson

3 september 2017 | 58 min

#9 Michael Tessler and Christopher Mason on 16S amplicon vs shotgun sequencing

18 augusti 2017 | 46 min

#8 Perfect k-mer hashing in Sailfish

5 augusti 2017 | 22 min

#7 Metagenomics and Kraken

9 juli 2017 | 28 min

#6 Allele-specific expression

25 juni 2017 | 33 min

#5 Relative data analysis and propr with Thom Quinn

10 juni 2017 | 56 min

#4 ChIP-seq and GenoGAM with Georg Stricker and Julien Gagneur

29 maj 2017 | 55 min

#3 miRNA target site prediction and seedVicious with Antonio Marco

12 maj 2017 | 57 min

#2 Single-cell RNA sequencing with Aleksandra Kolodziejczyk

29 april 2017 | 69 min

#1 Transcriptome assembly and Scallop with Mingfu Shao

16 april 2017 | 44 min

the bioinformatics chat

A podcast about computational biology, bioinformatics, and next generation sequencing.

Om podden

Avsnitt