CSAMA 2020 – Statistical Data Analysis for Genome-Scale Biology

CSAMA 2020 (18th edition)
Statistical Data Analysis for Genome Scale Biology
Bressanone-Brixen, Italy (South Tyrol Alps)
June 21-26, 2020


  • Laurent Gatto, University of Cambridge
  • Wolfgang Huber, EMBL Heidelberg
  • Katharina Imkeller, German Cancer Research Center, Heidelberg
  • Johannes Rainer, European Academy of Bozen
  • Davide Risso, University of Padova
  • Lori Shepherd, Roswell Park Comprehensive Cancer Center, Buffalo
  • Charlotte Soneson, Friedrich Miescher Institute for Biomedical Research, Basel
  • Britta Velten, DKFZ Heidelberg

Experts participating

  • Robert Gentleman, 23andMe, Mountain View
  • Martin Morgan, Roswell Park Comprehensive Cancer Center, Buffalo
  • Levi Waldron, CUNY Medical School, New York

Teaching Assistants

  • Simone Bell, EMBL, Heidelberg
  • Mike L. Smith, EMBL, Heidelberg

The one-week intensive course Statistical Data Analysis for Genome-Scale Biology teaches statistical and computational analysis of multi-omics studies in biology and biomedicine. It covers the underlying theory and state of the art (the morning lectures) and practical hands-on exercises based on the R / Bioconductor environment (the afternoon labs). At the end of the course, you should be able to run analysis workflows on your own (multi-)omic data, adapt and combine different tools, and make informed and scientifically sound choices about analysis strategies.

Topics include:

  • Introduction to R and Bioconductor
  • The elements of statistics: hypothesis testing, multiple testing, regression, regularization, clustering and classification, parallelization and performance (machine learning), visualisation
  • RNA-Seq data analysis
  • Computing with sequences and genomic intervals
  • Working with annotation – genes, genomic features, variants, transcripts and proteins
  • Gene set enrichment analysis
  • Mass spec proteomics and metabolomics
  • Basis of microbiome analysis
  • Experimental design, batch effects and confounding
  • Reproducible research and workflow authoring with R markdown
  • Package development, version control and developer tools (incl. git, github, RStudio)
  • Working with large data: performance parallelisation and cloud computing

The course consists of

  • morning lectures: 20 x 45 minutes: Monday to Friday 8:30h – 12:00h
  • 4 practical computer tutorials in the afternoons (13:30h – 16:30h) on Monday, Tuesday, Thursday and Friday

Visit the course’s website at: http://www.huber.embl.de/csama

Welcome Alina Batzilla

Alina has a BSc degree in Molecular Biotechnology from the University of Heidelberg. She wrote her Bachelor thesis at the Max-Planck-Institute (MPI) for Medical Research and is currently enrolled in the Molecular Biotechnology Masters program. After a research internship at the University of Queensland and an internship in the non-academic sector, Alina joined the Huber Group in January 2020. She is working on the integrative analysis of drug testing data in order to understand the mechanisms of drug response.

Postdoc, PhD and internship positions

We are continually inviting applications for postdoc, PhD and internship positions. You can apply for one of two tracks:

  1. Method development in statistical computing and bioinformatics,
  2. Biological discovery through integrative data analysis (“dry biology”)

For track 1, you will have strong quantitative and analytical skills, such as acquired through a degree in mathematics, statistics, physics, computer science or a related field. You have curiosity and motivation to work in interdisciplinary projects, which include generation of new data and their analysis, and are eager to get to grips with relevant areas of biology and the technologies used in biology research. You will have experience in scientific computing and be familiar with one or several computer languages. Familiarity with R is definitively a plus.

For track 2, you will have a training in life sciences and strong coding skills that enable you to undertake complex data transformations, integrative operations, applications of mathematical models and visualizations. You are driven by making fundamental discoveries by mining cutting-edge, large data sets.

To apply, please contact Wolfgang with your CV, a brief statement of research interests, and examples of your work: besides your publications, this can include theses, research reports, talk slides, software projects (e.g. R packages, github projects) or data analysis reports (e.g. markdown reports or Jupyter notebooks).

Here are some keywords and a non-exhaustive list of collaboration partners with whom we work frequently on new, exciting data types:

  • Latent spaces and manifolds estimation from multi-modal single cell data
  • Genotype-drug interactions, precision oncology, multivariate biomarker discovery
  • Imaging-based phenotyping
  • Bioconductor
  • Thorsten Zenz – pharmacogenomics of drug response in blood cancer
  • Sascha Dietrich – systems medicine of cancer drugs
  • Lars Steinmetz – systems genetics & ‘omics technology development
  • Michael Boutros – high-throughput genetics, genetic interactions & synthetic lethality in cancer
  • Henrik Kaessmann – evolution of cell types


Congratulations Holly Giles

Holly Giles was awarded the Joachim-Herz Add-On Fellowship for Interdisciplinary Science. The fellowship enables interdisciplinary research and qualification (e.g. stays at a research lab and participation in conferences), specialized equipment and tools (laptops, software, etc.), participation in events of the Joachim Herz Foundation and fellowship meetings. Its a € 12.500 grant to be spent over a period of two years.

New paper: Non-parametric analysis of thermal proteome profiles reveals novel drug-binding proteins

Dorothee Childs, Karsten Bach, Holger Franken, Simon Anders, Nils Kurzawa, Marcus Bantscheff, Mikhail Savitski and Wolfgang Huber

Detecting the targets of drugs and other molecules in intact cellular contexts is a major objective in drug discovery and in biology more broadly. Thermal proteome profiling (TPP) pursues this aim at proteome-wide scale by inferring target engagement from its effects on temperature-dependent protein denaturation. However, a key challenge of TPP is the statistical analysis of the measured melting curves with controlled false discovery rates at high proteome coverage and detection power. We present non-parametric analysis of response curves (NPARC), a statistical method for TPP based on functional data analysis and nonlinear regression. We evaluate NPARC on five independent TPP datasets and observe that it is able to detect subtle changes in any region of the melting curves, reliably detects the known targets, and outperforms a melting point-centric, single-parameter fitting approach in terms of specificity and sensitivity. NPARC can be combined with established analysis of variance (ANOVA) statistics and enables flexible, factorial experimental designs and replication levels. To facilitate access to a wide range of users, a freely available software implementation of NPARC is provided.

Read more.

New paper: Biological Plasticity Rescues Target Activity in CRISPR Knockouts

Arne H. Smits, Frederik Ziebell, Gerard Joberty, …, Lars M. Steinmetz, Gerard Drewes and Wolfgang Huber.

Gene knockouts (KOs) are efficiently engineered through CRISPR-Cas9-induced frameshift mutations. While DNA editing efficiency is readily verified by DNA sequencing, a systematic understanding of the efficiency of protein elimination has been lacking. Here, we devised an experimental strategy combining RNA-seq and triple-stage mass spectrometry to characterize 193 genetically verified deletions targeting 136 distinct genes generated by CRISPR-induced frameshifts in HAP1 cells. We observed residual protein expression for about one third of the quantified targets, at variable levels from low to original, and identified two causal mechanisms, translation reinitiation leading to N-terminally truncated target proteins, or skipping of the edited exon leading to protein isoforms with internal sequence deletions. Detailed analysis of three truncated targets, BRD4, DNMT1 and NGLY1, revealed partial preservation of protein function. Our results imply that systematic characterization of residual protein expression or function in CRISPR-Cas9 generated KO lines is necessary for phenotype interpretation.

Read more.

Emma Dann moves to Wellcome Sanger Institute

After a quite successful traineeship in the Huber group in 2018 and 2019, Emma Dann has moved to Cambridge to start her PhD studies at the Sanger Institute, located on the Wellcome Genome Campus and affiliated to Cambridge University. She is currently doing three months rotation projects before picking a lab for the PhD project. At the moment she is working on methods for alignment of scRNA-seq and scATAC-seq data in the lab of Sarah Teichmann.

Good luck and lots of success, Emma!

New book: Modern Statistics for Modern Biology

The text book Modern Statistics for Modern Biology by Susan Holmes and Wolfgang Huber has been published through Cambridge University Press (paperback). An online HTML version is also available. From the blurb: “If you are a biologist and want to get the best out of the powerful methods of modern computational statistics, this is your book. You can visualize and analyze your own data, apply unsupervised and supervised learning, integrate datasets, apply hypothesis testing, and make publication-quality figures using the power of R/Bioconductor and ggplot2. This book will teach you ‘cooking from scratch’, from raw data to beautiful illuminating output, as you learn to write your own scripts in the R language and to use advanced statistics packages from CRAN and Bioconductor. It covers a broad range of basic and advanced topics important in the analysis of high-throughput biological data, including principal component analysis and multidimensional scaling, clustering, multiple testing, unsupervised and supervised learning, resampling, the pitfalls of experimental design, and power simulations using Monte Carlo, and it even reaches networks, trees, spatial statistics, image data, and microbial ecology. Using a minimum of mathematical notation, it builds understanding from well-chosen examples, simulation, visualization, and above all hands-on interaction with data and code.”