Laurent Gatto | Institut de Duve

Lien vers notre page CBIO Lab

Modern high throughput biology produces more data that can be analyzed, and the challenges of modern biology are statistical interpretation and integration of these data. The researchers and engineers in the Computational Biology group devise novel computational techniques to comprehend high dimensional biology and enable high throughput biomedical research.

For the last ten years, biology and biomedical sciences have seen an impressive increase in the size of the data that are collected as part of routine research projects. The increase in amount and complexity of such data leads some to call it a data deluge. Indeed, we have reached a situation where the sheer volume of data that is produced is overwhelming the capacity of individual researchers and research groups to manage, analyze and extract meaningful information from them. This revolution is shifting biomedical research towards the quantitative side of science, and has been driven by the technological breakthroughs that, today, allow us to sequence whole genomes, quantify the near complete set of transcripts or proteins, measure epigenetic modifications across whole genomes, assay proteins for post-translational modifications, interactions and localization. But the question remains: what to do with all that data?

To illustrate how large amounts of data can be difficult to deal with, let’s use the Lego bricks as an analogy. In a classic biological experimental setup, researchers would focus on a particular gene or small set of genes of interest. They would design an experimental setup to address their specific question, run the experiments and, after collecting the data, (mostly) manually analyze them. They would draw conclusions that would either support or deny their working model and follow-up by designing the next set of experiments accordingly. From a Lego point of view, this would correspond to acquiring a Lego box as we know it, i.e. containing all the blocks needed to build the model (and only those blocks) and a precise and accurate building plan. There is no need for any special tool to find the blocks and figure out how to assemble them; even for relatively large sets, given enough time, it’s easy enough to follow the instructions and produce the final product.

Now imagine that either the blocks or the building map are messed up.

Imagine that your box contains many more blocks than you actually need, that some of the blocks needed for your final product are missing (and that you don’t know if they are, or which ones are missing), and that the blocks aren’t sorted in little sachets, but provided in a single huge bag, with potentially several orders of magnitude extra blocks. Now imagine that the instructions are missing steps or pages at random, or that the instructions are missing completely, but that you have some idea of what needs to be built. In such situations, you’ll need algorithms and tools that will automatically sort the blocks, and arrange them by size, color, shape, … and algorithms that will inform you what pieces are most likely relevant for the model you want to build.

This sounds like a hopeless situation, but it’s not. There are countless opportunities in acquiring a lot of data, even when the actual building instructions are missing. Indeed, the extra blocks aren’t random, they are part of something bigger. Imagine you originally wanted to build the Millennium Falcon from the Star Wars movies, and that the reason you want to build that ship is that you are interested in the technology of the Rebel Alliance, or even in the whole Star Wars universe. Even if the extra blocks aren’t directly relevant to building the Millennium Falcon, they might provide precious information about the technology that was available when the ship was built. With the right algorithms, you might be able to build your ship, and collect additional information about the Star Wars universe. Or, even if you don’t manage to build the whole ship and achieve only a partial, incomplete product, the additional information might actually reveal much more about the Lego Star Wars universe than solely focusing on the one ship.

Methods that would consider the Lego blocks only, without any additional information (such as whether some blocks are used to build Rebel or Empire ships, or parts of the instruction manual) are termed “unsupervised”. Such methods could be used to group all the blocks and identify clusters of blocks with similar features. If additional information is at hand, such as whether a block is used for a type of ship, and we would want to classify a new block as to what type of ship it belonged, one would refer to a “supervised” analysis. Given the sheer number of blocks at hand, we would also want to summarize our collection by counting the types of blocks, how many blocks of each type we have, and visually represent this diversity.

In bioinformatics, the blocks would typically be replaced by quantitative measurements of the abundance of biological entities, such as transcripts or proteins. The annotation for the supervised analysis would describe whether these samples are wild-type cell lines, or from healthy donors or, on the contrary, cells submitted to a particular drug of interest or with a missing gene, or from patients suffering from a specific disease.

Another important feature of the data and the nature of modern, high throughput biology, is that the questions that are now asked have shifted from unequivocal and universal to context-specific, probabilistic, and definition-dependent (see Quincey Justman [2018] for an insightful documentation of this). The complexity of what we measure and what we ask requires us to accept that certainties and determinism are replaced by probabilities and uncertainties that need to be quantified to acquire confident knowledge.

This is the figurative situation modern biomedicine is in: tremendous potentials to gain a much broader picture of the whole cell, organ, body, but at the cost of a complexity in what we measure, and the need for bespoken methods to sort and manage the data we acquire and to analyze and understand it. That is the role of bioinformatics and computational biology, i.e. to devise ways to understand complex biological data to comprehend complex biological processes.

Page under construction

Complete list on PubMed

Replication of single-cell proteomics data reveals important computational challenges

Vanderaa C, Gatto L.

Expert Rev Proteomics. 2021; pp1-9.
Spatiotemporal proteomic profiling of the pro-inflammatory response to lipopolysaccharide in the THP-1 human leukaemia cell line

Mulvey CM, Breckels LM, Crook OM, Sanders DJ, Ribeiro ALR, Geladaki A, Christoforou A, Britovšek NK, Hurrell T, Deery MJ, Gatto L, Smith AM, Lilley KS.

Nat Commun. 2021; 12(1):5773.
A semi-supervised Bayesian approach for simultaneous protein sub-cellular localisation assignment and novelty detection

Crook OM, Geladaki A, Nightingale DJH, Vennard OL, Lilley KS, Gatto L, Kirk PDW.

PLoS Comput Biol. 2020; 16(11):e1008288.
MSnbase, Efficient and Elegant R-Based Processing and Visualization of Raw Mass Spectrometry Data

Gatto L, Gibb S, Rainer J.

J Proteome Res. 2021; 20(1):1063-1069.
Aberrant Membrane Composition and Biophysical Properties Impair Erythrocyte Morphology and Functionality in Elliptocytosis

Pollet H, Cloos AS, Stommen A, Vanderroost J, Conrard L, Paquot A, Ghodsi M, Carquin M, Léonard C, Guthmann M, Lingurski M, Vermylen C, Killian T, Gatto L, Rider M, Pyr Dit Ruys S, Vertommen D, Vikkula M, Brouillard P, Van Der Smissen P, Muccioli GG, Tyteca D.

Biomolecules. 2020; 10(8):1120.
Molecular networks in Network Medicine: Development and applications.

Silverman EK, Schmidt HHHW, Anastasiadou E, Altucci L, Angelini M, Badimon L, Balligand JL, Benincasa G, Capasso G, Conte F, Di Costanzo A, Farina L, Fiscon G, Gatto L, Gentili M, Loscalzo J, Marchese C, Napoli C, Paci P, Petti M, Quackenbush J, Tieri P, et al.

Wiley Interdiscip Rev Syst Biol Med. 2020: e1489.
Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics.

Crook OM, Gatto L, Kirk PDW.

Stat Appl Genet Mol Biol. 2019; 18(6). pii: /j/sagmb.2019.18.issue-6/sagmb-2018-0065/sagmb-2018-0065.xml.
Proteome Mapping of a Cyanobacterium Reveals Distinct Compartment Organization and Cell-Dispersed Metabolism.

Baers LL, Breckels LM, Mills LA, Gatto L, Deery MJ, Stevens TJ, Howe CJ, Lilley KS, Lea-Smith DJ.

Plant Physiol. 2019; 181(4):1721-1738.
Reproducibility and Transparency by Design.

Petyuk VA, Gatto L, Payne SH.

Mol Cell Proteomics. 2019; 18(8 suppl 1):S202-S204.
A Bioconductor workflow for the Bayesian analysis of spatial proteomics.

Crook OM, Breckels LM, Lilley KS, Kirk PDW, Gatto L.

F1000Res. 2019; 8:446.
Combining LOPIT with differential ultracentrifugation for high-resolution spatial proteomics.

Geladaki A, Kočevar Britovšek N, Breckels LM, Smith TS, Vennard OL, Mulvey CM, Crook OM, Gatto L, Lilley KS.

Nat Commun. 2019; 10(1):331.
ensembldb: an R package to create and use Ensembl-based annotation resources.

Rainer J, Gatto L, Weichenberger CX.

Bioinformatics. 2019; 35(17):3151-3153.
Assessing sub-cellular resolution in spatial proteomics experiments.

Gatto L, Breckels LM, Lilley KS.

Curr Opin Chem Biol. 2018; 48:123-49.
A Bayesian mixture modelling approach for spatial proteomics.

Crook OM, Mulvey CM, Kirk PDW, Lilley KS, Gatto L.

PLoS Comput Biol. 2018; 14(11):e1006516.
A Bioconductor workflow for processing and analysing spatial proteomics data.

Breckels LM, Mulvey CM, Lilley KS, Gatto L.

Version 2. F1000Res. 2016 [revised 2018]; 5:2926.
Learning from Heterogeneous Data Sources: An Application in Spatial Proteomics.

Breckels LM, Holden SB, Wojnar D, Mulvey CM, Christoforou A, Groen A, Trotter MW, Kohlbacher O, Lilley KS, Gatto L.

PLoS Comput Biol. 2016; 12(5):e1004920.
A draft map of the mouse pluripotent stem cell spatial proteome.

Christoforou A, Mulvey CM, Breckels LM, Geladaki A, Hurrell T, Hayward PC, Naake T, Gatto L, Viner R, Martinez Arias A, Lilley KS.

Nat Commun. 2016; 7:8992.
Visualization of proteomics data using R and bioconductor.

Gatto L, Breckels LM, Naake T, Gibb S.

Proteomics. 2015; 15(8):1375-89.
Orchestrating high-throughput genomic analysis with Bioconductor.

Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, Carvalho BS, Bravo HC, Davis S, Gatto L, Girke T, Gottardo R, Hahne F, Hansen KD, Irizarry RA, Lawrence M, Love MI, MacDonald J, Obenchain V, Oleś AK, Pagès H, Reyes A, Shannon P, et al.

Nat Methods. 2015; 12(2):115-21.
A foundation for reliable spatial proteomics data analysis.

Gatto L, Breckels LM, Burger T, Nightingale DJ, Groen AJ, Campbell C, Nikolovski N, Mulvey CM, Christoforou A, Ferro M, Lilley KS.

Mol Cell Proteomics. 2014; 13(8):1937-52.
ProteomeXchange provides globally coordinated proteomics data submission and dissemination.

Vizcaíno JA, Deutsch EW, Wang R, Csordas A, Reisinger F, Ríos D, Dianes JA, Sun Z, Farrah T, Bandeira N, Binz PA, Xenarios I, Eisenacher M, Mayer G, Gatto L, Campos A, Chalkley RJ, Kraus HJ, Albar JP, Martinez-Bartolomé S, Apweiler R, Omenn GS, et al.

Nat Biotechnol. 2014; 32(3):223-6.
Mass-spectrometry-based spatial proteomics data analysis using pRoloc and pRolocdata.

Gatto L, Breckels LM, Wieczorek S, Burger T, Lilley KS.

Bioinformatics. 2014; 30(9):1322-4.
Using R and Bioconductor for proteomics data analysis.

Gatto L, Christoforou A.

Biochim Biophys Acta. 2014; 1844(1 Pt A):42-51.
A cross-platform toolkit for mass spectrometry and proteomics.

Chambers MC, Maclean B, Burke R, Amodei D, Ruderman DL, Neumann S, Gatto L, Fischer B, Pratt B, Egertson J, Hoff K, Kessner D, Tasman N, Shulman N, Frewen B, Baker TA, Brusniak MY, Paulse C, Creasy D, Flashner L, Kani K, Moulding C, et al.

Nat Biotechnol. 2012; 30(10):918-20.
MSnbase-an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation.

Gatto L, Lilley KS.

Bioinformatics. 2012; 28(2):288-9.
Exploiting the DepMap cancer dependency data using the depmap R package.

Killian T, Gatto L.

F1000Research 2021; 10:416.