Loading [a11y]/accessibility-menu.js
Automated Composition of Scientific Workflows in Mass Spectrometry-Based Proteomics | IEEE Conference Publication | IEEE Xplore

Automated Composition of Scientific Workflows in Mass Spectrometry-Based Proteomics


Abstract:

Numerous software utilities operating on mass spectrometry (MS) data are described in the literature that provide specific operations as building blocks for the assembly ...Show More

Abstract:

Numerous software utilities operating on mass spectrometry (MS) data are described in the literature that provide specific operations as building blocks for the assembly of purpose-specific workflows. Working out which tools and combinations are applicable or optimal is often hard: insufficient annotation of tool functions and interfaces impedes finding viable tool combinations, and potentially compatible tools may not, in practice, operate together. Thus researchers face difficulties in selecting practical and effective data analysis pipelines for a specific experimental design.
Date of Conference: 29 October 2018 - 01 November 2018
Date Added to IEEE Xplore: 27 December 2018
ISBN Information:
Conference Location: Amsterdam, Netherlands

We propose a framework, illustrated in Fig. 1, to support researchers in identifying, extensively comparing and benchmarking multiple workflows from individual bioinformatics tools. It is beneficial when working with a well-annotated tool collection leading to workflows that give similar but not identical results, and opens the door to determine operational or functional bottlenecks in proteomics data analysis as well as generally in any use case including multiple bioinformatics operations. Concretely, we used the PROPHETS automatic workflow composition platform [1], [2] to explore the workflows that could be composed from a selection of public domain analysis tools listed on ms-utils.org (https://ms-utils.org/) or registered in the ELIXIR Tools and Data Services Registry (https://bio.tools) [3]. Composition was facilitated by the tools' semantic annotation using terms from the EDAM ontology [4].

Fig. 1. - Schematic outline of the workflow composition: PROPHETS suggests workflows for a selection of tools from ms-utils.org and bio.tools, annotated with terms from the EDAM ontology, under constraints such as requested operations or input and output formats. The resulting workflows are implemented and tested on public data.
Fig. 1.

Schematic outline of the workflow composition: PROPHETS suggests workflows for a selection of tools from ms-utils.org and bio.tools, annotated with terms from the EDAM ontology, under constraints such as requested operations or input and output formats. The resulting workflows are implemented and tested on public data.

To demonstrate the practical use of our framework, we implemented, executed and compared a number of logically and semantically equivalent workflows addressing four use cases representing frequent tasks in MS-based proteomics: peptide retention time prediction, protein identification and enrichment analysis, localization of phosphorylation and protein quantitation using isotopic labeling. For all use cases, we found at least slightly different results when comparing the different workflows. Our assessment of reproducibility tested the robustness of data (do different experimental measurements lead to the same end results?) and analysis (do different tool combinations give similar end results?). This strongly suggests that as many workflows as feasible should be benchmarked on “ground truth” data to identify optimal tool combinations. With the approach presented here, it will now be possible to compare many new pipeline instances to commonly used workflows on the basis of benchmarking data sets and therefore identify best-suited alternatives for specific groups of operations as well as for data types (e.g. different MS instruments). Benchmarking however might suffer from distinct performance when considering different data types such as different experimental setups or different biological sources. Thus still big community efforts will be required to create sets of different ground truth data that allow for generalized conclusions.

The work is described in greater detail in [5]. The project files and workflows are available from [6].

References

References is not available for this document.