Choosing tools for building bioinformatics workflows

Wisdom
Interoperability is the keyword
Author

Wolfgang Huber

Published

2025-06-16

I sometimes overhear conversations about tool choice in bioinformatics, where implementation language is stated as the primary criterion. Such thinking stands in the way of scientific quality and productivity. It is a form of self-harm.

What should one really care about? Here’s some criteria: scientific rigor of the tool, its software engineering quality, number of bugs, robustness and clarity of error messages, feature richness, ease of installation, ease of use, documentation, viability of the tool’s maintenance over the time I plan to rely on it.

Also important: the format (‘schema’) of the data inputs and outputs of the tool, and how well that interoperates with tools up- and downstream in the workflow. E.g., many numerics-heavy packages internally rely on C or Fortran, but the user only sees and cares about what she sees as matrices or n-dim arrays in R, Python, Julia etc. The keyword is interoperability.

Similarly, just at a slightly higher level of abstraction, bioinformatics workflow builders should care about data structures. Whether the thing you call is written in R, Python, Rust, bash, awk, … has some undeniable consequences–e.g., how easy will I personally find to install, debug or extend it–, but is just one on the list of important criteria.

A Bluesky post with essentially the above



Some other decision criteria I’d rather not recommend

  • Herd effect: use a method or an approach because “everyone else” (or instead: a particular group of people that really impresses you) does it. But is your problem really the same? Are your interests aligned?

  • The IBM effect: this is a term that apparently goes back to the 1970s (I learned it when I worked as a postdoc at IBM Research in 1998), a time when IBM completely dominated the market for enterprise scale databases and information management systems. The expression goes something like: “you never get fired for choosing IBM”. It rides on the fact that people in big (and perhaps in small) organisations would rather minimize the risk of being personally blamed for the failure of a project, than optimize the probability or magnitude of its success.