With data in biomedical science becoming unquestionably more important, will biomedical researchers also have to learn data science?
In recent years, biomedical sciences have been revolutionized by the availability of high-throughput, high-dimensionality data. However, most biomedical researchers have not been trained as data scientists and are becoming more and more dependent on bioinformaticians to analyze the results of their experiments. As a result, bioinformaticians must spend time executing simple tasks for researchers, the time they could use to advance their own research projects.
Is coding the only solution to this dilemma?
The impact of Big Data in biomedical sciences
We have entered the era of big data, with substantial volumes of measurements (variables and/or observations) made possible by new technology. Biomedical sciences, in particular, have become increasingly reliant on the processing, analysis, and interpretation of large volumes of data, information, and knowledge.
Advances in DNA sequencing allow for the measuring of whole genomes or transcriptomes.
Fluidics allows for the quick sorting of thousands of cells like in high dimensional flow cytometry.
Combining the two fields has led to single-cell sequencing techniques that allow for the detection of expression levels of thousands of genes across thousands of cells.
Wearable devices that capture real-time physiological signals will provide further increases in data volume and complexity.
This wealth of data creates significant opportunities to discover and understand the critical interplay among such diverse domains as genomics, proteomics, and metabolomics, to phenomics, including imaging, biometrics, and clinical data. Concomitant advances in processing power and storage have allowed for sophisticated computational analyses of big data. Robust, open-source, and community-driven software libraries have been developed to process and perform statistics on various biological data, including the above mentioned genomic and flow cytometry fields.
Artificial intelligence (AI) also has a growing role to play in biomedical research and healthcare. We have seen how AI augments the ability of professionals, for example, in the context of diagnosis and treatment (e.g., image analysis). Visual analytics has grown from the scientific visualization field, and its ability to collect and store data is increasing faster than the ability to analyze it.
A researcher with interdisciplinary training will be uniquely positioned to provide the skillset and the technology necessary for processing, managing, retrieving, analyzing, and interpreting big data from both the fundamental and clinical sciences.
Impact on education: online resources
While education in biology is traditionally split into a wet-lab or dry-lab scientist track, more interdisciplinary programs aim to close this gap. A skilled dry-lab scientist will know how to handle data end-to-end: from the moment it is outputted by the experiment to the extraction of knowledge. They will be versed in dealing with missing values and how to normalize across different platforms, two perennial problems in biological data. They will have strong knowledge of statistics, understand issues related to sample size, power, multiple hypothesis testing, classification (unsupervised learning), and generalized regression techniques (supervised learning).
Data integration is crucial, for example, in the research on molecular biomarkers of diseases and a powerful way to leverage the existing knowledge base in the biological sciences. Data can originate from the same population cohort (multi-omics) or between heterogeneous populations (meta-omics). Relationships between omics-clinical/ phenotypes also require data integration and the use of correlations and survival analysis.
Biological or quantitative science?
Many researchers are initially better versed in either biological or quantitative science and must fill in the blanks between the two. Data science is now everywhere, free and open-source. Today, access to data analytics learning is easy, with multiple online open resources, courses, and novel platforms available to researchers. However, the curve associated with learning to code means that time-pressed wet-bench scientists do not extend their expertise into the dry-bench side of biology. To empower biologists, Tercen has developed an analytics platform that can be used without any prior coding knowledge:
Impact on carriers: A seller’s market
Biomedical data science has become a central resource for many academic centers, universities, and commercial R&D departments (e.g., Biotech and Pharma).
However, candidates with combined biomedical knowledge, data science, and statistics skills are still a rarity in a field that offers many career opportunities. Interdisciplinary scientists get hired as quickly as they can be trained, which has made the market more competitive and drives up salaries for both industrial and faculty staff.
Biomedical data scientists enjoy the creativity that comes with their projects, looking for where the information and answers are in datasets utilizing the correct methodology. Skilled bioinformaticians look for laboratories at the bleeding edge of technical development where inspiration comes from the fact that new technologies can address some of science’s big, unanswered questions.
Shortage of high-throughput scientists
The flip side is that there is a shortage of scientists in laboratories that do excellent research in their fields of study but do not develop new genomic or high-throughput techniques. At the same time, in those labs, the wet-bench biologists are the knowledge-holders and know the questions to ask that will elicit knowledge from their data, even if they lack the computational tools to do so. The result is an iterative process where computational scientists have to devote time to small but time-consuming tasks of routine processing and visualization and where wet-bench scientists have to vie for the attention of their overworked computational colleagues.
A platform that empowers the wet-bench scientist to do their own analysis and data plotting and frees the computational scientist to focus on more complex tasks is a win-win situation. Tercen comes in: a data science platform that allows wet-lab scientists to analyze their data without needing to code.
Analyzing data without coding:
Impact on tools: An opportunity and two decades of stagnation
The tools available for dry-lab work, especially coding, are sophisticated and extensive (GitHub, bitbucket, Kaggle, Bioconductor, Python, Java, SAS, etc.). They cover the full coding life cycle of development, testing, versioning, sharing, and hosting of computer code. These tools have evolved quickly and intensively, driven primarily by a vibrant software development market. However, they all require extensive coding skills. Meanwhile, the current tools available to biologists, such as Excel (Microsoft®) and GraphPad (PRISM®), are limited and only partially cover the full research life cycle. It isn't very reassuring to see that the biologist analysis tools have not progressed over the last two decades. The market is finally starting to respond to and rectify this imbalance, and Tercen is fast becoming a key part of this movement.
Tercen: Analysing biological data without having to code
Tercen is an open data science platform for anyone who wants to get meaning from data. It aims to function with community-led open data projects. It offers non-coders access to high-end data science resources and techniques to generate insights with their multivariate datasets. The platform promotes a social, collaborative approach to data science and has a unique, powerful visual programming paradigm. At the moment, Tercen mostly hosts life science projects, but the platform is ideal for any data project and any user.
On Tercen, researchers can visually customize a workflow without the aid of a bioinformatician. A workflow is a pipeline data analysis composed of a sequence of computation and annotation steps. There are standard workflows for each of the molecular readouts (e.g., RNAseq, FlowCyto, Mass-spec). It is easy to add powerful computation, statistical, or visualization plug-ins copied from other projects or developed by a collaborator and transmitted via the Tercen platform.
Team members can collaborate on the same workflow and data simultaneously and generate reports for presentations or publications. The report contains the conclusions and an automatically generated formal description of the complete process (e.g., normalization, statistics testing, clustering, functional annotation), which is essential for reproducible science. All of this functionality with no coding necessary!