In this video, you will learn how to create a biomedical data analysis workflow.
I will show you how to import your biomedical data and build a chain of analysis steps.
Then we will apply a statistical operator, and visualize everything with both heatmap and pairwise plot graphs.
You can download the sample biomedical data file by right-clicking on this link and selecting "Save Link as..."
Biological Data Analysis on Rock Crabs
We are going to recreate a study on rock crabs done by Australian biologists in 1974.
The local crab population had colour variations but were always thought to be the same species.
To prove a hypothesis that they were not the same species, biological researchers began taking morphological measurements.
They compiled data on the characteristics of the crabs.
What I find most interesting about this study is that they used a simultaneous mathematical analysis of multiple characteristics to separate the species.
Our tutorial is going to show how easy it is to do these type of multivariate comparisons in biological data analysis using Tercen.
We will also take a look at how to use a statistical technique called PCA to simplify data.
First, some housekeeping to get set up on Tercen.
Once you've signed up, install the PCA Operator to your library and then create a new project in your personal page.
Next you will need to download the example data file and add it to your project.
Let’s take a moment to review our biological data so we can understand the information we want to analyse.
There are one thousand records made up from two hundred crabs that had five carapace measurements each.
Observation is the number that identifies a crab from 1 to 200.
Their colour is orange or blue, and here are 100 crabs of each.
Sex is male or female and there are 50 of each colour.
The combination of species and sex makes a natural grouping in biological data analysis that we will look for later.
Variable is the label given to the part of the crab carapace that was measured.
The variables are:
FL – Frontal Lobe RW – Rear Width
CL – Carapace Length CW – Carapace Width
BD - Body Depth
Measurement is the recorded size in millimetres.
Now that our project has a dataset we can add a new workflow.
The first step in any workflow is bringing in the data.
Right click the white background – press add, and select table.
You can now see that our dataset is available.
Now we can attach steps that perform our biological data analysis.
Let’s attach a step to make a heat-map visualisation.
Right click on the table and add a data step.
This screen is called the cross-tab projection screen; visualisations and calculations are done here.
Click the plus icon beside our dataset.
Now you see the headers of the data file: observation, color, sex, and so on.
On the Projection Crosstab Screen these are called factors.
A factor can provide data for the visualisation, or affect how other data is visualised.
Pick up a factor with your mouse and drag it.
You see zones that go green to tell you where it can be dropped.
Different zones do different things when you drop a factor onto them.
This element of the screen is called the crosstab grid. It creates the visualisations.
On the inside it has zones for the X and Y axis, and outside them are the zones for row and column.
You can drop factors into any of the zones While there can be only one x and one y axis multiple rows and columns are allowed.
Tercen will display their values in the crosstab cell.
It applies a sort order working from the outside in, and within the crosstab cell Tercen sorts values from low to high.
One of the advantages of Tercen is that, unlike a spreadsheet, multiple values can be held in a cell
Let’s now look at a heatmap
Biological Data Analysis: Heatmaps
We will arrange the individual crabs along the top by moving the observation factor to the column zone.
We will put measurement on the y-axis and we will drop variable on the row zone
This will group the measurements by the carapace characteristic.
Now we have an overview of the data that follows standard biological data analysis conventions.
We can change how our visualisation looks by dragging the lines of the grid closer together.
Now we can see the five variables measured on our 200 crabs all projected on a graph.
This is the Configuration Panel.
It has settings zones that control how data is visualised.
If I drag the Measurement factor to colours, Tercen will apply a gradient to the values in the crosstab cell.
Higher values are assigned the hot colours and lower ones the cold colours.
Then by changing the type drop-down our point graph becomes a heat-map graph.
I’ll point out now that you should save your work regularly.
Tercen will let you know a change has been made by displaying this disk icon. Click on it to save.
It’s also good practice to label your workflow steps clearly, so that people can understand your research.
I’ll call this workflow "Heatmap".
Now lets do a visualisation to see if we can reveal our crab species.
We’ll try a simple comparison called a pairwise plot, this way we can compare the crabs measurements and see if any pattern emerges between blue and orange.
First, attach a new data step to the table and then in the projection, put measurement to the Y-axis variable to column and variable, and also to row.
Put observation on labels and drop colour onto colour to separate our suspected species.
We have just made a multi-pairwise caparison graph.
All variables of the blue crabs are plotted against all the variables of the orange Crabs.
But there is a problem.
There is no clear separation between the characteristics of the crabs that would show they were different species.
The dots are very closely bunched.
Even If we split the graph by sex, which is easy to do in Tercen by moving the factor to columns, we still don’t see any conclusive proof.
The males do show a slight pattern of divergence in Carapace Width, but this projection of the data does not reveal the species.
We will have to make some computations on our biological data analysis in order to gain better information from it.
Biological Data Analysis: Principal Component Analysis
In Tercen Computations are made by operators.
We will apply an operator called PCA which we added to our library earlier.
PCA means Principal Component Analysis, and PCA transforms the original variables into new variables (called components).
Projecting these new components allows for patterns to emerge that can’t be seen with the raw data.
Let’s duplicate our original visualisation and add the PCA operator
Run it and check it’s results here in the Computed Tables link.
It has identified five Principal Components and calculated the value for our 200 crabs.
Now we’ll add a down-stream data step to use the results of our PCA calculation
In Tercen the results of calculations performed by an operator, and the original data, are both passed down to the next step for visualisation.
Our original factors are still available and the results of the PCA calculation have been added as new factors.
I’m going to pick two principal components and compare them pairwise.
We will plot PC2 and PC3 into our axes, and put both sex and colour on the colour zone
Tercen will combine them into a group for the gradient.
Now we can see the possibility that these are two separate species.
If these crabs were the same species the dots would be closer together like on our multi pairwise plot.
Now that we are encouraged by these findings we should compare all of the Principle Components against each other.
This is possible in Tercen but we will have to wrangle our data into shape first.
We need to group the Components together so they can be used as a block in the next projection.
Working from the PCA operator we add another step, this time it is a Gather.
From this name Tercen will make two new data groups PCA.variable for the names and PCA.value for the calculations.
Then we select the values we want to use.
Lets pick all of the principal components.
After we save the gather we need to run it in the workflow.
Now we can attach a new data step to visualise it as a multi-pairwise.
To make this projection, place the PCA.value on both x and y-axis.
Then plot PCA.variable into both column and row.
Species and sex are colour coded as before.
Drag Observation to Labels to separate the data points into individual crabs.
When I adjust the grid a little we get to see each Principal Component plotted against all of the others.
Now it’s visually obvious where the species separation occurs.
Well done, you have created a workflow that visualised the crabs data, and showed that a multivariate analysis of their carapace characteristics can separate them into two distinct species.
Lastly, I will add an export step so my calculations can be downloaded for review or used in other projects.
See, now both my original and computed data are available for download.
If you want to learn more detail about how PCA works or how to wrangle data into different format check out our other videos.
Table of Contents:
00:00 - Introduction
00:19 - Study Introduction
01:04 - Set-up Project, Operator, and Data
01:34 - Review Data
02:28 - Review Data
02:37 - Begin Workflow - Add Data
02:47 - Visualise Data: Heatmap
05:51 - Visualise Data: Multivariate Pairwise
07:24 - Apply PCA Operator
08:20 - Visualise PCA: Pairwise
09:38 - Add Gather Step
10:08 - Visualise PCA: Multivariate Pairwise
11:11 - Add Export step