Humans find it easy to visualise things in three dimensions because that is how our vision works. Problems occur when we start dealing with data with many multiples of factors (dimensions). Such as the type of data modern lab machines can generate. Providing visual graphs for these high-dimensional data sets is impossible, so mathematicians have devised techniques that can statistically summarise the most important data in a set. PCA is one such technique.
Some things to check before starting.
An operator can only be applied to a data step. Data steps are created on a workflow, so you must have a project created and data uploaded to follow these instructions. It is also important that the data is in "Long Format" for PCA to work correctly.
If you do not have a project ready, you can clone this sample project. We have uploaded the relevant data and built the workflow up to the point where our video takes it forward. You can use it to follow this article too.
No time to read?
Watch a video of how it's done.
Tip: Open the video in another browser tab beside Tercen. That way, you can pause it and follow along with the instructions.
Steps to applying a PCA Operator
Though we are talking about PCA, the general flow of the instructions below apply to using any operator.
Install PCA to your library
Only the operators you have installed to your Library can be used in a workflow. Your Personal page has a Library, and each Team has a Library, so make sure to check it is installed in the right one for your project.
We made a short video where you can learn the steps to do this.
Add a data Step
Right-click on a previous step to add a new one.
Lay out the crosstab grid
Data must be projected onto a crosstab grid in the structure an Operator expects for it to perform a calculation.
If you are unsure how to lay out the crosstab grid for an operator, you can check the programmer's instructions on GitHub.
Operators that have been added to your library have a link to the GitHub where the instructions can be read.
Go to your personal or team library and click on the Operator version number (highlighted in yellow) to be taken to the instructions page.
Here are the layout instructions for the PCA Operator.
Input projection (
represents the variables (e.g. genes, channels, markers)
represents the observations (e.g. cells, samples, individuals)
You will have to determine which of your data factors corresponds to the input parameters. In our sample data, the following factors line up with the grid as follows.
gene_id is the variables
sample.variable is the observations
sample.value is the measurement value
Apply the operator
Add the PCA Operator from your library by pressing the Operator Plus Button.
When you have installed it, press the Run Button to start its calculations.
N.B. Don't forget to save your workflow when it is finished.
You can return to your workflow screen and rename the data step to make it more recognisable.
Add a New Data Step
To visualise the results of the PCA calculation, we must add a new data step.
Layout the Principal Components.
In the new data step first clear the automatic projection that Tercen made from your original layout. Click the X in the top left corner of the grid where the factor names are (e.g. gene_id). Do this until you have a blank grid.
The PCA operator has created new factors called Principal Components. These are the results of the calculations it made to reduce the data dimensions.
Principal Components are always numbered from most to least important so PC1 and PC2 will contain the majority of the most interesting data.
Layout PC1 and PC2 in a simple grid.
You can use other factors such as annotations (e.g. Annotations.tumour_name) to colour code the graph and show the clustering more clearly.