In this video we will show you a structured method for deciding the optimal number of clusters to focus your high dimensional flow cytometry analysis on.
The technique uses a combination of flowSOM and MEM scores to avoid over-clustering or under-clustering your sample data.
Table of Contents:
00:00 - Introduction to High Dimensional Flow Cytometry Analysis Series
00:19 - Overclustering
01:09 - MEM Calculation
02:02 - flowSOM Over Cluster
02:44 - MEM Profiles
03:11 - Hierarchical Heatmap
Today we are going to recommend a method for deciding the right number of cell populations to focus on when analysing high dimensional flow cytometry data.
In our previous video we picked seven clusters to restrict our flowSOM operator.
In this video we will show how we came to that number.
Determining the number of clusters is an art, but it doesn’t have to be guesswork.
There are structured methods that can help check if you have the right number of clusters in your high dimensional flow cytometry analysis.
High Dimensional Flow Cytometry Analysis: flowSOM
The method we recommend uses flowSOM combined with a statistical calculation called MEM.
It will score the clusters according to how relevant they are to their marker groups.
By starting with a really high number of clusters and visualizing their MEM scores in a heatmap, you can use your domain knowledge to observe patterns and narrow the number down.
You can then do as many iterations of this high dimensional flow cytometry analysis as you need until you are satisfied you have found the true number of clusters.
This way you will avoid the problem of under- clustering that can cause you to miss a rare population.
High Dimensional Flow Cytometry Analysis: MEM
Kirstin Diggins from Vanderbilt University developed the MEM calculation which we will use for this example.
If you have the time please check out her work online, as it will help support the bioinformatics community.
Kirstins MEM calculation is an equation that scores how a Marker is relevant to a Population.
The first part of the equation is based on the median value of that Marker, in that cluster, compared to it’s median value in other clusters.
Through this we can see its relative signal strength in the cluster.
The second part of the equation is an interquartile range calculation, again, relative to the selected cluster, and relative to all the other clusters.
This is an assessment of how homogeneous that marker is to that cluster.
Markers with high MEM scores are given enriched relevance to the group, and low scoring markers have depleted relevance to the group.
The scale of relevance is graded from -10 to +10.
Now, let me show you a workflow in Tercen which makes this process easy to do.
You see, like in the last video, we have taken our PBMC data set and performed the ACH transformation to make it ready for a flowSOM calculation.
We started by doing flowSOM with 90 clusters.
We picked this number because we know it is a lot higher than the number of populations we should expect to see in this sample.
We call this over-clustering and, at this stage, it’s more important to cast the net wide, rather than try to accurately predict anything.
High Dimensional Flow Cytometry Analysis: Visualisation
In the next step we visualise the 90 clusters created by flowSOM for our high dimensional flow cytometry analysis.
By looking closer we can see that the cells do show differences in either variability or homogeneity inside a marker.
We’ll apply the MEM Operator to score each one and then see which of those 90 clusters have similar profiles across all of the markers.
This heatmap visualizes the results of the MEM scores.
You see the 90 clusters are along the X-Axis and the signal markers on the Y.
Red indicates enrichment and blue indicates depletion of that marker in the cluster.
Even at this stage we can spot clusters with marker enrichment profiles that are similar to each other.
For example see clusters 19, 39 and 73
By using this SHINY heatmap operator, we can create a hierarchical clustering of these enrichment profiles.
The advantage of this heatmap is that it orders the profiles across the X-Axis by how similar they are to each other and draws dendrogram lines to indicate profile groupings.
See how 19, 39 and 73 are all part of the same block, along with some other clusters.
At this point your domain knowledge becomes very relevant, and you may be able to spot the major groupings based on your most important markers.
For example, we could make an argument based on CD3 and CD4 that the seven major groups are visible here.
Or, we could look at the dendrographs and form an opinion, based on all markers, that there are 18 possible groups and we should further examine them in an iteration of our process.
Adding another branch to a Tercen workflow is real easy, and this time we set flowSOM to 18 clusters.
After applying MEM scores and looking at the hierarchical heatmap we can more clearly see the important clusters.
Finally, to prove our conclusion that there are 7 clusters, we can start a new branch on the workflow to look at flowSOM with the 7 clusters.
Even on the standard heatmap the noise has cleared and when we switch to the hierarchical heatmap.
We can see how dissimilar the clusters are to each other.
The dendrograph is suggesting that 3 and 5 are similar but I think we will disagree.
We are confident that the 7 clusters are sufficiently different from each other to be right number to concentrate our analysis on.
In the next video we will wrap up our Cytometry tutorial by showing you how to add extra sample annotations to your data.
This can enrich your analysis and open up new possibilities for exploring sample groupings.