top of page
Methods
Alignment
Firstly we take the RNA viral samples and exclude any samples for which we have low coverage (> 3% of the sequence is comprised of ambiguous base pairs 'N'). Then we take individual genes (using the reference from the sequenced genome: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512) for a given viral sample and align them to the genes in the reference sample using global alignment (Smith-Waterman algorithm). The differences between base pairs between the sequences (single nucleotide polymorphisms) are computed and used to calculate a similarity score (number or base pair changes divided length of gene). This similarity is computed for both nucleotides and the translated amino acids of the aligned gene and reference gene.
Clustering
This gives us a similarity feature for each of the aligned genes (ORF1ab, S, ORF3a, E, M, ORF6, ORF7a, ORF8, N, ORF10) with the genes in the reference sequence We use these features to form a linkage matrix to do hierarchical clustering. The distance threshold for setting an appropriate cut-off to prevent further merging or agglomeration is set by examining the derivative of cluster distances following a round of merging. The first significant peak is used as this threshold (A more detailed discussion of the process of selecting cut-offs can be found here: https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/), we also require that each cluster contains a minimum of 10 individuals.
For further comparison we also do k-means clustering on the same features selecting k by looking at the derivative of the silhouette score (a ratio of the intracluster distances against the intercluster distances) for the first significant drop. These clustering procedures are repeated for the amino acid similarities as well.
Grouping
Once the relevant clusters are computed by each technique, we group the data and calculate the proportions of each cluster by a certain country. We also calculate proportions of each cluster by gender and age group independently.
bottom of page