Main

The COVID-19 pandemic has put the use of pathogen genomic sequencing to support public health decision-making on center stage. Rapid sharing of the first viral genome sequences of SARS-CoV-2 (ref. 1) showed that this virus is a member of the species Severe acute respiratory syndrome-related coronavirus in the family Coronaviridae, subfamily Orthocoronavirinae, genus Betacoronavirus, subgenus Sarbecovirus2, and is closely related to SARS-CoV and a diverse group of SARS-like coronaviruses identified in bats. The recent World Health Organization (WHO) mission to search for the origin of SARS-CoV-2 described the jump of the virus from bats, either directly or through an intermediate animal host, to humans as the most likely route by which the virus caused the pandemic3. Interestingly, SARS-CoV-2 also clusters with sequences obtained from pangolin4,5, although the time to the most recent common ancestor of SARS-CoV-2 and the related pangolin viruses dates to around 150 years ago6. Timely sharing of the first viral genome sequences also enabled the establishment of diagnostic tools7, including the development of specific SARS-CoV-2 whole-genome sequencing protocols8 and the rapid development of vaccines9.

With current massive genomic sequencing efforts, virological epidemiological surveillance is being performed near to real time. Mutations in the viral genome are also detected and shared near to real time, leaving the interpretation of their relevance for future work. Mutations and other genome changes are part of the normal replication and evolution process, and most mutations will not result in increased viral fitness. This Review summarizes the genomic surveillance efforts as well as the current nomenclature and detection of variants of concern (VOC) and variants of interest (VOI). In addition, the emergence of these variants is discussed, and future needs for faster genotype-to-phenotype prediction are described.

SARS-CoV-2 genomic surveillance

Virus genome sequencing has been increasingly used in recent years for outbreak research in the emerging disease field, as seen during the recent Ebola virus outbreak in Africa10 and the arbovirus outbreaks in South America11,12,13,14; however, the scale of genomic surveillance undertaken during the current pandemic is unprecedented. During the first year of the pandemic, a large number of SARS-CoV-2 whole-genome sequences were generated from all around the world and shared, mostly through GISAID. As of 5 July 2021, 25,284 whole-genome sequences from Africa (0.32% of all reported SARS-CoV-2-positive cases from that continent), 146,562 from Asia (0.30% coverage), 1,292,415 from Europe (2.35% coverage), 692,704 from North America (1.75% coverage), 37,913 from South America (0.12% coverage) and 20,613 from Oceania (25% coverage) had been generated (ref. 15 and WHO Coronavirus (COVID-19) Dashboard (https://covid19.who.int/table)). Although the number of genomes is unprecedented, the coverage is still heavily biased toward regions and countries with specialized genomic facilities, programs and research projects16,17. During the current SARS-CoV-2 outbreak, virus genome sequencing combined with metadata has been used to further study the origins of the pandemic18, to determine main routes of introduction and spread for outbreak investigations in hospitals19, nursing homes20, schools21 and mink farms22, to analyze regional, national and international epidemiological trends and to study potential immune escape23,24,25.

With the expanding scale of genomic sequencing, new analytical challenges arise. The massive spread of the virus, with over 183 million human individuals infected as of 5 July 2021 (ref. 26), led to the accumulation of mutations within the viral genome. This is all part of the game: the virus replication process is not 100% error proof, leading to the generation of progeny genomes with small numbers of mutations or, occasionally, insertions or deletions27. Currently, close to 50,000 nonsynonymous mutations have been observed28. With ongoing transmission, such mutations can be replicated in subsequent rounds of infection, evolving into a unique fingerprint. When sufficient diversity is observed, the fingerprints can be used for epidemiological analyses at different levels of resolution, for instance, to link cases to a cluster, to track the origins of outbreaks, to understand the seeding of pandemic waves and to monitor the effects of control measures19,29,30,31,32,33,34.

It is important to consider the biases implicit in generation of genomic data. The current genomic effort is biased toward a limited number of countries with high sequencing capacity. An overview of the percentage of SARS-CoV-2 sequences generated and shared on GISAID15 as compared to the total number of SARS-CoV-2 infections diagnosed as of 5 July 2021 (WHO Coronavirus (COVID-19) Dashboard, https://covid19.who.int/table) is shown in Fig. 1.

Fig. 1: Overview of the percentage of whole-genome sequences generated and shared on GISAID compared to the total number of COVID-19 cases per continent as of 5 July 2021.
figure 1

The number in each circle indicates the number of diagnosed SARS-CoV-2 infections per 1 million people.

Nomenclature and classification tools

During the pandemic, a plethora of bioinformatics tools have been developed, and the open sharing of genomic data has triggered a massive stream of publications analyzing local, regional or global datasets for a broad range of questions35. For such applications, a key issue has been the need for standardized, downsized reference datasets and for standardization of lineage nomenclature, which has been challenging as this needed to be developed during the evolving pandemic. The most frequently used lineage assignment and data visualization tools such as Pangolin36, Nextstrain37 and GISAID15 have greatly aided this process, but continued reassessment is needed as new challenges arise38. Recently, the WHO published a nomenclature system using Greek alphabet letters to label VOIs and VOCs to make the names more easy to remember and more practical39.

The most well-known systems are the Nextstrain SARS-CoV-2 clade naming strategy and Pango. In Pango, the earliest sequences from Wuhan were designated as lineage A (represented by Wuhan/WH04/2020; sampled 5 January 2010; GISAID accession EPI_ISL_406801) and lineage B (represented by Wuhan-Hu-1; sampled 26 December 2019; GenBank accession MN908947). Subsequent lineages were assigned a number, for instance, B.1, B.2 and so on, or letters, depending on the system used34. To make tracking of strains accessible for providers of genetic data, GISAID collaborated with bioinformaticians, using interactive visualization software that provides rough overviews of the distribution of virus lineages across the world based on typical amino acid substitutions35. The Pango nomenclature tool uses a numerical system to classify lineages in more detail36 and seems to have gained the most traction in public health communications, in combination with the WHO classification that is limited to specific variants (for instance, Alpha variant, Pango lineage B.1.1.7).

VOCs

The implications of mutations, insertions or deletions in SARS-CoV-2 genomes are hard to determine from sequence data alone. While most mutations are silent or might result in phenotypic differences that are neutral or detrimental to viral fitness, some genomic changes may affect properties that are relevant for our ability to detect, treat, control or prevent infections or disease40. Our understanding of the effects of certain genomic signatures is currently limited, as translating genotypes into phenotypes requires carefully designed experimental studies that may require months to complete.

Genomic tracking and data analysis has helped to identify virus variants that have drawn attention because of their epidemiological behavior. A first example was the emergence and global dispersal of viruses with an amino acid substitution (aspartic acid to glycine) in the spike protein at position 614 (ref. 41). This substitution was first described in B.1 lineage viruses identified toward the end of January 2020 in Guangzhou, Sichuan and Shanghai and subsequently in viruses from the same lineage identified in early cases of the pandemic in Germany, linked to a traveler from Shanghai42. This initial cluster was controlled, but viruses with the same substitution have been introduced on multiple occasions, seeding the pandemic in Europe. At that stage, it was not possible to determine whether the substitution reflected a founder event in the country of origin; however, since then, this mutation has been fixed in the genome and is now—as of 19 May 2021—present in 99.27% of the genomes sequenced since the start of 2021. Incursion into the United Kingdom allowed comparison of the spread of B.1 viruses with the 614G substitution over 614D viruses in the same epidemiological background, and displacement of 614D-encoding viruses over time was observed. Subsequent testing of the effect of the substitution on the infectivity of the virus in different cell types (using lentiviral vectors with SARS-CoV-2 spike protein on the viral surface) suggested that the D614G substitution caused an increase in infectivity43,44, while structural analysis suggested a conformational change in the spike protein affecting binding and/or fusion44. In addition, enhanced replication in the upper respiratory tract in hamsters45 and somewhat enhanced transmission in animals were observed46. Considering this in combination with the observed global displacement of D614-encoding viruses, Hou et al. concluded that the virus had adapted to increased transmissibility, possibly through a shift toward more efficient upper respiratory tract infection46.

A more recent phenomenon is the detection of new SARS-CoV-2 variants with multiple mutations across the genome that appear to have undergone a process of natural selection, resulting in an evolutionary jump in comparison to previous circulating viruses (Fig. 2a,b). These variants are declared VOCs when phenotypic traits of relevance to public health are attributed to them23,47,48. The first variant with such an unusual number of mutations (Alpha (B.1.1.7)) was first noted in mid-November 2020 in the United Kingdom, a country that has stood out because of its massive sequencing effort. This VOC differed in 22 nucleotide positions from previously sequenced viruses, including at least 8 nonsynonymous changes mapping to the spike protein48. One consequence of the genetic changes was that one of the three PCR targets used in the routine screening of cases in large test facilities failed, making it relatively easy to track the emergence and spread of the Alpha variant by monitoring the proportion of positive cases with target failure in the spike gene49,50. The Alpha variant rapidly increased in prevalence in large parts of the United Kingdom and beyond and was associated with rapidly expanding community epidemics in different regions. UK scientists, on the basis of phylodynamic analyses and modeling, have suggested that the variant strain may be more transmissible50,51. This conclusion was based on their analyses of virus-lineage-specific trends in COVID-19 reporting, combined with data on social contacts and mobility information48,52. These analyses led to the conclusion that the observed pattern of spread was best explained by assuming that the Alpha variant had increased transmissibility, increasing the reproduction number by 0.4 or more in comparison to previous circulating variants. Studies in hamsters showed higher viral shedding of Alpha variant viruses53,54, and it is possible that increased viral load might partly explain the increased rates of transmission between humans as well. Although previously acquired natural or vaccine-induced immunity to SARS-CoV-2 provides protection against severe disease upon infection with the Alpha variant, the possibility that immune escape may explain its rapid spread cannot be excluded as antibody cross-reactivity was variable54,55,56. Thus, a combination of factors ranging from neutral drift and seeding events to viral shedding patterns, immune escape and increased transmissibility may have contributed to the rapid spread of the Alpha variant around the globe.

Fig. 2: Overview of amino acid changes in specific proteins of VOCs and currently detected and former VOIs.
figure 2

a, Amino acid changes in the spike (S) protein of the indicated variants in comparison to the Wuhan-Hu-1 strain (NC_045512.2). b, Amino acid changes in the ORF1ab, ORF3a, envelope (E), membrane (M), ORF6, ORF7a, ORF8 and nucleocapsid (N) proteins of the indicated variants in comparison to the Wuhan-Hu-1 strain (NC_045512.2). NTD, N-terminal domain; RBD, receptor-binding domain; FCS, furin cleavage site; * indicates a stop codon.

In a separate event, another VOC was first detected in South Africa (Beta (B.1.351)). Like the Alpha variant, this variant has undergone an unusually large number of mutations, some of which are shared with the Alpha variant. The Beta variant is characterized by at least eight nonsynonymous changes in the spike protein, including three that affect key residues in the receptor-binding domain (K417N, E484K and N501Y), which potentially affect receptor binding or antigenicity, or both. As observed with the Alpha variant, Beta variant viruses have rapidly increased in prevalence, with initial modeling suggesting that these viruses have increased transmissibility57. In addition, reduced sensitivity to neutralizing antibodies elicited by either natural infection or vaccination was observed for this variant, which is in line with its first emergence in a region with high seroprevalence due to the first pandemic wave56,58,59,60,61,62,63.

A third highly divergent variant was detected in Japan, traced back to travelers from Brazil64. Subsequent analyses by a sequencing consortium in Brazil confirmed circulation of this variant, referred to as the Gamma variant, in a region that had been hit particularly hard earlier in the pandemic23. The Gamma variant has also been reported to transmit more easily and might be associated with a higher case fatality ratio among young and middle-aged adults65.

More recently, a fourth variant emerged in India, the so-called Delta (B.1.617.2) variant, and was declared a VOC. The Delta variant is characterized by L452R, T478K and P681R substitutions in the spike protein, of which P681R is located in the S1–S2 furin cleavage site, which is an essential site enabling the virus to infect target cells66. It has been speculated that the specific combination of L452R, E484Q and P681R substitutions may result in increased ACE2 binding and a higher rate of S1–S2 cleavage, which could lead to increased transmissibility of variant viruses, but experimental evidence is lacking66. The Delta variant was already identified in December 2020, but has received increased attention recently owing to a rapid surge in COVID-19 cases in India and the United Kingdom caused by this variant since February 2021 (ref. 67). Since then, the Delta variant has rapidly spread across different continents and increased spread as compared to the Alpha variant has been observed68. Additionally, reduced neutralization was observed after vaccination, although vaccination most likely still protects against severe disease and hospitalization69,70.

VOIs

Next to these VOCs, an expanding list of other variants have been identified that might be associated with phenotypic changes but have not yet been demonstrated to circulate widely and/or negatively affect transmissibility, virulence and immune escape or result in decreased effectiveness of available vaccines, diagnostics and therapeutics. These variants are so-called VOIs and need careful monitoring to determine their possible impact on public health.

VOIs might harbor similar mutations as some of the VOCs and have been found in multiple countries or have caused multiple COVID-19 cases. For example, in December 2020, VOI Eta (B.1.525) was detected both in Nigeria and the United Kingdom. This variant shares mutations with the Alpha variant (deletions at positions 69, 70 and 144 of the spike protein) and has the E484K substitution that is found in the Beta and Gamma variants. This specific substitution is monitored because it has been associated with reduced sensitivity to neutralizing antibodies elicited by natural infection or vaccination71. Other examples of VOIs that carry the E484K substitution within the receptor-binding domain are the former VOI Zeta (P.2), former VOI Theta (P.3) and VOI Iota (B.1.526) variants that emerged in Brazil, the Philippines and the United States, respectively. VOI Kappa (B.1.617.1), which was identified in India together with VOC Delta, also has a substitution at position 484 in the spike protein but encodes a glutamine at this position, which is also associated with reduced susceptibility to neutralization with convalescent sera72. Another VOI was first detected in July 2020 in California and subsequently spread rapidly throughout the United States. This variant, former VOI Epsilon (1.427/1.429), is characterized by a set of substitutions in the spike protein, of which the L452R substitution in the receptor-binding domain is also thought to increase infectivity, and has the potential to escape antibodies73. A variant circulating widely in South America, VOI Lambda (C.37), which was first identified in Peru in August 2020, encodes a substitution in the receptor-binding domain at position 452 as well. Instead of a change from a lysine to an arginine as seen in VOC Delta and VOIs Epsilon and Kappa, a glutamine residue occupies this potentially important site74. An overview of the currently detected VOCs and currently detected and former VOIs and their substitutions in the spike protein as well as the rest of the virally encoded proteins is shown in Fig. 2a,b, but the number of variants is rapidly expanding and their categorization is constantly being updated on the basis of ongoing risk assessments.

Where did these VOIs and VOCs emerge?

All currently recognized variants were first identified in countries with considerable capacity for genomic surveillance, which does not mean that they also first developed in those countries. At the moment, despite the massive surveillance effort where around 0.93% of all SARS-CoV-2-positive cases around the world are sequenced, the origin of these VOCs has not been found. One hypothesis is that accumulation of multiple mutations may occur within a single specific patient, as several case reports have described the identification of mutations shared with the current VOCs. For instance, deletion of amino acids 141–144 and the E484K and N501Y substitutions in the spike protein were observed in an immunocompromised patient who received plasma therapy in Hong Kong75. In another case report, deletion of amino acids 141–144 in the spike protein was observed in an immunocompromised patient with cancer76, while deletion of positions 69 and 70 in the spike protein, a hallmark of the Alpha variant, was observed in a chronically infected patient77.

A second hypothesis is that the virus mutated in an animal reservoir, as SARS-CoV-2 has been shown to be able to infect many different animal species. Large-scale outbreaks of SARS-CoV-2 have for instance been identified in mink farms22. In the Netherlands, only limited spillback to the human population was observed, while in France and Denmark there seemed to be temporal transmission from humans to animals and vice versa78,79. SARS-CoV-2 infection has also been demonstrated in wild mink80, making it not unlikely that mink can serve as a reservoir host. Other animal species that have been shown to be susceptible to SARS-CoV-2, some of which can also transmit the virus, are hamsters, ferrets, cats, dogs, lions, deer, monkeys and fruit bats, among others81,82,83,84,85,86,87. Of note, newly emerging VOCs may have an extended host range, as the Alpha and Gamma variants have been shown to be able to infect mice88. Taken together, these findings demonstrate that SARS-CoV-2 has a wide host range and that the role of animals as reservoir hosts and as a source for the emergence of new variants needs to be investigated.

A third possibility is that a particular virus variant may have evolved gradually in parts of the world where there is less genomic surveillance but widespread circulation. Whereas in some countries a substantial amount of SARS-CoV-2-positive cases have been sequenced, this is not true for all regions of the world. Given that variants with large numbers of mutations have been detected more frequently later in the pandemic and in countries with relatively high seroprevalence due to intense early pandemic waves, it is possible that natural selection of variants with immune escape is occurring during virus circulation in populations with little genomic surveillance.

When does a variant become a VOC?

As genomic monitoring continues to increase in volume, new variants will continue to be detected. A key challenge is to predict and flag VOIs that might be of concern. This requires in-depth knowledge of the genomic profile and possible biological implications of these VOIs. The WHO’s working definition of a VOI is that a variant should be phenotypically different or have a genome with mutations that lead to amino acid changes with established or suspected phenotypic implications. Another requirement is epidemiological evidence of sustained and possibly increased community transmission in one or several countries. The WHO has convened a working group to assess evidence from multiple sources to underpin the assignment of VOIs and VOCs89. A SARS-CoV-2 variant is currently classified as a VOC if it has been demonstrated that this variant is associated with an increase in transmissibility, an increase in virulence or changes in clinical disease presentation, or a decrease in the effectiveness of public health and social measures or available diagnostics, vaccines or therapeutics or when a variant is assessed to be a VOC by the WHO in consultation with the WHO SARS-CoV-2 working group89.

Future needs in genomics: fast genotype-to-phenotype prediction

Observed changes in epidemiological patterns can be explained by multiple mechanisms not necessarily related to the observed mutations in viral genomes. To draw conclusions about specific variants, epidemiological observations need to be combined with experimental data to assess virus properties, such as infectivity, transmissibility, tropism, virulence and immune escape. To develop a robust knowledge base for monitoring of viral evolution in relation to pandemic preparedness, the rapidly expanding viral genomic sequencing network needs to include reference centers for virus characterization, working toward a suite of standardized assays, reagents, strain collections and serum samples, none of which is currently available for SARS-CoV-2 (ref. 90). The devil is in the details. For instance, propagation of the viruses in vitro can result in cell culture-adaptive mutations91. This can be overcome by using specific cell lines or organoids and by developing a consensus mechanism for standardization and auditing of cell lines and protocols; however, such harmonization efforts are challenging and their implementation may take years. Currently, such standardization efforts are not part of molecular surveillance work and still remain to be developed by the field.

A key question is what the priorities are for genotype-to-phenotype prediction, based on lessons learned so far92. On the basis of the criteria for assignment of VOCs, it would make sense to focus on virus traits that can provide information about key properties of concern: transmissibility, virulence and immune escape or decreased effectiveness of available vaccines, diagnostics and therapeutics. However, this is a wide scope, and further prioritization may be needed. For instance, the inferred increased transmissibility observed for several VOCs thus far has not translated to fundamental changes in baseline public health strategies93. That could be different if variants emerge for which new traits, such as changes in the age groups predominantly affected or modes of transmission, would warrant updating of interventions on the basis of solid experimental or field data. Arguably, the most urgent question is whether vaccine-derived immunity is affected by variant emergence, for which assessment of both humoral and cellular immune responses will be needed94,95,96,97. This assessment has been performed for the Alpha and Beta variants, where it was shown that, although these variants can partially escape humoral immunity, CD4-specific T cell activation was not affected97. The turnaround time for such assays, however, has to be improved for informed public health decision-making. Alternatively, reduced neutralization of VOCs may become an important screening assay, as neutralizing antibodies are a likely correlate of protection and neutralization assays can be performed relatively quickly once a high-quality virus isolate has been obtained98,99.

It is likely that SARS-CoV-2 will continue to circulate and evolve and that a system analogous to the global influenza virus surveillance network will be needed100,101,102. This is a network of national influenza centers and WHO collaborating centers that collect data on influenza-like illness trends and provide genetic and antigenic characterization data on a representative selection of viruses circulating in a more or less standardized manner. This information is aggregated globally and is used to decide whether and when the vaccine composition for the next season needs to be adapted. However, whereas the global influenza surveillance system has been largely reactive by selecting newly emerging antigenic variants identified during epidemics to generate vaccines for the next season, high-throughput global virus genome sequencing efforts also allow more forward-looking approaches. When robust assays are developed to quantify immune responses to SARS-CoV-2 after infection and vaccination, such assays may be used to test the effect of all amino acid substitutions observed in global surveillance studies on immune escape, in real time. Examples of such studies are already available for immune escape changes resulting from substitutions in the receptor-binding domain of the spike protein71, but additional assays can be developed, including assays for other antibody targets in the spike protein (for example, the N-terminal domain) and for T cell immunity. Robust, standardized and validated assays to measure viral immune escape based on genome sequence data and population immunity studies can provide information on vaccine effectiveness against emerging variants. The development of these assays would allow for timely risk assessment and a more immediate response in the case of emergence of VOCs with increased diversity of responses to vaccines.

Other potential indicators of increased public health risk are changes in transmissibility, changes in disease severity, reduced detection by diagnostic assays and reduced susceptibility to drugs and/or therapeutics. For each of these parameters, several assays and study types are available that should be further developed, standardized and assessed for their suitability for use in rapid risk assessment. Examples include the use of panels of viral antigens to screen for potentially reduced sensitivity of widely used rapid tests and the development of well-characterized reference sera for neutralization assays and reference viruses to be used in competition assays for each of the different variants, as mutations will continue to occur in VOCs and VOIs after their initial detection. Given the fact that SARS-CoV-2 may rapidly acquire mutations mapping to the spike protein upon inoculation in animals, human organoids potentially represent a powerful tool to further characterize SARS-CoV-2 variants. Recent studies, for example, have indicated that the Alpha variant, in comparison to an ancestral SARS-CoV-2 clade B virus, produces higher levels of infectious virus late in infection and has higher replicative fitness in human airway, alveolar and intestinal organoid models103. These assays should also be performed in a timely fashion because the increasing volume of sequencing data otherwise has the potential to become a burden instead of a valuable source of information90.

Whenever possible, this work should be conducted without the use of full-length infectious SARS-CoV-2, for example, by using virus pseudotypes. However, some phenotypes, such as virulence and transmission, cannot be investigated without infectious viruses. Conclusive evidence for other experiments (for example, assays for immune escape) may also require use of infectious viruses. These experiments, conducted under the appropriate biosafety level 3 conditions, are crucial to keep intervention strategies up to date in the interest of public health and animal health. Key recommendations are summarized in Box 1. In combination, globally representative genomic surveillance linked with experimental data to validate signals from genomic data will provide a critical step forward in surveillance of current and potential pandemic threats104.