Topic Introduction

De Novo Genome Sequencing, Annotation, and Taxonomy of Unknown Bacteria

  1. Andrew Camilli1
  1. Department of Molecular Biology and Microbiology, Tufts University, School of Medicine, Boston, Massachusetts 02111, USA
  1. 1Correspondence: andrew.camilli{at}tufts.edu

Abstract

Whole-genome sequencing of viruses and bacteria has become routine thanks to advances in DNA-sequencing technologies. Parallel advances in computing power and software design allow for billions of base pairs of sequence information to be analyzed in hours to minutes. Here, I describe methods to isolate known as well as new species of bacteria from the environment; to purify, sequence, assemble, and bioinformatically annotate their genomes; and to determine their place in the tree of life by phylogenetic analysis. The protocol introduced here was developed as part of Cold Spring Harbor's Advanced Bacterial Genetics course.

BACKGROUND

Massively parallel sequencing (MPS), also known as next-generation sequencing or second-generation sequencing, has revolutionized genome, metagenome, microbiome, and epigenome sequencing by allowing researchers to generate billions of nucleotides of DNA sequence information within a single experiment (Rivera and Ren 2013; Franzosa et al. 2015; Loman and Pallen 2015). MPS works by the parallel sequencing of hundreds of millions of short DNA molecules. The resulting sequence of each DNA molecule is referred to as a “read.” Read lengths can range from 50 to 300 nt. Thus, in a typical experiment, the user can generate anywhere from 1 to 50 billion nucleotides of sequence information, depending on the sequencing instrument used and the type of run. As most bacterial species have genome sizes on the order of 2–6 million bp, the resulting read depth, which is the average number of times that each base pair in a bacterial genome is sequenced, can be dozens to hundreds of times. This redundancy allows computer software programs to “stitch” the reads together using overlapping sequence between pairs of reads to generate the genome sequence de novo.

Recently, third-generation sequencing, which involves parallel sequencing of long DNA molecules, has made strides, and promises to soon become more powerful than second-generation sequencing for de novo genome sequencing and assembly. Notable examples are the PacBio (www.pacb.com/products-and-services/applications/whole-genome-sequencing/microbial/) and the Nanopore (Deamer et al. 2016) sequencing platforms. The latter threads long DNA molecules through a nanopore, determining the sequence as the molecule transits. The much greater read length of third-generation sequencing greatly facilitates assembly of genomes de novo. However, the high error rate of these methods remains problematic, usually requiring combination with second-generation sequencing to produce an accurate genome sequence.

Here I introduce a protocol for the isolation, sequencing, and taxonomic classification of potentially novel species of culturable bacteria (see Protocol: Isolation and Sequencing of Novel Vibrio Species [Camilli 2022]). It starts with isolation of bacteria as colonies on a nutrient-rich agar plate. The user can vary the type of agar medium used in order to target isolation of different kinds of bacteria. Next, the user isolates total DNA, which consists of chromosomes plus any plasmids or other extrachromosomal elements that may be present. This is followed by a commonly used procedure for genome sequencing using the tagmentation (support.illumina.com/content/dam/illumina-support/documents/documentation/chemistry_documentation/samplepreps_nextera/nexteradna/nextera-dna-library-prep-reference-guide-15027987-01.pdf) sample-preparation method, followed by sequencing on the Illumina NextSeq instrument (www.illumina.com/systems/sequencing-platforms/nextseq.html). Finally, genome assembly, annotation, and phylogenetic analysis methods are performed using a combination of commercial and freely available software. Specifically, genome assembly is performed using the commercially available CLC Genomics Workbench software suite (digitalinsights.qiagen.com/products-overview/discovery-insights-portfolio/analysis-and-visualization/qiagen-clc-genomics-workbench/) run on a desktop computer. Annotation of genomes is achieved by using the freely available Rapid Annotation Using Subsystem Technology (RAST) (Aziz et al. 2008; Overbeek et al. 2014; Brettin et al. 2015) server. Finally, the relatedness of the resulting bacteria to other known species can be determined using phylogenomic analysis of the 16S ribosomal RNA gene sequence using the freely available BLAST (Sayers et al. 2019) server.

Analysis of genomes that are almost identical to previously sequenced ones is computationally simple, as the reads can be mapped to the known (reference) genome. In this case, one is essentially looking for the rare differences—namely, single-nucleotide polymorphisms, deletions or insertions, and inversions or translocations. More computationally demanding is the de novo genome assembly of a previously unknown species of bacteria. In this case, one must obtain deep sequence coverage (ideally ≥50-fold) of the entire genome to generate a completely or nearly completely assembled genome. Because of difficult-to-sequence regions (e.g., homopolymer tracks) and the presence of short and long sequences that are repeated within a genome, MPS and computational assembly is almost never able to generate a complete bacterial genome. Instead, one obtains dozens to hundreds of “contigs,” which are fragments of the genome built from sets of overlapping sequences. Although the sum of these contigs may represent most or all of the unique portion of the genome, and is often published as a “draft” genome, the order of the contigs is unknown. Moreover, the sequences in between contigs, which are usually additional copies of repeat sequences, remain unknown. A more complete draft genome (i.e., larger and fewer contigs) can be obtained by carrying out “paired-end” sequencing, whereby the sequences at each end of the DNA molecules up to 1 kbp in size are sequenced and this linkage information is later used to aid genome assembly. In this case, repeat sequences up to ∼900 bp can be assembled into their proper context. Repeats longer than this, such as transposons or ribosomal RNA gene operons, will remain unresolved by the assembly software. A complete genome can usually be obtained for bacteria by using “mate-pair” sequencing (Weber and Myers 1997; see Protocol: Preparation of an 8-kb Mate-Pair Library for Illumina Sequencing [Mardis and McCombie 2017]), in which long DNA molecules are circularized and the sequences flanking the ligation junctions are sequenced. As long as the length of the DNA molecules being circularized is longer than the longest repeat within the genome, then all repeats will be assembled correctly, resulting in a complete genome. Nevertheless, a draft genome, which can be generated using the procedures introduced here, is sufficient for identification of virtually all of the genes and for taxonomic classification.

Genome sequencing technology is rapidly evolving, with the cost per genome decreasing dramatically. Combined with parallel advances in computing power, it has become routine to sequence, annotate, and bioinformatically analyze bacterial genomes. These advances have greatly empowered scientists, from beginners to experts, to explore our microbial world at the genetic and genotypic levels.

ACKNOWLEDGMENTS

I thank Cecilia Alejandra Silva-Valenzuela, Lauren Shull, Miriam Ramliden, Jacob Bourgeios, and David Lazinski for expert assistance and modifications of these procedures.

Footnotes

  • From the Experiments in Bacterial Genetics collection, edited by Lionello Bossi, Andrew Camilli, and Angelika Gründling.

REFERENCES

No Related Web Pages
| Table of Contents

This Article

  1. Cold Spring Harb Protoc 2023: pdb.top107847- © 2023 Cold Spring Harbor Laboratory Press
  1. All Versions of this Article:
    1. pdb.top107847v1
    2. 2023/1/pdb.top107847 most recent

Article Category

  1. Topic Introduction

Personal Folder

  1. Save to Personal Folders

Updates/Comments

  1. Alert me when Updates/Comments are published

Related Content

  1. Related Web Pages

Share