Topic Introduction

Navigating the Maze of Maize Genomics: The Impact of Transposable Elements and Tandem Repeats

  1. Shujun Ou1,2
  1. 1Department of Molecular Genetics, The Ohio State University, Columbus, Ohio 43210, USA
  1. 2Correspondence: ou.195{at}osu.edu

Abstract

Transposable elements (TEs) are abundant and ubiquitous components of eukaryotic genomes. Since TEs were first discovered in maize (Zea mays) by Barbara McClintock in the late 1940s, these elements have been shown to be important agents in shaping genome structure and evolution. Today, maize continues to be an important model organism for molecular and quantitative genetics, and represents a particularly useful system for the study of the interplay between TEs and host genomes. While TEs constitute a significant part of the maize genome and are important drivers of genome evolution, their annotation remains a complex and challenging task. Here, we discuss genome annotation of TEs and other repetitive sequences in maize genomes. We briefly review current knowledge on the overall landscape of TE and non-TE repeats in maize, and discuss how these sequences may impact genome structure, and the genotype and phenotype within species. We also provide a summary of the main tools used to find TE polymorphisms, and briefly introduce four different bioinformatic approaches for TE and tandem repeat annotation, explaining how they can be best used by maize researchers.

INTRODUCTION

Transposable elements (TEs) are DNA sequences that are able to move from one location to another in the genome. These mobile DNA sequences are found in virtually all organisms, and they occupy a considerable fraction of the genome of most eukaryotic species (Bourque et al. 2018; Wells and Feschotte 2020). TEs were first discovered by Barbara McClintock, who identified the Activator (Ac) and Dissociation (Ds) elements associated with chromosome breaks in Zea mays (maize) (McClintock 1950, 1953). Since then, TEs have been studied extensively because of their notable impact on genome size, structure, and function in multiple eukaryotic species (Bourque et al. 2018; Wells and Feschotte 2020). Today, maize continues to be a model system for the study of TEs and their impact. Maize features a large and complex genome, with >80% of it composed of TEs, and it displays a remarkable intraspecific genomic diversity (Hufford et al. 2021; Chen et al. 2023b).

Here, we provide an overview of the TE landscape in maize and its diversity between different lineages and species in the genus Zea, also pointing to some key examples of how TEs have impacted the evolution and function of the maize genome. We briefly introduce the general process and methods for annotating TEs and for identifying new TE insertions in maize. We outline the main non-TE repeats found in maize, discussing how they influence genomic structure and function, and briefly introduce methods for annotating those sequences. Finally, we introduce the different approaches used in our protocol for annotating TEs and other repeats in maize, which we include as part of this collection (see Benson et al. 2024).

THE LANDSCAPE OF TES IN MAIZE GENOMES

There are two major classes of TEs: class I, or retrotransposons, which mobilize using an RNA intermediate, and class II, or DNA transposons, which mobilize through a DNA intermediate (Bourque et al. 2018; Wells and Feschotte 2020). These classes can be further divided into subclasses, four of which are present in the maize genome: long terminal repeat (LTR) retrotransposons, non-LTR retrotransposons, terminal inverted repeat (TIR) DNA transposons, and Helitron DNA transposons. Each of them can also be further divided into superfamilies and families. Furthermore, TEs can also be classified as autonomous (i.e., when they encode all the necessary enzymatic machinery to transpose) and nonautonomous (i.e., when functional coding sequences for such machinery are partially or fully missing) (Table 1). Nonautonomous TEs, however, are sometimes able to mobilize using transposition proteins encoded by autonomous elements (Wells and Feschotte 2020). In maize, 83.2% of the reference genome (the B73 lineage) corresponds to TEs that are distributed across thousands of families, with DNA transposons and retrotransposons representing 8.6% and 74.6% of the genome, respectively (Hufford et al. 2021).

Table 1.

Classification, structure, size, and functional relevance of maize transposable elements and tandem repeats

LTR elements are retrotransposons named after the distinctive long terminal repeats found on their 5′ and 3′ ends (Boeke et al. 1985; Ou and Jiang 2018). Autonomous LTR elements have at least two genes. The first is the gag gene, which encodes a polyprotein containing the structural components of viral-like particles that encase LTR element intermediates, and the second is the pol gene, which is a polyprotein containing at least a protease (PR), a reverse transcriptase (RT), an RNase H (RH), and an integrase (IN) domain (Havecker et al. 2004; Wells and Feschotte 2020). The main LTR superfamilies in plant genomes are Ty1 and Ty3. Their sizes range from 0.3 to 11 kb for Ty1 elements, and from 2 to 20 kb for Ty3 elements (Zhao et al. 2016). LTR retrotransposons represent 74.19% of the maize genome (Table 1), which is >99% of the total retrotransposon content in this species (74.6%). Most of this fraction corresponds to the superfamilies Ty1 and Ty3, which represent 24.92% and 44.25% of the maize genome, respectively (Hufford et al. 2021).

Non-LTR elements are retrotransposons that usually contain two open reading frames, ORF1 and ORF2, although only ORF2 is found in all autonomous elements from this subclass. The main protein encoded by ORF2 functions as both an endonuclease (EN) and a reverse transcriptase (RT), which is not only responsible for synthesizing a complementary DNA strand from an RNA intermediate but also for integrating this DNA strand into the host target site. The function of ORF1 is less well understood, but it encodes a protein that has RNA binding and nucleic acid chaperone activity (Han 2010; Wells and Feschotte 2020). The main non-LTR superfamilies in plant genomes are the long interspersed nuclear elements (LINEs) and short interspersed nuclear elements (SINEs). In contrast to LINEs, all SINEs are nonautonomous and depend on the enzymatic machinery from LINEs to transpose. The only common structural feature among all maize SINEs is the presence of an RNA polymerase III promoter with conserved A and B motifs (Baucom et al. 2009). These elements have sizes of up to 9 kb in the case of LINEs and up to 0.5 kb in the case of SINEs (Zhao et al. 2016). Together, non-LTR elements represent <1% of the maize genome (Table 1), of which ∼98% belong to the LINE superfamily (Schnable et al. 2009; Hufford et al. 2021). However, this is likely an underestimation due to the lack of high-quality methods for de novo identification of non-LTR elements (Ou et al. 2019).

TIR elements are DNA transposons named after their terminal inverted repeats and typically contain a single ORF encoding a DD(E/D) transposase. This protein is responsible for mobilizing TIR elements by recognizing TIR sequences, cleaving both TE ends, and reinserting them on a new chromosomal location through a strand transfer mechanism, also known as “cut and paste” transposition (Hickman and Dyda 2016; Wells and Feschotte 2020). The main TIR superfamilies in plant genomes are the CACTA, Mutator, PIF/Harbinger, Tc1/Mariner, and hAT superfamilies. Their sizes range from 0.2 to 21 kb for CACTA, 0.12 to 16 kb for Mutator, 0.08 to 7 kb for PIF/Harbinger and Tc1/Mariner, and 0.11 to 6 kb for hAT elements (Zhao et al. 2016). TIR elements occupy 6.71% of the maize B73 genome (Table 1), with CACTA being the largest superfamily of TIR elements, making up 2.96% of the B73 genome (Hufford et al. 2021).

Helitrons represent another subclass of DNA transposons that are also found in the maize genome. These elements have structural features like small nucleotide motifs in their termini, hairpin loops close to their 3′ end, and at least one ORF encoding a transposase known as RepHel, containing a Rep (or HUH endonuclease) and a helicase (Hel) domain. The RepHel protein is responsible for moving Helitrons on host genomes by a process known as rolling-circle transposition, which is similar to the replication mechanism used by some viral and plasmid rolling-circle elements (Thomas and Pritham 2015; Grabundzija et al. 2018; Heringer and Kuhn 2022). Helitrons can be divided into four subfamilies (or variants), including Proto-Helentron, Helentron, Helitron2, and the canonical Helitron, with only canonical Helitrons found on maize and other plant genomes (Thomas and Pritham 2015; Wells and Feschotte 2020). In plants, Helitrons have sizes ranging from 0.15 to 20 kb (Zhao et al. 2016), and occupy 1.89% of the maize B73 genome (Table 1; Hufford et al. 2021).

Different TE groups have distinct distribution patterns in the maize genome. For instance, in the genome of the maize inbred line B73, DNA TEs represented by TIR elements and Helitrons are predominantly found in gene-rich regions. In contrast, LTR elements are found in both gene-rich (e.g., Ty1-like elements) and gene-poor heterochromatic regions (Ty3-like elements), depending on the family (Schnable et al. 2009; Hufford et al. 2021). An LTR family named centromeric retrotransposons of maize (CRM), together with CentC tandem repeats, are predominantly localized in the maize centromeres, being the two main structural components of these regions (Table 1; Birchler and Han 2009; Wolfgruber et al. 2009). In the maize B chromosome, which is nonessential and only found in some lineages, the four TE subclasses are evenly distributed along the chromosome (Blavet et al. 2021).

TE CONTENT VARIATION IN THE GENUS ZEA

The distribution and relative abundances of TE superfamilies and families among maize lineages, subspecies, and Zea species are similar overall (Tenaillon et al. 2011; Bornowski et al. 2021; Hufford et al. 2021; Chen et al. 2022; Wang et al. 2023). Nonetheless, there are some noticeable differences between the maize B73 line and other Zea taxa regarding specific groups. For instance, there is a general decrease in the number of Ty3 elements in maize genomes when compared to other Zea species, and an overall increase in the abundance of specific DNA transposon families in the clade that includes Zea diploperennis and Zea perennis (Chen et al. 2022). Likewise, although the relative proportions of TE families between maize and Zea luxurians are somewhat similar, the total TE content in Z. luxurians is ∼35% higher, which explains a large portion of the 52% difference in genome size estimated by flow cytometry between these species (Tenaillon et al. 2011). In addition, the activity and abundance of LTR families can vary between tropical and temperate maize lineages, which are driven by the amplification and removal of LTR elements that are likely subjected to distinct selective pressures (Ou et al. 2022). Regarding those observations, it is worth mentioning that variation in the content of repetitive sequences (i.e., satellite DNAs and TEs) is one of the main determinants of genome size differences between closely related taxa (Sanmiguel and Bennetzen 1998; Tenaillon et al. 2011; Chia et al. 2012; Roessler et al. 2019). Furthermore, in maize lineages, genome size variation is correlated with latitudinal and altitudinal gradients, which could be the result of selection on genome size (Díez et al. 2013; Lai et al. 2017; Bilinski et al. 2018).

EVOLUTIONARY AND FUNCTIONAL IMPACT OF TES ON THE MAIZE GENOME

The large amount of TEs in plant genomes not only affects their size but also results in the generation of structural variants (SVs), facilitates chromosomal rearrangements, and influences gene function (Schnable et al. 2009; Jiao et al. 2017; Yang et al. 2017; Anderson et al. 2019; Li et al. 2020; Hu et al. 2021). For instance, the maize genome has a higher TE content (>80%) (Hufford et al. 2021) than that of rice (>45%) (Ou et al. 2019) and sorghum (>65%) (McCormick et al. 2018), and maize genes contain larger introns because of TE insertions (Schnable et al. 2009).

Maize TEs have been directly involved in chromosomal rearrangements, together with gene inactivation, duplication, subfunctionalization, and expression modulation by their insertion into regulatory regions (Lisch 2013). Examples have been found for all TE subclasses in maize. For instance, a Hopscotch LTR element insertion functions as an enhancer for the tb1 gene, which increases apical dominance and is regarded as a key maize domestication event (Studer et al. 2011). Furthermore, the insertion of a Cin4 non-LTR element in the 3′ untranslated region of the A1 gene results in an alternative transcript, with no apparent change in phenotype (Schwarz-Sommer et al. 1987a, b); the insertion of a CACTA-like TIR element in the ZmCCT promoter reduces maize's sensitivity to photoperiod (Yang et al. 2013); and a Helitron insertion in the ZmGDIα gene results in a recessive allele that confers resistance to maize rough dwarf disease (Liu et al. 2020b), among other examples (Table 1).

METHODS FOR MAIZE TE ANNOTATION

The coding and structural features in each of the TE groups discussed above can be used for their identification and annotation by different tools. The protein sequences found in TEs are unique to each subclass, and for that reason, can be used by homology-based methods in their identification. Structural features like LTRs, TIRs, terminal motifs, and secondary structures vary between different groups, and structure- and homology-based methods can be used to identify superfamilies and families. Overall, TE subclasses are defined by their coding sequence and general sequence structure, and superfamilies and families are classified through sequence clustering that reflects their phylogenetic relationships (Bourque et al. 2018; Wells and Feschotte 2020). Ideally, in silico TE annotation tools should make use of those different levels of information in their methodology. Failure to do so may result in misclassifications and a higher rate of false annotations.

TEs are typically identified based on their sequence homology with TE databases, or by using structure-based methods that can identify di-, tri-, and tetranucleotide repeats in addition to other unique and defining features of specific TE classes and superfamilies. General repeat finders represent yet another class of methods that can be used to find TEs. A common feature of these methods is their ability to detect sequences based on their repetitiveness, without necessarily relying on homology or structural similarity to known repeats. Examples of such tools include RepeatExplorer (Novák et al. 2013), Red (Girgis 2015), and Generic Repeat Finder (Shi and Liang 2019).

Most maize researchers use a combination of structure-based and homology-based methods to classify TEs (Springer et al. 2018; Yang et al. 2019; Li et al. 2020; Hufford et al. 2021; Lin et al. 2021; Zhao et al. 2022; Chen et al. 2023b; Tian et al. 2023; Wang et al. 2023), and some of the latest algorithms developed for TE annotation now use a blended approach (Ou et al. 2019; Flynn et al. 2020). In nonmodel systems, TE libraries for homology-based classification are not readily available and must be created for high-quality annotations. Maize researchers have an advantage in that maize-specific TE exemplars are heavily curated by the Maize TE Consortium and the community, and are freely available for use in TE annotation (see https://github.com/oushujun/MTEC).

FINDING NEW TE INSERTIONS

TEs are the most dynamic component of plant genomes, and SVs caused by transposition events play a key role in genomic variation observed both within (Springer et al. 2018; Anderson et al. 2019; Carpentier et al. 2019) and between (Tenaillon et al. 2010) species. The rapid advances in sequencing technologies in recent years have resulted in the sequencing of a large number of maize genomes (Hufford et al. 2021), creating an unprecedented opportunity for the study of TE insertion polymorphisms (TIPs) between lineages and populations. Because TEs are one of the fastest-evolving components of eukaryotic genomes, the detection of TIPs is an important step for identifying relevant genomic changes that are potentially involved in phenotypic variation between lineages within maize.

Many methods have been reported for detecting TIPs, and most of them were developed based on short-read whole-genome sequencing (WGS) data. Examples include (1) TEMP2, which identifies TE insertions (or their absence) in genomic sequencing data, pinpoints their junctions, and estimates the frequency of transposition events in a specific population (Zhuang et al. 2014); (2) the RelocaTE2 tool, which maps TIPs at single-base-pair resolution using TE-derived short reads as seeds to cluster read pairs, also detecting target site duplications (TSDs) of insertions in each cluster (Robb et al. 2013; Chen et al. 2017); and (3) the integrated pipeline McClintock2, which executes and evaluates 12 detectors for TIPs, generating an output in a standardized format (Nelson et al. 2017; Chen et al. 2023a).

Although methods for detecting TIPs on short-read WGS data have been historically important, they usually have insufficient levels of sensitivity, as TE sequences are highly repetitive and usually longer than short reads, which limits their mapping on reference genomes (Vendrell-Mir et al. 2019; Rech et al. 2022). Conversely, long-read WGS platforms can sequence fragments from 10 kb to 2 Mb (Warburton and Sebra 2023), largely alleviating the drawbacks that exist in short-read-based methods. Tools like Sniffles2 (Sedlazeck et al. 2018) and SVIM (Heller and Vingron 2019) are SV callers that, despite not being tailored specifically for detecting new TE insertions, can be used for that goal. Alternatively, TELR (Han et al. 2022), which detects SVs using Sniffles2 (Sedlazeck et al. 2018), and then filters TE insertion candidates and estimates TE insertion allele frequency, is an example of a method developed specifically to detect TIPs on long-read WGS data. Finally, there are tools that also identify TIPs on genome assemblies. Examples include (1) TrEMOLO, which detects global TE variations between two assembled genomes and populational/somatic variation in TE insertion or deletion, also estimating their frequency (Mohamed et al. 2023), and (2) GraffiTE, which detects TIPs from genome assemblies or long-read sequencing data, and genotypes the discovered variants from short- or long-read data (Groza et al. 2023).

LANDSCAPE OF NON-TE REPEATS IN MAIZE

In addition to interspersed repeats, represented mainly by TEs, the maize genome (like that of most eukaryotes) also contains a large number of tandem repeats, such as satellite DNAs, ribosomal DNAs (rDNAs), and other elements. Although non-TE repeats do not have specific replication machinery to increase their copy number, they are proposed to arise and expand through mechanisms like replication slippage, rolling-circle replication, TE-derived tandem insertions (McGurk and Barbash 2018), and unequal crossover (Garrido-Ramos 2017).

The main types of maize satellite DNA sequences are knob sequences, the CentC repeats, ZmBs repeats, and rDNA. Knobs represent large and highly variable genomic regions of heterochromatin and are clustered in heterochromatic regions of several chromosomes (Bilinski et al. 2018; Liu et al. 2020a, Hufford et al. 2021). Knobs are composed mostly of a 180 bp tandem repeat (knob180), a 350 bp tandem repeat (TR-1 knob), and different groups of retrotransposons (Xiong et al. 2005; Lamb et al. 2007; Haberer et al. 2020). CentC repeat arrays are composed of 156 bp monomers and are found in all maize centromeres with significant size variability (Birchler and Han 2009; Wolfgruber et al. 2009). ZmBs repeats have 1 kb monomers, which are specific to B chromosomes. ZmBs repeats are distributed mainly on the pericentromeric region, although a smaller number of copies is found at the distal tip of the long arm (Alfenito and Birchler 1993; Blavet et al. 2021). rDNAs are comprised of 45S rDNA repeats with 8.8 kb units and 5S rDNA repeats with 320 bp units (Chen et al. 2023b), and are the main components of nucleolus organizer regions (NORs).

Between 0.87% and 4.56% of assembled maize genomes correspond to tandem repeats, and similarly to TEs, the amount of tandem repeats varies between different lineages (Hufford et al. 2021). In the telomere-to-telomere genome assembly of the maize Mo17 lineage, tandem repeats only constitute 3.92% of the assembled genome, with satellite DNAs representing 2.7% and rDNAs representing 1.22% of the assembled genome (Chen et al. 2023b). Importantly, those values, especially in the case of assemblies with gaps, could be underestimated, as long tandem repeat arrays are usually incomplete in genome assemblies (Jiao et al. 2017; Liu et al. 2020a; Ou et al. 2020). For instance, measures based on flow cytometry and fluorescence in situ hybridization (FISH) indicate that heterochromatic knobs can represent up to ∼15% in some maize genomes (Bilinski et al. 2018). However, the considerably higher values observed in this case could also be due to individual variation, as well as possible overestimations from flow cytometry methods.

Because tandem repeats are abundant and usually found as large heterochromatic blocks, they can be used to karyotype and identify maize chromosomes with cytogenetic techniques such as chromosome painting (Kato et al. 2004). Together with TEs, tandem repeats are responsible for a large portion of the genome size variation observed among lineages and can cause significant genome size changes in relatively short periods of time. For instance, TEs and knobs were shown to be associated with significant genome size variation within maize selfed lineages over a few generations (Roessler et al. 2019). In addition to genome size, tandem repeats have also been shown to influence maize phenotypes. For example, the amount of knob heterochromatin between maize populations was shown to correlate with their vegetative cell cycle length (Realini et al. 2015). Furthermore, variation in genome size, mediated predominantly by the abundance of heterochromatic knobs, was shown to correlate with cell size, rate of cell production, and flowering time in different maize populations (Jian et al. 2017; Bilinski et al. 2018).

Annotation of non-TE repeats typically relies on general repeat finders or homology with previously identified sequences, but structural approaches have begun to emerge as a solution to non-TE annotation in highly contiguous genome assemblies. For example, TRASH was developed for de novo annotation of satellite repeats (Wlodzimierz et al. 2023).

APPROACHES FOR MAIZE REPEAT ANNOTATION

As part of this collection, we provide a protocol that describes four approaches for TE and non-TE repeat annotation in maize (see Benson et al. 2024). Each approach offers distinct advantages and can be carefully selected to best accommodate the scientific interests of the research project.

Approach 1 is homology-based and uses the tool RepeatMasker (https://www.repeatmasker.org/) to screen genomes for interspersed repeats and low-complexity sequences. This approach is the least computationally intensive among the four described in the accompanying protocol. With a high-quality, manually curated TE library to guide annotation in maize, the approach is suitable to quickly annotate TEs and useful for researchers who are not focused on the repetitive fraction of genomes, such as for gene annotations. Approach 2 uses the Extensive de novo TE Annotator (EDTA) (Ou et al. 2019), which is a package developed for automated whole-genome de novo TE annotation. It requires more computational resources but is arguably just as accessible as Approach 1 for researchers with a limited background in computational biology. Approach 2 uses both structure-based and homology-based annotation to allow the detection of rare or novel TEs and other repeats. This approach enables the identification of TE structural features, which might be desirable for users who study TE biology. Approach 3 is similar to Approach 2 as it uses EDTA, but aimed at bolstering potentially low-quality or missing annotations by incorporating tools like AnnoSINE (Li et al. 2022), RepeatModeler v2 (Flynn et al. 2020), and Tandem Repeats Finder (Benson 1999). Approach 3 will be most useful for researchers with intermediate experience in bioinformatics who might be interested in an in-depth annotation to further explore TEs and repeats in their downstream analyses. Last, Approach 4 leverages panEDTA (Ou et al. 2022), which is based on the EDTA pipeline (Ou et al. 2019) and uses the same methodological backbone as Approach 2, but is designed for streamlined and automated annotation of pan-genome assemblies.

These approaches exploit the strengths of an assortment of the latest tools and aim to thoroughly address the difficulties and limitations of annotating repetitive sequences. Furthermore, in addition to describing their implementation, we describe how to assess TE annotation quality and include scripts to help visually depict the repeat landscape. While these protocols are especially pertinent to maize genomic research, they may also be useful to researchers working on organisms other than maize. These methods offer a robust framework for TE and repeat annotation and will help enhance our understanding of maize genetics and beyond.

CONCLUDING REMARKS

Maize is an important model organism for studying complex eukaryotic genomes. The large number of lineages with high-quality sequenced genomes available in this species, together with its rich landscape of TEs and tandem repeats, represents a unique resource for researchers interested in understanding the importance of repetitive DNAs in plant genomes. This brief review on maize repeats and the introduction to maize repeat annotation methods, together with the accompanying protocol (Benson et al. 2024), should be useful for genomic researchers interested in these elements not only in maize, but also in other plant species.

COMPETING INTEREST STATEMENT

The authors declare no competing interests.

AUTHOR CONTRIBUTIONS

Conceptualization: S.O. Writing—original draft: P.H. and C.W.B. Writing—review and editing: P.H., C.W.B., and S.O.

ACKNOWLEDGMENTS

This work was supported by grant GR130542 of the Ohio State University (OSU) Enterprise for Research, Innovation, and Knowledge (ERIK) STEM Education Faculty Startup Awards and JobsOhio, and by the OSU start-up fund. We also thank Dr. Xingli Li and Dr. Ning Jiang for critical reading of this work.

Footnotes

  • From the Maize collection, edited by Candice N. Hirsch and Marna D. Yandeau-Nelson. The entire Maize collection is available online at Cold Spring Harbor Protocols and can be accessed at https://cshprotocols.cshlp.org/.

REFERENCES

*Reference is also in this subject collection.

  1. *.
No Related Web Pages
| Table of Contents

This Article

  1. Cold Spring Harb Protoc 2025: pdb.top108441- © 2025 Cold Spring Harbor Laboratory Press
  1. All Versions of this Article:
    1. pdb.top108441v1
    2. 2025/9/pdb.top108441 most recent

Article Category

  1. Topic Introduction

Personal Folder

  1. Save to Personal Folders

Updates/Comments

  1. Alert me when Updates/Comments are published

ORCID

Related Content

  1. Related Web Pages

Share