Assembling a Colossus
The loblolly pine genome is big. Bloated with retrotransposons and other repetitive sequences, it is seven times larger than the human genome and easily big enough to overwhelm standard genome assembly methods.
This forced the loblolly pine genome sequencing team, led by David Neale at the University of California, Davis, to look for ways to reduce the enormous complexity of their task.
The draft genome sequence, described in the latest issue of GENETICS and the journal Genome Biology, was pieced together from over 16 billion sequence reads. Spanning around 23 billion base pairs, it only just beats out the Norway spruce as the largest genome ever sequenced, but it is substantially more complete. For example, the N50 scaffold size of the current loblolly assembly is 66.9 Kbp, compared to 0.72 Kbp in the Norway spruce.
So how did they do it?
One strategy was to generate most of the sequence from part of a single pine nut. This tiny source material was the megagametophyte, which is the haploid tissue that provides nutrients to the developing diploid embryo. Despite the limited amount of DNA that can be extracted from this source, the reduced complexity of a haploid genome makes it easier to assemble. To link up all the sequence fragments from the haploid genome, the team also created DNA libraries from diploid needles of the parent genotype.
The solution was a method of pre-processing the data into “super reads”, or larger chunks of contiguous haploid sequence that condensed many individual reads. In essence, they were dealing with the unambiguous parts of the problem first, and getting rid a huge amount of overlapping and redundant data in the process.
The result was a 100-fold reduction in the amount of megagametophyte sequence that needed to be held in the memory of the assembly computer. That kind of reduction is not just handy for giant genomes; Salzberg says it also speeds up projects of more modest scale.
Luckily, says Salzberg, the loblolly genome project wasn’t held back by the masses of repeats that are typical of conifers. Even though around 82% of the loblolly pine genome is repetitive, it turns out that most of the repeats are evolutionarily ancient. That means they have diverged enough to no longer be a big stumbling block for assembly.
All this is good news for sequencing other conifer species, especially since the team is already tackling an even larger behemoth: the 35 gigabase genome of the sugar pine.
Zimin A., Stevens K.A., Crepeau M.W., Holtz-Morris A., Koriabine M., Marcais G., Puiu D., Roberts M., Wegrzyn J.L. & de Jong P.J. & (2014). Sequencing and Assembly of the 22-Gb Loblolly Pine Genome, Genetics, 196 (3) 875-890. DOI: 10.1534/genetics.113.159715
Wegrzyn J.L., Liechty J.D., Stevens K.A., Wu L.S., Loopstra C.A., Vasquez-Gross H.A., Dougherty W.M., Lin B.Y., Zieve J.J. & Martinez-Garcia P.J. & (2014). Unique Features of the Loblolly Pine (Pinus taeda L.) Megagenome Revealed Through Sequence Annotation, Genetics, 196 (3) 891-909. DOI: 10.1534/genetics.113.159996
Neale D.B., Wegrzyn J.L., Stevens K.A., Zimin A.V., Puiu D., Crepeau M.W., Cardeno C., Koriabine M., Holtz-Morris A.E. & Liechty J.D. & (2014). Decoding the massive genome of loblolly pine using haploid DNA and novel assembly strategies, Genome Biology, 15 (3) R59. DOI: 10.1186/gb-2014-15-3-r59