This 2nd edition contains many new sections, sections that have gone through significant modifications, and some new or rearranged chapters. These are highlighted. Sections that describe advanced, often technical concepts that are not central to the main flow of the book are marked with a star. Use the Map of HTS to see how the main computational steps in high-throughput sequencing map to book chapters.
Part I – Preliminaries
1. Molecular biology and
high-throughput sequencing
1.1 DNA, RNA, proteins
1.2 Genetic variations
1.3 High-throughput sequencing
2. Algorithm design
2.1 Complexity analysis
2.2 Data representations
2.3 Reductions
2.4 Literature
3. Data structures
3.1 Dynamic range minimum queries
3.2 Bitvector rank and select operations
3.3 Wavelet tree
3.3.1 Balanced representation
3.3.2 Range queries
3.4 Static range minimum queries
3.4.1 From RMQs to LCAs through Cartesian tree
3.4.2 From LCAs to special RMQs
3.4.3 Solving short special RMQs
3.4.4 From large special RMQs to shorter general RMQs
3.4.5 Solving general RMQs
3.5 Hashing
3.5.1 Universal hashing
3.5.1 Approximate membership query
3.5.1 Rolling hash
3.5.1 Minimizers
3.6 Literature
4. Graphs
4.1 Directed acyclic graphs (DAGs)
4.1.1 Topological ordering
4.1.2 Shortest paths
4.2 Arbitrary directed graphs
4.2.1 Eulerian paths
4.2.2 Shortest paths and the Bellman-Ford method
4.3 Literature
5. Network flows
5.1 Flows and their decompositions
5.2 Minimum-cost flows and circulations
5.2.1 The residual graph
5.2.2 A pseudo-polynomial algorithm
5.3 Bipartite matching problems
5.3.1 Perfect matching
5.3.2 Matching with capacity constraints
5.3.3 Matching with residual constraints
5.4 Covering problems
5.4.1 Disjoint cycle cover
5.4.2 Minimum path cover in a DAG
5.5 Literature
Part IV – Genome-Scale Algorithms
10. Alignment-based genome analysis
10.1 Variation calling
10.1.1 Calling small variants
10.1.2 Calling large variants
10.3 Pattern partitioning
10.3 Dynamic programming along suffix tree paths
10.4 Backtracking on BWT indexes
10.4.1 Prefix pruning
10.4.2 Case analysis pruning with the bidirectional BWT index
10.5 Suffix filtering for approximate overlaps
10.6 Paired-end and mate pair reads
10.7 Algorithmic approach to variant selection
10.8 Literature
11. Alignment-free genome analysis and comparison
11.1 De novo variation calling
11.2 Space-efficient genome analysis
11.2.1 Maximal repeats
11.2.2 Maximal unique matches
11.2.3 Maximal exact matches
11.3 Comparing genomes without alignment
11.3.1 Substring and k-mer kernels
11.3.2 Substring kernels with Markovian correction
11.3.3 Substring kernels and matching statistics
11.3.4 Mismatch kernels
11.3.5 Compression distance
11.3.6 Approximating Jaccard similarity with min-hash
11.4 Literature
12. Compression of genome collections
12.1 Lempel–Ziv parsing
12.1.1 Basic algorithm for Lempel–Ziv parsing
12.1.2 Space-efficient Lempel–Ziv parsing
12.1.3 Space- and time-efficient Lempel–Ziv parsing
12.2 Bit-optimal Lempel–Ziv compression
12.3 Prefix-free parsing and run-length encoded BWT
12.3.1 Prefix-free parsing
12.3.2 Suffix sorting
12.3.3 Construction of the run-length BWT
12.4 Literature
13. Fragment assembly
13.1 Sequencing by hybridization
13.2 Contig assembly
13.2.1 Read error correction
13.2.2 Reverse complements
13.2.3 Irreducible overlap graphs
13.3 Scaffolding
13.4 Gap filling
13.5 Literature
Part V – Applications
14. Haplotype analysis
14.1 Haplotype assembly and phasing
14.1.1 Minimum error correction
14.1.2 Hardness
14.1.3 Dynamic programming
14.2 Haplotype matching
14.2.1 Haplotype matching in linear time
14.2.2 Positional Burrows–Wheeler transform
14.3 Literature
15. Pangenomics
15.1 Overview of pangenome representations
15.1.1 Colored de Bruijn graphs
15.1.2 Founder graphs and founder sequences
15.2 Aligning reads to pangenome
15.2.1 Indexing a set of individual genomes with a hybrid scheme
15.2.2 Indexing a set of individual genomes with the r-index
15.2.3 Indexing a reference genome and a set of variants
15.3 Variation calling over pangenomes
15.3.1 Analysis of alignments on a set of individual genomes
15.3.2 Analysis of alignment on the labeled DAG of a population
15.3.3 Evaluation of variation calling results
15.4 Literature
16. Transcriptomics
16.1 Split alignment of reads
16.2 Estimating the expression of annotated transcripts
16.3 Transcript assembly
16.3.1 Short reads
16.3.2 Long reads
16.3.3 Paired-end reads
16.4 Simultaneous assembly and expression estimation
16.5 Transcript alignment with minimizers and co-linear chaining
16.6 Literature
17. Metagenomics
17.1 Species estimation
17.1.1 Single-read methods
17.1.2 Multi-read and coverage-sensitive methods
17.2 Read clustering
17.2.1 Filtering reads from low-frequency species
17.2.2 Initializing clusters
17.2.3 Growing clusters
17.3 Comparing metagenomic samples
17.3.1 Sequence-based methods
17.3.2 Read-based methods
17.3.3 Multi-sample methods
17.4 Literature