Genome-Scale Algorithm Design

We maintain here the list of all typos and mistakes found after the book was published.

If you find a new mistake, please let us know via email and we will publish it here.

3. Data Structures

Page 23, Example 3.1: Table second, value 11 should be 2 and value 13 should be 4. (observed at a study group of String Processing Algorithms course, Autumn 2018)
Page 23, Example 3.1: Table third, row 100 should contain 111 rather than 100. (reported by Anna Kuosmanen)
See also illustration of the above corrections by Baraa Orabi and Faraz Hach.
Page 23, line -5, definition of select: it should say maximal i such that rank(B,i)=j AND B[i]=1.
Page 24, down poistions
Page 26, lines 10-15: Indexes i_v and i_w become wrongly updated, as they shift to the left (outside the query interval), although they should shift to the right so that they remain inside the query interval. A fix is to use rank_*(B_v, i_v-1)+1 and rank_*(B_w, i_w-1)+1 to update the coordinates when going down.
- To help infer the idea of the correct algorithm, here is a simulation on word pääsiäinen.

4. Graphs

4.2.1 Eulerian paths

Page 34, Corollary 4.4: the third bullet point |N-(v)| = |N-(v)| should be |N-(v)| = |N+(v)| (reported by Masahiro Kanai)

Exercises

Page 39, Exercise 4.16: c : V(G) → Q ⟹ c : E(G) → Q
Page 39, Exercise 4.16: that passes through at most t sets of the partition S ⟹ that changes the partite sets of S at most t times
Page 39, Exercise 4.17: c : V(G) → Q ⟹ c : E(G) → Q

5. Network flows

5.4 Covering problems

Page 61, Figure 5.9(b): the edge from f_out to g_in should have cost €3, and the one from g_out to h_in should have cost €4 (see Figure 5.9(a))

6. Alignments

6.1.2 Shortest detour

Page 76, line 24: in O(((n-m)+2x)-m) = O((t/delta)m), O(((n-m)+2x)-m) should be O(((n-m)+2x)m) (reported by Masahiro Kanai).
Page 95, line -7,: s_{n+1} should be b_{n+1} (reported by Massimo Equi).

7. Hidden Markov models (HMMs)

Insight 7.1

Number 23 should be 22 through the example (number of non-coding emissions).

Exercises

Page 123, Exercise 7.4: S = s_n...s_n ⟹ S = s₁...s_n (reported by Rob Patro)

8. Classical Indexes

8.2 Suffix array

Page 134, line -4: "Now let [...] W ⁱ ∈ [1.. \sigma]* be a sequence of strings" ⟹ "Now let [...] W ⁱ ∈ [1.. \sigma]⁺ be a sequence of nonempty strings" (reported by Roman Cheplyaka).
Page 135, line 5: "for a given integer p consider the set of distinct prefixes S ^p = { W ^k_1..p | k ∈ [1..n], p ∈ [1..m] }" ⟹ "for a given integer p ∈ [1..m], consider the set of distinct prefixes S ^p = { W ^k_1..p | k ∈ [1..n] }" (reported by Roman Cheplyaka)
Page 135, Lemma 8.7: "A sequence of strings [...] | W ⁱ ∈ [1.. \sigma]* " ⟹ "A sequence of nonempty strings [...] | W ⁱ ∈ [1.. \sigma]⁺ " (reported by Roman Cheplyaka)
Page 138, last line: c = T[\tilde{SA}[k]] ⟹ c = T[\tilde{SA}[k]-1] (reported by Roman Cheplyaka)
Page 137, line 12: S[T]=0 ⟹ S[T]=1 (reported by Roman Cheplyaka).

8.3 Suffix tree

Page 141, Figure 8.4: in the suffix-link tree, the path from the root labelled by GAGG should be relabelled GAGA. The suffix-link tree should also contain an additional path from the root labelled by GCGAGA#.
Page 144, line -6: a tree on n leaves ⟹ a tree of size O(n) on n leaves
Page 144, proof of Lemma 8.17, serious flaw: The statement (with above fix) is correct, but the proof has a flaw: The proof omits the case when "(" is deleted, and pointers to it become invalid. This can be solved adding a union-find data structure, but the running time is no longer linear. However, there exist also correct linear time solutions: see Gabow & Tarjan, STOC 1983, or a recent simplified and practical solution Alzamel, Charalampopoulos, Iliopoulos, and Pissis, IWOCA 2017. Proof of Theorem 8.18 uses Lemma 8.17, but there is also a direct proof using properties of the suffix tree. Also Theorem 8.21 (document counting) uses Lemma 8.17. Another way to convince on the correctness of the claim of Lemma 8.17 is to read The LCA Problem Revisited by Bender and Farach-Colton for the online version of the problem. (reported by the authors)

8.4 Applications of the suffix tree

Page 147, line 10: if V is the leaf ⟹ if v is the leaf (reported by Monika Balvočiūtė)

8.5 Literature

Page 152, line -6: ... whose presentation we partly followed to introduce the repeat finding concepts. ⟹ Add the following sentence: Example applications of such concepts to genome analysis are long terminal repeat retrotransposons, k-mismatch repeats, and transposable element modules, which can be detected using maximal repeats as seeds (Ellinghaus et al. 2008, Rho et al. 2007, Volfovsky et al. 2001, Kurtz et al. 2001, Tempel et al. 2010), and ultraconserved elements, which correspond to maximal unique matches or to high-scoring alignments across multiple genomes (see e.g. Bejerano et al. 2004, Isenbarger et al. 2008).
Page 153, line 4: Crochemore & Vérin (1997b) ⟹ Blumer et al. (1987)
Page 153, line 5: Apostolico & Lonardi (2002) and Crochemore & Vérin (1997a) ⟹ Gusfield (1997), Crochemore & Vérin (1997a), and Crochemore & Vérin (1997b)

9. Burrows-Wheeler indexes

9.7.1 Frequency-oblivious representation

Page 192, line 4: end in a vertex prefixed by c₁W ⟹ end in vertex c₁W

9.7.3 Space-efficient construction

Page 193, line -18: Both the frequency-aware and the frequency-oblivious representations of DBG_T#,k for some string T ⟹ Both the frequency-aware and the frequency-oblivious representations of the de Bruijn graph of a string T

9.8 Literature

Page 195, line -17: is due to Kulekci et al. (2012). ⟹ is described in Kulekci et al. (2012) and in Okanohara et al. (2009).

Exercises

Exercise 9.7: extendRight() and enumerateRight() are not used by Algorithm 9.3, so there is no need to consider enumerateRightExtended() (reported by Jarno Alanko).

11. Genome analysis and comparison

11.1 Space-efficient genome analysis

Page 224, Algorithm 11.2, lines 12,13,14,18: replace I with P (reported by Jarno Alanko).
Page 228, Algorithm 11.3, line 5: the line could be shortened to: "if j₁ < i₁ then" (reported by Monika Balvočiūtė)

11.2 Comparing genomes without alignment

Page 237, line 4: the value of the k-mer kernel is ⟹ the value of the kernel is

11.3 Literature

Page 255, line -8: The space-efficient computation ... (2014). ⟹ Add the following sentence: The computation of the same kernels on generalized suffix trees is described in (Rieck et al. 2006, Rieck & Laskov 2008).
Page 256, line 4: ... and a geometric treatment... is described in Benham et al. (2013). ⟹ Add the following sentence: Generalizations to measure the similarity of all strings in a set, and applications to regulatory DNA, are described in Ren et al. (2013).
Page 256, line 15: The connection between ... was established in Apostolico & Bejerano (2000). ⟹ Add the following sentence: Probabilistic suffix tries were then implemented using lazy suffix trees and enhanced suffix arrays (Lin et al. 2012; Schulz et al. 2008).

Exercises

Page 257, Exercise 11.7 (b): σ^k+1 ⟹ σ^k+1 - σ^k + 1 (reported by Jarno Alanko).

13. Fragment assembly

13.5 Literature

Page 290, second paragraph: The condition for redundancy is not correct, as there could be different overlaps between the nodes. Therefore the removal of redundant arcs is not as simple as claimed. The fix is to use the direct construction of Insight 13.3. (reported by Chirag Jain)
Page 299, last line: ... of the Chinese postman problem to assembly. ⟹ Add the following sentence: Limiting assembly to k-mers with high frequency in the de Bruijn graph has been used to build libraries of genomic repeats (Koch et al. 2014).

Exercises

Page 304, Line 4: conting ⟹ contig

14. Genomics

Exercises

Page 323, Exercise 14.3: This greedy algorithm is in essence the same as for set cover, Exercise 14.4, so the claimed bound does not hold. One could restate this exercise as "give an input where the greedy algorithm produces a suboptimal solution".

15. Transcriptomics

15.4 Transcript alignment with co-linear chaining

Page 343, Problem 15.7, first bullet: 1 ≤ j ≤ p ⟹ 2 ≤ j ≤ p

References

Page 371: add the following: Bejerano, G., Pheasant, M., Makunin, I., Stephen, S., Kent, W. J., Mattick, J. S., & Haussler, D. (2004). 'Ultraconserved elements in the human genome', Science, 304(5675), 1321-1325.
Page 372: add the following: Blumer, A., Blumer, J., Haussler, D., McConnell, R., & Ehrenfeucht, A. (1987), 'Complete inverted files for efficient text retrieval and analysis', Journal of the Association for Computing Machinery 34(3), 578-595.
Page 374: add the following: Ellinghaus, D., Kurtz, S., & Willhoeft, U. (2008), 'LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons', BMC Bioinformatics 9(18).
Page 377: add the following: Isenbarger, T.A., Carr, C.E., Stewart Johnson, S., Finney, M., Church, G.M., Gilbert, W., Zuber, M.T., & Ruvkun, G. (2008), 'The most conserved genome segments for life detection on Earth and other planets', Origins of Life and Evolution of Biospheres 38(6), 517-533.
Page 377: add the following: Koch, P., Platzer, M., & Downie, B. R. (2014), 'RepARK - De novo creation of repeat libraries from whole-genome NGS reads', Nucleic Acids Research 42(9), e80-e80.
Page 378: add the following: Kurtz, S., Choudhuri, J. V., Ohlebusch, E., Schleiermacher, C., Stoye, J., & Giegerich, R. (2001), 'REPuter: the manifold applications of repeat analysis on a genomic scale', Nucleic Acids Research 29(22), 4633-4642.
Page 378: add the following: Lin, J., Adjeroh, D. & Jiang, B.H. (2012), 'Probabilistic suffix array: efficient modeling and prediction of protein families', Bioinformatics 28(10), 1314-1323.
Page 380: add the following: Okanohara D. & Tsujii, J. (2009), Text categorization with all substring features, in Proceedings of the 2009 SIAM International Conference on Data Mining. SIAM, 838-846.
Page 381: add the following: Ren, J., Song, K., Sun, F., Deng, M., & Reinert, G. (2013), 'Multiple alignment-free sequence comparison', Bioinformatics 29(21), 2690-2698.
Page 381: add the following: Rho, M., Choi, J.-H., Kim, S., Lynch, M., & Tang, H. (2007), 'De novo identification of LTR retrotransposons in eukaryotic genomes', BMC Genomics 8(90).
Page 381: add the following: Rieck, K., Laskov, P. & Sonnenburg, S. (2006), Computation of similarity measures for sequential data using generalized suffix trees, in Proceedings of the 2006 conference on Advances in Neural Information Processing Systems, Vol. 19. Cambridge: MIT Press, 1177-1184.
Page 381: add the following: Rieck, K. & Laskov, P. (2008), 'Linear-time computation of similarity measures for sequential data', Journal of Machine Learning Research 9(2008), 23-48.
Page 382: add the following: Schulz, M.H., Weese, D., Rausch, T., Döring, A., Reinert, K. & Vingron, M. (2008), 'Fast and adaptive variable order Markov chain construction', in Algorithms in Bioinformatics, Vol. 5251 of Lecture Notes in Computer Science. Berlin: Springer, 306-317.
Page 383: add the following: Tempel, S., Rousseau, C., Tahi, F., & Nicolas, J. (2010), 'ModuleOrganizer: detecting modules in families of transposable elements', BMC Bioinformatics 11(474).
Page 384: add the following: Volfovsky, N., Haas, B.J. & Salzberg, S.L. (2001), 'A clustering method for repeat analysis in DNA sequences', Genome Biology 2(8).