University of Helsinki Department of Computer Science
 

Department of Computer Science

Department information

 

Seminar: Machine Learning in Bioinformatics

58309106 Seminar: Machine Learning in Bioinformatics (3 cr)
Instructor: Professor Juho Rousu Time and place: 08.09-06.10, 27.10-01.12 Monday 12-14 C221

Prerequisites and enrolling to the course

The course Introduction to bioinformatics or equivalent background knowledge of the topic. Prior knowledge of machine learning techniques will be helpful.

Enroll to the seminar in the registration system.

Note:In case the demand exceeds seminar capacity, students in the Master's programme in bioinformatics will have priority.

Participants

Abhishek Tripathi, Dorothee Girbig, Esa Pitkänen, Hitomi Hasegawa, Ignacio Fernandez, Juan Fernandez, Jurkka Näsänen, Krishnan Narayanan, Laura Langohr, Maria Yli-heikkilä, Markus Heinonen, Pekka Parviainen, Ping Chen, Yuan Zou
Name
Topic
Seed paper(s)
Reviewers
Abhishek Tripathi
Protein function prediction
#12
Ping Chen, Krishnan Narayanan
Pekka Parviainen
Haplotype Inference
#5
Ignacio Fernandez, Jurkka Näsänen
Laura Langohr
Gene expression/Cluster analysis
#14
Jurkka Näsänen,  Maria Yli-Heikkilä
Ping Chen
Protein structure and function prediction
#8,#11
Juan Fernandez, Krishnan Narayanan
Jurkka Näsänen
Gene expression profiling/SVM
#13
Laura Langohr, Maria Yli-Heikkilä
Maria Yli-Heikkilä
micro-RNA modelling
#22
Laura Langohr,  Yuan Zou
Yuan Zou
Gene prediction
#2
Ignacio Fernandez,  Hitomi Hasegawa
Hitomi Hasegawa
Protein-protein interaction prediction
#21
Dorothee Girbig, Pekka Parviainen
Dorothee Girbig
Biological network inference
#15,#16,#20
Juan Fernandez, Abhishek Tripathi
Juan Fernandez
Protein identification
#10
Dorothee Girbig, Hitomi Hasegawa
Ignacio Fernandez
SNP discovery
#6
Pekka Parviainen, Yuan Zou
Krishan Narayanan
Protein fold and remote homology detection
#9
Ping Chen, Abhishek Tripathi

Seminar goals

Machine learning is one of the key technologies in bioinformatics, making it possible to automatically generate predictive models from data. In this seminar we will get an overview of how machine learning techniques are used in bioinformatics. We will look at various prediction problems, including We will look at machine learning techniques in the context of above mentioned biological problems, including representative approaches of

Completing the seminar

The language of the seminar is English. To pass the seminar, you need to do the following four tasks: During Period I all students write their papers in English. The length of the paper is 6-10 pages formatted according to the format given below. The oral presentations, during Period II, should last for about 30-40 minutes, which should leave some time for questions.

Grading

Students will be graded based on i) their written paper (40%), ii) their oral presentation (40%), and iii) their activity in commenting other students' work and participating in the discussion (20%). To pass the course, the student must write the paper on the agreed subject and present his work. In addition, each student is required to attend at least 80% of the seminar presentations.

Grading will be on the scale 0-5 (0=fail,5=excellent)

Schedule

Date
Time
Name/Agenda/Deadline
8.9
12.15-14
Organization, selection of topics
15.9 Personal guiding sessions, room A239b
12.15-12.30
Krishnan Narayanan
12.30-12.45
Yuan Zou
12.45-13.00
Maria Yli-Heikkilä
13.00-13.15
Ignacio Fernandez
13.15-13.30
Pekka Parviainen
13.30-13.45
Ping Chen
13.45-14.00
Markus Heinonen
22.9 Personal guiding sessions, room A239b (continued)
12.15-12.30 Abhishek Tripathi
12.30-12.45 Juan Fernandez
12.45-13.00
Laura Langohr
13.15-13.30 Jurkka Näsänen
13.30-13.45 Dorothee Girbig
13.45-14.00 Hitomi Hasegawa
29.9
-
No session
6.10 Deadline
12.00
Deadline for submitting the first paper draft (via email to the two reviewers and to Juho Rousu)
13.10 Deadline
12.00
Reviews of the paper drafts submitted (via email to the author and to Juho)
13.10 Presentation 12.15-13.00
Krishnan Narayanan
20.-26.10
-
No session (Period break)
27.10 Presentations
12.15-13.00
Yuan Zou
13.15-14.00

3.11 Presentations 12.15-13.00
Ignacio Fernandez
13.15-14.00 Pekka Parviainen
10.11 Presentations 12.15-13.00
Ping Chen
13.15-14.00 Dorothee Girbig
17.11 Presentations 12.15-13.00
Abhishek Tripathi
13.15-14.00 Juan Fernandez
24.11 Presentations 12.15-13.00
Laura Langohr
13.15-14.00 Jurkka Näsänen
1.12 Presentations 12.15-13.00
Hitomi Hasegawa
13.15-14.00
5.12 Deadline
12.00
Final seminar paper returned, via email to juho

Guidelines

In the following some addition guidelines for this seminar are given. Additional helpful material can be found from the home page of the scientific writing course Department of computer science.

Layout of the seminar paper

Using literature

Sources of information

Categories of information sources for the seminar, in the order of preference
  1. High-quality journals in bioinformatics, computer science, statistics, as well as biological and medical sciences. These are are the preferred source of seminar material. A non-exhaustive list of suitable journals: Bioinformatics, BMC Bioinformatics, Data Mining and Knowledge Discovery, Journal of Computational Biology, Journal of Machine Learning Research, Machine Learning.
  2. Proceedings of high-quality conferences in computational biology and machine learning: Intelligent Systems for Molecular Biology (ISMB), Research in Computational Molecular Biology (RECOMB), International Conference on Machine Learning (ICML), Neural Information Processing Systems (NIPS).
  3. Text books contain high-quality information. However, as the publication process of books takes very long, the information in text books is rarely the latest in science. Text books can be used as sources of information, but they should always be accompanied by journal and conference papers.
  4. Wikipedia contains a lot of information and sometimes is a good source to get an overview of the seminar topic. However, the quality of Wikipedia articles varies. In particular, the peer-review process behind a Wikipedia article is not always at the same level as high-quality scientific journals and conferences. As a consequence, sometimes Wikipedia contains opinions of small groups of scientists that are not shared by the research community. Guideline: You may use Wikipedia as a means to learn about some topic. However, avoid using Wikipedia as the only source of information. Always verify the facts using other sources of information. Whenever possible rely on journal and conference articles.
  5. Online course material is widely available in the www. These should be used even with more caution than Wikipedia. Some courses are very good some are not, and there is no peer-review process behind the material. Online courses should not be used as references in you seminar paper.
  6. The rest of WWW. A random web page of some individual/organization/group about some subject has typically very little quality control behind it. This material is not suitable for seminar paper material.

Finding information


Combination of two search strategies will lead to the best results

Oral presentation

Giving and receiving feedback

Two golden rules: A matrix on factors affecting grading in scientific writing may be used as basis of feedback. You are NOT supposed to give grades with your feedback, however, only suggestions and comments.

Seminar material and topics

The seminar will be based on recent scientific articles and text books. The following survey article will be a useful starting point: The following are preselected topics with an associated seed article. Seed article is meant to be used as a starting point for lietrature search, not the only or the best reference on a certain topic.
In addition to the preselected topics, you may suggest your own topic.

Gene prediction

  1. Axel Bernal,, Koby Crammer, Artemis Hatzigeorgiou, Fernando Pereira. Global Discriminative Learning for Higher Accuracy Computational Gene Prediction. PLoS Computational Biology 3(3): e54
  2. Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern and Peter Meinicke.Gene prediction in metagenomic fragments: A large scale machine learning approach. BMC Bioinformatics 2008, 9:217

Protein-DNA binding

  1. Nitin Bhardwaj, Robert E. Langlois, Guijun Zhao, and Hui Lu.Kernel-based machine learning protocol for predicting DNA-binding proteins. Nucleic Acids Res. 2005; 33(20): 6486-6493.
  2. Pengyu Hong, X. Shirley Liu, Qing Zhou, Xin Lu, Jun S. Liu, and Wing H. Wong: A boosting approach for motif modeling using ChIP-chip data Bioinformatics 2005 21: 2636-2643

SNPs and haplotyping

  1. Eric P. Xing, Michael I. Jordan, Roded Sharan. Bayesian Haplotype Inference via the Dirichlet Process. Journal of Computational Biology. April 1, 2007, 14(3): 267-284.
  2. Lakshmi Matukumalli, John Grefenstette, David Hyten, Ik-Young Choi, Perry B Cregan and Curtis P Van Tassell. Application of machine learning in SNP discovery. BMC Bioinformatics 2006, 7:4

Protein structural classification and structure prediction

  1. Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble, Christina Leslie. Multi-class Protein Classification Using Adaptive Codes. Journal of Machine Learning Research 8 (2007, 1557-1581
  2. Scott Montgomerie, Joseph Cruz, Savita Shrivastava, David Arndt, Mark Berjanskii, David Wishart. PROTEUS2: a web server for comprehensive protein structure prediction and structure-based annotation. Nucleic Acids Research, 2008, Vol. 36, No. suppl_2 W202-W209
  3. Theodoros Damoulas and Mark Girolami. Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detectionBioinformatics 24, 10 (2008):1264-1270

Protein identification

  1. Joshua Elias, Francis Gibbons, Oliver King, Frederick Roth and Steven Gygi. Intensity-based protein identification by machine learning from a library of tandem mass spectra. Nature biotechnology 22, 2 (2004), 214-

Protein function prediction

  1. Iddo Friedberg. Automated protein function prediction - the genomic challenge. Briefings in Bioinformatics 2006 7(3):225-242
  2. Igor V. Tetko, Igor V. Rodchenkov, Mathias C. Walter, Thomas Rattei and Hans-Werner Mewes. Beyond the 'best' match: machine learning annotation of protein sequences by integration of different sources of information. Bioinformatics 2008 24(5):621-628

Gene expression profiling

  1. Alexander Statnikov 1, Constantin F. Aliferis, Ioannis Tsamardinos, Douglas Hardin and Shawn Levy. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 2005 21(5):631-643
  2. Daxin Jiang, Chun Tang, Aidong Zhang Cluster analysis for gene expression data: a survey IEEE Transactions on Knowledge and Data Engineering 16, 11, (2004),1370 - 1386

Biological network inference

  1. Florian Markowetz and Rainer Spang: Inferring cellular networks - a review. BMC Bioinformatics 8, 6 (2007), S5
  2. Jean-Philippe Vert: Reconstruction of biological networks by supervised machine learning approaches. Technical Report HAL-00283945, June, 2008.
  3. Ashwin Srinivasan and Ross D. King. Incremental Identification of Qualitative Models of Biological Systems using Inductive Logic Programming.

Gene (regulatory) networks

  1. Robert Castelo and Alberto Roverato: A Robust Procedure for Gaussian Graphical Model Search From Microarray Data With p Larger Than n. Journal of Machine Learning Research 7 (2006), 2621-2650.
  2. Jason Enrst, Oded Vainas, Christopher Harbison, Itamar Simon and Ziv Bar-Joseph. Reconstructing Dynamic Regulatory Maps. Molecular Systems Biology 4 (2007):74

Protein-protein interaction networks

  1. Krogan et al.:Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440 (2006), 637-643
  2. Yanjun Qi, Ziv Bar-Joseph, Judith Klein-Seetharaman. Evaluation of different biological data and computational classification methods for use in protein interaction prediction.. Proteins: Structure, Function, and Bioinformatics 63, 3 (2006), 490 - 500