Seminar: Machine Learning in Bioinformatics

58309106 Seminar: Machine Learning in Bioinformatics (3 cr)
Instructor: Professor Juho Rousu Time and place: 08.09-06.10, 27.10-01.12 Monday 12-14 C221

Prerequisites and enrolling to the course

The course Introduction to bioinformatics or equivalent background knowledge of the topic. Prior knowledge of machine learning techniques will be helpful.

Enroll to the seminar in the registration system.

Note:In case the demand exceeds seminar capacity, students in the Master's programme in bioinformatics will have priority.

Participants

Abhishek Tripathi, Dorothee Girbig, Esa Pitkänen, Hitomi Hasegawa, Ignacio Fernandez, Juan Fernandez, Jurkka Näsänen, Krishnan Narayanan, Laura Langohr, Maria Yli-heikkilä, Markus Heinonen, Pekka Parviainen, Ping Chen, Yuan Zou

Name	Topic	Seed paper(s)	Reviewers
Abhishek Tripathi	Protein function prediction	#12	Ping Chen, Krishnan Narayanan
Pekka Parviainen	Haplotype Inference	#5	Ignacio Fernandez, Jurkka Näsänen
Laura Langohr	Gene expression/Cluster analysis	#14	Jurkka Näsänen, Maria Yli-Heikkilä
Ping Chen	Protein structure and function prediction	#8,#11	Juan Fernandez, Krishnan Narayanan
Jurkka Näsänen	Gene expression profiling/SVM	#13	Laura Langohr, Maria Yli-Heikkilä
Maria Yli-Heikkilä	micro-RNA modelling	#22	Laura Langohr, Yuan Zou
Yuan Zou	Gene prediction	#2	Ignacio Fernandez, Hitomi Hasegawa
Hitomi Hasegawa	Protein-protein interaction prediction	#21	Dorothee Girbig, Pekka Parviainen
Dorothee Girbig	Biological network inference	#15,#16,#20	Juan Fernandez, Abhishek Tripathi
Juan Fernandez	Protein identification	#10	Dorothee Girbig, Hitomi Hasegawa
Ignacio Fernandez	SNP discovery	#6	Pekka Parviainen, Yuan Zou
Krishan Narayanan	Protein fold and remote homology detection	#9	Ping Chen, Abhishek Tripathi

Seminar goals

Machine learning is one of the key technologies in bioinformatics, making it possible to automatically generate predictive models from data. In this seminar we will get an overview of how machine learning techniques are used in bioinformatics. We will look at various prediction problems, including

prediction problems in biosequences (gene prediction, splice site prediction)
structure prediction (RNA, protein)
protein function prediction
interaction networks prediction (protein-protein, gene regulation)

We will look at machine learning techniques in the context of above mentioned biological problems, including representative approaches of

classification methods (classification trees, nearest neighbor, neural networks, support vector machines)
clustering (partition-based, hierarchical, mixture models)
probabilistic graphical models (HMM, bayesian network)

Completing the seminar

The language of the seminar is English. To pass the seminar, you need to do the following four tasks:

Write a paper about a topic agreed during the first meetings,
Review two papers written by other students,
Prepare a presentation and discuss it with the other students, and
Participate in the seminar by asking questions, raising discussions on the topic, and reviewing other students' work.

During Period I all students write their papers in English. The length of the paper is 6-10 pages formatted according to the format given below. The oral presentations, during Period II, should last for about 30-40 minutes, which should leave some time for questions.

Grading

Students will be graded based on i) their written paper (40%), ii) their oral presentation (40%), and iii) their activity in commenting other students' work and participating in the discussion (20%). To pass the course, the student must write the paper on the agreed subject and present his work. In addition, each student is required to attend at least 80% of the seminar presentations.

Grading will be on the scale 0-5 (0=fail,5=excellent)

Schedule

Date	Time	Name/Agenda/Deadline
8.9	12.15-14	Organization, selection of topics
15.9 Personal guiding sessions, room A239b	12.15-12.30	Krishnan Narayanan
	12.30-12.45	Yuan Zou
	12.45-13.00	Maria Yli-Heikkilä
	13.00-13.15	Ignacio Fernandez
	13.15-13.30	Pekka Parviainen
	13.30-13.45	Ping Chen
	13.45-14.00	Markus Heinonen
22.9 Personal guiding sessions, room A239b (continued)	12.15-12.30	Abhishek Tripathi
	12.30-12.45	Juan Fernandez
	12.45-13.00	Laura Langohr
	13.15-13.30	Jurkka Näsänen
	13.30-13.45	Dorothee Girbig
	13.45-14.00	Hitomi Hasegawa
29.9	-	No session
6.10 Deadline	12.00	Deadline for submitting the first paper draft (via email to the two reviewers and to Juho Rousu)
13.10 Deadline	12.00	Reviews of the paper drafts submitted (via email to the author and to Juho)
13.10 Presentation	12.15-13.00	Krishnan Narayanan
20.-26.10	-	No session (Period break)
27.10 Presentations	12.15-13.00	Yuan Zou
27.10 Presentations	13.15-14.00
3.11 Presentations	12.15-13.00	Ignacio Fernandez
3.11 Presentations	13.15-14.00	Pekka Parviainen
10.11 Presentations	12.15-13.00	Ping Chen
10.11 Presentations	13.15-14.00	Dorothee Girbig
17.11 Presentations	12.15-13.00	Abhishek Tripathi
17.11 Presentations	13.15-14.00	Juan Fernandez
24.11 Presentations	12.15-13.00	Laura Langohr
24.11 Presentations	13.15-14.00	Jurkka Näsänen
1.12 Presentations	12.15-13.00	Hitomi Hasegawa
1.12 Presentations	13.15-14.00
5.12 Deadline	12.00	Final seminar paper returned, via email to juho

Guidelines

In the following some addition guidelines for this seminar are given. Additional helpful material can be found from the home page of the scientific writing course Department of computer science.

Layout of the seminar paper

The paper should be formatted according to the general format (cover etc.) used on the Scientifc Writing course, example layout
Using LaTeX is recommended. LaTeX-templates are available from http://www.cs.helsinki.fi/u/kurhila/tiki_k2007/coursedescription.html

Using literature

The seminar paper should cover the biological problem and the computational methods used to solve the problem. You may need to use separate sources for the application and for the method.
Try to locate the best papers about the topic. You will probably end up reading more papers than you will eventually use.
How many references you should use and cite? A rule of thumb is "as many references as there are pages in the paper". This does not mean that you will write exactly one page about each reference, some require more than others.
Try to make a synthesis of the literature. What is the main message of the papers about some topic? How do the individual papers relate to or deviate from this main message.

Sources of information

Categories of information sources for the seminar, in the order of preference

High-quality journals in bioinformatics, computer science, statistics, as well as biological and medical sciences. These are are the preferred source of seminar material. A non-exhaustive list of suitable journals: Bioinformatics, BMC Bioinformatics, Data Mining and Knowledge Discovery, Journal of Computational Biology, Journal of Machine Learning Research, Machine Learning.
Proceedings of high-quality conferences in computational biology and machine learning: Intelligent Systems for Molecular Biology (ISMB), Research in Computational Molecular Biology (RECOMB), International Conference on Machine Learning (ICML), Neural Information Processing Systems (NIPS).
Text books contain high-quality information. However, as the publication process of books takes very long, the information in text books is rarely the latest in science. Text books can be used as sources of information, but they should always be accompanied by journal and conference papers.
Wikipedia contains a lot of information and sometimes is a good source to get an overview of the seminar topic. However, the quality of Wikipedia articles varies. In particular, the peer-review process behind a Wikipedia article is not always at the same level as high-quality scientific journals and conferences. As a consequence, sometimes Wikipedia contains opinions of small groups of scientists that are not shared by the research community. Guideline: You may use Wikipedia as a means to learn about some topic. However, avoid using Wikipedia as the only source of information. Always verify the facts using other sources of information. Whenever possible rely on journal and conference articles.
Online course material is widely available in the www. These should be used even with more caution than Wikipedia. Some courses are very good some are not, and there is no peer-review process behind the material. Online courses should not be used as references in you seminar paper.
The rest of WWW. A random web page of some individual/organization/group about some subject has typically very little quality control behind it. This material is not suitable for seminar paper material.

Finding information

Google Scholar is perhaps the search engine to find literature on certain topic.
University of Helsinki has subscriptions to a wide range of electronic journals, you can access these from the university computers. (To access these via Google Scholar, remember to enable "Library Links" for University of Helsinki in Google Scholar preferences)

Combination of two search strategies will lead to the best results

Google Scholar will give you well-references articles that match to the keywords. These are often a bit older.
Systematic search through the tables of contents of latest issues of good journals will return you the latest of the latest in the topic.

Oral presentation

The oral presentation should not be a image of the written paper. You should concentrate in geeting the main message through and leave minor details to the seminar paper.
Explain both the biological problem and the machine learning method(s)
Allocate enough time for each slide so that the audience have time to understand the contents. 2 minutes per slide is a good rule of thumb.

Giving and receiving feedback

Two golden rules:

When giving feedback, be constructive, suggest improvements rather than just criticizing.
When receiving feedback, try to look at your paper through the reviewers eyes. Why did this particular comment/suggestion/criticism arise? Usually every bit of feedback contains something useful you can use to imporve your paper.

A matrix on factors affecting grading in scientific writing may be used as basis of feedback. You are NOT supposed to give grades with your feedback, however, only suggestions and comments.

Seminar material and topics

The seminar will be based on recent scientific articles and text books. The following survey article will be a useful starting point:

http://bib.oxfordjournals.org/cgi/content/abstract/7/1/86

The following are preselected topics with an associated seed article. Seed article is meant to be used as a starting point for lietrature search, not the only or the best reference on a certain topic.
In addition to the preselected topics, you may suggest your own topic.

Gene prediction

Axel Bernal,, Koby Crammer, Artemis Hatzigeorgiou, Fernando Pereira. Global Discriminative Learning for Higher Accuracy Computational Gene Prediction. PLoS Computational Biology 3(3): e54
Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern and Peter Meinicke.Gene prediction in metagenomic fragments: A large scale machine learning approach. BMC Bioinformatics 2008, 9:217

Protein-DNA binding

Nitin Bhardwaj, Robert E. Langlois, Guijun Zhao, and Hui Lu.Kernel-based machine learning protocol for predicting DNA-binding proteins. Nucleic Acids Res. 2005; 33(20): 6486-6493.
Pengyu Hong, X. Shirley Liu, Qing Zhou, Xin Lu, Jun S. Liu, and Wing H. Wong: A boosting approach for motif modeling using ChIP-chip data Bioinformatics 2005 21: 2636-2643

SNPs and haplotyping

Eric P. Xing, Michael I. Jordan, Roded Sharan. Bayesian Haplotype Inference via the Dirichlet Process. Journal of Computational Biology. April 1, 2007, 14(3): 267-284.
Lakshmi Matukumalli, John Grefenstette, David Hyten, Ik-Young Choi, Perry B Cregan and Curtis P Van Tassell. Application of machine learning in SNP discovery. BMC Bioinformatics 2006, 7:4

Protein structural classification and structure prediction

Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble, Christina Leslie. Multi-class Protein Classification Using Adaptive Codes. Journal of Machine Learning Research 8 (2007, 1557-1581
Scott Montgomerie, Joseph Cruz, Savita Shrivastava, David Arndt, Mark Berjanskii, David Wishart. PROTEUS2: a web server for comprehensive protein structure prediction and structure-based annotation. Nucleic Acids Research, 2008, Vol. 36, No. suppl_2 W202-W209
Theodoros Damoulas and Mark Girolami. Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detectionBioinformatics 24, 10 (2008):1264-1270

Protein identification

Joshua Elias, Francis Gibbons, Oliver King, Frederick Roth and Steven Gygi. Intensity-based protein identification by machine learning from a library of tandem mass spectra. Nature biotechnology 22, 2 (2004), 214-

Protein function prediction

Iddo Friedberg. Automated protein function prediction - the genomic challenge. Briefings in Bioinformatics 2006 7(3):225-242
Igor V. Tetko, Igor V. Rodchenkov, Mathias C. Walter, Thomas Rattei and Hans-Werner Mewes. Beyond the 'best' match: machine learning annotation of protein sequences by integration of different sources of information. Bioinformatics 2008 24(5):621-628

Gene expression profiling

Alexander Statnikov 1, Constantin F. Aliferis, Ioannis Tsamardinos, Douglas Hardin and Shawn Levy. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 2005 21(5):631-643
Daxin Jiang, Chun Tang, Aidong Zhang Cluster analysis for gene expression data: a survey IEEE Transactions on Knowledge and Data Engineering 16, 11, (2004),1370 - 1386

Biological network inference

Florian Markowetz and Rainer Spang: Inferring cellular networks - a review. BMC Bioinformatics 8, 6 (2007), S5
Jean-Philippe Vert: Reconstruction of biological networks by supervised machine learning approaches. Technical Report HAL-00283945, June, 2008.
Ashwin Srinivasan and Ross D. King. Incremental Identification of Qualitative Models of Biological Systems using Inductive Logic Programming.

Gene (regulatory) networks

Robert Castelo and Alberto Roverato: A Robust Procedure for Gaussian Graphical Model Search From Microarray Data With p Larger Than n. Journal of Machine Learning Research 7 (2006), 2621-2650.
Jason Enrst, Oded Vainas, Christopher Harbison, Itamar Simon and Ziv Bar-Joseph. Reconstructing Dynamic Regulatory Maps. Molecular Systems Biology 4 (2007):74

Protein-protein interaction networks

Krogan et al.:Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440 (2006), 637-643
Yanjun Qi, Ziv Bar-Joseph, Judith Klein-Seetharaman. Evaluation of different biological data and computational classification methods for use in protein interaction prediction.. Proteins: Structure, Function, and Bioinformatics 63, 3 (2006), 490 - 500

Department of Computer Science

Department information