Seminar: Machine Learning in Bioinformatics
58309106 Seminar: Machine Learning in Bioinformatics (3 cr)Instructor: Professor Juho Rousu Time and place: 08.09-06.10, 27.10-01.12 Monday 12-14 C221
Prerequisites and enrolling to the course
The course Introduction to bioinformatics or equivalent background knowledge of the topic. Prior knowledge of machine learning techniques will be helpful.
Enroll to the seminar in the registration system.Note:In case the demand exceeds seminar capacity, students in the Master's programme in bioinformatics will have priority.
Participants
Abhishek Tripathi, Dorothee Girbig, Esa Pitkänen, Hitomi Hasegawa, Ignacio Fernandez, Juan Fernandez, Jurkka Näsänen, Krishnan Narayanan, Laura Langohr, Maria Yli-heikkilä, Markus Heinonen, Pekka Parviainen, Ping Chen, Yuan ZouName |
Topic |
Seed
paper(s) |
Reviewers |
Abhishek Tripathi |
Protein function prediction |
#12 |
Ping Chen, Krishnan Narayanan |
Pekka Parviainen |
Haplotype Inference |
#5 |
Ignacio Fernandez, Jurkka Näsänen |
Laura Langohr |
Gene expression/Cluster analysis |
#14 |
Jurkka Näsänen, Maria
Yli-Heikkilä |
Ping Chen |
Protein structure and function
prediction |
#8,#11 |
Juan Fernandez, Krishnan Narayanan |
Jurkka Näsänen |
Gene expression profiling/SVM |
#13 |
Laura Langohr, Maria Yli-Heikkilä |
Maria Yli-Heikkilä |
micro-RNA modelling |
#22 |
Laura Langohr, Yuan Zou |
Yuan Zou |
Gene prediction |
#2 |
Ignacio Fernandez, Hitomi
Hasegawa |
Hitomi Hasegawa |
Protein-protein interaction
prediction |
#21 |
Dorothee Girbig, Pekka Parviainen |
Dorothee Girbig |
Biological network inference |
#15,#16,#20 |
Juan Fernandez, Abhishek Tripathi |
Juan Fernandez |
Protein identification |
#10 |
Dorothee Girbig, Hitomi Hasegawa |
Ignacio Fernandez |
SNP discovery |
#6 |
Pekka Parviainen, Yuan Zou |
Krishan Narayanan |
Protein fold and remote homology
detection |
#9 |
Ping Chen, Abhishek Tripathi |
Seminar goals
Machine learning is one of the key technologies in bioinformatics, making it possible to automatically generate predictive models from data. In this seminar we will get an overview of how machine learning techniques are used in bioinformatics. We will look at various prediction problems, including- prediction problems in biosequences (gene prediction, splice site prediction)
- structure prediction (RNA, protein)
- protein function prediction
- interaction networks prediction (protein-protein, gene regulation)
- classification methods (classification trees, nearest neighbor, neural networks, support vector machines)
- clustering (partition-based, hierarchical, mixture models)
- probabilistic graphical models (HMM, bayesian network)
Completing the seminar
The language of the seminar is English. To pass the seminar, you need to do the following four tasks:- Write a paper about a topic agreed during the first meetings,
- Review two papers written by other students,
- Prepare a presentation and discuss it with the other students, and
- Participate in the seminar by asking questions, raising discussions on the topic, and reviewing other students' work.
Grading
Students will be graded based on i) their written paper (40%), ii) their oral presentation (40%), and iii) their activity in commenting other students' work and participating in the discussion (20%). To pass the course, the student must write the paper on the agreed subject and present his work. In addition, each student is required to attend at least 80% of the seminar presentations.Grading will be on the scale 0-5 (0=fail,5=excellent)
Schedule
Date |
Time |
Name/Agenda/Deadline |
8.9 |
12.15-14 |
Organization, selection of topics |
15.9
Personal guiding sessions, room A239b |
12.15-12.30 |
Krishnan Narayanan |
12.30-12.45 |
Yuan Zou |
|
12.45-13.00 |
Maria Yli-Heikkilä |
|
13.00-13.15 |
Ignacio Fernandez |
|
13.15-13.30 |
Pekka Parviainen |
|
13.30-13.45 |
Ping Chen |
|
13.45-14.00 |
Markus Heinonen |
|
22.9
Personal guiding sessions, room A239b (continued) |
12.15-12.30 | Abhishek Tripathi |
12.30-12.45 | Juan Fernandez |
|
12.45-13.00 |
Laura Langohr |
|
13.15-13.30 | Jurkka Näsänen |
|
13.30-13.45 | Dorothee Girbig |
|
13.45-14.00 | Hitomi Hasegawa |
|
29.9 |
- |
No session |
6.10 Deadline |
12.00 |
Deadline for submitting the
first paper draft (via email to the two reviewers and to Juho Rousu) |
13.10 Deadline |
12.00 |
Reviews of the paper drafts
submitted (via email to the author and to Juho) |
13.10 Presentation | 12.15-13.00 |
Krishnan Narayanan |
20.-26.10 |
- |
No session (Period break) |
27.10
Presentations |
12.15-13.00 |
Yuan Zou |
13.15-14.00 |
||
3.11 Presentations | 12.15-13.00 |
Ignacio Fernandez |
13.15-14.00 | Pekka Parviainen |
|
10.11 Presentations | 12.15-13.00 |
Ping Chen |
13.15-14.00 | Dorothee Girbig |
|
17.11 Presentations | 12.15-13.00 |
Abhishek Tripathi |
13.15-14.00 | Juan Fernandez |
|
24.11 Presentations | 12.15-13.00 |
Laura Langohr |
13.15-14.00 | Jurkka Näsänen |
|
1.12 Presentations | 12.15-13.00 |
Hitomi Hasegawa |
13.15-14.00 | ||
5.12 Deadline |
12.00 |
Final seminar paper returned, via email to juho |
Guidelines
In the following some addition guidelines for this seminar are given. Additional helpful material can be found from the home page of the scientific writing course Department of computer science.Layout of the seminar paper
- The paper should be formatted according to the general format (cover etc.) used on the Scientifc Writing course, example layout
- Using LaTeX is recommended. LaTeX-templates are available from http://www.cs.helsinki.fi/u/kurhila/tiki_k2007/coursedescription.html
Using literature
- The seminar paper should cover the biological problem and the computational methods used to solve the problem. You may need to use separate sources for the application and for the method.
- Try to locate the best papers about the topic. You will probably end up reading more papers than you will eventually use.
- How many references you should use and cite? A rule of thumb is "as many references as there are pages in the paper". This does not mean that you will write exactly one page about each reference, some require more than others.
- Try to make a synthesis of the literature. What is the main message of the papers about some topic? How do the individual papers relate to or deviate from this main message.
Sources of information
Categories of information sources for the seminar, in the order of preference- High-quality journals in bioinformatics, computer science, statistics, as well as biological and medical sciences. These are are the preferred source of seminar material. A non-exhaustive list of suitable journals: Bioinformatics, BMC Bioinformatics, Data Mining and Knowledge Discovery, Journal of Computational Biology, Journal of Machine Learning Research, Machine Learning.
- Proceedings of high-quality conferences in computational biology and machine learning: Intelligent Systems for Molecular Biology (ISMB), Research in Computational Molecular Biology (RECOMB), International Conference on Machine Learning (ICML), Neural Information Processing Systems (NIPS).
- Text books contain high-quality information. However, as the publication process of books takes very long, the information in text books is rarely the latest in science. Text books can be used as sources of information, but they should always be accompanied by journal and conference papers.
- Wikipedia contains a lot of information and sometimes is a good source to get an overview of the seminar topic. However, the quality of Wikipedia articles varies. In particular, the peer-review process behind a Wikipedia article is not always at the same level as high-quality scientific journals and conferences. As a consequence, sometimes Wikipedia contains opinions of small groups of scientists that are not shared by the research community. Guideline: You may use Wikipedia as a means to learn about some topic. However, avoid using Wikipedia as the only source of information. Always verify the facts using other sources of information. Whenever possible rely on journal and conference articles.
- Online course material is widely available in the www. These should be used even with more caution than Wikipedia. Some courses are very good some are not, and there is no peer-review process behind the material. Online courses should not be used as references in you seminar paper.
- The rest of WWW. A random web page of some individual/organization/group about some subject has typically very little quality control behind it. This material is not suitable for seminar paper material.
Finding information
- Google Scholar is perhaps the search engine to find literature on certain topic.
- University of Helsinki has subscriptions to a wide range of electronic journals, you can access these from the university computers. (To access these via Google Scholar, remember to enable "Library Links" for University of Helsinki in Google Scholar preferences)
Combination of two search strategies will lead to the best results
- Google Scholar will give you well-references articles that match to the keywords. These are often a bit older.
- Systematic search through the tables of contents of latest issues of good journals will return you the latest of the latest in the topic.
Oral presentation
- The oral presentation should not be a image of the written paper. You should concentrate in geeting the main message through and leave minor details to the seminar paper.
- Explain both the biological problem and the machine learning method(s)
- Allocate enough time for each slide so that the audience have time to understand the contents. 2 minutes per slide is a good rule of thumb.
Giving and receiving feedback
Two golden rules:- When giving feedback, be constructive, suggest improvements rather than just criticizing.
- When receiving feedback, try to look at your paper through the reviewers eyes. Why did this particular comment/suggestion/criticism arise? Usually every bit of feedback contains something useful you can use to imporve your paper.
Seminar material and topics
The seminar will be based on recent scientific articles and text books. The following survey article will be a useful starting point:- Pedro Larrañaga , Borja Calvo , Roberto Santana , Concha Bielza , Josu Galdiano , Iñaki Inza , José A. Lozano ,
Rubén Armañanzas , Guzmán Santafé , Aritz Pérez , and Victor Robles (2006):
Machine learning in bioinformatics. Brief Bioinform 7: 86-112.
http://bib.oxfordjournals.org/cgi/content/abstract/7/1/86
In addition to the preselected topics, you may suggest your own topic.
Gene prediction
- Axel Bernal,, Koby Crammer, Artemis Hatzigeorgiou, Fernando Pereira. Global Discriminative Learning for Higher Accuracy Computational Gene Prediction. PLoS Computational Biology 3(3): e54
- Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern and Peter Meinicke.Gene prediction in metagenomic fragments: A large scale machine learning approach. BMC Bioinformatics 2008, 9:217
Protein-DNA binding
- Nitin Bhardwaj, Robert E. Langlois, Guijun Zhao, and Hui Lu.Kernel-based machine learning protocol for predicting DNA-binding proteins. Nucleic Acids Res. 2005; 33(20): 6486-6493.
- Pengyu Hong, X. Shirley Liu, Qing Zhou, Xin Lu, Jun S. Liu, and Wing H. Wong: A boosting approach for motif modeling using ChIP-chip data Bioinformatics 2005 21: 2636-2643
SNPs and haplotyping
- Eric P. Xing, Michael I. Jordan, Roded Sharan. Bayesian Haplotype Inference via the Dirichlet Process. Journal of Computational Biology. April 1, 2007, 14(3): 267-284.
- Lakshmi Matukumalli, John Grefenstette, David Hyten, Ik-Young Choi, Perry B Cregan and Curtis P Van Tassell. Application of machine learning in SNP discovery. BMC Bioinformatics 2006, 7:4
Protein structural classification and structure prediction
- Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble, Christina Leslie. Multi-class Protein Classification Using Adaptive Codes. Journal of Machine Learning Research 8 (2007, 1557-1581
- Scott Montgomerie, Joseph Cruz, Savita Shrivastava, David Arndt, Mark Berjanskii, David Wishart. PROTEUS2: a web server for comprehensive protein structure prediction and structure-based annotation. Nucleic Acids Research, 2008, Vol. 36, No. suppl_2 W202-W209
- Theodoros Damoulas and Mark Girolami. Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detectionBioinformatics 24, 10 (2008):1264-1270
Protein identification
- Joshua Elias, Francis Gibbons, Oliver King, Frederick Roth and Steven Gygi. Intensity-based protein identification by machine learning from a library of tandem mass spectra. Nature biotechnology 22, 2 (2004), 214-
Protein function prediction
- Iddo Friedberg. Automated protein function prediction - the genomic challenge. Briefings in Bioinformatics 2006 7(3):225-242
- Igor V. Tetko, Igor V. Rodchenkov, Mathias C. Walter, Thomas Rattei and Hans-Werner Mewes. Beyond the 'best' match: machine learning annotation of protein sequences by integration of different sources of information. Bioinformatics 2008 24(5):621-628
Gene expression profiling
- Alexander Statnikov 1, Constantin F. Aliferis, Ioannis Tsamardinos, Douglas Hardin and Shawn Levy. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 2005 21(5):631-643
- Daxin Jiang, Chun Tang, Aidong Zhang Cluster analysis for gene expression data: a survey IEEE Transactions on Knowledge and Data Engineering 16, 11, (2004),1370 - 1386
Biological network inference
- Florian Markowetz and Rainer Spang: Inferring cellular networks - a review. BMC Bioinformatics 8, 6 (2007), S5
- Jean-Philippe Vert: Reconstruction of biological networks by supervised machine learning approaches. Technical Report HAL-00283945, June, 2008.
- Ashwin Srinivasan and Ross D. King. Incremental Identification of Qualitative Models of Biological Systems using Inductive Logic Programming.
Gene (regulatory) networks
- Robert Castelo and Alberto Roverato: A Robust Procedure for Gaussian Graphical Model Search From Microarray Data With p Larger Than n. Journal of Machine Learning Research 7 (2006), 2621-2650.
- Jason Enrst, Oded Vainas, Christopher Harbison, Itamar Simon and Ziv Bar-Joseph. Reconstructing Dynamic Regulatory Maps. Molecular Systems Biology 4 (2007):74
Protein-protein interaction networks
- Krogan et al.:Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440 (2006), 637-643
- Yanjun Qi, Ziv Bar-Joseph, Judith Klein-Seetharaman. Evaluation of different biological data and computational classification methods for use in protein interaction prediction.. Proteins: Structure, Function, and Bioinformatics 63, 3 (2006), 490 - 500