Exercise session 1

Introduction to bioinformatics, Autumn 2008

Exercise session: Tuesday 9 September 16.15-18.00 Exactum C221.

General instructions: in the exercise session you should mark the assignments (1..6 here) which you have done and are willing to present in the session. In addition, you need to send notes of the assignments you are going to mark to Lauri Eronen by email. These exercise notes should contain a brief description of the steps you took to solve the assignment, as well as the results. Important: When sending email, use subject of form "ITB exercise X" where X is the exercise session number. Send your notes in email text body. If you need to include a figure, send it as an attachment.

Assignments

In PubMed, search for review articles published in the last year discussing the gene HbA1 in humans.
- Describe briefly what PubMed is.
- How many articles does the query return?
- Which disease or diseases are mentioned in article titles?
Search for gene HbA1 in OMIM.
- Describe briefly what OMIM is.
- How many results do you get?
- Choose one result entry from each of the categories denoted by symbols +, *, # and %, and describe in your own words what is being described by each entry.
- What is the meaning of the four symbols here?
Search for HbA1 in NCBI RefSeq using Entrez. Hint: Choose Nucleotide option from the Search list and set the options in Limits tab accordingly.
- Describe briefly what RefSeq is.
- How many results did you get?
- How can you separate your RefSeq results from other results?
Access the entry for human HBA1 in NCBI RefSeq and answer the following questions.
- How long is the RNA sequence corresponding to the gene?
- How many exons have been annotated in this sequence?
- In which chromosome is this gene located in?
- When was the entry last updated?
- How can you easily download the sequence corresponding to a nucleotide entry in NCBI?
Find entries related to gene HbA in UniProt.
- Describe briefly what UniProt is.
- What are the two sections of UniProt, and how do they differ from each other? How can you separate between the two sections in search results?
- Describe your results for the query: how many results in the two sections did you get?
- Access the entry HBA_HUMAN. What does the entry say about evidence for this protein? How is this protein's function being characterised? Hint: see the Ontologies section.
- Access the entry Q86YQ5_HUMAN and describe it in the same fashion as HBA_HUMAN.
Download genome sequences of Escherichia coli (GenBank ID NC_000913) and Thermoplasma volcanium (NC_002689) from NCBI.
1. Find out, using your favourite programming language (notes on programming languages below) or other method, the nucleotide, dinucleotide and trinucleotide frequencies.
2. What is the G-C content of the sequences?
3. Draw a diagram of 2-word and 3-word distributions in both sequences (you can use any software available).
Write a program in your favourite language that tries to find gene coding regions with the following method.
- Scan the given sequence for start (ATG) and stop codons (TAA, TAG, TGA).
- Report the regions that begin with the start codon and end in a stop codon. Note: remember that within a coding region, ATG codes for methionine and does not "restart" the coding region.
- Take into account frame shifts, considering codons starting from the first, second and third position of the input sequence.
Test your program with this DNA sequence.
- How many candidate coding regions can you find? Where can you find them? How long are the regions?
- Discuss how could you investigate further your findings.

Programming languages

You can solve the programming assignments (problems 5 and 6) with any suitable language. However, I suggest using a relatively high-level language such as Python, Perl, R, Matlab or Octave because of ease of implementation. Python and Perl are scripting languages that are probably the easiest to learn if you are new to programming. Good tutorials are provided for both. Furthermore, the languages mentioned above are available on CS computers (try typing "python", "perl", "R", "matlab" or "octave" in CS shell).