Exercise session 1
Introduction to bioinformatics, Autumn 2008
Exercise session: Tuesday 9 September 16.15-18.00 Exactum C221.
General instructions:
in the exercise session you should mark the assignments (1..6 here)
which you have done and are willing to present in the session.
In addition, you need to send notes of the assignments you
are going to mark
to Lauri
Eronen by email. These exercise notes should contain
a brief description of the steps you took to solve the assignment,
as well as the results. Important: When sending email,
use subject of form "ITB exercise X" where X is the exercise session
number. Send your notes in email text body. If you need to include a
figure, send it as an attachment.
Assignments
-
In PubMed, search
for review articles published in the last year
discussing the gene HbA1 in humans.
-
Describe briefly what PubMed is.
-
How many articles does the query return?
-
Which disease or diseases are mentioned in article titles?
-
Search for gene HbA1
in OMIM.
-
Describe briefly what OMIM is.
-
How many results do you get?
-
Choose one result entry from each of the categories denoted by symbols
+, *, # and %,
and describe in your own words what is being described by each
entry.
-
What is the meaning of the four symbols here?
-
Search for HbA1 in NCBI RefSeq
using Entrez.
Hint: Choose Nucleotide option from the Search list and set the
options in Limits tab accordingly.
-
Describe briefly what RefSeq is.
-
How many results did you get?
-
How can you separate your RefSeq results from other results?
Access the entry for human HBA1 in NCBI RefSeq and answer the following questions.
-
How long is the RNA sequence corresponding to the gene?
-
How many exons have been annotated in this sequence?
-
In which chromosome is this gene located in?
-
When was the entry last updated?
-
How can you easily download the sequence corresponding to a
nucleotide entry in NCBI?
-
Find entries related to gene HbA in
UniProt.
-
Describe briefly what UniProt is.
-
What are the two sections of UniProt, and how do they differ from
each other? How can you separate between the two sections in
search results?
-
Describe your results for the query: how many results in the two
sections did you get?
-
Access the entry HBA_HUMAN. What does the entry say about evidence
for this protein? How is this protein's function being
characterised? Hint: see the Ontologies section.
-
Access the entry Q86YQ5_HUMAN and describe it in the same fashion
as HBA_HUMAN.
-
Download genome sequences of Escherichia coli
(GenBank ID NC_000913) and Thermoplasma volcanium (NC_002689)
from NCBI.
-
Find out, using your favourite
programming language (notes on programming languages below)
or other method, the nucleotide, dinucleotide
and trinucleotide frequencies.
-
What is the G-C content of the
sequences?
-
Draw a diagram of 2-word and 3-word distributions
in both sequences (you can use any software available).
-
Write a program in your favourite language that tries to find gene coding regions with the
following method.
-
Scan the given sequence for start (ATG) and stop codons (TAA,
TAG, TGA).
-
Report the regions that begin with the start codon and end in a
stop codon. Note: remember that within a coding region,
ATG codes for methionine and does not "restart" the coding region.
-
Take into account frame shifts, considering codons starting
from the first, second and third position of the input sequence.
Test your program with this DNA sequence.
-
How many candidate coding regions can you find?
Where can you find them?
How long are the regions?
-
Discuss how could you investigate further your findings.
Programming languages
You can solve the programming assignments (problems 5 and 6) with any
suitable language. However, I suggest using a relatively
high-level language
such as Python, Perl,
R, Matlab
or Octave
because of ease of implementation.
Python and Perl are scripting languages that are probably the easiest to
learn if you are new to programming. Good tutorials are provided for
both.
Furthermore, the languages mentioned above are available on CS
computers
(try typing "python", "perl", "R", "matlab" or "octave" in CS shell).