CSCI 241 Labs: PreLab 12

CSCI 241 Labs: PreLab 12
Bioinformatics

Introduction

Bioinformatics can be viewed as the application of computer technology to biological problems. In recent years there has been an explosion of data about biological systems. For example, the study of DNA, how it encodes genes, how those genes are translated to proteins and how those protein molecules are twisted and folded into three dimensional shapes provides complex problems that can only be addressed using computers.

There is significant advantage to understanding these processes, too. We may find cures for genetic diseases or design new drugs on the computer, rather than experimentally. Genetic engineering holds promise for developing new strains of agricultural plants to help feed an exploding world population. Fortunately for you, you are not expected to solve the world's problems in CS1. We do want to give you enough background, however, to understand the basics of the field.

DNA

Deoxyribonucleic acid (DNA) encodes the genes present in animals, microbes and plants. Each DNA molecule is made up of sequences of four simpler molecules, called nucleotides. Each nucleotide contains a representative base, either adenine (A), cytosine (C), guanine (G), and thymine (T). We can represent a strand of DNA as a sequence of the bases. Here is an example sequence:

ACGGGAGGACGGGAAAATTACTACGGATTAGC

A real DNA molecule can contain hundreds of thousands to millions of nucleotides.

A gene is a small portion of a DNA molecule that contains information about how to make one protein. Variations in genes are what make us different from each other. One version of a gene may give a person blond hair, while another version will give a person black hair. A single DNA molecule usually contains many genes.

About 50 years ago, it was discovered that it takes two strands of DNA to make a molecule. They are joined together in a spiral staircase shape called a double helix. The account of this discovery is recorded in the book,
The Double Helix: A Personal Account of the Discovery of the Structure of DNA by James D. Watson and Lawrence Bragg. This book is a very good read and is highly recommended by your instructors. (It is available in the UW-Parkside library.)

The two strands are known as reverse complements. They must match up nucleotide by nucleotide according the rules that

an A always matches with a T, and
a G always matches with a C.

For example, the DNA sequence given above would require the match:

5'  TGCCCTCCTGCCCTTTTAATGATGCCTAATCG  3'
3'  ACGGGAGGACGGGAAAATTACTACGGATTAGC  5'

The strands have an identifiable left end, called the five prime, 5', end and a right end, called the three prime, 3', end. DNA sequences are always written from the 5' end to the 3' end. When DNA is used to generate proteins the cell processes it from the 5' end to the 3' end. The ends of the each strand match with the opposite end of the other strand, however. This is shown in the sequences above.

If we were to write the bottom sequence by itself, it should be written in the opposite order, e.g.

5'  CGATTAGGCATCATTAAAAGGGCAGGAGGGCA  3'

Each gene appears in only one of the strands of the DNA. When scientists search DNA for genes, it is important that they search both strands, but it is a fairly simple algorithm (i.e., doable in CS 1) to translate one strand of DNA into its reverse complement.

RNA

Most of the time, genetic information is stored as DNA. To create proteins from the DNA, an intermediate molecule is formed, known as ribonucleic acid (RNA).
Very Brief Aside: Sometimes genetic information is encoded and stored directly in RNA. HIV (human immunodeficiency virus) is one example of this.
To transcribe DNA to RNA simply replace all the Ts in the DNA with Us.

Proteins

Proteins are made up of amino acids. There are 20 different amino acids used to build proteins.
Another Very Brief Aside: This was recently expanded to 22.
Yet Another Very Brief Aside: The human body can produce most of these amino acids. A few of them, however, we cannot produce on our own and have to ingest (take into our bodies by eating certain foods or taking vitamin supplements). These are called the essential amino acids. Everyone, but particularly vegetarians, must make certain that they take in enough essential amino acids or they risk serious health problems.

Translating RNA to protein takes some work. A single strand of RNA may encode one or more proteins. There are only 4 nucleotides in the RNA and they need to form the right combinations to represent 20 different amino acids. If we group the nucleotides 2 at a time, we would only get 16 different possible pairs. Therefore we must take the nucleotides 3 at a time to get enough, but this gives us 64 different combinations, well more than we need. A sequence of 3 nucleotides is called a codon. Scientists have thoroughly worked out which amino acid each codon represents. The following table (boldly stolen from the Internet) lists the amino acids, their single letter codes (SLC) and the codons that represent them:

One Letter And Three Letter Amino Acid Symbols And Codons

A	Ala	Alanine	GCA GCC GCG GCU
C	Cys	Cysteine	UGC UGU
D	Asp	Aspartic Acid	GAC GAU
E	Glu	Glutamic Acid	GAA GAG
F	Phe	Phenylalanine	UUC UUU
G	Gly	Glycine	GGA GGC GGG GGU
H	His	Histidine	CAC CAU
I	Ile	Isoleucine	AUA AUC AUU
K	Lys	Lysine	AAA AAG
L	Leu	Leucine	UUA UUG CUA CUC CUG CUU
M	Met	Methionine	AUG
N	Asn	Asparagine	AAC AAU
P	Pro	Proline	CCA CCC CCG CCU
Q	Gln	Glutamine	CAA CAG
R	Arg	Arginine	AGA AGG CGA CGC CGG CGU
S	Ser	Serine	AGC AGU UCA UCC UCG UCU
T	Thr	Threonine	ACA ACC ACG ACU
V	Val	Valine	GUA GUC GUG GUU
W	Trp	Tryptophan	UGG
Y	Tyr	Tyrosine	UAC UAU

Note: If more than one codon is listed for an amino acid, it means any of those codons will translate into that amino acid.

One tricky part of translating RNA to amino acids is knowing where the protein's representation starts. A protein might start at the zeroth, first or second location of an RNA fragment. The gene might also be in the complementary strand of DNA. Thus, given an DNA sequence, there are two possible RNA sequences and 6 possible protein sequences.

Scratching the Surface

So far we have just scratched the surface of bioinformatics. There are many other problems in bioinformatics where computers are helpful. For example,

DNA molecules are huge. Finding genes on them is computationally expensive.
Mutations occur in DNA that lead to changes in proteins, which manifest themselves in disease.
The sequence of amino acids in a protein is determined by the DNA. The sequence twists itself and bends into complex three dimensional shapes. Current theory suggests that the shape is determined by minimizing the energy required to maintain the bonds. Studying the energy minimization involved is a complex numerical calculation.
Proteins don't work in isolation. Instead, biological processes often require many proteins working in harmony to accomplish a task. Biologists are just starting to get an experimental handle on how to determine which genes get turned on simultaneously to produce the multiple proteins needed and how proteins interact with each other.

If you think this stuff is cool, consider using the CS breadth requirement to take some Bioinformatics courses in UW-Parkside's Department of Biological Sciences.