Computing for Biologists, Part I Python vs. Pathogens

Size: px
Start display at page:

Download "Computing for Biologists, Part I Python vs. Pathogens"

Transcription

1 Computing for Biologists, Part I Python vs. Pathogens Acknowledgement: The following slides are adopted, some with minor revisions, from several lecture slides of the CS5 Green course at Harvey Mudd.

2 Computer Science Computational Biology Biology

3 DNA is double stranded

4 Representing DNA molecules on a computer 5' - AATGCCGTGCTTGTAGACGTA - 3' 3' - TTACGGCACGAACATCTGCAT - 5' By convention, we represent as a single string going 5' to 3. AATGCCGTGCTTGTAGACGTA or TACGTCTTCAAGCACGGCATT Either of these two strings could be be used These are reverse complements of each other

5 Salmonella outbreaks: routes of infection From:

6 The central dogma in a nutshell Protein RNA ATG TAG DNA Promoter S

7 Finding open reading frames (ORFs): check all 3 reading frames ATGCCCTAACATGAAAATGACTTAGG ATGCCCTAACATGAAAATGACTTAGG ATGCCCTAACATGAAAATGACTTAGG

8 Genes can occur on either strand ATG = start codon TGA, TAG, TAA = Stop codons Gene 1: coding strand is on top 5' - AATGCCGTGCTTGTAGACGTAGGCTTAGATCGTCATGGG - 3' 3' - TTACGGCACGAACATCTGCATCCGAATCTAGCAGTACCC - 5' Gene 2: coding strand is on bottom

9 Noncoding sequence between Salmonella genes nuom nuol GCTATCTCACTCGTCAGCCCAAATCCTGCCAGTGCTCACACAAAACGCAGCGCGTTTTGAACGTCCGTAA GGACGGCCCCGTAGGGGTGAGCTTCGCGAATCATCCTCACGTACTTCAGTACGCTCCGGTTGCTGTGCGC TGGCGGTATCCGGTCTGACTTCGCCGATGACGCCTGTACTTCAGGAACAGATTTTCAACGAATTCTAAAA ATTATTTTTGGGTTTGTAGGCCGGATAAGCACTGCGCC Its orfully strange to find this here!

10 The unfortunate truth Not every ATG is a start codon Not every ORF is a gene How can we separate gene ORFs from ORFs due to random chance?

11 A simple gene finding strategy A sequence of interest AATGGGCCGACCAAGGCGACATAGACGCGAATCGGACCAGACGCCGGCTCACCTGTTCATCTACCTTTCTG CGTTGGCGCTAAAAGTTAACGATCGGGCCCTGCGCCGAAACGAAACGTCAGGAATCGACAAATACCAAGTA TCTAAGCTACGGGATAAGCCCCCCCTCGCGAGAGAGGGGAAGGGGTCAATATTTCCCTGGCCGACTGACAA TGGAGTGTACTTACCGGTATACAGTTTGTACTCTACAGCCATCGCTGTCTTACGACGTATTCGGGGCATTT CAACATGCTGTCTCTCAGGAGTTTTCGCGCGCTGAAAGAACTCCCATCTAAACCCTG ORF: 318 nucs

12 A simple gene finding strategy A sequence of interest AATGGGCCGACCAAGGCGACATAGACGCGAATCGGACCAGACGCCGGCTCACCTGTTCATCTACCTTTCTG CGTTGGCGCTAAAAGTTAACGATCGGGCCCTGCGCCGAAACGAAACGTCAGGAATCGACAAATACCAAGTA TCTAAGCTACGGGATAAGCCCCCCCTCGCGAGAGAGGGGAAGGGGTCAATATTTCCCTGGCCGACTGACAA TGGAGTGTACTTACCGGTATACAGTTTGTACTCTACAGCCATCGCTGTCTTACGACGTATTCGGGGCATTT CAACATGCTGTCTCTCAGGAGTTTTCGCGCGCTGAAAGAACTCCCATCTAAACCCTG ORF: 318 nucs Randomly shuffled versions of this sequence CGCTAGGCACGAAAGGATGGCGTCCCAACATATCAACGAGGTACGTTTGTGGAAGGCCCCGTATTACCGTC AGAGACCTGGTACGAGGGTGACTATTTAACGCGGGAGCCCCAAGGAAAACTCAAGCAAAGCCGGCATCTTT GTCGGGTACAGCGCTGTAACTGCGACCGTGATTGGAATCAACACGGATGACCGTGAAGGCGTGTTTGCCAG CTACCACCCCTGATCCCCGGTCTCTCTTTGCCTGGGTTTATAGCTCAAAACTGTATCACGCGTTTAAAAGC AACAACTGTAACGGCATACCCCCGAATTCCCCGTACCAGACGACGTTAATGCTTTCC Longest ORF: 153 nucs CTGCCCTCGCGACGTAAAGGCCTACCCTTATTCGGGCGCTGGTGTCTGGTGTCTGCACCTTGTACGATTTA ATCCGTCTTACGCACCGCGGGTGTCAGTGCAAAACGACTTGGGCTTACAGACATGAATCACGGAACTCTGA AGTATGGGTCGACCAGTCCACATTATGGAGGGAGCCAGAGTCCAACCCGGGAGGCGGGGCACCACACGCGG TATTTTAAGAGGAACCACGCTTGATCACCAACGGAAAGTAGCCGCTAAATTATCGTCAATCTACCCTCAAA CACAAAACCTCGGCTGAACGTCATATTCGAAAAGCTCTACATTTCGGGTTCAGGCCC Longest ORF: 156 nucs Don't forget to look at the reverse complements!

13 Modules and the import statement >>> L=['A','C','G','G','T','C','A ] >>> L ['A', 'C', 'G', 'G', 'T', 'C', 'A'] >>> import random >>> random.shuffle(l) >>> L ['C', 'T', 'A', 'A', 'C', 'G', 'G']

14 Homework: gene finding in a region of DNA unique to salmonella Salmonella pathogenicity island 1

15 Homework bonus: average length to ATG vs. AAA 'CGAGGCGCGGATATCTGGTTTACCCGTACATACTACATTGATGTTGTA...'