Computing for Biologists, Part I Python vs. Pathogens
|
|
- Meghan Clarke
- 5 years ago
- Views:
Transcription
1 Computing for Biologists, Part I Python vs. Pathogens Acknowledgement: The following slides are adopted, some with minor revisions, from several lecture slides of the CS5 Green course at Harvey Mudd.
2 Computer Science Computational Biology Biology
3 DNA is double stranded
4 Representing DNA molecules on a computer 5' - AATGCCGTGCTTGTAGACGTA - 3' 3' - TTACGGCACGAACATCTGCAT - 5' By convention, we represent as a single string going 5' to 3. AATGCCGTGCTTGTAGACGTA or TACGTCTTCAAGCACGGCATT Either of these two strings could be be used These are reverse complements of each other
5 Salmonella outbreaks: routes of infection From:
6 The central dogma in a nutshell Protein RNA ATG TAG DNA Promoter S
7 Finding open reading frames (ORFs): check all 3 reading frames ATGCCCTAACATGAAAATGACTTAGG ATGCCCTAACATGAAAATGACTTAGG ATGCCCTAACATGAAAATGACTTAGG
8 Genes can occur on either strand ATG = start codon TGA, TAG, TAA = Stop codons Gene 1: coding strand is on top 5' - AATGCCGTGCTTGTAGACGTAGGCTTAGATCGTCATGGG - 3' 3' - TTACGGCACGAACATCTGCATCCGAATCTAGCAGTACCC - 5' Gene 2: coding strand is on bottom
9 Noncoding sequence between Salmonella genes nuom nuol GCTATCTCACTCGTCAGCCCAAATCCTGCCAGTGCTCACACAAAACGCAGCGCGTTTTGAACGTCCGTAA GGACGGCCCCGTAGGGGTGAGCTTCGCGAATCATCCTCACGTACTTCAGTACGCTCCGGTTGCTGTGCGC TGGCGGTATCCGGTCTGACTTCGCCGATGACGCCTGTACTTCAGGAACAGATTTTCAACGAATTCTAAAA ATTATTTTTGGGTTTGTAGGCCGGATAAGCACTGCGCC Its orfully strange to find this here!
10 The unfortunate truth Not every ATG is a start codon Not every ORF is a gene How can we separate gene ORFs from ORFs due to random chance?
11 A simple gene finding strategy A sequence of interest AATGGGCCGACCAAGGCGACATAGACGCGAATCGGACCAGACGCCGGCTCACCTGTTCATCTACCTTTCTG CGTTGGCGCTAAAAGTTAACGATCGGGCCCTGCGCCGAAACGAAACGTCAGGAATCGACAAATACCAAGTA TCTAAGCTACGGGATAAGCCCCCCCTCGCGAGAGAGGGGAAGGGGTCAATATTTCCCTGGCCGACTGACAA TGGAGTGTACTTACCGGTATACAGTTTGTACTCTACAGCCATCGCTGTCTTACGACGTATTCGGGGCATTT CAACATGCTGTCTCTCAGGAGTTTTCGCGCGCTGAAAGAACTCCCATCTAAACCCTG ORF: 318 nucs
12 A simple gene finding strategy A sequence of interest AATGGGCCGACCAAGGCGACATAGACGCGAATCGGACCAGACGCCGGCTCACCTGTTCATCTACCTTTCTG CGTTGGCGCTAAAAGTTAACGATCGGGCCCTGCGCCGAAACGAAACGTCAGGAATCGACAAATACCAAGTA TCTAAGCTACGGGATAAGCCCCCCCTCGCGAGAGAGGGGAAGGGGTCAATATTTCCCTGGCCGACTGACAA TGGAGTGTACTTACCGGTATACAGTTTGTACTCTACAGCCATCGCTGTCTTACGACGTATTCGGGGCATTT CAACATGCTGTCTCTCAGGAGTTTTCGCGCGCTGAAAGAACTCCCATCTAAACCCTG ORF: 318 nucs Randomly shuffled versions of this sequence CGCTAGGCACGAAAGGATGGCGTCCCAACATATCAACGAGGTACGTTTGTGGAAGGCCCCGTATTACCGTC AGAGACCTGGTACGAGGGTGACTATTTAACGCGGGAGCCCCAAGGAAAACTCAAGCAAAGCCGGCATCTTT GTCGGGTACAGCGCTGTAACTGCGACCGTGATTGGAATCAACACGGATGACCGTGAAGGCGTGTTTGCCAG CTACCACCCCTGATCCCCGGTCTCTCTTTGCCTGGGTTTATAGCTCAAAACTGTATCACGCGTTTAAAAGC AACAACTGTAACGGCATACCCCCGAATTCCCCGTACCAGACGACGTTAATGCTTTCC Longest ORF: 153 nucs CTGCCCTCGCGACGTAAAGGCCTACCCTTATTCGGGCGCTGGTGTCTGGTGTCTGCACCTTGTACGATTTA ATCCGTCTTACGCACCGCGGGTGTCAGTGCAAAACGACTTGGGCTTACAGACATGAATCACGGAACTCTGA AGTATGGGTCGACCAGTCCACATTATGGAGGGAGCCAGAGTCCAACCCGGGAGGCGGGGCACCACACGCGG TATTTTAAGAGGAACCACGCTTGATCACCAACGGAAAGTAGCCGCTAAATTATCGTCAATCTACCCTCAAA CACAAAACCTCGGCTGAACGTCATATTCGAAAAGCTCTACATTTCGGGTTCAGGCCC Longest ORF: 156 nucs Don't forget to look at the reverse complements!
13 Modules and the import statement >>> L=['A','C','G','G','T','C','A ] >>> L ['A', 'C', 'G', 'G', 'T', 'C', 'A'] >>> import random >>> random.shuffle(l) >>> L ['C', 'T', 'A', 'A', 'C', 'G', 'G']
14 Homework: gene finding in a region of DNA unique to salmonella Salmonella pathogenicity island 1
15 Homework bonus: average length to ATG vs. AAA 'CGAGGCGCGGATATCTGGTTTACCCGTACATACTACATTGATGTTGTA...'