Imaging informatics computer assisted mammogram reading Clinical aka medical informatics CDSS combining bioinformatics for diagnosis, personalized

Size: px
Start display at page:

Download "Imaging informatics computer assisted mammogram reading Clinical aka medical informatics CDSS combining bioinformatics for diagnosis, personalized"

Transcription

1 1

2 2

3 3

4 Imaging informatics computer assisted mammogram reading Clinical aka medical informatics CDSS combining bioinformatics for diagnosis, personalized medicine, risk assessment etc Public Health Bio surveillance for infectious diseases 4

5 Genbank sequence database is an open access, annotated collection of all publically available nucleotide sequences and their protein translations. They receive sequences produced in labs throughout the world for over 100,000 organisms and has become one of the most important and influential databases in research since its 30 years of establishment. Continues to grow at exponential rate, approximately doubling every 18 months. 5

6 Protein created by mrna structure telling function is so important and is a large area of bioinformatics Human genome is equivalent to a 800MB file size = a very large encyclopedia 6

7 Cost complexity and time to sequence DNA drops each year because we re using more efficient tools that require less time and resources. Interpretation of data and analysis not generation is the bottleneck 7

8 8

9 Soundness We want to make sure that the answers we compute are in fact the declarative meaning of the program. We don t want the computational machinery to yield answers which are not true according to the program semantics. Completeness We don t want out computational machinery to miss any answers 9

10 10

11 11

12 12

13 Algorithm is general language but gives you an exact idea of how to go about making a cake (the process) The program (recipe) will make a SPECIFIC kind of cake and will actually produce one. The instructions are coded with quantities. Every recipe/program will implement the cake algorithm, but slightly differently, giving a different product every time. 13

14 Biomarker a measurable indicator of some biological state or condition (i.e. glucose, blood pressure); in genetics a biomarker is a DNA sequence that causes disease or is associated with susceptibility to disease 14

15 Homology has distinct evolutionary and biological implications similarity in sequence or structure due to a common ancestor Homologous genes are therefore genes derived from the same ancestral gene Because homology implies a common ancestor it can also imply a common function or structure for two homologous proteins, which is a useful pointer to function if one of the proteins is known only from its sequence. Also, similar sequences may not be evolutionary homologous as a result of convergent evolution for similar function. example of butterfly wing and bat wing: these are structurally similar but they didn t come for a common ancestor so converse isn t true (Function doesn t imply sequence similarity) Sequence alignment is basically a hypothesis. The algorithms we use to complete these processes must account for these factors. 15

16 16

17 17

18 Global has a more gappy alignment if the sequences aren t similar wheeras the local alignment can demonstrate more conserved highly similar areas of a two sequences 18

19 Optimal alignment is one in which the correspondences are greatest and the differences are the smallest 19

20 Matches are NOT counted Only one sub needed to transform one string to the other 20

21 Idea is simple. In general, to solve a given problem we need to solve different parts of the problem (sub problems) then combine the solutions of the subproblems to reach an overall solution. It is able to store the solution to a given subproblem so that it is only computed once, which makes this a fast method. These algorithms are used for optimization. This means that it will example all possible ways to solve the problem and will pick the best solution. Thus, Each of these smaller sub problems must also have optimal solutions. We were able to incorporate gaps into an alignment with the development of dynamic programming algorithms. 21

22 Comparison are made on basis of all pairs of amino acids that could be made between two sequences. Sequences are represented as a two D matrix and all possible comparisons are scored using a basic algorithm. Optimality tells us the most optimal place to introduce gaps Depending on whether we are doing a local alignment or global alignment problem, there are different dynamic programming algorithms to be used. 22

23 Both algorithms based on dynamic programming 23

24 Dot matrix analysis a graphical method that allows the comparison of two biological sequences and identify regions of close similarity between them 24

25 Necessary for making sense of the sequence data becoming available from the genome projects. For example, you may have a protein sequence and want to search a database to find a potential function for your protein Database searching needs to be both sensitive in order to detect distantly related homologs and avoid false negative searches and specific in order to reject unrelated sequences with accidental similarity (false positive) 25

26 Heuristic refers to experience based techniques for problem solving, learning and discovery that give a solution which is not guaranteed to be optimal. Rigorous and exhaustive techniques are time consuming and require considerable computer resources. Heuristic methods are used to speed up the process of finding a satisfactory solution via mental shortcuts to ease the cognitive load of making a decision. Many database search programs currently in use are modifications of the rigorous methods we discussed like Smith Waterman Heuristic methods prune the search space using fast approximate methods to select the sequences of the databases that are likely to be similar to the query and to locate the similarity region inside them 26

27 K contiguous residues letters must be touching/bordering Joining procedure: program checks to see if some of the highest scoring diagonals can be joined together Apply limited DP to best scoring k tuples This will increase the speed. 27

28 Relies on finding core similarity defined by a window of present size (words) that fall above a given threshold 28