General Comments. J Fass UCD Genome Center Bioinformatics Core Tuesday December 16, 2014

Size: px
Start display at page:

Download "General Comments. J Fass UCD Genome Center Bioinformatics Core Tuesday December 16, 2014"

Transcription

1 General Comments J Fass UCD Genome Center Bioinformatics Core Tuesday December 16, 2014

2 Schedule Overview Tuesday - Friday morning lecture and exercises, ad hoc consulting if you brought some data of your own to play with Friday afternoon other AWS topics, sequencing topics, consulting, GC tour? Q&A each morning what didn t work? Stop your instances overnight to save $$ Terminate (instances) & delete (volumes) after the course to save $$$$

3 General Comments Bioinformatics is experimental! NextGen / ThirdGen tools & applications are still in an adaptive radiation phase Best Practices are extremely hard to come by do your own bench-marking / optimization But, computation doesn t consume sample, reagents so: Take chances make mistakes! - Ms. Frizzle

4 Workflows See Documentation for examples All workflows must be adjusted for: weird data new technologies new tools The workflow / pipeline can sometimes make a big difference to the outcome!

5 Open Source & Reproducibility We use (virtually) all open source tools anyone can read source code & find / fix mistakes! (Open source does not mean free) Open source tools don t guarantee reproducibility! The pipeline, commands, environment (hardware, OS), data must also be open. public data archives - NCBI (various), GigaDB, etc. code / commands in public, stable repositories (GitHub, Google Code, etc.)

6 Open Source & Reproducibility The most critical - and commonly missing link in making biological research reproducible is publishing code. Insist on submitting all scripts, with version numbers, as Supplemental Materials.

7 Bioinformatics platforms Linux command line R, bash, Python, Perl, etc. iplant (ianimal, iorganism ) heavy grant, programmer, institutional support growing community Galaxy light (?) support, heavy community support Commercial products like Genious, CLC, etc.

8 Bioinformatics platforms Linux command line will always be most cutting edge most flexible highest learning curve (think months to years) iplant (ianimal, iorganism ) lots of funding = inertia, but requires support from their developers not open source, yet Galaxy longer history = inertia, published workflows, deployed on many public servers open source and relatively easy to develop for Commercial products like Genious, CLC, etc. black boxes (no public review for bugs, algorithm flaws) they develop new tools for you or, they don t

9 Linux command line

10 iplant

11 Galaxy

12 Cloud Computing Cloud computing - either via command line, or stock or customized GUI (e.g. Galaxy) - may be a cost effective solution for a small-ish lab with light computing requirements. "If within an entire year one only needs to run their computers for less than 30% of time then cloud computing may be worth it." ~ Istvan Albert (

13 The Big Picture Own your own Best for heavy compute user; get what you pay for Fully customizable High training barrier to use Public server Often limited compute power, scheduling Not customizable Low training barrier to use Cloud servers Some limits on RAM; unlimited nodes / compute Highly customizable High training barrier to use; some "turn-key"-ish solutions (Galaxy s Cloud button, MIT s StarCluster)