We have proudly joined the e-BioGrid project on a mission to bring nation-wide IT infrastructure for large scale biobanking. First goal is building a model infrastructure for the Dutch Biobank community, BBMRI-NL, that manages resources for the future of biomedical research. See http://www.bbmriwiki.nl and http://www.nlgenome.nl elsewhere on this wiki.
- Duplicate input checking.
- Quality scores histogram from the BGI.
- Maybe other graphs/data provided by the BGI.
- Quality scores with the FastqC toolkit.
- GC content.
- Quality scores per base.
- Quality scores per read.
- Length distribution.
- Over representation of reads.
- A summary of this data provided by a script.
- Percentage aligned.
- Insert size distribution.
- Visualisation as a wiggle track.
- Intra sample distance calculation of the wiggle tracks.
- Mapping quality distribution.
- Look into Picard Tools.
- Any available statistics from GATK?
- Transition / transversion rate.
- X, Y coverage (check encoding of sample tags).
- Mutation rate.
- Distribution of SNPs found in dnSNP.
- Indel / substitution rate.
- Cross check with immuno-chip.
In all steps, cross check with the data provided by the BGI.
Yesterday we had a succesfull joint hackathon in Haren accross BBMRI-NL bioinformatics and NBIC biobank and next generation sequencing task forces. Representatives were there from UMCG, AMC, LUMC and NBIC as well as guests of the local genomics coordination center/pipeline workgroup. First we had a series of short presentations on the progress in the SnpCallingPipeline (Freerk, Morris), general progress in NGS team (Leon) as well as pointers to automation of sample tracking (Morris) and analysis tracking (Joeri). Most of the afternoon were spend on establishing QC procedures (Jeroen, Kai, Wil), paving the way for the migration of the SnpCallingPipeline to the BigGrid? (Barbera, Mark, Freerk), exploring SNP annotation protocols (Alex, Barbera, Jeroen), and analyzing how and were the pipeline can be optimized for throughput time (David, Leon). All in all a great meeting that we hope to have more off. See http://www.bbmriwiki.nl/wiki/MeetingMinutes/2010_11_29 for minutes.
The first two trio data has been downloaded from BGI. Meanwhile BGI is running the basic analysis, i.e., variant calling on the whole batch. Freerk and Morris went to Boston to learn how to do the same procedure but then using GATK. Hopefully by end of next week they have a running pipeline on the cluster to digest the rest of the pilot data when it comes in and produce VCFs for the GVNL analysis team to sink their teeth into.
Interesting read? Pelak et el (http://www.ncbi.nlm.nih.gov/pubmed/20838461) present the analysis of twenty human genomes to evaluate the prospects for identifying rare functional variants that contribute to a phenotype of interest. We sequenced at high coverage ten "case" genomes from individuals with severe hemophilia A and ten "control" genomes. We summarize the number of genetic variants emerging from a study of this magnitude, and provide a proof of concept for the identification of rare and highly-penetrant functional variants by confirming that the cause of hemophilia A is easily recognizable in this data set. We also show that the number of novel single nucleotide variants (SNVs) discovered per genome seems to stabilize at about 144,000 new variants per genome, after the first 15 individuals have been sequenced. Finally, we find that, on average, each genome carries 165 homozygous protein-truncating or stop loss variants in genes representing a diverse set of pathways. http://www.ncbi.nlm.nih.gov/pubmed/20838461