Protocols for HapMap data-handling
LSID: urn:lsid:dcc.hapmap.org:Protocol:data_release:1
Title: Data release procedure at DCC
Description:
When dumping out data for a public freeze, the primary tracking database is queried
for all available genotype sets. Each genotype set and its parent assay record is first
checked to see if the submitter has flagged it as passed. Any fail-flagged assays or
genotype sets are skipped from dump.
Next, the genotype set is checked for the following QC-parameters in order and if it
does not pass the given threshold, the gt-set is skipped from dump. These are the
standard, uniform QC-filters applied to all project data before public release (as of
Sep04):
-Completion rate (a.k.a. plate passrate) >= 80%
-Hardy-Weinberg equilibrium p-value >= 0.001
-Mendel inheritance errors <= 1
-Duplicate discrepancies <= 1
Each genotype set that passes QC-filtering is passed to the file-printout stage, where
the individual genotypes, allele & genotype frequencies, assays etc. are printed to the
appropriate files & formats. This task is handled by Template Toolkit templates, such
that modifying output formats or adding new formats is relatively easy and can be done
with little or no modifications to the dumping script itself. Output formats include
text tables for bulk downloads and GFF-files for loading into Bio::DB::GFF
After dumping out all data, the public FTP-repository for the new release is created
on the DCC development server. At the same time, the data are loaded into the visualization database along with other genome annotations, to create the GBrowse back end for this
new release.
Both the FTP-dirs and GBrowse are tested via web browser on the development website
and, if all looks good, finally mirrored via rsync to the public webserver hapmap.org, at
which time it becomes visible to the public.
LSID: urn:lsid:dcc.hapmap.org:Protocol:data_exchange:1
Title: Data exchange between genotyping centers and the DCC
Description:
The HapMap genotyping center prepares assay and genotype XML datafiles that
conform to the HapMap XML Schema definition. See
http://www.hapmap.org/xml-schema/2003-11-04/hapmap.xsd for the schema itself
and http://www.hapmap.org/downloads/xml_docs/
for more information and sample XML-documents. The center has the ability to
locally validate the XML-documents against the schema using a schema-validating
parser. This helps to catch numerous syntax errors before a file even gets to the
DCC.
The genotyping center then uploads the gzipped assay and genotype XML-files to
the DCC server hapmap.cshl.org via sFTP (secure FTP over SSH). For each
compressed XML-file, the center also provides a file with the MD5 checksum for
the XML-file before compression (see below for the purpose of this checksum
exercise). After uploading, the center can use a web browser to log on to the
password-protected DCC site http://hapmap.cshl.org and see a listing of uploaded
files, processing status, summary for processed batches, and more.
The DCC processes uploaded datafiles in a staggered submission cycle, such that
on the 1st Monday new uploads from center A are processed, on the 2nd Monday
uploads from center B and C are processed, and so on. This is done to even the
processing load at the DCC; the amount of data passing from the centers to the
DCC is quite large and if many centers would submit one the same day (typically
end of the month), then the DCC would need several days to clear the resulting
queue.
For each assay or XML file in a monthly submission batch, a number of tasks are
performed:
1) MD5 checksum provided by submitter is compared to locally computed checksum of
the XML-file on the DCC server. If the two do not match, then a data transfer
error may have occurred and the file is thus rejected.
2) XML-ile is validated against the XML schema. If found invalid or not well-
formed, the file is rejected.
3) XML-file is processed with XML-2-RDMS database middleware to import into the
primary DCC tracking database. If any errors are thrown (typically unique-key or
foreign-key violations), the file is rejected.
After all files in a batch have been put through the DCC pipeline described
above, the processing report is inspected by a DCC team member, possible problems
identified and center data manager(s) get E-mailed the report along with comments
from the DCC person. Detailed logs are created at each processing step and are
made available to the data manager.
Immediately following this procedure, the data imported to the database in this
batch (if any) are put through the post-dbimport analysis pipeline (see seperate
protocol description).
LSID: urn:lsid:dcc.hapmap.org:Protocol:submission_analysis:1
Title: Post-dbimport analysis of data submitted to the DCC
Description:
After a monthly batch of uploaded files from a genotyping center has been
imported to the tracking database, the post-dbimport analysis is initiated.
This involves A) making sure that the submitted assay and genotype records
are consistent and error-free, and B) running standard quality-control
analyses on the whole data batch:
Assay consistency tests:
1) Assay must point to a valid SNP record. This requires a valid LSID
pointing to a record in the HapMap SNP allocations used by centers to
design assays (see http://www.hapmap.org/downloads/allocated_snps/?N=D).
2) All assay oligos (PCR primers, probes etc., depending on platform) must
align to the sequence around the SNP.
3) The so-called 'primary oligo' for the assay must align to the sequence a
around the SNP in the same strand-orientation as the reported assay strand
in the XML-file.
Genotype consistency tests:
1) Sample LSIDs referenced in each individual genotype in a 95-genotype set
must be from the referenced panel.
2) All 90 samples along with 5 duplicates from the panel/plate are present.
3) Reported alleles match alleles of the SNP (according to dbSNP).
QC-analysis:
The following metrics are calculated for each SNP and summarized. See data
release protocol for information on how these results are used to filter
data prior to release:
-Proportion of genotyped individuals (% of the 90-individual panel)
-Hardy-Weinberg Equilibrium (HWE) p-value, exact test (Abecasis, unpublished)
-Mendelian inheritance check
-Duplicate discrepancy check
A summary from this analysis is E-mailed a DCC team member for review, then
forwarded along with comments to the center data manager. If significant
problems are found (e.g. many oligo-alignment failures, 'orphan' assays),
sometimes the whole batch is withdrawn at this stage and resubmitted after
the center fixes the problem.
If no problems are found and the data manager gives final approval, the
monthly batch is flagged as good and is included in the next data freeze
and subsequent public release on the project website.
Generated 10:13:45 13-Oct-2005
Last updated : protocol.tt2,v 1.2 2004/11/25 16:32:25 mummi Exp