HapMap Project logo
International HapMap Project
 

Home | About the Project | Data | Publications | Conference

 

Protocols for HapMap data-handling

LSID:  urn:lsid:dcc.hapmap.org:Protocol:data_release:1
Title: Data release procedure at DCC
Description: 
  When dumping out data for a public freeze, the primary tracking database is queried 
for all available genotype sets. Each genotype set and its parent assay record is first 
checked to see if the submitter has flagged it as passed. Any fail-flagged assays or 
genotype sets are skipped from dump.
  Next, the genotype set is checked for the following QC-parameters in order and if it 
does not pass the given threshold, the gt-set is skipped from dump. These are the 
standard, uniform QC-filters applied to all project data before public release (as of 
Sep04):

 -Completion rate (a.k.a. plate passrate) >= 80%
 -Hardy-Weinberg equilibrium p-value      >= 0.001
 -Mendel inheritance errors               <= 1
 -Duplicate discrepancies                 <= 1

  Each genotype set that passes QC-filtering is passed to the file-printout stage, where 
the individual genotypes, allele & genotype frequencies, assays etc. are printed to the 
appropriate files & formats. This task is handled by Template Toolkit templates, such 
that modifying output formats or adding new formats is relatively easy and can be done 
with little or no modifications to the dumping script itself. Output formats include 
text tables for bulk downloads and GFF-files for loading into Bio::DB::GFF
  
  After dumping out all data, the public FTP-repository for the new release is created 
on the DCC development server. At the same time,  the data are loaded into the visualization database along with other genome annotations, to create the GBrowse back end for this 
new release.
  Both the FTP-dirs and GBrowse are tested via web browser on the development website 
and, if all looks good, finally mirrored via rsync to the public webserver hapmap.org, at 
which time it becomes visible to the public.


LSID:  urn:lsid:dcc.hapmap.org:Protocol:data_exchange:1
Title: Data exchange between genotyping centers and the DCC
Description: 
The HapMap genotyping center prepares assay and genotype XML datafiles that 
conform to the HapMap XML Schema definition. See 
http://www.hapmap.org/xml-schema/2003-11-04/hapmap.xsd for the schema itself 
and http://www.hapmap.org/downloads/xml_docs/ 
for more information and sample XML-documents. The center has the ability to 
locally validate the XML-documents against the schema using a schema-validating 
parser. This helps to catch numerous syntax errors before a file even gets to the 
DCC.

The genotyping center then uploads the gzipped assay and genotype XML-files to 
the DCC server hapmap.cshl.org via sFTP (secure FTP over SSH). For each 
compressed XML-file, the center also provides a file with the MD5 checksum for 
the XML-file before compression (see below for the purpose of this checksum 
exercise). After uploading, the center can use a web browser to log on to the 
password-protected DCC site http://hapmap.cshl.org and see a listing of uploaded 
files, processing status, summary for processed batches, and more.

The DCC processes uploaded datafiles in a staggered submission cycle, such that 
on the 1st Monday new uploads from center A are processed, on the 2nd Monday 
uploads from center B and C are processed, and so on. This is done to even the 
processing load at the DCC; the amount of data passing from the centers to the 
DCC is quite large and if many centers would submit one the same day (typically 
end of the month), then the DCC would need several days to clear the resulting 
queue.
  For each assay or XML file in a monthly submission batch, a number of tasks are 
performed:
1) MD5 checksum provided by submitter is compared to locally computed checksum of 
the XML-file on the DCC server. If the two do not match, then a data transfer 
error may have occurred and the file is thus rejected.
2) XML-ile is validated against the XML schema. If found invalid or not well-
formed, the file is rejected.
3) XML-file is processed with XML-2-RDMS database middleware to import into the 
primary DCC tracking database. If any errors are thrown (typically unique-key or 
foreign-key violations), the file is rejected.

  After all files in a batch have been put through the DCC pipeline described 
above, the processing report is inspected by a DCC team member, possible problems 
identified and center data manager(s) get E-mailed the report along with comments 
from the DCC person. Detailed logs are created at each processing step and are 
made available to the data manager.
  Immediately following this procedure, the data imported to the database in this 
batch (if any) are put through the post-dbimport analysis pipeline (see seperate 
protocol description).


LSID:  urn:lsid:dcc.hapmap.org:Protocol:submission_analysis:1
Title: Post-dbimport analysis of data submitted to the DCC
Description: 
After a monthly batch of uploaded files from a genotyping center has been 
imported to the tracking database, the post-dbimport analysis is initiated. 
This involves A) making sure that the submitted assay and genotype records 
are consistent and error-free, and B) running standard quality-control 
analyses on the whole data batch:

Assay consistency tests:
1) Assay must point to a valid SNP record. This requires a valid LSID 
pointing to a record in the HapMap SNP allocations used by centers to 
design assays (see http://www.hapmap.org/downloads/allocated_snps/?N=D).
2) All assay oligos (PCR primers, probes etc., depending on platform) must 
align to the sequence around the SNP. 
3) The so-called 'primary oligo' for the assay must align to the sequence a
around the SNP in the same strand-orientation as the reported assay strand 
in the XML-file.

Genotype consistency tests:
1) Sample LSIDs referenced in each individual genotype in a 95-genotype set 
must be from the referenced panel.
2) All 90 samples along with 5 duplicates from the panel/plate are present.
3) Reported alleles match alleles of the SNP (according to dbSNP).

QC-analysis:
 The following metrics are calculated for each SNP and summarized. See data 
release protocol for information on how these results are used to filter 
data prior to release:
  -Proportion of genotyped individuals (% of the 90-individual panel)
  -Hardy-Weinberg Equilibrium (HWE) p-value, exact test (Abecasis, unpublished)
  -Mendelian inheritance check
  -Duplicate discrepancy check


A summary from this analysis is E-mailed a DCC team member for review, then 
forwarded along with comments to the center data manager. If significant 
problems are found (e.g. many oligo-alignment failures, 'orphan' assays), 
sometimes the whole batch is withdrawn at this stage and resubmitted after 
the center fixes the problem.
  If no problems are found and the data manager gives final approval, the 
monthly batch is flagged as good and is included in the next data freeze 
and subsequent public release on the project website.


Generated 10:13:45 13-Oct-2005

Last updated : protocol.tt2,v 1.2 2004/11/25 16:32:25 mummi Exp


Home | About the Project | Data | Publications | Conference
Please send questions and comments on website to help@hapmap.org