Batch Genome Variation Server
An NHLBI Program for Genomic Applications  

How to Use Batch GVS
Batch GVS allows you to submit a text file with a list of genes, chromosome regions, or rs IDs, and later download a file of genotypes, SNP summary information, r2 values, tag SNPs, or haplotypes (for individuals in all populations available, or for individuals and/or populations you specify). The database search is performed at low priority (as such requests can be time-consuming). Once the search is complete, an e-mail will be sent to you with a link for downloading a file. If a single file is generated, it will be zipped (use gunzip on a Linux machine). If multiple files are generated, a tarball will be available for download (use tar xvzf on a Linux machine). Information on the underlying GVS database and on the details of all the calculations can be found in the documentation pages of the interactive GVS site.

The input file must be plain-text. If, for example, you create the file in Microsoft Word, it must be saved as text-only.

The genotype output file is in a format similar to that of a "prettybase" file. Following a possible header line beginning with #, there is one line for each genotype. The columns are chromosome location (hg18), population:individual, first allele, second allele, rs ID, chromosome number, and region.

The SNP summary file has several header lines, then one line for each SNP, with many annotation columns (see the GVS link, where version 2 of the Illumina HumanHap300 chip has been added here, as I3.2).

The r2 file has a column header line, then one line for each SNP pair, with several annotation columns. The columns are first-SNP chromosome location, second-SNP chromosome location, r2, first-SNP rs ID, second-SNP rs ID, chromosome number, and region. The calculation of r2 is described here.

The tag-SNP file has several header lines, then one or more lines for each bin. For the one-line-per-bin format, the columns are bin number, number of SNPs in the bin, percent average minor allele frequency, list of tagSNPs, and list of other SNPs. SNPs are binned by similar values of the r2 linkage disequilibrium value (see references below). This data is useful for the development of a minimal set of SNPs that can be used for large-scale genotyping of similar sample populations (by selecting one variation from each bin). The "tag SNPs" are those for which the pairwise-r2 values between the SNP and any other SNP in the bin are greater than the "r2Threshold" parameter chosen (see below). The "other SNPs" are those for which the pairwise-r2 value between the SNP and at least one other SNP in the bin is less than the r2 threshold. (It is preferable to choose a SNP from "tag SNPs" rather than "other SNPs" to represent the bin, if no other constraints exist.) If there are individuals from more than one population class, the MultiPop-TagSelect Algorithm may be requested, and there are then additional sections in the output files.

In the case of fastPHASE haplotype calculations (see documentation and reference), there are two options. The first is to download only the phased-genotype calls. The second is to download a tarball with all the files created by fastPHASE, as well as the genotype files created by GVS Batch for input to fastPHASE.

Columns in the output files are white-space separated (a tab in the case of genotypes and r2, and a space in the case of tagSNPs and SNP summary).

File Input: Quick Start
Make a file with a list of genes (one per line) and submit it. The downloaded file will list genotypes. Example file content:
	actb
	alad
	
File Input: the Details
The input for a database search is a file with one or more regions (one per line), and optionally some lines to customize the search or the calculations. The lines can be submitted in any order.

Blank lines are ignored, as are lines beginning with "##" (so that comments can be put in following two #s).

Lines beginning with a "#" indicate optional parameters that can be set. There should be only one line for each parameter, with the exception of individual and population, for which there can be many lines. The line should have a # as the first character, followed by optional whitespace, followed by one of the parameters names, then whitespace (required), then the parameter value. (Any content beyond the parameters value is ignored.) The case-sensitive parameter set is

searchType
displayType
numberFiles
fileName
headerLine
writeParametersToOutputFile
r2DisplayCutoff
(used only when displayType is r2 or r2LD)
annotation (used only when displayType is snpSummary, tagSNPs, or r2LD)
individual
population
expandUpstream
(used only when searchType is geneName or geneID or rs)
expandDownstream (used only when searchType is geneName or geneID or rs)
keepRelatedHapMap
freqCutoff
noMonomorphic
r2Threshold
(used only when displayType is tagSNPs)
coverageTagSNPs (used only when displayType is tagSNPs)
coverageClustering (used only when displayType is tagSNPs)
multipop (used only when displayType is tagSNPs)
tagSNPsFormat (used only when displayType is tagSNPs)
fastPHASERandomStarts (used only when displayType is fastPHASE)
returnTarballWithAllFastPHASEFiles (used only when displayType is fastPHASE)
fastPHASEUseClockSeed (used only when displayType is fastPHASE)
The searchType parameter can be one of 4 values: geneName, geneID, chromosome, or rs.

The displayType parameter can be one of 6 values: genotypes, snpSummary, r2, r2LD, tagSNPs, or fastPHASE. The r2 type is used to get r2 values for all SNP pairs in the region. The r2LD type can only be used if the searchType is rs. This option is designed to display SNPs in linkage disequilibrium with a given SNP. All pairs in the output file have the input rs ID as one element of the pair. If tagSNPs is chosen, listing individuals is not allowed. For displayType fastPHASE, the fastPHASE program is run (on our server), and haplotypes are constructed for the genotypes in the query.

The numberFiles parameter can be single or multiple. If single, all data for the input file is put into a single file; if multiple, a separate data file is generated for each line in the input file.

The fileName parameter is a string of characters appropriate to a file name, with no intervening whitespace. In the absence of such a line, the filename will be GVSBatch followed by a time stamp. When present, the value will be inserted in the file name between GVSBatch and the time stamp. This parameter is solely for your use in keeping track of files. (The parameter does not need to be unique within a series of queries, as uniqueness is maintained by the timestamp.)

The headerLine parameter is for putting an information line at the beginning of each of the result files.

The writeParametersToOutputFile parameter is for putting input parameters at the beginning of each of the result files. If this parameter is set to true (rather than omitted or set to false), a line in the output file is added for each parameter line in the input file: # queryInput parameterName value.

The r2DisplayCutoff (displayTypes r2 or r2LD only) sets a threshold, such that only SNP pairs with r2 equal to or greater than this value are written to the output file. The range is 0.0 through 1.0.

The annotation parameter (snpSummary, tagSNPs, or r2LD displayType only) requests SNP annotation columns. For the tagSNPs case, the annotation is added only to the single-population sections, but only if the tagSNPsFormat parameter is set to oneLinePerSNP. For the r2LD case, the annotation is that of the second SNP (the one that is not the input rs ID). The value of the parameter is a comma-separated list of column names (with no whitespace in the list). The order of the columns in the output file will be the same as the order in the list. The available column names are the same as those used on the interactive GVS site, plus 3 allele-count columns:

  Alleles
  MinorAllele
  PercentAlleleFrequency
  Heterozygosity
  Chi-Square
  Genes
  Function
  ConservationScore
  SubmitterIDs
  ChimpAllele
  GenotypingChipIDs
  RepeatMasker
  TandemRepeatsFinder
  CopyNumberVariation
  UpstreamFlank
  DownstreamFlank
  NumberAlleles
  NumberMajorAlleles
  NumberMinorAlleles

The individual parameter sets a dbSNP numerical individual ID. Such a line is optional. If there is at least one individual line, genotypes are returned only for the individuals listed in all the "# individual" lines. If the ID is not a number, the line is ignored. These lines are not allowed (thus far) if the displayType is tagSNPs. To find individual IDs see this list of individuals and their IDs in our database (or see this link for HapMap only).

The population parameter sets a dbSNP numerical population ID. Such a line is optional. If there is at least one population line, genotypes are returned only for the populations listed in all the "# population" lines. If the ID is not a number, the line is ignored. To find population IDs see this list of populations and their IDs in our database (or see this link for HapMap only).

If there are both population and individual lines (allowed if displayType not tagSNPs), the search is restricted to genotypes for which the listed individuals belong to one of the listed populations.

If expandUpstream and/or expandDownstream are included, the region of the genome is extended. Up and down refer to the direction of the genome, not to the direction of the gene for geneName or geneID searches. If the searchType is chromosome, these lines are ignored.

The keepRelatedHapMap parameter is only used for HapMap populations 1412 (African) and 1409 (European), where there are 30 trios, and thus 60 individuals (in each of these two populations) that are related to 30 others. If this parameter is set to true, all genotypes will be analyzed. For tag SNPs it is recommended that the default of false be used. If false, a set of unrelated individuals (including one member of each trio) is used.

The freqCutoff parameter eliminates from consideration SNPs having minor allele frequencies below the cutoff: range is 0 through 50, an integer in units of percent.

The noMonomorphic parameter (true or false) eliminates from consideration SNPs having zero minor allele frequency (i.e. having only one allele)

Five parameters are used only when displayType is tagSNPs. The value r2 is the square of the Pearson correlation coefficient in linkage disequilibrium calculations. The parameter r2Threshold is the minimum value of r2 for variations to belong to the same tag-SNP bin. Its range is from 0.0 through 1.0. For a discussion of linkage disequilibrium and tagSNPs see this link and Carlson et al., Am. J. Hum. Genet., 74:106-120, 2004. The next two tag-SNP parameters are integers, in units of percent. These are designed to put SNPs with many unknown genotypes into separate bins. The coverageTagSNPs parameter is the minimum data coverage for a variation to be considered as a potential tagSNP: range is 0 through 100. The coverageClustering parameter is the minimum data coverage for a variation to be clustered potentially with other variations: range is 0 through 100. Its value must be less than the coverageTagSNPs value. If the multipop parameter is omitted or is set to true (rather than false), the MultiPop-TagSelect Algorithm will be used for tag-SNP selection if there are individuals in different population groups. When using this algorithm, it is best to specify two or more population parameters. The tagSNPsFormat parameter can be set to either of two values: oneLinePerBin (the default for the original format) or oneLinePerSNP. It affects only single-population sections. If oneLinePerSNP is selected, only one SNP is printed on a line, the bins are separated by a blank line, and it's possible to add annotation.

Three parameters are used only when displayType is fastPHASE. The fastPHASERandomStarts parameter selects the number of random starts of the expection-maximization algorithm in fastPHASE. The default is 20. Higher values may achieve greater accuracy, but the calculation time is increased proportionally. If the returnTarballWithAllFastPHASEFiles parameter is set to true (default is false), a tarball will be returned with all the files created by fastPHASE, as well as the genotype files created by GVS Batch for input to fastPHASE. In this tarball there will also be a monitor file that captures any information fastPHASE would normally write to a console window, and a file echoing your query parameters. If returnTarballWithAllFastPHASEFiles is false (or not specified), only the "hapguess_switch.out" file content of fastPHASE is returned, but preceded by a line specifying the order of the SNPs (as this information is not in the fastPHASE output files, only in the GVS Batch input files). The fastPHASEUseClockSeed parameter is used to seed the random number generator. If set to false (the default), the fastPHASE calculations are performed with a seed of 13579, so successive submissions will produce the same result. If set to true, the server clock will be used to seed the random number generator, and successive submissions will produce slightly different results.
If there are no lines in the input file for a given parameter, the defaults are used: geneName for searchType, genotypes for displayType, single for numberFiles, no file name modification for fileName, no header line for headerLine, writeParametersToOutputFile = false, r2DisplayCutoff = 0.0, no additional annotation, all individuals and populations returned, no expansion of the region, and false for keepRelatedHapMap. For tag SNPS, the defaults are 0.8 for r2Threshold, 0 for freqCutoff, true for noMonomorphic, 85 for coverageTagSNPs, 70 for coverageClustering, true for multipop, and oneLinePerBin for tagSNPsFormat. For fastPHASE, the defaults are 20 for fastPHASERandomStarts, false for returnTarballWithAllFastPHASEFiles, and false for fastPHASEUseClockSeed.

For the case of searchType=geneName, the region lines (one line per region) should each contain the name of a gene (upper or lower case is fine). For the case of searchType=geneID, each line should contain a numerical gene ID. For the case of searchType=chromosome, each line should contain a hg18 (NCBI 36) region in the form chr*:begin-end (see example below), where * is 1 through 22 or X or Y, and "end" must be equal to or larger than "begin". If only one base is to be queried, the "-" and end base are optional: e.g. chr7:5534892 will be interpreted as chr7:5534892-5534892. (This is useful if rs IDs are required for a list of chromosome positions; this one-base option with a displayType of snpSummary will list rs IDs if there are known variations at the locations.) For the case of searchType=rs, each line is the dbSNP rs ID, with or without lower-case "rs" in front of the number. No regions can have whitespace characters in the middle. Any content beyond whitespace is ignored.

File Input Examples
Here are several examples of input files. (If you copy and paste any of these, remove the white-space at the beginnings of the lines.)

Example 1    download example 1
	## list of gene names, each gene in a separate file
	# searchType geneName
	# numberFiles multiple
	# headerLine this is a test
	actb
	alad
	
Example 2    download example 2
	## list of gene IDs, all in one file; include additional bases upstream and downstream of the genes; echo the input
	# searchType geneID
	# numberFiles single
	# writeParametersToOutputFile true
	# fileName testIDs
	# expandUpstream 1500
	# expandDownstream 3000
	60
	6624
	
Example 3    download example 3
	## chromosome region, only two individuals, one related
	# headerLine example 3
	# fileName chromosome.example3
	# writeParametersToOutputFile true
	# searchType chromosome
	# numberFiles single
	# individual 362
	# individual 349
	# noMonomorphic false
	# keepRelatedHapMap true
	chr7:5300000-5400000
	
Example 4    download example 4
	## rs IDs, one file, the rs is optional, only for population 693
	# searchType rs
	# numberFiles single
	# population 693
	rs7612
	rs7161563	
	
Example 5    download example 5
	## snp annotation for snps having genotypes for the gene abo, population 595
	# numberFiles single
	# displayType snpSummary
	# population 595
	# headerLine this is a test
	# fileName abo.595.snpSummary
	abo
	
Example 6    download example 6
	## tag SNPs for abo and vkorc1 with non-standard thresholds, population 596
	# numberFiles single
	# searchType geneName
	# fileName abo.vkorc1.596.tagSNPs
	# displayType tagSNPs
	# population 596
	# freqCutoff 10
	# coverageTagSNPs 90
	# coverageClustering 75 
	# r2Threshold 0.75
	abo
	vkorc1
	
Example 7    download example 7
	## tag SNPs for abo and vkorc1, with MultiPop-TagSelect, populations 596 and 595, annotation
	# numberFiles single
	# searchType geneName
	# fileName abo.vkorc1.596.595.tagSNPs.multipop
	# displayType tagSNPs
	# population 596
	# population 595
	# multipop true
	# tagSNPsFormat oneLinePerSNP
	# annotation Function,ConservationScore,PercentAlleleFrequency,Genes
	abo
	vkorc1
	
Example 8    download example 8
	## genotypes for the gene abo, HapMap populations, keep related individuals
	# numberFiles single
	# displayType genotypes
	# fileName abo.HapMap.genotypes.all
	# keepRelatedHapMap true
	# population 1412
	# population 1411
	# population 1410
	# population 1409
	abo
	
Example 9    download example 9
	## r2 for the rs ID 500499, plus SNPs 100K bases on each side, show only r2>=0.5
	# searchType rs
	# numberFiles single
	# displayType r2
	# r2DisplayCutoff 0.5
	# fileName 500499.HapMap.r2
	# population 1412
	# population 1411
	# population 1410
	# population 1409
	# expandUpstream 100000
	# expandDownstream 100000
	500499
	
Example 10    download example 10
	## r2 for the rs ID 500499, plus SNPs 100K bases on each side, show only r2>=0.5, only for pairs where one is 500499
	## additional annotation requested for the other member of the pair
	# searchType rs
	# numberFiles single
	# displayType r2LD
	# r2DisplayCutoff 0.5
	# fileName 500499.HapMap.r2LD
	# population 1412
	# population 1411
	# population 1410
	# population 1409
	# expandUpstream 100000
	# expandDownstream 100000
	# annotation Function,ConservationScore,PercentAlleleFrequency,Genes,RepeatMasker,TandemRepeatsFinder
	500499
	
Example 11    download example 11
	## run fastPHASE for two genes, return only phased genotypes, 20 random starts (the default)
	# fileName fastPHASE.abo.vkorc1.Tdefault
	# numberFiles single
	# headerLine fastPHASE
	# searchType geneName
	# displayType fastPHASE
	# returnTarballWithAllFastPHASEFiles false
	# fastPHASERandomStarts 20
	# freqCutoff 0
	# noMonomorphic true
	# writeParametersToOutputFile true
	# population 596
	abo
	vkorc1
	
Additional, Specialized Parameters

Setting the omitSingleSNPBins parameter to true suppresses all tag SNP bins with only one SNP in them. (The default is false.) If there is only one population, this simply has the effect that such bins do not appear in the output file. If the MultiPop-TagSelect Algorithm is being used, only those bins with at least two SNPs (either tagSNPs or other SNPs) in them will be fed into the algorithm. This parameter should be used with caution, as it suppresses much known variation, and affects the mix of ancient versus recent mutations.


Setting the addR2ToTagSNPs (used only for the tagSNPs displayType) to true adds an additional section in the output file: The r2 values for each pair in each bin. This option is so far available only for single-population tag SNPs. Extension to multipop usage is in progress.


The chipFilter parameter is used for the snpSummary and r2LD displayTypes. In the snpSummary case, only those SNPs on the designated chip will be written to the output file. In the r2LD case, the output lines are filtered so that only comparison SNPs on a particular chip are written to file. The value of the chipFilter parameter is one (and only one) chip ID (e.g. A5):
  A1 Affymetrix Mapping 100K Set
  A5 Affymetrix Mapping 500K Set
  A9 Affymetrix Genome-Wide Human SNP Array 6.0
  I1 Illumina Human-1 BeadChip
  I3 Illumina HumanHap300 BeadChip (version 1)
  I3.2 Illumina HumanHap300 BeadChip (version 2)
  I5 Illumina HumanHap550 BeadChip
  I6 Illumina HumanHap650Y BeadChip
  I10 Illumina Human1M BeadChip


There is an additional searchType parameter: chipID. It is used to request SNP annotation for all SNPs on a given chip. This type of search, in the interest of speed, does not access any genotypes. Most parameters are ignored. The only required parameters are searchType and chipFilter (A1, A5, etc. as in the list above). There are no region lines. The optional parameters headerLine and fileName have the same function as in other modes.

An example input file would be
	# headerLine I5ChipAnnotation
	# fileName I5Chip
	# searchType chipID
	# chipFilter I5
	
The columns in the single returned file are:

  base(NCBI.36)
  rsID
  Genes
  Function
  ConservationScore
  SubmitterIDs
  RepeatMasker
  TandemRepeatsFinder
  CopyNumberVariation
  UpstreamFlank
  DownstreamFlank
  InputFileRegion (the chip ID)

If the parameter returnTarballWithEmailMessage is set to true, the file returned will be a tarball including the contents of the email (including any warning or error messages), as well as the data files.

If the parameter compress is set to false, the file will be returned uncompressed. This parameter is ignored unless numberFiles is single, and the parameter returnTarballWithEmailMessage is false (both of these being the default values).

Limits for Large Queries
Once a file is received, the size of the job is evaluated for time, memory, and disk space requirements. If the job is too big, it is cancelled, and an e-mail is sent asking for the job to be broken into smaller pieces and submitted sequentially. For a "genotypes" query, the limit is 10,000 SNPs per line in the file and 250,000 SNPs for the entire file (the number being calculated before any frequency cutoff is applied). For a "tagSNPs" query, it's 5,000 per line and per file is 60,000 without multipop and 30,000 with multipop. For "r2" or "r2LD", 5,000 per line and 150,000 per file is allowed. For "snpSummary", 10,000 per line and 600,000 per file are the limits. For "fastPHASE", both the line and file limits are 5,000 SNPs. If the freqCutoff parameter is greater than 0, the limits are all multiplied by a factor (freqCutoff + 4) / 4.

These limits may be changed at any time as we monitor the server load. A job will reach a timeout limit in 24 hours. If an e-mail is not received by then, there is some problem. For a large job, it's a good idea to start with a small subset and see how long the job takes, so you know about when a result is expected for the entire job. A crude monitoring system is available; it looks at the number of processed region lines in the file.

Please do not submit simultaneous large jobs.

 
Skip footer links and go to content