Friday, November 30, 2012

Geno 2.0 patch for DIYDodecad

(See important update at the end of this post)

People who have tested using the Genographic Project's Geno 2.0 test can now use the DIYDodecad tool with their data. The raw data download from this test has a slightly different format than the ones from 23andMe and Family Finder, so it is necessary to convert your data in a format that DIYDodecad can interpret.

So, after you have downloaded and extracted the DIYDodecad software as per its instructions, you should also download a couple of extra files into your working directory; these files are included in this patch:

  • standardize.r which replaces the standardize.r in the DIYDodecad software bundle, and allows you to convert your Geno 2.0 formatted data
  • hgdp.base.txt which includes additional information about SNP markers that is not found in your Geno 2.0 raw data download, and which is necessary to complete the conversion process.
Once these two files have been extracted into your working directory, the process of using DIYDodecad is exactly the same as for any other user of the software.

The only difference is that at the step where you convert your data using the standardize command (see DIYDodecad README file), you will use the command:

standardize('johndoe.csv', company='geno2')

where johndoe.csv is your unzipped raw data download. This will write a genotype.txt file in the working directory, and you can proceed the rest of the way as per the instructions.

You can use all ancestry calculators released by the Project (or indeed other projects); the most recent one is globe13

You should be aware, that because the Geno 2.0 test includes a smaller number of SNPs, and because globe13 and other calculators were developed using the common SNP set of 23andMe and Family Finder, the analysis using globe13 will only include ~34 thousand SNPs and will be "noisier" than usual. In the future, I might develop new calculators that make use of the SNP set of the Geno 2.0 test itself.

PS: Feel free to post a comment below if you experienced any difficulty converting your data; also thanks to CeCe Moore for graciously sharing a raw data file with me, which allowed me to build this converter.


Apparently, the data format has been changed for some Geno 2.0 data downloads.
If your data includes a [Header] ... [Data] preamble followed by a list of 5 comma-separated values, ignore this.
If it includes a header "SNP,Chr,Allele1,Allele2" followed by a list of 4 comma-separated values, you should follow the instructions as above, but use company='geno2new' instead.