Tuesday, January 31, 2012

'K12b' and 'K7b' calculators

I am releasing two new calculators with K=12 and K=7 components, named 'K12b' and 'K7b'. You can scroll down to the bottom if you are just interested in the downloads, or read on.


New Features

The new 'K12b' calculator is an update of the previous K12a one, that was inferred using all the new samples submitted during the last submission opportunity. The 12 components are still roughly the same, although their allele frequencies may have changed by a bit, so existing participants can expect to have slightly altered results, and new participants in the Project more so, since their data are now contributing to the creation of the new tool. Non-participants can, of course, use the new calculator with DIYDodecad.

I have also taken the opportunity to do some minor tweaks. I am releasing population portraits for K12b (which were lacking in K12a); I've changed my visualization code so that the sample IDs of non-Dodecad populations can now be seen in the barplots. This may be useful for anyone else using these reference populations, by quickly identifying potential outliers in them.

I have also decided to use normalized median admixture proportions for the populations. For example, if 5 individuals in a population have 0, 0, 0.2, 0.5, 10.0% of a particular component, then the average is 2.14%, but the median is 0.2%. By using the median, the proportions become less susceptible to the presence of outliers (such as the 10%). However, if the median is calculated over every component separately, it is no longer guaranteed that the components will add up to 100%; this can be addressed by re-normalizing them (scaling them by a constant factor) so that they do. I believe that use of the normalized median will not only give better proportions that are less susceptible to outliers, but will also improve results of the new Dodecad Oracle for K12b.

At the same time I am also releasing 'K7b' which is an update of the existing 'eurasia7' calculator and which has been built on exactly the same dataset as 'K12b' but at a lower (K=7) level of detail.

Information on K7b


Information spreadsheet.

Normalized median admixture proportions barplot for all included populations (a high resolution version of this is included in the download bundle):


Table of Fst divergences:

Neighbor-joining tree (based on above):

Information on K12b


Information spreadsheet.


Normalized median admixture proportions barplot for all included populations (a high resolution version of this is included in the download bundle):

Table of Fst divergences:

Neighbor-joining tree (based on above):
Multidimensional Scaling Plots of K12b and K7b


I have created MDS plots using synthetic individuals representing the 12 ancestral components of K12b and the 7 ancestral components of K7b. By including both in the same plot, one gets an idea of the relationship of the components at different resolution. The first 10 dimensions can be seen below:

Here is a blowup of the main West Eurasian groups from the plot of the first two dimensions:

Some observations:

  • The Atlantic_Med component which is bi-modal in Basques and Sardinians occupies the apex of the figure; this makes sense, since Southwest Europe is quite distant (along land routes) to both Asia and Africa.
  • The Caucasus component is surrounded by most of the others; this is consistent with my theory elaborated in The womb of nations: how West Eurasians came to be.
  • The Atlantic_Baltic component (from K=7) is intermediate between the Atlantic_Med and North_European components.
  • Similarly, the West_Asian component (from K=7) is intermediate between the Caucasus and Gedrosia components; the Gedrosia component diverges in the direction of the Asian groups (not shown in this figure), and in particular of South Asians. This divergence can also be seen in the plot of dimension #3.
  • The Northwest_African component diverges in the direction of Sub-Saharan Africans.

Technical Details


A dataset of 268 populations/3,115 individuals was assembled. A total of 265,519 SNPs are in common in the various source datasets as well as the 23andMe v2/v3 and Family Finder platforms. Iterative removal of distant relatives was performed by removing one individual from each pair within a population if that pair had a RATIO of 2.5 or greater or more than the mean and two standard deviations in IBD analysis performed in PLINK 1.07. A total of 2,675 individuals remained. 4 individuals were removed for low genotyping rate (less than 97%). 264,328 SNPs remained after removal of SNPs with less than 97% genotyping rate or 1% minor allele frequency. 166,770 SNPs remained after linkage-based disequilibrium pruning (--indep-pairwise 200 25 0.4). The final set thus consisted of 2,671 individuals/268 populations/166,770 SNPs. Ancestral populations (components) were inferred using ADMIXTURE 1.21, with K=7 and K=12 and default parameters.

No individuals were removed from the source datasets, except in the case of the Armenians_Y sample, where one individual (ID: armenia3) was dropped because he/she was the same as a Dodecad Project participant.

Downloads


K7b population portraits, spreadsheet, and DIYDodecad files.
K12b population portraits, spreadsheet, and DIYDodecad files.

Dodecad Oracle (K12b edition) can be downloaded from here. Please read the instructions of the previous Oracle on how to use this tool. Note that the number of populations is now 223.

To use either calculator with DIYDodecad, with your 23andMe or Family Finder data, follow the instructions in the README file, but substitute 'K12b' or 'K7b' for 'dv3'.

Project participant results for both K7b and K12b are found in the spreadsheets in the Individual Results tab.

Terms of Use


You are free to use K12b and K7b, including all downloaded files for any non-commercial purpose, as long as you attribute them to the Dodecad Project and to Dienekes Pontikos as follows:

The [K7b/K12b] admixture calculator is courtesy of Dienekes Pontikos and was developed as part of the Dodecad Ancestry Project; more information here.

Tuesday, January 24, 2012

Submission Opportunity is OVER

Thank you everyone for submitting their data. I will not accept any more data at this time. A couple of submissions came in at the last second, so I accepted one more than I promised, who got the brand new DPD001 ID.

Those who submitted in time will get their IDs and their results will be posted in the K12a spreadsheet.
Additionally, I will run all participants over world9, so that spreadsheet will also include everybody.

From now on, I will be reworking some of the Project tools to make use of newer samples submitted during this submission opportunity.

If you wish to submit your data during this off period, note that you must contact me at dodecad@gmail.com. Do not send data at this time, unless I indicate that I can accept it! I will let you know if I can process it, and note that I will normally only consider those who matched the eligibility criteria of the most recent submission period.

Monday, January 23, 2012

Open submission for everybody until DOD999

SUBMISSION OPPORTUNITY IS NOW OVER

Everyone on the planet is invited to submit their data, regardless of their ancestry.

All other rules apply, especially the no relatives clause. Additionally, I will accept a single submission from each submitter, so don't submit all your friends. Moreover, regardless of your ancestry, you should let me know the origin of your four grandparents.

There are 35 spots open, so hurry, since last time I had a free-for-all I had to close it down after about 12 hours due to overwhelming demand. I will close project submission after I assign DOD999.

All submissions after I post the end-of-submission announcement on the blog will be ignored. If you post this in any forums or mailing lists, include this post link so that people will know whether the opportunity is over.

Saturday, January 21, 2012

fastIBD analysis of Afroasiatic groups (Jews, Arabs, Assyrians, Berbers, Somalis, Amharas, etc.)

Please refer to the previous analysis on the Balkans/West Asia for more information about the interpretation of this type of analysis.

I am very pleased with the way this analysis of Afroasiatic groups has turned out, revealing an exceptional degree of resolution. I invite individuals from the Near East and Africa who are eligible, to submit their data, so that they can be included in future runs of this kind.

Clusters Galore


45 clusters were inferred with 29 dimensions.


I can't comment on all 45 clusters, so I'll just limit myself to the ones that are significantly represented among Project participants: 1. Ashkenazi, 4. Assyrian/Mandaean, 6. Somali, 7. Moroccan, 8. Algerian/Tunisian, 9. Sephardic, 10. Morocco Jews, 11. Iran/Iraq Jews, 12. Non-Jewish Ethiopians, 13. Saudi, 14. Arab #1, 15. Arab #2, 16. Egyptian

Inter-Population IBD


Results for Project Participants


The results can be found in the spreadsheet.

I have also added the full IBD sharing matrix which lists how many Morgans of sequence are estimated to be IBD with probability greater than 10^-6 between all pairs of individuals.

You can google any non-Project sample IDs to get some more information about their origin. For example, GSM536710 is an Iraqi Jew who shares about half his genome with GSM536714, also an Iraqi Jew. These two samples are almost certainly first-degree relatives. Or, GSM537032, a Samaritan shares 740-1,480cM with the other 2 Samaritans, an exceptional amount in this small and probably highly inbred population.

You can manipulate this matrix in R. After you download it and unzip it, you can load it into R as follows:

X<-read.table('afroasiatic_ibd_sharing.txt',row.names=1,header=T)

Then, you can, for example, sort the IBD sharing for a particular individual, as follows:

sort(X['DOD026',])

fastIBD analysis of Central/Eastern Europe

Please refer to the previous analysis on the Balkans/West Asia for more information about the interpretation of this type of analysis.

Clusters Galore


The Clusters Galore can be found in the spreadsheet. After inspection of the 23 clusters inferred with 21 dimensions, they could be described as:

  1. Mordvin
  2. East Slavic
  3. Polish-Ukrainian
  4. East Balkan
  5. Vologda Russians
  6. Lithuanian
  7. Central European (combining many groups with small sample sizes)
  8. A couple of related (?) individuals
  9. Anatolian
  10. Greek
  11. Chuvash
  12. Ossetian
  13. A couple of related individuals
  14. A couple of related individuals
  15. Balkar
  16. A couple of related individuals
  17. Chechen
  18. Kumyk
  19. A couple of related individuals
  20. Adygei
  21. Lezgin #1 (main)
  22. Lezgin #2
  23. Lezgin #3
If you belong to a population with few other participants, you might end up latching onto a cluster dominated by a bigger group. This does not mean that your population is not distinctive, only that there are not enough samples to reveal its distinctiveness if it exists.

Inter-Population IBD


Results for Dodecad Participants

Results can be found in the spreadsheet.

If you have joined the Project, please consider leaving a comment in the Information about Project samples thread. That will help others make better sense of their results, e.g., if you find that you belong in the same cluster with some other individual, you might want to know something about their origins.

UPDATE: I have added the IBD sharing matrix.See here on how to use it.

Thursday, January 19, 2012

fastIBD analysis of South Asia

Please refer to the previous analysis on the Balkans/West Asia for more information about the interpretation of this type of analysis.

Clusters Galore


The Clusters Galore analysis can be found in the spreadsheet. 59 clusters were inferred with 47 MDS dimensions. The very fine-scale structure (I only considered the first 50 dimensions, but many more seemed significant than in any previous experiment) is probably the result of the size of the South Asian population, as well as the practice of endogamy associated with the caste system. High intra-population IBD sharing is also evident in the following (notice how well-defined the diagonal is):

Inter-Population IBD




Results for Dodecad participants

They can be found in the spreadsheet. Many Project participants belong to a population with 1 or 2 individuals, so cluster #1 seems to be a generalized catch-all for many such individuals. Individuals from he two sub-populations that I've identified recently Iyer_D, and Jatt_D all belong to the same cluster. The Iyer_D cluster (#4) also seems to include the Iyengar project participants as might be expected.

It is also interesting how all Dodecad participants fall in just 7 of the 59 clusters. This goes to show how truly diverse people from the Indian subcontinent are. I fully expect that with more participation further structure will be revealed, since it seems that due to endogamy it only takes a few participants from each ethnic group for a specific cluster pertaining to that group to be identified. So, I invite people from South Asia to join the Project during this submission opportunity.

Tuesday, January 17, 2012

fastIBD analysis of Iberia, France, Italy, Balkans, Anatolia and European Jews

On the heels of the previous analysis of Balkans/West Asia, a new experiment on a different set of populations. Please refer to the earlier post for some thoughts/explanations about this type of analysis, I'll stick to "just the data" for this post.

Clusters Galore




24 clusters inferred with 17 MDS dimensions.

The Galore analysis provides increased resolution within Iberia (#6-9, 11), Italy, and the Ashkenazi Jewish group (#14-16).

The Iberian results are particularly interesting, showing the power of this approach compared to the one with unlinked data. There appear to be:

  • a Spanish Basque (#6), 
  • French Basque (#11) cluster, as well as 
  • a Portuguese/Galician/Castilla Y Leon (#9) cluster, and 
  • a complementary Castilla La Manch/Cantabria/Andalucia/Murcia (#7) cluster, and 
  • a smaller Aragon/Cataluna cluster (#8). 
There is overlap between these clusters, but the geographical contrasts are quite evident. I did not go through the results of Spanish Project participants (all the Portuguese fall in the Galician cluster, and our Basque member in the Basque cluster as expeccted), so it would be interesting to hear whether they fall in the cluster(s) which exist in their regions of origin.

Inter-Population IBD




Results for Project Participants


The results can be found in the spreadsheet.

Saturday, January 14, 2012

fastIBD analysis of Balkans/West Asia

Now that I've discovered a way to boost Clusters Galore analysis even further by using fastIBD, I will start experimenting with different regional populations. This analysis took about 5 hours to complete, so it appears to be quite practical.

For my first experiment, I carry out an analysis of various populations from the Balkans and West Asia.

Clusters Galore

27 different clusters were inferred with 17 MDS dimensions. Some interesting findings:
  • For the first time there emerge a couple of clusters that appear to be quite specific to Armenians (#2 and #3). 
  • Similarly, Assyrians are broken to a few clusters that appear fairly specific to them  (#9-11)
  • Georgians are split into three clusters, one of which (#14) is linked with the neighboring Abkhasians, who in turn have their own exclusive cluster (#25)
  • The cluster modal in Greeks (#6) includes 14 of 19 Greek participants, and a few Greeks are also in the Balkan cluster (#8) and an Iranian-Turkish cluster (#4)
  • The Behar Cypriot sample also splits into two, and the few Turkish Cypriot participants link to one of them (#13)
  • The Ossetian project participant links to one of the three North_Ossetian clusters
  • The major Balkan cluster (#8) still defies resolution. I am certain, however, that structure in this cluster will be uncovered with more participation. MCLUST adapts the cluster size and shape, and a "big", inclusive cluster spanning the Balkans appears more parsimonious than smaller clusters centered on the different groups. With larger participation, I anticipate that regional structure will be uncovered in the Balkans as well.
I cannot stress the importance of participation strongly enough. When groups have more participants, it is possible to both:

  1. Discover group-specific clusters, by identifying what is common between members of groups
  2. Discover within-group clusters, by identifying what is different between members of groups
For example, the great participation of Armenians in the Project has now allowed me to discover structure within the Armenian population. It appears, that cluster #2 corresponds to a more "western" Armenian group, and #3 to a more "eastern" one, with some overlap between the two.

Inter-population IBD


You can also see a visual representation of inter-population IBD:

I have only included populations with 5+ participants in this representation. Reddish shades express high IBD sharing; bluish ones low one. The heatmap has been scaled by row.

As you might expect, values across the diagonal are "reddish", since individuals within populations tend to have high IBD sharing with each other.

A few features "pop out" of the screen. Going from top to bottom:
  • Intra-Iranic sharing
  • Intra-Armenian sharing
  • Intra-Balkan sharing
  • Georgian-Abkhaz sharing
You can probably get more out of the figure, but these appear to be the most salient features.

Results for Project Participants


The results can be found in the spreadsheet, and include:
  • Probabilities of assignment in each of the 27 clusters of the Clusters Galore analysis
  • Z-scores of IBD between each individual and each of the 20 populations with 5+ participants. Higher values mean more IBD sharing. Note that Z-scores have been calculated for each row, hence each participant must scan his own row to find populations with an excess (+) or deficiency (-) of IBD sharing, and people should not compare across different rows.
Last but not least, I want to remind new project participants to leave a message in the Information about Project samples thread. Your comment will not appear immediately, since comment moderation is on, and also note that there are multiple pages of comments. 


If you haven't joined the Project yet, I encourage you to do so if you are eligible.

Wednesday, January 11, 2012

Clusters Galore (fastIBD edition) for some northern European participants

You can find some new Clusters Galore results here (scroll down for spreadsheet link). The new methodology described in that post has made it possible to infer even finer-level population structure than "classical" Clusters Galore.

Tuesday, January 3, 2012

Submission opportunity (January 2012)

Who is eligible


Anyone who:

  • has 23andMe or Family Finder autosomal data,
  • is not related to any other Project participants,
  • has 4 grandparents from the same African, European, or Asian ethnic group or country (e.g., 4 Albanian grandparents, 4 grandparents born in Ethiopia, 4 Kazakh grandparents, etc.)
Any ineligible submissions will be blacklisted. Do not send data if you do not meet the eligibility criteria.

What to send


Send your compressed autosomal data (ending .zip or .gz) that you can download from your testing company.
Send to dodecad@gmail.com as an attachment, and include in your e-mail as much information about your ancestry as you can (e.g., birthplace of grandparents, spoken languages, practiced religions, ethnic affiliation, etc.). Samples without adequate ancestral information will be ignored.

Data Privacy Statement


Your raw data or genealogical information will not be shared or distributed in any manner, and it will not be analyzed for any other purpose than assessment of ancestry (i.e., not for any physical or health-related traits). It will be identified by a unique ID, known to you and me, and results will be posted in the blog using that ID. I will continue to analyze your data for ancestry, and new results will be posted using that same ID. Also, I will report aggregate results for populations with at least 5 participants.

What you will receive


I will add you to the K12a spreadsheet of the K12a calculator. You will also be eligible to participate in future data analyses, and newer results will be posted in this blog with your ID.

Clarification (added on 6 Jan, 2012): The results which you will receive will be based on the K12a calculator whose components were inferred in December 2011, and hence included only those who had submitted their data up to that time. As new members of the Project, your data will be used for the development of the next version of the admixture analysis, and this will -in all likelihood- lead to a subtle redrawing of the ancestral components and different ancestral proportions (see technical note).

By participating in the Project, you help better draw both the basic ancestral components underlying genomic variation in Africa/Europe/Asia, and create more robust samples of different populations. This is helpful both to the Project, and to yourself, because it helps you get increasingly better results with newer versions of the analysis. All newer analysis tools are announced on this blog.

End of Submission Opportunity


The end of this submission opportunity will be announced on this blog.