Friday, April 29, 2011

K=12 ADMIXTURE results for selected participants

In comparison to the previous K=11 run:
  • I have included some more general populations, e.g., Italian_D and Scandinavian_D, so that some people who did not fit in the previous populations, e.g., Danes or North Italians could be included
  • The number of individuals is 1,010 now, the number of populations 65
  • I have upped the number of markers to ~173k after linkage-disequilibrium pruning
  • I am continuing to play around with ways to frame West Eurasia, so now I have included Pakistan_H, North Kannadi and Sakilli for South Asia, She, Miaozu, Chinese, Yakut, and Selkup for North/Eastern Eurasia, and Yoruba, Maasai, Bantu, Ethiopians, and Ethiopian Jews for Africa
The previous analysis had stopped at K=11 because a couple of Iraqi Jews formed a spurious "family" cluster. On inspection, these were a couple of likely first-degree relatives, so I removed one of them to proceed.

The clusters of the previous run were more or less recreated, but please check the table of Fst distances to see how the different names are related to each other. The new addition at K=12 is the split of East Eurasians into East Asian and North Eurasian. The latter is centered on the Uralic Selkup and the Altaic Yakut.

I would say that this is a substantial improvement over the standard K=10 analysis of the Project, as:
  • Two main components (North and South European) have been replaced by four new ones (NW/NE European, Sardinian, Basque) that have interesting distributions.
  • The five "framing" components (Sub-Saharan, E African, S Asian, N Eurasian, E Asian) correspond largely to the pre-existing ones, but with more diverse framing populations to make them a little better defined.
The search continues...

Admixture proportion and individual results can be found here. Population portraits from here or here.

PS: As I've noticed before, at this level of resolution "noise" becomes a real problem, as evidenced by the emergence of a few tenths of a percent of components where one might not expect them.

Wednesday, April 27, 2011

Results for DOD604 to DOD612 posted

Admixture proportions can be found in the spreadsheet.

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.

Current submission opportunity is over. Please subscribe to the feed to be alerted of new ones.

All populations:



Individual bars:

Saturday, April 23, 2011

K=11 ADMIXTURE results for selected participants

In Dienekes' Anthropology Blog, I have created the most comprehensive analysis of genetic structure in West Eurasians so far.

Using a new technique to frame the region of interest (West Eurasia and North Africa) by meta-population controls representing Sub-Saharan Africans, East Eurasians, and South Asians, I was able to extract several robust components that track admixture in unprecedented detail.

In comparison to previous analyses, the North European component has been split into Northwest and Northeast European, while the South European one has been split into Sardinian and Basque. Additionally, the inclusion of the above-mentioned controls, as well as a great variety of West Eurasian populations has probably improved the accuracy of the results for all the other components substantially.

You can read the details, and obtain detailed data downloads from this study in my other blog. But, it's good to remember that it is only possible for me to do things like this because of the project volunteers.

Because of the anthropological nature of the study, only individuals who belonged to particular populations could participate, and they can all see their individual results here. So, check the figure on the left to see if your Dodecad population is included, and then look at the spreadsheet for your individual results.

Saturday, April 16, 2011

Your nearest IBS neighbors (up to DOD603)

I have calculated your nearest identity-by-state (IBS) neighbors based on the same set of ~146K markers used for the standard K=10 analysis.

As explained here, it is not always the case that your nearest neighbors will belong to the same ethnic group as you. For closely related groups in the global context (e.g., Europeans), it's quite possible for a member of a different group to be more similar to you than a member of your own.

I am distributing the data as an R object. You must first install R, and then you can open this object by double-clicking on it (in Windows), or by using the File->Load Workspace menu within R. Then, you simply enter the following command at the prompt:
closest("DBV001")

Replace "DBV001" with your own project ID. If that ID is not included in the data, or you mistyped it, you will get an error message:
closest("qwerty")
[1] "This ID is not included"

Otherwise, you will get your results:
closest("DBV001")
[1] "Your nearest neighbor is 0.05 standard deviations more distant to you than for the average project participant"
RANK ID IBS
V3 "1" "DBV001" "1"
V133 "2" "DOD151" "0.749907"
V313 "3" "DOD344" "0.749559"
V943 "4" "Ashkenazy_Jews" "0.749298"
V50 "5" "DOD051" "0.74926"
V935 "6" "Ashkenazy_Jews" "0.749082"
V944 "7" "Ashkenazy_Jews" "0.749018"
V243 "8" "DOD272" "0.748985"
V25 "9" "DOD022" "0.748982"
V157 "10" "DOD179" "0.748904"
V942 "11" "Ashkenazy_Jews" "0.748822"
V939 "12" "Ashkenazy_Jews" "0.748821"
V954 "13" "Ashkenazy_Jews" "0.748767"
V251 "14" "DOD280" "0.748607"
V936 "15" "Ashkenazy_Jews" "0.748582"
V940 "16" "Ashkenazy_Jews" "0.748529"
V950 "17" "Ashkenazy_Jews" "0.748515"
V949 "18" "Ashkenazy_Jews" "0.748398"
V201 "19" "DOD228" "0.748382"
V308 "20" "DOD338" "0.748344"

By default, this produces the first 20 closest IBS matches. You can change this behavior by entering:
closest("DBV001", k=50)

Notice, that the sentence: [1] "Your nearest neighbor is 0.05 standard deviations more distant to you than for the average project participant" gives you an idea of how close your nearest neighbor is to you compared to other Project members.

For people of well-represented groups, their nearest neighbor is likely to be closer to them than average.

I have also included the 692 reference individuals from the standard K=10 analysis set, so your list of closest neighbors will include both DOD-labeled project participants, as well as reference individuals.

Thursday, April 14, 2011

IBS similarity matrix and Population Concordance Ratio for Dodecad populations

In Dienekes' Anthropology Blog, I presented a new method of comparing populations, the population concordance ratio. You can refer to that post for the rationale, definitions, and code, but for the present, I will just say that this ratio estimates the probability that two random individuals from a population A are more similar to each other than either of them is to a random individual from another population B. Its expected value ranges from 0.25 (two very similar populations) to 1 (two very dissimilar populations).

Another common way of comparing populations is by computing an identity-by-state (IBS) similarity matrix. Comparing the genomes of two individuals across many loci, you can get a number (IBS) ranging beteween 0 and 1: in humans 0 is almost never encountered, as two random individuals may share some alleles in common by pure chance, while 1 indicates either monozygotic twins or a clerical error.

I have computed these two statistics over populations of the Dodecad Project with at least 5 individuals. The analysis is based on 282,409 SNPs with a 99%+ genotyping rate over the combined sample.

The results can be found in this spreadsheet.

[NOTE: I have taken down the spreadsheet on Apr 15, in order to investigate a possible error in the Brazilian_D sample]

[NOTE II: The results seem to be correct, so spreadsheet is back up]

For the population concordance ratio each row represents an estimate of the probability that two individuals from that population are more similar to each other than either of them is to a member from a population in each column; this is an asymmetric matrix.

Below are some visualizations of these statistics for the Greek_D sample.

First, the IBS similarity matrix. These ranged between 0.70383 and 0.73689, so I have subtracted 0.7 in order to bring out the scale of the differences.

Second, the population concordance ratio:


Sunday, April 10, 2011

Results for DOD596 to DOD603 posted

Admixture proportions can be found in the spreadsheet.

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.

Current submission opportunity is over. Please subscribe to the feed to be alerted of new ones.

All populations:
Individual bars:

Friday, April 8, 2011

Structure in West Asian Indo-European groups (part 2)

I will occasionally revisit old posts such as Structure in West Asian Indo-European groups to take advantage of new population samples from project submitters. This time around, I included our first Kurdish and Azeri participants, and limited the analysis to populations for which I had large numbers of markers (the final total is a ~132k pruned set of markers).

I also included Greeks, Caucasian populations (Georgians and Lezgins), and Levantine Arabs (Syrians-Lebanese) who frame this region from the West, North, and South respectively, as well as Assyrians who are interspersed in West Asia as an ethno-religious minority.

Here are dimensions 1 and 2 of the multidimensional scaling plot:

The two Caucasian groups (South and Northeast Caucasian Georgians and Lezgins respectively) form their own clusters. So do the Iranians and the Syro-Lebanese.

(As always population labels are placed on population averages, and _D denoted Dodecad Project populations)

Curiously, many linguists assert a close relationship of Greek and Armenian within the Indo-European language family. Turks speak an Altaic language due to migration of a numerically small population element, but their pre-Turkish genetic ancestors were Anatolian speakers, Greeks, Armenians, and Iranians, i.e., primarily Indo-Europeans.

Dimensions 1 and 3:

Dimension 3 contrasts Greeks from West Asian groups. Notice also the presence of 3 Armenians at the bottom of the plot, these are outliers of the Behar et al. Armenian sample.

Notice that the Azeri_D sample clustered with Iranians in dimension 2 and with Turks in dimension 3. This is not very surprising, as Azeris speak a Turkic language, but also have clear Iranian antescendants. The Kurd_D sample, on the other hand, clusters consistently with Iranians.

The variability of the Greek_D population sample along dimension 3 is also quite interesting. This could reflect variable levels of influence of extra-Greek European/Anatolian population elements on the basic Greek stock. Greeks who possess 23andMe or FTDNA Illumina population data are strongly encouraged to join the Project to help us better determine regional variation within the ethnic Greek population.

The Clusters Galore analysis results are as follows (9 clusters/3 MDS dimensions retained) :

In brief, the modal populations in each cluster are:
  1. Greeks
  2. Iranians
  3. Turks
  4. Syrians and Lebanese
  5. Armenians and Georgians
  6. Armenians and Assyrians
  7. Syrians and Iranians
  8. Lezgins
  9. Georgians
I will be happy to provide to all Dodecad Project members from the _D populations with their individual results. If you send me e-mail at dodecad@gmail.com, I will send you a line with your probabilities of assignment in each of the 9 clusters, as well as the 3 co-ordinates in the first 3 MDS dimensions plotted above.

I strongly encourage individuals from West Asia, the Balkans, and Italy to contact me for inclusion in the Project (send e-mail first, not data). While submission to the Project is currently closed, I usually accept data from these regions.

Wednesday, April 6, 2011

Results for DOD581 to DOD595 posted

Admixture proportions can be found in the spreadsheet.

Don't forget to leave a message in the ancestry thread to help yourselves and others make better sense of these results.

Current submission opportunity is over. Please subscribe to the feed to be alerted of new ones.

All populations:

Individual bars: