Similarity measures for clustering SNP and epidemiological data

Selinski, Silvia

Similarity measures for clustering SNP and epidemiological data

dc.contributor.author	Selinski, Silvia
dc.date.accessioned	2006-05-04T09:54:51Z
dc.date.available	2006-05-04T09:54:51Z
dc.date.issued	2006-05-04T09:54:51Z
dc.description.abstract	The issue of suitable similarity measures for a joint consideration of so called SNP data and epidemiological variables arises from the GENICA (Interdisciplinary Study Group on Gene Environment Interaction and Breast Cancer in Germany) casecontrol study of sporadic breast cancer. The GENICA study aims to investigate the influence and interaction of single nucleotide polymorphic (SNP) loci and exogenous risk factors. A single nucleotide polymorphism is a point mutation that is present in at least 1 % of a population. SNPs are the most common form of human genetic variations. In particular, we consider 43 SNP loci in genes involved in the metabolism of hormones, xenobiotics and drugs as well as in the repair of DNA. Assuming that these single nucleotide changes may lead, for instance, to altered enzymes or to a reduced or enhanced amount of the original enzymes – with each alteration alone having minor effects – the aim is to detect combinations of SNPs that under certain environmental conditions increase the risk of sporadic breast cancer. The search for patterns in the present data set may be performed by a variety of clustering and classification approaches. I consider here the problem of suitable 2 measures of proximity of two variables or subjects as an indispensable basis for a further cluster analysis. In the present data situation these measures have to be able to handle different numbers and meaning of categories of nominal scaled data as well as data of different scales. Generally, clustering approaches are a useful tool to detect structures and to generate hypothesis about potential relationships in complex data situations. Searching for patterns in the data there are two possible objectives: the identification of groups of similar objects or subjects or the identification of groups of similar variables within the whole or within subpopulations. The different objectives imply different requirements on the measures of similarity. Comparing the individual genetic profiles as well as comparing the genetic information across subpopulations I discuss possible choices of similarity measures suitable for genetic and epidemiological data, in particular, measures based on the χ2-statistic, Flexible Matching Coefficients and combinations of similarity measures.	en
dc.format.extent	520598 bytes
dc.format.mimetype	application/pdf
dc.identifier.uri	http://hdl.handle.net/2003/22399
dc.identifier.uri	http://dx.doi.org/10.17877/DE290R-15898
dc.language.iso	en
dc.subject	Cluster analysis	en
dc.subject	Flexible Matching Coefficient	en
dc.subject	GENICA	en
dc.subject	Mixed similarity coefficient	en
dc.subject	Pearson's Corrected Coefficient of Contingency	en
dc.subject	Similarity	en
dc.subject	Single nucleotide polymorphism (SNP)	en
dc.subject	Sporadic breast	en
dc.subject.ddc	004
dc.title	Similarity measures for clustering SNP and epidemiological data	en
dc.type	Text	de
dc.type.publicationtype	report	en
dcterms.accessRights	open access
eldorado.dnb.deposit	true

Files

Original bundle

Now showing 1 - 1 of 1

Name:: tr25-06.pdf
Size:: 508.4 KB
Format:: Adobe Portable Document Format
Description:: DNB

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.92 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Sonderforschungsbereich (SFB) 475