D-optimal plans for variable selection in data bases

Schiffner, Julia; Weihs, Claus

D-optimal plans for variable selection in data bases

dc.contributor.author	Schiffner, Julia
dc.contributor.author	Weihs, Claus
dc.date.accessioned	2009-08-05T10:05:44Z
dc.date.available	2009-08-05T10:05:44Z
dc.date.issued	2009-08-05T10:05:44Z
dc.description.abstract	This paper is based on an article of Pumplün et al. (2005a) that investigates the use of Design of Experiments in data bases in order to select variables that are relevant for classication in situations where a sufficient number of measurements of the explanatory variables is available, but measuring the class label is hard, e. g. expensive or time-consuming. Pumplün et al. searched for D-optimal designs in existing data sets by means of a genetic algorithm and assessed variable importance based on the found plans. If the design matrix is standardized these D-optimal plans are almost orthogonal and the explanatory variables are nearly uncorrelated. Thus Pumplün et al. expected that their importance for discrimination can be judged independently of each other. In a simulation study Pumplün et al. applied this approach in combination with five classiffication methods to eight data sets and the obtained error rates were compared with those resulting from variable selection on the basis of the complete data sets. Based on the D-optimal plans in some cases considerably lower error rates were achieved. Although Pumplün et al. (2005a) obtained some promising results, it was not clear for different reasons if D-optimality actually is beneficial for variable selection. For example, D-efficiency and orthogonality of the resulting plans were not investigated and a comparison with variable selection based on random samples of observations of the same size as the D-optimal plans was missing. In this paper we extend the simulation study of Pumplün et al. (2005a) in order to verify their results and as basis for further research in this field. Moreover, in Pumplün et al. D-optimal plans are only used for data preprocessing, that is variable selection. The classiffication models are estimated on the whole data set in order to assess the effects of D-optimality on variable selection separately. Since the number of measurements of the class label in fact is limited one would normally employ the same observations that were used for variable selection for learning, too. For this reason in our simulation study the appropriateness of D-optimal plans for training classiffication methods is additionally investigated. It turned out that in general in terms of the error rate there is no difference between variable selection on the basis of D-optimal plans and variable selection on random samples. However, for training of linear classiffication methods D-optimal plans seem to be beneficial.	en
dc.identifier.uri	http://hdl.handle.net/2003/26363
dc.identifier.uri	http://dx.doi.org/10.17877/DE290R-8705
dc.language.iso	en	de
dc.subject	design	en
dc.subject	experiment	en
dc.subject.ddc	004
dc.title	D-optimal plans for variable selection in data bases	en
dc.type	Text	de
dc.type.publicationtype	report	en
dcterms.accessRights	open access
eldorado.dnb.deposit	true

Dateien

Originalbündel

Gerade angezeigt 1 - 1 von 1

Name:: tr14-09.pdf
Größe:: 612.18 KB
Format:: Adobe Portable Document Format
Beschreibung:: DNB

Herunterladen

Lizenzbündel

Gerade angezeigt 1 - 1 von 1

Name:: license.txt
Größe:: 1.94 KB
Format:: Item-specific license agreed upon to submission
Beschreibung:

Herunterladen

Sammlungen

Sonderforschungsbereich (SFB) 475