Sekundarni povzetek: |
High-throughput DNA microarray technology is nowadays available in any modern biomedical laboratory. Despite the sophistication of the microarray technology, a state-of the-art statistical analysis of microarray data is still a great challenge. Microarray dataset could be described by a matrix with n rows and p columns, where the former refer to individual samples, and the later to the particular genes. It is assumed that n « p. Based on a topological analysis of the geometrical properties of the high-dimensional data objects we can show, that in this case the data space is very sparse. The empty-space phenomenon can be effectively managed using various dimensionality reduction techniques. The empirical evidence reveals that systematic evaluation that examined the behavior of different dimensionality reduction methods on the microarray data has not yet been performed. Moreover, the question of the usefulness of discretization of microarray data still remains unanswered. In this thesis, we discussed three different problem tasks. In the first set of experiments, we systematically studied the performance of various classifiers in a standard classification task with two pre-defined classes. We used a bundle of state-ofthe-art classifiers, including neural networks, nearest neighbors, classification trees with random forests, support vector machines, penalized logistic regression, and three variants of linear discriminant analysis (Fisher, classical and diagonal). In the second experiment, we analyzed the effect of dimensionality reduction on the classification performance; in particular we examine principal component analysis and partial least squares. In the third experiment we studied the effect of data discretization on classification performance. The analysis included some of the most commonly used discretization algorithms, including equal width and equal frequency discretization, 1R, MDLP, and ChiMerge. Experiments were carried out on a set of 37 real DNA microarray datasets. Effect of classification method and variable selection procedure was evaluated on synthetic data as well. Learning parameters and performance measures were evaluated using the cross-validation scheme. The classification results were represented by standard performance measures including classification accuracy, sensitivity, specificity, and area of the ROC curve. Results showed best classification performance with penalized logistic regression for real datasets and support vector machines for synthetic data. Neural networks perform worst in both
settings. Principal component analysis and partial least squares did not show statistically significant differences according to classification performance (with the exception of the area under the ROC curve). Among discretization methods the best classification performance was achieved using the MDLP and ChiMerge algorithms. To the best of our
knowledge and according to available empirical evidence this is the first study on such large number of microarray datasets. |