New methods for successive optimization of the discriminant criteria

In [21] we proposed a new method for identifying two discriminant directions obtained by successive optimization of discriminant criteria which determine a single linear one-dimensional subspace of the sample space. We discuss two classes of discriminant criteria, namely, the extended Fisher criterion previously proposed by us [15,16] and the nonparametric criterion proposed by Fukunaga (Intr. to Stat. Patt. Rec. 1990). In the past two methods for successive optimization were applied. The first method uses orthogonal constraints on the discriminant vectors (called method ORTH). The second method does not so constrain the discriminant vectors (called method FREE). Our experience shows that method FREE does not always create new discriminant information along the second discriminant vector. In order to include new information in it, method ORTH was used, but it is known that orthogonal directions do not always suffice; directions containing information for class separation may be oblique to each other. We proposed a new method which is better than FREE and ORTH. It is free to search for discriminant directions oblique to each other and ensures that the informative directions already found will not be chosen again at a later stage.

The new method, called REM, "removal of classification structure", consists of obtaining a discriminant vector, transforming the data along it into data without classification structure, and obtaining the second discriminant vector. The proposed method is stimulated by an idea of Friedman (“Exploratory projection pursuit”, JASA, vol.82, 1987) called "structure removal" which is oriented to cluster analysis. The aim of this work was to employ "structure removal" in discriminant analysis.

We compared the discrimination qualities of the plots spanned by the discriminant vectors obtained by the methods ORTH, FREE and the new method REM. We study a wide spectrum of situations in terms of the size of the design sample drawn from synthetic and real data. Synthetic data sets concern normal class-conditional distributions, nonnormal unimodal distributions and multimodal distributions. The real data concerns the medical diagnosis of the neurological disease cerebrovascular accident (CVA). We compared the discrimination qualities of the plots by the resulting differences of the test error rates. The simulation studies and the real data experiment indicate that the proposed method REM increases the quality of the discriminant plots obtained by optimization of two criteria, namely our extended Fisher criterion [15,16] and the nonparametric criterion proposed by Fukunaga (1990). There appears to be no loss in applying REM instead of both ORTH and FREE. It seems that the new method REM provides a more effective constraint on the discriminant vectors than does the orthogonality constraint usually applied in statistical pattern recognition.

In [22] we proposed a new method for the optimization of a discriminant criterion which has several local maxima. In our previous work [20] we studied discriminant criteria which have unique maxima with respect to the directional vector, the optimal solution being obtained analytically (by solving some eigenvalue problems). Here we discussed the optimization of a criterion, called the Patrick-Fisher (PF) distance, which is a highly nonlinear function and has several local maxima. For this criterion an iterative optimization has to be carried out. In most studies in the past, the vector which maximizes PF distance is searched for along the gradient of this distance hoping that with a good starting point the procedure will converge to the global maximum or at least to a practical one. In the naive optimization, the observed maximum of the PF distance can be merely a local maximum.

We proposed a method of recursive optimization for searching the directions corresponding to several large local maxima of the PF distance. Starting from some directional vector we search to a convergence point of a local maximizer of the PF distance. The directional vector after convergence of the maximizer is called the PF discriminant vector. Then we transform the data along the obtained PF vector into data with a deflated maximum of PF distance (greater overlap of the classes) and iterate to obtain the next PF discriminant vector. In order to deflate the maximum of the PF distance one could use our method REM (proposed in [20]) which will result in full overlap of the classes along the previously defined discriminant vector. This certainly eliminates the local maximum of the PF distance, but it causes large changes of the distributions of the transformed data in some applications discussed in our paper. In order to overcome this problem we developed a method for "reduction of the class separation" which directs the local optimizer to a new local maximum while keeping the data structure of the transformed data as close as possible to the original one.

We carried out experiments with synthetic and real data sets which show that our method succeeded in finding the sequence of directions with significant discriminant information. Our method was more effective than the methods for initialization of the local optimizer of the PF distance used in the past (initializations by the Fisher’s discriminant analysis and principal component analysis). Our method, as is the case with the stochastic smoothing algorithms, may miss some large local maxima including the global one. We view this as desirable and do not think of our proposal as a global optimization method, but rather as a simple tool which detects directions with “interesting” discriminant information.

Department of Electrical & Computer Engineering
Prof. MAYER ALADJEM