Sparse Nonlinear Feature Selection Algorithm via Local Structure Learning

In this paper, we propose a new unsupervised feature selection algorithm by considering the nonlinear and similarity relationships within the data. To achieve this, we apply the kernel method and local structure learning to consider the nonlinear relationship between features and the local similarity between features. Specifically, we use a kernel function to map each feature of the data into the kernel space. In the high-dimensional kernel space, different features correspond to different weights, and zero weights are unimportant features (e.g. redundant features). Furthermore, we consider the similarity between features through local structure learning, and propose an effective optimization method to solve it. The experimental results show that the proposed algorithm achieves better performance than the comparison algorithm.

nonlinear relationship between the features is linearly separable in the high-dimensional space.In this way, the relationship within the feature can be more fully explored, thus making the mining more thorough.This paper proposes a new unsupervised feature selection algorithm to deal with the above two problems.Specifically, for the nonlinearity of the feature, we use a Gaussian kernel function to map each feature to a kernel matrix and then apply a weight to each feature.The important feature pairs have a large weight value, and the relatively unimportant feature (i.e., redundant feature) corresponds to a weight value of zero.For the similarity of features, we use local structure learning to consider the similarity between features.At the same time, the local structure of the feature can be maintained.For our proposed algorithm, we adopt the method of alternating iterative optimization, so that the algorithm is gradually decremented in each iteration, and finally reaches convergence.The main contributions we have listed for our approach are as follows.
 The similarity and non-linear relationship between features are considered simultaneously on a framework.The relationship between features is more fully explored through kernel sparse learning and local structure learning, thus effectively removing redundant features.
 The global structure of the data is maintained by low rank constraints, and complements the local structure learning.At the same time, low rank constraints can also remove noise samples and outliers, which improves the robustness of the algorithm to some extent.
 We propose an algorithm of alternating iterative optimization to solve our objective function.The method gradually reduces the value of the objective function in each iteration and finally converges.At the same time, we also use theory to prove the convergence of our proposed algorithm.The experimental results also show that compared with other comparison algorithms, our proposed algorithm achieves better results on real data sets.

2-Related Work
Feature selection is an important way of data pre-processing.It mainly looks for a subset of features that represent the original data.Due to the popularity of sparse learning, most of the existing feature selection algorithms use sparse learning.However, from the perspective of machine learning, the existing feature selection algorithms are mainly divided into unsupervised feature selection algorithm, semi-supervised feature selection algorithm and supervised feature selection algorithm.
In the unsupervised feature selection algorithm of the past two years, Almusallam et al. proposed an efficient unsupervised feature selection for streaming features [8].The algorithm uses the K-means algorithm to aggregate features that are not known into a feature stream.It uses three independent similarity measures to determine whether to add existing features to the subset features by calculating the similarity measure.Wan et al. proposed global and intrinsic geometric structure embedding for unsupervised feature selection [11].This method takes into account the information difference between the original feature space and the low-dimensional subspace.The projection matrix is constrained by the  2 or  2,1 − norm to select more sparse and more discriminative properties.
At the same time, some new feature selection algorithms have been proposed in the last two years.For example, Xue et al. proposed online weighted multi-task feature selection [9].This paper proposes a weighted multitasking model that not only selects important features, but also sparse solutions.At the same time, the convergence speed of the algorithm is also guaranteed.Liu et al. proposed global and local structure preservation for feature selection [10].The article mainly states that it is extremely important to maintain the global similar structure and local geometry of the data for supervised feature selection algorithms.For the unsupervised feature selection algorithm, it is more important to maintain the local geometry of the data.Li et al. proposed a stable feature selection algorithm [12].The algorithm is a stable feature selection algorithm based on energy learning.It uses  1 or  2 regularization term to investigate its stability.Tsagris et al. proposed feature selection for high-dimensional temporal data [13].The algorithm extends the constraint based feature selection algorithm for high-dimensional temporal data, and finally achieves good results.
There are some interesting feature selection algorithms.For example, Zhao et al. proposed cost-sensitive feature selection based on adaptive neighborhood granularity with multi-level confidence [14].The algorithm establishes a fast backtracking based on the accuracy of the data, and designs an adaptive domain rough set model.A trade-off between test cost and misclassification cost to select useful features.Wang et al. proposed category specific dictionary learning for attribute specific feature selection [15].This article proposes a method of combining label learning with dictionary learning.Feature selection is implemented at the dictionary level, which can better preserve structural information.At the same time, the intra-class noise is suppressed.Sheeja et al. proposed a novel feature selection method using fuzzy rough sets [16].The algorithm studies the properties of fuzzy rough approximations, using the divergence metrics of fuzzy sets to define, and using fuzzy measures to express the different fuzzy sets.Zhang et al. proposed a fast feature selection algorithm based on swarm intelligence in acoustic defect detection [17].The algorithm transforms the feature selection into a global optimization problem, and proposes a filter-based feature global optimization framework and mathematical model.Finally, the algorithm takes the shortest time in the case of equal performance.

3-Our Method
In this section, we first introduce the symbols used in this article and then explain our proposed Unsupervised Feature Selection Algorithm via Local Structure Learning and Kernel Function(Abbreviated as: LSK FS) , in Sections 3.1 and 3.2, respectively, and then optimize the proposed optimization method in Section 3.3.Finally, we analysis the convergence of the objective function in Section 3.4.

3-1-Notations
For the data matrix nd   XR, the i-th row and the j-th column are denoted as i X and j X respectively, and the elements of the i-th row and the j-th column are denoted as , ij x .The trace of the matrix X is denoted by ()  tr X , T X denotes the transpose of the matrix X, and 1  X represents the inverse of the matrix X.We also denote the norm and norm of X respectively as .And projects them into the kernel space to get the kernel matrix () Thus the original nd   XR becomes d kernel matrices.
The unsupervised feature selection algorithm mainly mines more representative features in the data.In the absence of the class label Y, using the data matrix X as a response matrix, the internal structure of the original features of the data can be better preserved [18,19].In order to fully exploit the nonlinear relationship of data features.Get the following expression: () Where represents the kernel coefficient matrix, 1 d  α R is used to perform feature selection, which is equivalent to the weight vector of the feature, i α is an element of the vector α ,   is the kernel matrix.
Predecessors have proved that the local structure between data can be used to reduce the dimension [20], so this paper makes local structure learning by establishing a similarity matrix between data features in low-dimensional space.The following formula is obtained through local structure learning: s is obtained by the Gaussian kernel function; otherwise =0.In order to make X get a better fitting effect, and consider the structural relationship between data features in low-dimensional space, we get the following formula: Since the similarity matrix S is particularly affected by the influence of parameters σ .In order to reduce the number of adjustment parameters, a more efficient similarity matrix is learned.In this paper, structural learning and lowdimensional space learning are alternated to achieve their optimal results.Specifically get the following formula: ., , = 1,s 0, 0,if Ν( ), 0 Where 1 λ , 2 λ is the tuning parameter, i s is the i-th column of the similar matrix S, and 2 2 i s is used to avoid unimportant results. 1 represents a vector with all elements of 1. Ν() i represents a set of neighbors the i-th feature.In order to maintain rotation invariance, we set . Therefore, the above formula can make the similarity corresponding to the feature having a relatively close distance large, and the feature having a relatively long distance has a small corresponding value.
In order to eliminate the interference of the outliers, the noise samples are removed at the same time [21].This paper adds a low rank constraint [22] to the matrix W, namely:  W AB (6) Where , we added an orthogonal limit to A, in order to fully consider the correlation between the output variables.We add a 1 l  norm of α for sparse learning and feature selection [23].Finally we get our final objective function as follows: Where T r r   λ and 3 λ are the tuning parameter.The kernel matrix K is calculated by the Gaussian kernel function, and its main function is to map the data to the kernel space, thereby mining the nonlinear relationship between the data features.The 1 l  norm of the last item α is used to sparse the features for feature selection.If the value of the element corresponding to the vector α is zero, it means that the feature is not selected.

3-3-Optimization
Since the objective function is not co-convex, the closed solution cannot be directly obtained.Therefore, this paper proposes an alternate iterative optimization method to solve the problem, which is divided into the following four steps:

Update A by fixing S, α and B:
When S, α and B are fixed, the optimization (7) problem becomes: . ., We make (8) formula can be transformed into:

X PAB AB AB
A A I (9) We simplify the (9), we have:

X X X PAB B A P X B A P PAB B A XLX AB
A A I (10) Where tr( )  represents the trace of matrix, =- is a Laplace matrix, Q is a diagonal matrix, and the elements of each column are ,, 1 . Deriving for A, we have: (11) Due to the orthogonality of A, we can optimize it by the method in [24].

Update B by fixing S, α
and A By fixing S, α and A, the objective function (7) can be simplified as follows: (12) It is easy to get (12) is equivalent to the following formula:

X X X PAB B A P X B A P PAB
B A XLX AB (13) When we ask for B and let its derivative be zero, we can get:

B A P PA
A XLX A A P X (14) Update S by fixing A, α and B After fixing A, α and B, the objective function (7) becomes: We first calculate the Euclidean distance between every two data features to construct the neighbours of all the features.If the j-th feature does not belong to the nearest neighbour of the i-th feature, then the value of , ij s is zero; otherwise, the value of , ij s is solved by equation (18).At the same time, optimizing S is equivalent to optimizing each ( AB AB (16) Here, Under KKT conditions, we can get the following: Since each data feature has a neighbour, we sort each ( 1,... ) Under the conditions , we can get: Update α by fixing A, B and S After fixing A, B and S, the objective function (7) becomes: In order to optimize the next step, here is the simplification of the above formula, namely: We set ,

MR
and have: The above simplification is only for the convenience of the following gradient descent [25].We make: Note that 1 α is convex but not smooth.So using approximate gradient to optimize α , we can update iterations α by the following rules.1 arg min ( , ) In the above formula, η is a tuning parameter, t α is the value of α in the t-th iteration.
By ignoring the independence in equation ( 26), we can get: Where 1 () π α is the Euclidean projection on the convex set t η , because 1 α has a separable form, the formula ( 27) can be written as follows: Where i α and 1 i t α  are the i-th elements of α and 1 t α respectively, then according to formula (28), 1 i t α  can obtain the following closed solution: To speed up the approximate gradient algorithm in equation ( 26), we have added auxiliary variables: Where

3-4-Convergence Analysis
According to (18), for all , 1,..., i j n  , ( 1)   , t ij s  has a closed solution.We can get the following formula: , 2 When α and ( 1)   t S are fixed, to update ( 1)   t A and ( 1)   t B , we can get: From the above two types, we have: , 2 Theorem 1 let t α be the sequence generated by Algorithm 1, then for  (34) According to reference [38], γ is a constant defined in advance, L is the Lipschitz constant of the () f α gradient in equation (24), and * arg min ( ) Through the above inequality and Theorem 1, we can easily see that our algorithm is convergent.

4-Experiments
We compared the classification accuracy of our algorithm and 8 comparison algorithms in 12 data sets (shown in Table 1).

4-1-Experimental Settings
We tested our proposed unsupervised feature selection algorithm on four binary data sets and eight multi-class data sets.They are Yale, Colon, Lung_discrete, Glass, SPECTF, Sonar, Clean, Arrhythmia, Movements, Ecoli, Urban_land and Forest, where the first three data sets are from feature selection data, and the last nine data sets are from the UCI data set.The details of the data set are shown in Table 1: At the same time, we found eight representative feature selection comparison algorithms to compare with our proposed algorithm.The main introduction of the algorithm is as follows: EUFS [30]: This paper proposes a new unsupervised feature selection algorithm, which is different from other unsupervised feature selection algorithms to generate tags through clustering algorithms.This method directly embeds feature selection into the clustering algorithm through sparse learning theory.The most prominent contribution of this method is that other unsupervised feature selection algorithms can be applied to this framework.

Table1. The information of the data sets
FSASL [31]: The intrinsic structure of the data is rarely considered for existing feature selection algorithms.By placing structural learning and feature selection in a framework at the same time, the algorithm can effectively select representative features while maintaining the structure of the data (i.e., structural learning and feature selection complement each other).
NDFS [32]: The algorithm is a new unsupervised feature selection algorithm that mines discerning information.Specifically, it imposes a non-negative constraint on the class indication to learn clustering tags more accurately.At the same time, 2,1 l  norm is used to remove redundant features.It is an algorithm that performs clustering and feature selection simultaneously.
NetFS [33]: This method is also an unsupervised feature selection algorithm, but the algorithm is mainly for Networked Data.Due to the large amount of noise in Networked Data, the algorithm combines latent representation learning for feature selection.Through sparse learning and latent representation learning, the algorithm can remove the noise interference well and finally achieve good robustness.
RLSR [34]: This paper proposes a semi-supervised feature selection algorithm by finding the global and sparse solutions of the projection matrix.The main contribution of the algorithm is to propose a regular term 2   2,1 l  norm, and gives a detailed theoretical proof, which proves that the regular term can effectively select features and take into account global information.
RFS [35]: This paper presents a new robust feature selection algorithm through sparse learning.Specifically, it applies 2,1 l  norm to both the loss function and the regular item.Sparse restriction on the regular term can effectively remove the redundant features.The 2,1 l  norm of the loss function can effectively remove the noise samples, thus achieving a robust effect.
RSR [36]: The algorithm is a new unsupervised feature selection algorithm that learns by regularized selfcharacterization. Specifically, if a feature is particularly important, it can be represented linearly by most other features (i.e., a feature can be represented by a combination of other features).In addition, the algorithm shows good performance on both manual and real data sets.K_OFSD [37]: The algorithm is an online feature selection algorithm, which mainly selects relevant features through neighbor learning.Through this method, the class imbalance problem can be effectively solved.At the same time, the dimensionality of the high-dimensional data is effectively reduced by the dependency between the conditional features and the decision-making class.
In our proposed model, we set Since the experiment uses the classification accuracy rate to measure the performance of the algorithm, we define the classification accuracy as follows: Where X represents the total number of samples and correct X represents the correct number of samples for classification.At the same time, we define the standard deviation to measure the stability of our algorithm, as follows: Where N represents the number of experiments, i acc represents the classification accuracy of the i-th experiment, μ represents the average classification accuracy, and the smaller the std, the more stable the representative algorithm.

4-2-Experiment Result
In Figure 1, we can clearly see the classification accuracy of the 10 experiments.The algorithm we proposed is not the highest every time, but most of the cases are the highest.In Table 2, we can see the average classification accuracy of each algorithm on 12 data sets.The algorithm proposed by us is obviously superior to other comparison algorithms.Specifically, it is 4.78% higher than EUFS in average classification accuracy and 5.05% higher than FSASL, which indicates that our algorithm is better than the general feature selection algorithm.Compared with K_OFSD, NDFS, NetFS, RLSR, RFS, and RSR.LSK_FS increased 13.63%, 8.55%, 6.69%, 7.88%, 6.68%, and 3.59%, respectively.In particular, our algorithm is particularly effective on the dataset SPECTF.

The number of experiments
The number of experiments Arrhythmia Clean

The number of experiments The number of experiments Colon Ecoli
The number of experiments The number of experiments Forest Glass In Table 3, we can see the average standard deviation of each algorithm on 12 data sets.The standard deviation of the proposed LSK_FS algorithm is the smallest, indicating that our algorithm has the best stability.The LSK_FS algorithm can achieve such good results, mainly for the following two reasons: 1.Consider the similarity between data features.2. Fully consider the nonlinear relationship between data features.


. By adjusting the values of parameters 1  and 2  , we show the results of our proposed algorithm in Figure 2.
As shown in Figure 2, the proposed algorithm is sensitive to adjusting parameters.Specifically, it is not very sensitive on most data sets, but it also has subtle changes.This is because the algorithm proposed by us has better robustness. 1  is used to control the value of

4-3-Convergence Analysis
In Figure 3, we show the objective function values for each iteration of the proposed algorithm, on 12 data sets.We set the stop criteria of the proposed algorithm to

5-Conclusion
This article has proposed a new feature selection algorithm to remove redundant features.Specifically, the local structure learning is applied to the features to take into account the local structure and similarity of the features.Moreover, the kernel function is applied to map all the features in the high-dimensional space, so that the nonlinear relationship of the features is linearly separable in the high-dimensional space.In addition, low rank constraints are used to remove noise samples, maintain the global structure of the data and achieve robust results.Finally, the experimental results also show the superiority of our proposed algorithm.In future work, we plan to improve our algorithms through robust statistical learning.

6-Funding and Acknowledgements
where n and d represent the number of samples and the number of attributes, respectively.This paper first breaks the data set nd   Χ R into d column vectors, each vector 1 , 1,..., matrix of high-dimensional data in low-dimensional space, , ij s is an element of matrix S, indicating the similarity between feature i x and feature j x .If the feature i x is the kth nearest neighbor of the feature j x , then the value , ij of the kernel coefficient matrix {1,..., min( select the best SVM for classification.Through 10-fold cross-validation, we divide the data set into a training set and a test set.In order to minimize the experimental error, we perform 10 times experiments, and finally find the average of the classification accuracy.
Figure 1.Average classification accuracy of all methods for all datasets

Figure 3 .
Figure 3.The convergence graph of algorithm 1 for all data sets This work is partially supported by the China Key Research Program (Grant No: 2016YFB1000905); the Key Program of the National Natural Science Foundation of China (Grant No: 61836016); the Natural Science Foundation of China (Grants No: 61876046, 61573270, 81701780 and 61672177); the Project of Guangxi Science and Technology 0 1,..., )