“Feature selection methods and sampling techniques to financial distress prediction for Vietnamese listed companies”

The research is taken to integrate the effects of variable selection approaches, as well as sampling techniques, to the performance of a model to predict the financial distress for companies whose stocks are traded on securities exchanges of Vietnam. A firm is financially distressed when its stocks are delisted as requirement from Vietnam Stock Exchange because of making a loss in 3 consecutive years or having accumulated a loss greater than the company’s equity. There are 12 models, constructed differently in feature selection methods, sampling techniques, and classifiers. The feature selec- tion methods are factor analysis and F-score selection, while 3 sets of data samples are chosen by choice-based method with different percentages of financially distressed firms. In terms of classifying technique, logistic regression together with SVM are used in these models. Data are collected from listed firms in Vietnam from 2009 to 2017 for 1, 2 and 3 years before the announcement of their delisting requirement. The ex-periment’s results highlight the outperformance of the SVM model with F-score se- lection method in a data sample containing the highest percentage of non-financially distressed firms.


INTRODUCTION
According to Beaver (1966) in the first study on financial distress prediction, a firm is considered financially distressed or failed if the company fails to fulfill its financial obligations when mature. Since Beaver's pioneering work, the construction of a warning model has become the center of research in corporate finance worldwide. Traditionally, a financially distressed firm is a company that falls into bankruptcy because of business failure (Beaver, 1966;Altman, 1968;Norton & Smith, 1980;Ohlson, 1980) and remains popular in more relevant and recent research (Zhou et al., 2012;Altman et al., 2016;Liang et al., 2015). Another financial distress measure is known as finance-based definition (Pindado et al., 2008). In this measure, the financial distress of a company may not necessarily put it into bankruptcy (Altman, 1984), thus, a model for financial distress forecasting plays a crucial role in helping firms to avoid bankruptcy (Santoso, 2018).
It is clear that financial distress regardless of its recognition produces huge potential loss to the stakeholders of a company. Therefore, a financial prediction model works as an early warning system that supports the companies' managers to make necessary adjustments in their financial management strategies to avoid becoming distressed. It also assists the investors and creditors in their decision-making process and helps the government to provide an alarm to the firms before putting them on the "control" list. Although there are numerous models that have been created and tested, it has been revealed that the performance of a financial distress prediction model varies if different sets of predictors, data samples and classifiers are applied.
The independent variables in a prediction model are mainly accounting ratios and they have been extended to other features outside the accounting reports. In order to obtain the optimal variables for the model, different feature selection methods have been applied to choose the most informative and discriminant ratios. In addition to selection method, there is also evidence that the choice of data sampling affects the prediction model's performance. According to the number of financially distressed companies chosen, the sampling techniques can be divided into either the choice-based sampling or the sampling technique named complete data. Recently, there has been an increasing number of papers that have attempted to draw a comparison between these two selection approaches, but no consistent conclusion can be discerned.
The third determinant of a model's performance is the choice of classifying technique applied to determine whether a firm is financially distressed or not. Supporting by the computer science, the classifiers have been developed from the Univariate model (Beaver, 1966) to the Discriminant Technique model (Altman, 1968), and further to the Logistic Regression model (Ohlson, 1980) and Data Mining models. Although the comparison between different models has been taken widely, there is no consistent answer for the best classifier, which presents its superiority in all data samples.
Based on the factors that influence the model's performance, most of the relevant research focuses on improving the model's accuracy by selecting the optimal set of predictors and an appropriate classifying technique for a particular data sample. However, it can be seen that the combined effects of feature selection methods and sampling techniques have not yet received enough interest from researchers. Therefore, this research aims to build models to predict the financial distress condition of listed firms on securities exchange in Vietnam that focuses on the role of the feature selection method in association with different sampling choices in improving the model's performance. The importance of the study is emphasized as the number of firms financially distressed is increasing in Vietnam, while the number of research projects occurring is limited.
Data are collected from companies listed in the Vietnamese securities market from 2009 to 2017, while a financially distressed company is the one receiving the requirement of being delisted. The analysis results reveal that the model's accuracy is higher as the number of non-financially distressed firms chosen increases. Overall, the SVM models with F-score feature selection outperform the Logistic Regression models.

Review of predictors and predictor selection
In a financial distress prediction model, the choice of predictors can affect the accuracy level of prediction. While the usefulness of each predictor varies in different models, independent variables in existing studies can be classified into three main groups: accounting ratios, market variables and macroeconomic ratios. Among these groups, the largest one is the accounting ratio group. This includes ratios calculated from companies' financial statements, which are prepared by the companies according to pre-determined accounting principles. It can be perceived that the accounting ratios reflecting companies' financial performance such as liquidity, profitability, business ca-pacity, capital structure etc. of the firm are favored by researchers. In addition to accounting ratios, market variables are also used in an ex-ante model as they contain information on expected future cash flows, which are relevant to the likelihood of being financial distressed (Rees, 1995 A literature review shows that there is a great number of predictors that can be utilized in a financial distress prediction model. According to Zhou et al. (2012), there are 500 different variables that can be found in 128 papers and the predictive power of each variable changes in different papers (Sayari & Mugan, 2016). As stated by Powell (2007), the high dimensionality problem can be raised if too many variables are used for data analysis. Therefore, reducing the number of total variables by retaining only informative and discriminative predictors is crucial to improve a prediction model's performance.
Feature selection, defined as the approach for selecting the optimal set of predictors, has been applied broadly in existing papers. It is also designed to produce better performance, reduce the cost of processing a model, as well as to obtain better understanding of the company's operation (Guyon & Elisseeff, 2003). In previous articles, variable selection techniques are recognized as expert recommendation and statistical methods (Lin et al., 2014). As examples of using the expert recommendation method, studies taken by Alifiah (2014) and Liang et al. (2015) select variables, which are useful in at least ten previous papers or factors appearing more than 3 times in 127 relevant models for the model's indicators. On the other hand, filter-based feature and the wrapper-based feature selection method are categorized into the statistical methods for variable selection. These methods are considered to be computationally efficient when they apply into a large number of independent variables (Blum & Langley, 1997;Guyon & Elisseeff, 2003).
The filter-based selection method includes the t-test, factor analysis, and stepwise regression, which assess the relevance of the variables according to pre-determined indices. The proposed criteria for this method can be Fisher score (Yjlmaz, 2013), Laplacian score (Wan et al., 2015) and F-score (Chen & Lin, 2003). Among those criteria, F-score is considered to be the simplest (Song et al., 2017). In contrast, a wrapper method evaluates the variables based on their usefulness through a process that requires a lot of data processing. According to Kittler (1978), wrapper techniques can be listed as sequential forward selection or sequential backward selection. In other papers of Kohavi and John (1997) and Goldberg (1989), it can be recognized as randomized hill climbing or genetic algorithms.

Review of sampling technique and classifiers
In addition to the discussion of feature selection methods, there is also a disagreement on the data selection for the model. Zmijewski (1984) was the first researcher to discuss two data selection techniques in building a financial distress prediction model: the choice-based sampling technique and the complete data sampling technique. The choicebased sampling technique or stratified random sampling is used when the available distressed companies and only a part of the non-financially distressed companies are kept in the sample. The non-distressed firms are chosen randomly or by some criteria such as industry or company size. This sampling technique has been applied widely by Beaver (1966) , "choicebased technique successfully remedies the potential problem of extremely low frequency rate of bankruptcy events in the population". Opponents of this method state that the significant difference in financial distress contribution in the sample in comparison with that in the population may lead to biased estimation of parameters (Zmijewski, 1984;Shaonan et al., 2015). In contrast to the previous approach, the latter technique brings all available non-financially distressed firms to the data sample. For example, Ohlson (1980) brings entire records of the 2,050 non-distressed companies and 105 failed companies to the data set. A similar approach has been applied in the works of Bharath and Shumway (2008), and Kim and Sohn (2010). Supporters of complete data sampling technique argue that the rate of financially distressed companies should be representative of the population in a sample (Ohlson, 1980) and the biased parameters can be decreased as the likelihood of distressed firms in the sample approaches that of the population (Zmijewski, 1984). However, because of the great number of non-distressed firms compared to the number of distressed firms in the sample, this technique requires a huge amount of computation that may lead to a class imbalance problem and degradation in the final prediction performance (Liang et al., 2015).
Unquestionably, the classifier which is used to discriminate a company in the data sample according to their selected predictors plays a significant role in increasing the level of accuracy. With the development of statistical and soft computing techniques, a significant number of financial distress prediction models with various classifiers have been constructed and many of them can obtain impressive levels of accuracy. Zhou et al. (2012) summarize the related empirical researches and divide these techniques into 2 groups: traditional classifiers and modern classifiers. Beaver (1966), Altman (1968) and Ohlson (1980) are the authors who construct financial distress prediction models with traditional classifiers. Beaver (1966) (Balcaen & Ooghe, 2006). Beyond this, however, the domination of MDA model decreased due to the introduction of the Logistic Regression model by Ohlson (1980). This model has overtaken the MDA as the dominant model as it does not require any assumptions of normal distribution and equal covariance which are considered drawbacks of the MDA model. In addition to traditional models, the development of Artificial Intelligence (AI) and Data Mining has created modern classifiers such as Decision Tree (DT), Neural Network (NN), and Support Vector Machines (SVM).
There have been a number of studies on performance comparison between models with different classifiers. Ugurlu (2006) discovered that the Logit model provided a better accuracy level and overall fit than the MDA model. The same conclusion was also made in the study of Pindado et al. (2008). Recent studies taken by Lin et al. (2011Lin et al. ( , 2014 assert that the SVM model outperforms not only traditional models, but also other data mining models. Another paper produced by Gepp and Kumar (2015) concluded that the DT model is a superior classifier compared with the Logistic Regression model. From 2009, the State Securities Commission of Vietnam started to require a company to be delisted because of its financial distress. Specifically, a company is delisted if it incurs losses in 3 consecutive years or having accumulated loss bigger than its equity. The number of delisted companies increases from 6 companies in 2010 to the peak of 31 companies in 2013 and slightly decreases to 27 companies in 2017. Although the delisting requirement can improve the quality of listing stocks, an increasing number of delisted company may affect the market belief from investors. Therefore, investors in Vietnam should be supported by a financial distress prediction model for stock selection, while a company's managers also need this model to make necessary adjustments that can help company to avoid being financially distressed.

Research design
The main objective of the research is to construct the financial distress prediction models that take into account the combined effects of feature selection and sampling methods for companies listed in Vietnamese securities market. There are two main steps conducted to fulfill the research objective. In the first step, a number of models with different sets of predictors, data and classifiers are designed. In the next step, a comparison is taken to find the most effective model. The recognition of a firm's financial distress follows the finance-based definition, which emphasizes the independence of financial distress and its outcomes. A company is considered to be financially distressed when it is required to be delisted in Hanoi's securities market or Ho Chi Minh's securities market as it suffers losses in 3 consecutive years or its accumulated loss rises above the company's equity.

Feature selection methods
In order to highlight the analysis results, feature selection procedure is applied into 2 different sets of variables (variable set 1 and variable set 2) cho-sen from empirical analysis. The variable set 1 (see Table A1 in Appendix A) is taken mainly from the research of Geng et al. (2014) with an additional variable, which measures the ownership structure of the company in Vietnam. Those predictors cover different features such as solvency, profitability, operational capacity of an enterprise's financial performance. The variable set 2 (see Table A2 in Appendix A) is originated from the paper of Lin et al. (2014), because it is the result of a comprehensive selection method that integrates expert recommendations and wrapper approach.
In this paper, factor analysis and the F-score method are applied to minimize the number of predictors in order to increase the level of accuracy of the model. Factor analysis is performed to explore the "variables that seem(s) to be doing the best job in predicting financial distress" or are the most informative in the model with the application of VARIMAX for rotation. The factor analysis is also applied to detect the multicollinearity among the predictors. The selection procedure is based on a number of criteria. Firstly, Bartlett's test of sphericity should be significant to ensure the appropriateness of independent indicators for factor analysis. Second, the most informative variables should have factor loading above 0.5, the eigenvalue bigger than 1 and the communality greater than 0.8. The results of factor analysis are presented in Table  3 for models 1.1, 1.2 and 1.3. After conducting factor analysis, stepwise regression combined with binary logistic regression will provide the significant ratios that can act as predictors in the model.
F-score is a simple filter selection method that can be used together with any of the SVM models. F-score measures the discrimination of two variables set according to below function: ii ii i nn Features with higher F-score should be chosen in the model because of having higher discrimination ability. First introduced by Chen and Lin (2003), there are 2 steps in this feature selection. In the first step, the F-score of every feature is calculated before setting a threshold to remove the feature with F-score below that of the threshold and retain those with an F-score above it. In the next step, the data are again split randomly into new training data and testing data.

Sampling techniques
The researchers use the choice-based sampling technique to choose the firms used in the sample. In this method, the non-distressed firms are chosen randomly with their number equal to that of the number of distressed firms. However, because of the concern about the biased parameters produced from the inconsistent distress rates in the sample and population, three data samples were created with increasing numbers of non-financially distressed firms. Data sample 1 consists of 68 distressed firms and 68 non-distressed firms, data sample 2 includes 68 distressed firm and 136 non-distressed firms, while the number of non-distressed firms (204) is triple the number of distressed firms in data sample 3. The increase in the number of non-distressed firms, as well as the data size from data sets 1 to 3, reduces the inconsistency between the distress rates of the sample and that of the population. By choosing these data samples, the authors expect to discover the relationship between the rate of a distressed firm in the sample and the prediction model's accuracy.

Classification techniques
A classification technique is used to train the data for constructing the classifying function using selected independent variables. The function then applies to testing data set for determining the model's accuracy. This study uses logistic regression and the machine learning algorithm SVM as classifiers.
The logistic regression tries to compute the likelihood of being "financially distressed" for a listed firm. In the below function, the dependent variable Y receives value of 1 or 0. The former describes the company's financial distress, while the latter denotes the condition of non-financial distress: With the support of computer software such as SPSS, a company is considered to be financial- PY < The logistic regression model is chosen as it is a traditional classifier that exhibits high level of prediction accuracy in recent studies of Ugurlu (2006) and Pindado et al. (2008). In addition, the logistic regression model does not require the assumption of normality, as well as equal covariance of independent variables.
SVM which is known as a type of machine learning classifier establishes a hyperplane that separates two groups of companies according to their financial performance. Especially, this classifier identifies the optimized hyperplane with largest margin for companies' separation. Hsu et al. (2016), in order to construct the optimized hyperplane, the parameters C and γ should be determined by grid search. In this study, the radial basis function (RBF) incorporating C and γ is established. As the two parameters are selected, data training should be performed again. With the support of LibSVM tool, before conducting the classification techniques, the data are separated into 2 sets for training and predicting. SVM is a machine learning technique, which should be applied in comparison with logistic model, a traditional classifier. Stated by Sánchez et al. There are 12 models constructed with combinations of the different data sampling techniques, feature selection and classification methods. According to Table 1, 6 models from 1.1 to 1.6 apply factor analysis into 3 data sets with different sample size and use logistic regression as a classification technique. Meanwhile, 6 models from 2.1 to 2.6 determine F-score as a feature selection method to data sets 1, 2 and 3 and use SVM as classifier.

Factor analysis
The factor analysis is applied for 3 sets of data samples. In the first step, Bartlett's test and Kaiser-Meyer-Olkin (KMO) are run in order to assess the overall significance of the correlation matrix and the sample adequacy. If the value of KMO is under the range from 0.5 to 1, the factor analysis is considered to be appropriate for the data. In addition, Bartlett's test is important to make sure of the significant correlation between variables. In the next step, the VARIMAX is used for rotation to select the factors according to their eigenvalue, factor loading and communality. A suitable factor must have eigenvalue bigger than 1 along with factor loading and communality greater than 0.5 and 0.8, respectively. As shown in Table A3 and Table A4 in Appendix A, according to those criteria, the number of selected variables decreases quite dramatically in each variable set.
After conducting factor analysis, there is no multicollinearity found among the predictors and the redundant variables are also removed. In the next step, the significance test of all informative variables is performed to ensure the validity and the significance of the prediction model by running stepwise regression for logistic regression models. The results of stepwise regression procedures show that the independent variables are the same for all different models in one prediction time. However, the coefficient of each predictor varies in each different data sample. Tables 2 and 3 describe the coefficient of each significant predictor of 6 models for 1 year, 2 years and 3 years before the distress event.
According to Table 2, models constructed 1 year before the financial distress event emphasize the importance of current liabilities on total assets ratio, net profit over average current assets ratio and net profit on average fixed assets. Surprisingly, the negative sign in the coefficient of total liabilities on total assets ratio gives rise to the concern about the parameters produced by model 1.1 as there should be a positive relationship between this ratio and the financial distress probability. The number of significant and discriminative independent variables increases to 11 variables in 2-year prediction models from 9 variables in 1-year prediction models, because total liabilities over total shareholders' equity ratio and net profit over sale revenue ratios are included. There are consistencies in the signs of variables' coefficients that can be found among models. The negative coefficient sign of total liabilities/total assets ratio and positive coefficient sign of net assets/number of ordinary shares at the end of year ratio threaten the application of model 1. According to Table 3, models in year 1 contain 8 significant variables, while 9 is the number of significant variables in the remaining models using variable set 2. In addition, the variables chosen are nearly the same for 3 models in different time of prediction. Most of selected variables reflect the solvency, profitability, and asset development of the companies. However, while the coefficient sign of total assets growth ratio is supposed to be positive, it is found to be negative in all 3 models for 1 year before the distress event.

F-score selection method results
In addition to factor analysis, the other filter feature selection is applied in SVM by calculating the F-score of each variable. Using LIBSVM, the variables with F-score bigger than 0.3, 0.04 and 0.03 for data sets 1, 2 and 3, respectively, are selected by the program. Table 7 presents the predictors chosen according to their F-scores for 3 data sets.
As shown in Table 4 and Table 5, the results of the filter selection process show that the smallest number of selected variables is found in model with the smallest sample size. However, there is not much difference in the orders of selected predictors according to their F-scores among each three models. For example, the net profit on the number of ordinary shares ratio receives the highest F-score in model 2.2 and 2.3 and it also gets the second highest F-score in model 2.1. Similarly, acid test ratio obtains the highest F-score in all 3 models 2.4 to 2.6. According to predictor group, the variables selected by F-score mainly reflect the capital expansion capacity, profitability and operational capacity of a company.

Logistic regression model's performance
Using different sets of predictors as results of factor analysis and the stepwise regression proce-dure, the overall significance of 3 models 1.  Table A5 and  Table A6 in Appendix A, models with different data sets are significant for 3 years of prediction. Table 6 presents the level of accuracy of 3 logistic models 1-3 years prior to the financial distress event. In terms of the sample size, the level of accuracy increases from model 1.1 to model 1.3 using variable set 1 and from model 1.4 to model 1.6 using variable set 2. Therefore, the smaller the percentage rate of financially distressed firms in the models, the higher the model's accuracy. In terms of prediction time, there is a slight increase of the model's accuracy as the time of prediction progresses. In general, the accuracy rates of all three models are quite high with the highest level of 86% belonging to model 1.3 that makes a prediction 3 years in advance.

SVM model's performance
From the original number of variables, according to the F-score calculated of each variable as shown in Table 4, 5 predictors are chosen for models 2.1 to 2.6. The function to build the hyperplane is related to value of C and gamma. C and gamma are selected through the work of grid search (see Table  A7 in Appendix A). Using the optimal hyperplane constructed by the choice of C and gamma, each SVM model classifies a firm in testing data into a distressed group and a non-distressed group. The accuracy levels of 6 SVM models at different points of prediction time can be summarized in Table 8.
As can be seen, there is an increase in the model's accuracy level as the time of prediction progresses. The level of accuracy that can be reached peaks at nearly 87% in model 1.3 for 3 years before the distress event. In terms of the data set used, the model's accuracy improved as the rate of distressed firms decreased.

CONCLUSION
The study is conducted with the intention to consider the combined effects of 3 factors: predictor choices, sampling technique, and classification techniques to the performance of an ex-ante model for listed firms in the securities market of Vietnam. In terms of predictors selection, the factor analysis is used in logistic regression models, while F-score method is calculated to select the predictors in SVM models.
Regarding the sampling techniques, three data sets with different numbers of non-distressed firms are created. Each prediction model is constructed from 1 to 3 years before the firm receives the requirement of being delisted from the securities exchanges because of poor financial performance.
There are 12 models are constructed from 2 different original variable sets: set 1 is taken from the study of Geng et al. (2014), while set 2 comes from the paper of Lin et al. (2014). Although factor analysis and F-score all reduce the number of variables in all models dramatically, the predictors choices made by these two approaches are quite different. From those selected variables, the performance of each models