“Modeling and predicting earnings per share via regression tree approaches in banking sector: Middle East and North African countries case”

The regression tree approach is an effective and easy to interpret technique where it utilizes a recursive binary partitioning algorithm that divides the sample into partitioning variables with the strongest correlation to the response variable. Earnings per share can be considered as one of the main factors in making the investment decision. This study aims to build a predictive model for earnings per share in the context of the Middle East and North African countries (MENA) . The sample of the study consists of sixty-three banks, which were chosen from eight countries, with a total of six-hundred thirty observations. The simple regression, regression tree, and its pruned regression tree, conditional inference tree, and cubist regression are used to build the predictive model for earnings per share that depends on total assets, total liability, bank book value, stock volatility, age of the bank, and net cash. The results show that the cubist regression is outperforming other approaches where it improves root mean square error for the predictive model by approximately double in comparison with other methods. More interesting results are obtained from the important scores, where it shows that the total assets of the bank, bank book value, and total liability have the biggest impact on the prediction of earnings per share. Also, the cubist regression gives an improvement in R-squared over other methods by at least 30% and 23% using training and testing data, respectively.


INTRODUCTION
Although the financial market analysis needs knowledge, perceptive insight, and experience, the automation techniques have been steadily used and growing because of the availability of huge financial data.There is much work growing in the fields of data mining, machine learning, and predictive models and their applications to business (Bose & Mahapatra, 2001; San Ong, Yichen, & The, 2010; Canhoto & Clear, 2020).Stock evaluation of a firm to buy or sell is a crucial decision to be taken by the investors, especially with the availability of large data.This decision is not easy to be made without the help of some modern models and determining the best model, which influences the investment decisions for a firm (McNichols, 2000;Goel & Gangolly, 2012;Onder & Altintas, 2017).Earnings per share (EPS) is considered an important profitability metric on financial statements for making the investment decision.It represents the returns delivered by the firm for each outstanding share of common stock.In finance and accounting literature, 1. LITERATURE REVIEW Ou and Penman (1989) studied the two-step process to predict the sign of earnings changes.They used a stepwise logit regression model in estimating the historical relationship between observed financial ratio and sign of changes in future earnings.They obtained 78% accuracy of the sign of the changes for one year ahead earnings.In the out-of-sample prediction of the sign of the one year ahead earnings changes, they obtained approximately 60% accuracy.
Lawellen (2004) used regression models to predict aggregate stock returns using financial ratios such as dividend yield.They found that the predictive regressions are biased in the small sample, but the correction used by previous studies tends to improve predicting power substantially.Bulgurcu (2012) used TOPSIS technique to analyze the financial performance of technology firms, which were registered in the Istanbul Stock Exchange.This study obtained performance scores by TOPSIS method to examine and assess the firms in terms of ten financial ratios.
Zekic-Susac, Sarlija, and Bensic (2004) compared neural network, logistic regression, and decision tree models on the Croatian dataset to characterize important features for small business credit scoring.They showed that the neural network models are better associated with data than logistic regression and decision tree models.They concluded that the neural network model extracted entrepreneur personal, business characteristics, and credit program characteristics as important features.
Tsai and Wang (2009) used a decision tree and artificial network models to predict stock prices on Taiwanese stock market data.They concluded that the F-score on trained stock exchange data was 77% using decision tree and artificial network, while the F-score was about 67% using a single algorithm.
Gepp, Kumar, and Bhattacharya (2010) studied discriminant, logit, and decision tree models to obtain accurate business failure prediction models in financial investment and lending sectors.In terms of predicting the failure or success of a business, they have concluded that the decision tree model could surpass the prediction technique of business failure as compared to logit and discriminant models.
Döpke, Fritsche, and Pierdzioch (2017) studied the usefulness of selected financial leading indi-Investment Management and Financial Innovations, Volume 17, Issue 2, 2020 http://dx.doi.org/10.21511/imfi.17(2).2020.05cators for forecasting recessions using a boosted regression tree method.Their results showed that short-term interest rate and the term spread are the most important indicators.Boosted regression trees helped them to find out the method in which the recession probability relies on the shares between the leading indicators.The spread term and the stock market gained importance, while the predictive power of the short term is declined.
Lin Yu-Cheng, Yu-Hsin Lu, Fang-Chi Lin, and Yi-Chen Lu (2017) applied a cubist regression tree model on data from Taiwanese companies to explain when and why auditors compromise their independence.They showed a positive relationship between auditor dependence and important clients in case of net losses in the current year as reported by clients.They also concluded that although the clients reported net losses in their financial statements, the auditors permitted more important clients to manage their discretionary accruals a bit upward.
Affes and Hentati-Kaffel (2019) studied bankruptcy forecasting using multivariate adaptive regression splines (MARS), classification and regression trees (CART), and hybrid models on US banks' data over a complete cycle for the market.They concluded that MARS provided better results than CART in terms of correct classification, hybrid method increased the correct classification in the training sample, and in general, nonparametric models (MARS, CART, hybrid) had given better results for bank failure forecasting than the logit model.Carmona, Climent, and Momparler (2019) used extreme gradient boosting to forecast bank failure in the US banking sector.The data consisted of an annual series of 30 financial ratios for 156 national commercial banks from 2001 to 2015.They indicated that retained earnings, pre-tax return on assets, and total risk-based capital ratio are related to a higher risk of bank failure.The bank financial distress is increased by the exceedingly high yield on earning assets.
Bellotti, Brigo, Gambetti, and Vrins (2019) applied many regression and machine learning techniques on the database from a European Debt Collection Agency to predict recovery rates on non-performing loans.They found that the cubist regression, boosted trees, and random forest methods resulted in better than other approaches.
Chu, He, Hui, and Lehavy (2020) examined the managerial disclosure of modern products within the setting of the pressure between disclosure and managerial incentives.They developed a dictionary-based innovation disclosure measure obtained from the narratives in new product announcements.They found that a significant positive relationship between investor response and innovation disclosed up to two years' prediction can be obtained by the degree of innovation disclosed in new product announcements and the degree of innovation disclosure.The performance predictability is affected by managerial disclosure incentives.
Numerous earlier studies (Altman, Sabato, & Wilson, 2010; Altman, Iwanicz-Drozdowska, Laitinen, & Suvas, 2017; Appiah, Chizema, & Arthur, 2015) give evidence that the firm size plays an important role in making several choices within the firm and can impact on the productivity of the firm.Dias and Matias-Fonseca (2010) utilized 31 financial ratios to forecast corporate performance, including liability and others.
Different financial ratios are used in building predictive models to predict different outcomes such as corporate failure, bankruptcy, financial disasters, and financial performance of the firms.Appiah and Abor (2009) utilized many financial ratios, such as liability, liquidity, and profitability ratios, to construct their model.In Jordan, Alkhatib and Al-Horani (2012) utilized a set of 24 financial ratios to anticipate the financial distress of a sample of recorded companies.Kloptchenko, Eklund, Back, Karlsson, Vanharanta, and Visa (2002) utilized 7 ratios to forecast the financial performance of the firm.Balakrishnan, Qiu, and Srinivasan (2010) utilized firm measure, marketto-book ratio, and related ratios in their model.This study is different from previous studies in many aspects.It can be considered one of the few studies dealing with the applications of the regression tree approaches in MENA countries.The high demand for investors and financial analysts in the financial markets, especially in MENA countries, to have expectation about the financial performance of firms and contributing Investment Management and Financial Innovations, Volume 17, Issue 2, 2020 http://dx.doi.org/10.21511/imfi.17(2).2020.05 to the literature about building predictive models in MENA countries.
This main aim of this study is to build and predict earnings per share (EPS) based on the logarithm of bank total assets (logTOTA), total liabilities to total assets (LIAB), bank book value to its market value (BOKV), stock volatility with respect to the market (SVOL), age of the bank (AGEB), and net cash of the bank (NCSH) using classification and regression trees (CART), conditional inference trees (CIT), and cubist regression trees (CRT) approaches.

Data collection and study variables
The ing data to validate the training data or predict the model.Because of the homogeneity among these countries in terms of culture, conventions, and financial conditions, they are selected in the sample.The websites of the recorded banks in the bourse are used to collect the financial data.According to the aim of this study and previous arguments, seven variables are considered.Earnings per share (EPS) as dependent variable or measure of profitability and six independent variables, namely, total assets (TOTA), total liabilities to total assets (LIAB), bank book value to its market value (BOKV), stock volatility with respect to the market (SVOL), age of the bank (AGEB), and net cash of the bank (NCSH).

Predictive models
The linear regression, regression tree, conditional inference tree, and cubist regression are discussed briefly.

Regression tree
Decision trees are one of the nonparametric predictive modeling approaches that are applied to classification and regression problems termed classification and regression tree (CART) analysis, first studied by Breiman, Friedman, Olshen, and Stone (1984).The regression tree uses distinct branches to go from values about features to conclude the target variable (leaves) using a set of ifthen rules.The splitting points identify non-overlapping regions that have the most homogeneous responses to the target variable, and in each region, a simple model (such as the average) is fitted.
The splitting corresponding to the tree deepness continues till a stopping criterion is reached.For prediction, the new data is divided following the trained split points (Breiman, 1996(Breiman, , 2001;;Geurt, Ernst, & Wehenkel, 2006).
In case of regression, the model started with all observations, D, and searches each observation of each independent to locate the independent and divide the value, which divides the observations into two groups, say, D 1 and D 2 , such that square sum of errors is minimized:

Conditional inference tree
This approach is introduced by Hothorn, Hornik, and Zeileis (2006) to overcome the bias in the basic regression tree.

Cubist regression
Cubist regression is a rule-based regression described by Quinlan (1992)

DATA ANALYSIS AND RESULTS
The general EPS model can be written as Different performance metrics are used to evaluate the model in case of using training and testing data such as root mean square error (RMSE), determination coefficients (R-squared), and mean absolute error (MAE).In RMSE and MAE, the less value, the better performance.In R-squared, the higher value, the better performance.Variable importance is a measure of the decrease in "squared error", where the advancement in "squared error" due to each independent is gathered inside every tree.The refinement values for every independent are then averaged toward the whole gather-ing to produce an aggregate importance value (Friedman, 2002;Ridgeway, 2007).The important variables that contribute to predictions of EPS are obtained for each method to reflect the rank or importance of the independent variables.
All the analysis in this study is done using R-software and CARET package (Kuhn, 2008; R Core Team, 2017).

Descriptive analysis
The descriptive statistics for the variables of the study are displayed in Table 1.It can be noted that the standard deviation (Sd) is high for NSCH variable that indicates high variability among banks with respect to this variable.The mean and median are almost equal for logTOTA variable.Where the measures of skewness and kurtosis for the   variables EPS, logTOTA, SVOL, AGEB, NCSH are far away from 0 and 3, respectively, this indicates that the distribution of these variables is mostly non-symmetric.While the measures of skewness and kurtosis for the variables LIAB and BOKV are near 0 and 3, respectively, this indicates that the distribution of these variables is nearly symmetric.
Figure 1 shows the correlation and significance of the study variables.It can note that EPS has a significant correlation with logTOTA, LIAB, AGEB, while it has no significant correlation with BOKV, SVOL, and NCSH.The highest correlation is between EPS and logTOTA and the lowest correlation between AGEB and NCSH.

Linear regression
The results of the linear regression analysis are given in Table 2.Where p-value for F-statistics is zero, the model is significant.From column p-value, it can be seen that the variables logTOTA and LIAB are significant at 0.001 and 0.10 levels of significance, respectively.
The results of linear regression performance metrics are given in

Basic regression tree approach
Basic regression tree for EPS model with 13 terminal nodes is shown in Figure 3.The first decision node in Figure 3  The results of the pruned regression tree performance metric are given in Figure 6 and Table 5 show the pruned regression tree variable importance for EPS model.BOKV, log-TOTA, and AGEB transpire to the top of important metric, and important scores start receding with NCSH and LIAB.Note that the variable SVOL has no importance score.Consequently, BOKV, logTO -TA, and AGEB have the biggest impact on EPS.

Conditional inference tree
Conditional inference tree is shown in Figure 7.
The decision nodes are presented as circles with a number in each circle.The independent variable is divided twofold in each circle, with a p-value of the dependence test.The first decision node in Figure 7 is logTOTA that is most strongly associated with EPS that is measured by p < 0.001.The left branch shows the best cut-off value more than or equal 0.45, that is the best value to reduce root mean square error, and gives the predicted EPS of about 2.437, using 157 values.The decision node is divided by the variable logTOTA that is still strongly associated with EPS (p < 0.001) at cut-off The left branch is a cut-off value less than or equal to 4.22, and the right branch gives the predicted EPS of 1.066, using 67 data.Then, the decision node is divided by the variable LIAB that is still strongly associated with EPS (p < 0.001) with a right branch at cut-off value 3.32, while the left branch gives the predicted EPS of 1.162, using 9 observations.This process will be continued unthe terminal nodes are obtained.The results are displayed by boxplot because the response variable EPS is continuous.The results of the conditional inference tree performance me tric are given in

Cubist regression
Cubist regression is an ensemble model that predicts using the linear regression models at the terminal node of the tree.The tree is decreased to a set of rules that are ways from the top to the bottom.Rules are dispensed through pruning or combined for simplification.
Table 7 shows the resampling results across tuning parameters for 566 samples and 6 predictors.The optimal model is chosen based on the smallest value for RMSE.The committees = 20 and neighbors = 5 are the final values that are used for the model.For testing data, one can also note that the best model is cubist regression 54%, followed by linear regression 31%, conditional inference tree 28%, regression tree 25%, and pruned regression tree 22%.Similarly, it can rank models in terms of MAE.
From Figure 9, the important scores show the impact of independent variables on EPS.For cubist regression, logTOTA, BOKV, and LIAB have the biggest impact on EPS, followed by AGEB and SVOL, with no important score for NCSH.From the above results, one can conclude that each method has a strong point on one side and a weak point on another side.Therefore, multi-dimensional data need different approaches to be modeled.Where the results of this study are limited to regression tree approaches, namely, LR, CART, CIT and CRT, it may be in the future these results are compared with the results of different approaches such as neural network.

CONCLUSION
Motivated by the importance of making investment decisions, the predictive models are built based on machine learning approaches, namely, linear regression, regression tree, pruned regression tree, conditional inference tree, and cubist regression to help in making such a decision.These models were carried out on the data from eight countries in MENA region, where sixty-three banks are selected, with a total of 630 observations.The sampled data are divided into training data (from 2008 to 2017) to build the predictive model and testing data (2018) to validate the model.
Root mean square error, R-squared, and mean absolute error are used to assess the performance of the models.The results show that the cubist regression is outperforming other methods in terms of three measures, namely, RMSE, R-squared, and MAE.R-squared in training data for cubist regression is about 96%, while for the second-best basic regression tree method, it is about 66%.This has given at least 30% (96%-66%) improvement over other methods.Root mean square in cubist regression for training data is 0.375, while it is 0.978 in a basic regression tree.This has given the improvement in root mean square error by at least 0.603 (0.978-0.375) over other methods.R-squared in testing data for cubist regression is about 54%, while for the second-best linear regression approach, it is about 31%.This has given at least 23% improvement over other methods.Important scores are used to know which variables have the biggest impact in predicting earnings per share.
In terms of the best results, the cubist regression has shown that the total assets, bank book value, and total liability have the biggest impact on predicting earnings per share.Because each approach has its strengths and weaknesses, multi-dimensional data should be approached by different techniques.Because this study identified a set of important variables, this may help bank's manger in increasing the stability of financial market.For example, the LIAB variable could give a good conception about how a bank is financially solid.This study can be extended when there are financial and non-financial data such as tone.Also, it can be extended to include data from other types of business, such as service companies.

β
Classical linear regression aims to minimize the sum of square errors between actual values, y i , and estimated values, ˆi y are the parameters, i ε are the errors (Kuhn & Johnson, 2013).

Figure 1 .
Figure 1.The correlation matrix, histogram, and scatter plots for the study variables

Figure 6 .Figure 7 .
Figure 6.Pruned regression tree variable importance scores for EPS model . A model tree is generated from the training group, and the linear model is estimated and smoothed.The model tree is ceased into rules, and the pruned are applied.
The predictive model that predictsEPSis built based on the data from 2009 to 2017 (training data) as year t and is called a training model.The data for Investment Management and Financial Innovations, Volume 17, Issue 2, 2020 2018 as year t + 1 are used as testing data to validate the training model.In other words, the training model is built based on 9 years' data (2009-2017) to predict EPS in 2018.The training data from 2008 to 2017 consists of 566 observations, while the testing data for 2018 consists of 64 observations. (2)p://dx.doi.org/10.21511/imfi.17(2).2020.05

Table 1 .
Descriptive statistics for the study variables

Table 2 .
Linear regression analysis for EPS model

Table 4 .
is logTOTA that is most strongly associated with EPS.The left and right branches show that the best cut-off value equal to 0.45 is the best to reduce root mean square error.Then, the decision nodes are divided by the variable BOKV at cut-off value 0.98 and logTOTA at cut-off value 4.2 that are the best values to reduce the root mean square error.This process will be continued until the terminal nodes are obtained.The terminal nodes contain two values.The bottom one is the percentage of the data in this node that are used to compute the average of EPS (predicted value).For example, the training data have 566 values, and the predicted value 0.14 in the left node is the average of about 340 values (0.60⋅566) that fall in this branch.In other words, if logTOTA is less than 4.5, if logTOTA is less than 4.2, then the predicted EPS is 0.14.Another example is if logTOTA is more than or equal 4.5, BOKV is more than or equal 0.98, SVOL is more than or equal 0.84, then the predicted EPS is 5.3, using the average of about 11 EPS values in this branch (0.02⋅566).Basic regression tree variable importance scores and performance metrics for EPS BOKV, AGEB, and LIAB have the biggest impact on EPS.Figure5shows the pruned regression tree for EPS model.The number of tree nodes is 7, which is less than the basic regression tree.The first decision node in Figure5is logTOTA that is most strongly associated with EPS.The left and right branches show the best cut-off value equal to 0.45, which is the best value to reduce root mean square error.Then, the decision nodes are divided by the variables BOKV at cut-off value 0.98 and logTOTA at the cut-off value 4.2 that are the best values to reduce the root mean square error.This process will be continued until the terminal nodes are obtained.The terminal nodes contain two values.The bottom one is the percentage of the data in this node that used to compute the average of EPS (predicted value).For example, the training data is 566 values, and the predicted value 0.14 in the left node is the average of about 340 values (0.60*566) that fall in this branch.In other words, if logTOTA is less than 4.5, if logTOTA is less than 4.2, then the predicted EPS is 0.14.Another example is if logTOTA is more than or equal 4.5, BOKV is less than 0.98, then the predicted EPS is 1.5, using the average of about 45 (0.08*566) EPS values in this branch.
Figure 4 and Table4show the basic regression tree variable importance for EPS model.BOKV and AGEB transpire to the top of important metrics, and important scores start receding with LIAB, logTOTA, and NCSH.Note that the variable SVOL Figure 3. Basic regression tree for EPS model Basic regression tree for EPS http://dx.doi.org/10.21511/imfi.17(2).2020.05has no importance score.Consequently,

Table 5 .
The RMSE is 1.058 for training data while it goes up to 2.263 for testing data.R-squared is about 60.2% for training data and goes down to about 21.7% for testing data.MAE is about 0.535 for training data and goes up to 1.283 for testing data.

Table 6 .
Conditional inference tree variable importance scores and performance metrics for EPS model

Table 7 .
Cubist resampling results across tuning parameters for 566 samples and 6 predictors

Table 8 .
Cubist regression approach variable importance scores and performance metrics for EPS model Figure9and Table8show the cubist regression variable importance for EPS model.logTOTA, BOKV, and LIAB transpire to the top of important variables.The important scores start receding with AGEB and SVOL.Note that the variable NCSH has no importance score.Consequently, logTOTA, BOKV, and LIAB have the biggest impact on EPS.