Portfolio creation using graph characteristics

The aim of this work is by combination of the graph theory and Markowitz portfolio theory to illustrate how some graph characteristics are related to the diversification potential of individual portfolio-forming stocks. Using the graph characteristic, the vertex eccentricity, individual stocks are divided into two groups: a group of large and group of small eccentricity. Eccentricity in this context is considered to be a very suitable metric of the centrality of individual vertices. Different price histories (5 to 30 years) of the Standard and Poor’s index are analyzed. Using the simulation analysis, samples of mentioned groups are generated and then tested by means of comparison to show that larger eccentricity samples, representing stocks on the periphery of the minimum spanning tree of the graph, have a higher potential for diversification than those found in the center of the graph. The results published in the article can be a practical guide for an individual investor during the portfolio creation process and help him/her with decision-making about stock selection. Jakub Danko (Slovak Republic), Vincent Soltes (Slovak Republic) BUSINESS PERSPECTIVES LLC “СPС “Business Perspectives” Hryhorii Skovoroda lane, 10, Sumy, 40022, Ukraine www.businessperspectives.org Portfolio creation using graph characteristics Received on: 17th of October, 2017 Accepted on: 9th of February, 2018 INTRODUCTION The goal of the investor is to create a portfolio that has a high expected return and the low risk level. The first problem is that the investor needs to solve the selection of the stocks into the portfolio and then he has to determine the weights of the individual stocks. In the paper, for the stock selection process, it is proposed to use graph characteristic known as the vertex eccentricity. Every vertex of the graph represents an individual stock, which can form a portfolio. The eccentricity of the graph vertex can identify the so-called centrality of stocks and, as shown in the paper, this measure is very effective in identifying certain stocks that are more suitable for diversifying the risk the investor requires. 1. LITERATURE REVIEW Harry Markowitz (1952, 1959) and William Forsyth Sharpe (1992) are those authors, by whom pioneering work in modern portfolio theory was done. The effect of asset risk, return, correlation, as well as diversification on expected investment portfolio returns, are issues, which were studied by them. Methods of Markowitz portfolio optimization are used in this paper. Approach to portfolio optimization used in this article is based on network theory. Network theory is the tool and methodology approach, which can be used to explain and better understand financial markets. When talking about financial markets, the mutual development © Jakub Danko, Vincent Soltes, 2018 Jakub Danko, Ing., Department of Finance, Faculty of Economics, Technical University in Košice, Slovak Republic. Vincent Soltes, Professor, Department of Finance, Faculty of Economics, Technical University in Košice, Slovak Republic. This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International license, which permits re-use, distribution, and reproduction, provided the materials aren’t used for commercial purposes and the original work is properly cited. portfolio, risk, minimum spanning tree, eccentricity, Markowitz portfolio optimization


INTRODUCTION
The goal of the investor is to create a portfolio that has a high expected return and the low risk level. The first problem is that the investor needs to solve the selection of the stocks into the portfolio and then he has to determine the weights of the individual stocks.
In the paper, for the stock selection process, it is proposed to use graph characteristic known as the vertex eccentricity. Every vertex of the graph represents an individual stock, which can form a portfolio. The eccentricity of the graph vertex can identify the so-called centrality of stocks and, as shown in the paper, this measure is very effective in identifying certain stocks that are more suitable for diversifying the risk the investor requires. Harry Markowitz (1952, 1959 and William Forsyth Sharpe (1992) are those authors, by whom pioneering work in modern portfolio theory was done. The effect of asset risk, return, correlation, as well as diversification on expected investment portfolio returns, are issues, which were studied by them. Methods of Markowitz portfolio optimization are used in this paper.

LITERATURE REVIEW
Approach to portfolio optimization used in this article is based on network theory. Network theory is the tool and methodology approach, which can be used to explain and better understand financial markets. When talking about financial markets, the mutual development of some particular financial instruments (stocks, shares, indices, etc.) is mostly observed. Since the correlation coefficient is the most used dependency measure of this mutual development, the main area is the study of correlation-based networks. These networks can be used to reduce the complexity of financial dependencies and to understand and forecast the dynamics of financial markets. This part of the chapter is the introduction of some methodological approaches that are frequently used in the field of correlation-based networks.
The first publication about correlation-based networks with using the graph theory tools was written by Professor Rosario Mantegna (1999). The author introduced the method of minimum spanning tree while analyzing financial markets and described main advantages of this methodology. By constructing this subgraph (minimum spanning tree), the author finds the US stocks are grouped based on their industry sector. This means that price of the stock includes not only information about the current and past financial situation of the company, but also information about structure and topology. Onnela et al. (2003a) introduce a new network type -the dynamic asset graph. In comparison with the previous one done in the static time period the methodology is similar, but not the analysis. The authors, for analysis and smoothing purposes, divided the data into M windows of width T, where T corresponds to the number of daily returns included in the window, which made their analysis dynamic.
The same authors (2003b) analyzed financial markets from the perspective of portfolio creation. They showed that the assets with the highest diversification potential are located on the edge of minimum spanning tree. Authors used a combination of the vertex degree and the concept of the weighted portfolio layer. The hypothesis was that stocks with the greatest diversification potential are located on the periphery of the minimum spanning trees. This assumption forms the core of our analysis. Bonanno et al. (2004) consider how the returns of market-traded stocks are affected by varying the time horizons used to compute the correlation coefficients. They find that the graph structure progressively changes from a complex organization to a simple form as the time horizon decreases.
Another way how to define a structure of the financial market is to use the planar maximally filtered graph. It is more difficult method regarding the calculation, but by using it, we can get more information about the market structure. This approach was used by Tumminello et al. (2005Tumminello et al. ( , 2010 or Kenett et al. (2010). Other authors dealing with the correlation-based networks are for example Mizuno (2005), Naylor (2006) or Miskiewicz (2012).
There are, of course, other approaches to the analysis of financial markets. For example, international portfolio diversification, used by Bailey et al. (1990) or Abidin et al. (2004).
For an institutional investor, it is also important to monitor the so-called credit risk. Interesting approach to assessing credit risk using the correction to the KMV Black and Scholes model is introduced by Iazzolino and Fortino (2012). The risk from the perspective of the company and the comparison of real economy data and financially efficient working capital decisions and predictions is analyzed by Michalski (2016). Šoltés (2003, 2012) have made a theoretical introduction to quantifying return and risk in the case of two and three asset portfolios.

METHODS
The same database is used as in Šoltés, Danko (2017). The Standard and Poor's 500 Index (from now on referred to as the S&P 500) serves as a basis for stock analysis. The S&P 500 is an index of 500 stocks seen as a leading indicator of U.S. equities and it is a market value weighted index and one of the standard benchmarks for the U.S. stock market. The S&P 500 is widely regarded as the most accurate scaling factor of the performance of large cap U.S. equities. Individual stocks, forming the index at the date of compilation of the database (March 2016), were divided into six groups according to their price history (5-year history, 10-year history... 30-year history). It is assumed that the business year has on average 240 business days (approximately 20 business days in a month). Data from finance.yahoo.com, which are available from January 2, 1962, are the source of information that helped to perform an analysis.
As can be seen from Table 1, only 197 companies forming the index have a price history of 30 years and more (at the day of database creation). Stocks which do not have a price history for at least 1200 days have not been included in the analysis. Because of the growth of the reporting period, the number of companies that meet a given price history decreases. Stocks which are found in 30-year price history are certainly also found for example in 5-year price history; the opposite is not necessarily exact.
When creating the portfolio and analyzing its elements, there occurs the issue of so-called survivorship bias. In finance and investment, survivorship bias is the tendency for failed or new companies to be excluded from performance studies because they no longer exist or exist only for a concise time period. It often causes the results of studies to skew higher because only companies which were successful enough to survive until the end of the period are included. Survivorship bias, in our case, is not relevant because the aim of the investor is to create a portfolio, so only stocks of the existing companies whose price history is high enough are relevant.
The analysis was performed using the statistical programming language R. When working with graphs, the library (igraph) was used. For descriptive statistics computation, the library (psych) was used.

CORRELATION-BASED ANALYSIS
For each price history, a data matrix is created. Rows of the matrix represent individual observations and columns represent individual stocks belonging to a particular price history. The value in the -th i row and -th j column represents the ad-justed close price of the stock j at time i. It is clear from Table 1 that for example for the 5-year price history this matrix has a dimension of 1200x478 and similarly for other price histories. Adjusted close price was selected because it abstains from the effect of dividends and stock splits.
Furthermore, for each data matrix, the returns matrix is calculated according to formula 1. Because the oldest observations of the original data matrix do not have its predecessors, we lose one observation. It is clear from Table 1 that for example for 5-year price history returns matrix has a dimension of 1199x478 and similarly for other price histories.
[ ] where i P is adjusted close price in time , i 1 i P − is adjusted close price in the previous business day (in time 1 i − ) and i r represents a daily return in time i.
The aim is to compute mutual relationship of each pair of stocks. Because of that, the correlation matrix for each price history is calculated using the Pearson correlation coefficient given by the formula 2: where i,j ρ is correlation coefficient between returns of stocks i and , j i,j cov is covariance between these returns and i σ is the standard deviation of the stock's i return.
The correlation matrices were converted to distance matrices using the ultra-metric: For further details on the ultra-metric, see Mantegna (1999). As can be seen from Figure 1, there is the nonlinear negative dependence between distance and correlation.
It is clear from Figure 1  has a distance equal to square root of two (see the black square). Therefore, the more similar is the development of stock returns given by the correlation coefficient, the closer are the stocks and vice versa.

GRAPH THEORY APPROACH
Distance matrix with the application of discrete mathematics tools is the basic that gives the opportunity to receive new mathematical objects, complete graphs, which are given by vertices and edges. A complete graph has an edge between every pair of vertices. For a given number of vertices, there is a unique complete graph, which is often written as , n K where n is the number of vertices. It is clear from Table 1 that for example for 5-year price history, n is equal to 478. Vertices of these objects represent stocks forming the index, taking into account different price history, while the edges between the vertices represent distances between calculated stocks from a distance matrix. We get these subgraphs for each of six complete graphs by applying the method of minimum spanning tree, which shows that such a subgraph of the original graph, that is continuous, has minimal edge evaluation because of considering that there is a path between every pair of vertices, and it does not contain cycles. The structure of stocks, in which ones that are the closest (they have the greatest possible cross-correlation) are mutually linked, is exactly the structure that is represented by the minimum spanning tree. These minimum spanning trees and certain graph characteristics are the basis for selection of stocks for our portfolio.
During the analysis, graph characteristic eccentricity of the vertex is used which is considered to be very suitable for identifying individual stock centrality. According to West (2000), the eccentricity of the graph vertex v in the connected graph is the maximum graph distance between v and any other vertex of this graph. For a disconnected graph, all vertices are defined to have infinite eccentricity. The maximum eccentricity is the graph diameter. The minimum graph eccentricity is called the graph radius.  Based on the assumption that the more stocks are uncorrelated, the more they are suitable to diversify the portfolio risk, the authors claim that the longer the length of the minimum spanning tree representing the taxonomy of stocks, the greater the potential of the stock market (or portfolio) for risk diversification. An explanation could be the negative correlation between correlation coefficient and distance given by formula 1. Uncorrelated stocks, therefore, form the more extended minimum spanning trees and create additional opportunities for risk diversification. The authors analyzed the stock market using the minimum spanning trees changing over time and concluded that the longer minimum spanning trees have greater diversification potential that the smaller ones. If the dynamics over time would be ignored and only one minimum spanning tree for a whole analyzed time period is assumed, the question remains where are the stocks with the greatest diversification potential located. The authors claim that in any minimum spanning tree, stocks with the greatest diversification potential are located closer to the graph periphery than to the center of the graph. To verify this assumption, the authors used the so-called weighted portfolio layer, which represents the combination of the degree of the vertex (graph characteristic), the distance of the vertices (length of the shortest path having the two vertices as its endpoints) and Markowitz theory of the optimal portfolio. The analysis described in the following section is not in this sense dynamic because six static time periods are analyzed, so the minimum spanning tree length is not used. Based on the findings above to identify the centrality of the vertex, instead of the weighted portfolio layer, vertex eccentricity was used. If the assumption of the diversification potential of the stocks located on the graph periphery is correct, then the use of the vertex eccentricity appears to be very appropriate characteristic. Large eccentricity indicates the vertices found on the graph periphery and the smaller eccentricity is related to the stocks in the center of the graph.
Characteristic of the vertex degree, defined as follows was also considered. The degree of a graph vertex v is the number of graph edges which touch v. A vertex whose degree is one is called leaf. Leafs are more likely located on the graph periphery because they have only one neighbor and the connection between vertices ends in these vertices. Neighbor is a vertex that is adjacent (relation between two vertices that are both endpoints of the same edge) to a given vertex. Vertices with higher degrees should be located more in the center of the graph. However, in Figure 2, both ways of identifying stocks on the graph periphery are shown, for example, for the minimum spanning tree of 30-year history. As can be seen, the use of the eccentricity is in this case much more suitable than the use of the vertex degree (see Figure 2).

Figure 2. Identification of the central and periphery vertices using eccentricity and degree
In Figure 2, on the left, we can see division the eccentricity of the individual vertices into two groups: large eccentricity vertices and small eccentricity vertices so that both groups had approximately the same number of vertices (for more details of this division, see Table 2). Vertices with large eccentricity are highlighted in black. In Figure 2 on the right, all vertices with a degree equal to one are highlighted in black. If the assumption of the diversification potential of the stocks located on the graph periphery is correct, we consider the use of eccentricity to be more appropriate, because, in case of the vertex degree, there are a lot of degree one vertices located in the center of the graph.

RESULTS
For each price history, minimum spanning trees are calculated. For the vertices of these spanning trees are then calculated eccentricities, and for each price history, two groups were formed: large eccentricity vertices and small eccentricity vertices so that the number of elements in both groups is relatively equal. The procedure is explained in Table 2 below.
The aim is to test the assumption of the diversification potential of the stocks located on the graph periphery and to find out if there is a statistically significant difference between the risk of portfolios created from the stocks of only the first or only the second group.
For each price history, stocks are divided into two groups based on Table 2. From each group, we randomly selected ten stocks. It means that each stock in the particular group was given equal chance of being selected but with no stock being selected twice within a single portfolio. Subsequently, using Markowitz portfolio theory, we searched for the weights of stocks ( ) 1 10 ww − to minimize the standard deviation of the portfolio.
The standard deviation of the portfolio is calculated as the square root of the product of the vector of individual stocks equities (weights) with the covariance matrix of stocks' returns and the transposed vector of these equities: where w represents row weight vector, ∑ covariance matrix of returns and T w transposed row weight vector (column weight vector).
For each price history for both groups, was run 10,000 times the random selection of ten stocks. After each simulation was stored the value of the minimized standard deviation of the portfolio. Storing optimal portfolio weights was ignored to save memory and accelerate the calculation. To find optimal portfolio weights is not the aim of this consideration. Optimization is performed using the algorithm introduced by Byrd et al. (1995), which represents a modification of the Newton optimization method.
For each price history, there are two samples with 10,000 values of the minimized standard deviation of the portfolio. In the first group (sample 1), these deviations are calculated from portfolios whose stock have a large eccentricity (located on the periphery of the minimum spanning tree). In the second group (sample 2), there are deviations of portfolios whose stocks have small eccentricity (located in the center of the minimum spanning tree). Table 3 shows descriptive statistics of the samples, where n is the number of observations; sd is standard deviation; trim is trimmed mean with trim defaulting to 0.1; mad is median absolute deviation; min is minimum; max is maximum; skew is skewness of the sample distribution, and kurt is kurtosis of the sample distribution. As can be seen, the mean values, as well as other characteristics, such as the median, the standard deviation or the range of values, differ as well. Another interesting finding is that for stocks with large eccentricity (sample 1) compared to sample 2 (low eccentricity stocks), the skewness and kurtosis of the minimized standard deviation of the portfolio for each price history are higher. This means that in case of sample 1 the distribution is right-skewed: most values are below the average (mean is higher than median). For the sample 1 compared to sample 2, the distribution of minimized standard deviation of the portfolio is sharper. Sample 2 is from this point of view flatter.
The aim is to test the differences in the mean value of these samples and to show that stocks located on the periphery of minimum spanning tree have a better diversification potential. There, we assume that minimized standard deviation of the portfolio of Sample 1 is statistically significantly smaller than the one of sample 2.
Tests for comparison of means are used. Distribution of minimized standard deviations of the portfolio is not normally distributed, despite a large number of observations. Because of that, the non-parametric independent 2-group Mann-Whitney U Test is used, not the standard parametric unpaired (two sample) t-test. For most samples, the assumption of normality is corrupted mainly because of high kurtosis and right-skewed distribution. As an example, plots of the density and normal Quantile-Quantile (Q-Q) plot of minimized standard deviation of the portfolio for 5-year history and sample of large eccentricity are shown in Figure 3. The null hypothesis is formulated as follows: the mean value of the minimized standard deviation of both portfolios' samples is equal.
The alternative hypothesis is as follows: the mean value of the minimized standard deviation of sample 1 (a group of peripheral stocks with larger eccentricities) is statistically significantly lower than of sample 2. To support our assumption, we would like to reject the null hypothesis. where index 1 represents sample 1 (stocks with larger eccentricities) and index 2 represents sample 2 (stocks with smaller eccentricities).
Because the aim is to show that peripheral stocks have a better diversification potential (lover mini-mized standard deviation of the portfolio), onetailed tests are used.
Due to the high number of observations, we could also come out of the conclusions of the standard unpaired t-test. Results of this test in comparison with the non-parametric independent 2-group Mann-Whitney U Test are not different. Table 4 below shows results of both tests.
As can be seen in Table 4, in case of both, parametric and nonparametric tests of mean values, the p-values are asymptotically equal to 0, which means that for each price history the null hypothesis is rejected.
The assumption about diversification potential of the periphery stocks is supported and clearly confirmed. As shown, this is true for any price history. Furthermore, it is clear that the minimized standard deviations of portfolios are not differentiated from the perspective of the price history. In other words, there is no dependence between the price history and the minimized standard deviation of the portfolio.