A Conceptual Development And An Implementation For Exploring Hypothesis Space In Data Mining

The problem faced by data miners is that the information obtained is very difficult to understand and hence to represent because of high dimensionality. In this paper the effects of errors, data representation, and assessment approach in data mining is discussed. Data-mining techniques designed for classification problems usually assume that each observation is a member of only one category. We extend these methods to the case in which observations may be full members of multiple categories. They evaluate three of the popular methods (decision tree induction, linear discriminant analysis, and neural networks) and explore representational and performance measurement perspectives.We apply data mining to predicting bankruptcy. We show that different parameters for forecasting bankruptcy are obtained when economic conditions are normal and when countries are operating in crisis mode, as has been the case recently in parts of Asia.

1.    Introduction: Corporate bankruptcy always brings about huge economic losses to management, stockholders, employees, customers, and others, together with a substantial social and economical cost to the nation. Therefore, a model predicting corporate failure would serve to reduce such losses by providing a pre-warning to these stakeholders. An early warning signal of probable failure will enable both management and investors to take preventive actions and shorten the length of time whereby losses are incurred. Thus, an accurate prediction of bankruptcy has become an important issue in finance.
Numerous researchers have studied bankruptcy prediction over the past of years. As a result, various theories have evolved in an effort to explain or distinguish between firms that have failed. Beaver (1966) used a dichotomous classification (A multi-stage classification scheme is also referred to as a dichotomous system) test to determine the error rates a potential creditor would experience if he classified firms on the basis of their financial ratios as being failed or non-failed. Beaver was able to classify 78% of his sample firms as failures five years before they actually did fail. Altman (1968) used discriminant analysis to rank firms on the basis of a weighted combination of five ratios. His results were 95% effective in selecting future bankruptcies in the year prior to bankruptcy. Numerous follow-up studies have tried to further develop appropriate models by applying data-mining techniques including multivariate discriminant analysis, logistical regression analysis, probit analysis, genetic algorithms, neural networks, decision trees, and other statistical and computational methods.
This study attempts to build a financial ratio model to predict any financial crisis to banks before it really happens. This model could help management to improve its financial structure and reduce the probability of financial distress. Furthermore, this model provides in-dcpth information to investors and creditors to examine their investment risk.
2.    Literature Reviews: Ohlson (1980) used a method to build and analyze a model, which sampled 105 failed companies and 2058 non-failed companies during 1970 to 1976. He set up 3 models from 9 explanatory variables to predict corporate failure. From this it was possible to identify four basic factors as being statistically significant in affecting probability of failure (within one year). These are: (1) the size of the company; (2) a measure(s) of the financial structure; (3) a measure(s) of performance; and (4) a measure(s) of current liquidity (the evidence regarding this factor is not as clear as compared to cases (1)-(3)). Ohlson's empirical study showed the prediction accuracy of these first 3 models to be 96.12%, 95.55%, and 92.84%, respectively.
Model 1: Financial Ratios only
Model 2: Non-financial Information only
Model 3: Financial Ratios and non-financial information

The results indicate that marginally better predictions concerning a small company's failure may be obtained from non-financial data as compared to those which can be achieved from using traditional financial ratios. The accuracy rates of the classification of three models are 78.7%, 75.3%, and 82.2%, respectively.
3.    Research Design: First, the study defines the definition of a failed company, a non-failed company, and the variables. Secondly, the study uses factor analysis and logistical method to construct a prediction model.
3.1    Definition: Failed company: A listed company on the National Stocks Exchange (NSE) market is judicially declared a special arrangement company by authorities when the company has operation difficulties. According to Operation Rules, a company in an unhealthy financial condition is recognized as a company in financial crisis. Because the source of samples includes companies listed on the NSE market, the definition of a company with operation difficulties is based on the definition of NSE.
3.2.     Non-failed company: Companies that have no special stock arrangement, which are listed on the NSE market. Their stocks are allowed to trade publicly.
4.    Statistical models: If the sum of the squared partial correlation coefficients between all pairs of variables is small when compared to the sum of the squared correlation coefficients, then the KMO measure is close to 1. Small values for the KMO measure indicate that a factor analysis of the variables may not be a good idea, since correlations between pairs of variables cannot be explained by the other variables.
The reason why the study uses a logistic regression because that when the dependent variable can have only two values, the assumptions necessary for hypothesis testing in the regression analysis are violated. Another difficulty with multiple regression analysis is that predicted values cannot be interpreted as probabilities, as they are not constrained to fall in the interval between 0 and 1. Logistic regression requires far fewer assumptions than discriminant analysis; and even when the assumptions required for discriminant analysis are satisfied, logistic regression still performs well.
5.    Sample Selection and Data Sources: There are Minimum (31) failed companies and Minimum (31) non-failed companies that qualify according to the above definition by NSE, during the study. The failed and non-failed companies are matched up.
6.    Definition of variables: In this study, the dependent variable is a dummy that indicates 0 as a non-failed company, and vice versa. Independent variables consist of two categories, one is the financial ratio related group, and the other is the non-financial group.
The financial-related group consists of 18 financial ratios from the database of all over India Economic Journal. After using factor analysis, the study selects some variables which have the highest loadings and lists them. Those variables selected in the one year before failure, including the long-term capital ratio to fixed assets, the current ratio, the total assets turnover, the return on assets, and the cash reinvestment ratio. For two years before failure, the long-term capital ratio to fixed assets, the current ratio, the inventory turnover, the total assets turnover, and the return on total equity are selected. For three years before failure, the long-term capital ratio to fixed assets, the quick ratio, the times interest earned, the inventory turnover, and the net profit before taxes to capital issued are selected.
The stock price can reflect the performance of a company. Before companies are inclined to failure, their stock price reflects the related negative information. Variable can be used to measure whether the company goes from bad to worse.
7.    Empirical Results: The financial variables are selected into the factor analysis, while the non-financial variables are not selected into the factor analysis. They will be added into the logistic regression to see whether they can increase the rate of accuracy. In order to find variables that could be applied to establish a model, among the 18 financial ratios, this study uses the method of factor analysis to filter out the best appropriate variables. The numbers of factor sets are decided by the method proposed by Kaiser, which keeps an eigenvalue >1 and the numbers of financial ratio variables with an absolute value of factor loading > 0.3 and the value of communalities > 0.7. According to the results of factor analysis, variables having the highest loading in each factor are selected. Using these variables selected can then build the prediction failure models. The results of the variables selected.
The results of one year before the failure, of two years before the failure, and of three years before the failure. The financial variables selected for inclusion in the prediction model show that four variables are individually significant at the 95% level, where in total the function is significant at the 95% level. The financial and non-financial variables selected for inclusion in the prediction model show that four variables are individually significant at the 95% level, where in total the function is significant at the 95% level. It is apparent that the non-financial information contained in the prediction model can increase the percentage that is correct. The correct classification results for the prediction model, which is based upon both financial and non-financial information, are superior to the prediction model, which is based upon financial information. The model is able to correctly predict some 87.1% of the companies in the sample. This superiority is evident both in terms of overall correct classifications and in correctly classifying tailed and non-failed companies. The two-year comparative results, the financial variables selected for inclusion in the prediction model show that two variables are individually significant at the 95% level, where in total the function is significant at the 95% level. The financial and non-financial variables selected for inclusion in the prediction model show that three variables are individually significant at the 95% level, where in total the function is significant at the 95% level. It is apparent that the non-financial information contained in the prediction model cannot increase the percentage that is correct. The correct classification results for the prediction model, which is based upon both financial and non-financial information, are the same with the prediction model, which is based upon financial information. These two models are able to correctly predict some 77.42% of the companies in the sample.
8.    Independent Variables in the Financial Group: The three-year comparative results , the financial variables selected for inclusion in the prediction model show that no variable is individually significant at the 95% level, where in total the function is significant at the 95% level. This seems to indicate that multicollinearity may be present. While multicollinearity causes problems for the determination of the significance of individual variables, it does not affect the predictive accuracy of the model. The financial and non-financial variables selected for inclusion in the prediction model show that none variable is individually significant at the 95% level, where in total the function is not significant. It is apparent that the non-financial information contained in the prediction model can increase the percentage that is correct. The correct classification results for prediction model, which is based upon both financial and non-financial information, are superior to the prediction model, which is based upon financial information. The model is able to correctly predict some 72.58% of the companies in the sample. This superiority is evident both in terms of overall correct classifications and in correctly classifying failed companies.
9.    Conclusion: The objective of the paper has been to examine whether it is possible to predict listed companies that will tail from publicly available non-financial information alone or in conjunction with financial ratios, comparable to the predictions obtained solely from financial ratios.

10.     References
1.    Pregibon, D. (1997). Data Mining. Statistical Computing and Graphics, 7, 8.
Altman, E. (1968), "Financial Ratios. Discriminant Analysis and the Prediction of Corporate Bankruptcy." Journal of Finance, 23 (September), pp. 589-609.
2.    Beaver, W. H. (1966), "Financial Ratio as Predictors of Failure." Journal of Accounting Research, 4, pp. 71-111.
3.    Ohlson, J. A. (1980), "Financial Ratios and the Probabilistic Prediction of Bankruptcy." Journal of Accounting Research (Spring), pp. 109-131.
4.    Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (1996). Advances in knowledge discovery & data mining. Cambridge, MA: MIT Press.
5.    Han, J., Kamber, M. (2000). Data mining: Concepts and Techniques. New York: Morgan-Kaufman.