back start next


[start] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [ 95 ] [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] [130] [131] [132] [133] [134] [135] [136] [137] [138] [139] [140] [141] [142] [143] [144] [145] [146] [147] [148] [149] [150]


95

Making Profits with Data Preprocessing

reference.)1 You will find that the correlations and coefficients you get will be different, but the same concepts apply.

The most striking feature of this table is that these formulas seem to have discovered a relationship that is somewhat predictive! The largest correlation of any input with respect to the dependent variable is 0.28. When combined with the other inputs, the result is an Rvalue of 0.39. All these correlations are spurious, the result of a combination of chance and too few examples in a high dimensional space.

One key to successful data preprocessing is to develop a method for sifting through the myriad possible inputs to find that handful of candidate inputs with consistent predictive power. Networks with more than 5 or 10 inputs almost always wind up modeling spurious correlations in the noise.

Another issue faced by anyone using neural networks is how the network interprets the data. Often, without additional constraints, a neural network will find alias solutions that appear to fit the data as well as possible, yet fail to effectively capture essential relationships. This is the motivation for transformations that enhance the information in the data and can potentially reduce the number of inputs to a network. A few well-chosen transformations yield much better models than many poorly selected ones.

To illustrate the problem of alias solutions, Table 16.2 summarizes the training set for a neural network. One example has been included to test the trained network. Figure 16.1 shows this data graphically. The problem consists of two inputs that form a saddle shape. Look at Figure 16.1 a moment and consider what value you would expect the network to predict for the point labeled "T"?

Figure 16.2 shows the neural network solutions to this problem. Notice that all three solutions shown in Figure 16.2 fit the data perfectly. After running several simulations, solutions 2(a) and 2(b) each occur about 40 percent of the time. Solution 2(c) occurs approximately 20 percent of the time. All three solutions utilize three hidden units in a standard fully connected feed forward configuration. Which of these is

Table 16.2 Nine training records and one test record used to illustrate issues in developing neural models

0.25



Zero Output One Output

Test Case

preferable? When asked to choose, most individuals find solution 2(c) most appealing. It is the only solution that preserves a sense of "rationality." It is the only one of the three that meets the smoothness criterion: if a point lies between two others, its value should be between the value of the points it is between. Solutions 2(a) and 2(b) violate this assumption. In a real problem, solutions 2(a) and 2(b) would not generalize as well as 2(c).

Tying this back to preprocessing, neural networks often find "lazy" solutions that fit the data, missing more fundamental relationships. In general, to the extent that the raw data can be transformed in ways that pull out and extract more subtle relationships, the model will have fewer inputs and typically perform better on a validation set.

Data preprocessing enhances subtle relationships and sifts through candidate variables to select a few synergistic inputs for a model. The net result is simpler models that generalize better.

Feature Extraction

The fundamental assumption underlying the application of neural and other modeling technologies applied to financial problems is that there is some predictable signal for making better than random decisions. One explanation for this is the "Financial Ponzi Scheme" hypothesis of Meir Statman.2 This hypothesis leads to a strategy for identifying potentially good transformations on data.



preferable? When asked to choose, most individuals find solution 2(c) most appealing. It is the only solution that preserves a sense of "rationality." It is the only one of the three that meets the smoothness criterion: if a point lies between two others, its value should be between the value of the points it is between. Solutions 2(a) and 2(b) violate this assumption. In a real problem, solutions 2(a) and 2(b) would not generalize as well as 2(c).

Tying this back to preprocessing, neural networks often find "lazy" solutions that fit the data, missing more fundamental relationships. In general, to the extent that the raw data can be transformed in ways that pull out and extract more subtle relationships, the model will have fewer inputs and typically perform better on a validation set.

Data preprocessing enhances subtle relationships and sifts through candidate variables to select a few synergistic inputs for a model. The net result is simpler models that generalize better.

feature extraction

The fundamental assumption underlying the application of neural and other modeling technologies applied to financial problems is that there is some predictable signal for making better than random decisions. One explanation for this is the "Financial Ponzi Scheme" hypothesis of Meir Statman.2 This hypothesis leads to a strategy for identifying potentially good transformations on data.



[start] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [ 95 ] [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] [130] [131] [132] [133] [134] [135] [136] [137] [138] [139] [140] [141] [142] [143] [144] [145] [146] [147] [148] [149] [150]