back start next


[start] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] [ 105 ] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] [130] [131] [132] [133] [134] [135] [136] [137] [138] [139] [140] [141] [142] [143] [144] [145] [146] [147] [148] [149] [150]


105

movement toward the global minimum. They also believed that some problems are linearly separable and this would be a good place to start.

It is not the goal of this section to give a complete description of the Cascade Correlation algorithm. Rather, it is to show some of the strengths and weaknesses of this methodology. The first step in Cascade Correlation is to determine a linear fit to the training data. This is best illustrated in Figure 17.2a. The three input nodes are shown on the left and the output nodes are on the upper right. The intersection of the lines represents the weights. After the initial training, if the problem is linearly separable, the problem is solved. If there is residual error, then hidden nodes must be added (one at a time) to reduce the error to acceptable levels. A pool of potential hidden nodes is created using random initial weights. All nodes are trained until either the maximum number of training cycles is reached or there is no more progress toward maximizing the correlation (really the covariance). The winning node is selected and inserted into the network, its input weights frozen, and all the output node weights are trained (Figure 17.2b). This is a very fast process. If there is remaining residual error, another hidden node is added using the same process (Figure 17.2c).

Several things should be noted about the Cascade Correlation algorithm. First, as a node is added, its input includes all the original input plus the output of all previously added hidden nodes. What this is really creating is a network of many layers consisting of one node each. Each added node has also become a feature detector. The node is chosen to detect a specific pattern in the input and it is the only node doing that job. Thus, the hidden nodes no longer have to compete to become a feature detector. Second, notice that the input weights to the node, when it is added to the network, are frozen. They will not change at any time in the future. This means that backpropagation of the models output error is not needed. This leads to fast training. This also permits the existing network to be trained if new data is added to the training set while not requiring the relearning of the earlier data. Third, any transfer function may be added to the node. All node types may train and compete to be added to the network. Because the training is uncomplicated, this facilitates the construction of complex networks.

Why would we want to have a transfer function other than the sigmoid? Simply because our data may be better represented by another function. For example, if we

Figure 17.2 Example cascade correlation network growth.



use a sine as the transfer function, it may represent a cyclic market. Using a sine as a transfer function means we are performing a pseudo Fourier analysis of the data.

For all its good points, Cascade Correlation does have problems. The first is seen in Figure 17.3. Assume that our data is a perfect sine wave and we want to model this using Cascade Correlation and a sine as the nodes transfer function. With one hidden node, we should be able to exactly model this data. However, Cascade Correlation cannot do this. Remember that the first step was to do a linear fit and then model the residual error with the hidden nodes. The straight line in Figure 17-3 represents the linear fit. Once this linear fit is accomplished, the hidden nodes must model the residual error. Since this is no longer a pure sine wave, it becomes a difficult task.

In experiments using Cascade Correlation to model the Mackey-Glass equation (a time delay chaotic function), we found that Cascade Correlation would sometimes do a good job and other times a poor job.7 The problem lies in specifying when to quit. The more complicated the network, that is, the more nodes in the network, the worse its performance was. Networks that quit training with around 30 nodes did best, while networks of 60 nodes were notoriously bad.

Figure 17.3 cascade correlation limitation example.



Probabilistic Neural Networks

Another example of an ontogenic network is the Probabilistic Neural Network (PNN) developed by Donald Specht.8 This network has some very interesting properties that should be considered when deciding which methodology to use in building your model. First and foremost, this network is very fast in its training. Specht claims it to be almost instantaneous. Second, it has the feature that under some easily met conditions, the decision surfaces approach the Bayes optimal decision surface. Basically, this is the theoretical limit of how good a classification can be done. Specht also claims that this network is able to perform function approximation. This means that we can use the PNN to either forecast the future or help us make a trading decision.

What then, is a PNN? A PNN consists of four layers, Figure 17.4. The first layer, the input layer, is fully connected to the next layer and is just used to distribute the input components to each node in the next layer.

The second layer has nodes, also called pattern nodes, where each node represents one training example. This means that the more training examples you have, the more nodes and thus more storage is required. With the memory capacity of current personnel computers, plus the ever diminishing cost of memory, this should not be a consideration if this is the kind of network that you feel would be best. The

Figure 17.4 Probabilistic neural networks architecture.

Input Input Pattern Summation Output

Layer Layer Layer Layer



[start] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] [ 105 ] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] [130] [131] [132] [133] [134] [135] [136] [137] [138] [139] [140] [141] [142] [143] [144] [145] [146] [147] [148] [149] [150]