back start next


[start] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] [130] [131] [132] [133] [134] [135] [ 136 ] [137] [138] [139] [140] [141] [142] [143] [144] [145] [146] [147] [148] [149] [150] [151] [152] [153] [154] [155] [156] [157] [158] [159] [160] [161] [162] [163] [164] [165] [166]


136

Output from this node is x= f(y).

Node delta is 5 = f (y) (w + ... + wk8k)

Figure 13.5 Backpropagation.

/(j)Oi5, + w252 + . . . + wk8k).

Once all nodes have a delta value, these values are used to update the weights on the connections and the biases on each node. The basic form of updating weights is

W„ew = W0id + V§*>

where 5 is the delta on the upper node and x is the output from the lower node. The updating of biases takes a similar form,

W0,new = . + V5,

where 5 is the delta on the node.

The parameter v is called the network learning coefficient. It may be modified during training if the network has an adaptive learning capability. The learning coefficient is similar to the step size of simple iterative optimization algorithms such as gradient search. In fact backpropagation is equivalent to gradient descent optimization of the error function, so any of the standard algorithms can be substituted. Some of these algorithms were described in §4.3.2, but those that are most successful for optimizing GARCH models are not necessarily as good for neural networks. The most common gradient optimization methods used for neural networks are conjugate gradients or the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm.

13.2.4 Performance Measurement

Any standard performance measure, such as those described in §A.5.3. may be applied when comparing the output of a neural network with its target output.



When no distributional assumptions are made - as, for example, in some neural networks that are used for price prediction - the performance of the network may be measured using a standard error function, such as the root mean square error between the observed and predicted price.

If distributional assumptions are made, it is more informative to use a likelihood than an error criterion. For example, suppose a neural network is used to estimate the parameters of a univariate normal mixture density (§10.2.3). The basic input vector, before any normalization or data compression, will be the returns time series, and the output nodes would give the parameters of the normal mixture: the probabilities, means and (log) standard deviations for each of the constituent normal densities. An appropriate performance measure is then the log-likelihood of the training data given the output parameters. We denote this by In L(w) to emphasize that the value of the likelihood will depend on the weights w of the connections. During training the network will be optimized by minimizing the error function E(w) - -21nL(w) (§A.6.3).

The old-fashioned Since neural networks are universal approximators, if permitted they can fit the

method of preventing the network from over-fitting was just to stop the optimization before the fit became too close

required targets to any degree of accuracy, so that errors in the training data can be arbitrarily small. But this usually results in poor out-of-sample performance when the optimized network is run on a test data set. The old-fashioned method of preventing the network from over-fitting was just to stop the optimization before the fit became too close. This is rather crude and leaves much to the modellers judgement of the appropriate time to switch off the network.

A more sophisticated method is to add a cost of complexity to the performance measure. For example, Williams (1995, 1996) adds a regularization term n ln(E \wj I) to the log-likelihood function, where n is the number of connections with non-zero weights and the summation is over all the weights in the network. The regularized performance measure - that is, the error function that should be minimized - as a function of the weights w, becomes

£(w) = -21nL(w) + «ln(Eiv,).

This type of error function, which includes a penalty for over-complexity, automatically prunes connections from the network with zero weights. Thus the network architecture is automatically modified during training, which avoids any need for the modeller to apply subjective tolerance levels to convergence metrics in the old-fashioned way.

13.2.5 Integration

Error functions of neural networks are usually very complex. Commonly they have many local minima and it is very hard to identify any unique global minimum. Each time the network is run, with different initial values for weights and biases, a different optimum will be achieved. So it is standard



practice to run the network many times and to take some type of average over the results.

The form of average taken will depend on the type of output produced. For example, if outputs are price predictions an equally weighted arithmetic average of the results would be appropriate. But if the outputs are parameters of a distribution other types of integration over local minima may be appropriate. For example, variance parameter outputs could be averaged using the formula (10.12) for the variance of a mixture with each probability equal to I/TV, where TV is the number of different local optima in the average.

13.3 Price Prediction Models Based on Chaotic Dynamics

One of the characteristics of a chaotic dynamical system is that points on neighbouring trajectories grow apart from each other exponentially, at least in the short term. Think of the Lorenz butterfly, or some other strange attractor. If two points are distinct but infinitesimally close, they will be indistinguishable at time zero, but some time in the future they could be very far apart on opposite wings of the attractor. Two points that were virtually indistinguishable nevertheless lie on different trajectories at the start; so as time passes they will diverge very rapidly.

What are the implications if deterministic chaos rather than stochastic models govern a financial market? Very short-term predictions should be more accurate than long-term predictions, but is there anything else of practical use? Of course it would be truly wonderful if, following any positive finding for chaotic dynamics, the equations that govern the attractor could be specified. But this is extremely unlikely since the dimension of any attractor is going to be very high. In any case, financial markets are not likely to be governed by a set of purely deterministic equations. The stochastic components will be dominant, particularly in the high-frequency data that are usually necessary for the statistical tests of chaos that will be described below. Any evidence of chaos would simply tell us that a minuscule error of measurement in the data will produce large prediction errors except for very short-term forecasts. But this could be investigated directly, with or without a finding of chaotic dynamics. So although there is a large literature on the evidence for and against deterministic chaos in financial markets, it is not clear that this research is of any practical use.

However, there are a number of modelling techniques that are based on chaos theory that do have useful applications for prediction purposes. The aim of this section is to present a few of the concepts from the theory of chaos that have been used as a basis for some of the more successful high-frequency data prediction models in financial markets.

"What are the implications if deterministic chaos rather than stochastic models govern a financial market?

Any evidence of chaos would simply tell us that a minuscule error of measurement in the data will produce large prediction errors

13.3.1 Testing for Chaos

The concept of embedding a time series in a higher-dimensional space goes back to a theorem of Takens (1981). Takens showed that it is possible to



[start] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] [130] [131] [132] [133] [134] [135] [ 136 ] [137] [138] [139] [140] [141] [142] [143] [144] [145] [146] [147] [148] [149] [150] [151] [152] [153] [154] [155] [156] [157] [158] [159] [160] [161] [162] [163] [164] [165] [166]