TC Watch > Model-Based TC Signal Probabilities - Methodology [Refresh] [中文]
Methodology |
This document explains the methodology and conceptual ideas of the statistical model fitting procedure. Note: This text contains mathematical symbols and requires javascript to be displayed properly.
Click here to view the product, or here for a guide on its use.
1. The concept of statistical regression
Tropical cyclones (TCs) close to the territory are more likely to necessitate TC signals. The same goes for TCs with higher intensities as they are more destructive (i.e., generate stronger winds) and generally affect a wider area. Our analysis is based on regression, a statistical technique that allows one to attribute the level of a response to possible explanatory variables. In this case, the response variable is whether a particular signal is issued (a binary or 0-1 variable), and some of the potential explanatory variables are the geographic coordinates and the strength of the TC. This relationship can be written as follows:
\begin{equation} (\textrm{Signal status at time }t) \sim (\textrm{Geographic coordinates at time }t) + (\textrm{Strength at time }t), \label{eq:regeqn1} \end{equation}where the tilde ($\sim$) can be informally translated as "is related to" and the plus sign simply means "and". Problems of this sort, where the response is a binary variable, are typically dealt with via the logistic regression. Rather than using the binary variable directly, logistic regression models the probability of signal issuance, i.e., the probability that the binary variable takes the value of TRUE (or FALSE). Therefore, the regression model \eqref{eq:regeqn1} yields a signal probability that is a function of the geographic coordinates and the strength of a TC, and this makes intuitive sense.
2. The importance of signal and track history
Model \eqref{eq:regeqn1} assumes that factors other than the TC's current status are irrelevant in predicting the possibility of issuing TC signals. This assumption is unrealistic and we must take past information into account. It is perhaps obvious that the past signal status has heavy influence on current signal status, as TC signals are typically in force over a period of time during cyclone approach, without multiple issuances and cancellation in between. Meanwhile, a TC's past track contains information on its direction and speed of movement, and is important in determining whether the TC is approaching or receding from the territory. With these factors in mind, a more reasonable regression model can be formulated as follows:
\begin{eqnarray} (\textrm{Signal status at time }t) & \sim & (\textrm{Geographic coordinates at time }t) + (\textrm{Strength at time }t) \nonumber \\ & & + (\textrm{Geographic coordinates at time }t-k) + (\textrm{Signal status at time }t-k) \label{eq:regeqn2} \end{eqnarray}The variable $k$ is the time interval, chosen to be 6 hours for the analysis of the signals 1, 3 and 8, and 2 hours for the analysis of signals 9 and 10. For the lower signals, a time interval of 6 hours is short enough to capture most signal changes and long enough to yield a representative proportion of signal changes (between now and 6 hours before) in the data set. A shorter time interval is needed for the higher signals as they are typically in force for shorter durations, and an interval of 6 hours will likely miss some issuances of those signals.
Note that Model \eqref{eq:regeqn2} can be regarded as a time series model in the signal status. Theoretically this makes sense as this modified model assumes only conditional independence of the successive current signal statuses given the respective statuses at time $t-k$.
3. Using penalized splines for flexible modelling
Both Models \eqref{eq:regeqn1} and \eqref{eq:regeqn2} are additive, implying that a change of a fixed unit in one explanatory variable results in a fixed change in the (transformed) probability of signal issuance, holding other variables constant. For example, this implies that the increase in signal probability^{1} from a geographical coordinate of (20°N, 118°E) to (20°N, 115°E) is the same as that from (20°N, 115°E) to (20°N, 112°E). In practice, the probability is a non-linear function of geographical coordinates and a representative model must make such provisions; this is where splines can prove useful.
Regression splines involves the specification of knots, the number of which relates to the flexibility of the model. The larger the number of knots, the more flexible is the model. Splines allow non-linearities to be modelled as they adapt to the shape of the trend. For a better description of the data, we require the use of 2- or 3-dimensional splines to allow sufficient flexibility. A 2-dimensional spline describes the effect of longitude and latitude on the signal probability, taking account interaction effects (i.e., the effect that the change in probability due to a constant change in latitude is different at various longitudes, and vice versa). A 3-dimensional spline works jointly on latitude, longitude and strength in a similar fashion.
There is a price to pay for such flexibility. Because of the large number of parameters in the resulting models, they will likely overfit the data. This means that the fitted surface is too adapted to the noise (randomness) in the data, hindering the trend or average effect we want to capture. This problem can be alleviated by fitting penalized splines: with this method, a fitted spline is penalized by the degree of wiggliness (or adaptiveness) of the fit, in an attempt to discourage overfitting. When penalized splines are used, the fitted surface is typically smoother and does not respond to an occasional outlier as much as one that does not impose penalty.
^{1} It is actually a transformation of the signal probability that has a constant change in response to a change in the explanatory variable.
4. Data structure and candidate models
The data set contains TC location, strength (based on the 10-minute average standard) and signal information at 6-hourly intervals from 1961 to 2014, for a total of 6,389 observations. Observations outside a rectangular region with negligible issuance probability are excluded. An excerpt is given below, with the observations corresponding to Super Typhoon Usagi in 2013.
Name | Year | Month | Day | Time (UTC) |
Current | 6 hours ago | ||||||
Lat | Long | Strength (kt) | Signal | Lat | Long | Strength (kt) | Signal | |||||
USAGI | 2013 | 9 | 21 | 0 | 20.7 | 121.7 | 105 | 0 | 20.4 | 122.5 | 110 | 0 |
USAGI | 2013 | 9 | 21 | 6 | 20.8 | 120.7 | 105 | 1 | 20.7 | 121.7 | 105 | 0 |
USAGI | 2013 | 9 | 21 | 12 | 21.0 | 119.7 | 95 | 1 | 20.8 | 120.7 | 105 | 1 |
USAGI | 2013 | 9 | 21 | 18 | 21.4 | 118.8 | 90 | 3 | 21.0 | 119.7 | 95 | 1 |
USAGI | 2013 | 9 | 22 | 0 | 21.7 | 118.0 | 90 | 3 | 21.4 | 118.8 | 90 | 3 |
USAGI | 2013 | 9 | 22 | 6 | 22.4 | 116.8 | 90 | 3 | 21.7 | 118.0 | 90 | 3 |
USAGI | 2013 | 9 | 22 | 12 | 22.8 | 115.4 | 85 | 8 | 22.4 | 116.8 | 90 | 3 |
USAGI | 2013 | 9 | 22 | 18 | 23.1 | 113.9 | 70 | 8 | 22.8 | 115.4 | 85 | 8 |
USAGI | 2013 | 9 | 23 | 0 | 23.7 | 112.6 | 40 | 8 | 23.1 | 113.9 | 70 | 8 |
USAGI | 2013 | 9 | 23 | 6 | 24.3 | 111.2 | 25 | 0 | 23.7 | 112.6 | 40 | 8 |
A similar data set with 1,234 observations at 2-hourly intervals is used for the modelling of probabilities for signals 9 and 10.
The models considered are as follows:
Model | Equation | Number of parameters |
M0 | \begin{eqnarray*} (\textrm{Signal status at time }t) & \sim & (\textrm{Spline on lat-long-str at time }t) + (\textrm{Signal status at time }t-6) \end{eqnarray*} | 76 |
M1 | \begin{eqnarray*} (\textrm{Signal status at time }t) & \sim & (\textrm{Spline on lat-long-str at time }t) + (\textrm{Spline on radial speed at time }t) \\ & & + (\textrm{Signal status at time }t-6) \end{eqnarray*} | 85 |
M2 | \begin{eqnarray*} (\textrm{Signal status at time }t) & \sim & (\textrm{Spline on lat-long-str at time }t) + (\textrm{Spline on lat-long-str at time }t-6) \\ & & + (\textrm{Signal status at time }t-6) \end{eqnarray*} | 150 |
M3 | \begin{eqnarray*} (\textrm{Signal status at time }t) & \sim & (\textrm{Spline on lat-long-str at time }t, \textrm{for each signal status at time }t-6) \\ & & + (\textrm{Spline on lat-long-str at time }t-6) \end{eqnarray*} | 224 |
M4 | \begin{eqnarray*} (\textrm{Signal status at time }t) & \sim & (\textrm{Spline on lat-long-str at time }t, \textrm{for each signal status at time }t-6) \\ & & + (\textrm{Spline on lat-long-str at time }t-6, \textrm{for each signal status at time }t-6) \end{eqnarray*} | 298 |
M5 | \begin{eqnarray*} (\textrm{Signal status at time }t) & \sim & (\textrm{Spline on lat-long at time }t, \textrm{for each signal status at time }t-6) \\ & & + (\textrm{Spline on lat-long at time }t-6) + (\textrm{Spline on strength at time }t) \end{eqnarray*} | 82 |
Model M0 acts as a control or baseline upon which the improvement of other models is assessed. Note that Models M0-M4 involve 3-dimensional splines on the current (or 6 hours before) latitude, longitude and strength of the TC, while model M5 involve only 2-dimensional splines. For model M1, the prior TC location is incorporated in the form of radial speed. Models M3-M5 have one or more components that construct a spline for each level of the signal status at time $t-6$ (in hours); this effectively doubles the number of parameters. As the (linear) effect of the signal status is implicitly included in such a construction, there is no separate term on the signal status at time $t-6$. Models M0 to M4 have an increasing number of parameters, while Model M5 contains more parameters than M0 but fewer than the rest.
The above models are fitted for each signal using the statistical software R. The input data set for the fitting of each signal uses a binary response (for each observation) that indicates whether a certain signal (or above) is issued at that time point. Because penalized splines are used, the magnitudes of the parameters are constrained and thus the effective flexibility is less than that indicated by the number of parameters listed above. The magnitudes of the penalities are determined via cross-validation, as detailed in the next section.
Signals 9 and 10 are treated in a similar fashion, except that all $t-6$ variables are here $t-2$ instead.
5. Cross-validation and model verification
With penalized splines, a large penalty on the adaptiveness of the fit leads to an overly smooth fitted surface, while a small penalty may capture noise rather than the trend. The "sweet spot" can be determined by means of cross-validation (CV). In performing a $k$-fold CV, the data set is randomly split into $k$ portions of (roughly) equal size. A model is fitted first to the data set with the first portion withheld, and the fitted model is used to obtain predictions for observations in the withheld portion. This is repeated for each partition, resulting in $k$ fits and $k$ sets of predictions. The predicted signal status (using a decision rule so that probabilities larger than a certain threshold result in a predicted TRUE value) are compared to the observed values of the response, obtaining a misclassification rate. The misclassification rate resulting from the CV procedure gives a more accurate estimate of the model prediction error than the one obtained by the model fit that uses the whole data set, as the CV procedure is constructed to ensure that the observations used in the fit are not involved in the prediction process.
Within the fitting algorithm, a CV is carried out to select the optimal penalty (which results in the smallest misclassification rate) for the regression modelling using penalized splines. However, the data set contains a relatively small proportion of TRUEs (signal issued) especially for the higher signals. To compare the quality of the fitted models, it may be more suitable to other metrics that focus on whether these relatively rare observations are correctly predicted. Let the matrix of observed and predicted signal status be as follows:
Counts | Observed FALSE | Observed TRUE | Total |
Predicted FALSE | $X_{FF}$ | $X_{FT}$ | $X_{F+}$ |
Predicted TRUE | $X_{TF}$ | $X_{TT}$ | $X_{T+}$ |
Total | $X_{+F}$ | $X_{+T}$ | $X_{++}$ |
Then these metrics are defined as:
A model is better if it has a higher CSI or HIT, or a lower FAR. For each signal and fitted model, we evaluate these three metrics based on a training/test set approach: 2/3 of the data are randomly assigned as the training set, and the remaining 1/3 assigned as the test set. Models are fitted to the training set and are used for prediction of the test set. We carry out this procedure three times to obtain an average value for each metric in each combination, using a decision rule that assigns a prediction as TRUE if the probability is larger than 0.5. The verification table is as follows, with the "best" numbers (largest CSI/HIT; smallest FAR) in each column highlighted:
Model | Signal 1+ | Signal 3+ | Signal 8+ | Signal 9+ | Signal 10 | ||||||||||
CSI | HIT | FAR | CSI | HIT | FAR | CSI | HIT | FAR | CSI | HIT | FAR | CSI | HIT | FAR | |
M0 | 0.79 | 0.84 | 0.06 | 0.70 | 0.75 | 0.10 | 0.63 | 0.68 | 0.10 | 0.63 | 0.72 | 0.17 | 0.69 | 0.81 | 0.17 |
M1 | 0.80 | 0.84 | 0.06 | 0.72 | 0.77 | 0.08 | 0.61 | 0.66 | 0.10 | 0.63 | 0.71 | 0.16 | 0.69 | 0.80 | 0.17 |
M2 | 0.80 | 0.85 | 0.07 | 0.72 | 0.78 | 0.09 | 0.62 | 0.67 | 0.11 | 0.67 | 0.75 | 0.13 | 0.68 | 0.78 | 0.15 |
M3 | 0.78 | 0.82 | 0.06 | 0.70 | 0.75 | 0.09 | 0.53 | 0.57 | 0.12 | N/A | N/A | ||||
M4 | 0.78 | 0.82 | 0.05 | 0.70 | 0.75 | 0.07 | 0.50 | 0.53 | 0.11 | N/A | 0.70 | 0.85 | 0.20 | ||
M5 | 0.79 | 0.83 | 0.06 | 0.70 | 0.76 | 0.10 | 0.60 | 0.62 | 0.06 | 0.70 | 0.75 | 0.09 | 0.78 | 0.86 | 0.10 |
One overall model is to be selected for signals 1/3/8, and based on this table we adopt Model M2 as it generally achieves the best CSI and HIT, while its FAR is comparable to the other models. For signals 9/10, Model M5 is the clear winner. We note that Models M3 and M4 are not stable on the data set for signals 9/10 due to their larger number of parameters, and there were instances where the model fitting procedure could not be completed.
6. Interpretation based on the fitted models
We illustrate the use of the fitted models using Model M2 for the signal 1/3/8 cases. For this model, we predict the probability that the signal 1 (or above) is issued at a particular time ($t$) given the following input (explanatory) variables:
We predict the probability at a grid of (time $t$) latitude and longitude values to obtain a fitted surface. In the first example below, we assume that the TC was at 18°N, 119°E (indicated by a pink circle in the plots in Figure 1) 6 hours ago, with an intensity of 55 knots (102 km/h, severe tropical storm). The middle panel of Figure 1 shows the probability surface assuming that the signal was not issued 6 hours ago and the TC's intensity remained unchanged between time $t-6$ and $t$. Because the TC is situated southeast of Hong Kong, it is an intuitive result that the probability of signal issuance is higher in the northwestern quadrant of the plot, indicating that the TC has moved closer to the territory over the 6-hour interval. However, care must be exercised when interpreting the plot. The black arrows in Figure 1 show all of the available historical 6-hour track segments that originate from within 0.5 degrees latitude and longitude of 18°N, 119°E. We observe that, within a period of 6 hours, the range of a typical TC is no more than 300 km. Therefore, the probability surface is the most relevant for geographic coordinates within 300 km of the pink circle. The predicted values outside this range are considered extrapolated probabilities and are subject to larger errors. The accuracy of the predictions are therefore directly related to the amount of historical information available at that location.
Figure 1 - Probability surfaces mentioned in the text above and below. The inner and outer arcs are respectively the 400-km and 800-km radii from Hong Kong (at 22.3°N, 114.2°E), and an outline of the northern Philippines can be seen in the bottom right part of each panel. Click on each plot to view at its original size.
The left panel of Figure 1 is the same as the middle panel except that we assume the signal was already in force 6 hours ago. We see that the red area (corresponding to high probabilities) is much larger. This makes sense as the signal is much more likely to remain in force if it is already issued some hours ago and the TC is approaching. Note that, although the probability values remain high near the southern edge of the 800-km radius circle, these values are not representative due to the lack of historical information on such paths, i.e., TCs moving southwest at high speeds. Finally, the right panel is the same as the middle panel except that we assume the TC strengthened from 55 knots 6 hours ago to 70 knots (130 km/h, typhoon). We can observe that the probabilities are generally higher in this case, but not as high as when the signal has already been issued.
The second example concerns the probability of signal 8+ for a 70-knot typhoon (130 km/h) that was situated at 20°N, 114°E 6 hours ago. Again, the middle panel of Figure 2 depicts the probability surface assuming that the signal (8 or above) was not in force at time $t-6$ and that the TC's strength remained constant; it shows that a change of signal from 3 or below to 8 or above within these 6 hours is not high, unless the TC heads north which then poses a much greater threat to the territory. The left panel assumes the signal was already in force at time $t-6$, and in this case it is very likely for the signal 8 (or above) to remain in force. The right panel assumes the signal was not in force at time $t-6$ and the TC weakened from 70 to 50 knots (93 km/h, severe tropical storm), resulting in smaller probabilities than those for the middle panel. Similar to Figure 1, the arrows correspond to actual observations that originate within 0.5 degrees latitude and longitude of 20°N, 114°E, and they provide us with a general idea of the range of probabilities that are realistic in practice.
Figure 2 - Probability surfaces mentioned in the text above. The inner and outer arcs are respectively the 200-km and 400-km radii from Hong Kong (at 22.3°N, 114.2°E). Click on each plot to view at its original size.
There is a similar interpretation based on Model M5 for signals 9/10; this model requires the same input variables but the time interval is 2 hours instead of 6.
7. From single- to multi-period probabilities
So far we considered single-period probabilities, i.e., predicting the outcome given information 6 hours ago (or 2 hours ago for signals 9/10). However, given an entire track forecast, we can compute multi-period probabilities recursively. We demonstrate this idea using our track forecast for Super Typhoon Haima on October 19, 2016, as follows:
For consistency, we convert the intensities to 10-minute averages before computing the probabilities. The 6-hourly locations and intensities are obtained using linear interpretation. Based on our fitted Model M2, we obtain the following 6-hourly probabilities for signal 3+, assuming that the signal was not in force or was in force 6 hours ago:
Single-period issuance probabilities between times $t-6$ and $t$ | |||||||||||||
Time $t$ (hr) | 6 | 12 | 18 | 24 | 30 | 36 | 42 | 48 | 54 | 60 | 66 | 72 | |
Signal 3+ |
Not issued at $t-6$ | 0.000 | 0.004 | 0.014 | 0.062 | 0.206 | 0.648 | 0.889 | 0.981 | 0.523 | 0.152 | 0.013 | 0.000 |
Issued at $t-6$ | 0.000 | 0.334 | 0.658 | 0.898 | 0.972 | 0.996 | 0.999 | 1.000 | 0.993 | 0.960 | 0.634 | 0.052 |
Let $Y_t$ be the binary or 0-1 variable that represents whether signal 3 (or above) is in force at time $t$. Then, using the law of total probability, the multi-period "in-force" probability can be calculated as follows:
\begin{eqnarray*} \mathbb{P}(Y_6 = 1) & = & \mathbb{P}(Y_6 = 1 | Y_0 = y_0); \\ \mathbb{P}(Y_6 = 0) & = & 1-\mathbb{P}(Y_6 = 1); \\ \mathbb{P}(Y_t = 1) & = & \mathbb{P}(Y_t = 1 | Y_{t-6} = 0)\mathbb{P}(Y_{t-6} = 0) + \mathbb{P}(Y_t = 1 | Y_{t-6} = 1)\mathbb{P}(Y_{t-6} = 1); \\ \mathbb{P}(Y_t = 0) & = & 1-\mathbb{P}(Y_t = 1) \end{eqnarray*}for $t=6,12, \ldots ,72$, where $y_0$ is the (observed) signal status at time 0 and the pipe ($|$) means "given that". These multi-period probabilities are thus dependent on whether the signal is in force at the time of the track forecast.
Meanwhile, one may be interested in the probability that the signal is first issued (cancelled) between times $t-6$ and $t$ given that it has not yet been issued (cancelled). This can also be conveniently obtained as follows:
\begin{eqnarray} \mathbb{P}(Y_{t} = 1, Y_{t-6} = 0, \ldots , Y_6 = 0 | Y_0 = 0) & = & \mathbb{P}(Y_{t} = 1 | Y_{t-6} = 0) \prod_{i=6}^{t-6} \mathbb{P}(Y_{i} = 0 | Y_{i-6} = 0); \label{eq:firstissue}\\ \mathbb{P}(Y_{t} = 0, Y_{t-6} = 1, \ldots , Y_6 = 1 | Y_0 = 1) & = & \mathbb{P}(Y_{t} = 0 | Y_{t-6} = 1) \prod_{i=6}^{t-6} \mathbb{P}(Y_{i} = 1 | Y_{i-6} = 1) \label{eq:firstcancel}\\ \end{eqnarray}for $t=6,12, \ldots ,72$. Note that only one of equations \eqref{eq:firstissue} or \eqref{eq:firstcancel} is relevant, as $Y_0$ has already been observed. Note also that \eqref{eq:firstissue} is not the same as the probability of signal issuance between times $t-6$ and $t$, that is $\mathbb{P}(Y_{t} = 1 | Y_{t-6} = 0) \mathbb{P}(Y_{t-6} = 0)$, as the signal may have been in force prior to time $t-6$, although this rarely happens unless the TC moves erratically.
Using these equations, we obtain multi-period probabilities based on the single-period probabilities listed in the above table. We have $y_0 = 0$ as the signal was not in force at time 0. The "in force" and "first issuance" probabilities are:
Multi-period probabilities | |||||||||||||
Time $t$ (hr) | 6 | 12 | 18 | 24 | 30 | 36 | 42 | 48 | 54 | 60 | 66 | 72 | |
Signal 3+ |
In force at time $t$ | 0.000 | 0.004 | 0.017 | 0.076 | 0.264 | 0.740 | 0.970 | 0.999 | 0.993 | 0.954 | 0.605 | 0.032 |
First issued between times $t-6$ and $t$ | 0.000 | 0.004 | 0.014 | 0.061 | 0.189 | 0.474 | 0.229 | 0.028 | 0.000 | 0.000 | 0.000 | 0.000 |
8. Accounting for the track variability
The above table gives the predicted probabilities assuming the forecast track is correct. However TCs do not move along fixed paths and such probabilities are not very meaningful unless we take track uncertainty into account. Making use of the empirical distribution of our historical track error, we generate "similar" tracks that have the same overall shape as the forecast track, but perturbed in different directions. Using the example of Super Typhoon Haima above, the simulated tracks look like the following:
There are a total of 80 simulated tracks in addition to the forecast track in the middle. At each time point, the simulated track positions form 5 concentric rings, representing the 10th, 30th, 50th, 70th and 90th percentiles of the forecast error distribution. For each simulated track, we compute the multi-period probabilities as in Section 7. They are then averaged to arrive at the weighted probabilities that reflect the potential track error. Because the distances between the perturbed locations and the forecast track location are equally-spaced quantiles of the error distribution by construction, a simple average suffices: Let $W$ be a random variable (in this case, the distance from the forecast position) with density function $f_W(w)$ and distribution function $F_W(w)$. Then the expected value of $g(W)$, where $g$ is a function (in this case it maps the distance $W$ to a probability as determined by the fitted model), is given by
\begin{equation} \mathbb{E}[g(W)]=\int_0^\infty g(w) f_W(w) \, \textrm{d}w = \int_0^\infty g(w) \, \textrm{d}F_W(w) = \int_0^1 g(F^{-1}(u)) \,\textrm{d}u, \label{eq:expectquant} \end{equation}where the inverse of $F$ is simply the quantile function, which we replace by the sample quantile. Note, however, that there are multiple points whose distance from the forecast location is the same; we thus consider all directions in the above track simulation and take the average probabilities for each quantile $u$. The end result is that we take a simple average of all probabilities obtained from the simulated tracks for each time point. Our calculation is only an approximation to the integral \eqref{eq:expectquant} as we choose only 5 quantile points. However, experimentation suggests that the approximation is decent and further increase in the number of quantile points and the number of simulated tracks does not result in a large deviation in the average probability.
For the example above, the average probabilities are listed in the table below; these numbers were those reported in the live bulletin:
Averaged multi-period probabilities taking track variability into account | |||||||||||||
Time $t$ (hr) | 6 | 12 | 18 | 24 | 30 | 36 | 42 | 48 | 54 | 60 | 66 | 72 | |
Signal 3+ |
In force at time $t$ | 0.000 | 0.004 | 0.020 | 0.110 | 0.318 | 0.632 | 0.853 | 0.915 | 0.869 | 0.741 | 0.448 | 0.135 |
First issued between times $t-6$ and $t$ | 0.000 | 0.004 | 0.017 | 0.092 | 0.210 | 0.314 | 0.223 | 0.072 | 0.005 | 0.001 | 0.000 | 0.000 |
For more information on interpreting the plots and tables we issue, please visit this page.
Last Accessed: Mon Sep 28 2020 19:21:59 HKT
Last Modified: Sun May 07 2017
Statistics |