# Diminishing the Dengue Danger: Predicting future dengue outbreaks using Machine Learning on historical dengue and climate data in Singapore

# Introduction

Dengue fever has been identified by the World Health Organisation (WHO) to be the most critical mosquito-borne disease globally. In the last few decades, the world has seen a 30-fold increase in global incidence of dengue fever. People living in tropical and subtropical climates are most vulnerable to dengue, and that is half the world’s population at risk. There are four types of dengue virus serotypes (or ‘strains’), hence a person can be infected by the dengue virus up to four times. With no specific treatment associated with dengue, early detection to prevent the breeding of *Aedes* mosquitoes that carry the virus is the most effective method to reduce dengue outbreaks.

Ever since the Dengue fever epidemic hit Singapore almost a decade ago, the ‘5-step Mozzie Wipe-Out’ campaign launched by the National Environmental Agency (NEA) is known to every Singapore resident. Indeed, dengue outbreaks and spikes in dengue cases have been intermittently publicised in Singapore, especially given our equatorial climate that is highly conducive for mosquito breeding. There has also been passive monitoring of stagnant water that are potential breeding grounds for these deadly pests. Dengue fever came into the spotlight in Singapore and became a topic of everyday discourse in 2005, when there were a total of more than 14,000 dengue cases. This created a shortage of beds in hospitals due to the high influx of dengue patients. Since then, the dengue virus has been an ongoing public issue, with more than 16,000 reported cases in 2019.

# Objective

The objective of our project is to predict the number of dengue cases based on the (1) rainfall and (2) temperature measurements of various locations in Singapore, (3) population growth, and (4) time series effects. Our models will forecast the weekly number of dengue cases up to eight weeks into the future. We decided that predicting weekly outcome is more ideal than daily outcome to reduce variations in the outcome values, since we are also provided with a number of dengue cases by weeks. It will also greatly reduce the computing power required for machine learning, which translates into saving time. Using Decision Trees and Neural Networks, we aim to produce a robust prediction model that forecasts the upcoming number of cases of dengue eight weeks ahead.

# Our Dataset

We were able to leverage on the substantial amount of quality data to bring this project to fruition, thanks to the various Singapore ministries that were able to the collect important and relevant data. In Singapore, it is mandatory for all medical practitioners and clinical laboratories to report to the Ministry of Health (MOH) all clinically suspected and laboratory-confirmed dengue cases within 24 hours of diagnosis (as per Section 7 of the Infectious Diseases Act).

Our full dataset consists of a weekly panel data from 13 August 2011 to 23 November 2019 (433 weeks). We have picked out the following features to serve as significant variables that will determine the forecast of dengue cases — weekly rainfall (50 locations) and temperature (15 locations) data of various locations in Singapore and the yearly population of Singapore. The population data was interpolated to the individual weeks within the year in order to facilitate its processing together with the other weekly data.

Our dependent variable that we want to predict is the number of dengue cases per week, up to 8 weeks into the future.

The rationale behind using these data are as follows:

- Weather data (i.e. rainfall and temperature): these are the factors that affect the breeding of
*Aedes**aegypti*and*Aedes**albopictus*mosquitoes, and hence are related to the risk of someone contracting dengue fever. High temperatures provide a favourable and conducive environment for mosquitoes to breed, as well as resulting in a more aggressive feeding behaviour in them. Conversely, high rainfall results in a higher likelihood of the accumulation of stagnant water that could be left unchecked, such as in the roofs of private residential homes, which creates the conditions perfect for mosquitoes to breed. - We also included the population trend of Singapore, to investigate if changes in population size leads to a significant changes in the number of dengue cases over the years. We hope to be able to account for the portion of increase in the number of dengue cases over the years due to an increase in population.

# Data Pre-processing Methodology

*Working with highly correlated data*

Looking at the correlation plots in Figure 4 above, we observed that many of the rainfall and temperature stations are highly correlated to one another. This makes sense, considering that Singapore’s small land area should render most parts of the island to be atmospherically homogeneous. Based on geographical locations, the correlation results are logical and as expected, given that locations such as Dhoby Ghaut and Somerset (both located in the Southern area of Singapore) have very similar rainfall trends. It is worthy to note that there was a significant proportion of missing rainfall or temperature data in the original dataset (which dates back to the year 2000). This was why we narrowed down the range of data that we are using to eight years. We then replaced missing rainfall or temperature values based on the correlation method within this 8-year range.

*Feature Extraction to create Dendrogram clusters*

In view of the high correlation among temperature and rainfall features respectively, we want to reduce the number of temperature and rainfall features. However, we still want to retain our model’s ability to explain the variability in dengue cases. Thus, we conducted dimensionality reduction through feature extraction to handle highly correlated features and to prevent overfitting.

We built two dendrograms, one each for weekly rainfall and temperature. The dendrograms were based on the pairwise-correlation distance (1 minus correlation) between two locations. From the results, we derived **11 rainfall clusters and 4 temperature clusters** to represent the rainfall and temperature trends in the different parts of Singapore.

*Dealing with missing data*

There were patches of missing data in the dataset. For certain weather stations, prolonged periods of rainfall and temperature data is not available in the dataset. Considering the localised nature of rainfall, we approximated rainfall levels using the **rainfall values of the nearest meteorological stations (within the same cluster)**. We believed that doing so would lead to a more accurate representation of the rainfall/temperature trend, as compared to replacing the missing values with other statistics such as an 8-year rainfall average of the respective rainfall stations.

*Feature Normalisation of rainfall, temperature, and other values*

To determine the weekly temperature or rainfall or each of the 11 rainfall clusters and 4 temperature clusters, we took the average of all the temperature or rainfall values within each cluster. This averaging is done for each week’s data across all stations within the cluster, producing cluster averages for both temperature and rainfall each week. To illustrate this, temperatures at Khatib and Sembawang meteorological stations form cluster 3. As such, cluster 3’s weekly temperatures will be the average weekly temperatures of these two stations.

We then split our dataset into training set (from 13 August 2011 until 17 March 2018) and test set (24 March 2018 until 23 November 2019). We then normalised the values within each temperature or rainfall cluster, within the training set. The median and range used to normalise the training set were used to normalise the test set values as well. The formula used is as mentioned below.

We normalised the cluster values to minimise the magnitude of the values used, which will speed up neural network learning and also prevent any errors when running the neural network due to large numbers.

*Feature Selection*

For the preliminary selection of features, we used regression trees utilising the XGBoost algorithm. We started our feature selection with an initial set of variables comprising:

- dengue levels at T+0 (i.e. where week = 0)
- the maximum and mean values of each rainfall cluster from periods T-3 (i.e. three weeks into the past) to T+0
- minimum values of each temperature cluster from periods T-12 to T-4

The idea behind using the mean of each rainfall cluster from periods T-3 to T+0 is that the rainfall of those three weeks could predict for a dengue outbreak eight weeks into the future. This range (T-3 to T+0) was selected after taking into account the life cycle of *Aedes* mosquitoes and the incubation time of the dengue virus (the amount of time the virus takes to cause symptoms in humans). After coming into contact with water, it takes around three weeks for mosquito eggs to grow into an adult mosquito, leading up to the appearance of symptoms in an infected human. Importantly, this initial spike in *Aedes* population could lead to an exponential increase in eggs being laid within the eight-week window of prediction, and would be useful in predicting a spike in dengue cases.

As for the temperature clusters, the minimum values of each temperature cluster from periods T-12 to T-4 were selected because we theorised that lowest temperatures of the past 12 weeks could anticipate a period of higher temperatures eight weeks later. This is also because outbreaks traditionally occur in the June-October period, whereby June is typically the hottest month of the year. We want to anticipate an increase in temperature because it results in a shorter virus replication time within mosquitoes i.e. reduces time for mosquitoes to become infectious towards humans. This is further exacerbated by a more aggressive feeding behaviour in the mosquitoes, increasing the probability of an infected mosquito infecting a human. In summary, higher temperatures *during the period of an outbreak* is an important predictor. We would have loved to look further back in time (i.e. T-15), but we could not do so for fear of having a severe shortage of test data.

# Model Building

*Initial findings using Regression Trees*

We first ran the optimisation of regression tree parameters on the initial set of variables. We used a regression tree for feature extraction simply because it is fast. The parameters most suitable for our data are stated below. Using these optimised values, we then began reducing the number of clusters so as to minimise the curse of dimensionality when we move on to using neural networks. The reduction of clusters was performed as follows:

- Use the initial results (using all temperature and rainfall clusters) as the benchmark. The results to take note of are the mean-squared error (MSE), training predictions, test predictions, and the lagged correlations.
- Remove the variables belonging to the smallest rainfall cluster, which is situated at the extreme ends of Singapore e.g. Cluster 1 and 2.
- Look at the benchmarks for comparison. If they fare equally, or the model fares better without the removed variable, they are removed from the subsequent models to be tested. This is to create a model as small as possible. If it fares worse than the previous results, then that cluster is added back into the model.
- Should there be a run that was not completed, (i.e. test loss line did not cross training loss line), the number of rounds was increased
- This was performed iteratively, and the same method was used to select the useful temperature clusters.
- Any variations of the final set of variables were added into the model for testing. Like before, if the results were equivalent or worse than the previous model tested, the variable was removed.

The best parameters for the regression tree based on our initial set of variables are as such: [number of rounds = 20; maximum depth of trees = 1250; learning rate = 0.1; number of parallel trees = 10; subsampling = 0.1, column sampling by trees = 1]. We then found our best parameters to be:

*Mean dengue levels of T-4 to T-1, T-8 to T-5 & T-9 to T-12 Mean differentiated dengue values of T-4 to T-1 & T-9 to T-12Population levels at T+0Maximum and mean values of T-3 to T+0 of rainfall cluster 2, 4, 5, 6, 7, 8, 9 10 and 11Mean and minimum values of T-7 to T-4 & T-12 to T-8 for temperature clusters 2, 3 and 4*

The results of our regression tree model (with a squared error of 0.12095*) *is as listed below:

*Deep Learning Model Building*

The metrics used for the deep learning model are similar to the metrics used in the regression tree model. At this point, it would be useful to mention that we decided to go without a validation set due to the small number of data points in our dataset. Instead, we will solely rely on the Euclidean loss, as well as the test prediction plot to determine whether our model is performing well. Furthermore, when we looked at the dengue levels in 2017–2018, it was mostly flat. As such, we thought that even if we did a validation set, it would comprise mostly of those values. This could give us a false impression of how well our model is actually doing, since the dengue levels in 2019 began fluctuating and even peaked. *Importantly, we might have optimised to model to fit a validation set which does not have a peak (which corresponds to an outbreak)*. In view of time, we decided to just go with the Euclidean loss and the test prediction. Of course, we took into account lag as well.

Our model performance benchmark is the persistence loss, which in our test dataset was calculated to be 0.015698. The score was calculated using the test set dengue values. This was one column in our Excel, which serves as our T+0 values. In an adjacent column, we copied and pasted the dengue values from T+8 onwards. This column serves as our T+8 values. The protruding ends of the T+0 and the T+8 columns were truncated. After which, we calculated the squared difference of the two columns in a new column. In a separate cell, we averaged the sum of all the squared differences, and divided that average by two to obtain the persistence loss.

# Deep Learning Methodology

We then shifted our focus to neural networks with the final set of features obtained from the regression tree stage. Perhaps the ability of neural networks to generalise well to more complex patterns would allow us to make more accurate predictions.

We ran an initial test on a simple 3-layer neural network with just 3 configurations: 16, 32 or 64 nodes. This was in view of time. We ran 10,000 iterations per configuration, using the Adam optimiser. The whole process was repeated 5 times to ensure that initial weights did not lead to overly biased or overly optimal results. Our training label was dengue levels at T+8.

The best results were:

*Test Loss: 0.0060406Number of inputs: 36Iterations: 10000Optimisation algorithm: AdamNumber of perceptrons in topmost layer: 128Number of layers in neural network: 3*

The results were decent, but we wanted better. Also, there was a second spike in the predictions which came right after the real spike, which was a false spike in dengue levels. We would want to remove or minimise this false spike as much as possible.

Through our second round of optimisation on neural networks, we experimented with larger neural networks which had 3, 4 or 5 layers. From there, we also discovered that one more parameter improved our results significantly, which is the *second differential order of dengue levels T-4 to T-1*. As such, our final set of parameters are:

*Mean T-4 to T-1, T-8 to T-5 & T-9 to T-12 of dengue levelsMean first-order differentiated values of dengue levels at T-4 to T-1 & T-9 to T-12Mean second-order differentiated values of dengue levels at T-4 to T-1Population levels at T+0Maximum and Mean values of T-3 to T+0 for rainfall cluster 2, 4, 5, 6, 7, 8, 9, 10 and 11Maximum and Mean values of T-7 to T-4 & T-12 to T-8 for temperature clusters 2, 3 and 4*

We also compared those results with results from using larger neural networks of 6, 7, and 8 layers. The node sizes tested in the larger neural networks were the same i.e. 32, 64 and 128. We also tried using the SGD optimiser instead of the Adam optimiser. However, simpler neural networks and the Adam optimiser still worked the best. We ran our final analysis with 20 repeats to get the best results as shown below:

*Test Loss: *0.00453716*Number of layers in neural network: 4Number of perceptrons in topmost layer: 64Number of inputs: 37Iterations: 10000Optimisation algorithm: Adam*

By attaining a *loss score of 0.00453716*, our model surpasses persistence by around 0.01, which is a third of the persistence loss.

From our training prediction curve, it seems that our current set of variables does not allow the model to estimate the peaks present in this dataset. This is seen as the training seems to occur reasonably well on lower levels of dengue, but our model always falls short of predicting higher dengue levels when actual dengue levels peak. This suggests a lack of information within our model. Turning to the test prediction curve, the curve is actually quite well fitting. The predicted values are not too far off from the actual values for low dengue levels, and the peak in dengue cases was reasonably predicted with a lag of 0 (which is awesome!). Although our prediction did not manage to predict more dengue cases than the actual values of dengue cases, which would be favourable, it was able to predict a peak of dengue cases nevertheless. The second smaller peak that came after the initial peak should also be accounted for. We believe that there could have been a second peak, but it did not happen due to more intervention conducted by the NEA during that period e.g. increased fogging, although this is just a theory.

# Potential risks of using our model

If there is a caveat to our model results, we believe that our model is not quite capable of over-predicting (which is something that we wanted to achieve). The organisation using this model should notify the respective healthcare professions about this trend at their discretion, and it should be suggested to them to anticipate more cases. Further, as the prediction of the peak dengue values in the test set was lower than the actual dengue values, we cannot be certain if our model will be able to accurately predict dengue levels in the event of a very extreme outbreak. Perhaps there is a threshold of dengue cases to which our model can predict, but we need more data to determine if such a threshold exists.

Also, given that this model was developed based on data from Singapore, it would be prudent to trust the accuracy of this model when used in areas with similar traits to Singapore. Some characteristics that should be considered are: the area of Singapore (751.2km²) which inevitably affects rainfall and temperature patterns, population density, as well as spatiotemporal movement patterns of movement.

# Conclusion and Review

Overall, we think we did pretty well, and it was not easy at all to hit the sweet spot of having a low test score, an acceptable prediction pattern and an acceptable lag pattern. Good model scores were also achieved by just using different variations of four variables: dengue levels, Singapore population levels, temperature & rainfall.

Finding the relevant variables without altering results negatively was tough. We attempted to incorporate Google Search trends on ‘dengue’ to observe if it would be a significant predictor variable, but results were mixed. In addition, the correlation and causation between Google search trends and the outbreak itself were not widely studied, thus we decided to forgo that in the end.

We also believe that this model can be further improved by tweaking the model components even more by including in more variables (which we were not able to due to limited data available online). More experimentation with variations of neural network architecture, as well as feeding the training phase more data could be useful in improving the model too.

Nevertheless, we have developed a neural network model which is capable of predicting dengue levels fairly accurately. Moving forward, we believe that better dengue prediction models could be created with more quality data — e.g. dengue serotypes, *Aedes* population levels etc. — which are able to help explain the variance in dengue levels. Additional data could potentially enhance our model and the current set of variables which we propose to be useful in predicting dengue outbreaks.

*Written by: Benita, Clarence, Richard, Wen Hao, Zi Yu*