Modeling Dengue

Problem Overview

Dengue fever and severe dengue are mosquito-borne disease caused by the dengue virus. In severe conditions, dengue can develop into dengue hemorrhagic fever which can even be life threatening. It is one of the most important viral arthropod-borne disease worldwide and its geographical expansion during the past decades has been of growing concern for scientists and public health authorities because of its heavy sanitary burden and economic impacts.

Since the dengue virus is carried by mosquitoes, environmental factors such as temperature, humidity, rainfall, degree of urbanization and vegetation have well defined roles in the transmission cycle. Therefore changes in these factors lead to changes in the incidents of dengue fever. The purpose of this project is to find any significant relationship between such environmental factors and outbreak of dengue and finally be able to predict dengue outbreaks in future which may help in better prevention methods and timely resource allocation.

Before going further, lets take a look at the causes, behaviours and transmission of Dengue. Dengue virus is primarily transmitted by Aedes mosquitoes, particularly A. aegypti. They typically bite early in the morning and late evening, but could also bite during any time of the day and cause infection. For the mosquitoes, the right breeding and survival conditions include

Higher temperatures - which accelerate mosquito development stages and increase dengue transmission
Altered rainfall patterns - producing more standing water which are potential breeding sites
Humidity - identified as a consistent weather factor to provide favourable conditions for the dengue vectors.

In addition to these, vegetation and land use patterns of cities may also be identified also potential risk factors. For instance, agricultural practices may provide suitable habitats for the vector but it is also argued that larger shares of human settlement coverage in the neighborhood are associated with higher numbers of dengue cases. One of the reasons for this may be higher population density in areas with more human settlements, leading to higher human biting rates.

We learn from this that there are many different research studies with varying opinions about how environmental and climatic factors correlate with the outspread of dengue. The purpose of this project is to find environmental variables that have high strength of association with the number of dengue cases using statistical models and then finally use those models to predict future dengue outbreaks.

Citations

https://journals.plos.org/plosntds/article?id=10.1371/journal.pntd.0004211
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4784273/
http://www.who.int/denguecontrol/epidemiology/en/
https://en.wikipedia.org/wiki/Dengue_fever#Cause

Data Preparation

To analyze the data we start by loading, cleaning and preparing it. I decided to use Python (with Jupyter Notebook) to process the data required for this project. The training data included two files : train_featues.csv and train_labels.csv . For ease of access, I merged the files into a single files by columns 'weekofyear', 'year', 'city'. I then split this data into a list of two separate dataframes for each city - San Juan and Iquitos. This would help in finding correlations and strengths of associations for each city separately.

Missing Values and Eliminating Extra Features

Looking at the data, we can see that for certain features (eg. Total Precipitation, Avg. Temperature) the data included values from two sources. The sources include

NOAA's GHCN daily climate data weather station measurements
NOAA's NCEP Climate Forecast System Reanalysis measurements (0.5x0.5 degree scale)

To decide which source would be better for our analysis, I found two data driven reasons in favour of the NOAA's NCEP values.

The correlation of Total Precipitation from each source (GHCN and NCEP) to the total_cases, and n_lag_cases (explained below), the NCEP values turned out to be much more correlated for both the cities.
The GHCN values contained much more missing values ( > 5% in one case) than NCEP values.

Due to these reasons, I eliminated NOAA’s GHCN columns from the data. I filled the missing values using Pandas filllna() function with method ffill.

Additional Variables

Lagged total cases ( n_lag_cases )
While researching about the problem (discussed in section 1), I found that it is safe to assume that the environmental variables may affect the spread of dengue after a certain lag time. This is because there is typically a lag of weeks to months between changes in weather and associated dengue incidence. According to [], this lag time could be around 1-3 months. To account for this lag period I created two lag variables 4_lag_cases and 8_lag_cases that essentially contain total_cases values shifted up by 4, 8 weeks respectively. I replaced the last n values, after performing a shift, with the mean of preceeding n values to give a close approximation of cases. These two new variables can now help us determine if the environmental variables are correlated to dengue cases reported after n weeks.
Mean Vegetation Index (mean_vegetation_index)
The starter data also included Normalized difference vegetation index divided into four pixels in each direction from the centroid of the city. In order to get a better sense of the overall vegetation index of the city and obtain a single value for analysis, I created a new variable called mean_vegetation_index which, as the name suggests, is the mean of the four vegetation indexes of each week

After performing basic data preperation and adding additional variables the metadata looks like the following.

city
year
weekofyear
reanalysis_sat_precip_amt_mm (Total Precipitation)
reanalysis_dew_point_temp_k (Mean Dew Point Temperature)
reanalysis_relative_humidity_percent (Mean Relative Humidity)
reanalysis_specific_humidity_g_per_kg (Mean Specific Humidity)
reanalysis_precip_amt_kg_per_m2 (Total Precipitation)
reanalysis_max_air_temp_k (Maximum Air Temperature)
reanalysis_min_air_temp_k (Minimum Air Temperature)
reanalysis_avg_temp_k (Average Air Temperature)
reanalysis_tdtr_k (Diurnal temperature range)
4_lag_cases
8_lag_cases
mean_vegetation_index

Exploratory Data Analysis

Now that we have the data prepped, extra variables removed and new measures added, let's take a look at what the data has to tell us by visualizing it. My outside research and data exploration showed that both cities in our data have different patterns of occurrences and hence I am visualizing the data for each city separately.

San Juan	Iquitos
Distribution of total number of dengue cases	Distribution of total number of dengue cases
Bar graph Month vs Total Cases	Bar graph Month vs Total Cases
In San Juan, the mean number of dengue cases from 1990 - 2008 were 34.17 with a standard deviation of 51.38. As compared to Iquitos, these numbers are way higher. In the first figure, the distribution of dengue cases has a tail on the right with 461 max number of cases recorded in a week. In the second figure we can see that the months of August, Spetember, October, November being the most affected months through the year. This correlates to the weather in San Juan, where temperature and total precipitation peak from May - August. Following a lag of 1-2 months we see an increase in the dengue cases recorded.	In Iquitos, the mean number of Dengue cases from 2000 - 2010 were 7.59 with a standard deviation of 10.76. In the first figure we can see that the distribution has a tail towards the right with 116 max number of cases recorded. In the second figure we can see that in the months of November - January, the total number of cases take a peak. Iquitos has an increased amount of rainfall in the same months.

Looking at these plots and the corresponding weather patterns of both cities, I notices that both cities do not have much difference in the average monthly temperaturs throughout the year, hence average temperature might not be the best variable for future modelling. Rainfall and humidity, on other hand have relatively higher differences in values over the year.

Scatterplots of Total Cases by Week of Year for each year

San Juan

For San Juan, we can see that the total number of dengue cases reported take a spike in most of the years during the second half of the year. We can correspond this spike to increased temperature, humidity and precipitation. Years 1994 and 1998 were the most affected years with high spikes.

Iquitos

For Iquitos, the number of dengue cases were pretty stagnant throughout the year in 2000 and 2001 but started increasing from 2002, with a spike in cases reported in the second half of the year.

Correlations between number of dengue cases and environmental variables

I tested the correlations of each variable of interest in our data to total_cases, 4_lag_cases and 8_lag_cases. The top 4 correlating variables for each city turned out to be

Specific Humidity
Min Air Temperature
Dew Point Temperature
Avg Air Temperature

To my interest, all of these variables showed a higher correlation with the 1 month and 2 month lagged variables and strengthens our belief that a lag does exist between change of weather and affected dengue cases. Below are some lmplot's showing the correlation of Avg Precipitation, Avg Air Temperature and Specifiic Humidity with dengue cases. These charts are seperate for each instance of our dependent variables - total_cases, 4_lag_cases, 8_lag_cases. (San Juan only)

Average Precipitation vs Cases Reported (Immediate, 1 Month Lag, 2 Months Lag respectively)

Correlations: 0.07, 0.06, 0.06

We can see that in San Juan, the dengue cases do not highly correlate with Average precipitation

Average Air Temperature vs Cases Reported (Immediate, 1 Month Lag, 2 Months Lag respectively)

Correlations: 0.17, 0.27, 0.30

Average Specific Humidity vs Cases Reported (Immediate, 1 Month Lag, 2 Months Lag respectively)

Correlations: 0.21, 0.28, 0.31

As we discussed earlier, specific humidity and air temperature show much more correlation than precipitation in San Juan. Looking at these correlations, we can asses which climatic factor might affect a ciity more than the other and then try to tune our future models accordingly.

Statistical Modelling

After reviewing various kinds of models that might fit our data, I came down to choosiing between Poisson and Negative Binomial Regressions. The reason for this was because both of these regressions account for modelling count variiables which in our case fits because we are trying to model the count of dengue cases.

I expect Negative Binomial Regression to outperform Poisson Regression in our case since the variance and standard deviation of our data is very high. This would result in overdispersion and negative binomial regression is more flexible than Poisson in this regard since it has an additional parameter that adjusts the variance independently from the mean.

After reviewing the correlations of each independent variables in our data (discussed in previous section), I used backward selection to find the variables that produce the least Mean Absolute Error. Also these selected variables showed the strongest correlations with our ddependent variables. After playing for some time with the variables, I ddecided to include the following variables in my model :

Average Dew Point Temperature
Mean Vegetation Index
Mean Specific Humidity
Average Air Temperature
Min Air Temperature

I tested both Possion and Negative Binomial Regression models against our data for both cities seperately. I then calculated the Mean Absolute Error for each model to compare accuracy. In addition, I ran these model for the total_cases, 4_lag_cases and 8_lag_cases each. This was to account for the lag variabless and see whether lag does or does not affect the performance of our models. I used statsmodels in Python to perform the tests. For the ngetaive binomial model, I also tested the best alpha value for the hyper parametter which turned out to be 1e-08The result (in absolute mean error) for all the tests are provided below.

San Juan

Dependent Variable	Poisson	Negative Binomial
total_cases	28.68	28.69
4_lag_cases	26.14	25.71
8_lag_cases	25.02	25.56

Iquitos

Dependent Variable	Poisson	Negative Binomial
total_cases	6.54	6.42
4_lag_cases	6.45	6.35
8_lag_cases	6.42	6.46

These tests prove two things to us:

Adding a lag dependent variable for total_cases increases the accuracy in all cases. This confirms our initial belief that there should be a lag time between the change in environmental variables and consequental dengue caases.
Secondly, The two regression models are almost similar in their performances with no significant differences. This urges us to dig further into data exploration and tune our parameters more precisely in order to get better results in future prediction.

Interpretation

Poisson Model for Iquitos and San Juan Respectively

Negative Binomial Model for Iquitos and San Juan Respectively

Through these tests, we can observe that Poisson and Negative Binomial Regressions did not outperforrm the other (by a singificant amount) in any case. I double checked my code multiple times to observe very similar values for both regressions. This points me towards a few things to consider in the second part. First, I might be interested in creating more additional variables by using dimension reduction techneques in order to create more variablles to test the accuracy with. In addition, the graphs for models with lag cases showed a more linear graph than above showing lag effect does occur. Overall, through my analysis I can conclude than we can associate environmental variables to the change in dengue cases only till a small extent and the environmental variiables cannot predict changes in dengue cases with high accuracy. This, however, might change with further parameter tuning, and variable selection. The goodness of our model and the strengths of association have been discussed in the previous section. The strengths of our model ( calculated by mean absolute error) are a little lower than the benchmark of 25, but does show a sense of correctness with scope of improvement.

Modeling Dengue

Saurav Kharb

Core Methods in Data Science

Problem Overview

Citations

Data Preparation

Missing Values and Eliminating Extra Features

Additional Variables

Lagged total cases ( n_lag_cases )

Mean Vegetation Index (mean_vegetation_index)

Exploratory Data Analysis

San Juan

Iquitos

Scatterplots of Total Cases by Week of Year for each year

San Juan

Iquitos

Correlations between number of dengue cases and environmental variables

Average Precipitation vs Cases Reported (Immediate, 1 Month Lag, 2 Months Lag respectively)

Correlations: 0.07, 0.06, 0.06

Average Air Temperature vs Cases Reported (Immediate, 1 Month Lag, 2 Months Lag respectively)

Correlations: 0.17, 0.27, 0.30

Average Specific Humidity vs Cases Reported (Immediate, 1 Month Lag, 2 Months Lag respectively)

Correlations: 0.21, 0.28, 0.31

Statistical Modelling

San Juan

Iquitos

Interpretation

Poisson Model for Iquitos and San Juan Respectively

Negative Binomial Model for Iquitos and San Juan Respectively