Dengue fever and severe dengue are mosquito-borne disease caused by the dengue virus. In severe conditions, dengue can develop into dengue hemorrhagic fever which can even be life threatening. It is one of the most important viral arthropod-borne disease worldwide and its geographical expansion during the past decades has been of growing concern for scientists and public health authorities because of its heavy sanitary burden and economic impacts.
Since the dengue virus is carried by mosquitoes, environmental factors such as temperature, humidity, rainfall, degree of urbanization and vegetation have well defined roles in the transmission cycle. Therefore changes in these factors lead to changes in the incidents of dengue fever. The purpose of this project is to find any significant relationship between such environmental factors and outbreak of dengue and finally be able to predict dengue outbreaks in future which may help in better prevention methods and timely resource allocation.
Before going further, lets take a look at the causes, behaviours and transmission of Dengue. Dengue virus is primarily transmitted by Aedes mosquitoes, particularly A. aegypti. They typically bite early in the morning and late evening, but could also bite during any time of the day and cause infection. For the mosquitoes, the right breeding and survival conditions include
We learn from this that there are many different research studies with varying opinions about how environmental and climatic factors correlate with the outspread of dengue. The purpose of this project is to find environmental variables that have high strength of association with the number of dengue cases using statistical models and then finally use those models to predict future dengue outbreaks.
To analyze the data we start by loading, cleaning and preparing it. I decided to use Python (with Jupyter Notebook) to process the data required for this project. The training data included two files : train_featues.csv and train_labels.csv . For ease of access, I merged the files into a single files by columns 'weekofyear', 'year', 'city'. I then split this data into a list of two separate dataframes for each city - San Juan and Iquitos. This would help in finding correlations and strengths of associations for each city separately.
Looking at the data, we can see that for certain features (eg. Total Precipitation, Avg. Temperature) the data included values from two sources. The sources include
After performing basic data preperation and adding additional variables the metadata looks like the following.
city
year
weekofyear
reanalysis_sat_precip_amt_mm
(Total Precipitation)reanalysis_dew_point_temp_k
(Mean Dew Point Temperature)reanalysis_relative_humidity_percent
(Mean Relative Humidity)reanalysis_specific_humidity_g_per_kg
(Mean Specific Humidity)reanalysis_precip_amt_kg_per_m2
(Total Precipitation)reanalysis_max_air_temp_k
(Maximum Air Temperature)reanalysis_min_air_temp_k
(Minimum Air Temperature)reanalysis_avg_temp_k
(Average Air Temperature)reanalysis_tdtr_k
(Diurnal temperature range)4_lag_cases
8_lag_cases
mean_vegetation_index
Now that we have the data prepped, extra variables removed and new measures added, let's take a look at what the data has to tell us by visualizing it. My outside research and data exploration showed that both cities in our data have different patterns of occurrences and hence I am visualizing the data for each city separately.
San Juan |
Iquitos |
Distribution of total number of dengue cases |
Distribution of total number of dengue cases |
Bar graph Month vs Total Cases |
Bar graph Month vs Total Cases |
In San Juan, the mean number of dengue cases from 1990 - 2008 were 34.17 with a standard deviation of 51.38. As compared to Iquitos, these numbers are way higher. In the first figure, the distribution of dengue cases has a tail on the right with 461 max number of cases recorded in a week. In the second figure we can see that the months of August, Spetember, October, November being the most affected months through the year. This correlates to the weather in San Juan, where temperature and total precipitation peak from May - August. Following a lag of 1-2 months we see an increase in the dengue cases recorded. |
In Iquitos, the mean number of Dengue cases from 2000 - 2010 were 7.59 with a standard deviation of 10.76. In the first figure we can see that the distribution has a tail towards the right with 116 max number of cases recorded. In the second figure we can see that in the months of November - January, the total number of cases take a peak. Iquitos has an increased amount of rainfall in the same months. |
Looking at these plots and the corresponding weather patterns of both cities, I notices that both cities do not have much difference in the average monthly temperaturs throughout the year, hence average temperature might not be the best variable for future modelling. Rainfall and humidity, on other hand have relatively higher differences in values over the year.
For San Juan, we can see that the total number of dengue cases reported take a spike in most of the years during the second half of the year. We can correspond this spike to increased temperature, humidity and precipitation. Years 1994 and 1998 were the most affected years with high spikes.
For Iquitos, the number of dengue cases were pretty stagnant throughout the year in 2000 and 2001 but started increasing from 2002, with a spike in cases reported in the second half of the year.
I tested the correlations of each variable of interest in our data to total_cases, 4_lag_cases and 8_lag_cases. The top 4 correlating variables for each city turned out to be
We can see that in San Juan, the dengue cases do not highly correlate with Average precipitation
As we discussed earlier, specific humidity and air temperature show much more correlation than precipitation in San Juan. Looking at these correlations, we can asses which climatic factor might affect a ciity more than the other and then try to tune our future models accordingly.
After reviewing various kinds of models that might fit our data, I came down to choosiing between Poisson and Negative Binomial Regressions. The reason for this was because both of these regressions account for modelling count variiables which in our case fits because we are trying to model the count of dengue cases.
I expect Negative Binomial Regression to outperform Poisson Regression in our case since the variance and standard deviation of our data is very high. This would result in overdispersion and negative binomial regression is more flexible than Poisson in this regard since it has an additional parameter that adjusts the variance independently from the mean.
After reviewing the correlations of each independent variables in our data (discussed in previous section), I used backward selection to find the variables that produce the least Mean Absolute Error. Also these selected variables showed the strongest correlations with our ddependent variables. After playing for some time with the variables, I ddecided to include the following variables in my model :
I tested both Possion and Negative Binomial Regression models against our data for both cities seperately. I then calculated the Mean Absolute Error for each model to compare accuracy. In addition, I ran these model for the total_cases, 4_lag_cases and 8_lag_cases each. This was to account for the lag variabless and see whether lag does or does not affect the performance of our models. I used statsmodels in Python to perform the tests. For the ngetaive binomial model, I also tested the best alpha value for the hyper parametter which turned out to be 1e-08The result (in absolute mean error) for all the tests are provided below.
Dependent Variable | Poisson | Negative Binomial |
total_cases | 28.68 | 28.69 |
4_lag_cases | 26.14 | 25.71 |
8_lag_cases | 25.02 | 25.56 |
Dependent Variable | Poisson | Negative Binomial |
total_cases | 6.54 | 6.42 |
4_lag_cases | 6.45 | 6.35 |
8_lag_cases | 6.42 | 6.46 |
These tests prove two things to us:
Through these tests, we can observe that Poisson and Negative Binomial Regressions did not outperforrm the other (by a singificant amount) in any case. I double checked my code multiple times to observe very similar values for both regressions. This points me towards a few things to consider in the second part. First, I might be interested in creating more additional variables by using dimension reduction techneques in order to create more variablles to test the accuracy with. In addition, the graphs for models with lag cases showed a more linear graph than above showing lag effect does occur. Overall, through my analysis I can conclude than we can associate environmental variables to the change in dengue cases only till a small extent and the environmental variiables cannot predict changes in dengue cases with high accuracy. This, however, might change with further parameter tuning, and variable selection. The goodness of our model and the strengths of association have been discussed in the previous section. The strengths of our model ( calculated by mean absolute error) are a little lower than the benchmark of 25, but does show a sense of correctness with scope of improvement.