Data analysis

Analysis of health and economic consequences of storm events in the United States

A report produced as an assignment for the Reproducible Research course run by Johns Hopkins University on Coursera

Photo: John Towner

In this report we aim to show which types of storm events are most harmful to population health, and which have the greatest economic consequences.

The analysis uses a storm database from the United States National Oceanic and Atmospheric Administration (NOAA) containing data for the United States from 1950 to 2011, and focuses on fatalities, property damage, and crop damage caused by storm types.

The conclusion is that across the United States tornadoes are most harmful to population health (causing most deaths) while floods and droughts have the greatest economic consequences (causing the most damage to property and crops), but there is great variation among the states. The report was produced as an assignment for the Reproducible Research course run by Johns Hopkins University on Coursera.

Data processing

First we load the R packages that we use to analyse the data. We use readr for reading the source CSV, dplyr for manipulating the data, and ggplot2 for plotting charts, while knitr and tools are loaded for the kable and toTitleCase functions respectively.

library(dplyr)
library(ggplot2)
library(knitr)
library(readr)
library(tools)

Next we parse the source data. If it hasn’t been downloaded already, we retrieve the source file, decompress it, and convert the resulting CSV file to a data frame. Most columns are skipped and only those of interest are included. The dates are converted from characters to a POSIXct vector and the event type is normalised to title case.

# Download the source file and unzip it, but only if it
# doesn't already exist in the working directory.
if (!file.exists("StormData.csv")) {
  download.file("https://d396qusza40orc.cloudfront.net/repdata/data/StormData.csv.bz2",
                "StormData.csv.bz2")
  bunzip2("StormData.csv.bz2")
}

# Load useful columns, convert dates to POSIXct vectors,
# make event_type column uppercase.
storms <- read_csv("StormData.csv",
                   col_types = "-c---ccc---c------ddc-dddcdc---------",
                   col_names = c("start_date",
                                 "county",
                                 "state",
                                 "event_type",
                                 "end_date",
                                 "length",
                                 "width",
                                 "fujita",
                                 "fatalities",
                                 "injuries",
                                 "property_damage",
                                 "property_damage_multiplier",
                                 "crop_damage",
                                 "crop_damage_multiplier"),
                   skip = 1) %>%
  mutate(start_date = parse_date(start_date, "%m/%d/%Y 0:00:00"),
         end_date = parse_date(end_date, "%m/%d/%Y 0:00:00"),
         event_type = toTitleCase(tolower(event_type)))

The most complex part of the data processing is parsing the columns containing costs for property and crop damage. Property damage is stored in two columns, one containing a number and another containing a multiplier. We need to multiply the first by the second to get a dollar amount for the property damage. Crop damage is similarly across two columns.

The problem comes with the two multiplier columns (one for property, one for crops) not being numerical but being one of a messy set of characters — H means ‘multiply by one hundred’, for example. These columns need to be tidied before they can be used as multipliers, and this is where the damage_multiplier function comes in.

# Function to convert a messy damage multiplier column into
# a number that can be used as a mathematical multiplier.
damage_multiplier <- function(multiplier) {
  if (multiplier %in% c(NA, 0, "-", "+", "?")) {
    # If the multiplier is empty, 0, or punctuation, assume
    # a multiplier of 1.
    multiplier <- 1
  } else {
    multiplier_num <- as.numeric(multiplier)
    if (is.na(multiplier_num)) {
      # If the upper-case multiplier is H, K, M, or B, convert
      # that into the matching number, as taken from
      # documentation described in the class forum at
      # https://class.coursera.org/repdata-035/forum/thread?thread_id=51.
      multiplier_vector <- c("H" = 100,
                             "K" = 1000,
                             "M" = 1000000,
                             "B" = 1000000000)
      multiplier <- multiplier_vector[[toupper(multiplier)]]
    } else {
      # If the multiplier is a number, use it directly.
      multiplier <- multiplier_num
    }
  }
  multiplier
}

# Convert the messy property damage multiplier column and
# add a column, property_damage_abs, that contains a usable
# damage total.
storms <- storms %>%
  filter(!is.na(property_damage) | property_damage > 0) %>%
  mutate(property_damage_multiplier_abs = sapply(property_damage_multiplier,
                                                 damage_multiplier),
         property_damage_abs = property_damage * property_damage_multiplier_abs)

# Convert the messy crop damage multiplier column and add a
# column, crop_damage_abs, that contains a usable damage
# total.
storms <- storms %>%
  filter(!is.na(crop_damage| crop_damage > 0)) %>%
  mutate(crop_damage_multiplier_abs = sapply(crop_damage_multiplier,
                                             damage_multiplier),
         crop_damage_abs = crop_damage * crop_damage_multiplier_abs)

From that code you can see that property and crop damage are normalised into two new columns, property_damage_multiplier_abs and crop_damage_multiplier_abs.

The results include three plots, each of which is created using its own data structure. These are created below.

# Calculate the number of fatalities according to event
# type.
fatalities_by_type <- storms %>%
  filter(fatalities > 0 | injuries > 0) %>%
  group_by(event_type) %>%
  summarise(num = sum(fatalities)) %>%
  filter(num >= 100) %>%
  arrange(num) %>%
  mutate(event_type = factor(event_type, levels = event_type))

# Calculate the cost of damage to property according to
# event type.
property_damage_by_type <- storms %>%
  select(event_type, property_damage_abs) %>%
  filter(property_damage_abs > 0) %>%
  group_by(event_type) %>%
  summarise(cost = sum(property_damage_abs)) %>%
  arrange(desc(cost)) %>%
  mutate(event_type = factor(event_type, levels = rev(event_type))) %>%
  filter(row_number() < 21)

# Calculate the cost of damage to crops according to event
# type.
crop_damage_by_type <- storms %>%
  select(event_type, crop_damage_abs) %>%
  filter(crop_damage_abs > 0) %>%
  group_by(event_type) %>%
  summarise(cost = sum(crop_damage_abs)) %>%
  arrange(desc(cost)) %>%
  mutate(event_type = factor(event_type, levels = rev(event_type))) %>%
  filter(row_number() < 21)

Note that the data is not inflation-adjusted. There are also plenty of misspellings and miscategorisations among the event types, but exploratory analysis showed that this doesn’t affect the results.

Results

If we first look at fatalities caused by storm events, we see tornadoes are by far the deadliest type of event across the United States. Tornadoes cause 5,633 deaths while the next contender, excessive heat, caused 1,903. The twenty event types that caused the most deaths across the time period are shown in the plot below.

# Plot deaths by event type.
ggplot(fatalities_by_type, aes(event_type, num)) +
  geom_bar(stat = "identity") +
  labs(title = "Deaths caused by storm events in the United States\n1950–2011",
       x = "",
       y = "Deaths") +
  coord_flip() +
  theme_minimal() +
  theme(title = element_text(vjust = 1.5),
        axis.text.x = element_text(angle = 0, hjust = 0, vjust = 1),
        axis.title.x = element_text(vjust = -0.5))

Next we look at the economic consequences of storms. Despite causing most deaths, tornadoes aren’t the biggest cause of damage to property and they cause relatively little damage to crops. Instead, floods are the greatest cause of damage to property while drought is the biggest cause of damage to crops. In the two plots below we can see the twenty event types that cause the greatest damage to property and crops.

# Plot cost of property damage by event type.
ggplot(property_damage_by_type, aes(event_type, cost / 1000000000)) +
  geom_bar(stat = "identity") +
  labs(title = "Property damage caused by storm events\nin the United States, 1950–2011",
       x = "",
       y = "US dollars, billions") +
  coord_flip() +
  theme_minimal() +
  theme(title = element_text(vjust = 1.5),
        axis.text.x = element_text(angle = 0, hjust = 0, vjust = 1),
        axis.title.x = element_text(vjust = -0.5))
# Plot cost of crop damage by event type.
ggplot(crop_damage_by_type, aes(event_type, cost / 1000000000)) +
  geom_bar(stat = "identity") +
  labs(title = "Crop damage caused by storm events in the United States\n1950–2011",
       x = "",
       y = "US dollars, billions") +
  coord_flip() +
  theme_minimal() +
  theme(title = element_text(vjust = 1.5),
        axis.text.x = element_text(angle = 0, hjust = 0, vjust = 1),
        axis.title.x = element_text(vjust = -0.5))

These plots show data for the United States as a whole, but the country is famous for its heterogeneity. Below is a table showing the most deadly event type for each state. We can see that although tornadoes cause most fatalities across the United States, it’s localised to the South and the Midwest. In Alaska the biggest killers are avalanches, in Hawaii it’s high surf, and Connecticut it’s high wind. The table shows normalised data, fatalities per 1,000,000 people, which highlights the discrepancies in danger between states — for its deadliest event type Mississippi has a fatality rate more than an order of magnitude higher than neighbouring Florida.

Each state’s population is its census population on 1 April 2010 as collected from Wikipedia’s list of US states and territories by population.

# Download the source population file, but only if it
# doesn't already exist in the working directory.
if (!file.exists("state_populations.csv")) {
    download.file("http://storage.flother.is/etc/2015/state_populations.csv",
                  "state_populations.csv")
}
# Load the state population data.
state_populations <- read_csv("state_populations.csv")

# Create a data frame that contains each state's number of
# fatalities, grouped by event type.
state_fatalities <- storms %>%
  filter(fatalities > 0 | injuries > 0) %>%
  group_by(state, event_type) %>%
  summarise(fatalities = sum(fatalities))

# Merge the state population and fatalities datasets.
state_data <- merge(state_populations, state_fatalities,
                    by.x = "state_code", by.y = "state")

# Calculate per capita rates (see https://xkcd.com/1138/)
# and find out which event type is the biggest killer for
# each state.
per_capita_state_fatalities <- state_data %>%
  mutate(fatalities_per_million = fatalities / population * 1000000) %>%
  arrange(desc(fatalities_per_million)) %>%
  group_by(state_code) %>%
  filter(row_number() == 1) %>%
  ungroup() %>%
  select(state_name, event_type, fatalities_per_million)

# Print the table.
kable(per_capita_state_fatalities, digits = 1,
      col.names = c("State", "Event type",
                    "Deaths per 1,000,000"))
StateEvent typeDeaths per 1 mill.
MississippiTornado151.7
ArkansasTornado130.0
AlabamaTornado129.1
KansasTornado82.7
OklahomaTornado78.9
MissouriTornado64.8
TennesseeTornado58.0
IllinoisHeat50.9
AlaskaAvalanche46.5
WyomingAvalanche40.8
IndianaTornado38.9
North DakotaTornado37.2
LouisianaTornado33.7
NebraskaTornado29.6
KentuckyTornado28.8
PennsylvaniaExcessive Heat28.3
IowaTornado26.6
MichiganTornado24.6
South DakotaTornado22.1
TexasTornado21.4
NevadaHeat20.0
MinnesotaTornado18.7
GeorgiaTornado18.6
HawaiiHigh Surf17.6
WisconsinTornado16.9
OhioTornado16.6
MassachusettsTornado16.5
UtahAvalanche15.9
MarylandExcessive Heat15.2
North CarolinaTornado13.2
West VirginiaFlash Flood13.0
South CarolinaTornado12.8
IdahoAvalanche10.2
ArizonaFlash Flood9.7
ColoradoLightning9.5
FloridaRip Current9.1
MontanaLightning9.1
DelawareExcessive Heat7.8
New MexicoFlash Flood7.8
VermontFlood6.4
WashingtonAvalanche5.2
OregonHigh Wind5.0
New YorkExcessive Heat4.8
New HampshireTstm Wind4.6
MaineLightning4.5
VirginiaTornado4.5
New JerseyExcessive Heat4.4
CaliforniaExcessive Heat3.0
Rhode IslandHeat1.9
ConnecticutHigh Wind1.7

Conclusion

Across the United States tornadoes are most harmful to population health, while floods and droughts have the greatest economic consequences, but there is great variation among the states.

Read more: