Taking A Peek into My Hiking Data

Ain't No Mountain High Enough

8 minute read

I moved to Seattle at the end of 2016 and since then have done over 100 hikes (depending on your definition of ‘a hike’!). I must admit I’ve been abysmal at tracking any data regarding my hiking activity beyond a Google spreadsheet, despite the ubiquity of trail tracking apps that exist.

Recently, I signed up on AllTrails to start collecting data on my hikes. The Pro service offers many wonderful features, including the ability to download GPX data on hikes. I was so excited by this that I decided to try to visualize the hikes I have done.

I’m structuring this article a bit differently with the results/visualizations first, but for anybody dying to see the data cleaning process, please see the Methodology or Visualizations sections below! (Interesting, I ran a poll on Twitter in which I asked whether people embed code in the main text of their blog post or at the end. 91% embed in the main text [n = 85]! Still, I prefer having the code at the end).

Analysis

Disclaimer

For data collection, I downloaded each trail’s GPX files from AllTrails. Because these data are proprietary, I will not be providing them. Some things to note:

  • Because these are data pulled from the website, they are not indicative of my actual hiking path (for example, Franklin Falls is a 2 mile hike in the summer, but in the winter is a 6 mile snowshoe).
  • There are hikes that I did back-to-back that I’d consider one hike but the trails might be listed separately on the site. For example, Deception Pass is actually made up of three small loops.

The hikes are wide and varied

Being fortunate enough to live near multiple mountain ranges, the hikes I’ve been on come in all shapes and sizes.

I calculated my ‘average hike’ - that is, the average elevation given the cumulative distance travelled.

Aggregated Data by Trail

In the aggregate, there seems to be a correlation (r^2 = 0.48) between total distance and total elevation.

There Exist Categories of Hikes

I ran a quick cluster analysis to see if I can categorize my hikes in any way. Code is in the Methodology section. Four clusters seemed to be optimal. I have dubbed them:

  • Cluster 1: “Let’s Get This Over With” (steep & hard)
  • Cluster 2: “Easy Peasy Lemon Squeezy” (short & flat)
  • Cluster 3: “The Sweet Spot” (not too long, not too high)
  • Cluster 4: “I Don’t Care About My Knees Anyway” (too long for my own good)

I don’t particularly love long hikes

My average hike is 6.4 miles - and most of them are concentrated around that distance. This makes sense as I usually day hike and need to get back at a reasonable time. My shortest hike was 1.18 miles and my longest was 17.85 (the Enchantments…). In these 90 hikes, I hiked around 576 miles.

I don’t dislike high elevation hikes though

Elevation on these hikes ranged from ~0 feet to 4580 feet gain. I averaged 1455.4 feet gain and have climbed 130,984 feet (~24 miles!).

Methodology

Choose Packages

It took a bit to decide which packages had the functions needed to run the spatial analyses. In the end, I decided on:

  • plotKML: A package containing functions to read GPX files.
  • geosphere: A package containing functions for geospatial calculations. I decided to use this for finding out distances between lon/lat.
  • googleway: A package allowing access to the Google Maps API. To run this, you need to obtain a Google Maps API key and load it to R by using set_key(). I use this for elevation calculations but the API can also obtain distance between points.
library(tidyverse)
library(googleway)
library(plotKML)
library(geosphere)

googleway::set_key(API_KEY_HERE)

Upload Data

I downloaded each GPX file from AllTrails and saved them in a file in my project organization. Their file names were TRAILNAME.gpx.

  • Using plotKML::readGPX() results in the files being loaded as lists.
  • I used purrr in conjunction with plotKML() to handily read them in and add the file name to the list.
# find gpx files
data_path <- 
  here::here("data", "raw", "gpx_files")

files <-
  dir(data_path, pattern = "*.gpx", full.names = TRUE)

# get trail names
names <-
  dir(data_path, pattern = "*.gpx", full.names = FALSE) %>% 
  str_extract(".+?(?=.gpx)")

# read all gpx files
gpx_dat <-
  map2(files,
       names,
       ~ readGPX(.x,
         metadata = TRUE,
         bounds = TRUE,
         waypoints = TRUE,
         tracks = TRUE,
         routes = TRUE) %>%
         list_modify(trail = .y)) # otherwise you can't tell which entry is for which trail

Calculate Elevation

We can use googleway::google_elevation() to access the Google Elevation API and calculate elevation for every lon/lat pair from the GPX files. Unfortunately, the API accepts and returns only a few requests at a time (~200 rows for these files). We have over 51,000 rows of data. So, we can create groups for every 200 rows and use a loop to make a call for each

This results in a list, so we can then create a tibble pulling out the data we want.

lonlat_dat <-
  gpx_dat %>%
  map_df(., ~.x$"routes"[[1]], .id = "trail") %>%
  select(trail, lon, lat) %>% 
  group_by(trail) %>% 
  ungroup() %>% 
  mutate(group_number = (1:nrow(.) %/% 200) + 1) # https://stackoverflow.com/questions/32078578/how-to-group-by-every-7-rows-and-aggregate-those-7-values-by-median

dat_lapply <- lapply(1:max(lonlat_dat$group_number), function(x) {
  Sys.sleep(3)
  
  lonlat_dat %>%
    filter(group_number == x) %>% # added a filter so you only pull a subset of the data.
    do(elev_dat =
         data.frame(
           google_elevation(
             df_locations = dplyr::select(., lon, lat),
             location_type = "individual",
             simplify = TRUE)))
  })

dat_lapply_elev_dat <-
  dat_lapply %>%
  map(., ~ .x$"elev_dat"[[1]])

elev_df <-
  dat_lapply_elev_dat %>% {
    tibble(
      elevation = map(., ~ .x$"results.elevation"),
      lon = map(., ~ .x$"results.location"[["lng"]]),
      lat = map(.,  ~ .x$"results.location"[["lat"]])
    )
  } %>% 
  unnest(.id = "group_number") %>% 
  select(group_number, elevation, lon, lat)

Calculate Distance

Now we have a list of trails, longitudes and latitudes along their paths, and the elevation for each of those points. Now we want to calculate the distance along the paths.

  • We bring back lonlat_dat so we know what trails with which each points are associated.
  • To use calculate distance, we can use distHaversine() with two sets of lon/lat. We create the second set of lon/lat by creating a new variable that takes the “next” value in a vector (so we’re calculating the distance between point A and point B, point B to point C, and so on).
  • cumsum() accumulates the distances between each set of lon/lat.
  • Finally, we calculate the elevation gain for each hike.
hiking_dat <-
  plyr::join(elev_df, lonlat_dat, type = "left", match = "first") %>% 
  group_by(trail) %>% 
  mutate(elev_feet = elevation * 3.281, # to convert to feet
         lon2 = lead(lon, 1),
         lat2 = lead(lat, 1)) %>%
  ungroup() %>% 
  mutate(dist = distHaversine(hiking_dat[, 2:3], hiking_dat[, 7:8])/1609.344) %>% # to convert to miles
  group_by(trail) %>% 
  mutate(cumdist = cumsum(dist),
         elev_gain = elev_feet - first(elev_feet)) %>%
  ungroup()

Create Additional Tables

For nerdy kicks, I also wanted to find out my ‘average’ hike - that is, the average distance, the average elevation, and the average elevation for each distance. I also wanted to see the total distance and elevation for each trail for which I pulled data.

avg_elev <- # average elevation by distance
  hiking_dat %>% 
  group_by(round(cumdist, 1)) %>% 
  summarize(mean(elev_gain))

hiking_dat_by_trail <- # total gain/distance by trail
  hiking_dat %>% 
  select(trail, cumdist, elev_gain) %>% 
  group_by(trail) %>%
  summarize(tot_dist = max(cumdist, na.rm = T),
            tot_elev_gain = max(elev_gain)) %>% 
  mutate(tot_dist_scaled = scale(tot_dist), # for cluster analysis
         tot_elev_scaled = scale(tot_elev_gain))

Visualizations

Below is the code for the visualizations presented above.

library(tidyverse)
library(viridis)
library(ggridges)
library(cluster)
library(factoextra)

# joy plot

ggplot() + 
  geom_density_ridges(data = na.omit(hiking_dat),
                      aes(x = cumdist,
                          y = trail,
                          group = trail),
                      fill = "#00204c",
                      rel_min_height = 0.01
                      ) +
  theme_minimal() +
  theme(legend.position = "none")

# average hike

ggplot() + 
  geom_ridgeline(data = hiking_dat,
                 aes(x = cumdist,
                     y = trail,
                     group = trail,
                     height = elev_gain),
                 color = "#c9b869",
                 alpha = 0) +
  geom_line(data = avg_elev,
            aes(x = `round(cumdist, 1)`,
                y = `mean(elev_gain)`),
            color = "#00204c",
            size = 2) +
  scale_x_continuous(name = "Cumulative Distance (miles)") +
  scale_y_continuous(name = "Cumulative Elevation (ft)", limits = c(0, 5000)) +
  theme_minimal() +
  theme(legend.position = "none")

# aggregate data scatterplot

ggplot() + 
  geom_point(data = hiking_dat_by_trail,
             aes(x = tot_dist,
                 y = tot_elev_gain,
                 color = tot_elev_gain,
                 size = tot_dist)) +
  scale_x_continuous(name = "Total Distance (miles)") +
  scale_y_continuous(name = "Total Elevation (ft)") +
  scale_color_viridis(option = "cividis") +
  theme_minimal() +
  theme(legend.position = "none")

# cluster analysis

fviz_nbclust(hiking_dat_by_trail[, 4:5], kmeans, method = "wss") # finding optimal number of clusters
k4 <- kmeans(hiking_dat_by_trail[, 4:5], centers = 4, nstart = 25) # calculating clusters

fviz_cluster(k4, data = hiking_dat_by_trail)  +
  scale_x_continuous(name = "Scaled Total Distance (miles)") +
  scale_y_continuous(name = "Scaled Total Elevation (ft)") +
  scale_color_viridis(option = "cividis", discrete = T) +
  scale_fill_viridis(option = "cividis", discrete = T) +
  theme_minimal()

#  cumulative distance barplot

hiking_dat_by_trail %>% 
  mutate(cumdist = cumsum(tot_dist)) %>% 
  ggplot(aes(x = trail,
             y = cumdist,
             fill = cumdist)) +
  geom_bar(stat = "identity") +
  scale_fill_viridis(option = "cividis") +
  theme_minimal() +
  theme(legend.position = "none")

# distance histogram

hiking_dat_by_trail %>% 
  ggplot(aes(x = tot_dist)) +
  geom_histogram(fill = "#00204c") +
  xlab("Trail Total Distance (miles)") +
  ylab("Count") +
  scale_fill_viridis(option = "cividis") +
  theme_minimal() +
  theme(legend.position = "none")

# cumulative elevation barplot

hiking_dat_by_trail %>% 
  mutate(cumelev = cumsum(tot_elev_gain)) %>% 
  ggplot(aes(x = trail,
             y = cumelev,
             fill = cumelev)) +
  geom_bar(stat = "identity") +
  scale_fill_viridis(option = "cividis") +
  theme_minimal() +
  theme(legend.position = "none")

# elevation histogram

hiking_dat_by_trail %>% 
  ggplot(aes(x = tot_elev_gain)) +
  geom_histogram(fill = "#00204c") +
  xlab("Trail Total Elevation (ft)") +
  ylab("Count") +
  scale_fill_viridis(option = "cividis") +
  theme_minimal() +
  theme(legend.position = "none")