Environmental Data

R and Tidyverse Data Manipulation Tutorial

Overview

CCRCN Coding Exercises

The CCRCN aspires to make data accessible and easy to manipulate. These exercises are directed towards an audience that is already a beginner R user and has worked with RStudio, but is open to a new toolset. If you’re not comfortable with R yet, take a look at the Resources listed at the bottom of the last page (Challenge Exercises). Here, we will manipulate coastal carbon data using the tools available in the tidyverse. The tidyverse suite of packages is designed for data science, and all adheres to the same universal grammar and data structure. For the sake of using data curated by the CCRCN, but also for the purpose of learning some of the more flexible and popular data science tools used in the sciences, tidyverse is important to be familiar with.

First, we are going to prep our workspace, by setting our directory and loading the data. Next, we will demonstrate how to modify the columns of a dataset to fit one’s needs, and then we will show how to group and summarize observations by certain attributes. From there, we will build a simple visualization of our data. We’ll then learn a few new skills and finally build a map from our data.

Prepare workspace

We start by opening RStudio and creating a new project. R projects work through a selected working directory, where all pertinent files are located.

For the sake of these exercises (and for good file management practice), we will expect data files to be kept in “/name_of_your_directory/data”, and scripts to be kept in “/name_of_your_directory/scripts”. Once you have a new RStudio project inside a working directory created especially for this tutorial, you can add a blank R script (in RStudio, select File>New File> R Script) to your /scripts folder, into which you can copy and paste code from these exercises.

Although base R has many functions built-in, there are plenty of other operations The following lines will install tidyverse and all other packages necessary for these exercises:

# Install packages
install.packages('tidyverse')
install.packages('RCurl')
install.packages('ggmap')
install.packages('rgdal')

We will use data from the CCRCN June 2018 Data Release, which includes metadata on soil cores, the extent of human impact on core sites, field and lab methods, vegetation taxonomy, and soil carbon data. You can download the data files to your computer by clicking the links from the data release DOI, or download them manually using the following script:

(note: the following code will only work if you have created the appropriate folders, as described above)

# Load RCurl, a package used to download files from a URL
library(RCurl)

# Create a list of the URLs for each data file
url_list <- list("https://repository.si.edu/bitstream/handle/10088/35684/V1_Holmquist_2018_core_data.csv?sequence=7&isAllowed=y",
                 "https://repository.si.edu/bitstream/handle/10088/35684/V1_Holmquist_2018_depth_series_data.csv?sequence=8&isAllowed=y",
                 "https://repository.si.edu/bitstream/handle/10088/35684/V1_Holmquist_2018_impact_data.csv?sequence=9&isAllowed=y",
                 "https://repository.si.edu/bitstream/handle/10088/35684/V1_Holmquist_2018_methods_data.csv?sequence=10&isAllowed=y",
                 "https://repository.si.edu/bitstream/handle/10088/35684/V1_Holmquist_2018_species_data.csv?sequence=11&isAllowed=y"
)

# Apply a function, which downloads each of the data files, over url_list
lapply(url_list, function(x) {
  # Extract the file name from each URL
  filename <- as.character(x)
  filename <- substring(filename, 56)
  filename <- gsub("\\..*","", filename)
  # Now download the file into the "data" folder
  download.file(x, paste0(getwd(), "/data/", filename, ".csv"))
})

Whether you used the linkes on the DOI or used the script, the data files should be downloaded into the /data folder of your working directory. Now let’s load them into our R project in RStudio and look at the column headers. We will use the tidyverse method of loading csv files with read_csv(), which automatically converts data frames into tibbles.

# Load tidyverse packages
library(tidyverse)

# Load data
core_data <- read_csv("data/V1_Holmquist_2018_core_data.csv")
impact_data <- read_csv("data/V1_Holmquist_2018_impact_data.csv")
methods_data <- read_csv("data/V1_Holmquist_2018_methods_data.csv")
species_data <- read_csv("data/V1_Holmquist_2018_species_data.csv")
depthseries_data <- read_csv("data/V1_Holmquist_2018_depth_series_data.csv")

Tibbles are the tidyverse-preferred version of data frames, with superior display options and a simpler design. There are a variety of methods to quickly view our tibbles, such as the depthseries_data tibble:

# Display data
glimpse(depthseries_data) # View all column headers and a sample of observations

## Observations: 16,976
## Variables: 8
## $ study_id                <chr> "Boyd_2012", "Boyd_2012", "Boyd_2012",...
## $ core_id                 <chr> "BBRC_1", "BBRC_1", "BBRC_1", "BBRC_1"...
## $ depth_min               <int> 0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 4...
## $ depth_max               <int> 2, 6, 10, 14, 18, 22, 26, 30, 34, 38, ...
## $ dry_bulk_density        <dbl> 0.28, 0.41, 0.40, 0.48, 0.54, 0.59, 0....
## $ fraction_organic_matter <dbl> 0.28, 0.23, 0.27, 0.20, 0.18, 0.16, 0....
## $ fraction_carbon         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fraction_carbon_type    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

head(depthseries_data) # View the first six observations of the dataset

## # A tibble: 6 x 8
##   study_id  core_id depth_min depth_max dry_bulk_density fraction_organic~
##   <chr>     <chr>       <int>     <int>            <dbl>             <dbl>
## 1 Boyd_2012 BBRC_1          0         2             0.28              0.28
## 2 Boyd_2012 BBRC_1          4         6             0.41              0.23
## 3 Boyd_2012 BBRC_1          8        10             0.4               0.27
## 4 Boyd_2012 BBRC_1         12        14             0.48              0.2 
## 5 Boyd_2012 BBRC_1         16        18             0.54              0.18
## 6 Boyd_2012 BBRC_1         20        22             0.59              0.16
## # ... with 2 more variables: fraction_carbon <dbl>,
## #   fraction_carbon_type <chr>

tail(depthseries_data) # View the last six observations of the dataset

## # A tibble: 6 x 8
##   study_id   core_id depth_min depth_max dry_bulk_density fraction_organi~
##   <chr>      <chr>       <int>     <int>            <dbl>            <dbl>
## 1 Windham_M~ GL              0         2             1.67          0.0545 
## 2 Windham_M~ GL              8        10             1.92          0.00906
## 3 Windham_M~ HH              0         2             1.66          0.0270 
## 4 Windham_M~ HH              8        10             1.34          0.0256 
## 5 Windham_M~ HL              0         2             1.40          0.0579 
## 6 Windham_M~ HL              8        10             1.87          0.00947
## # ... with 2 more variables: fraction_carbon <dbl>,
## #   fraction_carbon_type <chr>

names(depthseries_data) # View the names of all column headers

## [1] "study_id"                "core_id"                
## [3] "depth_min"               "depth_max"              
## [5] "dry_bulk_density"        "fraction_organic_matter"
## [7] "fraction_carbon"         "fraction_carbon_type"

The ability to quickly view your data in different ways will inform what manipulations are needed to be performed. Besides these commands, you can also click on an object in the “Environment” tab of RStudio:

Let’s take a quick look at the columns provided in the depthseries_data tibble (this metadata can be found online here, which is also linked on the dataset DOI):

Column header	Meaning
study_id	Unique identified for the study made up of the first author’s family name, as well as the second author’s or ‘et al.’
core_id	Unique identifier assiged to a core.
depth_min	Minimum depth of a sampling increment.
depth_max	Maximum depth of a sampling increment.
dry_bulk_density	Dry mass per unit volume of a soil sample.
fraction_organic_matter	Mass of organic matter relative to sample dry mass.
fraction_carbon	Mass of carbon relative to sample dry mass.
fraction_carbon_type	Code assigned to specify the type of measurement fraction_carbon represents.

A lot of interesting types of data to work with here. But we can change what is included as well, by adding and removing columns.

Return to Top

Add and Remove Columns

To manipulate our datasets, we are going to use the dplyr package, a component of the tidyverse (dplyr is automatically installed with tidyverse).

We will be performing many modifications to our datasets in these exercises, but to leave ourselves a trail of breadcrumbs back to the untampered data, we will want to create new datasets instead of pernamently modifying the original tibbles. For this, we will create a new object for each modification. Assigning a value to an object is an integral component of R and many coding practices:

# create an object
x <- 3
x

## [1] 3

This will come in handy when we want to modify one of our datasets. For instance, sometimes the data we are seeking is not included in the original data, but is a product of existing columns. For this, we can use the mutate() operation:

# calculate organic matter density
depthseries_data_with_organic_matter_density <- mutate(depthseries_data, 
       organic_matter_density = dry_bulk_density * fraction_organic_matter)

head(depthseries_data_with_organic_matter_density)

## # A tibble: 6 x 9
##   study_id  core_id depth_min depth_max dry_bulk_density fraction_organic~
##   <chr>     <chr>       <int>     <int>            <dbl>             <dbl>
## 1 Boyd_2012 BBRC_1          0         2             0.28              0.28
## 2 Boyd_2012 BBRC_1          4         6             0.41              0.23
## 3 Boyd_2012 BBRC_1          8        10             0.4               0.27
## 4 Boyd_2012 BBRC_1         12        14             0.48              0.2 
## 5 Boyd_2012 BBRC_1         16        18             0.54              0.18
## 6 Boyd_2012 BBRC_1         20        22             0.59              0.16
## # ... with 3 more variables: fraction_carbon <dbl>,
## #   fraction_carbon_type <chr>, organic_matter_density <dbl>

Sometimes the dataset has too many types of information presented, and we want to simplify to just a few columns. For this, we select() which columns we desire.

# Select just a handful of columns
depthseries_data_select <- select(depthseries_data_with_organic_matter_density, # Define dataset
                                  core_id, depth_min, depth_max, organic_matter_density) # Chosen columns

head(depthseries_data_select)

## # A tibble: 6 x 4
##   core_id depth_min depth_max organic_matter_density
##   <chr>       <int>     <int>                  <dbl>
## 1 BBRC_1          0         2                 0.0784
## 2 BBRC_1          4         6                 0.0943
## 3 BBRC_1          8        10                 0.108 
## 4 BBRC_1         12        14                 0.096 
## 5 BBRC_1         16        18                 0.0972
## 6 BBRC_1         20        22                 0.0944

We can perform both of these operations in the same command by using dplyr’s pipe notation, %>%.

# Calculate organic matter density, then remove all non-critical columns
depthseries_data_select <- depthseries_data %>% # Define dataset
  mutate(organic_matter_density = dry_bulk_density * fraction_organic_matter) %>% # Separate operations with "%>%"
  select(core_id, depth_min, depth_max, organic_matter_density) # Even though we just created the organic_matter_density column, we can immediately use it in the select() operation

head(depthseries_data_select)

## # A tibble: 6 x 4
##   core_id depth_min depth_max organic_matter_density
##   <chr>       <int>     <int>                  <dbl>
## 1 BBRC_1          0         2                 0.0784
## 2 BBRC_1          4         6                 0.0943
## 3 BBRC_1          8        10                 0.108 
## 4 BBRC_1         12        14                 0.096 
## 5 BBRC_1         16        18                 0.0972
## 6 BBRC_1         20        22                 0.0944

Return to Top

Organize Observations

Organize observations

Sorting rows by particular values of a column is a commmon intemediary step for analysis and plot generation, and for this we use group_by(). This operation organizes the data so that observations with the same value for a given column (in this case, “core_id”) are put together:

# Sort data with group_by
depthseries_data_select_grouped <- group_by(depthseries_data_select, core_id)
head(depthseries_data_select)

## # A tibble: 6 x 4
##   core_id depth_min depth_max organic_matter_density
##   <chr>       <int>     <int>                  <dbl>
## 1 BBRC_1          0         2                 0.0784
## 2 BBRC_1          4         6                 0.0943
## 3 BBRC_1          8        10                 0.108 
## 4 BBRC_1         12        14                 0.096 
## 5 BBRC_1         16        18                 0.0972
## 6 BBRC_1         20        22                 0.0944

tail(depthseries_data_select)

## # A tibble: 6 x 4
##   core_id depth_min depth_max organic_matter_density
##   <chr>       <int>     <int>                  <dbl>
## 1 GL              0         2                 0.0910
## 2 GL              8        10                 0.0174
## 3 HH              0         2                 0.0447
## 4 HH              8        10                 0.0344
## 5 HL              0         2                 0.0811
## 6 HL              8        10                 0.0177

The rows are now arranged by core_id (BBRC_1,BBRC_2, NCBD_1, etc.), much in the same manner that they originally were.

If you want to condense your data to one row according to one attribute, you can do so with the summarize() operation.

# What is the average organic matter density of all observations
depthseries_data_avg_organic_matter_density <- summarize(depthseries_data_select, # Define dataset
                                         organic_matter_density_avg = mean(organic_matter_density, # New column that has one entry, the mean organic matter density
                                                                           na.rm = TRUE)) # Remove NA values before summarizing
depthseries_data_avg_organic_matter_density

## # A tibble: 1 x 1
##   organic_matter_density_avg
##                        <dbl>
## 1                     0.0679

These two steps aren’t that interesting by themselves, but are really effective when put together:

# What is the average organic matter density of each core, for all cores
depthseries_data_avg_OMD_by_core <- depthseries_data_select %>%
  group_by(core_id) %>%
  summarize(organic_matter_density_avg = mean(organic_matter_density, na.rm = TRUE))
depthseries_data_avg_OMD_by_core

## # A tibble: 1,534 x 2
##    core_id                     organic_matter_density_avg
##    <chr>                                            <dbl>
##  1 AH                                              0.109 
##  2 AL                                              0.0930
##  3 Alley_Pond                                      0.0734
##  4 Altamaha_River_Site_7_Levee                   NaN     
##  5 Altamaha_River_Site_7_Plain                   NaN     
##  6 Altamaha_River_Site_8_Levee                   NaN     
##  7 Altamaha_River_Site_8_Plain                   NaN     
##  8 Altamaha_River_Site_9_Levee                   NaN     
##  9 Altamaha_River_Site_9_Plain                   NaN     
## 10 AND                                             0.0690
## # ... with 1,524 more rows

Return to Top

Plot Histogram

Recap and plot a histogram

Now we will use these skills in unison to create a plot of our data. For this, we will use the ggplot2 package, also part of the tidyverse. The basic template for visualizations generated by ggplot2 involve base inputs through the ggpot() operation, followed by aesthetics added with aes() and layouts generated with theme(). There is a vast array of kinds of plots we can create in R through ggplot2, but we will start with a simple histogram of carbon density.

First, we must calculate carbon density. Some of the studies in our data calculated the fraction mass of the sample that is carbon, which is needed to calculate carbon density. However, not all studies included this calculation, so we will need to estimate the fraction carbon from the fraction organic matter, using the quadratic formula derived by Holmquist et al (2018).

In order to accomplish this, we will create a function that estimates the fraction carbon using the Holmquist et al (2018) equation. Then, we will write an if-else statement: If a logical statement is TRUE, then an action is performed; if the statement is FALSE, a different action is performed:

# This function estimates fraction carbon as a function of fraction organic matter
estimate_fraction_carbon <- function(fraction_organic_matter) {
  fraction_carbon <- (0.014) * (fraction_organic_matter ^ 2) + (0.421 * fraction_organic_matter) + (0.008) # Use the quadratic relationship published by Holmquist et al. 2018 to derive the fraction of carbon from the fraction of organic matter
  if (fraction_carbon < 0) {
    fraction_carbon <- 0 # Replace negative values with 0
  }
  return(fraction_carbon) # The product of the function is the estimated fraction carbon value
} 

depthseries_data_with_carbon_density <- depthseries_data %>% # For each row in depthseries_data...
  mutate(carbon_density = ifelse(! is.na(fraction_carbon), # IF the value for fraction_carbon is not NA.... 
      (fraction_carbon * dry_bulk_density), # Then a new variable, carbon density, equals fraction carbon * dry bulk density....
      (estimate_fraction_carbon(fraction_organic_matter) * dry_bulk_density) # ELSE, carbon density is estimated using our function
      )
  )

glimpse(depthseries_data_with_carbon_density)

## Observations: 16,976
## Variables: 9
## $ study_id                <chr> "Boyd_2012", "Boyd_2012", "Boyd_2012",...
## $ core_id                 <chr> "BBRC_1", "BBRC_1", "BBRC_1", "BBRC_1"...
## $ depth_min               <int> 0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 4...
## $ depth_max               <int> 2, 6, 10, 14, 18, 22, 26, 30, 34, 38, ...
## $ dry_bulk_density        <dbl> 0.28, 0.41, 0.40, 0.48, 0.54, 0.59, 0....
## $ fraction_organic_matter <dbl> 0.28, 0.23, 0.27, 0.20, 0.18, 0.16, 0....
## $ fraction_carbon         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fraction_carbon_type    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ carbon_density          <dbl> 0.03555373, 0.04328395, 0.04907624, 0....

Now all of our newly calculated carbon_density data is within a new column of the same name. Now let’s compile this information into a visual plot. For displaying the distribution of carbon density, a histogram will do.

# Create a histogram of carbon density
histo <- ggplot(data = depthseries_data_with_carbon_density, # Define dataset
       aes(carbon_density)) + # Define which data is displayed
       geom_histogram(bins = 100) # Determine bin width, and therefore number of bins displayed

histo

This could look a little better if the limit of the x-axis was reduced to fit the distribution…and let’s change the theme background and plot labels as well.

# Update our histogram
histo <- histo + # Load the plot that we've already made
         xlim(0, 0.15) + # Change x-axis limits
         theme_classic() + # Change the plot theme
         ggtitle("Distribution of Carbon Density of CONUS") + # Give the plot a title (CONUS is the contiguous US)
         xlab("Carbon Density") + # Change x axis title
         ylab("Number of Cores") # Change y axis title
histo

Great, much better. This is a basic plot of just one type of data. However, as we will see soon, we can develop more intricate plots with multiple data types. But before we create such a plot, we’ll overview a couple new skills.

Return to Top

Join and Filter Data

We will now learn a few more dplyr commands that will lead us to developing a map of soil cores.

Sometimes it might be useful to combine the contents of one dataset with a different dataset; say, to look at how the fraction of organic carbon varies with geographic location. There is a suite of commands in dplyr that can do so in different fashions, all of which are in “_join()" format. For our purposes, left_join() works:

# left_join pulls rows in the 2nd given dataset that match existing rows in the first given dataset via a common key (e.g. core_id)
core_depth_data <- left_join(core_data, depthseries_data, by = NULL)

## Joining, by = c("study_id", "core_id")

head(core_depth_data)

## # A tibble: 6 x 16
##   study_id  core_id core_latitude core_longitude position_code
##   <chr>     <chr>           <dbl>          <dbl> <chr>        
## 1 Boyd_2012 BBRC_1           39.4          -75.7 a1           
## 2 Boyd_2012 BBRC_1           39.4          -75.7 a1           
## 3 Boyd_2012 BBRC_1           39.4          -75.7 a1           
## 4 Boyd_2012 BBRC_1           39.4          -75.7 a1           
## 5 Boyd_2012 BBRC_1           39.4          -75.7 a1           
## 6 Boyd_2012 BBRC_1           39.4          -75.7 a1           
## # ... with 11 more variables: core_elevation <dbl>, salinity_code <chr>,
## #   vegetation_code <chr>, inundation_class <chr>, core_length_flag <chr>,
## #   depth_min <int>, depth_max <int>, dry_bulk_density <dbl>,
## #   fraction_organic_matter <dbl>, fraction_carbon <dbl>,
## #   fraction_carbon_type <chr>

Sometimes two datasets you wish to join have different names for the same variable (e.g., “name” and “Name”, or “temp” and “C”). If this is the case, you can specify which variables you wish to join together using the “by” argument [ex: left_join(...by = c("name" = "Name"))]. Because our data has uniform terminology and syntax, we didn’t need, so the “by” argument is left as NULL above [the default value for “by” in all _join() commands is NULL, but we included it to illustrate the point].

Now that our core data and depth series data are both in the same tibble, we can perform new operations on them. To explore particular parts of the data separately, we often are interested in subsetting to rows with a common characteristic. The filter() operation allows for this to be done easily.

# Filter to just cores with coordinates from a high quality source
core_data_HQ_coord <- filter(core_data, position_code == 'a1')
# We can use logical symbols in our filter operation
core_data_HQ_coord <- filter(core_data, position_code == 'a1' | position_code == 'a2')
# And, we can filter by more than one criterion
core_data_HQ_coord_estuarine <- filter(core_data, position_code == 'a1' | position_code == 'a2', salinity_code == 'Est')

head(core_data_HQ_coord_estuarine)

## # A tibble: 6 x 10
##   study_id          core_id core_latitude core_longitude position_code
##   <chr>             <chr>           <dbl>          <dbl> <chr>        
## 1 Crooks_et_al_2013 MA_A             48.0          -122. a1           
## 2 Crooks_et_al_2013 MA_B             48.0          -122. a1           
## 3 Crooks_et_al_2013 NE_A             48.0          -122. a1           
## 4 Crooks_et_al_2013 NE_B             48.0          -122. a1           
## 5 Crooks_et_al_2013 QM_A             48.0          -122. a1           
## 6 Crooks_et_al_2013 QM_B             48.0          -122. a1           
## # ... with 5 more variables: core_elevation <dbl>, salinity_code <chr>,
## #   vegetation_code <chr>, inundation_class <chr>, core_length_flag <chr>

This allows us to select observations that fulfill particular requirements; in this case, we wanted only cores with high-quality georeferencing.

Return to Top

Plot Map

Create a Soil Core Map

Let’s finish by creating more visuals; this time, a series of maps. The core_data tibble includes longitude and latitude of each soil core, so we can use those coordinates to plot where the cores were taken in geographical space:

# Plot cores on a lat/long coordinate plane
map <- ggplot() + 
       geom_point(data = core_data, aes(x = core_longitude, y = core_latitude)) + # Add coordinate data
       theme_bw() + # Change the plot theme
       ggtitle("Distribution of Sampled Cores across CONUS") + # Give the plot a title
         xlab("Longitude") + # Change x axis title
         ylab("Latitude") # Change y axis title
map

Not bad, but we can do more. Let’s plot a coordinate display of our cores overlayed on top of a map of the conterminous US, and display cores by a color gradient corresponding to the fraction of organic matter. We’ll use many of the commands that we learned in this tutorial all in sequence.

We’re also going to integrate a few other packages, ggmap and rgdal.

# Load new packages
devtools::install_github("dkahle/ggmap") # As of time of writing, this is a package that receives frequent updates. So it's good to install from GitHub
library(ggmap)
library(rgdal)

# Join core_data and depthseries_data
core_depth_data <- left_join(core_data, depthseries_data, by = NULL)

# Select coordinate data from core_data and redefine into new dataset
core_coordinates <- select(core_depth_data, core_longitude, core_latitude, fraction_organic_matter)


# Add Google Maps image of CONUS to plot
CONUS_map <- get_map(location = c(lon = -96.5795, lat = 39.8283), # center map on CONUS
    color = "color",
    source = "google",
    maptype = "terrain-background",
    zoom = 4) # set the zoom to include all data
 
ggmap(CONUS_map,
    extent = "device",
    ylab = "Latitude",
    xlab = "Longitude") +

# Add core lat/long coodinates as points, colored by fraction organic matter
geom_point(data = core_coordinates, aes(x = core_longitude, y = core_latitude, color = fraction_organic_matter), alpha = 1, size = 1) +
  scale_color_gradientn("Fraction Organic Matter\n", 
                        colors = c("#EFEC35", "#E82323"), na.value = "#1a1a1a") + # Provide ggplot2 with color gradient
       ggtitle("Distribution of Sampled Cores across CONUS") + # Give the plot a title
         xlab("Longitude") + # Change x axis title
         ylab("Latitude") # Change y axis title

## Filter to specific regions for map insets

# Add inset of Coastal Louisiana
LA_map <- get_map(location = c(lon = -91.0795, lat = 30.2283),
    color = "bw",
    source = "google",
    maptype = "satellite",
    zoom = 7)
 
ggmap(LA_map,
    extent = "device",
    ylab = "Latitude",
    xlab = "Longitude") +

# Add core lat/long coodinates as points, colored by fraction organic matter
geom_point(data = core_coordinates, aes(x = core_longitude, y = core_latitude, color = fraction_organic_matter), alpha = 1, size = 2) +
  scale_color_gradientn("Fraction Organic Matter\n", 
                        colors = c("#EFEC35", "#E82323"), na.value = "#1a1a1a") # Provide ggplot2 with color gradient

# Add inset of Mid-Atlantic Seaboard
mid_Atlantic_map <- get_map(location = c(lon = -75.095, lat = 39.2283),
    color = "bw",
    source = "google",
    maptype = "satellite",
    zoom = 7)
 
ggmap(mid_Atlantic_map,
    extent = "device",
    ylab = "Latitude",
    xlab = "Longitude") +

# Add core lat/long coodinates as points, colored by fraction organic matter
geom_point(data = core_coordinates, aes(x = core_longitude, y = core_latitude, color = fraction_organic_matter), alpha = 1, size = 3) +
  scale_color_gradientn("Fraction Organic Matter\n", 
                        colors = c("#EFEC35", "#E82323"), na.value = "#1a1a1a") # Provide ggplot2 with color gradient

When combined, these maps can tell a lot about soil carbon across the contiguous United States.

Return to Top

Challenge Exercises

The following function is for calculating a depth weighted average of a soil variable. That means if a core has fraction organic matter measured every 1 cm depth increments and you want to know what the averages are for 0 to 5, 5 to 25, 25 to 50 and 50 to 100 cm, all you need to do is input the original depth intervals and organic matter data, as well as your target depth intevals and the function will output the depth weighted averages.

# Generage some sample data
sample_carbon_density <- rnorm(100, 0.027, 0.013)
sample_carbon_density[sample_carbon_density < 0] <- 0
sample_depth_min <- 0:99
sample_depth_max <- 1:100


calculate_depth_weighted_average <- function(soil_variable = sample_carbon_density, # Inputs include the soil variable you want a weigthed avg. of
                                             depth_min = sample_depth_min, # The real depth minimums
                                             depth_max = sample_depth_max, # The real depth maximums
                                             target_depth_min = 0, # The targeted depth interval for the weighted average
                                             target_depth_max = 50.2)
  {
  
  weights <- c() # Empty vector
  for (i in 1:length(depth_min)) { # Iterate through real depth intervals
    # Get distance between the sampling depth interval or between the target depth and sampling depth if they overlap
   interval <- min(depth_max[i], target_depth_max) - max(depth_min[i], target_depth_min)
    if (interval > 0) { # If positive value add it to the vector
      weights <- c(weights, interval)
    } else {
      weights <- c(weights, 0) # If negative add 0
    }
  }
  
  if (! all(weights == 0)) { # If there is some overlap between the sampling intervals and target intervals
    normalized_weights <- weights / sum(weights) # Calculate normalized weights
    weighted_avg <- sum(soil_variable * normalized_weights) # Calculate a weighted averaged
  } else { # If there is no overlap
    weighted_avg <- NA  # Return a no data value
  }
 
  return(weighted_avg) # Output is the depth weighted average of the soil variable 
  
  }

Your challenge is to use this function and what you know about using mutate, group_by and summarize, filter, and plotting in ggplot to:

Calculate organic matter density from bulk density and fraction organic matter.

Struggle builds character, but if you need a hint or want to check your answer, click the Solution button below.

depthseries_data_with_organic_matter_density <- depthseries_data %>%

mutate(organic_matter_density = bulk_denisty * fraction_organic_matter)

For each core with a length greater than or equal to 50 cm, calculate the depth weighted average carbon density.

# There are several methods by which you can narrow our selection of cores to just those that we want:

# Solution 1

depthseries_max_depths <- depthseries_data_with_carbon_density %>%

group_by(core_id) %>%

summarize(max_depth = max(depth_max))

# Create a new column, which contains the maximum depth sampled in each core

depthseries_data_long_cores <- depthseries_data_with_carbon_density %>%

left_join(depthseries_max_depths) %>%

# Join our depthseries data to the dataset that contains maximum depths

filter(max_depth >= 50)

# Select for only cores longer than 50 cm

# Another method is even more simple:

# Solution 2

depthseries_data_long_cores <- depthseries_data_with_carbon_density %>%

group_by(core_id) %>%

filter(max(depth_max) >= 50)

# Now apply the depth weighted average function

depthseries_data_long_cores_with_depth_weighted_average <- depthseries_data_long_cores %>%

summarize(depth_weighted_avg = calculate_depth_weighted_average(soil_variable = carbon_density,

depth_min = 0:max(depth_min),

depth_max = 1:max(depth_max),

target_depth_min = 0,

target_depth_max = 894))

Plot a histogram of carbon density from 0 to 5 cm.

depthseries_data_shallow <- depthseries_data_with_carbon_density %>%

group_by(core_id) %>%

filter(depth_max <= 5)

depth_weighted_average_plot <- ggplot(depthseries_data_shallow, aes(carbon_density)) +

geom_histogram(bins = 100))

depth_weighted_average_plot

Additional Resources

There is loads more that can be accomplished with our data using the tidyverse. But it’s easy to get overwhelmed by the many operation available, so the RStudio cheat sheets are immensely helpful:

We realize that learning a new coding language is a lot to take on: it’s a marathon, not a sprint. To help with that process, here are some additional resources that can help with an introduction to R and RStudio:

Now you’re ready to take advantage of the full potential of your data!

Tell us how we did:

Any thoughts on this tutorial? Send them along to CoastalCarbon@si.edu, we appreciate the feedback.

Return to Top