Join and Filter Data

Join and Filter Data

 

 

We will now learn a few more dplyr commands that will lead us to developing a map of soil cores.

Sometimes it might be useful to combine the contents of one dataset with a different dataset; say, to look at how the fraction of organic carbon varies with geographic location. There is a suite of commands in dplyr that can do so in different fashions, all of which are in “_join()" format. For our purposes, left_join() works:

# left_join pulls rows in the 2nd given dataset that match existing rows in the first given dataset via a common 
# key (e.g. core_id)
core_depth_data <- left_join(core_data, depthseries_data, by = NULL)
## Joining, by = c("study_id", "core_id")
head(core_depth_data)
## # A tibble: 6 x 16
##   study_id  core_id core_latitude core_longitude position_code core_elevation salinity_code
##   <chr>     <chr>           <dbl>          <dbl> <chr>                  <dbl> <chr>        
## 1 Boyd_2012 BBRC_1           39.4          -75.7 a1                        NA Bra          
## 2 Boyd_2012 BBRC_1           39.4          -75.7 a1                        NA Bra          
## 3 Boyd_2012 BBRC_1           39.4          -75.7 a1                        NA Bra          
## 4 Boyd_2012 BBRC_1           39.4          -75.7 a1                        NA Bra          
## 5 Boyd_2012 BBRC_1           39.4          -75.7 a1                        NA Bra          
## 6 Boyd_2012 BBRC_1           39.4          -75.7 a1                        NA Bra          
## # ... with 9 more variables: vegetation_code <chr>, inundation_class <chr>, core_length_flag <chr>,
## #   depth_min <int>, depth_max <int>, dry_bulk_density <dbl>, fraction_organic_matter <dbl>,
## #   fraction_carbon <dbl>, fraction_carbon_type <chr>

Sometimes two datasets you wish to join have different names for the same variable (e.g., “name” and “Name”, or “temp” and “C”). If this is the case, you can specify which variables you wish to join together using the “by” argument [ex: left_join(...by = c("name" = "Name"))]. Because our data has uniform terminology and syntax, we didn’t need to do this, so the “by” argument is left as NULL above [the default value for “by” in all _join() commands is NULL, but we included it to illustrate the point].

Now that our core data and depth series data are both in the same tibble, we can perform new operations on them. To explore particular parts of the data, we often are interested in subsetting to rows with a common characteristic. The filter() operation allows for this to be done easily.

# Filter to just cores with coordinates from a high quality source
core_data_HQ_coord <- filter(core_data, position_code == 'a1')
# We can use logical symbols in our filter operation
core_data_HQ_coord <- filter(core_data, position_code == 'a1' | position_code == 'a2')
# And, we can filter by more than one criterion
core_data_HQ_coord_estuarine <- filter(core_data, position_code == 'a1' | position_code == 'a2', 
salinity_code == 'Est')

head(core_data_HQ_coord_estuarine)
## # A tibble: 6 x 10
##   study_id          core_id core_latitude core_longitude position_code core_elevation salinity_code
##   <chr>             <chr>           <dbl>          <dbl> <chr>                  <dbl> <chr>        
## 1 Crooks_et_al_2013 MA_A             48.0          -122. a1                      6.41 Est          
## 2 Crooks_et_al_2013 MA_B             48.0          -122. a1                      6.54 Est          
## 3 Crooks_et_al_2013 NE_A             48.0          -122. a1                      7.86 Est          
## 4 Crooks_et_al_2013 NE_B             48.0          -122. a1                     NA    Est          
## 5 Crooks_et_al_2013 QM_A             48.0          -122. a1                      6.12 Est          
## 6 Crooks_et_al_2013 QM_B             48.0          -122. a1                      5.99 Est          
## # ... with 3 more variables: vegetation_code <chr>, inundation_class <chr>, core_length_flag <chr>

This allows us to select observations that fulfill particular requirements; in this case, we wanted only cores with high-quality georeferencing, and therefore with ‘a1’ as its position_code.

 

Last Page              Return to Top              Next Page