Join and Filter Data
We will now learn a few more dplyr
commands that will lead us to developing a map of soil cores.
Sometimes it might be useful to combine the contents of one dataset with a different dataset; say, to look at how the fraction of organic carbon varies with geographic location. There is a suite of commands in dplyr
that can do so in different fashions, all of which are in “_join()" format. For our purposes, left_join()
works:
# left_join pulls rows in the 2nd given dataset that match existing rows in the first given dataset via a common
# key (e.g. core_id)
core_depth_data <- left_join(core_data, depthseries_data, by = NULL)
## Joining, by = c("study_id", "core_id")
head(core_depth_data)
## # A tibble: 6 x 16
## study_id core_id core_latitude core_longitude position_code core_elevation salinity_code
## <chr> <chr> <dbl> <dbl> <chr> <dbl> <chr>
## 1 Boyd_2012 BBRC_1 39.4 -75.7 a1 NA Bra
## 2 Boyd_2012 BBRC_1 39.4 -75.7 a1 NA Bra
## 3 Boyd_2012 BBRC_1 39.4 -75.7 a1 NA Bra
## 4 Boyd_2012 BBRC_1 39.4 -75.7 a1 NA Bra
## 5 Boyd_2012 BBRC_1 39.4 -75.7 a1 NA Bra
## 6 Boyd_2012 BBRC_1 39.4 -75.7 a1 NA Bra
## # ... with 9 more variables: vegetation_code <chr>, inundation_class <chr>, core_length_flag <chr>,
## # depth_min <int>, depth_max <int>, dry_bulk_density <dbl>, fraction_organic_matter <dbl>,
## # fraction_carbon <dbl>, fraction_carbon_type <chr>
Sometimes two datasets you wish to join have different names for the same variable (e.g., “name” and “Name”, or “temp” and “C”). If this is the case, you can specify which variables you wish to join together using the “by” argument [ex: left_join(...by = c("name" = "Name"))
]. Because our data has uniform terminology and syntax, we didn’t need to do this, so the “by” argument is left as NULL
above [the default value for “by” in all _join() commands is NULL
, but we included it to illustrate the point].
Now that our core data and depth series data are both in the same tibble, we can perform new operations on them. To explore particular parts of the data, we often are interested in subsetting to rows with a common characteristic. The filter()
operation allows for this to be done easily.
# Filter to just cores with coordinates from a high quality source
core_data_HQ_coord <- filter(core_data, position_code == 'a1')
# We can use logical symbols in our filter operation
core_data_HQ_coord <- filter(core_data, position_code == 'a1' | position_code == 'a2')
# And, we can filter by more than one criterion
core_data_HQ_coord_estuarine <- filter(core_data, position_code == 'a1' | position_code == 'a2',
salinity_code == 'Est')
head(core_data_HQ_coord_estuarine)
## # A tibble: 6 x 10
## study_id core_id core_latitude core_longitude position_code core_elevation salinity_code
## <chr> <chr> <dbl> <dbl> <chr> <dbl> <chr>
## 1 Crooks_et_al_2013 MA_A 48.0 -122. a1 6.41 Est
## 2 Crooks_et_al_2013 MA_B 48.0 -122. a1 6.54 Est
## 3 Crooks_et_al_2013 NE_A 48.0 -122. a1 7.86 Est
## 4 Crooks_et_al_2013 NE_B 48.0 -122. a1 NA Est
## 5 Crooks_et_al_2013 QM_A 48.0 -122. a1 6.12 Est
## 6 Crooks_et_al_2013 QM_B 48.0 -122. a1 5.99 Est
## # ... with 3 more variables: vegetation_code <chr>, inundation_class <chr>, core_length_flag <chr>
This allows us to select observations that fulfill particular requirements; in this case, we wanted only cores with high-quality georeferencing, and therefore with ‘a1’ as its position_code.
Last Page Return to Top Next Page