Organize Observations
Sorting rows by particular values of a column is a common intermediary step for analysis and plot generation, and also is helpful to quickly view your data in different arrangements. One of the methods we can use to do so is with group_by()
. This operation will place observations that have the same value of a given column ("core_id") next to each other:
# Sort data with group_by
depthseries_data_grouped <- group_by(depthseries_data_select, core_id)
head(depthseries_data_grouped)
## # A tibble: 6 x 4
## core_id depth_min depth_max organic_matter_density
## <chr> <int> <int> <dbl>
## 1 BBRC_1 0 2 0.0784
## 2 BBRC_1 4 6 0.0943
## 3 BBRC_1 8 10 0.108
## 4 BBRC_1 12 14 0.096
## 5 BBRC_1 16 18 0.0972
## 6 BBRC_1 20 22 0.0944
tail(depthseries_data_grouped)
## # A tibble: 6 x 4
## core_id depth_min depth_max organic_matter_density
## <chr> <int> <int> <dbl>
## 1 GL 0 2 0.0910
## 2 GL 8 10 0.0174
## 3 HH 0 2 0.0447
## 4 HH 8 10 0.0344
## 5 HL 0 2 0.0811
## 6 HL 8 10 0.0177
It didn't seem like it did anything, did it? The rows are now arranged by core_id (BBRC_1,BBRC_2, NCBD_1, etc.), much in the same manner that they originally were. That's OK. the group_by
operation isn't really useful on it's own, but can be very powerful when paired with other operations. Here's one: if you want to condense your data to one row according to one attribute, you can do so with the summarize()
operation.
# What is the average organic matter density of all observations
depthseries_data_avg_organic_matter_density <- summarize(depthseries_data_select, # Define dataset
organic_matter_density_avg = mean(organic_matter_density, # New column that has one entry,
# the mean organic matter density
na.rm = TRUE)) # Remove NA values before summarizing
depthseries_data_avg_organic_matter_density
## # A tibble: 1 x 1
## organic_matter_density_avg
## <dbl>
## 1 0.0679
This is the average organic matter density for all of our data. Now what if we wanted to calculate the average organic matter density for each core? Although group_by()
and summarize()
aren't that interesting by themselves, they are really effective when used in sequence:
# What is the average organic matter density of each core, for all cores
depthseries_data_avg_OMD_by_core <- depthseries_data_select %>%
group_by(core_id) %>%
summarize(organic_matter_density_avg = mean(organic_matter_density, na.rm = TRUE))
depthseries_data_avg_OMD_by_core
## # A tibble: 1,534 x 2
## core_id organic_matter_density_avg
## <chr> <dbl>
## 1 AH 0.109
## 2 AL 0.0930
## 3 Alley_Pond 0.0734
## 4 Altamaha_River_Site_7_Levee NaN
## 5 Altamaha_River_Site_7_Plain NaN
## 6 Altamaha_River_Site_8_Levee NaN
## 7 Altamaha_River_Site_8_Plain NaN
## 8 Altamaha_River_Site_9_Levee NaN
## 9 Altamaha_River_Site_9_Plain NaN
## 10 AND 0.0690
## # ... with 1,524 more rows
Using these two operations with the %>%
notation, we were able to find the mean organic matter density for each of the cores in our dataset.
Last Page Return to Top Next Page