Organize Observations

Sorting rows by particular values of a column is a common intermediary step for analysis and plot generation, and also is helpful to quickly view your data in different arrangements. One of the methods we can use to do so is with group_by(). This operation will place observations that have the same value of a given column ("core_id") next to each other:

# Sort data with group_by
depthseries_data_grouped <- group_by(depthseries_data_select, core_id)
head(depthseries_data_grouped)

## # A tibble: 6 x 4
##   core_id depth_min depth_max organic_matter_density
##   <chr>       <int>     <int>                  <dbl>
## 1 BBRC_1          0         2                 0.0784
## 2 BBRC_1          4         6                 0.0943
## 3 BBRC_1          8        10                 0.108 
## 4 BBRC_1         12        14                 0.096 
## 5 BBRC_1         16        18                 0.0972
## 6 BBRC_1         20        22                 0.0944

tail(depthseries_data_grouped)

## # A tibble: 6 x 4
##   core_id depth_min depth_max organic_matter_density
##   <chr>       <int>     <int>                  <dbl>
## 1 GL              0         2                 0.0910
## 2 GL              8        10                 0.0174
## 3 HH              0         2                 0.0447
## 4 HH              8        10                 0.0344
## 5 HL              0         2                 0.0811
## 6 HL              8        10                 0.0177

It didn't seem like it did anything, did it? The rows are now arranged by core_id (BBRC_1,BBRC_2, NCBD_1, etc.), much in the same manner that they originally were. That's OK. the group_by operation isn't really useful on it's own, but can be very powerful when paired with other operations. Here's one: if you want to condense your data to one row according to one attribute, you can do so with the summarize() operation.

# What is the average organic matter density of all observations
depthseries_data_avg_organic_matter_density <- summarize(depthseries_data_select, # Define dataset
  organic_matter_density_avg = mean(organic_matter_density, # New column that has one entry, 
                                                            # the mean organic matter density
  na.rm = TRUE)) # Remove NA values before summarizing
depthseries_data_avg_organic_matter_density

## # A tibble: 1 x 1
##   organic_matter_density_avg
##                        <dbl>
## 1                     0.0679

This is the average organic matter density for all of our data. Now what if we wanted to calculate the average organic matter density for each core? Although group_by() and summarize() aren't that interesting by themselves, they are really effective when used in sequence:

# What is the average organic matter density of each core, for all cores
depthseries_data_avg_OMD_by_core <- depthseries_data_select %>%
  group_by(core_id) %>%
  summarize(organic_matter_density_avg = mean(organic_matter_density, na.rm = TRUE))
depthseries_data_avg_OMD_by_core

## # A tibble: 1,534 x 2
##    core_id                     organic_matter_density_avg
##    <chr>                                            <dbl>
##  1 AH                                              0.109 
##  2 AL                                              0.0930
##  3 Alley_Pond                                      0.0734
##  4 Altamaha_River_Site_7_Levee                   NaN     
##  5 Altamaha_River_Site_7_Plain                   NaN     
##  6 Altamaha_River_Site_8_Levee                   NaN     
##  7 Altamaha_River_Site_8_Plain                   NaN     
##  8 Altamaha_River_Site_9_Levee                   NaN     
##  9 Altamaha_River_Site_9_Plain                   NaN     
## 10 AND                                             0.0690
## # ... with 1,524 more rows

Using these two operations with the %>% notation, we were able to find the mean organic matter density for each of the cores in our dataset.

Last Page Return to Top Next Page