Data science fundamentals 02: Transform and explore (Jul 25, 19)

We will work with HANES data set.

In the spirit of the data transformation section in R for Data Science, we will look into dplyr() functions.

As usual, we will first import the curated HANES data set in RStudio:

  # Load the package RCurl
  library(RCurl)
  # Import the HANES data set from GitHub; break the string into two for readability
  # (Please note this readability aspect very carefully)
  URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
  URL_text_2 <- "/master/Datasets/HANES/NYC_HANES_DIAB.csv"
  # Paste it to constitute a single URL 
  URL <- paste(URL_text_1, URL_text_2, sep="")
  HANES <- read.csv(text = getURL(URL))
  # Rename the GENDER factor for identification
  HANES$GENDER <- factor(HANES$GENDER, labels=c("M", "F"))
  # Rename the AGEGROUP factor for identification
  HANES$AGEGROUP <- factor(HANES$AGEGROUP, labels=c("20-39", "40-59", "60+"))
  # Rename the HSQ_1 factor for identification
  HANES$HSQ_1 <- factor(HANES$HSQ_1, labels=c("Excellent","Very Good", "Good", "Fair", "Poor"))
  # Rename the DX_DBTS as a factor
  HANES$DX_DBTS <- factor(HANES$DX_DBTS, labels=c("DIAB", "DIAB NO_DX", "NO DIAB"))
  # Omit all NA from the data frame
  HANES <- na.omit(HANES)
  # Observe the structure
  str(HANES)

## Warning: package 'RCurl' was built under R version 3.5.2

## 'data.frame':    1112 obs. of  23 variables:
##  $ KEY              : Factor w/ 1527 levels "133370A","133370B",..: 28 43 44 53 55 70 84 90 100 107 ...
##  $ GENDER           : Factor w/ 2 levels "M","F": 1 1 1 1 1 1 1 1 1 1 ...
##  $ SPAGE            : int  29 28 27 24 30 26 31 32 34 32 ...
##  $ AGEGROUP         : Factor w/ 3 levels "20-39","40-59",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ HSQ_1            : Factor w/ 5 levels "Excellent","Very Good",..: 2 2 2 1 1 3 1 2 1 3 ...
##  $ UCREATININE      : int  105 53 314 105 163 150 46 36 177 156 ...
##  $ UALBUMIN         : num  0.707 1 8 4 3 2 2 0.707 4 3 ...
##  $ UACR             : num  0.00673 2 3 4 2 ...
##  $ MERCURYU         : num  0.37 0.106 0.487 2.205 0.979 ...
##  $ DX_DBTS          : Factor w/ 3 levels "DIAB","DIAB NO_DX",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ A1C              : num  5 5.2 4.8 5.1 4.3 5.2 4.8 5.2 4.8 5.2 ...
##  $ CADMIUM          : num  0.2412 0.1732 0.0644 0.0929 0.1202 ...
##  $ LEAD             : num  1.454 1.019 0.863 1.243 0.612 ...
##  $ MERCURYTOTALBLOOD: num  2.34 2.57 1.32 14.66 2.13 ...
##  $ HDL              : int  42 51 42 61 52 50 57 56 42 44 ...
##  $ CHOLESTEROLTOTAL : int  184 157 145 206 120 155 156 235 156 120 ...
##  $ GLUCOSESI        : num  4.61 4.77 5.16 5 5.11 ...
##  $ CREATININESI     : num  74.3 73 80 84.9 66 ...
##  $ CREATININE       : num  0.84 0.83 0.91 0.96 0.75 0.99 0.9 0.84 0.93 1.09 ...
##  $ TRIGLYCERIDE     : int  156 43 108 65 51 29 31 220 82 35 ...
##  $ GLUCOSE          : int  83 86 93 90 92 85 72 87 96 92 ...
##  $ COTININE         : num  31.5918 0.0635 0.035 0.0514 0.035 ...
##  $ LDLESTIMATE      : int  111 97 81 132 58 99 93 135 98 69 ...
##  - attr(*, "na.action")= 'omit' Named int  2 15 16 24 26 28 33 34 35 39 ...
##   ..- attr(*, "names")= chr  "2" "15" "16" "24" ...

We will see more about tibbles in the wrangle section but we will make a short note of it now.

A tibble, or tbl_df, is a modern reimagining of the data frame, keeping what time has proven to be effective, and throwing out what is not.

Tibbles do less (i.e. they don’t change variable names or types, and don’t do partial matching) and complain more (e.g. when a variable does not exist). This forces one to confront problems earlier.

Tibbles also have an enhanced print() method which makes them easier to use with large datasets containing complex objects.

To convert a data frame to a tibble, we can use the function as.tibble(df) where df is a data frame.

We will convert the HANES data frame into tibble and use the tibble from now on.

  # Load the tidyverse library
  library(tidyverse)
  # Convert HANES data frame into a tibble and observe it
  HANES_TIB <- as.tibble(HANES)
  HANES_TIB

## # A tibble: 1,112 x 23
##    KEY   GENDER SPAGE AGEGROUP HSQ_1 UCREATININE UALBUMIN    UACR MERCURYU
##    <fct> <fct>  <int> <fct>    <fct>       <int>    <dbl>   <dbl>    <dbl>
##  1 1340… M         29 20-39    Very…         105    0.707 0.00673    0.370
##  2 1344… M         28 20-39    Very…          53    1     2          0.106
##  3 1344… M         27 20-39    Very…         314    8     3          0.487
##  4 1346… M         24 20-39    Exce…         105    4     4          2.21 
##  5 1346… M         30 20-39    Exce…         163    3     2          0.979
##  6 1352… M         26 20-39    Good          150    2     1          1.48 
##  7 1354… M         31 20-39    Exce…          46    2     4          0.106
##  8 1357… M         32 20-39    Very…          36    0.707 0.0196     0.238
##  9 1360… M         34 20-39    Exce…         177    4     2          2.30 
## 10 1362… M         32 20-39    Good          156    3     2          1.51 
## # … with 1,102 more rows, and 14 more variables: DX_DBTS <fct>, A1C <dbl>,
## #   CADMIUM <dbl>, LEAD <dbl>, MERCURYTOTALBLOOD <dbl>, HDL <int>,
## #   CHOLESTEROLTOTAL <int>, GLUCOSESI <dbl>, CREATININESI <dbl>,
## #   CREATININE <dbl>, TRIGLYCERIDE <int>, GLUCOSE <int>, COTININE <dbl>,
## #   LDLESTIMATE <int>

Tranformation basics are six key dplyr library functions that will allow us to solve vast majority of data manipulation challenges:

We can pick observations by their values using the function filter().
Reorder the rows applying arrange() operation to a data set.
We can choose variables by their names using select().
Create new variables through mutate().
Find the summary using summarise() function.
And the most important of all which is the group_by() operation which can be used with any of the above five.

The `filter()` function

The filter function allows us to pick a subset of information (or rows) from our dataset.

  # Pick all the records of patients with age == 45 
  # and HDL value of 50
  filter(HANES_TIB, SPAGE == 45, HDL == 50)

## # A tibble: 2 x 23
##   KEY   GENDER SPAGE AGEGROUP HSQ_1 UCREATININE UALBUMIN  UACR MERCURYU
##   <fct> <fct>  <int> <fct>    <fct>       <int>    <dbl> <dbl>    <dbl>
## 1 3291… M         45 40-59    Good           31        2     6    0.182
## 2 3292… F         45 40-59    Good          150        5     3    0.509
## # … with 14 more variables: DX_DBTS <fct>, A1C <dbl>, CADMIUM <dbl>,
## #   LEAD <dbl>, MERCURYTOTALBLOOD <dbl>, HDL <int>,
## #   CHOLESTEROLTOTAL <int>, GLUCOSESI <dbl>, CREATININESI <dbl>,
## #   CREATININE <dbl>, TRIGLYCERIDE <int>, GLUCOSE <int>, COTININE <dbl>,
## #   LDLESTIMATE <int>

Note: All dplyr functions write the result to a new data frame. If we want to store the data, we need to assign the result to a variable.

We can also use comparison and logical operators. For instance,

  # Pick all the records of patients who are between 45 and 46 and HDL value of 50
  filter(HANES_TIB, SPAGE >= 45 & SPAGE <= 46, HDL == 50)

## # A tibble: 3 x 23
##   KEY   GENDER SPAGE AGEGROUP HSQ_1 UCREATININE UALBUMIN  UACR MERCURYU
##   <fct> <fct>  <int> <fct>    <fct>       <int>    <dbl> <dbl>    <dbl>
## 1 3291… M         45 40-59    Good           31        2     6    0.182
## 2 5501… F         46 40-59    Good           54        3     6    0.386
## 3 3292… F         45 40-59    Good          150        5     3    0.509
## # … with 14 more variables: DX_DBTS <fct>, A1C <dbl>, CADMIUM <dbl>,
## #   LEAD <dbl>, MERCURYTOTALBLOOD <dbl>, HDL <int>,
## #   CHOLESTEROLTOTAL <int>, GLUCOSESI <dbl>, CREATININESI <dbl>,
## #   CREATININE <dbl>, TRIGLYCERIDE <int>, GLUCOSE <int>, COTININE <dbl>,
## #   LDLESTIMATE <int>

will list all the records for patients with age between \(45\) and \(46\) and who have the HDL value of \(50\).

Note: filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values.

Classwork/Homework:

Find all patients who:

Are females and in between ages \(45\) and \(60\)
Are in excellent or good health status
Who have diabetes but not yet diagnosed
Have total cholesterol between \(150\) and \(200\)
Have glucose levels greater than \(75\)
Who had poor health status but with more than median HDL values
High content of mercury (more than median) in their urine

How many patients have a missing UACR value in the HANES dataset?

The `arrange()` function

arrange() works similar to filter function except that instead of selecting rows, it changes their order.

It takes a data frame and a set of column names (or more complicated expressions) to order by.

If we provide more than one column name, each additional column will be used to break ties in the values of preceding columns:

  # Arrange the data frame arranging GENDER and AGE with descending order of the variable UCREATININE
  a <- arrange(HANES, GENDER, SPAGE, desc(UCREATININE))
  atib <- as.tibble(a)
  atib

## # A tibble: 1,112 x 23
##    KEY   GENDER SPAGE AGEGROUP HSQ_1 UCREATININE UALBUMIN    UACR MERCURYU
##    <fct> <fct>  <int> <fct>    <fct>       <int>    <dbl>   <dbl>    <dbl>
##  1 1343… M         20 20-39    Exce…         612   26     4.00e+0    6.23 
##  2 5500… M         20 20-39    Very…         217   12     6.00e+0    0.166
##  3 2206… M         20 20-39    Very…         198   15     8.00e+0    0.318
##  4 3331… M         20 20-39    Good          153    5     3.00e+0    0.653
##  5 4435… M         20 20-39    Very…         148    1     1.00e+0    0.717
##  6 1403… M         20 20-39    Very…         132    3     2.00e+0    0.106
##  7 4477… M         20 20-39    Exce…         131    3     2.00e+0    0.649
##  8 2239… M         21 20-39    Exce…         269    1     3.70e-1    0.502
##  9 2239… M         21 20-39    Very…         205    0.707 3.45e-3    1.51 
## 10 3287… M         21 20-39    Very…         144   25     1.74e+1    0.448
## # … with 1,102 more rows, and 14 more variables: DX_DBTS <fct>, A1C <dbl>,
## #   CADMIUM <dbl>, LEAD <dbl>, MERCURYTOTALBLOOD <dbl>, HDL <int>,
## #   CHOLESTEROLTOTAL <int>, GLUCOSESI <dbl>, CREATININESI <dbl>,
## #   CREATININE <dbl>, TRIGLYCERIDE <int>, GLUCOSE <int>, COTININE <dbl>,
## #   LDLESTIMATE <int>

Note: Missing values are always sorted at the end.

To view the last few entries of the data frame/tibble, one can use the tail() function.

  tail(atib)

## # A tibble: 6 x 23
##   KEY   GENDER SPAGE AGEGROUP HSQ_1 UCREATININE UALBUMIN    UACR MERCURYU
##   <fct> <fct>  <int> <fct>    <fct>       <int>    <dbl>   <dbl>    <dbl>
## 1 3329… F         88 60+      Fair          107   11     10         0.106
## 2 2242… F         88 60+      Fair           89    8      9         0.106
## 3 4465… F         88 60+      Fair           60    0.707  0.0118    0.504
## 4 1384… F         89 60+      Good          181    7      3.87      1.44 
## 5 4427… F         90 60+      Good          309   59     19         0.184
## 6 1409… F         92 60+      Good           73   15     20         0.106
## # … with 14 more variables: DX_DBTS <fct>, A1C <dbl>, CADMIUM <dbl>,
## #   LEAD <dbl>, MERCURYTOTALBLOOD <dbl>, HDL <int>,
## #   CHOLESTEROLTOTAL <int>, GLUCOSESI <dbl>, CREATININESI <dbl>,
## #   CREATININE <dbl>, TRIGLYCERIDE <int>, GLUCOSE <int>, COTININE <dbl>,
## #   LDLESTIMATE <int>

Classwork/Homework:

Sort HANES to find patients with highest cholesterol. Do the same to find patients with least cholesterol.
Sort HANES to find pateints with high A1C.

The `select()` function

The orginal HANES data set has \(704\) variables/columns.

It is common for real world data sets to be large and we may want to quickly zoom into the variables of interest.

The select() function allows us to do that. Here is a code to select only A1C and GLUCOSE variables:

  # Select only the A1C and GLUCOSE variables from the HANES data
  s <- select(HANES, A1C, GLUCOSE)
  as.tibble(s)

## # A tibble: 1,112 x 2
##      A1C GLUCOSE
##    <dbl>   <int>
##  1   5        83
##  2   5.2      86
##  3   4.8      93
##  4   5.1      90
##  5   4.3      92
##  6   5.2      85
##  7   4.8      72
##  8   5.2      87
##  9   4.8      96
## 10   5.2      92
## # … with 1,102 more rows

An useful helper arument is the everything() function that allows keeping all the variables and pushing the variables of interest in the front.

  # Push the A1C and GLUCOSE variables to the front keeping everything
  s <- select(HANES, A1C, GLUCOSE, everything())
  as.tibble(s)

## # A tibble: 1,112 x 23
##      A1C GLUCOSE KEY   GENDER SPAGE AGEGROUP HSQ_1 UCREATININE UALBUMIN
##    <dbl>   <int> <fct> <fct>  <int> <fct>    <fct>       <int>    <dbl>
##  1   5        83 1340… M         29 20-39    Very…         105    0.707
##  2   5.2      86 1344… M         28 20-39    Very…          53    1    
##  3   4.8      93 1344… M         27 20-39    Very…         314    8    
##  4   5.1      90 1346… M         24 20-39    Exce…         105    4    
##  5   4.3      92 1346… M         30 20-39    Exce…         163    3    
##  6   5.2      85 1352… M         26 20-39    Good          150    2    
##  7   4.8      72 1354… M         31 20-39    Exce…          46    2    
##  8   5.2      87 1357… M         32 20-39    Very…          36    0.707
##  9   4.8      96 1360… M         34 20-39    Exce…         177    4    
## 10   5.2      92 1362… M         32 20-39    Good          156    3    
## # … with 1,102 more rows, and 14 more variables: UACR <dbl>,
## #   MERCURYU <dbl>, DX_DBTS <fct>, CADMIUM <dbl>, LEAD <dbl>,
## #   MERCURYTOTALBLOOD <dbl>, HDL <int>, CHOLESTEROLTOTAL <int>,
## #   GLUCOSESI <dbl>, CREATININESI <dbl>, CREATININE <dbl>,
## #   TRIGLYCERIDE <int>, COTININE <dbl>, LDLESTIMATE <int>

The `mutate()` function

We can add new columns to the data set using the mutate() function.

For example, this code makes a new variable that is the sum of HDL and LDL estimate, stores it in a variable “H_L” to find the ratio between cholesterol and this new variable, H_L:

   # Select only the variables LDLESTIMATE,HDL,CHOLESTEROLTOTAL,TRIGLYCERIDE
   HANES_sub <- select(HANES,LDLESTIMATE,HDL,CHOLESTEROLTOTAL,TRIGLYCERIDE)
   # Add a new variable H_L that is a sum of HDL and LDL
   m <- mutate(HANES_sub, H_L = HDL+LDLESTIMATE, ratio=CHOLESTEROLTOTAL/H_L)
   as.tibble(m)

## # A tibble: 1,112 x 6
##    LDLESTIMATE   HDL CHOLESTEROLTOTAL TRIGLYCERIDE   H_L ratio
##          <int> <int>            <int>        <int> <int> <dbl>
##  1         111    42              184          156   153  1.20
##  2          97    51              157           43   148  1.06
##  3          81    42              145          108   123  1.18
##  4         132    61              206           65   193  1.07
##  5          58    52              120           51   110  1.09
##  6          99    50              155           29   149  1.04
##  7          93    57              156           31   150  1.04
##  8         135    56              235          220   191  1.23
##  9          98    42              156           82   140  1.11
## 10          69    44              120           35   113  1.06
## # … with 1,102 more rows

The transmute() function retains only the created variables.

   # Select only the variables LDLESTIMATE,HDL,CHOLESTEROLTOTAL,TRIGLYCERIDE
   HANES_sub <- select(HANES,LDLESTIMATE,HDL,CHOLESTEROLTOTAL,TRIGLYCERIDE)
   # Add a new variable H_L that is a sum of HDL and LDL
   m <- transmute(HANES_sub, H_L = HDL+LDLESTIMATE, ratio=CHOLESTEROLTOTAL/H_L)
   as.tibble(m)

## # A tibble: 1,112 x 2
##      H_L ratio
##    <int> <dbl>
##  1   153  1.20
##  2   148  1.06
##  3   123  1.18
##  4   193  1.07
##  5   110  1.09
##  6   149  1.04
##  7   150  1.04
##  8   191  1.23
##  9   140  1.11
## 10   113  1.06
## # … with 1,102 more rows

There are several useful functions for data manipulation such as lag(), lead(), cumsum(), min_rank() etc. and wrappers around them.

One of them is the top_n() function that is made out of the filter() and min_rank() functions that would list the top \(n\) values from the data.

For example, here is a code to select the top \(5\) values in the above ratio.

  # Select the top 5 ratios in the above transmuted data frame m
  top_5_ratio <-  top_n(m, 5, ratio)
  top_5_ratio

##   H_L    ratio
## 1 121 1.570248
## 2 141 1.588652
## 3 107 1.654206
## 4 121 1.586777
## 5  85 1.752941

The `group_by()` and `summarise()` functions

The group_by() is an extremely useful function that facililates grouped analysis and along with summarise() they can provide vital statistics for a group.

Let us say we want to find the mean cholesterol level for each gender, the group_by() and summarise() operation can be extremely handy.

  # Group the HANES data by GENDER 
  HANES_by_gender <- group_by(HANES, GENDER)
  # Find the mean cholesterol level for the two GENDERS
  summarise(HANES_by_gender, 
            mean_chol = mean(CHOLESTEROLTOTAL),
            sd_chol = sd(CHOLESTEROLTOTAL))

## # A tibble: 2 x 3
##   GENDER mean_chol sd_chol
##   <fct>      <dbl>   <dbl>
## 1 M           184.    37.8
## 2 F           187.    37.6

We find the mean cholesterol levels for females are little higher than males.

The summarise() has a useful function called the count() that will count the number of items appearing in a group.

For example, here is an analysis that shows the relationship between HDL and LDL for each age, grouped by the number of patients with the particular age.

  # First group the patients by age
  by_AGE <- group_by(HANES, SPAGE)
  # Summarize by counting the number of people in that age, and finding the mean HDL and LDL values
  DL <- summarise(by_AGE,
                   count = n(),
                   HDL = mean(HDL),
                   LDL = mean(LDLESTIMATE))
  # We remove the groups with five people or less
  DL <- filter(DL, count > 5)
  # Plot this data, with point for count estimates and smoothing it by line
  ggplot(data = DL, mapping = aes(x = HDL, y = LDL)) +
    geom_point(aes(size = count, color = SPAGE), alpha = 1/2) +
    geom_smooth(se = FALSE)

  # First group the patients by age
  by_AGE <- group_by(HANES, SPAGE)
  # Summarize by counting the number of people in that age, and finding the mean HDL and LDL values
  DL <- summarise(by_AGE,
                   count = n(),
                   HDL = mean(HDL),
                   LDL = mean(LDLESTIMATE))
  # We remove the groups with five people or less
  DL <- filter(DL, count > 5)
  # Plot this data, with point for count estimates and smoothing it by line
  ggplot(data = DL, mapping = aes(x = HDL, y = LDL)) +
    geom_point(aes(size = count, color = SPAGE), alpha = 1/2) +
    geom_smooth(se = FALSE)

The big disadvantage in the above code is the necessity to re-type the variables again and again, which can be eliminated through the use of a special symbol, %>%, the pipe operator.

As the analysis becomes complicated, piping becomes extremely useful. The following code is the piped version of the above code producing the exact same output.

  # Pipe the HANES data to the group_by function, grouping by age
  DL <- HANES %>% group_by(SPAGE) %>% 
  # and pipe this to summarize function
  summarise(count = n(),
            HDL = mean(HDL),
            LDL = mean(LDLESTIMATE)) %>% 
  # And pipe this result to filter the count < 5
  filter(count > 5)

  # Pipe the HANES data to the group_by function, grouping by age
  HANES %>% group_by(SPAGE) %>% 
  # and pipe this to summarize function
  summarise(count = n(),
            HDL = mean(HDL),
            LDL = mean(LDLESTIMATE)) %>% 
  # And pipe this result to filter the count < 5
  filter(count > 5) -> DL

and then,

# Plot this data, with point for count estimates and smoothing it by line
  ggplot(data = DL, mapping = aes(x = HDL, y = LDL)) +
    geom_point(aes(size = count, color = SPAGE), alpha = 1/2) +
    geom_smooth(se = FALSE)

Piping the data directly to the ggplot function:

  # Pipe the HANES data to the group_by function, grouping by age
  HANES %>% group_by(SPAGE) %>% 
  # and pipe this to summarize function
  summarise(count = n(),
            HDL = mean(HDL),
            LDL = mean(LDLESTIMATE)) %>% 
  # And pipe this result to filter the count < 5
  filter(count > 5) %>% 
  # Plot this data, with point for count estimates and smoothing it by line
  ggplot(mapping = aes(x = HDL, y = LDL)) +
    geom_point(aes(size = count, color = SPAGE), alpha = 1/3) +
    geom_smooth(se = FALSE)

Missing values

The use of the function na.rm() inside summarise can be extremely useful as this example shows:

(Note: Since we already removed ‘NA’ from our dataset, we import the new data with NA in it)

  # Load the package RCurl
  library(RCurl)
  # Import the HANES data set from GitHub; break the string into two for readability
  # (Please note this readability aspect very carefully)
  URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
  URL_text_2 <- "/master/Datasets/HANES/NYC_HANES_DIAB.csv"
  # Paste it to constitute a single URL 
  URL <- paste(URL_text_1, URL_text_2, sep="")
  HANES_with_NA <- read.csv(text = getURL(URL))

Without removing NA

  # Group by AGE and GENDER and
  HANES_with_NA %>% 
  group_by(SPAGE, GENDER) %>% 
  # summarize the mean of cholesterol for each group without removing NA
  summarise(mean = mean(CHOLESTEROLTOTAL))

## # A tibble: 143 x 3
## # Groups:   SPAGE [75]
##    SPAGE GENDER  mean
##    <int>  <int> <dbl>
##  1    20      1   NA 
##  2    20      2   NA 
##  3    21      1   NA 
##  4    21      2   NA 
##  5    22      1  153.
##  6    22      2   NA 
##  7    23      1   NA 
##  8    23      2   NA 
##  9    24      1   NA 
## 10    24      2   NA 
## # … with 133 more rows

Removing NA

  # Group by AGE and GENDER and 
  HANES_with_NA %>% 
  group_by(SPAGE, GENDER) %>% 
  # summarize the mean of cholesterol for each group by removing NA
  summarise(mean = mean(CHOLESTEROLTOTAL, na.rm=TRUE))

## # A tibble: 143 x 3
## # Groups:   SPAGE [75]
##    SPAGE GENDER  mean
##    <int>  <int> <dbl>
##  1    20      1  165.
##  2    20      2  157.
##  3    21      1  152.
##  4    21      2  154.
##  5    22      1  153.
##  6    22      2  167.
##  7    23      1  166.
##  8    23      2  170.
##  9    24      1  169.
## 10    24      2  174.
## # … with 133 more rows

We can also filter through the missing values first -

  # Filter the data and get no NA values 
  HANES_with_NA %>% filter(!is.na(CHOLESTEROLTOTAL)) %>%
  # group by AGE and GENDER 
  group_by(SPAGE, GENDER) %>% 
  # and find the mean of cholesterol for each group
  summarise(mean = mean(CHOLESTEROLTOTAL))

## # A tibble: 140 x 3
## # Groups:   SPAGE [75]
##    SPAGE GENDER  mean
##    <int>  <int> <dbl>
##  1    20      1  165.
##  2    20      2  157.
##  3    21      1  152.
##  4    21      2  154.
##  5    22      1  153.
##  6    22      2  167.
##  7    23      1  166.
##  8    23      2  170.
##  9    24      1  169.
## 10    24      2  174.
## # … with 130 more rows

Here are some examples of useful summary functions:

Example 1: Finding the mean of CHOLESTEROLTOTAL and the mean of CHOLESTEROLTOTAL if GLUCOSE >100 for each age group.

  DL <- HANES %>% 
    # For each AGEGROUP,
    group_by(AGEGROUP) %>%
    # find the mean of CHOLESTEROLTOTAL and the mean of CHOLESTEROLTOTAL when GLUCOSE content is > 100
    summarise(count = n(),
              H = mean(CHOLESTEROLTOTAL),
              H_glu = mean(CHOLESTEROLTOTAL[GLUCOSE>100]))
  # And print it
  DL

## # A tibble: 3 x 4
##   AGEGROUP count     H H_glu
##   <fct>    <int> <dbl> <dbl>
## 1 20-39      498  176.  183.
## 2 40-59      391  194.  189.
## 3 60+        223  192.  186.

Example 2: Finding the standard deviation of CHOLESTEROLTOTAL by age group

  DL <- HANES %>% 
  # For each AGEGROUP,
  group_by(AGEGROUP) %>% 
  # sumamarise the standard deviation of CHOLESTEROLTOTAL 
  # and arrange in descending order
  summarise(CHO_sd = sd(CHOLESTEROLTOTAL)) %>% 
  arrange(desc(CHO_sd))
  # and print it
  DL

## # A tibble: 3 x 2
##   AGEGROUP CHO_sd
##   <fct>     <dbl>
## 1 60+        41.8
## 2 40-59      36.7
## 3 20-39      34.2

Example 3: Finding largest and smallest CHOLESTEROLTOTAL for each diabetes diagnosis group.

  DL <- HANES %>% 
  # For each DX_DBTS,
  group_by(DX_DBTS) %>% 
  # find the largest and smallest CHOLESTEROLTOTAL value
  summarise(Lartgest_CHO = max(CHOLESTEROLTOTAL), 
            Smallest_CHO = min(CHOLESTEROLTOTAL)) 
  # and print it
  DL

## # A tibble: 3 x 3
##   DX_DBTS    Lartgest_CHO Smallest_CHO
##   <fct>             <int>        <int>
## 1 DIAB                315          105
## 2 DIAB NO_DX          246          132
## 3 NO DIAB             335           95

Example 4: Finding largest and smallest CHOLESTEROLTOTAL for each diabetes diagnosis group and by age group.

  DL <- HANES %>% 
  # For each DX_DBTS and AGEGROUP,
  group_by(DX_DBTS, AGEGROUP) %>% 
  # find the largest and smallest CHOLESTEROLTOTAL value
  summarise(Lartgest_CHO = max(CHOLESTEROLTOTAL), 
            Smallest_CHO = min(CHOLESTEROLTOTAL)) 
  # and print it
  DL

## # A tibble: 9 x 4
## # Groups:   DX_DBTS [3]
##   DX_DBTS    AGEGROUP Lartgest_CHO Smallest_CHO
##   <fct>      <fct>           <int>        <int>
## 1 DIAB       20-39             228          138
## 2 DIAB       40-59             315          112
## 3 DIAB       60+               304          105
## 4 DIAB NO_DX 20-39             221          157
## 5 DIAB NO_DX 40-59             238          141
## 6 DIAB NO_DX 60+               246          132
## 7 NO DIAB    20-39             308           95
## 8 NO DIAB    40-59             335           98
## 9 NO DIAB    60+               313           97

Grouped transformations

Grouping variables is extremely useful when used with the summarise() function, but it can also be used with other operations such as mutate() and filter().

Examples:

Among the diabetes diagnostic within age groups, find the members (top 3 ranks) with high glucose levels.

  # Group by diabetes status and age group,
   HANES %>% group_by(DX_DBTS, AGEGROUP) %>% 
  # filter for the top 3 ranks
   filter(rank(desc(GLUCOSE)) < 3) %>% 
  # select only the relevant variables
  select(DX_DBTS, AGEGROUP, GLUCOSE) %>% 
  # and arrange in descending order of glucose levels
  arrange(desc(GLUCOSE))

## # A tibble: 19 x 3
## # Groups:   DX_DBTS, AGEGROUP [9]
##    DX_DBTS    AGEGROUP GLUCOSE
##    <fct>      <fct>      <int>
##  1 DIAB       60+          490
##  2 DIAB       40-59        320
##  3 DIAB       40-59        297
##  4 DIAB       60+          281
##  5 DIAB NO_DX 40-59        219
##  6 DIAB NO_DX 40-59        214
##  7 DIAB NO_DX 40-59        214
##  8 DIAB       20-39        212
##  9 DIAB       20-39        186
## 10 DIAB NO_DX 60+          178
## 11 DIAB NO_DX 60+          171
## 12 NO DIAB    40-59        159
## 13 NO DIAB    20-39        149
## 14 NO DIAB    60+          147
## 15 DIAB NO_DX 20-39        140
## 16 NO DIAB    60+          136
## 17 NO DIAB    40-59        125
## 18 NO DIAB    20-39        120
## 19 DIAB NO_DX 20-39         87

Omit age groups (SPAGE) with low number of observations (say, \(30\)) and standardise A1C to compute per group metric.

  # Group by SPAGE and filter groups that contain more than 30 individuals
  HANES %>% group_by(SPAGE) %>% filter(n() > 30) %>% 

  # Filter for all records with A1C variable > 0 to avoid NaNs,
  filter(A1C > 0) %>%
  # standardise A1C dividing by the total sum of A1C in the group
  mutate(standardised_A1C = A1C / sum(A1C)) %>%
  # and seelct only A1C and standardised A1C
  select(A1C, standardised_A1C)

## # A tibble: 226 x 3
## # Groups:   SPAGE [7]
##    SPAGE   A1C standardised_A1C
##    <int> <dbl>            <dbl>
##  1    29   5             0.0320
##  2    28   5.2           0.0312
##  3    27   4.8           0.0301
##  4    24   5.1           0.0324
##  5    30   4.3           0.0243
##  6    30   5.3           0.0299
##  7    28   4.7           0.0282
##  8    30   4.7           0.0265
##  9    23   4.9           0.0291
## 10    30   5.2           0.0293
## # … with 216 more rows

Selected materials and references

R for Data Science - Data Transformation Part Data Transformation Cheat Sheet Data frames Vs. Tibbles

Data science fundamentals 02: Transform and explore (Jul 25, 19)

The filter() function

The arrange() function

The select() function

The mutate() function

The group_by() and summarise() functions