R fundamentals 02: Basic graphics in R (Jul 15, 19)


In this part of the lecture we will learn about graphics in R.


Graphics

R has very strong graphical capabilities - this is the primary reason why both industries and academics are interested.

  • Creates plots with code
  • Replication and modification is easy
  • Reproducibility!
  • graphics package loaded by default produces great plots
  • Excellent packages like ggplot2, ggvis and lattice

What are packages?

Packages are extensions of R functionality, adding a new set of functions that are tailored to handle tasks with a common purpose. When a new package is loaded, it often happens that some of the new functions have the same name of currently loaded ones. When happens, we get a “conflict event” and R warns you and list all commands that are going to mask previously loaded ones.

Packages are loaded using the library() function.

When a package is being loaded, it can load other packages that it depends on.

Sometimes we want to use a function from a specific package just once, at this case, instead of loading the whole package to the memory, we can refer to it directly by using the :: operator. For example: Hmisc::cut2()


graphics package

This package is part of the default list of loaded packages when you start R. It has many functions. Primarily plot() and hist() provide essential functionalities.

The plot() function is generic, which means:

  1. Different inputs gives different plots
  2. Can plot several things like vectors, linear models, kernel densities etc.

Before we see how the plot function works, we will first import a public health data set. We will work with HANES data set which is New York City’s Health and Nutrition survey data set. For more info about HANES, click here.

  # If needed, install RCurl package, then load the package
  # install.packages("RCurl")
  library(RCurl)
## Warning: package 'RCurl' was built under R version 3.5.2
## Loading required package: bitops
  # Import the HANES data set from GitHub; break the string into two for readability
  # (Please note this readability aspect very carefully)
  URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
  URL_text_2 <- "/master/Datasets/HANES/NYC_HANES_DIAB.csv"
  # Paste it to constitute a single URL 
  URL <- paste(URL_text_1, URL_text_2, sep="")
  HANES <- read.csv(text = getURL(URL))

We now observe the structure of the data.

  # Observe the structure
  str(HANES)
## 'data.frame':    1527 obs. of  23 variables:
##  $ KEY              : Factor w/ 1527 levels "133370A","133370B",..: 28 32 43 44 53 55 70 84 90 100 ...
##  $ GENDER           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ SPAGE            : int  29 27 28 27 24 30 26 31 32 34 ...
##  $ AGEGROUP         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ HSQ_1            : int  2 2 2 2 1 1 3 1 2 1 ...
##  $ UCREATININE      : int  105 296 53 314 105 163 150 46 36 177 ...
##  $ UALBUMIN         : num  0.707 18 1 8 4 3 2 2 0.707 4 ...
##  $ UACR             : num  0.00673 6 2 3 4 ...
##  $ MERCURYU         : num  0.37 NA 0.106 0.487 2.205 ...
##  $ DX_DBTS          : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ A1C              : num  5 5.5 5.2 4.8 5.1 4.3 5.2 4.8 5.2 4.8 ...
##  $ CADMIUM          : num  0.2412 0.4336 0.1732 0.0644 0.0929 ...
##  $ LEAD             : num  1.454 0.694 1.019 0.863 1.243 ...
##  $ MERCURYTOTALBLOOD: num  2.34 3.11 2.57 1.32 14.66 ...
##  $ HDL              : int  42 52 51 42 61 52 50 57 56 42 ...
##  $ CHOLESTEROLTOTAL : int  184 117 157 145 206 120 155 156 235 156 ...
##  $ GLUCOSESI        : num  4.61 4.5 4.77 5.16 5 ...
##  $ CREATININESI     : num  74.3 80 73 80 84.9 ...
##  $ CREATININE       : num  0.84 0.91 0.83 0.91 0.96 0.75 0.99 0.9 0.84 0.93 ...
##  $ TRIGLYCERIDE     : int  156 63 43 108 65 51 29 31 220 82 ...
##  $ GLUCOSE          : int  83 81 86 93 90 92 85 72 87 96 ...
##  $ COTININE         : num  31.5918 57.6882 0.0635 0.035 0.0514 ...
##  $ LDLESTIMATE      : int  111 52 97 81 132 58 99 93 135 98 ...

Note that GENDER, AGEGROUP and HSQ_1 are integers but in fact they should be factors! So, we need to convert them to factors.

  # Convert them to factors
  HANES$GENDER <- as.factor(HANES$GENDER)
  HANES$AGEGROUP <- as.factor(HANES$AGEGROUP)
  HANES$HSQ_1 <- as.factor(HANES$HSQ_1)
  
  # Now observe the structure
  str(HANES)
## 'data.frame':    1527 obs. of  23 variables:
##  $ KEY              : Factor w/ 1527 levels "133370A","133370B",..: 28 32 43 44 53 55 70 84 90 100 ...
##  $ GENDER           : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ SPAGE            : int  29 27 28 27 24 30 26 31 32 34 ...
##  $ AGEGROUP         : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ HSQ_1            : Factor w/ 5 levels "1","2","3","4",..: 2 2 2 2 1 1 3 1 2 1 ...
##  $ UCREATININE      : int  105 296 53 314 105 163 150 46 36 177 ...
##  $ UALBUMIN         : num  0.707 18 1 8 4 3 2 2 0.707 4 ...
##  $ UACR             : num  0.00673 6 2 3 4 ...
##  $ MERCURYU         : num  0.37 NA 0.106 0.487 2.205 ...
##  $ DX_DBTS          : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ A1C              : num  5 5.5 5.2 4.8 5.1 4.3 5.2 4.8 5.2 4.8 ...
##  $ CADMIUM          : num  0.2412 0.4336 0.1732 0.0644 0.0929 ...
##  $ LEAD             : num  1.454 0.694 1.019 0.863 1.243 ...
##  $ MERCURYTOTALBLOOD: num  2.34 3.11 2.57 1.32 14.66 ...
##  $ HDL              : int  42 52 51 42 61 52 50 57 56 42 ...
##  $ CHOLESTEROLTOTAL : int  184 117 157 145 206 120 155 156 235 156 ...
##  $ GLUCOSESI        : num  4.61 4.5 4.77 5.16 5 ...
##  $ CREATININESI     : num  74.3 80 73 80 84.9 ...
##  $ CREATININE       : num  0.84 0.91 0.83 0.91 0.96 0.75 0.99 0.9 0.84 0.93 ...
##  $ TRIGLYCERIDE     : int  156 63 43 108 65 51 29 31 220 82 ...
##  $ GLUCOSE          : int  83 81 86 93 90 92 85 72 87 96 ...
##  $ COTININE         : num  31.5918 57.6882 0.0635 0.035 0.0514 ...
##  $ LDLESTIMATE      : int  111 52 97 81 132 58 99 93 135 98 ...

Let’s plot a categorical variable, for instance gender.

  # Plot the factor gender
  plot(HANES$GENDER)


Classwork/Homework:

  1. Is the above plot informative?
  2. What will you do to make it more informative?


Let’s now plot a numerical variable.

  # Plot a numerical variable
  plot(HANES$A1C)

Of course, we can plot two numerical variables:

  # Plot two numerical variables 
  # A1C - Hemoglobin percentage, UACR - Urine Albumin/Creatinine Ratio
  plot(HANES$A1C, HANES$UACR)

Note that R autamatically renders them as a scatter plot and set the axes scale based on the range of the variables:

min(HANES$A1C, na.rm = T); max(HANES$A1C, na.rm = T)
## [1] 3
## [1] 13.4

min(HANES$UACR, na.rm = T); max(HANES$UACR, na.rm = T)
## [1] 0.002412969
## [1] 5327

For the purpose of learning Rmarkdown, have a look at the output of same code above, this time we used the option: results=‘hold’

min(HANES$A1C, na.rm = T); max(HANES$A1C, na.rm = T)

min(HANES$UACR, na.rm = T); max(HANES$UACR, na.rm = T)
## [1] 3
## [1] 13.4
## [1] 0.002412969
## [1] 5327


However, this plot is uninformative as the data is unevenly scattered. One can scale the data using the “ylim” argument:

  # Plot two numerical variables with appropriate scaling
  plot(HANES$A1C, HANES$UACR, ylim=c(0, 10))

Although the scaling is okay now, the relationship is extremely complicated.

One of the transformations that helps us to understand relationships between the variables is the log() function.

We can apply logrithm to both variables -

  # Transform the data using the log function and plot the result
  plot(log(HANES$A1C), log(HANES$UACR))

We note that there are two different clusters of patients - one with low UACR values and another with high UACR values, both corresponding to a mean \(log(A1C)\) of about \(1.7\).

We can also plot two categorical variables. Let us plot GENDER and AGEGROUP factors.

Lets change the texts to render something more informative (based on the HANES codebook):

  # Rename the GENDER factor for identification 
  HANES$GENDER <- factor(HANES$GENDER, labels=c("M","F"))
  # Rename the AGEGROUP factor for identification
  HANES$AGEGROUP <- factor(HANES$AGEGROUP, labels=c("20-39","40-59","60+"))

  # Plot GENDER vs AGEGROUP
  plot(HANES$GENDER, HANES$AGEGROUP)

Note that R already prints proportion as it displays the plots. The first element is the \(x\)-axis and the second element is the \(y\)-axis.

Now, let’s switch the order:

# Swap AGEGROUP vs GENDER
  plot(HANES$AGEGROUP, HANES$GENDER)

Next, let’s explore the hist() function. hist() is a short form for histogram.
The hist() function:

Here is an example to find the distribution of A1C variable for the male population.

First select only the male population:

  # Form a logical vector consisting only the MALE gender
  HANES_MALE <- HANES$GENDER == "M"
  # Select only the records for the male population
  MALES_DF <- HANES[HANES_MALE,]

Now, let’s make an histogram for the above selected male population:

  # Make an historgam
  hist(MALES_DF$A1C)

Observe that the Glycohemoglobin percentage lies between \(5-6\) for most of the men (the mode).

Note that R has also chosen the number of bins automatically.

You can increase (or decrease) the number of bins using the “breaks” argument.

There are other cool tools like barplot(), boxplot(), pairs() in the graphics package.

The plot system allows to add different plots one on top of the other.

For example, on top of the histogram, let’s add a vertical line represents the mean of the distribution

# Make an historgam
  hist(MALES_DF$A1C)
# Add a vertical line, supplying the x-axis value
  abline(v = mean(MALES_DF$A1C, na.rm = T), col="red")



Classwork/Homework:

  1. Find the distribution of A1C for the female population in the above data set. Are they different?
  2. Add vertical lines that indicate the boundaries of the standard error of the mean.
  3. Find the distribution of A1C for three age groups in the above data set. Is there a difference?
  4. Try to find the distribution of one more numeric variable (other than A1C) for the three age-groups.
  5. Try some plots with a higher number of bins in the above exercise, what happens?


Customizing plots

How does this plot look?

  # Plot LDL values vs HDL values
  plot(HANES$LDL, HANES$HDL)

compared to this -

  # Plot GLUCOSE vs GLUCOSESI with parameters
  plot(HANES$GLUCOSE, HANES$GLUCOSESI, 
       xlab= "Plasma Glucose [mg/dL]", ylab = expression(paste("Blood Glucose SI units [", mu, "mole/L]")), 
       main = "Plasma vs Blood Glucose", type = "o", col="blue")



Classwork/Homework: Check the Hmisc::label() function. In accordance to the graph above, think how one can leverage this function to save some typing when plotting several graphs with the same variable? Give an example.



To do good data science, it certainly not only helps to know correlations between the variables (in the above figure, we know blood glucose levels and plasma glucose levels are the same), but how we present the data matters!

Some plot function characteristics:



Classwork/Homework: Change the type to “l” and report the plot type.



Graphical parameters are not maintained throughout session. If you want to maintain graphical parameters, use the par() function. For example,

  # Set the graphical parameter par's so that color red is held
  par(col="red")
  # Plot LDL vs HDL
  plot(HANES$LDL, HANES$HDL)

# Now make another plot:
  # This time Hemoglobin vs HDL
  plot(HANES$A1C, HANES$HDL)

Tip: As our commands become more and more complex and ask for more and more arguments, the specification of the dataset name again and again becomes onerous. To save some typing we have the function with(). Here is an example:

  # Set the graphical parameter par's so that color red is held
  par(col="red")
  # Plot LDL vs HDL
  with(HANES, plot(LDL, HDL), xlab = label(LDL))

# Now make another plot:
  # This time Hemoglobin vs HDL
  with(HANES, plot(A1C, HDL), ylab = label(HDL))


More graphical parameters:


Multiple graphs

So far we saw single plots of data, with no combinations and layers. It may be good to plot several. We can use “mfrow” with the par() function.

  # Set the par function with mfrow to 2x2 "grid"
  par(mfrow = c(2,2))
  # Plot LDL vs HDL
  plot(HANES$LDL, HANES$HDL)
  # Plot A1C vs HDL
  plot(HANES$A1C, HANES$HDL)
  # Plot GLUCOSE vs HDL
  plot(HANES$GLUCOSE, HANES$HDL)
  # Plot CHOLESTEROLTOTAL vs HDL
  plot(HANES$CHOLESTEROLTOTAL, HANES$HDL)



Classwork/Homework: Do the above exercise with “mfcol” argument. How does it plot?


To reset the plot to 1 figure, one can use par(mfrow = c(1,1)), that will get us back to normal.


The layout() function

Facilitates more complex plot arrangements.

  # Create a grid on how our figures should appear
  grid <- matrix(c(1,1,2,3), nrow=2, ncol=2, byrow=TRUE)
  # Pass it to the layout function
  layout(grid)
  # Plot LDL vs HDL
  plot(HANES$LDL, HANES$HDL)
  # Plot GLUCOSE vs HDL
  plot(HANES$GLUCOSE, HANES$HDL)
  # Plot CHOLESTEROLTOTAL vs HDL
  plot(HANES$CHOLESTEROLTOTAL, HANES$HDL)

  # Reset the layout
  layout(1)


Tip: Resetting everytime might be too tedious. A trick is to assign the old setting to an object and reuse it when necessary:

  # Assign the old parameters to an object
  old_parameters <- par()
  # Change to new parameters
  par(col="red")
  plot(HANES$LDL, HANES$HDL)

  # Reset to old parameters
  par(old_parameters)
  # Test the original settings
  plot(HANES$LDL, HANES$HDL)

Stacking graphical elements

Stacking graphical elements is a great way of adding more information to the plots:

  # Plot A1C vs GLUCOSESI
  plot(HANES$A1C, HANES$GLUCOSESI, xlim=c(6,8), ylim=c(3,10))
  # Using linear fit model. 
  # Note: `lm()` function will return a vector of coefficients for the fit
  lm_glucose_SI <- lm(HANES$A1C ~ HANES$GLUCOSESI)
  # Stack the linear model on top of the plot with line width 2 (specified by `lwd` argument)
  abline(coef(lm_glucose_SI), lwd = 2)



Classwork/Homework: Make a plot and add elements through the functions points(), lines(), segments() and text().