## 'data.frame': 1112 obs. of 23 variables:
## $ KEY : Factor w/ 1527 levels "133370A","133370B",..: 28 43 44 53 55 70 84 90 100 107 ...
## $ GENDER : Factor w/ 2 levels "M","F": 1 1 1 1 1 1 1 1 1 1 ...
## $ SPAGE : int 29 28 27 24 30 26 31 32 34 32 ...
## $ AGEGROUP : Factor w/ 3 levels "20-39","40-59",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ HSQ_1 : Factor w/ 5 levels "Excellent","Very Good",..: 2 2 2 1 1 3 1 2 1 3 ...
## $ UCREATININE : int 105 53 314 105 163 150 46 36 177 156 ...
## $ UALBUMIN : num 0.707 1 8 4 3 2 2 0.707 4 3 ...
## $ UACR : num 0.00673 2 3 4 2 ...
## $ MERCURYU : num 0.37 0.106 0.487 2.205 0.979 ...
## $ DX_DBTS : Factor w/ 3 levels "DIAB","DIAB NO_DX",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ A1C : num 5 5.2 4.8 5.1 4.3 5.2 4.8 5.2 4.8 5.2 ...
## $ CADMIUM : num 0.2412 0.1732 0.0644 0.0929 0.1202 ...
## $ LEAD : num 1.454 1.019 0.863 1.243 0.612 ...
## $ MERCURYTOTALBLOOD: num 2.34 2.57 1.32 14.66 2.13 ...
## $ HDL : int 42 51 42 61 52 50 57 56 42 44 ...
## $ CHOLESTEROLTOTAL : int 184 157 145 206 120 155 156 235 156 120 ...
## $ GLUCOSESI : num 4.61 4.77 5.16 5 5.11 ...
## $ CREATININESI : num 74.3 73 80 84.9 66 ...
## $ CREATININE : num 0.84 0.83 0.91 0.96 0.75 0.99 0.9 0.84 0.93 1.09 ...
## $ TRIGLYCERIDE : int 156 43 108 65 51 29 31 220 82 35 ...
## $ GLUCOSE : int 83 86 93 90 92 85 72 87 96 92 ...
## $ COTININE : num 31.5918 0.0635 0.035 0.0514 0.035 ...
## $ LDLESTIMATE : int 111 97 81 132 58 99 93 135 98 69 ...
## - attr(*, "na.action")= 'omit' Named int 2 15 16 24 26 28 33 34 35 39 ...
## ..- attr(*, "names")= chr "2" "15" "16" "24" ...
If you want to download the full HANES dataset, use the download.file() function
download.file(url = URL, destfile = "HANES.original.csv")Comma Seperated Values (CSV) are the most common types of files. We can read a .csv file directly into a tibble format through the function readr::read_csv() which is a modified version of the base function read.csv().
# Read the HANES orginal csv file
URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
URL_text_2 <- "/master/Datasets/HANES/HANES.original.csv"
# Paste it to constitute a single URL
URL <- paste(URL_text_1, URL_text_2, sep="")
HANES_original <- read.csv(text = getURL(URL), stringsAsFactors = FALSE)When the data contains character vectors, parsing of these vectors can be done through the parsing function:
# Parse a logical vector
str(parse_logical(c("TRUE", "FALSE", "NA")))## logi [1:3] TRUE FALSE NA
# Parse whether we have non-integer numbers
# DMQ_5 (In what country was SP born?) is a character (factor) variable
str(parse_integer(HANES_original$DMQ_5))## int [1:1527] 10 10 10 10 10 10 66 10 10 10 ...
## - attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame': 8 obs. of 4 variables:
## ..$ row : int [1:8] 176 383 624 696 937 1040 1331 1486
## ..$ col : int [1:8] NA NA NA NA NA NA NA NA
## ..$ expected: chr [1:8] "an integer" "an integer" "an integer" "an integer" ...
## ..$ actual : chr [1:8] "R" "R" "DN" "R" ...
# Display parse_* findings/problems as a table
problems(parse_integer(HANES_original$DMQ_5))## # A tibble: 8 x 4
## row col expected actual
## <int> <int> <chr> <chr>
## 1 176 NA an integer R
## 2 383 NA an integer R
## 3 624 NA an integer DN
## 4 696 NA an integer R
## 5 937 NA an integer DN
## 6 1040 NA an integer R
## 7 1331 NA an integer DN
## 8 1486 NA an integer DN
# Parse a date vector
str(parse_date(c("2010-01-01", "1979-10-14")))## Date[1:2], format: "2010-01-01" "1979-10-14"
# Regroup several cases under new categoies (from the ::forcats package)
fct_x <- as.character(3:10)
fct_x## [1] "3" "4" "5" "6" "7" "8" "9" "10"
fct_x <- factor(fct_x)
fct_x <- fct_collapse(fct_x, '7 or more' = as.character(7:10))
fct_x## [1] 3 4 5 6 7 or more 7 or more 7 or more
## [8] 7 or more
## Levels: 7 or more 3 4 5 6
Parsing different number formats are important as it comes in various styles:
People write numbers differently in different parts of the world. For example, some countries use . in between the integer and fractional parts of a real number, while others use ,.
Numbers are often surrounded by other characters that provide some context, like “$1000” or “10%”.
Numbers often contain “grouping” characters to make them easier to read, like “1,000,000”, and these grouping characters vary around the world.
The first problem is solved by the argument locale in parse_double() function, the second using parse_number() function and the third by combination of locale and parse_number().
# Parse a number of the format 1,23 where `,` indicates the decimal mark
parse_double("1,63", locale = locale(decimal_mark = ","))
# Just parse the number for $120 and 60%
parse_number("$120")
parse_number("60%")
# Parse the grouping as used in Europe
parse_number("123.834.297", locale = locale(grouping_mark = "."))
# or as used in Switzerland
parse_number("123'834'297", locale = locale(grouping_mark = "'"))
# Parse grouping and decimal marks as used in colombia and other latin countries
parse_number("123.834,297", locale = locale(grouping_mark = ".", decimal_mark = ","))## [1] 1.63
## [1] 120
## [1] 60
## [1] 123834297
## [1] 123834297
## [1] 123834.3
Parsing date and time is very important in health related data sets. For example, the MIMIC3 data has tons of data sets with time information.
There are three main functions for parsing dates and times.
parse_datetime() expects an ISO8601 date-time. ISO8601 is an international standard in which the components of a date are organised from biggest to smallest: year, month, day, hour, minute, second.
parse_date() expects a four digit year, a - or /, the month, a - or /, then the day
parse_time() expects the hour, :, minutes, optionally : and seconds, and an optional am/pm specifier
# Parse datetime
parse_datetime("2010-10-01T2010")## [1] "2010-10-01 20:10:00 UTC"
# Parse date
parse_date("2010-10-01")## [1] "2010-10-01"
# Parse time
library(hms)
parse_time("01:10 am")## 01:10:00
If the deafults doesn’t work, there are several options as follows:
Year: %Y (4 digits), %y (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.
Month: %m (2 digits), %b (abbreviated name, like “Jan”), %B (full name, “January”).
Day: %d (2 digits), %e (optional leading space).
Time: %H 0-23 hour, %I 0-12, must be used with %p, %p AM/PM indicator, %M minutes, %S integer seconds, %OS real seconds, %Z Time zone (as name, e.g. America/Chicago).
Non-digits: %. skips one non-digit character, %* skips any number of non-digits.
As an example: (all outputs are in the format yyyy-mm-dd)
# Parse date in month/day/year format
parse_date("01/02/15", "%m/%d/%y")
# Parse date in day/month/year format
parse_date("01/02/2015", "%d/%m/%Y")
# Parse date in year/month/date format
parse_date("01/02/15", "%y/%m/%d")
# We can even set the `locale` option to specify different built in languages for the date
parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))## [1] "2015-01-02"
## [1] "2015-02-01"
## [1] "2001-02-15"
## [1] "2015-01-01"
Classwork/Homework: Read section 11.4 and 11.5 (Parsing a file and Writing to a file sections ) in R for Data Science book.
Dealing with dates and times can be a very tough job. For example, consider the MIMIC3 admissions table which has several time variables.
Note: MIMIC3 says “dates were shifted into the future by a random offset for each individual patient in a consistent manner to preserve intervals, resulting in stays which occur sometime between the years 2100 and 2200.” so don’t freak out looking at the dates.
If we import the dataset without stringsAsFactors = F , the parse_date() function fails to process the factor
# Load the package RCurl
library(RCurl)
# Import the admissions data set in MIMIC3 from GitHub;
URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
URL_text_2 <- "/master/Datasets/MIMIC3/admissions.csv"
URL <- paste(URL_text_1,URL_text_2, sep="")
MIMIC3_ADM <- read.csv(text=getURL(URL), stringsAsFactors = FALSE)
# Observe the structure
str(MIMIC3_ADM)
# Parse the admission time using `parse_date()` function
head(parse_date(MIMIC3_ADM$admittime, "%Y-%m-%d"))## 'data.frame': 5770 obs. of 19 variables:
## $ row_id : int 83 84 85 86 87 88 89 90 91 92 ...
## $ subject_id : int 82 83 84 84 85 85 86 87 88 89 ...
## $ hadm_id : int 110641 158569 120969 166401 116630 112077 190243 190659 123010 188646 ...
## $ admittime : chr "2150-06-24" "2142-04-01" "2196-02-02" "2196-04-14" ...
## $ dischtime : chr "2150-06-29 15:00:00" "2142-04-08 14:46:00" "2196-02-04 17:48:00" "2196-04-17 13:42:00" ...
## $ deathtime : chr NA NA NA "2196-04-17 13:42:00" ...
## $ admission_type : chr "NEWBORN" "URGENT" "ELECTIVE" "EMERGENCY" ...
## $ admission_location : chr "PHYS REFERRAL/NORMAL DELI" "TRANSFER FROM HOSP/EXTRAM" "PHYS REFERRAL/NORMAL DELI" "EMERGENCY ROOM ADMIT" ...
## $ discharge_location : chr "HOME" "HOME HEALTH CARE" "HOME" "DEAD/EXPIRED" ...
## $ insurance : chr "Private" "Medicare" "Private" "Private" ...
## $ language : chr NA NA NA NA ...
## $ religion : chr "UNOBTAINABLE" "UNOBTAINABLE" "OTHER" "OTHER" ...
## $ marital_status : chr NA "MARRIED" "MARRIED" "MARRIED" ...
## $ ethnicity : chr "OTHER" "UNKNOWN/NOT SPECIFIED" "WHITE" "WHITE" ...
## $ edregtime : chr NA NA NA "2196-04-13 22:23:00" ...
## $ edouttime : chr NA NA NA "2196-04-14 04:31:00" ...
## $ diagnosis : chr "NEWBORN" "CAROTID STENOSIS" "MEDIAL PARIETAL TUMOR/SDA" "GLIOBLASTOMA,NAUSEA" ...
## $ hospital_expire_flag: int 0 0 0 1 0 0 0 0 0 0 ...
## $ has_chartevents_data: int 1 1 0 1 1 1 1 1 1 1 ...
## [1] "2150-06-24" "2142-04-01" "2196-02-02" "2196-04-14" "2162-03-02"
## [6] "2167-07-25"
Another approach is to use the helpers provided by the library lubridate.
They automatically work out the format once we specify the order of the component.
To use them, we have to identify the order in which year, month, and day appear in our data, then arrange “y”, “m”, and “d” in the same order.
That gives the name of the lubridate function that will parse our date.
To use this, we need to load the lubridate library.
# Load the lubridate library
library(lubridate)
# Since admission time has the format year/month/time, we pass ymd
MIMIC3_ADM$admittime <- ymd(MIMIC3_ADM$admittime)
# View the format
str(MIMIC3_ADM)
# Print the first few dates
head(MIMIC3_ADM$admittime)## 'data.frame': 5770 obs. of 19 variables:
## $ row_id : int 83 84 85 86 87 88 89 90 91 92 ...
## $ subject_id : int 82 83 84 84 85 85 86 87 88 89 ...
## $ hadm_id : int 110641 158569 120969 166401 116630 112077 190243 190659 123010 188646 ...
## $ admittime : Date, format: "2150-06-24" "2142-04-01" ...
## $ dischtime : chr "2150-06-29 15:00:00" "2142-04-08 14:46:00" "2196-02-04 17:48:00" "2196-04-17 13:42:00" ...
## $ deathtime : chr NA NA NA "2196-04-17 13:42:00" ...
## $ admission_type : chr "NEWBORN" "URGENT" "ELECTIVE" "EMERGENCY" ...
## $ admission_location : chr "PHYS REFERRAL/NORMAL DELI" "TRANSFER FROM HOSP/EXTRAM" "PHYS REFERRAL/NORMAL DELI" "EMERGENCY ROOM ADMIT" ...
## $ discharge_location : chr "HOME" "HOME HEALTH CARE" "HOME" "DEAD/EXPIRED" ...
## $ insurance : chr "Private" "Medicare" "Private" "Private" ...
## $ language : chr NA NA NA NA ...
## $ religion : chr "UNOBTAINABLE" "UNOBTAINABLE" "OTHER" "OTHER" ...
## $ marital_status : chr NA "MARRIED" "MARRIED" "MARRIED" ...
## $ ethnicity : chr "OTHER" "UNKNOWN/NOT SPECIFIED" "WHITE" "WHITE" ...
## $ edregtime : chr NA NA NA "2196-04-13 22:23:00" ...
## $ edouttime : chr NA NA NA "2196-04-14 04:31:00" ...
## $ diagnosis : chr "NEWBORN" "CAROTID STENOSIS" "MEDIAL PARIETAL TUMOR/SDA" "GLIOBLASTOMA,NAUSEA" ...
## $ hospital_expire_flag: int 0 0 0 1 0 0 0 0 0 0 ...
## $ has_chartevents_data: int 1 1 0 1 1 1 1 1 1 1 ...
## [1] "2150-06-24" "2142-04-01" "2196-02-02" "2196-04-14" "2162-03-02"
## [6] "2167-07-25"
We also see the discharge time has both date and time formats so we can use parse_datetime() function to convert from the factor format.
# Load the lubridate library
library(lubridate)
# Since discharge time has both date and time attributes we may parse using datetime function
MIMIC3_ADM$dischtime <- parse_datetime(MIMIC3_ADM$dischtime)
# View the format
str(MIMIC3_ADM)
# Print the first few dates and times
head(MIMIC3_ADM$dischtime)## 'data.frame': 5770 obs. of 19 variables:
## $ row_id : int 83 84 85 86 87 88 89 90 91 92 ...
## $ subject_id : int 82 83 84 84 85 85 86 87 88 89 ...
## $ hadm_id : int 110641 158569 120969 166401 116630 112077 190243 190659 123010 188646 ...
## $ admittime : Date, format: "2150-06-24" "2142-04-01" ...
## $ dischtime : POSIXct, format: "2150-06-29 15:00:00" "2142-04-08 14:46:00" ...
## $ deathtime : chr NA NA NA "2196-04-17 13:42:00" ...
## $ admission_type : chr "NEWBORN" "URGENT" "ELECTIVE" "EMERGENCY" ...
## $ admission_location : chr "PHYS REFERRAL/NORMAL DELI" "TRANSFER FROM HOSP/EXTRAM" "PHYS REFERRAL/NORMAL DELI" "EMERGENCY ROOM ADMIT" ...
## $ discharge_location : chr "HOME" "HOME HEALTH CARE" "HOME" "DEAD/EXPIRED" ...
## $ insurance : chr "Private" "Medicare" "Private" "Private" ...
## $ language : chr NA NA NA NA ...
## $ religion : chr "UNOBTAINABLE" "UNOBTAINABLE" "OTHER" "OTHER" ...
## $ marital_status : chr NA "MARRIED" "MARRIED" "MARRIED" ...
## $ ethnicity : chr "OTHER" "UNKNOWN/NOT SPECIFIED" "WHITE" "WHITE" ...
## $ edregtime : chr NA NA NA "2196-04-13 22:23:00" ...
## $ edouttime : chr NA NA NA "2196-04-14 04:31:00" ...
## $ diagnosis : chr "NEWBORN" "CAROTID STENOSIS" "MEDIAL PARIETAL TUMOR/SDA" "GLIOBLASTOMA,NAUSEA" ...
## $ hospital_expire_flag: int 0 0 0 1 0 0 0 0 0 0 ...
## $ has_chartevents_data: int 1 1 0 1 1 1 1 1 1 1 ...
## [1] "2150-06-29 15:00:00 UTC" "2142-04-08 14:46:00 UTC"
## [3] "2196-02-04 17:48:00 UTC" "2196-04-17 13:42:00 UTC"
## [5] "2162-03-10 13:15:00 UTC" "2167-07-30 15:24:00 UTC"
Sometimes we may just want to work with the dates given the variable in date-time format. We can use as_date() function to extract just the date.
# Load the lubridate library
library(lubridate)
# Since discharge time has both date and time attributes we use as_date() to extract the date
MIMIC3_ADM$dischtime <- as_date(MIMIC3_ADM$dischtime)
# We can then select only the variables of interest
MIMIC3_adm_dis <- select(MIMIC3_ADM, hadm_id, admittime, dischtime)
as.tibble(MIMIC3_adm_dis)## # A tibble: 5,770 x 3
## hadm_id admittime dischtime
## <int> <date> <date>
## 1 110641 2150-06-24 2150-06-29
## 2 158569 2142-04-01 2142-04-08
## 3 120969 2196-02-02 2196-02-04
## 4 166401 2196-04-14 2196-04-17
## 5 116630 2162-03-02 2162-03-10
## 6 112077 2167-07-25 2167-07-30
## 7 190243 2146-04-06 2146-04-10
## 8 190659 2191-02-25 2191-04-25
## 9 123010 2111-08-29 2111-09-03
## 10 188646 2185-06-17 2185-06-21
## # … with 5,760 more rows
Classwork/Homework: Read section 16.3 Date-time components in the book R for Data Science.
Like we noted above MIMIC3 dates are shifted into the future, 2100-2200, by a random offset.
To make sense, it might be worth to convert the dates to earlier dates, like 1900-2000. To do this, we may have to systematically convert the dates.
We can make use of time spans.
Time spans come in three important classes:
durations, which represent an exact number of seconds.
periods, which represent human units like weeks and months.
intervals, which represent a starting and ending point.
In R, when we subtract two dates, we get a difftime object:
# Load the lubridate library
library(lubridate)
# Subtract discharge time minus admission time
MIMIC3_adm_dis <- MIMIC3_adm_dis %>%
mutate(time_spent = ymd(MIMIC3_adm_dis$dischtime) - ymd(MIMIC3_adm_dis$admittime))
# And display as tibble
as.tibble(MIMIC3_adm_dis)## # A tibble: 5,770 x 4
## hadm_id admittime dischtime time_spent
## <int> <date> <date> <drtn>
## 1 110641 2150-06-24 2150-06-29 5 days
## 2 158569 2142-04-01 2142-04-08 7 days
## 3 120969 2196-02-02 2196-02-04 2 days
## 4 166401 2196-04-14 2196-04-17 3 days
## 5 116630 2162-03-02 2162-03-10 8 days
## 6 112077 2167-07-25 2167-07-30 5 days
## 7 190243 2146-04-06 2146-04-10 4 days
## 8 190659 2191-02-25 2191-04-25 59 days
## 9 123010 2111-08-29 2111-09-03 5 days
## 10 188646 2185-06-17 2185-06-21 4 days
## # … with 5,760 more rows
While this may be useful to find features such as the distribution of the time spent by the patients in hospital:
# And plot the frequency distribution of the time spent by the patients
MIMIC3_adm_dis %>%
ggplot(aes(time_spent)) +
geom_freqpoly(binwidth = 1)time difference can be little painful to work with when we are interested in durations.
The library lubridate provides an alternative which always uses seconds: the duration.
We can convert our variable into duration by using the function as.duration().
# Convert the time spent as duration (in seconds)
MIMIC3_adm_dis <- MIMIC3_adm_dis %>%
mutate(time_spent_in_seconds = as.duration(time_spent))
as.tibble(MIMIC3_adm_dis)## # A tibble: 5,770 x 5
## hadm_id admittime dischtime time_spent time_spent_in_seconds
## <int> <date> <date> <drtn> <Duration>
## 1 110641 2150-06-24 2150-06-29 5 days 432000s (~5 days)
## 2 158569 2142-04-01 2142-04-08 7 days 604800s (~1 weeks)
## 3 120969 2196-02-02 2196-02-04 2 days 172800s (~2 days)
## 4 166401 2196-04-14 2196-04-17 3 days 259200s (~3 days)
## 5 116630 2162-03-02 2162-03-10 8 days 691200s (~1.14 weeks)
## 6 112077 2167-07-25 2167-07-30 5 days 432000s (~5 days)
## 7 190243 2146-04-06 2146-04-10 4 days 345600s (~4 days)
## 8 190659 2191-02-25 2191-04-25 59 days 5097600s (~8.43 weeks)
## 9 123010 2111-08-29 2111-09-03 5 days 432000s (~5 days)
## 10 188646 2185-06-17 2185-06-21 4 days 345600s (~4 days)
## # … with 5,760 more rows
Durations come with a bunch of convenient constructors:
# List the duration of 15s
dseconds(15)
# List the duration in 10 minutes
dminutes(10)
# List the duration in 12 hrs and 24 hours
dhours(c(12, 24))
# List the duration from 0 to 5 days
ddays(0:5)
# List the duration in 3 weeks
dweeks(3)
# List the duration in a year
dyears(1)## [1] "15s"
## [1] "600s (~10 minutes)"
## [1] "43200s (~12 hours)" "86400s (~1 days)"
## [1] "0s" "86400s (~1 days)" "172800s (~2 days)"
## [4] "259200s (~3 days)" "345600s (~4 days)" "432000s (~5 days)"
## [1] "1814400s (~3 weeks)"
## [1] "31536000s (~52.14 weeks)"
These constructors can help convert our years by a 200 year offset. Or alternatively we can use periods.
Periods are time spans but don’t have a fixed length in seconds, instead they work with “human” times, like days and months.
That allows them work in a more intuitive way. Like durations, periods can be created with a number of friendly constructor functions.
# Make 15s
seconds(15)
# Make 10m
minutes(10)
# Make 12 hrs and 24 hrs
hours(c(12, 24))
# Make 7 days
days(7)
# Make 1-6 months
months(1:6)
# Make 3 weeks
weeks(3)
# Make 1 year
years(1)## [1] "15S"
## [1] "10M 0S"
## [1] "12H 0M 0S" "24H 0M 0S"
## [1] "7d 0H 0M 0S"
## [1] "1m 0d 0H 0M 0S" "2m 0d 0H 0M 0S" "3m 0d 0H 0M 0S" "4m 0d 0H 0M 0S"
## [5] "5m 0d 0H 0M 0S" "6m 0d 0H 0M 0S"
## [1] "21d 0H 0M 0S"
## [1] "1y 0m 0d 0H 0M 0S"
You can add and multiply periods and add them to dates. Thus we can subtract a 200 year time peroid from our admission and discharge times:
# Makes times compatible with our times subtracting 200 years
MIMIC3_adm_dis <- MIMIC3_adm_dis %>%
mutate(admission_time_minus_200 = ymd(admittime)-years(200),
dischtime_time_minus_200 = ymd(dischtime)-years(200)) %>%
select(admission_time_minus_200, dischtime_time_minus_200, everything())
as.tibble(MIMIC3_adm_dis)## # A tibble: 5,770 x 7
## admission_time_… dischtime_time_… time_spent hadm_id admittime
## <date> <date> <drtn> <int> <date>
## 1 1950-06-24 1950-06-29 5 days 110641 2150-06-24
## 2 1942-04-01 1942-04-08 7 days 158569 2142-04-01
## 3 1996-02-02 1996-02-04 2 days 120969 2196-02-02
## 4 1996-04-14 1996-04-17 3 days 166401 2196-04-14
## 5 1962-03-02 1962-03-10 8 days 116630 2162-03-02
## 6 1967-07-25 1967-07-30 5 days 112077 2167-07-25
## 7 1946-04-06 1946-04-10 4 days 190243 2146-04-06
## 8 1991-02-25 1991-04-25 59 days 190659 2191-02-25
## 9 1911-08-29 1911-09-03 5 days 123010 2111-08-29
## 10 1985-06-17 1985-06-21 4 days 188646 2185-06-17
## # … with 5,760 more rows, and 2 more variables: dischtime <date>,
## # time_spent_in_seconds <Duration>
What should years(1) / days(1) return?
Is it 365 or 366, since some years are leap years and periods are human interpretable values unlike duration?
R will throw a warning if we try to perfom this operation.
# Find years(1) / days(1)
years(1) / days(1)## estimate only: convert to intervals for accuracy
## [1] 365.25
Thus we have to use an interval.
An interval is a duration with an explicit starting point: that makes it precise so you can determine exactly how long it is:
# Find years(1) / days(1) by converting to interval
next_year <- today() + years(1)
(today() %--% next_year) / ddays(1)## [1] 366
# Find years(1) / days(1) by converting to interval
in2years <- today() + years(2)
(next_year %--% in2years) / ddays(1)## [1] 365
Intervals are defined by using the %--% operator.
Here is another example:
Below we define two dates in the US Eastern time zone. The start day is March 11, 2017 at 5:21 AM. The end day is March 12, 2017 at the same time. Note that Daylight Savings began on March 12 at 2:00 AM.
start <- mdy_hm("3-11-2017 5:21", tz = "US/Eastern")
end <- mdy_hm("3-12-2017 5:21", tz = "US/Eastern")
# Since we’re dealing with elapsed time between two dates,
# let’s start with Intervals. We can define an Interval using the %--% operator.
time.interval <- start %--% end
time.interval## [1] 2017-03-11 05:21:00 EST--2017-03-12 05:21:00 EDT
Notice how Intervals print. They show the beginng date and end date. And also notice how the time zone changes from EST to EDT indicating that Daylight Savings has started.
str(time.interval)## Formal class 'Interval' [package "lubridate"] with 3 slots
## ..@ .Data: num 82800
## ..@ start: POSIXct[1:1], format: "2017-03-11 05:21:00"
## ..@ tzone: chr "US/Eastern"
If we look at the structure of an Interval object we see it contains elapsed time in seconds, 82800, and the start date.
To create a Duration between these two dates, we can use the as.duration function.
time.duration <- as.duration(time.interval)
time.duration## [1] "82800s (~23 hours)"
Notice a Duration object prints the elapsed time in seconds as well as something a little friendlier to read, in this case hours. Because Daylight Savings went into effect at 2:00 AM during the interval, an hour was skipped. Thus the duration between these two time points is only 23 hours.
If we look at the structure of a Duration object we see it just contains elapsed time in seconds.
str(time.duration)## Formal class 'Duration' [package "lubridate"] with 1 slot
## ..@ .Data: num 82800
We can create a Period from an Interval using the as.period function.
time.period <- as.period(time.interval)
time.period## [1] "1d 0H 0M 0S"
A Period prints elapsed time as integers in the form of years, months, weeks, days and so on. Notice this Period is 1 day long. While only 23 hours have technically elapsed since the start date, according to our clock one day has elapsed.
Classwork/Homework: Read section 16.5 Date-time components in the book R for Data Science.
Same “kind of” data can be present in multiple columns that may denote a single variable.
To handle such situations we can use the gather() function from the tidyr package. The purpose of the this function is to aggregate such data into a new, single pair of variables. To accomplish this, we need three parameters:
The set of columns that represent sub-classification/sub-variables.
The name of a single variable that represents a larger class. This is called a key.
The values that are present in the cells corresponding to the sub-variables.
Here is an example taken from the gather() help page:
stocks <- tibble(
time = as.Date('2009-01-01') + 0:9,
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2),
Z = rnorm(10, 0, 4)
)
stocks## # A tibble: 10 x 4
## time X Y Z
## <date> <dbl> <dbl> <dbl>
## 1 2009-01-01 0.139 2.04 -0.0138
## 2 2009-01-02 -0.808 2.11 -6.38
## 3 2009-01-03 0.337 -0.503 4.15
## 4 2009-01-04 -0.182 0.125 -1.38
## 5 2009-01-05 -2.34 -1.93 1.83
## 6 2009-01-06 -0.0377 1.49 3.29
## 7 2009-01-07 -0.723 -0.174 10.3
## 8 2009-01-08 -0.235 0.714 -4.15
## 9 2009-01-09 0.300 2.76 6.66
## 10 2009-01-10 -0.934 1.46 -4.40
# Gather all stocks data under one column:'price' and an added new factor:'stock'
# '-time' means to drop the 'time' variable from the list of variables to be processed.
# See the ... option from the help page.
stocks %>% gather(stock, price, -time) %>% mutate(stock = factor(stock)) %>%
print() -> meltedStocks## # A tibble: 30 x 3
## time stock price
## <date> <fct> <dbl>
## 1 2009-01-01 X 0.139
## 2 2009-01-02 X -0.808
## 3 2009-01-03 X 0.337
## 4 2009-01-04 X -0.182
## 5 2009-01-05 X -2.34
## 6 2009-01-06 X -0.0377
## 7 2009-01-07 X -0.723
## 8 2009-01-08 X -0.235
## 9 2009-01-09 X 0.300
## 10 2009-01-10 X -0.934
## # … with 20 more rows
# Show that now 'stock' serves as a factor
levels(meltedStocks$stock)## [1] "X" "Y" "Z"
Now, let’s consider MIMIC3 admissions table. There are two variables - edregtime and edouttime (emergency department registration time and emergency department out time, respectively) that represents one variable: time stamp in emergency, a larger class of variable.
# Select subject_id, hadm_id, edregtime and edouttime variables
MIMIC3_ADM_tibble <- MIMIC3_ADM %>% select(subject_id, hadm_id, edregtime, edouttime) %>% na.omit()
as.tibble(MIMIC3_ADM_tibble)## # A tibble: 3,065 x 4
## subject_id hadm_id edregtime edouttime
## <int> <int> <chr> <chr>
## 1 84 166401 2196-04-13 22:23:00 2196-04-14 04:31:00
## 2 85 112077 2167-07-25 16:37:00 2167-07-25 20:46:00
## 3 88 123010 2111-08-29 01:44:00 2111-08-29 02:28:00
## 4 91 121205 2177-04-22 21:02:00 2177-04-23 04:03:00
## 5 94 183686 2176-02-25 10:35:00 2176-02-25 18:14:00
## 6 94 140037 2176-09-02 09:54:00 2176-09-02 17:56:00
## 7 95 160891 2157-12-25 12:24:00 2157-12-25 16:56:00
## 8 96 170324 2156-03-31 13:54:00 2156-03-31 14:45:00
## 9 101 175533 2196-09-26 12:50:00 2196-09-26 18:37:00
## 10 103 133550 2144-08-30 17:12:00 2144-08-31 00:39:00
## # … with 3,055 more rows
Let’s see how to use gather() using the MIMIC3 data set:
The set of columns that represent sub-classification are the columns: edregtime and edouttime
The key here is the ‘time stamp in emergency’. Let’s call this variable EdTimeStamp.
The values here are the time stamp when registering and when checking-out of the emergency department. Lets call this variable time
Once we have these parameters, we can gather data:
# Gather edregtime and edouttime, to form variable "EdTimeStamp" with
# value "time"
MIMIC3_ADM_gather <- MIMIC3_ADM_tibble %>%
gather("edregtime", "edouttime", key = "EdTimeStamp", value = "time")
# Showing table head
as.tibble(head(MIMIC3_ADM_gather, n=4))
## # A tibble: 4 x 4
## subject_id hadm_id EdTimeStamp time
## <int> <int> <chr> <chr>
## 1 84 166401 edregtime 2196-04-13 22:23:00
## 2 85 112077 edregtime 2167-07-25 16:37:00
## 3 88 123010 edregtime 2111-08-29 01:44:00
## 4 91 121205 edregtime 2177-04-22 21:02:00
# Showing table tail
as.tibble(tail(MIMIC3_ADM_gather, n=4))
## # A tibble: 4 x 4
## subject_id hadm_id EdTimeStamp time
## <int> <int> <chr> <chr>
## 1 90521 195402 edouttime 2197-07-08 13:45:00
## 2 90535 151788 edouttime 2137-11-19 22:12:00
## 3 90538 105088 edouttime 2146-03-10 11:47:00
## 4 99781 147562 edouttime 2133-08-01 20:38:00# An example for the duplication in the number of rows for each subject_id (here for id 88)
MIMIC3_ADM_gather %>% filter(subject_id == 88)## subject_id hadm_id EdTimeStamp time
## 1 88 123010 edregtime 2111-08-29 01:44:00
## 2 88 123010 edouttime 2111-08-29 02:28:00
Classwork/Homework: Find the number of times each patient was registered in emergency department.
Spreading is the opposite of gathering.
We use it when an observation is scattered across multiple rows.
For example, in the above data set, the patient denoted by subject_id can have several hospital admission id hadm_id depending on how many times they were admitted in the hospital.
To spread this data, we first analyse the representation in similar way to gather(). This time, however, we only need two parameters:
The column that contains the key column. Here, it’s subject_id.
The column that contains multiple values for the key column, the values column. Here it’s hadm_id.
Once we have these two parameters we can spread the data:
# Spread each patient (subject_id) for the hospital admission id - hadm_id
MIMIC3_ADM_spread <- MIMIC3_ADM_tibble %>% head() %>%
spread(key = subject_id, value = hadm_id) %>%
select(`edregtime`, `edouttime`, '94', everything())
as.tibble(MIMIC3_ADM_spread)## # A tibble: 6 x 7
## edregtime edouttime `94` `84` `85` `88` `91`
## <chr> <chr> <int> <int> <int> <int> <int>
## 1 2111-08-29 01:44:00 2111-08-29 02:28:… NA NA NA 123010 NA
## 2 2167-07-25 16:37:00 2167-07-25 20:46:… NA NA 112077 NA NA
## 3 2176-02-25 10:35:00 2176-02-25 18:14:… 183686 NA NA NA NA
## 4 2176-09-02 09:54:00 2176-09-02 17:56:… 140037 NA NA NA NA
## 5 2177-04-22 21:02:00 2177-04-23 04:03:… NA NA NA NA 121205
## 6 2196-04-13 22:23:00 2196-04-14 04:31:… NA 166401 NA NA NA
We immediately see that patient no. 94 had two admissions in the hospital as emergency visits.
separate() pulls apart one column into multiple columns, by splitting wherever a separator character appears.
For example, consider the admissions_location variable. Its made of 9 possible values:
# Separate admission_location information
MIMIC3_ADM_LOC_SEP <- MIMIC3_ADM %>% select(admission_location) %>%
separate(admission_location, into = c("Info1", "Info2"), sep = "/", convert = FALSE)
as.tibble(MIMIC3_ADM_LOC_SEP)## # A tibble: 5,770 x 2
## Info1 Info2
## <chr> <chr>
## 1 PHYS REFERRAL NORMAL DELI
## 2 TRANSFER FROM HOSP EXTRAM
## 3 PHYS REFERRAL NORMAL DELI
## 4 EMERGENCY ROOM ADMIT <NA>
## 5 CLINIC REFERRAL PREMATURE
## 6 CLINIC REFERRAL PREMATURE
## 7 PHYS REFERRAL NORMAL DELI
## 8 PHYS REFERRAL NORMAL DELI
## 9 EMERGENCY ROOM ADMIT <NA>
## 10 PHYS REFERRAL NORMAL DELI
## # … with 5,760 more rows
By default seperate() would coerce into charector vectors even if we seperate numerical variables. To retain numerical type, we have to set convert = TRUE argument.
unite() is the inverse of separate(): it combines multiple columns into a single column.
For example, in the above data set we can unite hospital admission id hadm_id and the admission type admission_type to tie a reason why the patient got admitted. By default the unite will use underscore.
This can be overridden by using a seperator argument. This is done as follows:
# Unite hadm_id and admission_type
MIMIC3_ADM_LOC_UNITE <- MIMIC3_ADM %>% select(subject_id, hadm_id, admission_type) %>%
unite(reason, hadm_id, admission_type, sep=":")
as.tibble(MIMIC3_ADM_LOC_UNITE)## # A tibble: 5,770 x 2
## subject_id reason
## <int> <chr>
## 1 82 110641:NEWBORN
## 2 83 158569:URGENT
## 3 84 120969:ELECTIVE
## 4 84 166401:EMERGENCY
## 5 85 116630:EMERGENCY
## 6 85 112077:EMERGENCY
## 7 86 190243:ELECTIVE
## 8 87 190659:NEWBORN
## 9 88 123010:EMERGENCY
## 10 89 188646:NEWBORN
## # … with 5,760 more rows
Missing values can be:
Consider this data:
# Make a tibble reporting the data on people affected with influenza
influenza <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
no_sick = c(800, 450, 377, NA, 900, 1300, 1257)
)
influenza## # A tibble: 7 x 3
## year qtr no_sick
## <dbl> <dbl> <dbl>
## 1 2015 1 800
## 2 2015 2 450
## 3 2015 3 377
## 4 2015 4 NA
## 5 2016 2 900
## 6 2016 3 1300
## 7 2016 4 1257
There are two missing values in this dataset:
The no of sick people for the fourth quarter of 2015 is explicitly missing, because the cell where its value should be instead contains NA.
The no of sick people for the first quarter of 2016 is implicitly missing, because it simply does not appear in the dataset.
There are several ways we can deal with the missing values.
# Spread the data to make implicit absense explicit
influenza %>%
spread(year, no_sick)## # A tibble: 4 x 3
## qtr `2015` `2016`
## <dbl> <dbl> <dbl>
## 1 1 800 NA
## 2 2 450 900
## 3 3 377 1300
## 4 4 NA 1257
na.rm = TRUE in gather() to turn explicit missing values implicit: # Spread the data to make implicit absense explicit and
# gather with na.rm = TRUE to turn explicit missing values implicit
influenza %>%
spread(year, no_sick) %>%
gather(year, no_sick, `2015`:`2016`, na.rm = TRUE)## # A tibble: 6 x 3
## qtr year no_sick
## <dbl> <chr> <dbl>
## 1 1 2015 800
## 2 2 2015 450
## 3 3 2015 377
## 4 2 2016 900
## 5 3 2016 1300
## 6 4 2016 1257
complete() which takes a set of columns, and finds all unique combinations. It then ensures the original dataset contains all those values, filling in explicit NAs where necessary: # Complete the data by finding the combinations and replacing NA
influenza %>%
complete(year, qtr) ## # A tibble: 8 x 3
## year qtr no_sick
## <dbl> <dbl> <dbl>
## 1 2015 1 800
## 2 2015 2 450
## 3 2015 3 377
## 4 2015 4 NA
## 5 2016 1 NA
## 6 2016 2 900
## 7 2016 3 1300
## 8 2016 4 1257
fill() which takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward). # Fill NA with the most recent value found - 377 in this case
influenza %>%
fill(no_sick) ## # A tibble: 7 x 3
## year qtr no_sick
## <dbl> <dbl> <dbl>
## 1 2015 1 800
## 2 2015 2 450
## 3 2015 3 377
## 4 2015 4 377
## 5 2016 2 900
## 6 2016 3 1300
## 7 2016 4 1257
The following case study illustrates the typical process involved in tidying up real world data.
The data comes from the 2014 World Health Organization Global Tuberculosis Report, which can be downloaded from WHO TB Data.
The data is a part of tidyverse package, so you don’t have to download it.
Just querying the data results as follows:
# Query WHO TB data
who## # A tibble: 7,240 x 60
## country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534
## <chr> <chr> <chr> <int> <int> <int> <int>
## 1 Afghan… AF AFG 1980 NA NA NA
## 2 Afghan… AF AFG 1981 NA NA NA
## 3 Afghan… AF AFG 1982 NA NA NA
## 4 Afghan… AF AFG 1983 NA NA NA
## 5 Afghan… AF AFG 1984 NA NA NA
## 6 Afghan… AF AFG 1985 NA NA NA
## 7 Afghan… AF AFG 1986 NA NA NA
## 8 Afghan… AF AFG 1987 NA NA NA
## 9 Afghan… AF AFG 1988 NA NA NA
## 10 Afghan… AF AFG 1989 NA NA NA
## # … with 7,230 more rows, and 53 more variables: new_sp_m3544 <int>,
## # new_sp_m4554 <int>, new_sp_m5564 <int>, new_sp_m65 <int>,
## # new_sp_f014 <int>, new_sp_f1524 <int>, new_sp_f2534 <int>,
## # new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,
## # new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>,
## # new_sn_m2534 <int>, new_sn_m3544 <int>, new_sn_m4554 <int>,
## # new_sn_m5564 <int>, new_sn_m65 <int>, new_sn_f014 <int>,
## # new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>,
## # new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>,
## # new_ep_m014 <int>, new_ep_m1524 <int>, new_ep_m2534 <int>,
## # new_ep_m3544 <int>, new_ep_m4554 <int>, new_ep_m5564 <int>,
## # new_ep_m65 <int>, new_ep_f014 <int>, new_ep_f1524 <int>,
## # new_ep_f2534 <int>, new_ep_f3544 <int>, new_ep_f4554 <int>,
## # new_ep_f5564 <int>, new_ep_f65 <int>, newrel_m014 <int>,
## # newrel_m1524 <int>, newrel_m2534 <int>, newrel_m3544 <int>,
## # newrel_m4554 <int>, newrel_m5564 <int>, newrel_m65 <int>,
## # newrel_f014 <int>, newrel_f1524 <int>, newrel_f2534 <int>,
## # newrel_f3544 <int>, newrel_f4554 <int>, newrel_f5564 <int>,
## # newrel_f65 <int>
It contains redundant columns, odd variable codes, and many missing values.
The variables - country, iso2, and iso3 redundantly specify the country.
We don’t know what the variables new_sp_m014, new_ep_m014, new_ep_f014 mean.
Therefore, we look them up in the given dictionary.
Note: This would save the .csv dictionary file into your computer. You need to open it, like using excel.
iso2 and iso3 are ISO standard country/territory codes. Also the variables that start with new are new cases.
In particular,
The next two letters describe the type of TB:
rel stands for cases of relapse
ep stands for cases of extrapulmonary TB
sn stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative)
sp stands for cases of pulmonary TB that could be diagnosed be a pulmonary smear (smear positive)
The sixth letter gives the sex of TB patients. The dataset groups cases by males (m) and females (f).
The remaining numbers gives the age group. The dataset groups cases into seven age groups:
014 = 0 – 14 years old
1524 = 15 – 24 years old
2534 = 25 – 34 years old
3544 = 35 – 44 years old
4554 = 45 – 54 years old
5564 = 55 – 64 years old
65 = 65 or older
Thus we need to separate them. But first since each of these variables contains count of new cases, a natural way to organize them would be to gather these variables.
We can make a meta-variable “group” and “case-count” to reflect this gathering removing missing values NA:
# Gather "new_" variables, removing NA, grouping them up and counting cases
who_gathered <- who %>%
gather(new_sp_m014:newrel_f65, key = "group", value = "case-count", na.rm = TRUE)
as.tibble(who_gathered)## # A tibble: 76,046 x 6
## country iso2 iso3 year group `case-count`
## <chr> <chr> <chr> <int> <chr> <int>
## 1 Afghanistan AF AFG 1997 new_sp_m014 0
## 2 Afghanistan AF AFG 1998 new_sp_m014 30
## 3 Afghanistan AF AFG 1999 new_sp_m014 8
## 4 Afghanistan AF AFG 2000 new_sp_m014 52
## 5 Afghanistan AF AFG 2001 new_sp_m014 129
## 6 Afghanistan AF AFG 2002 new_sp_m014 90
## 7 Afghanistan AF AFG 2003 new_sp_m014 127
## 8 Afghanistan AF AFG 2004 new_sp_m014 139
## 9 Afghanistan AF AFG 2005 new_sp_m014 151
## 10 Afghanistan AF AFG 2006 new_sp_m014 193
## # … with 76,036 more rows
Also observe that for rel we don’t have underscore, whereas for ep we have underscore after new. Thus, we need to fix these to get consistent variables:
# Replace newrel by new_rel
who_rel_replaced <- who_gathered %>%
mutate(group = stringr::str_replace(group, "newrel", "new_rel"))
as.tibble(who_rel_replaced)## # A tibble: 76,046 x 6
## country iso2 iso3 year group `case-count`
## <chr> <chr> <chr> <int> <chr> <int>
## 1 Afghanistan AF AFG 1997 new_sp_m014 0
## 2 Afghanistan AF AFG 1998 new_sp_m014 30
## 3 Afghanistan AF AFG 1999 new_sp_m014 8
## 4 Afghanistan AF AFG 2000 new_sp_m014 52
## 5 Afghanistan AF AFG 2001 new_sp_m014 129
## 6 Afghanistan AF AFG 2002 new_sp_m014 90
## 7 Afghanistan AF AFG 2003 new_sp_m014 127
## 8 Afghanistan AF AFG 2004 new_sp_m014 139
## 9 Afghanistan AF AFG 2005 new_sp_m014 151
## 10 Afghanistan AF AFG 2006 new_sp_m014 193
## # … with 76,036 more rows
We can now seperate the variables to reflect the type of TB, gender and the age group. First we seperate by underscore:
# First we seperate by underscore
who_seperated_by_underscore <- who_rel_replaced %>%
separate(group, c("new", "type", "sexage"), sep = "_")
as.tibble(who_seperated_by_underscore)## # A tibble: 76,046 x 8
## country iso2 iso3 year new type sexage `case-count`
## <chr> <chr> <chr> <int> <chr> <chr> <chr> <int>
## 1 Afghanistan AF AFG 1997 new sp m014 0
## 2 Afghanistan AF AFG 1998 new sp m014 30
## 3 Afghanistan AF AFG 1999 new sp m014 8
## 4 Afghanistan AF AFG 2000 new sp m014 52
## 5 Afghanistan AF AFG 2001 new sp m014 129
## 6 Afghanistan AF AFG 2002 new sp m014 90
## 7 Afghanistan AF AFG 2003 new sp m014 127
## 8 Afghanistan AF AFG 2004 new sp m014 139
## 9 Afghanistan AF AFG 2005 new sp m014 151
## 10 Afghanistan AF AFG 2006 new sp m014 193
## # … with 76,036 more rows
And then by the gender and age group which are linked together, so we can use sep = 1 argument:
# Then we seperate by gender and age using "sep = 1"
who_seperated_by_gender_age <- who_seperated_by_underscore %>%
separate(sexage, c("sex", "age"), sep = 1)
as.tibble(who_seperated_by_gender_age)## # A tibble: 76,046 x 9
## country iso2 iso3 year new type sex age `case-count`
## <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <int>
## 1 Afghanistan AF AFG 1997 new sp m 014 0
## 2 Afghanistan AF AFG 1998 new sp m 014 30
## 3 Afghanistan AF AFG 1999 new sp m 014 8
## 4 Afghanistan AF AFG 2000 new sp m 014 52
## 5 Afghanistan AF AFG 2001 new sp m 014 129
## 6 Afghanistan AF AFG 2002 new sp m 014 90
## 7 Afghanistan AF AFG 2003 new sp m 014 127
## 8 Afghanistan AF AFG 2004 new sp m 014 139
## 9 Afghanistan AF AFG 2005 new sp m 014 151
## 10 Afghanistan AF AFG 2006 new sp m 014 193
## # … with 76,036 more rows
We can then drop the unwanted variables, such as new, iso2, iso3 etc. Thus, our data will be tidy after this:
# Drop "new" and "iso" variables and tidy the data
tidy_who <- who_seperated_by_gender_age %>%
select(-new, -iso2, -iso3)
as.tibble(tidy_who)## # A tibble: 76,046 x 6
## country year type sex age `case-count`
## <chr> <int> <chr> <chr> <chr> <int>
## 1 Afghanistan 1997 sp m 014 0
## 2 Afghanistan 1998 sp m 014 30
## 3 Afghanistan 1999 sp m 014 8
## 4 Afghanistan 2000 sp m 014 52
## 5 Afghanistan 2001 sp m 014 129
## 6 Afghanistan 2002 sp m 014 90
## 7 Afghanistan 2003 sp m 014 127
## 8 Afghanistan 2004 sp m 014 139
## 9 Afghanistan 2005 sp m 014 151
## 10 Afghanistan 2006 sp m 014 193
## # … with 76,036 more rows
Usually data in the real world comes in different tables.
Multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important.
Combining data in multiple tables is an imporant part of the wrangling process. Usually these combinations are facilitated by joins.
There are three families of joins that are designed to work with relational data:
Mutating joins: these joins add new variables to one data frame based on the presence of data from the other data table.
Filtering joins: these filter observations from one data frame based on whether or not they match an observation in the other table.
Set operations: these operations treat observations as if they were set elements.
A variable that connects two tables are called keys. There are several types of keys:
A primary key uniquely identifies an observation in its own table.
A foreign key uniquely identifies an observation in another table.
A surrogate key uniquely identifies an observation in the absence of a primary key. They can be created, for instance using mutate() and row_number().
Note: A primary key can also be a foreign key.
Once we identify if a variable is a key one way to verify is to use the count() function and look for entries where n is greater than one:
# Verify if hadm_id is a primary key
MIMIC3_ADM %>%
count(hadm_id) %>%
filter(n > 1)## # A tibble: 0 x 2
## # … with 2 variables: hadm_id <int>, n <int>
# Verify if subject_id is a primary key
is_subject_id_key <- MIMIC3_ADM %>%
count(subject_id) %>%
filter(n > 1)
as.tibble(is_subject_id_key)## # A tibble: 729 x 2
## subject_id n
## <int> <int>
## 1 84 2
## 2 85 2
## 3 94 2
## 4 103 2
## 5 105 2
## 6 107 3
## 7 109 8
## 8 426 2
## 9 433 4
## 10 434 2
## # … with 719 more rows
To describe the joins, we will work with input events data set from MIMIC3. A meta level description of the table is given as:
Inputs and outputs are extremely useful when studying intensive care unit patients. Inputs are any fluids which have been administered to the patient: such as oral or tube feedings or intravenous solutions containing medications.
A description of input events table is given here.
# Load the package RCurl
library(RCurl)
# Import the input events data set in MIMIC3 from GitHub;
URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
URL_text_2 <- "/master/Datasets/MIMIC3/inputevents_cv.csv"
URL <- paste(URL_text_1,URL_text_2, sep="")
MIMIC3_INE <- read.csv(text=getURL(URL))
# Observe the structure
as.tibble(MIMIC3_INE)## # A tibble: 19,369 x 22
## row_id subject_id hadm_id icustay_id charttime itemid amount amountuom
## <int> <int> <int> <int> <fct> <int> <dbl> <fct>
## 1 50508 15239 134445 208022 2186-07-… 30056 60 ml
## 2 50509 15239 134445 208022 2186-07-… 30056 120 ml
## 3 50510 15239 134445 208022 2186-07-… 30056 120 ml
## 4 50511 15239 134445 208022 2186-07-… 30056 400 ml
## 5 50512 15239 134445 208022 2186-07-… 30056 240 ml
## 6 50513 15239 134445 208022 2186-07-… 30056 240 ml
## 7 50514 15239 134445 208022 2186-07-… 30056 100 ml
## 8 50515 15239 134445 208022 2186-07-… 30056 100 ml
## 9 50516 15239 134445 208022 2186-07-… 30056 60 ml
## 10 50517 15239 134445 208022 2186-07-… 30056 100 ml
## # … with 19,359 more rows, and 14 more variables: rate <dbl>,
## # rateuom <fct>, storetime <fct>, cgid <int>, orderid <int>,
## # linkorderid <int>, stopped <fct>, newbottle <int>,
## # originalamount <dbl>, originalamountuom <fct>, originalroute <fct>,
## # originalrate <dbl>, originalrateuom <fct>, originalsite <fct>
Mutating inner join can be described by the following figure:
The following code computes the inner join of admissionas and input events cv table:
# Inner join of admissions table and input events (cv) table using the key "hadm_id"
MIMIC_ADM_INE_inner_join <- MIMIC3_ADM %>%
inner_join(MIMIC3_INE, by = "hadm_id")
as.tibble(MIMIC_ADM_INE_inner_join)## # A tibble: 1,805 x 40
## row_id.x subject_id.x hadm_id admittime dischtime deathtime
## <int> <int> <int> <date> <date> <chr>
## 1 93 91 121205 2177-04-23 2177-05-10 2177-05-…
## 2 94 92 142807 2122-12-13 2123-03-04 <NA>
## 3 94 92 142807 2122-12-13 2123-03-04 <NA>
## 4 94 92 142807 2122-12-13 2123-03-04 <NA>
## 5 94 92 142807 2122-12-13 2123-03-04 <NA>
## 6 94 92 142807 2122-12-13 2123-03-04 <NA>
## 7 94 92 142807 2122-12-13 2123-03-04 <NA>
## 8 94 92 142807 2122-12-13 2123-03-04 <NA>
## 9 94 92 142807 2122-12-13 2123-03-04 <NA>
## 10 94 92 142807 2122-12-13 2123-03-04 <NA>
## # … with 1,795 more rows, and 34 more variables: admission_type <chr>,
## # admission_location <chr>, discharge_location <chr>, insurance <chr>,
## # language <chr>, religion <chr>, marital_status <chr>, ethnicity <chr>,
## # edregtime <chr>, edouttime <chr>, diagnosis <chr>,
## # hospital_expire_flag <int>, has_chartevents_data <int>,
## # row_id.y <int>, subject_id.y <int>, icustay_id <int>, charttime <fct>,
## # itemid <int>, amount <dbl>, amountuom <fct>, rate <dbl>,
## # rateuom <fct>, storetime <fct>, cgid <int>, orderid <int>,
## # linkorderid <int>, stopped <fct>, newbottle <int>,
## # originalamount <dbl>, originalamountuom <fct>, originalroute <fct>,
## # originalrate <dbl>, originalrateuom <fct>, originalsite <fct>
There are three types of outer joins:
Left join admissions and input events table:
# Left join admissions table and input events (cv) table using the key "hadm_id"
MIMIC_ADM_INE_left_join <- MIMIC3_ADM %>%
left_join(MIMIC3_INE, by = "hadm_id")
as.tibble(MIMIC_ADM_INE_left_join)## # A tibble: 7,179 x 40
## row_id.x subject_id.x hadm_id admittime dischtime deathtime
## <int> <int> <int> <date> <date> <chr>
## 1 83 82 110641 2150-06-24 2150-06-29 <NA>
## 2 84 83 158569 2142-04-01 2142-04-08 <NA>
## 3 85 84 120969 2196-02-02 2196-02-04 <NA>
## 4 86 84 166401 2196-04-14 2196-04-17 2196-04-…
## 5 87 85 116630 2162-03-02 2162-03-10 <NA>
## 6 88 85 112077 2167-07-25 2167-07-30 <NA>
## 7 89 86 190243 2146-04-06 2146-04-10 <NA>
## 8 90 87 190659 2191-02-25 2191-04-25 <NA>
## 9 91 88 123010 2111-08-29 2111-09-03 <NA>
## 10 92 89 188646 2185-06-17 2185-06-21 <NA>
## # … with 7,169 more rows, and 34 more variables: admission_type <chr>,
## # admission_location <chr>, discharge_location <chr>, insurance <chr>,
## # language <chr>, religion <chr>, marital_status <chr>, ethnicity <chr>,
## # edregtime <chr>, edouttime <chr>, diagnosis <chr>,
## # hospital_expire_flag <int>, has_chartevents_data <int>,
## # row_id.y <int>, subject_id.y <int>, icustay_id <int>, charttime <fct>,
## # itemid <int>, amount <dbl>, amountuom <fct>, rate <dbl>,
## # rateuom <fct>, storetime <fct>, cgid <int>, orderid <int>,
## # linkorderid <int>, stopped <fct>, newbottle <int>,
## # originalamount <dbl>, originalamountuom <fct>, originalroute <fct>,
## # originalrate <dbl>, originalrateuom <fct>, originalsite <fct>
Right join admissions and input events table:
# Right join admissions table and input events (cv) table using the key "hadm_id"
MIMIC_ADM_INE_right_join <- MIMIC3_ADM %>%
right_join(MIMIC3_INE, by = "hadm_id")
as.tibble(MIMIC_ADM_INE_right_join)## # A tibble: 19,369 x 40
## row_id.x subject_id.x hadm_id admittime dischtime deathtime
## <int> <int> <int> <date> <date> <chr>
## 1 NA NA 134445 NA NA <NA>
## 2 NA NA 134445 NA NA <NA>
## 3 NA NA 134445 NA NA <NA>
## 4 NA NA 134445 NA NA <NA>
## 5 NA NA 134445 NA NA <NA>
## 6 NA NA 134445 NA NA <NA>
## 7 NA NA 134445 NA NA <NA>
## 8 NA NA 134445 NA NA <NA>
## 9 NA NA 134445 NA NA <NA>
## 10 NA NA 134445 NA NA <NA>
## # … with 19,359 more rows, and 34 more variables: admission_type <chr>,
## # admission_location <chr>, discharge_location <chr>, insurance <chr>,
## # language <chr>, religion <chr>, marital_status <chr>, ethnicity <chr>,
## # edregtime <chr>, edouttime <chr>, diagnosis <chr>,
## # hospital_expire_flag <int>, has_chartevents_data <int>,
## # row_id.y <int>, subject_id.y <int>, icustay_id <int>, charttime <fct>,
## # itemid <int>, amount <dbl>, amountuom <fct>, rate <dbl>,
## # rateuom <fct>, storetime <fct>, cgid <int>, orderid <int>,
## # linkorderid <int>, stopped <fct>, newbottle <int>,
## # originalamount <dbl>, originalamountuom <fct>, originalroute <fct>,
## # originalrate <dbl>, originalrateuom <fct>, originalsite <fct>
Full join admissions and input events table:
# Full join admissions table and input events (cv) table using the key "hadm_id"
MIMIC_ADM_INE_full_join <- MIMIC3_ADM %>%
full_join(MIMIC3_INE, by = "hadm_id")
as.tibble(MIMIC_ADM_INE_full_join)## # A tibble: 24,743 x 40
## row_id.x subject_id.x hadm_id admittime dischtime deathtime
## <int> <int> <int> <date> <date> <chr>
## 1 83 82 110641 2150-06-24 2150-06-29 <NA>
## 2 84 83 158569 2142-04-01 2142-04-08 <NA>
## 3 85 84 120969 2196-02-02 2196-02-04 <NA>
## 4 86 84 166401 2196-04-14 2196-04-17 2196-04-…
## 5 87 85 116630 2162-03-02 2162-03-10 <NA>
## 6 88 85 112077 2167-07-25 2167-07-30 <NA>
## 7 89 86 190243 2146-04-06 2146-04-10 <NA>
## 8 90 87 190659 2191-02-25 2191-04-25 <NA>
## 9 91 88 123010 2111-08-29 2111-09-03 <NA>
## 10 92 89 188646 2185-06-17 2185-06-21 <NA>
## # … with 24,733 more rows, and 34 more variables: admission_type <chr>,
## # admission_location <chr>, discharge_location <chr>, insurance <chr>,
## # language <chr>, religion <chr>, marital_status <chr>, ethnicity <chr>,
## # edregtime <chr>, edouttime <chr>, diagnosis <chr>,
## # hospital_expire_flag <int>, has_chartevents_data <int>,
## # row_id.y <int>, subject_id.y <int>, icustay_id <int>, charttime <fct>,
## # itemid <int>, amount <dbl>, amountuom <fct>, rate <dbl>,
## # rateuom <fct>, storetime <fct>, cgid <int>, orderid <int>,
## # linkorderid <int>, stopped <fct>, newbottle <int>,
## # originalamount <dbl>, originalamountuom <fct>, originalroute <fct>,
## # originalrate <dbl>, originalrateuom <fct>, originalsite <fct>
Note: By default, by = NULL, which will use all common variables that appear in both tables as keys. This is called natural join. We can also use by = c("a" = "b"). This will match variable a in the first table to variable b in the second table.
semi_join(table1, table2, by = "k") would keep all observations in table 1 that have a match in table 2 identified by the key “k”.
Thus, in our tables:
# Semi join admissions table and input events (cv) table using the key "hadm_id"
MIMIC_ADM_INE_semi_join <- MIMIC3_ADM %>%
semi_join(MIMIC3_INE, by = "hadm_id")
as.tibble(MIMIC_ADM_INE_semi_join)## # A tibble: 396 x 19
## row_id subject_id hadm_id admittime dischtime deathtime admission_type
## <int> <int> <int> <date> <date> <chr> <chr>
## 1 93 91 121205 2177-04-23 2177-05-10 2177-05-… EMERGENCY
## 2 94 92 142807 2122-12-13 2123-03-04 <NA> NEWBORN
## 3 95 93 160481 2128-03-17 2128-06-17 <NA> NEWBORN
## 4 96 94 183686 2176-02-25 2176-02-29 <NA> EMERGENCY
## 5 99 96 170324 2156-03-31 2156-04-29 <NA> EMERGENCY
## 6 104 101 175533 2196-09-26 2196-10-12 2196-10-… EMERGENCY
## 7 548 427 101274 2170-03-26 2170-04-01 <NA> EMERGENCY
## 8 549 428 132541 2123-10-25 2123-12-20 <NA> NEWBORN
## 9 556 433 163523 2164-08-13 2164-08-17 2164-08-… EMERGENCY
## 10 558 434 175451 2101-08-31 2101-09-16 <NA> EMERGENCY
## # … with 386 more rows, and 12 more variables: admission_location <chr>,
## # discharge_location <chr>, insurance <chr>, language <chr>,
## # religion <chr>, marital_status <chr>, ethnicity <chr>,
## # edregtime <chr>, edouttime <chr>, diagnosis <chr>,
## # hospital_expire_flag <int>, has_chartevents_data <int>
this would result in keeping the observations in admissions that are present in input cv.
This is particularly useful if we have a result in one table that we want to analyze related data in the other table.
The opposite of semi_join() is the anti-join(table1, table2, by = "k"), where we would retain only the observations in table 1 that does not have a match in table 2, identified by the key “k”.
In our data tables, this would pull out all the entries in admissions table where hospital admissions id (hadm_id) is not present in the input events (cv) table.
# Anti join admissions table and input events (cv) table using the key "hadm_id"
MIMIC_ADM_INE_anti_join <- MIMIC3_ADM %>%
anti_join(MIMIC3_INE, by = "hadm_id")
as.tibble(MIMIC_ADM_INE_anti_join)## # A tibble: 5,374 x 19
## row_id subject_id hadm_id admittime dischtime deathtime admission_type
## <int> <int> <int> <date> <date> <chr> <chr>
## 1 83 82 110641 2150-06-24 2150-06-29 <NA> NEWBORN
## 2 84 83 158569 2142-04-01 2142-04-08 <NA> URGENT
## 3 85 84 120969 2196-02-02 2196-02-04 <NA> ELECTIVE
## 4 86 84 166401 2196-04-14 2196-04-17 2196-04-… EMERGENCY
## 5 87 85 116630 2162-03-02 2162-03-10 <NA> EMERGENCY
## 6 88 85 112077 2167-07-25 2167-07-30 <NA> EMERGENCY
## 7 89 86 190243 2146-04-06 2146-04-10 <NA> ELECTIVE
## 8 90 87 190659 2191-02-25 2191-04-25 <NA> NEWBORN
## 9 91 88 123010 2111-08-29 2111-09-03 <NA> EMERGENCY
## 10 92 89 188646 2185-06-17 2185-06-21 <NA> NEWBORN
## # … with 5,364 more rows, and 12 more variables: admission_location <chr>,
## # discharge_location <chr>, insurance <chr>, language <chr>,
## # religion <chr>, marital_status <chr>, ethnicity <chr>,
## # edregtime <chr>, edouttime <chr>, diagnosis <chr>,
## # hospital_expire_flag <int>, has_chartevents_data <int>
Anti-joins are useful for diagnosing join mismatches. In the above example, we might be interested to know how many patients don’t have input event information.
This is usually done by understanding the data set and not empirically by looking for a combination of variables that give a unique identifier.
If we just look for variables without thinking about what they mean, we might find a combination that’s unique to our data but the relationship might not be true in general.
In all of our examples above, we used hospital admissions id (hadm_id) as our key.
This may fetch the statistics and analysis based on the admissions statistics, but not based on patients (or subject_id).
So if we want to derive patient statistics, we may want to combine, say, subject_id with hadm_id and derive a primary key.
Therefore, combination may sometimes work and can capture statistics of interest.
If a value is missing then it can’t identify an observation!
The best way to do this is with an anti_join().
It’s common for keys not to match because of data entry errors.
Fixing these is often a lot of work.
Classwork/Homework: Read and work on sections 14 and 15 (Strings and Factors) in the book R for Data Science.
R for Data Science - Wrangle Part working-with-dates-and-time Data Transformation Cheat Sheet —