The TA for this course is Lizabeth Katsnelson. Please email lk2513@nyu.edu for any clarifications regarding homeworks or classworks.
Tuesday: 3:30pm to 4:30pm or by appointment
Official Course Name: Prgramming for Data Analysis (BMIN-GA 1005/BMSC-GA 4486)
Meeting Schedule: Every Tuesday at 2:00pm - 3:30pm and Friday at 2:30pm - 4:00pm starting Jul. 9th through Aug. 16
Meeting Location:
On Tuesdays, we will meet at the New Science Building (SB) 435 E 30th St, Room 113
On Fridays, we will meet at the Translational Research Building (TRB) 227 E 30th St, Room 718Towards the end of this course the student will exhibit in-depth understanding of data science and analysis methods as well as proficiency in R. The student will produce a portfolio of data analysis projects from the course that demonstrates mastery of analysis and visualization methods. He /She will be equipped for analysis of biomedical and genomic data sets. Another main objective of the course is to communicate statistical results correctly and effectively.
This course is designed to empower students to learn R programming language to conduct data science. We will study a wide range of topics, including, handing and querying databases, exploratory/confirmatory analysis and visualization in R. We will closely follow the book R for Data Science, however the emphasis will be given to working with **biomedical data** than datasets illustrated and used in the textbook.
This course does not have any pre-requisites.
Late/missed work: You must adhere to the due dates for all required submissions. If you miss a deadline, then you will not get credit for that assignment/post.
Incompletes: No "Incompletes" will be assigned for this course unless we are at the very end of the course and you have an emergency.
Responding to Messages: I will check e-mails daily during the week, and I will respond to course related questions within 48 hours.
Announcements: I will make announcements throughout the semester by e-mail.
Make sure that your email address is updated; otherwise you may miss important emails from me.
Safeguards: Always back up your work on a safe place (electronic file with a backup is recommended) and make a hard copy. Do not wait for the last minute to do your work. Allow time for deadlines.
Plagiarism: Plagiarism, the presentation of someone else's words or ideas as your own, is a serious offense and will not be tolerated in this class. The first time you plagiarize someone else's work, you will receive a zero for that assignment. The second time you plagiarize, you will fail the course with a notation of academic dishonesty on your official record.
Programming Assignments (40%)
Directed Insights (25%)
Final Project (35%)
1. R for Data Science by Garrett Grolemund & Hadley Wickham (available here)
2. R in Action by Robert I. Kabacoff
3. Several online tutorials (just type "R tutorial" in google and follow the lead)
You need to document and demonstrate all aspects of data science foundations discussed in the class.
1. Correctly apply tools and techniques of data preparation and wrangling
    a. Missing data handling, joining, or other transformations, removing outliers etc.
    b. Gathering, spreading data (if needed)
2. Use Exploratory Data Analysis and `dplyr` transformation methods to identify structure and correlations in the data
3. Formulate questions and possible ways of analysis and visualization
    a. Identify appropriate visualization methods for analysis of your data set
    b. Choose the right geoms for the questions at hand
4. Correctly interpret results of analysis (clinical/biological significance)
    a. Demonstrate domain specific knowledge of clinical data
    b. Propose an hypothesis based on visualization and results
    c. Compare the usefulness of the obtained results/conclusions
5. Formulate appropriate plans for validation, further analysis, or to collect additional data needed.
| Introduction to the course | Link | |||
| R fundamentals 01: Elementary data types (July. 09), Dr Lieber | Presentation | html | ||
| R fundamentals 02: Advanced data types and graphics (July. 12), Dr Lieber | Presentation | html | Rmd | |
| R fundamentals 03: Basics graphics and Rmd (July. 16), Dr Lieber | Presentation | html | Rmd | |
| R fundamentals 04: Elements of Programing (July. 19), Dr Lieber | Presentation | html | Rmd | |
| Data science fundamentals 01: Visualize and Explore (July 22), Dr Lieber | Presentation | html | Rmd | |
| Data science fundamentals 02: Transform and Explore (July 26), Dr Kannan | Presentation | HTML | Rmd | |
| Data science fundamentals 03: Wrangle data (July 30), Dr Lieber | Presentation | HTML | Rmd | |
| Data science fundamentals 04: Exploratory data analysis (Aug. 2nd), Dr Kannan | Presentation | Rmd | ||
| Data science fundamentals 05: Basic inference and linear regression (Aug. 6th), Dr. Kannan | Presentation | Rmd | ||
| Data science fundamentals 07: Advanced modeling, Dr. Kannan ( ) | Presentation | Link | ||
| Data science workshop, Dr. Kannan ( ) | ||||
| Project presentation 01 ( ) | ||||
| Project presentation 02 ( ) |
Homework #01
Homework #02
Homework #03
Homework #04
The final project is easy to state: Obtain directed insights on data sets of your choice (given below) based on Explore, Wrangle, Model, Program and Communicate paradigm.
You are advised to become familar with HANES and MIMIC3 data sets (see below) and their formats right away. The first step in becoming a good data scientist is becoming friendlier with the data you are handling. The more friend you are, better patterns you can decipher.
You will be continously assessed to make sure you are progressing towards your final submissions. Please see the project page for more information.
| New York City Health and Nutrition Examination Survey (HANES) | Original (SAS format) | CSV | |
| New York City Health and Nutrition Examination Survey (HANES), Curated | CSV | ||
| IPUMS HEALTH SURVEYS (NHIS) | Link | ||
| National Center for Health Statistics | Link | ||
| MIMIC3, Fluid input events, CareVue (0.1%) | CSV | ||
| MIMIC3, Fluid input events, MetaVision (0.1%) | CSV | ||
| MIMIC3, Chart events 1 (0.01%) | CSV | ||
| MIMIC3, Prescriptions (0.1%) | CSV | ||
| MIMIC3, Other data sets | Link |