Getting Organized to Do Data Science

Christopher Skovron & Jon Atwell Northwestern University

June 20, 2018

From last time: getting help

  • StackOverflow is a great resource
  • May have to adjudicate between responses
  • May have to adapt answers to your specific case

Use built-in help

?lm
?read.csv
?dplyr::rename

Built-in help will

  • Tell you how functions work, to varying degrees of helpfulness
  • Often provide some examples of how functions work
  • Often still leave you clueless and needing to Google around some more

Not-so-useful help

The help file for grep is not very useful for novices

Not-so-useful help

The help file for grep is not very useful for novices

Some builtin help is more useful

The amce function does one thing and is well-documented.

Some builtin help is more useful

The amce function does one thing and is well-documented.

Some builtin help is more useful

The amce function does one thing and is well-documented.

Some builtin help is more useful

The amce function does one thing and is well-documented.

Asking for help

Get organized

Your audiences, in decreasing order

  • Your future self
  • Your collaborators
  • Replicators

Good organizational principles

One script per task

  • In general, each scipt should do a specific high-level task
  • Figuring out where to break up scripts is more of an art than a science, but some rules of thumb
  • If you are at a point in your workflow where it would make sense to save something - a dataset, a model object, a few figures, etc, that is often a good time for a new script

Mind the order

  • Complex tasks will require multiple scripts
  • Use organizing tools like RStudio’s Projects
  • Number scripts in order: 0_clean_data.R, 1_merge_census_data.R, 3_fit_multilevel models.R

Automate everything that can be automated

  • Saves yourself time
  • Prevents mistakes
  • Sometimes comes at the cost of more time up front and code readability

DON’T REPEAT YOURSELF

  • Inefficient
  • Invites mistakes
  • Makes even simple tweaks to code take forever to implement
  • Ask yourself, “could this code do this task 100 times?”

Don’t copy and paste

  • Call things from where they are stored
  • Typically, this will be on your disk or on some remote/cloud disk you can access from your machine
  • If your code calls for the same task to be repeated, find ways to automate

Set up lists you can iterate over

  • If you’re doing the same task over and over, you want to iterate over a list
  • Strategies for doing this depend on task and language
  • for loops
  • apply functions

Understand your file structure

Paths

  • Paths are how your program finds what it needs to
  • You shouldn’t be clicking on datasets or files to load them

Modern IDES can autocomplete paths

Setting a working directory

  • Each script should have a “working directory”
  • This is where your program will search for files referenced in the script, write any files you create using the script, etc.
  • R:
setwd("/Users/cskovron/Dropbox/Research/ncs-constituent-eval/analysis")
# on Mac is equivalent to
setwd("~/Dropbox/Research/ncs-constituent-eval/analysis")
  • Python:
import os
path="~/Dropbox/Research/ncs-constituent-eval/analysis"
os.chdir(path)
  • Command line:
cd "~/Dropbox/Research/ncs-constituent-eval/analysis"

In R, convention is now to use RStudio “Projects”

  • This improves reproducibility but is a little more advanced, so I encourage you to look at it on your own if you are an R specialist

Setting relative paths - Unix-based systems including Mac

  • You can set paths at levels outside your working directory
  • ~ Your home directory (on my Mac, /Users/cskovron/)
  • . The current directory (./images/ is a subfolder of the working directory called images)
  • .. The parent of the current directory (the directory the working directory is in)

Don’t change your working directory to save to a subfolder

  • Just do:
write.csv(some.data.to.save, "./data-subfolder/data-filename.csv")

Paths don’t play nicely between Windows and Unix systems

Version control

  • Simplest definition – lets you revert to the past
  • Track changes made by collaborators
  • Document your contributions to code

Version control options

Tidy data principles

Tidy data has

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.

Tidy data does not have

  • Column headers are values, not variable names.
  • Multiple variables are stored in one column.
  • Variables are stored in both rows and columns.
  • Multiple types of observational units are stored in the same table.
  • A single observational unit is stored in multiple tables.

Long vs wide data

via Sean Anderson

Relational data

  • Some data has structures more complex than simple tables
  • For example, Netflix has a database where each user has a table of movies they’ve watched and a separate table for each movie of the users who have watched it
  • This is “relational data”
  • It’s often, but not always, big
  • Requires special tools, usually SQL

Checking up on your data cleaning

  • glance at your data:
    • View() (but be careful!)
    • summary() (be careful on big datasets)
    • head() and tail()
    • tibble::glimpse()
    • is.na() and sum(is.na())

Set yourself up to do things iteratively

  • In one paper, I do the same analysis for many survey items
  • To automate this, I made a helper file called issues.names.titles.csv

Tricks for doing things iteratively

  • Use paste() and paste0() to help write captions and labels
  • Can select columns using variables: data[, issue], if issue is a character vector, selects just that column. Loop over issues

Where to work? Development environments

  • Let your software do some of the work for you
  • RStudio projects
  • RStudio gets new features every day that help you stay organized
  • RMarkdown allows you to integrate R code and writing to produce reports in HTML and PDF, slides, etc. Very flexible and extendable

Get organized to learn in the future