Computing Essentials for Social Scientists: Day One
Jon Atwell & Christopher Skovron
June 18, 2018
- Program Requirements
- What is data science? What do we hope to get out of this?
- Python vs R
- What you need to know to succeed with data science
Let’s get to know one another
- Michigan PhD, 2017 - Sociology
- Studies the emergence of culture and norms
- Expertise: Python, text methods, agent-based modeling, experiments
- Michigan PhD, 2017 - Political Science
- Studies public opinion and what people believe about it
- Expertise: R, statistical modeling, causal inference
We expect from you
- Sustained commitment to integrating data science methods in your research
- Be a resource for your colleagues, both students and faculty
- Be flexible, open to struggling and learning new things about yourself and your research
- Three sessions this week
- At least four additional sessions. Intro R and Python count as two sessions.
- Optional but suggested to attend afternoon practice sessions for workshops.
- Participate in future data science programming put on by Data Science Initiative
- Jon and I will be mentors
- We aren’t here to debug but will help you strategize and find resources
- Research Computing Services is another source of support
What is data science?
- the analysis of data ¯\(ツ)/¯
- Naur, “Datalogy” (1974)
- Was proposed that statistics rename itself to DS (1997)
- Today’s use comes via computer science (Classification Society conf. 1996)
What does it mean in practice today?
Everything and Nothing, IOHO
What do LIDAR and Tweets have in common?
- Only data scientists!
Ok, sophisticated processing and analysis algorithms too
Ok, what does it mean in the academy?
- New data sources and types
- Non-regression-based quantitative empirical analytical approaches.
Why are we here?
Business got more serious about analytics
Northwestern Data Science Initiative definition
- New analytic methods
- New sources
- Defined relative to what has historically been typical in disciplines
New data sources
- Text, tweets
- Experiments conducted online, in the field at scale
- More technical data used to analyze social behavior: satellite lights
- A plethora of new text methods
- Machine learning methods
- Fine-grained measurement (geolocations, etc)
- Surveys are bigger and more frequent
- People leave more traces of themselves online
- We can observe social behavior using new technologies
Beware the hype
- People have incentives to play up how fancy their data and methods are
- Important to keep a critical eye to whether people are being honest about why they use the data they do and the method they do
- Don’t become a data science supremacist
- OLS is often just as good and orders of magnitude simpler than fancy methods!
- It’s important to have a good bullshit detector when working in this area
Pitfalls for inference
- As social scientists, we need to remember that using new data science tools doesn’t remove or obligations to do good research design
- Bigger data or more computationally intensive approaches also present new challenges for transparency
- Just because you have the full population doesn’t mean you can speak to causality, etc
- Big data doesn’t always mean your findings are generalizable
CESS students were chosen because
- You already show creativity and promise in your research projects and ideas
- Many of these projects could benefit from data science approaches
- You don’t already know all the things we are going to teach
Working together is a key
- CESS is a community of students facing similar challenges and opportunities
- Working collaboratively on programming requires some vulnerability, but community is essential
- You should work on getting comfortable with asking peers for help
- You should share what you know and tricks you pick up with peers and with faculty
Python vs R
Where do you fit in DS?
or, What you need (and don’t need) to know to thrive using computational tools
or or, Relax and focus on what you need
Part 1: The professional angle
Academic research or industry?
- You aren’t effectively preparing for the latter here
- Transition is getting harder
- Both can be great, or terrible
- Breiman, 2001: “Two Cultures”
Supervised vs. Unsupervised
- Supervised => You know values of dependent variable(s)
- Unsupervised => You don’t
Specified model vs blackbox
- Orthogonal to supervised vs. unsupervised
- High predictive power
- more diversity of generating mechanisms the reg.
- easily overfitted
- can be blackbox
- Lack groundtruth
- define probablistic model/generative process
- harder to overfit
- can be blackbox
Do you need supervised models?
- Regression models, of course!
- Classification? umm, probably not
Do you need unsupervised models?
- text as data!
- clustering, yes
Part 3: The good (academic) programmer angle
No one knows it all, so relax
- Be comfortable in that
- But don’t stop learning!
Wrong emphasis: Efficiency
Why? (more on Day 2)
- clocktime vs human scale
- but complexity matters.
Wrong emphasis: Succinctness
- Compactness often isn’t good
- (See the translation process)
Correct emphasis: Scientific question
- lead with questions, not methods
Correct emphasis: Reproducibility
- organized so others can reproduce result
- but working code not the same as. ..
Correct emphasis: Replicability
- logic of code is clear
- via comments and other documentation
- via properly expressive code