- Huge N - Not created for science* - Always-on - Nonreactive - Incomplete - Inaccessible - Nonrepresentive - Drifting - Algorithmically-confounded - Dirty - Sensitive
Big Data is indeed big, but is correlation is enough?
- whole of sample frame is not the population, unless science of X - systemic bias instead sampling bias - P-hacking, meet N-hacking: 6:06 mark - More about new types of data, or richness of context
Not created for science
- “digital exhaust” - weak link between construct and measures
Always-on
- Old: HS diploma to job at 28 - New: Location every 15 minutes. - But: When is this informative? Are we answering different questions?
Nonreactive
- Old: Generosity in lab studies (reactivity/Hawthorne effect) - New: Don’t know about observation, or is now normalized - But: Have you been on Instagram?
Incomplete: If only I had:
Incessible
Private companies own it
Governments collect it but don’t share
Nonrepresentative
Who actually tweets anyway?
But nonrandom sample can still be very useful!
Drifting
population
behavior
system
Algorithmically confounded
unique experiences (N treatment groups)
recommenders
Matthew effect/increasing returns/preferential attachment
action triggers
Bigger question: Is social life now algorithmically confounded?
Lack of treatment controls is one thing, but exposures compound to substantial change.
Social science is moving toward lab or collaborative model