I will be brief: No, it is not.
“Data Scientist,” you see, is a job title. An attractive title. A title that really sparkles on a business card. A title for cool and interesting work.
But not every job title has a corresponding academic discipline.
Case in point: management consulting. Seemingly half of my college classmates went into this field. But their B.A.’s were in political science, French, mathematics, and so on. “Helping other organizations with strategic decisions” has no specific disciplinary tradition, no distinctive set of analytical tools, no well-defined object of inquiry—in short, there’s no “there” there.
Mastering an academic discipline, as opposed to completing a vocational degree, means becoming an expert in some object of human interest: markets (economics), life (biology), literature (English), abstract argumentation (philosophy) or really abstract argumentation (mathematics). The management consultant is, by design, a generalist. An expert in nothing.
Which brings us back to our original question. What is a data scientist an expert in?
Data science’s expertise cannot be the tools of data analysis. Those tools emerge from Statistics and Computer Science. For expert-level depth and rigor in understanding such tools, you go to the makers, not the users.
Nor can data science claim expertise in the subjects of data analysis. The data sets belong to public health, marketing, supply chains, medicine, baseball, finance, and so on. For subject-matter expertise, you’d go to subject matter experts.
I’ve got nothing against Data Science. My own M.S. is in Data Analytics. But there’s no denying it: the data scientist applies tools (in which they’re not expert) to content areas (in which they’re not expert). Interesting work, but not a standalone academic discipline.
When I squint, I see an area of expertise waiting to be claimed. Big questions lurk—some currently housed in other departments, and some still awaiting satisfactory answers—all orbiting around a basic theme: what happens when humans use data to understand the world.
A first batch of questions might focus on how we turn the world into data; that is, the promise and perils of quantification. The flavor here is less hardcore STEM, and more “philosophy of social science.”
- What is data? What does it capture (and fail to capture) about our messy reality? (Import “Epistemology.”)
- What simplified model of the world is embedded in any particular choice of data? How do we inspect these models and unpack their assumptions?
- What risks do we face from random error? (Import “Statistics”.)
- What risks do we face from the myriad forms of nonrandom error, from volunteer bias to survivorship bias to Goodhart’s Law?
- In what ways does gathering data change reality, from shifting incentives to increasing transparency to Seeing Like a State-style bulldozing of complexity?
- How do we gather useful data? (Import “Social Science Research Methods.”)
Then, a second batch of questions might focus on the various computations we perform with data, from the simple to the sophisticated.
If anything is distinctive about the data scientist’s approach here (and I’m not 100% sure anything is), perhaps it’s the need to keep one eye on the reality from which the data emerged, and another on the human interpreters for which it is destined.
- How do we clean data without damaging it?
- What are the best practices (and common pitfalls) in consolidating data sets?
- [this item reserved for all of the million models that data scientists should know about, i.e., 90% of data science as currently taught and practiced]
- [this item reserved for a special focus on neural networks, because in practice they supplant 999,997 of the million models mentioned above]
- We all know “garbage in, garbage out.” But how do we think systematically about the vastly more common situation of “mostly good stuff, but with a little bit of garbage smell” in? In other words, how do errors propagate through data analysis?
The third and final batch of questions concerns the human interpretation and understanding of data; that is, how data becomes knowledge. I grandiosely envision something akin to the CS area of Human-Computer Interaction (HCI). Call it “Human-Data Interaction,” if you like.
- How can data science’s various models be presented to laypeople?
- How does human vision work, and what implications does this have for the design of data visualizations? (Import “Edward Tufte.”)
- How (in words and pictures) can we best communicate uncertainty and potential error?
- What are the best practices for creating interactive data visualizations? (When is interactivity useful? When is it not?)
- What are people prone to miss when interpreting data? How do we call their attention to overlooked features, or prompt their curiosity?
These, perhaps, are the seeds of Data Science as a Liberal Art.
Now, are these questions rich, specific, and coherent enough? Or is the resulting “discipline” prone to feel like a high-school-level mishmash of coding boot camp and Philosophy 101? In short: if Data Science completes these labors, will it belong at the academic table?
I honestly don’t know. But if nascent Data Science departments want to lay claim to the mantle of “liberal art,” this is where I’d encourage them to begin.