Building a Culture of Data Science

Written By:

|
September 1, 2022

Introduction

This post is another excerpt from my conversation with Chad Heilig, Associate Director for Data Science at the U.S. Centers for Disease Control. Last time, I excerpted our discussion of our 2013 paper on the AERO graph, which focused on strategic decision-making within an anti-tuberculosis research program. This time, I am excerpting our discussion relating to Chad’s views about creating a “progressive culture for data”.

To provide some additional background for this discussion: In June 2022, Chad posted a “declaration” to LinkedIn calling for “a progressive culture for data in public health”. The declaration is not long, and I strongly recommend giving it a read before reading the conversation, since it will provide some helpful context for the discussion below.

(Disclaimer: Chad's views are his own and do not necessarily reflect the official position of CDC.)

Conversation

Spencer Phillips Hey (SPH): In your Declaration, you are calling for a “progressive culture for data in public health,” and the principles you articulate as core to this culture all seem compelling to me. But as I read the document, I can’t help but feel a sense of frustration, which I can only assume is rooted in your experience working in the government. Can you speak about your experience and why you felt the need to write this?

Screen capture of Chad's Declaration post on LinkedIn

Chad Heilig (CH): Sure. So around 2019, there had been lots of discussions—including at the top of the executive branch under the [Trump] administration—about data as "a valid, valuable asset," and I couldn't find a straight answer on what that meant. I think I finally got one person to say, “Well, maybe it's about interoperability.” But interoperability is not an end in itself.

Eventually, I just asserted that data have value because they help us learn things about the world. (Influenced by a friend and colleague, I would later add “and do things with what is learned”.) Learning things about the world, to me, is a good in itself. It's not the ultimate good, but it is a good. And I think that so often when we talk about data, this not only gets lost, but we wind up devaluing learning itself. Data gets talked about and treated as an asset that is separable from the humans who may learn and do things with it. I think this is a serious mistake—and it’s a mistake that I’ve seen frequently.

So this is why when I talk about data and data science, I foreground the humans who actually do the data science and the culture that supports them.

SPH: One of your principles for a progressive data culture is a dedication to “learning with data through its full life cycle”. Can you say more about that? What does that look like in practice?

CH: I credit my thinking about the life cycle to Roger Peng and Elizabeth Matsui, who articulate 5 core activities for the art of data science.

Let's say I know a method for analyzing images likesome deep learning methods do. That means that I can ask questions whose answers depend on being able to analyze images. If I'm not aware of methods for analyzing images, then I'm not prone to ask questions for which analyzing images would give me some sort of point of entry into those data.

So this is the beginning of the data life cycle. A progressive culture for data at this stage involves cultivating learners who know how to pose, as I put it, “rich questions about the world, amendable to rich methods”.

And then at the other end of the lifecycle of data, if I understand enough about what's happening in deep learning for images, whether it's convolutional or recurrent neural network architectures, I don't actually have to know all the fine details. But I need to know enough that I can convey accurately to a layperson what is happening with that model.

A progressive culture of data at this later stage involves being able to translate what I’ve found to others in the relevant community. Or as I put it in the Declaration: “to make answers [to rich questions] transparent, accessible, and enduring”.

SPH: I take it that a “regressive” culture for data is one in which decision-makers aren’t aware of the rich methods available, and so they don’t even think to ask themselves certain questions. Or they don’t understand the methods and results well enough to translate them into language that the broader community can understand or appreciate, so much of the value in the data is lost.

What other kinds of dangers do you see in failing to build this progressive culture for data? Or what worries you about institutions that are attracted to “data science” as more of a buzzword without this appreciation for the cultural requirements?

CH: One thing that worries me is a hunger for squeezing out optimal performance in and of itself, not grounded in the larger context of performance in practice.

Suppose you have a surveillance system with a particular condition as the target, and you want to estimate the prevalence of that condition in the population. Suppose you use a machine learning method to estimate the prevalence and you compare that to the previous method, which was experts reviewing compiled sources of information.

So you have the reference standard, which is based on experts, and you have this machine learning method that tries to replicate what experts do. Let's say when two experts review the data, they agree with each other 91%of the time, and this new machine learning model agrees with experts 88% of the time.

That sounds great, right? Except two important things don’t immediately follow. First, it doesn't immediately follow that we should now start looking for ways of substituting the machine learning model in place of the expert approach. Having a high-performing model is good, but it doesn't automatically imply what we do next.

Second, it doesn't immediately follow that we should continue looking for models that are better and better at achieving the performance that we think is the goal—trying to squeeze out that 3% difference(between 91% and 88%), say.

We could instead interpret this 88% accurate machine learning approach as showing us that, yes, in fact, it is possible to use automated methods that come close to human methods. Further, we could interpret this result as giving us permission to consider less intensive semiautomated methods.

For example, suppose now I come along and I say, okay, let's consider instead that I just look for any one of these three words or variations on those words in the compiled sources of information. Suppose that this approach gives us 85% correspondence with the experts. If I had started there, I might say, “Oh, that sucks. 85% means we're getting 15% of them wrong.” But now I also know two things: first, the best we could hope to do is 91% and, second, more sophisticated automated models approach that but don't get all the way there.

So instead, we shift the question to: How can I use this simple, heuristic-driven approach whose performance looks like it's going to aid us even if it doesn't substitute for the expert approach?

This is an example of where I think the general direction and impulse in machine learning as a field can be misguided. But to be clear: I believe we should be thinking—both inside and outside CDC—about machine learning as expanding the set of tools we have for figuring out things in data, whether they're going to help us only learn things about the world or also help us do things with what we learn.

And, as you know, machine learning is often criticized for not giving you easy or straightforward points of entry into explanation or interpretation. But I like to flip that around: the thing that appears to give you explanation or interpretation could be misleading, whether it comes from machine learning, a familiar statistical approach, or some other method.

SPH: Right! Whatever method you are using, good judgment is still going to be required for interpretation and explanation. So this, I take it, leads us back to your emphasis on culture. Because new methods or technologies in data science are going to be appearing all the time, so to keep up with fast movement in the field, you need practitioners and learners who have the skills(both technical and non-technical, as you write in your declaration) to do this.

CH: Yes, it comes back to the culture. I really believe that a hierarchical organization (like CDC) would be much better served by intentionally respecting and engaging the contributions of the people who directly do things with data.

Sometimes we see managers who are reluctant to entertain unfamiliar methods or unfamiliar software. Along those lines, CDC has been primarily a SAS (www.sas.com) shop for decades. And until the last few years, R (www.r-project.org) has not been very welcome, even if R could help you answer questions better ina variety of ways.

More than that, I think there has to be a way of rewarding learners for actually learning. And that's hard to do, especially if you're on a production schedule. If you have to keep turning out reports or whatever, then it's difficult to support your staff spending time that isn't contributing directly to producing reports.

Or worse, you have staff who question orthodox methods or tools, and you as their supervisor or manager, have to figure out a way to handle that respectfully, whether you agree with them or disagree with them.

One of the risks of doing what I'm saying—engaging with the people who work directly with data—is having them say things that we're not prepared to act on. And I don't mean act on in the public health way. I mean act on in the way of challenging orthodoxy.

I think this is probably true of machine learning and other methods as well, because the methods are unfamiliar to many, and provoke some questioning from traditionalist statisticians who might be skeptical of machine learning. Again, I'm not an evangelist for machine learning. I'm an evangelist for using the tool that helps you address the issue, to answer the question.

And that’s ultimately the central point of my declaration: In the public health sector, we are constantly challenged “to stretch modest resources, to anticipate and respond to threats, and to promote population health”. To meet this challenge, I believe we must re-think our understanding of data and data science, and recognize the need to build a culture of learning that more seriously engages and supports the people who are doing the data science.

Conclusion

Thanks again to Chad for making the time to talk and helping to put together these excerpts!

Latest Articles

Discussion

Understanding Large Perturbation Models

A brief, layperson's introduction to Large Perturbation Models (LPMs), a new tool in the drug development toolkit to simulate vast numbers of experiments digitally

Schedule a demo