In this month's "Discussing Data Science" episode, I talk with Eni Mustafaraj, an associate professor of computer science at Wellesley College and one of the co-founders of the data science program at Wellesley.

You can watch the video below or on Youtube. But if you'd prefer to read, keep scrolling. The complete transcript (edited for length and clarity) is below.


Spencer Hey (SH): Hello, my name is Spencer Hay. I'm the co-founder and Chief Science Officer at Prism Bio, and welcome to "Discussing Data Science".

My guest today is Dr. Eni Mustafaraj. She is an Associate Professor of Computer Science at Wellesley College, where she has taught courses on a variety of topics in computer science, including artificial intelligence, data analytics and visualization, data and text mining, among many others.

Furthermore, she's also one of the co-founders of the data science program at Wellesley.

Dr. Mustafaraj's research focuses on web-based socio-technical systems. She is particularly interested in understanding bias and credibility in online information sources.

Eni, thank you so much for joining me today.

Eni Mustafaraj (EM): Thank you so much for inviting me, Spencer. It's such a pleasure to be here with you today.

SH: I'd love to start with a definition of 'socio-technical systems'. What does that term mean to you?

EM: Let me explain with an example: Think of the transportation system--we take the car, the train, or the bus every day. That's an example of a socio-technical system because, of course, there are technologies like cars, trains, and buses. There's the infrastructure--the roads, the railways, and the bridges. But it's also us, the humans. We are driving these vehicles, but we are also designing the infrastructure, the institutions, the rules, and the organization. So, our transportation system is a complex socio-technical system, and we have many such systems in our societies.

However, in the past few decades, we have been building online socio-technical systems. One of the largest ones out there is the internet, or the web. The internet is like the physical infrastructure, the technologies, and the web is a combination of both the software that runs on top of the internet and everything else that we, humans, have created in order to communicate with one another. This, to me, is an interesting, ever-evolving system. Just like we are trying to understand and design better socio-technical systems broadly in our society, researchers and scientists are trying to do the same with the web.

Google and Our Shared Information Environment

SH: Let's delve a little deeper into that. I know you've done a considerable amount of work exploring Google's search results--how they work and how they inform people's... I suppose I want to say 'opinions', but maybe it's not just opinions. It seems like Google's search could be shaping a person's whole perspective on the world. Could you please tell us more about your work in this area?

EM: When we think about the web, it serves several roles. One of them is as a communication technology, especially when we consider social media and social networking. But it's also an information environment that we turn to whenever we have information needs. Nowadays, we often use the term 'Google it' as part of our common vocabulary. Therefore, it's crucial to understand what kind of information we find on the web, especially since this search process is mediated through search engine algorithms.

I believe that we should all care about the online information environment, about the kind of information that we find there. We should care about whether it's harmful or truthful, or who the sources producing this information are. We need mechanisms and processes to observe and monitor what's happening in this information environment, where Google is such a significant player. It's essentially our gateway to access this information.

Part of my research for many years now, since 2008, has been to audit search results, especially in the context of political elections, but also significant events like Supreme Court cases, etc. The aim is to scrutinize what Google is doing and what other actors in this information environment are doing--how they behave, how they adapt to this algorithmically curated environment.

You can learn a lot by observing how Google organizes the search result page. If you were to search for something like 'COVID', you'd remember how, in the past months and years, Google Search provided an entire dashboard where you could see all kinds of statistics, national and local news, and information from health authorities. This happened because it was such a major event. Similarly, during elections, you see that change.

Since the web comprises thousands of online news sources and other information sources, they all learn to adapt to this algorithmic environment, trying to make their content legible to the algorithms so that their content shows up at the top. There's an entire search engine optimization industry that helps various online sources get to the top of the search results.

The process I use in my research to capture these aspects is through auditing. This involves large scale, longitudinal searches on Google. For example, in one study my colleagues and I conducted a few years ago, we collected data for an entire year for everyone who was a candidate during the presidential election primaries, for both Democrats and Republicans. Our aim was to see what was happening in the information environment. As I said, you learn a lot about what Google's algorithms prefer in terms of which news sources and topics they prioritize.

SH: Could you share some insights about what you learned from these studies?

EM: We learned a couple of things from these studies.

One of the major findings is something we call 'source concentration'. It turns out that Google relies on certain news sources and shows them much more frequently than others. These are mainstream sources like NPR, New York Times, Washington Post, CNN, NBC. While it's true these are the largest news organizations out there producing a substantial volume of content, Google still prioritizes these sources over others, even if their content is "stale"--let's say from three months ago.

Another major finding is the diversity in the news ecosystem. There are thousands of sources out there, and some researchers have developed ways to assign political orientation to these sources. When we looked at the distribution of political orientation among news sources in one of our studies, it was actually quite balanced. Most news sources were either center-left or center-right, with fewer on the far-left or far-right. However, it's interesting to note that not all these news sources produce news at the same rate, and this also depends on the topic.

Although there are many sources that are mostly center-right, like the Wall Street Journal, they don't produce as much content. And we found that these sources are somewhat fading, as readers interested in a right-leaning perspective tend to gravitate towards sources such as Fox News, which is further to the right. So when you look at the distribution of actual news content and articles (in terms of volume), you see a more bimodal distribution. Slightly left of center, you have the New York Times, Washington Post, and NBC, and then there's another group where Fox News, the Daily Caller, and Breitbart fall.

What was fascinating was when we collected data about Trump and Biden. Most of the content about Trump was being created by the center-left sources, the mainstream ones. But most of the content about Biden, at least in 2019 when Biden was still just a Democratic candidate, was coming from places like Breitbart and Fox News. They were actively trying to create a narrative around Biden as someone who was elderly and perhaps not the right choice. But it's fascinating to see these trends in the data regarding the news being produced.

Information, Bias, and Ethics

SH: Google could potentially read your results--and hopefully they have--and then acknowledge that while the distribution of news sources is one thing, the producers and the audience display a bimodal pattern of interest rather than a normal one. If we now place ourselves in Google's shoes, so to speak, is there something that we should do or change? In other words, does Google have an obligation to intervene and try to "correct" this distribution?

EM: In my view, Google does indeed strive to do the right thing by drawing from a large and diverse range of sources.

SH: They're aiming to sample from a more balanced or "normal" distribution when presenting search results?

EM: Yes, that's correct. Google aims to sample from a well-rounded assortment of sources, but there's a limit to how much control it has over the amount and nature of the content those sources produce. It's worth noting that this is all set against the backdrop of the declining news industry. The concentration of news sources--where everyone gravitates towards outlets like the New York Times or CNN--is a growing concern, leaving local and regional newspapers unable to compete in an ecosystem dominated by such powerful players.

Ironically, Google and Facebook, by controlling the online advertising industry, have been part of the reason why local and regional journalism is in decline, having absorbed the advertising revenue that these outlets used to rely on. So, while Google is making an effort to offer balanced content, it's also aware of the need to monitor potential abuses of the prime 'slot' it provides.

For instance, we've seen outlets like Breitbart continuously update or recycle their stories to remain fresh but biased towards their perspective. Over time, Google took measures to restrict the number of news sources displayed, effectively "disappearing" Breitbart from the top stories. However, such decisions often lack transparency. It took Breitbart weeks, with the aid of a search engine optimization company, to realize they were no longer appearing in the search results for certain topics.

Google's task is challenging, especially considering the backlash it received during the Trump years. It was repeatedly accused of being left-leaning or anti-conservative, when the reality was more nuanced. Outlets like Fox News or the New York Post were among the most frequently displayed sources because Google strives to maintain balance in the news sources it presents.

In one of our papers, we theorize that Google seems to be trying to revive the fairness doctrine--a policy that required news outlets to fairly represent various viewpoints. This approach appears to be reflected in Google's algorithms, which aim to balance sources from a variety of political perspectives in each search result.

Sampling Opinion vs. Sampling Truth

SH: I understand that in the realm of politics, there's a certain amount of opinion or principle. And we can argue that for a healthy political environment, these views should have an opportunity to express themselves. However, there's the next layer which, as I know you understand, involves the truth, or the facts about the world. Here it seems that multiple opinions shouldn't really matter. It's about communicating the state of the world or the reality of how things are.

In the context of truth (rather than opinion), how do you think about these kinds of biases? If Google is trying to sample from these different sources as people are asking questions, how should they approach it? To tie it back to what you were discussing earlier about COVID information, people are Googling about COVID vaccines. If you are the search engine, would you sample from a variety of sources across the political spectrum to answer this question? Does that make sense?

EM: That's a great point. For COVID, Google and all the other platforms, once they realized the severity of it, tried to label everything. For instance, the first notification would direct you to the CDC. You still see this on YouTube, where all COVID-related videos have a note below them saying, "For the latest news and information, go to the CDC."

Google tried to do the same. They were showing news from all the sources that were creating news. But its ranking prioritized some of these sources more than others. If the New York Times was constantly writing about COVID, which it was, then you would always see the New York Times, CNN, and Fox News. When something becomes big news, you can be sure that the big news organizations are covering it. So, it's easy for Google, because these news outlets have the highest ranking and will always appear first. Occasionally, Google will show some other sources, but if something becomes big news, the chances of seeing other resources are small.

Google learned its lesson a few years ago. You might remember around 2016, after the election, there was a discussion about who won the popular vote. Trump didn't win the popular vote; Clinton did. But a journalist searched, "Did Trump win the popular vote?" and a completely random blog appeared claiming Trump had won. This got a lot of press because journalists were just realizing the power of Facebook and Google. They were saying, "Wait a second, what these companies are doing seems so important, we have to keep an eye on them." Google became very careful about which news organizations made that list after that.

Going back to your question, these are mostly news organizations that might have a political bias but are generally committed to the truth, whatever they define as truth. These aren't completely unreliable news sources; they just have their political slant towards certain events, which we know can often be problematic. But these are not news sources that spread outright lies.

What's the Deal with ChatGPT?

SH: Let's transition to discussing ChatGPT, the new large language model chatbot from OpenAI that's generating buzz. I would love to hear your perspective. What excites you about this new technology? What do you feel people often misunderstand?

EM: There are a few things to note. Firstly, ChatGPT is a remarkable achievement in computer science. People often question why these tools are created, but consider that all computer scientists learn about the Turing Test during their education. Alan Turing, the father of modern computer science, proposed this test which focuses on whether machines can think. Artificial Intelligence, a branch of computer science, has been tackling this question for decades. We now have technology that can pass the Turing Test and perform amazing feats which, until recently, we believed only humans could do.

However, this is happening in a context where computer science isn't just an intellectual exercise. It is clear that computers are one of the most influential technologies humans have invented. It's pervasive; we cannot run our societies without computers and computation. AI is a way to make or outsource decisions, something we've been doing for decades. People often forget that Google is a form of AI, built by students and researchers. The co-founders of Google were trained by Terry Winograd, one of the fathers of AI who created an NLP system back in the 60s and 70s.

Google is an AI technology, but we've had time to adjust to its capabilities and limitations. Google has also been using large language models, like BERT and others, as part of their search functionality for years. The goal is to improve the understanding of query phrases. Initially, these large language models were designed to do just that --move beyond keyword searching towards understanding semantic meaning.

OpenAI has chosen to create technology that is open and accessible. On the one hand, this is good because it has brought the technology into public discourse, and a healthy dose of concern is necessary with all new technologies, especially those that spread quickly. On the other hand, OpenAI quickly realized that achieving these advanced models requires a significant amount of capital. This has led to large investments and the technology is now no longer as open.

The issues with these new technologies, especially those that are more advanced than others, include who gets access and the unknown extent to which they can be used. The public needs to understand that the developers didn't intend for ChatGPT or GPT4 to pass the bar exam, provide instructions for building a bomb or renovating a house. Those capabilities are emergent properties of these large models that can predict the next word.

SH: It's amazing to see the power of a system that essentially just completes sentences.

EM: Indeed, it is remarkable to witness the evolution from models that could simply understand language to those that can generate language in a manner that surpasses imagination. One important point we often overlook is that these large language models are trained on all human input available on the web--all our conversations and knowledge. We have willingly and freely contributed this knowledge over decades without compensation.

Companies like OpenAI and Microsoft have taken this knowledge to build these amazing systems. However, the fear is how these models will be used both by the companies that own them and the speculators who see an opportunity in an as-yet-unregulated environment. Early adopters of these technologies often seek to benefit from environments where there are still no regulations. Remember the first years of email, when spam was rampant, or the web pre-Google, where search engines showed mostly spam results? It took significant effort to build tools to prevent spam and scams.

We must question how and how quickly we will adapt to this new technology. There is a possibility that the quality of the web will worsen before it improves. Humans may become reluctant to generate content for free on the internet, while others will flood the web with AI-generated content. We might shift towards closed communities, leading to a more fractured and decentralized web. We don't know yet, but there's going to be significant change in response to the introduction of these new models.

What are the biggest data gaps in the field?

SH: Let's move on to the last part of the podcast where we ask our three questions.

First question: What do you see as the biggest data gaps in the field? Are there places where an important question or issue lacks sufficient data, or the data needs to be created?

EM: I'll stick with the topic of the web since that's my area of expertise. Our devices and applications collect a vast amount of raw data--every scroll, mouse movement, or keystroke is recorded. This data is often stored in large data dumps owned by companies like Google and Facebook. But, for us, the users of these technologies, there's little value derived from that data.

There's a shift happening where devices, for example, the iPad, will tell you if you spent more or less time on it this week. But the information is vague and high level. I would love to see the establishment of a data layer on top of the raw data--aggregated, actionable data that is focused on the user.

If we're spending so much time online, it would be beneficial to know exactly what we're doing with that time. Right now, after spending two hours online, it's hard to determine where that time went. It's as though there's an addictive quality to the technology. The first step to combating this would be to have better data that helps us understand how humans are behaving and using technology.

However, this isn't in the best interest of the companies like TikTok or YouTube, they don't want you to be aware of how much time you're spending on their platforms. The solution might come from somewhere else, likely open-source software or plugins that can help us track our usage. These tools should be more user-friendly and embedded in our devices.

In my opinion, the biggest data gap is in understanding our online behavior. It's an intentional gap, but hopefully, it will change.

What excites you the most about the future of research?

SH: Second question: What excites you the most about the future of research?

EM: You're likely familiar with Tim Berners-Lee, the inventor of the World Wide Web. In recent years, Berners-Lee has proposed a new idea about a decentralized web where each of us owns our own data. The technology is called Data Pods, and he even created a company a few years ago to support it.

Basically, the idea is about having more control over our data and what we do on the web. This new model won't rely on centralized repositories like Google and Facebook, which extract and then use our data in surveillance capitalism. Instead, we would have numerous small providers which we trust, where we know our data is secure. We could then provide permissions for applications to use our data as needed.

At least in Europe, it seems there is some action towards this. I heard from a colleague that the Flanders government in Belgium, a regional government representing over six million people, has teamed up with Berners-Lee's company to see how they can create an ecosystem for this data. It's very early, and of course, it will need time to gain traction, but I think it's a very exciting direction for research.

Wave the magic wand...

SH: Third question: If you could wave a magic wand, what would you change about the web?

EM: I would change the financing model. If we look at other media, such as television and radio, they had two competing financing models from the beginning. For instance, the BBC in the UK was publicly funded, with the idea that it belongs to the public and is supported by tax money. In the United States, on the other hand, one of the big media corporations like NBC or CBC was created with private investment and advertising.

The web, however, has never had a public-funding model. We don't even know what it means for the government to invest in search engines or social networks. While we have public libraries, we don't have public online platforms. It would have been interesting to see how the online ecosystem would have looked with different financing models.

Initially, the ethos of the web was about everything being free and open. However, it didn't anticipate how capitalism would eventually take over and transform it into an engine for making money, often at the expense of everyone's well-being. Although there has been some shift with increased subscription models, it would have been interesting to see what public funding could have done for our information ecosystem.


SH: Eni, thank you so much. I really appreciate you taking the time to discuss these topics. It has been truly fascinating. Is there anything else you'd like to share?

EM: Absolutely, if there's one thing I would love to emphasize and share is that if we view the online world as our shared environment, we should treat it with the same care and responsibility as we do our natural environment. Although we haven't done a great job in the past in taking care of our physical environment, we can learn from our failures and apply that knowledge to the online space. We all have a collective responsibility to make the information environment a safe and healthy place. Finding ways to contribute and uphold the safety and well-being of our online community is a goal worth striving for.