Biases and errors in our tools: how do you cope? Reflections of a newcomer to textual analysis.


Image via sisssou, cc-by 2.0

On February 22, 2014, I had the pleasure of moderating a panel at the NYU Humanities Initiative entitled Using Digital Tools in the Classroom and in Research. (Here are the videos of 3 of the 4 panelists’ presentations and the Q&A).

My first question to the panelists related to biases and errors in the technology we use and how we as scholars and teachers deal with them:

We know that technology is not neutral; it contains within it the biases or limits of the person/people/cultures who created it. One example is the Google books corpus: because of the libraries involved in the book scanning, I assume that the Google book corpus has much better representation of western writing than that of other cultures. If you don’t have this in mind when using Google’s NGram viewer, you could get wildly incorrect results if you assume that “all the books are in Google.” Biases are also inherent in computer algorithms that process data. How do you account for this bias in your own work and how do you teach others to beware?

The next day I continued this conversation with the panelists via email, sharing my own example of technological error in a text analysis tool. Here was my question restated:

In thinking about last night’s conversation it occurred to me that I may have posed one of my questions in an awkward way and I wanted to follow up with you a bit via email. This relates to my first question, about the biases (or just plain errors) inherent in the technology we use and how we need to beware of and question them. The example I gave — Google Books — was probably not a good one. Here’s another example:
In Summer 2013, I went to DHSI to take a text analysis workshop. I brought with me digitized volumes of a library journal from the 1930s and volumes of the same journal from the 2000s. I tried a method called Dunning log-likelihood. This method is used to highlight differences in word frequencies in two sets of texts. In the workshop we were using the SEASR-Meandre Workbench, which has a built-in Dunning log-likelihood tool. Basically you stick your two texts in and the tool spits out the words that are overvalued in one text compared to the other. So I did just that. And the results showed that in the 1930s articles, the word “e-books” was used more frequently than in the articles from the 2000s.1 What?!?
If I wasn’t thinking critically about the tools themselves [and didn’t understand my field and the corpus as well as I did], I might have taken this result at face value and gone on to draw terribly erroneous research conclusions based on it. Fortunately I knew enough about my subject matter that the error popped out at me right away, and instead of interrogating my corpus, I started interrogating the tool (and the tool’s developers, who were right there).
Hence my question about scholars being careful to deeply question the tools and methods they are using, as well as the data they are analyzing. Your answers to my question focused much more on the data [understandable, since the example I gave related to the Google book corpus]. But how do you cope with the possibility that the tools you are using may be biased (or error prone) and that the “black box” is sending you down the wrong research highway?

My example is an egregious one; no one in their right mind would believe that librarians talked more about e-books in the 1930s than in the 2000s. However, if the mistake had been less obvious, you can see how I might have been lured into taking the results at face value.

I know digital humanists, like librarians and archivists, are keenly aware of, and beware of, the inherent biases in the data itself: in the archival “silences” (and here as well), in text corpora, in the data’s structure and organization (selection, presentation, navigation, markup, etc.). Poorly prepared corpora can potentially lead you down a research rabbit hole. For example, a corpus of 17th century texts that also includes the 19th-20th century prefaces to those texts is going to give you some pretty distorted text analysis results.

But here’s my problem as a technically proficient scholar with limited programming skill: how can we peer into and scrutinize the black boxes that are the analytical tools processing our data? The same question holds for text corpora or any other data held in black box systems that preclude examination–in addition to the Google books example, the HTRC comes to mind. And, assuming you can peer inside, without the specific programming skills to evaluate how a tool was built and look for biases or errors in the code, how can we be sure the tools we’re using aren’t giving us erroneous results? I realize that I had a particularly perplexing experience at DHSI. But as I embark on my first text analysis project, I wonder how I can trust that the tools are not going to do me wrong.

Here are several ways I’ve thought of to evaluate the effectiveness of tools and methods:

  1. Know your corpus: understand what is and isn’t in your data set and how your results may be biased. Does your corpus include a-historical introductions to texts from another period? Do the texts include end-of-line hyphenation that will leave words out of your analysis? How much OCR error is there? Is there metadata and is it accurate? Will any of these issues have more/less of an effect on your results if your data set is very large or quite small? If you don’t know how to answer these questions, find out. If you can’t find out (e.g., if you can’t examine the corpus), maybe you should be really, really careful about scrutinizing your results and limiting how much emphasis you put on them. As an example of how results can be skewed, here’s an article on a number of these problems with Google’s corpus that will affect results in the Ngram viewer.
  2. Don’t rely on distant reading (or what Matthew Jockers calls macroanalysis) to replace the kind of insight you glean from close reading. This probably sounds really dumb to people who’ve been doing this kind of analysis for a while. But for newcomers to these methods, the seduction of the interesting results you get from massive data analysis may overpower our critical analysis of the methods themselves. In his article On Distant Reading and Macroanalysis, Matthew Jockers says: “It is the exact interplay between the macro and micro scale that promises a new, enhanced, and perhaps even better understanding of the literary record. The two approaches work in tandem and inform each other. Human interpretation of the “data,” whether it be mined at the macro or micro level, remains essential.”
  3. Duplicate your analyses using multiple tools that do the same or similar things. For example, I should run Dunning log-likelihood analyses in several tools and compare the results to see if they hold up or if something looks fishy.
  4. Follow the leaders: See what tools those who do this kind of scholarship are using and talk to colleagues about their preferred tools and methods. Bamboo DiRT provides a registry of digital research tools used by digital humanists. Which tools are “tried and true”? What are the pitfalls to watch out for? Are there different methods and tools for doing the kinds of analyses you want to accomplish? If so, which ones are better for your project. While digital humanists are a very friendly and welcoming lot, it’s best to see what information is on the Internet before cold calling someone with a question that might be easily answered through online searching. Digital Humanities Questions & Answers is a great site to mine the archive for information on your question. If you don’t find anything, ask a question there–that’s what it’s for! Lisa Spiro’s Getting Started in the Digital Humanities includes great suggestions for finding a community.
  5. To the extent that you can, learn how the tools and methods you are using work “under the hood.” Read about how the tools work to see if they produce the kinds of results that you are looking for. If a tool or method hasn’t been adopted by others, find out why.
  6. Partner with experts: we’re fortunate at NYU to have a Data Services department in the library to whom we can turn for help with R and other data analysis tools. If you don’t have such a unit on campus, you could seek out someone with the skills in another department (computer science, statistics, etc.) who might help you get started with your research or partner with you on your project. What incentive would motivate a stats expert or computer programmer to partner with a humanist on a research project? Engage them with interesting computational questions. Collin Jennings spoke about this collaboration at the NYU event mentioned above (from 12:50-14:55 in this video). As well, Lisa Spiro’s Getting Started in the Digital Humanities has a section on finding collaborators.

These are just my own ideas that I came up with after my disconcerting experience with the SEASR toolkit. I’d be very happy for counter examples, dissenting opinions, and other ideas about how to be more skeptical and think more critically about DH tools and methods.

  1. The Meandre developers confirmed the problem, so I wasn’t doing it wrong. []