Transcribing Multilingual Documents in the Digital Age

Last month, I visited the University of Houston as part of the Recovering the US Hispanic Literary Heritage’s new US Latina/o Digital Humanities (#usLdh) Incubator. The scholars in this community have been working at the forefront of scholarly editing for decades by editing, publishing, and digitizing documents that are as multilingual and transnational as the United States.

In doing this work, these scholars have been forced to confront a challenge that is often sidelined in the discussion of digitization. Digitization has brought renewed attention to the difficulties of transcribing and encoding historical texts. But this work has been largely driven by Anglophone, monolingual projects.

As a result, Anglophone, monolingual textuality is hardwired into the systems and processes that we use to transcribe the historical record.

Elvía Arroyo-Ramirez has spoken brilliantly about this in the context of archival processing. We face similar problems when transcribing and encoding texts.

What does it mean for us as digital practitioners to use tools that were designed for a different cultural context? At the #usLdh Incubator, I started to answer this question by tracing a brief history of multilingual transcription. I also spoke about digital transcription options, and some of the challenges of using these tools on multilingual texts.

I’ll summarize some of the key points here, and my slides are online.

Copying Past

Transcription is always transformative, imposing new contexts onto old documents, but transformation is not necessarily deformation or defamation. Transcription can also revive or reveal a text.

The Codex Mendoza

Take the Codex Mendoza, a document believed to have been inscribed by Francisco Gualpuyogualcal and Juan González in New Spain in the 1540s. The first part of the Codex Mendoza is likely a transcription of a Pre-Columbian text, now lost. That text was then transcribed using alphabetic annotations written in Nahuatl, and translated into Spanish. [1]

Page from the Codex Mendoza shows a young man getting his education (top) and a wedding (bottom), with alphabetic Nahuatl transcription. The edge of the Spanish transcription on the opposite page is also visible.

Page from the Codex Mendoza. Image from the Bodleian Library.

Both the pictographic and alphabetic inscriptions are transcriptive. The former moves the text from one page to another; the latter, from one form to another. Both have the potential to remain faithful to an earlier copy.

But only the pictographic document can be a facsimile transcription, representing not just the meaning but its form on the page. And only the alphabetic Nahuatl would have looked like language to the royal Spanish recipients of the copied text.

The original document is lost, but we know that both Gualpuyogualcal and González changed the text as they transcribed it. The pictographic text was produced using European paper and ink, and uses a style of shading introduced by the Spanish. And the Nahuatl transcriptions are sparse and sometimes inaccurate representations of the pictographic text.

Is the alphabetic text sparse because González couldn’t fluently read the pictographic copy? Were the pen and ink chosen to impress the Spanish royalty? In transcription, we see both strategic transformation and misrepresentation.

El Título de Santa María Ixhuatán

Take another example, an alphabetic Nahuatl manuscript known as El título de Santa María Ixhuatán. The historian Margarita Cossich Vielman and linguist Sergio Romero believe that this is partially an alphabetic transcription of a lienzo, a much older document written in pictographic script. [2]

Photograph of a page from the título de Santa María Ixhuatán.

El Título de Santa María Ixhuatán. Image from the Nahuatl/Nawat Project.

But Cossich and Romero think it was written by someone who was not fully literate in pictographic Nahuatl. Where the earlier text used logograms (signs representing a word or phrase) and silabograms (signs representing syllables), the scribe read them as literal representations.

So in the Título, he writes of a place called Teohuanhuaco, “drawing of a tree, drawing of a priest.” But Cossich and Romero argue that it was more likely to be the more figurative Teowakwawnawako, “place beside the divine tree,” or Teokwawko, “place of the divine tree.”

Here, we see a scribe using transcription to revive an otherwise lost text. But we also see how his imperfect copying had an impact on our ability to associate this text with geographic locations or to compare it to other histories.

Copying Present

Literacy and legibility informed the ways that colonial texts were copied in the past. These factors continue to impact our ability to consume the texts of the past, and they can have real implications when we begin to treat transcribed texts as historical fact.

While not all transcriptions are political, in the cases that were described here, shifts in literacy and legibility are directly tied to colonial systems designed to shift power away from indigenous communities.

When we transcribe digitized texts, we enter a similarly interpretive field, and it is worth being attentive to the ways that our transcription processes engage with structures of power.

Many digital project managers use automatic transcription tools to make their scanned documents more searchable. But as I have written elsewhere, automatic transcription tools are not always ‘literate’ in historical orthographies or languages other than English, which can impact our ability to treat these texts as reliable objects of historical study. This is especially impactful when we are using transcriptions for corpus analytics.

The same is true in the case of crowd-sourced transcription. The manual transcription of documents in languages other than English can be more costly, because it requires more specialized knowledge. There is also an extra emotional cost to working with colonial documents that inscribe everyday processes of power and violence.

And what about the tools that we use to inscribe these texts? Most programming languages, command line operations, and web interfaces for transcription are written in English, and many are not configured to handle diacritics or special characters. Again, this means there is an additional barrier to entry to editing multilingual documents that is an unnatural result of systematic bias. As Brook Lillehaugan has explained in the case of the Ticha project, it is unreasonable to ask Zapotec-speaking collaborators to learn English so they can work in TEI.

Knowledge of English should not be a prerequisite for the digital editing of historical texts.


Copying Future

What can we do to address these transcription challenges? One thing is to ensure that scholars working with non-English texts (including those working outside the Latin alphabet) are at the table during the development of both automatic and crowdsourcing transcription tools.

This has been prioritized by teams like From the Pagea service for designing crowdsourced transcription projects. It is also the focus of groups like the Historical and Multilingual OCR project at Northeastern University, which is trying to set the agenda for automatic transcription of multilingual texts.

Another priority must be to think seriously about labor and citation practices. In the case of colonial transcription, historians have been working to recover the names of the indigenous scribes who wrote the documents, recovering their intellectual contribution to the historical record.

We have the same responsibility in the case of digitization. Digitization is intellectual labor which is often done by students and staff in contingent positions. Highlighting this contribution helps ensure that workers know their labor is valued, and better reflects the reality of collaborative digitization work.

It also helps our institutions understand this work, so we can make a case for better financial compensation.

It is only by consistently and publicly acknowledging the value of this work that we will be able to ensure that is taken seriously by funding institutions, hiring committees, and in cases of tenure and promotion.

And that is a necessary step in ensuring that the digitization of the historical record remains, at its foundation, intellectually sound and culturally sensitive.


Hannah Alpert-Abrams is a CLIR Postdoctoral Fellow in Data Curation and Latin American Studies at the University of Texas at Austin. Find her online at

[1] See Frances Berdan and Patricia Rieff Anawalt, The Codex Mendoza. University of California Press (1992). There is plenty to say about the Spanish translations of this text, too, but I don’t address it here.

[2] Margarita Cossich Vielman and Sergio Romero. “Lienzos prehispánicos y el título de Santa María Ixhuatán, Guatemala.” Asociación para el Fomento de los Estudios Históricos en Centroamérica, 2016. Online.