Charles Cooney, Glenn Roe, and Mark Olsen, “The Notion of the Textbase: Design and Use of Textbases in the Humanities”
Charles Cooney, Glenn Roe, and Mark Olsen
¶ 1 Leave a comment on paragraph 1 0 Our goal in this essay is twofold. First, we define and describe a selection of humanities textbases, paying particular attention to the design principles that underlie their structure and inform their use. Then, keeping the collections we are most familiar with in mind—those built for text analysis—we outline the scholarly approaches to textbases that we have historically supported at ARTFL (Project for American and French Research on the Treasury of the French Language) and that will continue to inform our decisions as we make use of new algorithm-based analytic tools. Essentially, we argue that traditional modes of humanistic scholarship must be in the forefront of our minds as we build and improve on standard text-analysis tools like PhiloLogic, the word-search-and-retrieval software developed at ARTFL to query its textbases. The text-mining and machine-learning applications we have begun implementing will not entirely replace traditional philological approaches to text analysis. Instead, as textbases continue to grow, we believe these new tools can offer necessary alternatives to scholars beyond simple word search, allowing them to discover and explore unseen connections among texts, trace the evolution of ideas over large collections and historical periods, or identify the contextual and intertextual relations of individual texts to any number of other works.
What Are Textbases?
¶ 2 Leave a comment on paragraph 2 0 Textbase, or textual database, is a term that denotes a coherent collection of semi- or unstructured digital documents.1 These documents, as textual artifacts, can come from literature, periodicals, historical or philosophical writings, legislative proceedings, or any other realm that produces written discourse. The data in these textbase collections differs significantly from the more structured information, such as bibliographic or machine-readable cataloging (MARC) records—those used in online library catalogs, for instance—and other indexes generally associated with the field of information retrieval.
¶ 3 Leave a comment on paragraph 3 0 Textbases all cohere in some manner. Unlike their cousins, large repositories of digitized texts like Project Gutenberg or Google Books, textbases exist as corpora of documents assembled around some specific unifying principle. A textbase might gather together thematically or generically similar documents, such as a collection of nineteenth-century American newspapers or Chadwyck-Healey’s English Poetry database, which is made up of nearly 40,000 poems written by English poets across a span of 1,300 years. The ARTFL Project’s primary online offering, FRANTEXT, consists of more than 3,500 texts of various types, all in French. Built initially to provide a large body of word examples for the French dictionary Le trésor de la langue française (TLF), this textbase includes a range of genres, from novels to theatrical works, essays, scientific treatises, and correspondence. Or, as The Digital Karl Barth Library, from Alexander Street Press, illustrates, a textbase might represent the entirety of a single author’s writings, attempting to replicate canonical, published versions of texts as closely as possible. This brief list of examples hints at the fact that there are many varieties of textbases that aim to satisfy a range of scholarly desires and that the intellectual aims of textbases are always linked to some fundamental technical considerations. The end use of a textbase will often determine its design, in terms of both the individual documents selected for the corpus and the style and degree of document encoding.
¶ 4 Leave a comment on paragraph 4 0 A rough comparison of two fields that create and make use of textbases, electronic publishing and digital humanities, can demonstrate the technical considerations involved in the formation of a textbase. The primary goal of electronic publishing is to make texts available in a digitized format for access through the Internet. As a result, much of the encoding in individual documents serves the basic purposes of demarcating bibliographic information, formatting text, or segmenting documents for ease of online navigation. Most documents, such as scholarly articles or even new literary works, are born-digital.
¶ 5 Leave a comment on paragraph 5 0 Textbases in the digital humanities—those we are most familiar with at ARTFL and make our main focus in this essay—are built to enable text-centered scholarly research. In the humanities, scholars are primarily concerned with the specifics of language and meaning in context, or what is in the works. These collections tend to represent specific linguistic or national traditions, genres, or other characteristics reflecting disciplinary concerns and scholarly expertise. In contrast to the born-digital documents of electronic publishing, textbases in the digital humanities are generally retrospective collections built with an emphasis on canonical works in particular print traditions.
¶ 6 Leave a comment on paragraph 6 0 In this context, the encoding in documents is meant to enrich the scholar’s engagement with the collection. At the document level, texts are partitioned into composite sections, such as chapters and paragraphs following specific schemes, scholarly traditions, or approaches. The metadata of these textbases—that is, the bibliographic or structural information associated with each text—can describe individual documents, document sections, and even individual words in far greater detail than in many other digital collections. Yet this rough comparison between electronic publishing and digital humanities is a touch simplistic. To some degree, most textbases are hybrids of the publishing and scholarly models, with varying amounts of concern for the display and representation of text and the capability to support computer-aided research. The aforementioned Karl Barth collection, for example, produced by Alexander Street Press in association with the print publisher Theologischer Verlag Zürich, contains electronic versions of the print volumes of the theologian’s work. These digitized texts are intended to be faithful representations of the original print editions. And yet they have also been carefully encoded in standard TEI (Text Encoding Initiative) notation to enable scholars to search Barth’s writings for individual biblical references by book, chapter, and verse, for example.2
¶ 7 Leave a comment on paragraph 7 0 The varied approaches to text encoding often reveal intellectual biases about the purpose of textbases and the ways in which computers and computer-assisted text analysis can best help scholarly activity in the humanities. Computers can serve as text and media viewers and can also be used to perform calculations on large bodies of data at incredibly high speeds.3 At a basic level, favoring one of these attributes over the other will influence all the subsequent decisions concerning textbase design, encoding, and almost anything else. Over the past two decades, several different text-markup schemes have come into being, built either on a minimal amount of markup for display purposes or on some higher level of encoding to enable more-complex search and retrieval functions. Much software had to be written to handle these new forms of textual data. Although this essay is not the place to enter into detail about text markup, software development, or the often-contentious confrontation between the two, we give a sketch of encoding styles and textbase design as a way to draw out the practices they seek to serve. On the minimalist end of the spectrum, Project Gutenberg produces texts in what it calls Plain Vanilla ASCII. These texts, free to be assembled into textbases or studied individually, contain essentially no encoding, so that they can be easily downloaded, read, and used with the most basic and readily available digital tools. As the project site states, “99% of the hardware and software a person is likely to run into can read and search these files” (Hart). Easy access and readability are the order of the day for Project Gutenberg. Conversely, texts used in digital humanities corpora, because they are intended to support scholarly research, require some level of encoding, whether they use older schemes such as 1990s SGML (standard generalized markup language) or the currently popular TEI-XML standard.
¶ 8 Leave a comment on paragraph 8 0 The idea behind encoding is to associate various kinds of metadata with individual texts or to demarcate particular aspects of those texts so that the user can refine search parameters and manipulate sections of the text during research. Texts either contain bibliographic information or have it associated with them. Texts almost always also contain other metadata that partitions them into discrete objects, often replicating the sections of a printed version of the document. Beyond this abstraction of the internal structure, texts can be encoded effectively with as much metadata as the scholar designing the textbase chooses. The encoding guide produced by the Brown University Women Writers Project describes encoding as “a way of formalizing and externalizing the structures in a text; a way of adding further information to the text that interests us; a meta-text that comments on, interprets, or extends the meaning [of] a text” (“Introduction”). What and how much the scholar wants to get out of the digitized text determine what is put in to the encoding scheme.
¶ 9 Leave a comment on paragraph 9 0 We thus present three varieties of textbase design as they pertain to encoding—textbases with heavily encoded internal TEI-XML notation, those that rely on relational databases such as MySQL to manage their metadata, and those with only a minimal level of markup or encoding. These distinctions are not absolute, however, since each variety intends to support a different kind of engagement with texts: the first is motivated by a close attention to the texts and an effort to preserve them as historical documents; the second, a more sociological approach to understanding texts and their creation and dissemination; and the third, a philological and intellectual-historical approach that examines word use, both synchronically and diachronically, over large-scale collections. We focus particular attention on this last variety, not only because we build textbases and software for this approach but also because our research falls under this category.
Varieties of Textbases Today
Transcription and Textuality
¶ 10 Leave a comment on paragraph 10 0 Textbases devoted to text preservation and representation on a large scale generally seek to promote research on texts as literary or cultural artifacts. Such textbases have deep roots in electronic publication. Projects building these corpora want to make rare archival materials more accessible or offer electronic critical editions. Yet these textbases have historically tended to rely on rich document encoding because, at their core, they mean to make available for research particular aspects of individual texts. We should point out that, since 1995, this approach to textbase design has evolved as digital humanists have debated what is technically possible and even desirable in electronic text collections.4 Nevertheless, scholars compiling critical editions, as well as archivists, have been attracted to digitization in part because of the kinds of metadata and amount of annotation that can be inserted into electronic versions of texts. The editors of the Brown University Women Writers Project noted the role of encoding for such activities over a decade ago:
¶ 11 Leave a comment on paragraph 11 0 [T]ext encoding—and particularly standards like SGML and the TEI—makes it possible to create large electronic resources of previously inaccessible material, such as rare archival texts by women authors. At the same time, it makes possible an integration of responsible editing practice with the new technologies of distribution and access, such as the internet and the World Wide Web. (Flanders, “Text Encoding”)
¶ 12 Leave a comment on paragraph 12 0 The editors intended their editing and encoding practice to reflect their “commitment to preserving the integrity of the text as an object which circulated in the culture of a particular historical moment” (Flanders, “Women Writers Project”). For this theory of textbase design, internal encoding was often seen as the key to making electronic texts as alive as they could be, with all the advantages that the digital medium could offer. Accordingly, TEI headers—the supplemental data placed at the beginning of each document—contain high-level metadata about the texts, their authorship, publication information, and so on. Low-level particularities of texts—whether variants or typographic errors—can also be demarcated, thus creating an abstract, self-contained internal structure that can potentially represent multiple facets of a single text.
¶ 13 Leave a comment on paragraph 13 0 In the mid-1990s, before hyperlinks on the Web became ubiquitous, markup was considered an innovative means of creating hypertexts out of texts. The aim was to build wide-ranging electronic critical editions by hand. Document encoding was intended to allow digitized texts literally to point beyond themselves to secondary versions and multiple sources. The critical edition textbase would thus mimic the way texts seem to exist in the real world, in relation to other versions of themselves and to entirely separate works. John Lavagnino wrote at the time:
¶ 14 Leave a comment on paragraph 14 0 What a number of scholars have imagined a hypertext edition would be, then, is a system that would store both electronic texts and images of all the versions of the works in question, and offer the ability to display parallel texts of any two versions, as either images or electronic texts. Every hypertext edition in progress will do more than this, but this is the core of them all. (“Reading”)
¶ 15 Leave a comment on paragraph 15 0 A textbase in this scheme is a locus of research constructed around a single document, whose ancillary data and hyperlinks replicate its real-world context electronically. New data could continually be added to the textbase to enrich the sense of the original text. The textbase in this context, which “needn’t ever stop growing and changing” (Lavagnino, “Reading”), would thus have a life span similar to that of the text itself.
¶ 16 Leave a comment on paragraph 16 0 One such critical edition textbase that collects and makes accessible a wide range of materials is the Rossetti Archive.5 This archive, as the home page states, “facilitates the scholarly study of Dante Gabriel Rossetti,” bringing together his complete textual and pictorial works and other contextual materials. In addition to digitized versions of texts and manuscripts, which contain some degree of editorial matter, the archive includes “high-quality digital images of every surviving documentary state of DGR’s works.” First and foremost, this archive, like all archives, centralizes research materials. Digitization makes looking at those materials easy for anyone with Web access. Metadata makes it searchable: “[a]ll documents are encoded for structured search and analysis.” In this case, “structured search” means that a user can search and retrieve documents by title, genre, and date; search documents for names of people in roles such as printer, author, publisher, and so on; and execute word searches. These search capabilities, driven by standard metadata and named-entity tags, suggest that the editors, if they have tried to annotate low-level textual quirks, have not actually rendered them searchable for the moment. In this context, document images provide the only direct means of seeing the original text and its variants. These few limitations of the Rossetti Archive thus reflect some of the general limitations of this type of textbase.
¶ 17 Leave a comment on paragraph 17 0 Textbases developed with the idea of replicating textual materiality have often had difficulty meeting the analytic needs of general humanities research. Even with the effort to standardize encoding schemes such as the TEI, tagging options continue to grow as scholars mark up more and more textual idiosyncrasies. To take advantage of the rich document encoding for research purposes, systems designed explicitly to handle XML are needed. But many of these tools, such as the Lucene search and indexing engine, are still not able to effect full-text word search and analysis with any degree of sophistication. The Rossetti Archive, for example, runs on a modified version of the Lucene platform. Searching the XML name tags and metadata in this collection works rather smoothly. But the word-search feature of this software cannot find two specific words separated by a particular number of words or in any particular proximity to each other, cannot generate and display collocations, and seems unable to delimit word search by document metadata.6 In the end, large amounts of internal tagging can primarily serve display purposes only. Lavagnino noted intellectual disjunctions of this sort at a time when digital humanities scholars were just beginning to develop systems to query electronic corpora. He complained, “it is striking how many proposals for hypertext editions fail to mention even the rather ordinary function of text searching, although, mundane as it is, it is one of the most valuable things that can be done with electronic texts” (“Reading”). The danger here is that textbase design can sometimes meet the needs of scholarship in only a circumscribed way.
¶ 18 Leave a comment on paragraph 18 0 A digitized text can be filled with a host of metadata and critical annotation, but the researcher might not actually be able to do much with it, let alone with the text itself. Julia Flanders describes the difficulty: “The challenge posed by projects like the Rossetti Archive is how to capture bibliographic codes and textual materiality in ways which can represent them usefully to readers: not simply as visible cues but as data which can give one leverage on the text” (“Electronic Textual Editing”). Projects such as these are potentially valuable resources for scholars and students, but the issue is whether the technology built into them, and on which they are built, does “give leverage on the text.” Despite their benefits, textbases designed with textuality and representation in mind have not necessarily been able to provide a satisfactory platform for studying the broad meanings and implications of texts and text collections on a large scale. As Lavagnino points out, “in order to use the TEI approach you need also to believe in transcription” (“Electronic Textual Editing”). Indeed, the notions of transcription and textuality tend to become secondary considerations for word-based modes of digital research.
¶ 19 Leave a comment on paragraph 19 0 The two basic varieties of textbases on the spectrum of textbase types—one focusing on data retrieval, the other on representation—are committed to word searching but support different kinds of research. The following examples are designed to emphasize data retrieval. In the interest of full disclosure, we have designed and built both these types of textbases around the word-search-and-retrieval engine PhiloLogic, developed at ARTFL. And while PhiloLogic is based on traditional philological approaches to textuality, as the software’s name suggests, this association only tells part of the story when considering these particular kinds of mixed-mode and text-analysis textbases.
¶ 20 Leave a comment on paragraph 20 0 Word searching can be paired with multiple levels of metadata to provide the functional framework of textbases designed for both text analysis and document retrieval. Collections built using this model allow researchers to find texts and study word use, for example, by sociological or historical parameters. If textbases geared toward text representation tend to rely on heavy internal encoding, this style of textbase uses lightly encoded documents and manages metadata externally, through relational databases like MySQL. Here, word search is the core technical functionality. But the associated metadata in this mixed model provide the leverage for refining word searches and permitting complex document retrieval tasks.7 Individual texts generally have a minimum of internal encoding beyond object hierarchies, which allow for the contextualization of word-search results from book level down to paragraph level. For studying texts and word use through the prism of context, researchers realize the payoff from the relational side of these systems. Word-search results across the entire corpus can be delimited by any combination of values in the bibliographic metadata (e.g., author, title, date, genre). Beyond the bibliography, the distinguishing feature of these textbases is the variety of supplemental indexing data that can be knit together to enable complex document searching with a range of other related data.
¶ 21 Leave a comment on paragraph 21 0 In collaboration with the electronic publisher Alexander Street Press, ARTFL has developed a number of mixed-mode textbases that feature abundant metadata. The subject matter of these textbases includes historical letters and diary collections as well as corpora of dramatic works, film scripts, and general works of literature. The rich metadata provide different perspectives into texts and enable document retrieval over a range of assigned fields (e.g., title, author, date). In addition to standard kinds of bibliographic metadata for texts—author, date, and genre—the editors of each textbase selected other metadata fields to extend direct descriptions of the text, such as the social context of composition; occupation or social status of the author; names of battles, tribes, flora, or fauna described; characters in drama; or theater and performance company information (see Olsen, “Rich Textual Metadata”). These textbases also make use of extensive document-indexing schemes—by subject, geographic area, historical context, or personal events, among other controlled descriptions of a document’s contents. Word searching can be delimited and documents found by any of these metadata values. Supplemental tables help create a sense of context for the documents in this sort of textbase. Tables of chronological events, for example, may have data or descriptions regarding battles or encounters with native tribes, and tables of authors often include number of children or number of marriages, military rank, or age at the time of death. These tables can contain heterogeneous data types that might not typically be considered textual metadata, such as winners and losers of Civil War battles, commanding generals, and even the number of combatants and losses. Such supplemental tables often link to lists of individual documents that are related to the extended metadata.
¶ 22 Leave a comment on paragraph 22 0 As the products offered by Alexander Street Press illustrate, the mixed mode of textbase design can serve the demands of both text research and electronic publishing. Like the extra materials in archival textbases, the ancillary tables provide external resources that enrich the study of texts in the collection. But because the texts are meant to serve searching rather than replication, they do not rely as heavily on internal encoding. For certain textbases that seek explicitly to present a version of a printed text, documents have links to page images. For example, texts in the collections of Reformation-era religious writings include digitized versions of archival editions.8 Trying to encode the many peculiarities of these hard copies would be dauntingly expensive and would offer little payback in terms of functionality; furthermore, page images allow researchers to get a better sense of the original document’s appearance than extensive tagging ever could.
Textbases for Textual Analysis
¶ 23 Leave a comment on paragraph 23 0 The final type of textbase in this brief survey is the variety we build and design at ARTFL for computer-assisted textual analysis. Like the Alexander Street Press products, our collections run under the current implementation of the PhiloLogic software package.9 The main technical difference between these closely related kinds of textbases is that ours have much less metadata. This simple design distinction perhaps reveals a deeper divergence in intellectual underpinnings and ideas concerning how textbases are intended to be used. The textbases we create and the main software engine we have built support a customary mode of research: the study of word use informed by the long tradition of philology and historical semantics. Because we support text- and word-centered research, the metadata in ARTFL-style corpora is drawn from text headers and basic internal document encoding so that it is directly associated with individual texts and text objects. There are no supplemental tables because these textbases do not seek to provide the same sort of rich contextualization for individual texts or the collection as a whole as mixed-mode textbases do. To take Lavagnino’s comment entirely out of context, our textbases are intended to fulfill the needs of “mundane” kinds of humanities research, but on a massive scale and quickly.
¶ 24 Leave a comment on paragraph 24 0 ARTFL’s main corpus, FRANTEXT, was designed to provide researchers of the French language and scholars of French literature a collection with which to study a wide range of word usage over time. The textbase consists of more than 3,500 documents, ranging from classic works of French literature to various kinds of nonfiction prose and technical writing. Genres include novels, verse, theater, journalism, essays, correspondence, and treatises. Subjects include literary criticism, biology, history, economics, and philosophy. In most cases standard scholarly editions were used in converting the text into machine-readable form, and the data contains page references to these editions. The number, variety, and historical range of texts in this textbase allow researchers to go well beyond a narrow focus on a single work or author. The collection permits both rapid exploration of single texts and intertextual research of a kind virtually impossible without the aid of a computer.
¶ 25 Leave a comment on paragraph 25 0 Researchers can study word or concept use across this collection with varying degrees of sophistication by using PhiloLogic’s many search and reporting options. Searches may be run for a single word, a word root, a word with wildcard characters, a prefix, a suffix, or a list of words created by the user. For example, one can search for “tradition.*” in texts published during the seventeenth century, sorting the results by words to the left and then right of the keyword (as shown in fig. 1).10 The resulting report gives the user a quick sense of the commonplace phrases or most closely related terms. These terms can then be combined in further searches, such as “tradition.* OR coûtume.*,” and the results can be sorted similarly. Through a combined search, users can examine differences in the usage of tradition and custom, either in particular time periods or over the entire collection. In many cases a researcher will wish to investigate not merely the occurrences of single words or lists of words but also the context in which these words occur in texts. Thus PhiloLogic allows the user to search for logical combinations of words and word lists. Figure 2 shows a KWIC (keyword in context) report of “tradition.* OR coûtume.* AND nation.* OR local.* OR provin.* OR patrie.* OR pays OR villag.*” within three words. In this example, the Boolean operator and combines a series of geographic expressions with the original terms, tradition and custom. As seen in the screen images, PhiloLogic allows the user to navigate to larger contexts for close reading as required.11 For researchers interested in examining word use in literary or linguistic contexts, results can be displayed line by line, with the search word highlighted or centered. Users may also hyperlink to the full context of any result, examining many sentences or paragraphs around the target of the search. For reference purposes, PhiloLogic displays the bibliographic information and page number for each keyword occurrence.
Leave a comment on paragraph 27 0
Figure 2. KWIC report of “tradition* OR coUtume* AND nation* OR local* OR provin* OR patrie* OR pays OR villag*” within three words. This report can be used to examine the localization of tradition and custom (ARTFL-FRANTEXT).
¶ 28 Leave a comment on paragraph 28 0 For broader-scoped analysis of word use, textbase search results can be sorted according to their frequency of occurrence by title, author, and year. Frequencies can be displayed raw or computed per ten thousand words. Figure 3 shows the occurrences of “tradition*” sorted by its relative frequency by author. Similarly, relative frequencies of terms can be broken down over time periods, as shown in figure 4. Collocation tables can also be generated to give the user a sense of a word’s most common collocates to the left and right (see fig. 5). A very useful approach is the comparison of collocation tables between time periods or authors, allowing users to get a sense of how word use and context changes over time or between authors. The user can choose to filter out the most common terms in a corpus to eliminate overfrequent collocations and also determine the number of words, between one and ten, that separate the collocated terms. For the user to get a grasp of how words actually function in sentences, search results can also be displayed by the word’s location in a particular clause, a feature that we call the Theme-Rheme display (see fig. 6).12 The term’s location can be at the front of a clause only; at the front and end of a clause only; or at the front, middle, and end of a clause. Word positions are calculated by looking at the number of words in a clause and determining within what percentage of clause length the word falls: the front of a clause is the first 35% of words, the end of a clause is the last 10%, and the middle of a clause is 55% of the remainder.
¶ 33 Leave a comment on paragraph 33 0 Our example of tradition illustrates how a textbase like FRANTEXT, running under PhiloLogic, can enable research on the evolution of word use and cultural concepts. We note that the concept of tradition changed dramatically in France from the seventeenth to the twentieth centuries. Derived from the Latin traditio, a form of handing property down from one generation to another, the meanings and associations of this term have developed differently in English and French. In the early modern period, French uses of tradition were clearly related to the authority of the Catholic Church; the most frequent collocates include saint(es), église, écriture, pères, apostolique, juïfs, and apostres. Indeed, the notion of tradition was so closely related to the authority of the church that secular or non-Catholic traditions were given a distinct word, traditive, which was defined in the first editions of the Dictionnaire de l’Académie Française as having the same senses of tradition, but never in conjunction with religion. The authority of tradition was by and large maintained during the eighteenth century, when it was recast as a form of knowledge that could be verified, most remarkably by the philosophes. Adjectival (traditionel, traditionelle) and adverbial (traditionnellement) forms also began to appear, suggesting that tradition moves from an identifiable object to a descriptive characteristic. In the nineteenth century, debates regarding the authority of tradition recede and take on a distinctly cultural conception. The most frequent nineteenth-century collocates for tradition in FRANTEXT include the following:
- ¶ 34 Leave a comment on paragraph 34 0
- religieuses, religieuse, chrétienne, église, culte, croyances
- populaire, peuples, famille, coutumes, locale, souvenirs, moeurs
- lois, france, pays, autorité, française, nationales
- histoire, universelle, historiques, humanité, humaine
¶ 35 Leave a comment on paragraph 35 0 Not surprisingly, much of the discussion about tradition appears mainly in works of sociology and history. By the twentieth century, tradition as a French concept had three distinct foci:
- ¶ 36 Leave a comment on paragraph 36 0
- National: française, France, français, nationale, révolutionnaire, politique
- Intellectual: philosophique, art, philosophie, musique, intellectuelle, transcendantale, grecque, musique, littéraire
- Religious: catholique, chrétienne, religion, saint, église, foi
¶ 38 Leave a comment on paragraph 38 0 In addition to supporting such historical research on word use, PhiloLogic permits close analysis of word use in a textbase’s individual works. When a textbase is built under PhiloLogic, certain kinds of internal annotation are extracted from texts that allow highly refined word searching. Dramatic texts, for example, can be tagged to enable analysis of individual characters’ speeches. The University of Chicago’s collection Perseus under PhiloLogic has this kind of search capability. Tags demarcating each character’s speech in a text can contain the character’s name and perhaps even the character’s gender. The individual speech acts are thus associated with character name and any other metadata. The user can then execute a word search across the entire collection but filter results by that low-level metadata (character name), along with any document-level metadata that might help differentiate character names that reoccur in multiple texts. Among the Greek texts, a user can search the character speeches of Medea, or Μήδεια for the Greek-language texts, by entering her name in a character-subdivision-search field and specific terms in the word-search field. A search for the term “ἐγὼ” and the character Μήδεια generates thirteen instances where Medea, in the Euripides play of the same name, refers to herself in the first person. PhiloLogic thus mimics the way XML engines permit text searching within discrete document objects but takes an entirely different approach to the problem. The underlying advantage here is the degree of flexibility and speed that PhiloLogic brings to text search and analysis.
¶ 39 Leave a comment on paragraph 39 0 Such uses of ARTFL-style textbases center on word search and retrieval. This word-centered mode of research has been, and always will be, vital and important for humanities scholars.13 However, as text-digitization efforts multiply and textbases grow exponentially, scholars will need new kinds of tools that can help them navigate collections and discover how the texts in them relate to one another. Machine learning, text mining, similarity algorithms, and document clustering offer significant promise to help identify meaningful linguistic patterns across very large collections of texts. These new tools will serve as heuristic aids to scholars, facilitating a methodological transition from words to works. As we at ARTFL conduct experiments with such tools, we believe that this developing functionality must still be able to satisfy certain traditional scholarly principles. For one, though it will be possible to conduct textual research on ever-larger scales, textual scholars will need to be able to navigate from that broad, general view to particular textual instances. Access to individual texts and text sections will remain essential because, after all, textual scholars study texts. We are also using these tools to try to answer somewhat customary questions concerning relations between texts, intellectual filiations, and so on. What is new is the method of discovery and evidence gathering to support such claims. Computers can only help researchers make insights. Researchers must still judge the worth of result sets. Moreover, these new tools will help users make worthwhile discoveries only if humanities textbases are built with cohesion and thought. The final examples we give here illustrate ways we and other groups are conducting algorithm-based research on humanities textbases.
From Words to Works
¶ 40 Leave a comment on paragraph 40 0 One of the greatest, and most recent, challenges in the design and implementation of textbases in the humanities stems from the extraordinary increase in the size and coverage of available digital collections. In the 1990s, collections of primary texts contained perhaps several thousand encoded documents, frequently reflecting the most important works of a national or cultural tradition. Today, large-scale digitization efforts, from Google Books to major collaborative projects such as EEBO-TCP, have happily resulted in collections of a far greater scope and breadth.14 Digital collections have also begun to include significant runs of newspapers and periodicals, popular literature, legal and political documentation, as well as more personal manuscript materials such as letters and diaries. At the simplest level, searching these large textbases can point researchers to the particular documents they might wish to consult. Yet traditional text-analysis tools typically associated with textbases in the humanities, such as PhiloLogic, tend to be overwhelmed by the new and massive amounts of data and results. Searching for even relatively rare terms in modern textbases can generate many thousands of occurrences. No matter how nicely they are presented in concordances or summarized in other kinds of reports, results can be difficult for a user to grasp. In addition, word-search engines are not able to tell users what particular texts or passages of texts are about. Studies of individual terms or concepts are certainly useful, but it is difficult to generalize from these small details of usage to the broad issues that interest many researchers in the humanities, such as the role of gender or ethnicity in writing, the identification of general discursive topics, and intertextuality. For such studies, different kinds of tools are needed.
¶ 41 Leave a comment on paragraph 41 0 Researchers and digital humanities projects have begun to adapt techniques in text or data mining and machine learning, approaches developed primarily in computer and information science, for use on humanities textbases. From Franco Moretti’s method of “distant reading” to the collaborative and multi-institutional digital environment developed by the MONK project, digital humanities scholars are beginning to turn their attention increasingly to data mining and other machine-learning techniques. Underlying these efforts is the assumption that the machine can help discover meaningful patterns of word usage or metadata that may help summarize or give a more general perspective of the materials contained in a collection—or of an entire literary genre. Our experiments in the automatic discovery of textual patterns revolve around three related machine-learning techniques, which we call predictive, comparative, and clustering or similarity.15 All three approaches rely on the computer to identify potentially interesting patterns, with certain degrees of flexibility.16 The examples below illustrate the potential of machine-learning approaches to enhance linguistic research using textbases, as well as to help scholars navigate and discover connections between texts in large-scale digital collections.
Predictive and Comparative Classification
¶ 42 Leave a comment on paragraph 42 0 Predictive and comparative classification are both forms of supervised machine learning, wherein the systems attempt to identify patterns of features (words, groups of words, or other characteristics) that are associated with a set of predetermined classes. These classifiers are designed to distinguish documents by categories. But the means by which they do so can also help researchers get a statistically informed sense of authors’ stylistic choices and word use.
¶ 43 Leave a comment on paragraph 43 0 Predictive machine learning is perhaps the most familiar to many people, since modern e-mail systems already use this technology to distinguish spam (unwanted junk mail and solicitations) from desired e-mail. These systems take input data representing two or more classes of documents labeled by human beings and attempt to build a statistical model that will predict the classes found in an unseen set of documents. In the problem of spam / not spam, the algorithm learns from human labeling (often interactively in many e-mail systems) that certain features, such as the word “Viagra” or phrases like “unique opportunity,” are often associated with spam. Incoming e-mails are evaluated against the features associated with previously identified spam, judged to be either spam or not spam, and then processed accordingly.
¶ 44 Leave a comment on paragraph 44 0 Such simple binary distinctions are uncommon in humanistic research. Rather, one can have numerous complex classification schemes associated with various textbases. In some of our recent work, we have examined the classification of human knowledge found in the eighteenth-century Encyclopédie of Denis Diderot and Jean Le Rond d’Alembert. Using predictive machine-learning techniques, we first attempted to categorize automatically the Encyclopédie’s unclassified articles—some 15,000 of the 75,000 articles were left unlabeled. In further experiments, this same predictive model was applied to other eighteenth-century documents for the sake of epistemological comparison (see Horton, Morrissey, Olsen, Roe, and Voyer). As a kind of supervised learning scheme, predictive classification models are thus built on data derived from contemporary classification systems (such as the Encyclopédie’s editorial classes of knowledge) that can then be used to identify the subject matter of documents from the same time period and language. The importance of using epistemological models derived from contemporary texts for classification purposes should be emphasized here, so that scholars can guard against imposing their own modern terminologies and organizational principles on historical collections.
¶ 45 Leave a comment on paragraph 45 0 In humanities research we rarely need to predict classes, since we may already know all the relevant bibliographic classifications in a given textbase. The gender of authors in a large collection of documents is usually identified by human beings and integrated as metadata associated with an author’s work or collection of works. Comparative machine-learning techniques can help scholars test the utility of textbases given classifications for analytic purposes. In recent work, we used several textbases of American and French literature to examine the gender of playwrights and of their characters. The point was to see whether algorithms could find stylistic features in texts that reveal differences in the way men and women write and in the way authors depict male and female characters. We trained the classifier on documents divided by the author’s gender and then used this model to reclassify the same data set.17 The classification results show us how accurately the system can identify documents (or speeches) by the author’s gender, the features (words, lemmas, groups of words known as n-grams, etc.) that are most distinctive of each class, and the instances (documents, characters) that are misclassified. Applied to a large textbase of African American drama, we found that algorithms can identify authors’ and characters’ gender with an accuracy of between 70% and 80%. These results are in line with similar studies on the same subject (see Argamon, Cooney, Horton, Olsen, and Stein; Argamon, Goulain, Horton, and Olsen).
¶ 46 Leave a comment on paragraph 46 0 Classifier failures and anomalies stemming from comparative classifications may be particularly interesting to humanities scholars who are curious to find out why particular texts or authors deviate from norms. Classifying the ARTFL collection of French poetry by time period (pre- and post-1800), for example, resulted in an accuracy rate of classification of over 95%. The most significant errors came from the classification of all five volumes of André Chénier’s works, which the algorithm claimed to belong to the nineteenth rather than the eighteenth century. This result brings to mind Victor Hugo’s assertion that Chénier was one of the first modern (i.e., Romantic) French poets.18 We also found several interesting classification errors when comparing American and non-American playwrights, particularly those that resulted from an incorrect assignment of nationality by the editors. The plays of the Montserratian Edgar White were all correctly classified as non-American save for one, When Night Turn Day, which is set in the Bronx. Similarly, the American author Joseph Walker was correctly classified in nine out of ten cases, except for The Lion Is a Soul Brother, a play set in “rural/jungle West Africa” (a classification category from Alexander Street Press’s Black Drama database). Examining misclassified instances in a generally successful classification task may thus serve to provoke new questions about the traits of given texts that make them unique when compared with similar texts. These errors in classification can, in many cases, lead scholars to uncover interesting outliers at the margins of an otherwise known classification scheme.
Clustering and Similarity
¶ 47 Leave a comment on paragraph 47 0 Clustering systems, another broad class of learning algorithms that fall under the category of unsupervised machine learning, have vast potential to be heuristic aids in large textbases. These algorithms work by identifying documents or parts of documents that are most similar to each other, instead of beginning from a set of preidentified classes. The clusters of similar documents can therefore aid in the navigation of large textbases by automatically identifying broad discursive topics. We have recently used a clustering algorithm to add information on the topics identified in thousands of unclassified nineteenth-century newspaper articles, allowing users to find articles about justice, commerce, or health and hygiene. Because of the groupings these machine-learning tools generate, the results of these applications can be embedded in the general structure of a textbase to expand user interaction with documents. Humanities textbases can have suggestion links, as on commercial Web sites like Amazon or Netflix. Moreover, textbase administrators can use these sorts of systems to suggest documents that may be similar to a passage an individual user is examining or to find similar passages that are drawn from other documents.
¶ 48 Leave a comment on paragraph 48 0 There are many statistical approaches to similarity and a variety of ways in which to use the information generated by these tools. Many search schemes are based on mathematical models such as vector space. This model measures the similarity of a search query (typically input by the user) to each instance in a textbase, selecting those that are most similar. Measuring document similarity can lead to many interesting applications beyond simple search and retrieval. Users may identify documents that are most similar as a way to discover unexpected and unforeseen connections. In the Encyclopédie, for example, the article “Gnomonique” is calculated as highly similar to a diverse range of articles dealing with the town of Woolsthorpe (“Wolstrope” in French), clock making, another town called Tylehurst, and Saturn and several other planets. In this case, the article on gnomonique—from gnomon (γνώμων), the pin or raised element of a sundial—describes various types of sundials that depend on the movement of celestial bodies. The modern geography article on Wolstrope, the birthplace of Isaac Newton, is in fact a biography of Newton, with long discussions of his mathematics, mechanics, and astronomy. Similarly, the entry on Tylehurst, the birthplace of William Lloyd, includes a long exposition of his work on the history of the celestial calendar. When the degree of similarity is calculated, this relative measurement can then draw attention to potentially related passages and documents across multiple domains and text categories.
¶ 49 Leave a comment on paragraph 49 0 Measuring similarities between documents or parts of documents can help track the relatedness of works. In the case of already-classified documents, such as the Encyclopédie, one can use measures of similarities between articles to establish a nearest-neighbor classification scheme, a kind of predictive classifier based on more direct measures of similarity. This approach may be better suited to many humanistic applications than other similarity measures, since it is more sensitive to smaller classes and functions well with complex or less orderly ontologies, or classification schemes. A technique called topic modeling, to cluster first and classify later, works well in the humanities. With this approach, an algorithm groups the most similar articles into one of a set number of groups and extracts the features (e.g., words) that are most characteristic of that group. After examining these features and the contents of each group, the user can assign each cluster a label that will convey some sense of the overall content of that cluster. Such post hoc classification may be more sensitive to the specifics of a collection but may not be as generalizable to cross-database searching, since the clusters will be specific to that collection. Finally, measures of similarities may be applied across documents for analytic purposes. In recent work, we measured the similarity of the articles in the Encyclopédie against articles in the Jesuit Dictionnaire de Trévoux, one of the Encyclopédie’s predecessors and its chief intellectual rival, and found that a little over 5% of all the articles in the Encyclopédie were “borrowed” from the Jesuit dictionary (see Allen et al.).
Text Alignment—Discovering Intertextuality and Similarity
¶ 50 Leave a comment on paragraph 50 0 The alignment of similar passages, a technically simple function compared with the approaches examined above, can quite effectively find borrowings and other forms of acknowledged or unacknowledged reuse of texts.19 We have found that techniques borrowed from the field of bioinformatics to identify similar parts of DNA can also be used to identify similar passages in textbases. This approach is based on the identification of matching clusters of n-grams—sequences of several words or lemmas with function words removed. Thus Rousseau’s famous declaration in Du contrat social, “L’homme est né libre, et partout il est dans les fers. Tel se croit le maître des autres, qui ne laisse pas d’être plus esclave qu’eux,” rendered as trigrams (n-grams with n = 3), with short and function words removed and accents and case flattened, would look like this:
- ¶ 51 Leave a comment on paragraph 51 0
¶ 52 Leave a comment on paragraph 52 0 By setting various parameters, the alignment algorithm can bridge variations, insertions, deletions, and data errors. The result is a very flexible notion of what constitutes textual similarity. As an example, our sequence-alignment system identified the following passages as similar despite the considerable insertions, word-order differences, and orthographic variations (overlapping text is in red):
¶ 53 Leave a comment on paragraph 53 0 She locks her lily fingers one in one. “Fondling,” she saith, “since I have hemmed thee here Within the circuit of this ivory pale, I’ll be a park, and thou shalt be my deer; Feed where thou wilt, on mountain or in dale: Graze on my lips; and if those hills be dry, Stray lower, where the pleasant fountains lie.” Within this limit is relief enough. . . .
—Shakespeare, Venus and Adonis (1593)
Dra. I pray you sir help vs to the speech of your master.
Pre. Ile be a parke, and thou shalt be my Deere: He is very busie in his study. Feed where thou wilt, in mountaine or on dale. Stay a while he will come out anon. Graze on my lips, and when those mounts are drie, Stray lower where the pleasant fountaines lie. Go thy way thou best booke in the world.
Ve. I pray you sir, what booke doe you read?
—Markham, The Dumbe Knight: A Historicall Comedy . . . (1608)
¶ 60 Leave a comment on paragraph 60 0 We have found that this highly generalizable technique works well on very large textbases, in various languages, and for several different kinds of problems. The most obvious application is the identification of borrowed or shared passages. The French revolutionary theorist Jean-Paul Marat, for example, appears to have borrowed the following passage without attribution in Les chaînes de l’esclavage (1792):
¶ 61 Leave a comment on paragraph 61 0 prévaut d’un silence qu’il empêche de rompre, ou des irrégularités qu’il fait commettre, pour supposer en sa faveur l’aveu de ceux que la crainte fait taire, et pour punir ceux qui osent parler
¶ 63 Leave a comment on paragraph 63 0 prévaut du silence qu’il les empêche de rompre, ou des irrégularités qu’il leur a fait commettre, pour supposer en sa faveur le vœu de ceux que la crainte a fait taire, ou punir ceux qui osent parler
¶ 64 Leave a comment on paragraph 64 0 We have also applied this technique quite effectively to uncorrected OCR (optical character recognition) data sets and have even used contemporary translations—such as eighteenth-century French translations of David Hume’s Essays—as a means to isolate cross-linguistic similarities.
¶ 65 Leave a comment on paragraph 65 0 Similar passage identification is a general approach that can be tailored to enhance a broad variety of digital humanities applications and textbases. We believe that scholars can use similarity tools to discover shared passages, borrowings, plagiarisms, and other forms of text recycling, whether alluded to in the source data or not. Indeed, even in cases where source documents have clear indicators of a borrowing or a direct citation, this approach can significantly improve the manner in which these relations are linked from text to text. Instead of parsing a reference and link using citation data or outside referencing schemes—which can be highly variable, inconsistent, and typically keyed on page numbers or other arbitrary attributes—hypertextual links can be identified and contextualized using the alignment techniques outlined above.
¶ 66 Leave a comment on paragraph 66 0 Because of the rapid expansion in size and coverage of modern textbases, the examination of new kinds of algorithms and techniques that allow researchers to work effectively with these resources has become a necessity. Machine-learning and text-mining techniques that use the computer as a pattern-detection device can be used in a wide variety of ways. As we have outlined above, machine learning can be used to understand and assign classification systems, to support comparisons of many hundreds or thousands of documents at once, and to identify different kinds of similarities over very large collections or across multiple textbases. While much of this recent work in digital humanities is still experimental, there is considerable effort to bring these kinds of advanced approaches into newer generations of textbases. Searching, navigating, and analyzing emergent large-scale textbases will increasingly be mediated by machine intelligence, either explicitly by user instruction or selection or implicitly through the design of the systems.
¶ 67 Leave a comment on paragraph 67 0 We have outlined approaches to building textbases as well as the various kinds of research that they support. It is clear that textbases will continue to expand in scope and coverage as the availability of digitized texts grows exponentially. We expect, however, that textbases will always reflect the domain-specific expertise of humanities scholars and textbase administrators. Curatorial projects will continue to build textbases that are very similar to those that exist today. They will therefore require specialized language or period teams, specific encoding, and dedicated technical support. Services like Google Books, HathiTrust Digital Library, and similar efforts will provide access to huge heterogeneous collections, but these database aggregators do not provide sufficient tools or narrow enough collections of texts to support the needs of most humanities researchers. In the future, new textbase systems will allow scholars to conduct more focused research by querying, cohesively, distinct text collections across the entire network. This kind of cross-textbase searching will begin to include text mining and machine learning, which will enhance the researcher’s ability to make sense of result sets from these federated collections. As a result, this distributed model of textbase design, enhanced by machine-learning algorithms, will require continued collaboration between humanists, computer scientists, and researchers in fields as far-flung as bioinformatics. Textbase design has evolved in the past several decades to address modern technologies and both traditional and experimental critical concerns. It is precisely this dialectic between technological progress and critical inquiry that will continue to drive textbase design in the years to come, reflecting the persistent needs and innovations of both fields.
¶ 68 Leave a comment on paragraph 68 0 1. The term textbase came into being in the 1990s and is to some degree outmoded. Textual database and simply database are just as commonly used today. For the purposes of this essay, however, we chose to keep the term textbase in play, since it successfully captures the text-based databases we are describing.
¶ 69 Leave a comment on paragraph 69 0 2. The Text Encoding Initiative (TEI) is a consortium that develops and maintains a standard for the representation and encoding methods of text in digital form. Over the years, the TEI encoding guidelines have become the most widely accepted standard for text encoding in the digital humanities.
¶ 70 Leave a comment on paragraph 70 0 3. Buzzetti and McGann, for example, write about the ability of computers to “simulate” textual phenomena. At ARTFL, we tend to be more interested in harnessing computational power for textual research.
¶ 71 Leave a comment on paragraph 71 0 4. Flanders sketches the basics of this evolution in the introduction to her essay on the Women Writers Project (“Electronic Textual Editing”). Today, heavy encoding as a means of representing the original text has given way to page images.
¶ 72 Leave a comment on paragraph 72 0 5. For a sense of the archive’s theoretical underpinnings, see Buzzetti and McGann (McGann is the editor of the Rossetti Archive). They state, “One advantage digitization has over paper-based instruments comes not from the computer’s modeling powers, but from its greater capacity for simulating phenomena—in this case, bibliographical and socio-textual phenomena.”
¶ 73 Leave a comment on paragraph 73 0 6. For example, word search and structured searching—that is, search that relies on metadata—cannot be combined in the Rossetti Archive. Note, however, that Lucene offers a great deal of promise for research that is not based on word search and retrieval. We will soon begin using its ranked-relevancy retrieval feature for experiments on collections of newspapers.
¶ 74 Leave a comment on paragraph 74 0 7. The reason for employing a mixed mode, from the standpoint of functionality, is partly practical: the technologies driving relational databases and word search are both well developed. Joining them allows for flexible, fast, and efficient research over a large body of electronic texts.
¶ 75 Leave a comment on paragraph 75 0 8. The Digital Library of Classic Protestant Texts and The Digital Library of the Catholic Reformation are accessible only by subscription. See the index page for religious products on the Alexander Street Press site.
¶ 76 Leave a comment on paragraph 76 0 9. Alexander Street Press does not offer all PhiloLogic’s standard text-analysis tools to their subscribers. Their products allow word-search refinement by the means we have discussed, but not by those we implement for more rigorous linguistic analysis.
¶ 77 Leave a comment on paragraph 77 0 10. In regular expressions, the period (.) is the wildcard character for a single character. Combined with the asterisk operator (.*) it will match any number of characters following the period. So our example “tradition.*” would return “tradition, traditionem, traditions, traditionelle, traditionalisme, traditionellement,” and so on. These examples come from Olsen’s research on the evolving conception of tradition in French (“Handing It Down”).
¶ 81 Leave a comment on paragraph 81 0 14. EEBO-TCP (Early English Books Online–Text Creation Partnership) began in 1999 as a cooperative relationship between ProQuest, the university libraries of Michigan and Oxford, and the Council on Library and Information Resources (CLIR) to convert 25,000 books from ProQuest’s EEBO image product into fully searchable, TEI-compliant SGML/XML texts.
¶ 84 Leave a comment on paragraph 84 0 17. We effected multiple iterations of training on one part of the data (for male and female) and then predicted on the “unseen” portion, a process commonly known as cross-validation.
¶ 85 Leave a comment on paragraph 85 0 18. In an 1820 article on Lamartine’s Méditations poétiques, Hugo described Chénier as “un romantique parmi les classiques” (“a Romantic among the classicists”; qtd. in Estève 35; our trans.).
Allen, Timothy, Charles Cooney, Stéphane Douard, Russell Horton, Robert Morrissey, Mark Olsen, Glenn Roe, and Robert Voyer. “Plundering Philosophers: Identifying Sources of the Encyclopédie.” Journal of the Association for History and Computing 13.1 (2010): n. pag. Web. 29 Aug. 2012. <http://hdl.handle.net/2027/spo.3310410.0013.107>.
Argamon, Shlomo, Charles Cooney, Russell Horton, Mark Olsen, and Sterling Stein. “Gender, Race, and Nationality in Black Drama, 1850–2000: Mining Differences in Language Use in Authors and Their Characters.” Digital Humanities Quarterly 3.2 (2009): n. pag. Web. 1 Dec. 2009. <http://www.digitalhumanities.org/dhq/vol/3/2/000043/000043.html>.
Argamon, Shlomo, Jean-Baptiste Goulain, Russell Horton, and Mark Olsen. “Vive la Différence! Text Mining Gender Difference in French Literature.” Digital Humanities Quarterly 3.2 (2009): n. pag. Web. 1 Dec. 2009. <http://www.digitalhumanities.org/dhq/vol/3/2/000042/000042.html>.
Buzzetti, Dino, and Jerome McGann. “Electronic Textual Editing: Critical Editing in a Digital Horizon.” Text Encoding Initiative. TEI, n.d. Web. 1 Dec. 2009. <http://www.tei-c.org/About/Archive_new/ETE/Preview/mcgann.xml>.
Estève, Edmond. Études de littérature préromantique. Paris: Champion, 1923. Print.
Flanders, Julia. “Electronic Textual Edition: The Women Writers Project: A Digital Anthology.” Text Encoding Initiative. TEI, n.d. Web. 1 Dec. 2009. <http://www.tei-c.org/About/Archive_new/ETE/Preview/flanders.xml>.
———. “Text Encoding.” Editorial Methodology and the Electronic Text. Brown University Women Writers Project. Brown U, 1996. Web. 29 Aug. 2012. <http://www.wwp.brown.edu/research/publications/presentations/NASSR1996/TextEncoding.html>.
———. “Women Writers Project.” Editorial Methodology and the Electronic Text. Brown University Women Writers Project. Brown U, 1996. Web. 29 Aug. 2012. <http://www.wwp.brown.edu/research/publications/presentations/NASSR1996/Argument.html>.
Halliday, M. A. K., and Christian M. I. M. Matthiessen. An Introduction to Functional Grammar. London: Arnold, 2004. Print.
Hart, Michael. “The History and Philosophy of Project Gutenberg.” Project Gutenberg. Project Gutenberg, 1992. Web. 1 Dec. 2009. <http://www.gutenberg.org/wiki/Gutenberg:The_History_and_Philosophy_of_Project_Gutenberg_by_Michael_Hart>.
Horton, Russell, Robert Morrissey, Mark Olsen, Glenn Roe, and Robert Voyer. “Mining Eighteenth Century Ontologies: Machine Learning and Knowledge Classification in the Encyclopédie.” Digital Humanities Quarterly 3.2 (2009): n. pag. Web. 1 Dec. 2009. <http://www.digitalhumanities.org/dhq/vol/3/2/000044/000044.html>.
Horton, Russell, Mark Olsen, and Glenn Roe. “Something Borrowed: Sequence Alignment and the Identification of Similar Passages in Large Text Collections.” Digital Studies / Le champ numérique 2.1 (2010): n. pag. Web. 1 Jan. 2012. <http://www.digitalstudies.org/ojs/index.php/digital_studies/article/view/190/235>.
“Introduction.” Encoding Guide for Early Printed Books. Brown University Women Writers Project. Brown U, 2007. Web. 1 Dec. 2009. <http://www.wwp.brown.edu/research/publications/guide/html/introduction.html>.
Lavagnino, John. “Electronic Textual Editing: When Not to Use TEI.” Text Encoding Initiative. TEI, n.d. Web. 1 Dec. 2009. <http://www.tei-c.org/About/Archive_new/ETE/Preview/lavagnino.xml>.
———. “Reading, Scholarship, and Hypertext Editions.” TEXT: Transactions of the Society for Textual Scholarship 8 (1995): 109–24. Scholarly Technology Group. Web. 1 Dec. 2009. <http://www.stg.brown.edu/resources/stg/monographs/rshe.html>.
Moretti, Franco. Graphs, Maps, Trees: Abstract Models for a Literary History. New York: Verso, 2005. Print.
Olsen, Mark. “Handing It Down: A Survey of the Use of Tradition in French and English from the Sixteenth to the Twentieth Centuries.” ALLC/ACH Conference. U of Tübingen. 25 July 2002. Address.
———. “Rich Textual Metadata: Implementation and Theory.” ALLC/ACH Conference. U of Tübingen. 25 July 2002. Web. 30 Aug. 2012. <http://docs.google.com/View?id=ddj2s2rb_47dwqtmvhq>.
Unsworth, John. “Scholarly Primitives: What Methods Do Humanities Researchers Have in Common, and How Might Our Tools Reflect This?” Symposium on Humanities Computing: Formal Methods, Experimental Practice. King’s College, London. 13 May 2000. University of Illinois. Web. 1 Dec. 2009. <http://people.lis.illinois.edu/~unsworth/Kings.5-00/primitives.html>.