2017-09-07

Thoughts on transcribing historical documents.

When transcribing historical documents, there are a number of potential end goals. 1) a strict transcription: the goal is to maintain all the vagaries of the original document, just make it more readable by using modern typefaces. Next to accessing the original document, this is the most accurate presentation; it's also the hardest to produce, as you have to really look at each character of the original document carefully, and have to fight the urge to say, "Oh, it's that word," and make sure that what you enter is what was actually there. You can't trust the results of your first pass through the document, you have to let it sit a while and then make a second, and maybe even a third, pass through the document. Then there's formatting the results. You can just have a text document, or you can try to make it look as much like the original in formatting as possible; this is harder to do, but better for those who are able to access the original, or images of the original, as it makes it easier to place the transcription side by side with the original, and be able to look back and forth between them. 2) a transcription with regularized spellings for the language at that time: If you are not concerned with the spelling variations inside the original, but do want to read it in the original language, this is best suited to your purpose. Again, you can produce a straight text document, or you can attempt to make the transcription match the original in layout. 3) Transcribing/translating to the modern version of the original language. You have to be very careful here, to insure that you capture the meaning in context of each word; word meanings change over time, the word used in the original may no longer have that meaning in the modern language, so you have to replace it with the modern word that most closely matches the original intent where word meanings have changed; either that, or provide a gloss of the meaning of the word at the time the document was written. The previously mentioned methods presume a researcher who is familiar with the original language, and the word meanings at the time the original document was created. This is for those interested in the intellectual content of the original without having to understand the changes in the language. The previous methods have no interpretation involved, no need to really grasp the intended message of the author, it's just typesetting; well, somewhat more than typesetting if you are working with handwritten documents, you have to be able to read the original script, and sometimes that's very difficult; this isn't made any better if all you have to work with is a scan of the original. Here, you have to understand what the author was trying to say, so you can translate it for them into modern language, reflecting the changes in word meanings. This is much more intellectually stimulating for the editor, at this point you are becoming an editor, as you try to change the text as little as possible while trying to create a modern language version. Punctuation changes. Changes in word meanings requires the substitution of the closest modern word that provides the original meaning in the context of the surrounding words; you're not aiming at a total recasting, as much as is possible you want to maintain the original phrasing. You need to be a scholar of the subject the author was writing about, so you can comprehend what he was trying to say, so you can make the changes to the modern language while changing his phrasing and meaning as little as possible. You need to understand the subject both as it was understood when the author wrote the document, and as it is now practiced, so that the changes in vocabulary remain true to the original intent while becoming more accessible to the modern practitioner of the subject. Not all that much recognition is given to the individual who performs the first two types of transcription, it's strictly character recognition in the first, that and spelling regularization in the second. Here, there is interpretation involved, and that interpretation will be debated. But the desire is still for a document that would reflect the original author's style and phrasing, following the conventions of the author's time, with as little change as possible while remaining true to the intent of the original words. While some words are changed, it should still read as a period piece, not a modern document. It should read as if only the language had changed, not the writing conventions; it should remain faithful in style as well as meaning. This produces a document of use to those interested in period practices who are not interested in the language of the time the original was written, but are interested in how the information was presented at the time the original document was created. It retains the intellectual property of the author. Anything beyond the third is a modern interpretation, a retelling rather than reformatting. You are creating a derivative work, a modern work based upon the intellectual content of a historical document but written using modern conventions. This is not what I do, I don't understand the subject matter well enough to recast it in modern form. What I am now trying to do is create distinct documents reflecting the goals outlined above. First, I'm working from documents that use modern European alphabets; while I have access to fonts for Futhark, etc., that's not my primary area of interest, and I have to be interested in the subject matter, or there's no way I'd put up with the drudgery and monotony of this process. In theory, I start by producing a character by character transliteration from the historical typeface to a modern typeface, I generally use Times New Roman, it's the typeface we are most used to reading, although I'm considering switching to using Georgia; the really tricky bit is attempting to retain the original specialized non-alpha-numeric symbols, this really comes into play with Elizabethan printed materials, where they will use a
to represent "on" at the end of a word; it's not a character of any alphabet, it's a specialized printer's space saving symbol, and there is no Unicode for it; it is close enough to a ♁ (U-2641) that I've decided to use that in its stead, with a note explaining the substitution. I'm also using (U-0361) to join ct to produce c͡t, (U-0113) ē for “em” and “en” and (U-014D) for “on” appear to be exact matches. After I've finished the first version, the character by character transposition into a modern typeface, and verified that it's accurate, I save that as a master copy and create a copy from it to use for the next step, which is producing a document formatted to match the original document. This has it's own tricky bits. The fancy woodblock/engraving/illuminated initial letters are beyond my ability to reproduce except by creating an image from the scan of the original and inserting it into my document. This holds true for other illustrations/artwork. LibreOffice is not the best program for doing this formatting, but it's what I have to work with; I have time I can devote to this activity, but I can't invest very much money. Again, once I've finished this document, I save a master copy of it, and move on to the next step, which is producing a document with regularized period spellings. For this I return to a copy of the first master document, pre formatting and image introduction. The trick here is to determine what the standard period spellings are. Where possible I consult contemporaneous dictionaries, to see what was the opinion of the time; the larger the number of contemporary dictionaries I can consult, the more confident I am as to the spelling I determine to use. I temper this by checking to see if there are any authoritative modern works covering the contemporary spellings; I know there are modern Anglo-Saxon dictionaries, I suspect that there are modern Elizabethan English dictionaries. I'm not going to go against what modern scholarship has determined unless I think they are all way off base, and that's not very likely. The intent in producing this regularized spelling document is to present what they would have produced if they had computers with spell checkers in the language of their time. Conversely, software is available which can determine the frequency of words within a document; running the original transcription through said software would enable me to determine which spellings the author of the document most favored, and change the other spellings to match; this may not jib with what modern scholarship has determined to be the societal consensus, but would produce a normalized spelling closer to the intent of the author. As part of the normalization process the printers special symbols are transformed back to the text they represent. As a bonus, I'm producing glossaries to words, individuals, places and events mentioned in the documents; what was common knowledge amongst the intended audience may be unknown to the modern reader; if I had to look it up, it goes in the glossary, if I think I knew about it due to specialized knowledge, it goes in the glossary. These glossaries are appended to the end of the document. Depending upon the margins, and if the original text already did this, I might insert text boxes in the margins adjacent to the first appearance of archaic words or word meanings to present their current meanings, as an alternative to replacing them with a modern equivalent; if the original text contains notes presented this way I'll need to find a way of clearly differentiating my notations from the author's notations, to prevent confusion as to who is providing the information; using a radically different font springs to mind, clearly there would need to be a note concerning this. The idea of glossing word meanings adjacent to the first occurrence of the word could be used in the modern spelling document, as a means of avoiding changing the text of the document via the replacement of archaic words with their modern equivalents. 

I'm not the only one doing this. Not by far!

There are currently a number of transcription projects ongoing in Academia.

The Text Creation Partnership has transcribed a ton of documents from ProQuest's Early English Books Online, Gale Cengage's Eighteenth Century Collections Online, and Readex's Evans Early American Imprints, all of which are restricted access services. ECCO-TCP (Eighteenth Century Collections Online); these are available to anyone. EEBO-TCP (Early English Books Online) has two parts, the first contains approximately 25,000 books, available to anyone, while the second part, consisting of 35,000 books are only available to TCP partner organizations. Evans-TCP (Evans Early American Imprint Collection) is available to anyone. While TCP's main page doesn't go into detail, they do say these are normalized texts, and a quick scan of the word index for EEBO-TCP and browsing the titles for ECCO-TCP and Evans-TCP seems to confirm this; the frequency of variant spellings is nowhere near as great in EEBO-TCP as would be indicated based upon the two Elizabethan Fencing Manuals that I have examined in depth. everie 3811, everye 118, every 419924 just screams that the spelling has been normalized. publique 19417 and public 3171 confirms normalizing to period practice. Given the large number of individuals doing the transcription and creating metadata over a long period of time, the metadata is not standardized; you have to try a variety of terms if searching the metadata, to insure you find all the texts related to your subject, they weren't working from a standardized thesaurus of terms with clear definitions. It is clear they didn't get the Library Cataloger community involved. I'm not really in a position to throw stones, as I haven't been referring to either Sear's or LC's subject heading works; I have a copy of Sear's, I don't own a copy of LC.

Visualizing English Print is a project that is taking the TCP and similar files and make them more amenable to textual analysis using specialized software. Certain sacrifices had to be made to enable this, which makes their output of no use to those researching period printing practices. All text is stripped to bare ASCII; no umlauts, apostrophes, italics, etc. No attempt is made to preserve document formatting, other than maintaining the same line breaks as their source files. As part of removing punctuation, words were standardized; to wit, fashiond, fashion'd, both were changed to fashioned. So some, not all, spelling variants have been removed from their SimpleText output. It will be interesting to see what people do with the result of their efforts.

Smithsonian Digital Volunteers is a project of the Smithsonian Institution to coordinate the digital transcription of a whole slew of documents either in their possession or in the possession of institutions who have joined with them in this project. As they are constantly creating new images of text items in their collections, this is a very long term project. It started in June 2013, and according to their page, currently has 9085 volunteers.

Citizen Archivist is a similar project of the National Archives and Records Administration.

Manuscript Transcription Projects is a list of projects similar to Early Modern Manuscripts Online (EMMO); EMMO is a Folger Library project, and the Manuscript Transcription Projects link page is maintained by the Folger Library.

FromThePage appears to be a transcription crowdsourcing service provider, where individuals and institutions pay them monthly fees to host their projects, and volunteer transcriptionists log in to do the actual transcription. Their fees for hosting projects seem reasonable, and this allows individuals/institutions to have crowdsourced transcription projects without having to set up all the software/hardware interfaces themselves. Clearly, since they charge for this, once a given transcription project is completed the project owner may choose to remove the project from their site and store it elsewhere, which may or may not include making it accessible through the web.

Papers of the War Department, 1784-1800 is a crowdsourced transcription project of the Roy Rosenzweig Center for History and New Media (RRCHNM), which in turn is a project of the Department of History and Art History at George Mason University. There are a number of projects that the RRCHNM has been involved with, which they provide links to. They have also developed some useful Open Source software for use in this type of activity.

There are many more such projects out there; these are merely those from the first page of a Google search on document transcription projects.

If getting involved in this activity intrigues you, determine what your preferred subject matter is and start looking for relevant projects, if you want to work with established collections, or do as I'm doing, which is tracking down .pdfs or other format scans of relevant documents, transcribing them, placing them on my Academia web page and the Internet Archive, and the files section of pertinent Facebook groups that I belong to. Of course, given the source material being out of copyright (which it had better be if you don't have the permission of the copyright holder), you could always attempt to make some extra money by selling your completed transcription project via the various marketplaces. I'm making the results of my labours freely available, because so much of what I'm able to do these days is a result of others making materials freely available; turnabout is fair play.

Since there were a number of professional transcription sites included in the results of my Google search, there is the option of branching out as a Transcriptionist For Hire once you have developed your skills via volunteering with a crowdsourced transcription project; that works just fine by me, it's in the spirit of the Works Projects Administration projects during the Great Depression, where the US Government put people to work on various projects to give them income and teach them practical skills which they could then put to use in the private sector. Mini rant: They should never have shut down the Works Project Administration, it was successful in all of its goals. The American Association of Electronic Reporters and Transcribers can provide you with information on learning how to do this and getting certified.

I could go on (and on and on) but I think this is enough for now on this topic.

Post this Puppy!

No comments: