2017-09-17

The Son of More Thoughts on Transcription

In my first post I discussed a philosophy of Transcription. In my second, the creation of a master font document to assist in character recognition, and the concept of initially transcribing into a font that matches the source document, for ease of comparing your initial transcription with the source to see if the individual characters match. In this post I'm going to talk about getting access to your source document.

There is one assumption I'm making, and that is that you are using a desktop computer for this purpose. I can envision using a laptop, but anything without a physical keyboard distinct from the display is right out.

Your source document will come in one of three basic forms. 1) Digitized images of the original, 2) a hard copy of the original; this may be a physical book, a photocopy of the document, or, if you are fortunate enough to be working with the owner of the original document, the original document itself. In the case of working with the original document itself, odds are very good that you will be doing this where they store it, and unless they are providing you with access to a work station, you will be using a laptop. 3) Sound recordings. Sound recordings are a whole nother kettle of fish, if they aren't a sound file, because you will need to have the equipment to play back the media they are recorded on. Well, even if they are a sound file, they may be on an outdated storage format, such as floppy disks, and in an outdated file format. In which case you would need access to a computer of the appropriate vintage, with the appropriate audio software. As time passes, this is going to become harder and harder to do; I no longer possess a computer with floppy drives of any kind that still works, and it's been quite some time since I had access to anything capable of running a pre-Windows 95 program. Anyway, if you are dealing with sound recordings that aren't digitized audio files, you will need the appropriate equipment to play them. I'm not going to go into what all this might entail, at least not in this post, just take my word for it that finding the equipment to playback non-digitized audio recordings may be quite the adventure, if it doesn't come provided with access to the sound recordings themselves. However, you would be surprised what equipment is still available, if you hunt around a bit; the online marketplace has made obtaining obsolescent equipment much easier, as individuals who couldn't quite bear to just throw their old equipment away now have a means of finding it a new home, and those who made a business out of obtaining obsolescent equipment from those wanting to get rid of it (heck, sometimes they even got paid to take it away!) for resale to those who needed that equipment to access obsolescent media now have it much better when it comes to outreach to their prospective customers. And, there are those who make a business out of converting audio between different storage media; for a price, you send them your outdated media, they'll send you back the contents on current media. This holds true for all data types, not just audio; if you are willing to let them retain a copy of the converted data, and distribute it as they wish (including selling copies), they might be willing to arrange a lower price, but it would need to be something marketable that isn't under someone else's copyright.

Digitized images of the original: In short, a computer data file. Hopefully, this will have been created recently enough that it is in a current file format, and current storage media. If a non-current file format, you will need to either obtain conversion software so you can convert it to a modern file type, or software capable of displaying the contents of that file type. If it's not a current storage media, we're back to the problem outlined with audio recordings, of needing to obtain the equipment necessary to read the storage media and file type. For my purposes in this post, I'm going to pretend that your source image is in a current file format, stored on modern equipment, such that you can view it on your main computer's monitor. In some cases you may be allowed to download the images to your own storage media, in other cases the source site may not allow downloading (and installed the appropriate scripts to disable mouse right -clicks from pulling up a context menu), and you will need to keep an active browser window open to their site. Of course, their not allowing you to download a copy of the image should raise the question of whether you have their permission to create a transcript of the document. If it is a unique document, you really need to contact them to seek their permission to create a transcript from it; while the original document may be out of copyright, odds are real good that the image they won't let you download is in copyright, and modifying the image, which includes transcribing the contents, requires their permission. In writing. One can argue fair use for transcribing a small portion of the information contained in the image, enough for a quote in another document, but a complete transcription is right out without their permission. If they allow you to download the image, but require permission to use the image in a publication, you will still need to contact them about distributing your transcription in any form. If it is not a unique document, things get a little bit iffy. But only a little bit. Sure the original is not unique, but do you have physical access to any of the other physical copies? Has anyone else made images of one of those copies available without constraints placed upon their use? If the answer to those questions is No, then you still need to get their permission. If the answer to either of those questions is Yes, then that's what you need to do to access the document if you don't want to contact the image producer about producing a transcript from their image.

[Note: A bit tardy, but I've just emailed the Lord Collection to request permission to make transcriptions from their .pdfs. As with my article on Link Rot, I must practice what I preach.][2017 09 18: Got an email back, it's cool with them. Yay!]

There are online repositories of digitized documents that make their holdings available without constraint, other than not selling what you obtain from them; derivative works, your call, but there needs to be substantive changes made, such as transcribing them into a modern typeface, annotating them, translating them into another language, things that take considerable time and effort, such that you have a real claim on the resulting document. Google Books, the Internet Archive, any agency of the United States Government, in general any State Government agency, Project Gutenberg, to name a few.

Accessing the original document in hard copy.

If it is a published work, now out of copyright, and you own a copy of it in hard copy, you are set, good to go. I would recommend investing in a good document holder, appropriate to the hard copy format, to hold the document open and well displayed while you work from it.

If you do not own a copy of the work, you may be able to borrow a copy via your local library; while they may not have a copy themselves, they could try to borrow it from another library that does, through InterLibrary Loan (ILL). There is a caveat to this, and that is, the less common the item, the less likely that anyone who still has it will lend it out. I worked in the Bibliographic and Interlibrary Loan Center of the Chicago Public Library for three years, I know whereof I speak.

If you don't own a copy, and can't borrow a copy, you will have to go to where a copy is kept. First, you have to find out where a copy is held. For published works, OCLC WorldCat is the best place to start for holdings within the USA, as it is drawn from the cataloging database that OCLC maintains of materials for which they have bibliographic records, and they are the major, although not the only, cataloging database service provider in North America. Outside of North America their coverage is not very good. OCLC has been in operation since 1967, and by now, most libraries in North America have substantially completed their retrospective conversion projects; retrospective conversion is a fancy term for taking the information from your physical card catalog and converting it into information in an electronic database, typically available via the library's online catalog. Pretty much, the only things that haven't been converted are items unique to a given collection, where they haven't been able to afford the time of an original item cataloger to create the bibliographic record. Original cataloging is a lot harder than copy cataloging; copy catalogers have to be very careful, but what they are doing is searching the existing cataloging records for one which matches the physical description of the item in their collection; if they find one, they attach their holdings code to the record, download the record for use in their online catalog, and proceed on to the next item. If they can't find a matching record, they record that fact in a local record of some kind, and move on to the next item. The record of items for which a matching bibliographic record wasn't found will then be accessed by an original item cataloger, when they can afford to hire one; note that point, when they can afford to hire one. Pretty much all libraries of any size have a copy cataloger on staff, to handle their ongoing acquisitions. It may not be a dedicated copy cataloger, but someone who does it as part of their duties; my sister, when she was the Children's Librarian in Klamath Falls, Oregon, did the copy cataloging for the Children's Library as part of her duties. But original cataloging is much more time consuming, and requires a very analytical, detail oriented mind set; they have to create a bibliographic record that accurately describes the item in their possession such that it is clear what they have, and how the edition of the document in their possession differs from all other editions of that document. Having worked in ILL for three years in one of the largest public library systems in North America, I have a much better understanding of just how important that is than I did previously. Different editions are just that, different. They differ in formatting of the information contained, the actual information contained in the work can differ between different editions; like, duh, why else would they call it a different edition? Different printings of the same edition can vary in appearance. There are all sorts of reasons why a researcher will need access to not just a specific work, but a specific printing of a specific edition. If you are looking at travelling thousands of miles to do your research, you want to be certain before you pack your bags that the copy of the item held by the repository you are going to visit matches the item you are seeking to research. So good, detailed, anal retentive original cataloging is not a luxury, it is mandatory, and people capable of that quality of work cost. Collections greater than a certain size, who have funding adequate to their needs, can afford original catalogers. Smaller collections, and specialized collections, may not be able to afford to have an original item cataloger on staff permanently. What they do is 1) hope their item isn't as unique as they fear, and a cataloging record will be input by another institution that matches the item in their collection, and 2) seek outside funding in addition to their normal funding to hire a project cataloger, someone who will focus all their efforts on cataloging the items unique to their collection, for the duration of their funding. They don't always call these individuals catalogers, sometimes they are called archivists; archivists focus on non-published items such as personal and corporate papers and records, but the basic concept is the same, the creation of entry points to the holdings of the library/archive, such that researchers can become aware of what they have that is unique to that collection, so people will use the materials and justify the expense of preserving them; researchers are also a revenue source, while publicly funded repositories are usually free to access in person, privately funded collections frequently charge for admission to their collections, as a means of supplementing their usually inadequate funding; they are also more likely to charge publication fees for use of the information unique to their collection in publications, said fees generally on a sliding scale based upon expected number of individuals who will access that publication.

And with that last, I've advanced to unique items. Items that are unique to a given collection, because few if any copies were made. While WorldCat's coverage in this area is improving, that's damning with faint praise. This is where you need to have some reason to think that a given collection would have resources relating to your research, before you can search their holdings information. Thankfully, as these collections are able to obtain funding for inventorying of their unique holdings, more and more information about these collections is becoming available via web searches. Also, there are a growing number of organizations such as Archives West, which acts as a portal to the specialized collections of a great many collections in the Greater Pacific Northwest, allowing you to use their front end search software to search the a number of specialized collections at once; caveat, due to the variety of materials in these collections, they don't all use the same terminology in their collection descriptions, you need to try a number of searches using terms tangential to each other to maximize the chances of finding that they have materials related to your research.

Hm. Shifted from transcription to research. Well, looking for a collection that holds a copy of the fairly unique item you want to transcribe is research. And, I have to admit, that's how I've tracked down the items I've been transcribing, searching on the web for items related to my area of interest; I didn't start out looking for Vincentio Saviolo his Practise in Two Bookes, I was looking for historical fencing manuals, and stumbled across the Raymond J. Lord Collection by purest chance. It was only afterwards that I located the various HEMA link repositories that directed there. I mean, the University of Massachusetts does not immediately spring to mind as an institution which would have a collection of historical European fighting manuals. Once you find out about their academic programs, not so surprising.

Well, I did, and didn't, cover what I intended to in this post. It certainly isn't what I'd been thinking about earlier today, which was the physical layout of your transcription area. But it did cover something important; before you can transcribe, you need to have something to transcribe.

It's past time for lunch.

Post this Puppy!

Edit 2017 09 18: Permission received from the Lord Collection to make the transcriptions from their .pdfs.

More thoughts on transcription of documents in odd fonts

What I've done so far, is transcribe directly to a modern font. I just realized, while trying to make out the letters in a German Blackletter volume, that what you should really do is this: go through your font library to find the font that most closely matches the font you are transcribing from, and use that for your first transcription. If the fonts are a good match, you will be able to tell by comparing your transcription to the original document whether you have correctly identified each letter, because if you haven't they won't look the same.

For this to work, you first need to create a document that shows you the complete extended Latin alphabet in each of your fonts. Do this the easy way, by first doing this with your standard font, the one you have as your default in your text editor. Make sure you enter the official font name first, before entering the alphabet. In this case when I say alphabet I mean the extended alpha-numeric character set, you want every Latin letter your font will support, then all the numbers, and punctuation symbols; all the punctuation symbols, not just the ones used in modern English. Got that done? Good. Drop down a couple of lines, and still using your standard font, enter the name of the first font that shows up when you select the pull down font menu in your text editor; now, if you have a lot of fonts, you might find that the program crashes when scrolling through the pull down menu for font selection, LibreOfffice does. To avoid that, don't use the pull down menu, go to Format→Character; this should present you with a long list of font names that you can scroll through, which won't cause your program to crash. Now drop down a couple of lines, and type in the name of the next font in that menu. Repeat until all of your font names have been entered. Save your document, you don't want to have to repeat this. Now, go back up to where you had typed the entire extended Latin alpha-numeric character set, plus punctuation marks, and copy that. Paste it following each font name. Save your document frequently, you don't want to have to do this over again. Now, once you have that done, go to each font name, select all the characters except the font name (you want to be able to read that easily), and apply that font to the selection. Do this for all the font names. Make sure you save your document frequently. Voila! You now have a master document containing a list of all your fonts, such that you can compare them to the document you are getting ready to transcribe, and select the one that most closely matches that document.

Now do your transcription, using that font. I know, the end goal is to have the text in something easier to read. That's the end goal, right now your goal is to be certain you have chosen the correct character to match that in the original document. When you are all done with the transcription, and have gone through the verification process to insure that you have, indeed, chosen the correct character in each case, then and only then, but wait, first save your document, and open a copy of it; you don't want to lose your hard work (this should become instinctive after a while). Once you have opened the copy, select all the text, and apply the font you want to have the document in. There. Done. You have your transcribed document in an easy to read modern font. Save the document; this is the basis for all of your future text manipulations.

Now you can proceed in the process described in my previous post.

Post this Puppy!

2017-09-15

Link Rot

Link Rot: The condition of HTML links going bad due to changes in the destination site's url tree.

Link Rot happens. Link Rot Deniers lie through their teeth. OK, I don't really think there are Link Rot Deniers, but there are definitely those who don't check their posted links for Link Rot as often as they should.

I spent six hours yesterday preparing an errata sheet for a web site I stumbled across; don't ask why I did this, I'm not totally sure why myself. And that was just for a quarter of the categories based off of one page of their site. Most of their link pages hadn't been updated since 2011. I had 24 corrections for them; in a couple of cases I couldn't find a current site to replace the dead link, but most of them I was able to provide the current url.

In the process of tracking down current urls, I found out dated links on two other sites referring to the url I was trying to update. So when I found the current url, I informed them as well. I'm not going to name names here, but one of the sites is run by a chap who sends out notices about its existence to the major mailing lists of that interest group on a monthly basis. He had links to GeoCities in his list. GeoCities shut down all operations outside of Japan in 2009, for crying out loud!

The url that lead me to his site, the one I was looking for a replacement to, well, the other site knew it was bad, so they had provided a link to an Internet Archive backup of it. Which was a good temporary fix, except, as they noted, it was a music lyric/sound file site, and the .midi files hadn't been grabbed by the Wayback Machine. However, they had enough information about the purpose of the site, which hadn't been provided by the first site, the one that started all of this, that I was then able to find the current site for the organization that all three sites had bad urls for. So there are three sites which, hopefully, will shortly have active links to that organization again, and one site that will, hopefully, have 24 links corrected shortly.

Now, I will admit that I got a bit snarky in one of my emails, pointing out that GeoCities had shut down operations in 2009, which was pretty common knowledge, so there was no excuse for still having a link to a GeoCities site on his link list.

This morning, getting up somewhat later than usual (I finished all of that activity at nearly 2:00 AM), I decided that if I was going to get snarky about other people's Link Rot, maybe I should look at the links in my Blog postings. So I did. Got side tracked a couple of times, but all of my blog posts are now up to date in regard to referring urls. And I updated product availability and price information as well, as annotations, leaving the original information intact, except for turning off invalid product links. With only a couple of exceptions, I changed from linking directly to a sub page to linking just to the home page, and then providing the information needed to use the home page search engine to find the proper sub page. Those exceptions were for sites where, as far as I could tell, they hadn't changed their directory tree schema; they might have added and removed pages, but they hadn't changed the url of an existing, retained, page. It's sad just how few such stable web sites I found. While some url changes are perfectly understandable, such as when your domain owner goes out of business, others reflected the realization that they hadn't put proper effort into their initial web site structure development. Still others reflected organizational changes that required site structure changes to be able to function in a reasonable manner. Anyway, when I found such a stable web site, one that hadn't changed it's url structure since I posted links to it in 2008, I sent them messages letting them know how much this was appreciated, and commending their initial web site design initiative for being so successful.

I'm not staying up as late as yesterday.

Post this Puppy!

2017-09-07

Thoughts on transcribing historical documents.

When transcribing historical documents, there are a number of potential end goals. 1) a strict transcription: the goal is to maintain all the vagaries of the original document, just make it more readable by using modern typefaces. Next to accessing the original document, this is the most accurate presentation; it's also the hardest to produce, as you have to really look at each character of the original document carefully, and have to fight the urge to say, "Oh, it's that word," and make sure that what you enter is what was actually there. You can't trust the results of your first pass through the document, you have to let it sit a while and then make a second, and maybe even a third, pass through the document. Then there's formatting the results. You can just have a text document, or you can try to make it look as much like the original in formatting as possible; this is harder to do, but better for those who are able to access the original, or images of the original, as it makes it easier to place the transcription side by side with the original, and be able to look back and forth between them. 2) a transcription with regularized spellings for the language at that time: If you are not concerned with the spelling variations inside the original, but do want to read it in the original language, this is best suited to your purpose. Again, you can produce a straight text document, or you can attempt to make the transcription match the original in layout. 3) Transcribing/translating to the modern version of the original language. You have to be very careful here, to insure that you capture the meaning in context of each word; word meanings change over time, the word used in the original may no longer have that meaning in the modern language, so you have to replace it with the modern word that most closely matches the original intent where word meanings have changed; either that, or provide a gloss of the meaning of the word at the time the document was written. The previously mentioned methods presume a researcher who is familiar with the original language, and the word meanings at the time the original document was created. This is for those interested in the intellectual content of the original without having to understand the changes in the language. The previous methods have no interpretation involved, no need to really grasp the intended message of the author, it's just typesetting; well, somewhat more than typesetting if you are working with handwritten documents, you have to be able to read the original script, and sometimes that's very difficult; this isn't made any better if all you have to work with is a scan of the original. Here, you have to understand what the author was trying to say, so you can translate it for them into modern language, reflecting the changes in word meanings. This is much more intellectually stimulating for the editor, at this point you are becoming an editor, as you try to change the text as little as possible while trying to create a modern language version. Punctuation changes. Changes in word meanings requires the substitution of the closest modern word that provides the original meaning in the context of the surrounding words; you're not aiming at a total recasting, as much as is possible you want to maintain the original phrasing. You need to be a scholar of the subject the author was writing about, so you can comprehend what he was trying to say, so you can make the changes to the modern language while changing his phrasing and meaning as little as possible. You need to understand the subject both as it was understood when the author wrote the document, and as it is now practiced, so that the changes in vocabulary remain true to the original intent while becoming more accessible to the modern practitioner of the subject. Not all that much recognition is given to the individual who performs the first two types of transcription, it's strictly character recognition in the first, that and spelling regularization in the second. Here, there is interpretation involved, and that interpretation will be debated. But the desire is still for a document that would reflect the original author's style and phrasing, following the conventions of the author's time, with as little change as possible while remaining true to the intent of the original words. While some words are changed, it should still read as a period piece, not a modern document. It should read as if only the language had changed, not the writing conventions; it should remain faithful in style as well as meaning. This produces a document of use to those interested in period practices who are not interested in the language of the time the original was written, but are interested in how the information was presented at the time the original document was created. It retains the intellectual property of the author. Anything beyond the third is a modern interpretation, a retelling rather than reformatting. You are creating a derivative work, a modern work based upon the intellectual content of a historical document but written using modern conventions. This is not what I do, I don't understand the subject matter well enough to recast it in modern form. What I am now trying to do is create distinct documents reflecting the goals outlined above. First, I'm working from documents that use modern European alphabets; while I have access to fonts for Futhark, etc., that's not my primary area of interest, and I have to be interested in the subject matter, or there's no way I'd put up with the drudgery and monotony of this process. In theory, I start by producing a character by character transliteration from the historical typeface to a modern typeface, I generally use Times New Roman, it's the typeface we are most used to reading, although I'm considering switching to using Georgia; the really tricky bit is attempting to retain the original specialized non-alpha-numeric symbols, this really comes into play with Elizabethan printed materials, where they will use a
to represent "on" at the end of a word; it's not a character of any alphabet, it's a specialized printer's space saving symbol, and there is no Unicode for it; it is close enough to a ♁ (U-2641) that I've decided to use that in its stead, with a note explaining the substitution. I'm also using (U-0361) to join ct to produce c͡t, (U-0113) ē for “em” and “en” and (U-014D) for “on” appear to be exact matches. After I've finished the first version, the character by character transposition into a modern typeface, and verified that it's accurate, I save that as a master copy and create a copy from it to use for the next step, which is producing a document formatted to match the original document. This has it's own tricky bits. The fancy woodblock/engraving/illuminated initial letters are beyond my ability to reproduce except by creating an image from the scan of the original and inserting it into my document. This holds true for other illustrations/artwork. LibreOffice is not the best program for doing this formatting, but it's what I have to work with; I have time I can devote to this activity, but I can't invest very much money. Again, once I've finished this document, I save a master copy of it, and move on to the next step, which is producing a document with regularized period spellings. For this I return to a copy of the first master document, pre formatting and image introduction. The trick here is to determine what the standard period spellings are. Where possible I consult contemporaneous dictionaries, to see what was the opinion of the time; the larger the number of contemporary dictionaries I can consult, the more confident I am as to the spelling I determine to use. I temper this by checking to see if there are any authoritative modern works covering the contemporary spellings; I know there are modern Anglo-Saxon dictionaries, I suspect that there are modern Elizabethan English dictionaries. I'm not going to go against what modern scholarship has determined unless I think they are all way off base, and that's not very likely. The intent in producing this regularized spelling document is to present what they would have produced if they had computers with spell checkers in the language of their time. Conversely, software is available which can determine the frequency of words within a document; running the original transcription through said software would enable me to determine which spellings the author of the document most favored, and change the other spellings to match; this may not jib with what modern scholarship has determined to be the societal consensus, but would produce a normalized spelling closer to the intent of the author. As part of the normalization process the printers special symbols are transformed back to the text they represent. As a bonus, I'm producing glossaries to words, individuals, places and events mentioned in the documents; what was common knowledge amongst the intended audience may be unknown to the modern reader; if I had to look it up, it goes in the glossary, if I think I knew about it due to specialized knowledge, it goes in the glossary. These glossaries are appended to the end of the document. Depending upon the margins, and if the original text already did this, I might insert text boxes in the margins adjacent to the first appearance of archaic words or word meanings to present their current meanings, as an alternative to replacing them with a modern equivalent; if the original text contains notes presented this way I'll need to find a way of clearly differentiating my notations from the author's notations, to prevent confusion as to who is providing the information; using a radically different font springs to mind, clearly there would need to be a note concerning this. The idea of glossing word meanings adjacent to the first occurrence of the word could be used in the modern spelling document, as a means of avoiding changing the text of the document via the replacement of archaic words with their modern equivalents. 

I'm not the only one doing this. Not by far!

There are currently a number of transcription projects ongoing in Academia.

The Text Creation Partnership has transcribed a ton of documents from ProQuest's Early English Books Online, Gale Cengage's Eighteenth Century Collections Online, and Readex's Evans Early American Imprints, all of which are restricted access services. ECCO-TCP (Eighteenth Century Collections Online); these are available to anyone. EEBO-TCP (Early English Books Online) has two parts, the first contains approximately 25,000 books, available to anyone, while the second part, consisting of 35,000 books are only available to TCP partner organizations. Evans-TCP (Evans Early American Imprint Collection) is available to anyone. While TCP's main page doesn't go into detail, they do say these are normalized texts, and a quick scan of the word index for EEBO-TCP and browsing the titles for ECCO-TCP and Evans-TCP seems to confirm this; the frequency of variant spellings is nowhere near as great in EEBO-TCP as would be indicated based upon the two Elizabethan Fencing Manuals that I have examined in depth. everie 3811, everye 118, every 419924 just screams that the spelling has been normalized. publique 19417 and public 3171 confirms normalizing to period practice. Given the large number of individuals doing the transcription and creating metadata over a long period of time, the metadata is not standardized; you have to try a variety of terms if searching the metadata, to insure you find all the texts related to your subject, they weren't working from a standardized thesaurus of terms with clear definitions. It is clear they didn't get the Library Cataloger community involved. I'm not really in a position to throw stones, as I haven't been referring to either Sear's or LC's subject heading works; I have a copy of Sear's, I don't own a copy of LC.

Visualizing English Print is a project that is taking the TCP and similar files and make them more amenable to textual analysis using specialized software. Certain sacrifices had to be made to enable this, which makes their output of no use to those researching period printing practices. All text is stripped to bare ASCII; no umlauts, apostrophes, italics, etc. No attempt is made to preserve document formatting, other than maintaining the same line breaks as their source files. As part of removing punctuation, words were standardized; to wit, fashiond, fashion'd, both were changed to fashioned. So some, not all, spelling variants have been removed from their SimpleText output. It will be interesting to see what people do with the result of their efforts.

Smithsonian Digital Volunteers is a project of the Smithsonian Institution to coordinate the digital transcription of a whole slew of documents either in their possession or in the possession of institutions who have joined with them in this project. As they are constantly creating new images of text items in their collections, this is a very long term project. It started in June 2013, and according to their page, currently has 9085 volunteers.

Citizen Archivist is a similar project of the National Archives and Records Administration.

Manuscript Transcription Projects is a list of projects similar to Early Modern Manuscripts Online (EMMO); EMMO is a Folger Library project, and the Manuscript Transcription Projects link page is maintained by the Folger Library.

FromThePage appears to be a transcription crowdsourcing service provider, where individuals and institutions pay them monthly fees to host their projects, and volunteer transcriptionists log in to do the actual transcription. Their fees for hosting projects seem reasonable, and this allows individuals/institutions to have crowdsourced transcription projects without having to set up all the software/hardware interfaces themselves. Clearly, since they charge for this, once a given transcription project is completed the project owner may choose to remove the project from their site and store it elsewhere, which may or may not include making it accessible through the web.

Papers of the War Department, 1784-1800 is a crowdsourced transcription project of the Roy Rosenzweig Center for History and New Media (RRCHNM), which in turn is a project of the Department of History and Art History at George Mason University. There are a number of projects that the RRCHNM has been involved with, which they provide links to. They have also developed some useful Open Source software for use in this type of activity.

There are many more such projects out there; these are merely those from the first page of a Google search on document transcription projects.

If getting involved in this activity intrigues you, determine what your preferred subject matter is and start looking for relevant projects, if you want to work with established collections, or do as I'm doing, which is tracking down .pdfs or other format scans of relevant documents, transcribing them, placing them on my Academia web page and the Internet Archive, and the files section of pertinent Facebook groups that I belong to. Of course, given the source material being out of copyright (which it had better be if you don't have the permission of the copyright holder), you could always attempt to make some extra money by selling your completed transcription project via the various marketplaces. I'm making the results of my labours freely available, because so much of what I'm able to do these days is a result of others making materials freely available; turnabout is fair play.

Since there were a number of professional transcription sites included in the results of my Google search, there is the option of branching out as a Transcriptionist For Hire once you have developed your skills via volunteering with a crowdsourced transcription project; that works just fine by me, it's in the spirit of the Works Projects Administration projects during the Great Depression, where the US Government put people to work on various projects to give them income and teach them practical skills which they could then put to use in the private sector. Mini rant: They should never have shut down the Works Project Administration, it was successful in all of its goals. The American Association of Electronic Reporters and Transcribers can provide you with information on learning how to do this and getting certified.

I could go on (and on and on) but I think this is enough for now on this topic.

Post this Puppy!

2017-08-02

What I've been doing: Creation of omnibus eStory volumes.

I read Internet Fiction. A lot of it. It's what I spend most of my time doing these days. I expect to keep doing this for a long time. The thing to note about Internet Fiction is that you need an active Internet connection when you are reading. A lot of the host sites allow you to download the stories for off-line reading, but for the free sites with a few exceptions the files are text files that aren't very pretty. In fact, some are down right ugly. But if you don't have an active connection, they're better than nothing.

Looking ahead, I can see the time coming where I'm in a care facility. It may not come to that, depending upon how my health deteriorates, but I'd be wise to plan for it in advance. What I need to plan for is being in a care facility because of physical problems, but while my mind is still working. I don't think many care facilities provide Internet access for those in their care. At least, that's the assumption I'm making. It's unlikely I'd have a desktop computer in a care facility, but a laptop or tablet seems reasonable. So, if I'm going to have anything to read, I'll need to have amassed a collection of files, preferably eBooks, of the stories I enjoy reading. While some of the authors who post on the Internet have gone to the effort to repackage their stories for sale via Nook, Kindle, Smashwords, Lulu, and a growing plethora of related sites, many have not. Which means that for many of the stories that I enjoy reading, if I want something other than raw text or a downloaded web page, I have to create it myself. There are eBooks out there about this, I even have two of them in my collection, although I have to admit I haven't read them. There discussions of this subject in various of the online forums frequented by authors; I've casually monitored the discussions in the Authors section of the Stories Online Forum. Mostly I've followed the examples of eBooks that I've purchased. The thing about the discussions on the author forums is, they pretty much assume you've already got a clean complete document in .docx, .rtf., or .odt, or some other accepted standard more advanced than .txt. They don't talk about what to do if your starting from .txt files download with lots of line feeds to keep the lines short, such as those available from Project Gutenberg or FictionMania. They don't talk about starting from downloaded web pages, with all the nasty .html artifacts that can make a standard word processor choke, and which can cause problems with the more basic (read: free) .html editors. So I've had to do a lot of learning by trial and error, sometimes ending up with such a mess that I deleted the working document and started over from the original downloaded file; one thing I learned very early is that you don't edit the original file, you make a copy first and edit that.

.txt files have no formatting. No italics, no bold, underlining, nothing. Sometimes authors use non-alphanumeric characters, such a - _ /\[]() to indicate formatting; this started with Usenet and BBS posts, where it was the only way. The Usenet/BBS crowd developed a fairly standard definition for what these non-alphanumeric characters intended to convey, but if you didn't grow up on Usenet it's not intuitive. If you start with a file that has formatting indicated in this manner, you have a big job ahead of you. First, you have to import the file into your favorite document editor that supports modern WYSIWYG formatting. Then you have to determine what the author intended with the symbols he used, apply the WYSIWYG formatting, and remove the characters used to imply that formatting. While there may be document editors out there that allow you to search for text surrounded by certain characters, and then replace those characters with modern formatting, I haven't come across them. So if you start with a document like that, it's going to be very labor intensive to bring it up to modern standards, as you will have to go through it character by character. And you will also need to look for foreign language words with non-English characters. I'm constantly replacing deja-vu with déjà vu, for example. and then you have to go through, line by line, adding a space to the end of each line and deleting the line feed so that paragraphs flow together as one unit; in the early days of Usenet and BBSs, lines didn't wrap around the screen, they ran off the end, and you had to manually insert line feeds into the document to keep the lines from running off the screen. Modern technology handles wrapping text just fine, and the display width is much greater, so a sentence that might take three lines with line feeds may take just one line with them removed. What I do now, when I come across such a file, is do an Internet search to see if it's been reposted with this reformatting already done. A good example of this is the stories by The Professor, which were originally posted at FictionMania, without formatting. In 2010 PS obtained permission from The Professor to repost many of his stories at BigCloset TopShelf. The reposted stories had The Professor's intended formatting. They were also .html documents. This was good and bad. Good in that the character formatting had been done. Bad, because .html documents have frames and all sorts of other stuff that mess up non.html documents in text processors. BigCloset allows you to create a printer friendly document that doesn't have all the site advertising sidebars and menus, etc. that their web pages are cluttered with. You can download that page, which gives you a much nicer document to start with in an .html editor. But I quickly discovered that weird shit happens when editing .html files, doing something that seems completely innocuous will cause a section of text to suddenly change font and font treatments for no reason I can fathom, and prove to be beyond the undo function to handle. So I gave up on editing .html files as the path to nifty eBooks. I had to, I was getting too angry and frustrated. The next thing I tried was to copy/paste the entire text of the printer friendly file in one fell swoop into a LibreOffice Writer document. This worked, after a fashion, but introduced some .html formatting elements, such as frames, into the document. These elements in some manner interfere with some of LibreOffice's formatting tools; I kept finding myself unable to insert horizontal lines between sections of text to indicate breaks in action, instead of the short lengths of dashes that had been used. And I really wanted those horizontal lines, they look much nicer than short runs of dashes. What I'm now doing, which is somewhat time consumptive and repetitive, is cutting/pasting text from within an individual frame; this way there are no .html artifacts to interfere with my document editor. It takes time, but is still so much faster than starting from a .txt file that it isn't funny.

Formatting aside, stories are posted in different ways. Sometimes the entire document is posted at once, sometimes it is posted in sections. Depending upon the host site, files may be limited in size, with larger files having to be broken down into parts. If posted via a mailing list, short stories may be posted complete, but longer works will be split up. If the author's mailing list has a host site with file storage capabilities, he may store the complete story as a single document at that site, and that single document will be what gets posted at other sites that can handle files that size. That's how Morpheus does things. Usually. He used to post his stories as serial emails to his Yahoo! Group, then post the complete story as a single file at BigCloset. Recently he's posted them at BigCloset at the same time he's sent them to his mailing list, and not posted a complete doc at BigCloset when done. The Academy was the last story in his Were universe that he posted at BigCloset as one file. The next, Touching the Moon, was posted in 62 parts! If I'm creating eBooks to read in a care facility, I don't want to have 62 eBooks to read one novel, just not going to happen.  FictionMania readers were lucky, he posted it as one document there, but since it was done as a .txt file, no formatting and lots of line feeds. Touching the Moon was a straight forward cut/paste of 62 text blocks into an .odt doc; .odt is LibreOffice' default document format. However, I don't have an .odt document of Touching the Moon. Rather, I have one .odt doc of all the Were universe stories to date. With a cover page, a title page, a table of contents with internal links to each story, and at the beginning of each story, right under the title of the story, links to the files at FictionMania and BigCloset. And an About the Author section at the end, with a link to the copy of a chat session interview with him stored at FictionMania.

I've created a number of documents like that. And using Calibre, an eBook management/conversion program, I've created eBooks from those documents in a number of file types, for ease of reading. I've also had the thought that after all the effort involved in creating these documents, it would be nice if it benefited more than just myself.  Since I don't own the rights to the stories, I can't distribute them without the permission of the author. The author may prefer to handle distribution themselves; while I did the packaging, I consider the documents their property to utilize as they see fit. If they want to sell copies, fine by me. If they want to make them freely available, well, that's pretty cool.

I've only contacted one author about this so far. With very positive results. With the permission of The Professor, I've uploaded eBook versions of The Complete Ovid Stories to the Internet Archive. While I haven't looked into what would be involved in making them available through the Nook and Kindle storefronts as free eBooks, I have The Professor's permission to do so, it's just a matter of working with those sites to make it clear that while it is not my intellectual property, I have been authorized to act as The Professor's agent in placing copies in the wild.

I've got to say I feel pretty good about this. While the majority of Internet Fiction is drek, Sturgeon's Law holding true, there's some pretty good stuff that risks getting lost when the host site closes, as happened when EWP went under; in that case we were fortunate that the Internet Archive's Wayback Machine had archived the site, and that someone checked while we still remembered the URL of EWP, since the Wayback machine indexes by URL. Unlike the print publishing industry, where using your legal name as the author is the norm, Internet Fiction is almost entirely published under pseudonyms. The heirs to print industry authors generally know that so and so is an author, and what he's published, and can take action to keep those items in print, so that they get the revenue. The vast majority of times, Internet Fiction author's relatives have no clue that they write Internet Fiction, nor how to obtain access to their accounts; I was fortunate that twenty years after the last Ovid story was posted, The Professor was still monitoring the message board at FictionMania, and answered my message asking if anyone knew how to contact The Professor, if he was still alive. The last person I knew to be in contact with him, PS, in 2010, hadn't posted at BigCloset since 2013, and the last contact Angharad had with PS had been several years ago, when he was in ill health. In the print community, publishers generally find out when their authors die. On the Internet, unless someone in contact with them outside the Internet finds out and posts the information, an individual could be dead for decades and no one would know it, they'd just know it had been a while since they'd been heard from. Without knowing legal names, you can't search for obituaries or go through the Social Security Death Index. This can sometimes be disastrous for an Internet community, when the person managing the web hosting dies and the first anyone knows is when the site is shut down for non-payment of maintenance fees. BigCloset, The Crystal Hall, and Stardust all had that start to happen to them, when Bob Arnold died. He'd not only handled hosting those web sites, the server's physical location was his home. When the power was turned off, the sites went black. In this case, his family knew what he had been involved with, and approved, which is pretty incredible since the primary genre posted to those three sites is Transgender Fiction; Bob wasn't Trans himself, but had an interest in Transformation and Gender-Bender fiction, which heavily overlaps TG fiction. The admins at BigCloset were able to contact Bob's family, and arranged for the power to go back on, and then purchased and relocated the servers. Stardust was Bob's baby, and is being maintained in his memory. The Crystal Hall has since set up shop on it's own, but maintains close ties with BigCloset. But if Bob's family hadn't approved of what he was doing, and if Erin and the other admins at BigCloset hadn't known how to contact them, all three sites would have been lost forever, along with any stories not backed up elsewhere. BigCloset has set up a corporation to administer the site, so there won't be one key individual whose loss will bring it down. BigCloset also has a memorial wall, where are listed the names of those members who they know have died. There is a forum thread at Beyond The Far Horizon dedicated to information on the status of authors, but I don't know what arrangements Gina Marie Wylie has made for maintenance of the site when she becomes unable to do so; she's already had to change the domain type in the URL because someone snipped the domain renewal on her. Stories Online, and it's sister sites, Fine Stories and SciFi Stories, are managed by World Literature Publishing Company, but as far as I know that organization is wholly owned by Lazeez Jiddan, and I don't know what arrangements he's made for their continuation when he's no longer up to it; he's a very hand's on sysadmin, lot's of hand coding of the site infrastructure, it would be very difficult for someone to come in cold and keep it going.

Potentially, I could be making master documents and talking with authors about getting them archived for a very long time. I'll keep making the documents since they meet a need that I have. And I'll keep offering them to the authors because it would be criminal, in my mind, to keep the results of that effort to myself.

This isn't the first time I've done something like this. At my Academia site are stored .pdf files of

Di Grassi his true Arte of Defence modernized v1 2

Vincentio Saviolo, His Practise in two books, modernized typeface, annotated vocabulary

which I produced several years ago. The copies available were all unmodified images of the original publications, and man, were they hard to read. So I ran them through OCR, corrected all the OCR errors, annotated them, and created new .pdf files, and posted them to Academia and spread the word through the SCA Rapier community, and also the HEMA community. I didn't update the spelling, they're a strict transcription formatted to match the original. The one major alteration was replacing the illustrations from the Di Grassi English edition with those from the original Italian edition, which were much better illustrations, which I did at the suggestion of one of the HEMA types, who provided the URL for the images. I should probably get them uploaded to the Internet Archive as well. I need to modify them anyway, my email contact information in them is now incorrect. Addendum, 8/29/2017: OK, I hadn't looked at the text of the fencing manuals since I created them in 2013, so I was in error. I did standardize, and modernize, the spelling, and in some cases, the words themselves, substituting modern equivalents where the intended meaning was no longer what the word means in Modern English. I'm currently creating a non-normalized EModE version of Saviolo, and plan to then create a normalized version, which will make them of use to researchers who want them in the original language. I need to revise the modernized text, as I've found places where I misread the original text the first time through, and I want to rethink EModE/Modern word equivalencies.

2017-07-28

Yandex Image Search and Google Image Search

Been a long time since I last posted. Can't say it'll happen more frequently, but this is a start.

https://yandex.com/images/ https://www.google.com/imghp

Both allow you to do a standard text description search. Both allow you to search for an image based upon one for which you have a known URL. Both allow you to upload an image to search. It's when you get to the results of the search that things differ.

I'm going to use the following test image which I uploaded from my computer; in this case I know the precise URL where I found it, although Calibre renamed it when using it to add a cover to the .rtf version of the book in my possession. Incidentally, the book is well worth reading.

I know, a pretty plebeian image, but since I don't have this blog set up behind a 21+ firewall, my choices are limited; if I didn't, I've got an image that I downloaded at least ten years ago, where I didn't record where I found it, didn't know the name of the model, who took the photo, where it was initially published, didn't know a thing about it other than that I thought the model was good looking, where Google couldn't find anything like it, but Yandex found the precise image with a bunch of sites, including a site with an entry on the model containing six photo shoot collections of around 90 images each; I now know the model's first name, went from three images to way too many, but know nothing much about her since the site Yandex found didn't provide that information; the site is natively in Russian, but has a drop down list of other languages to display in, including English. 

Moving right along...

First, the Google search. Rocinante cover, Google Image Search results 
Second, the Yandex search. Rocinante cover, Yandex Image Search results 

This was, perhaps, too easy an item to find. I may have to try this again with something more obscure.

Google didn't find the same resolution image, so it didn't declare a winner. It's best guess as to the identity of the image was spot on. The first site it listed was the actual source site. The related images were alternate cover images for the book, which indicates Google searched for related images by their best guess title, rather than items which featured things that looked like the submitted image. The first four sites listed as having matching images did, indeed, have matching images, while the final two sites were completely bogus. Only one of the sites which had a matching image was not owned by Wes Boyd; that site looked to be of interest to me, and I've now subscribed to their mailing list.

Yandex was much more confident about saying it had a match. It immediately offered a list of different resolutions for the image, with links to those images; this is something Yandex does for every image you select from those displayed in their search results, and I find this very useful. The related images section didn't come up with the alternate covers of the book, but instead images of aircraft similar to the one on the cover. This indicates their related image search is based upon an analysis of the submitted image to determine the main topic of the image, rather than the item the submitted image had been linked to. This is an important difference, and should be borne in mind when deciding which search engine to use. All six of the sites listed as having matching images did. All six sites are owned by Wes Boyd. Google didn't find as many sites owned by Wes Boyd, but did find the image at someone else's site.

Neither found the entry at LibraryThingGoodreads had the book listed, but showed one of the alternate covers; the only Wes Boyd book they listed where that was the case. FictionDB had the alternate cover. The Google Books entry didn't show. A whole bunch of others didn't show, including Nook and Kindle eBook stores.

Now to try again.
This is an image of the map of Middle Earth included with one of the hardcover editions of The Lord of the Rings published by Allen & Unwin lo these many years agone. 


Google, again, wasn't sure about it's identification, but it's best guess was pretty good. The two sites they list before showing related images were sites I already knew about as primo Middle Earth fan projects. The related images were spot on, all being similar maps of Middle Earth. Google then goes on with a bazillion hits for sites with matching images, I mean pages upon pages upon pages, leading off with five articles on the find of a copy of the map hand annotated by J.R.R. Tolkien himself. And where possible, a small thumbnail of the image at that site appears to the left of the listing.

Yandex, again, was sure of it's identification, and offered a variety of resolutions for the image. The related images were spot on.  The sites listed as having matching images aren't organized the way Google's are, which may be good or bad; after all, the first five sites Google listed had basically the same information, while Yandex leads off with a Korean language Middle Earth fan site rich with maps. Of course, I didn't know it was Korean, and the translating software used by Chrome doesn't tell you what language is being translated from, which is a grievous lack, and the translated site didn't have anything saying it was based in Korea, except that in the About page it did list a problem at one time with the Palgong Port interface, and a search on Palgong determined that it was in South Korea. Yandex also includes a thumbnail of the image as part of each site listing, and continuing their focus on resolution, lists the resolution of the image at the bottom of then thumbnail.

Google is very good at finding information in your language, and geographically close by. This is because of all the information they collect about you, as the Internet Conspiracy Theorists rant about all the time; I think it's cool,  I generally get better results because of it. But there are times when that isn't what you want. It wasn't until the end of the thirteenth page of results that Google listed a non-English language site; Yandex lead off with one. However, Google did have those pages upon pages of sites, while Yandex only lists forty-three sites. And both of my example search objects were non-obscure; as I related at the beginning, I had an obscure Adult Model image in my collection that Google didn't have a clue about, that Yandex, given their far more aggressive delving into former Soviet countries resources, found.

If your interest lies in finding different resolution images, foreign language resources, or obscure Adult Model image information, Yandex is definitely the search engine to use. If you want localized information stick with Google, that's where they put their focus. There are other image search providers out there, but I haven't tried them out; it could be well worth your time checking them out, as I suspect each has differing strengths and weaknesses, and with proper investigation you would be able to select the best search engine for your specific research project. I know I'll be switching back and forth between Google and Yandex, just like when I'm looking for used books that were published in Scandinavia I search Antikvariat.net rather than AbeBooks, you choose the proper tool for the task at hand.

2013-04-29

Brocade Covered Tourney Chest










This chest was made by my father, James Mead [SCA: James Addison the Lame], lo these many years ago; approximately 30 years ago, I'd guess. It is not based upon any historical designs that I'm aware of, but strove to fit into the medievalesque ambiance of the general SCA encampment.

The top and bottom are 1/4" plywood panelboard, with smooth finished backsides. The four sides are 1/4" pegboard. Each of the four sides is covered with brocade fabric, curtain fabric if I remember correctly, from Goodwill.

After the fabric was attached, the front and back were framed with 1x1 finished wood. Same with the top and bottom. The handles are backed on the inside by steel plates, massively strengthening the pegboard. The two sides have wider 1x hardwood framing. The front and back fit between the two sides. The bottom fits over the four sides, as does the top. This chest has an internal division, and a removable tray in the left partition. The top is attached by two hinges, with a strip of leather to prevent the top from opening too far. Other than the hinges, everything was fastened together with glue and panel nails.

This chest is surprisingly sturdy, given the materials it is made from. Lightweight, too.