2017-11-02

Guaconics: a new field for socioeconomic study

I seriously doubt that this term existed before it came to me while I was half asleep this morning. I have no clue if anyone has started focusing on research in this area, although I have read articles in The Atlantic which directly impinge upon it.

Guaconics is the study of the socioeconomic interrelationships of Internet Access and Social Media. Short for Group Union Access, it studies the impact of 1) How you standardly access the Internet (do you own access to the Internet, or utilize access points such as the public library? If you own your access methods, do you predominently access using mobile computing, or a desktop/laptop? Within those, which OS family do you predominently belong to, iOS, Android, OS X, Windows, Linux? What browser do you utilize the most?) 2) which Social Media providers do you have accounts with, and within that, which ones you are most active on? 3) If you have a Blog, which provider is it with? 4) Which Information Aggregators do you utilize? 5) Which Original Article publishers do you most rely upon? 6) Which Satire sites do you frequent, and do you realize they are satire? When you see something shared from a Satire site, do you recognize that it's satire, or do you think it's factual? 7) Do you check the sources of information you come across on the Internet as to whether they are impartial, or if they are putting an ideological spin into the information they provide? Do you access the sources for the information they provide, to see if they are providing a valid interpretation of the source documents? Do you investigate the validity of the source documents?

What got me started thinking about this was 1) An article in The Atlantic which discussed the impact Facebooks like/share algorithms which determine what shows up in your feed had upon the 2016 Presidential Election, 2) theSkimm only providing access to their added content via an iOS app and The Week only providing an iOS digital subscription option, and Instagram only providing full functionality to mobile users; you can't post pictures or download pictures from a desktop, only from mobile platforms.

I'm a late joiner, when it comes to Social Media. I just tried to determine when I joined Facebook, and found that this wasn't information they provided as part of your account information, but I know it was after I moved to Tacoma, WA in 2012; I only joined Twitter this year. I started this Blog in 2008, but after an initial period of high activity, for a great many years I posted little or nothing at all; Blogger is a less useful platform for Blogging than some of the competitors, such as Wordpress, but I'm reluctant to change platforms. I came late to cellphones, first obtaining one in the 2001, when I transfered my father's cellphone account to my name, and didn't get a smartphone until a couple of years ago, when I was forced to upgrade my existing phone because its communication interface was no longer supported, and saw that I could get an Android smartphone contract for only $5.00/month more than a non-smartphone. To date, I've gone online via my smartphone less than five times, and those were all cases where I was blocked from accessing the Internet via my Windows 10 Desktop, and needed to contact Microsoft Support. Yes, I'm an aberation within modern society. That was already clear, since I've only posted one picture of a cat on social media; the number of cat pictures I just saw while looking through Instagram was significant, from many different sources. I installed iTunes this year, and have yet to add anything to it.

Anyway. The article in The Atlantic made me aware of something I'd been peripheraly aware of, but it hadn't really sunk in. There are a number of Facebook friends where many posts show in my feed, and a number where I no longer see any posts. I'd just assumed it indicated that, like myself, they didn't post that often. Not so. Your Facebook feed is filtered based upon what you like/share, to increase the number of like/shares; this ties in with Facebook's revenue source, as it allows for closely targetted advertising. So Facebook does this purely from a revenue generating perspective, and hadn't really considered what impact it had upon people interacting with people of differing viewpoints, and how people access information sources. Neither, really, had anyone else, until the aftermath of the 2016 Presidential Election Campaign, trying to determine how the Left had so misjudged what the results would be.

[Now we move away from Guaconics, to commentary on our current society. As usual, I can't stay on track to save my life.]

Not entirely true, as it becomes clear that certain Conservative PACs had looked into this, and had seen just how much it could multiply the impact of their advertising expenses; they could zero in on those most likely to respond positively to their message, while keeping their opposition completely in the dark about the fact that they were doing any advertising at all. Not to say that the Left wasn't doing the very same thing, but nowhere near as well; the Left is nowhere near as unified as the Right, it appears.

All segments of the Left seem to have something in common, the belief that you can effect social change through legislation; it is true that you can change how people behave in public based upon fear of punishment, but that is coercion, not a change in how they really view the world; anyway, while all segments of the Left share this belief, they are fragmented as to which of the social changes they would cause via legislation is the most important.

The Right, although having some disparate philosophies and ideologies of their own, is united in their viewing social change forced by legislation as infringing upon their personal liberties guaranteed under the Constitution, and was able to unite behind a candidate who promised to repeal as many regulations as possible; that there might be real negative impacts upon them as a result wasn't as important as that the legislation had been enacted without their consent; the Left focuses on the negative results of unrestricted personal liberties, in that an individual's freely made actions can negatively impact others, while the Right focuses on not wanting their actions restricted any more than absolutely necessary for the functioning of a society.

This is not to say that there aren't those who are grouped within the Right who wouldn't legislate social change themselves if they had the power to do so, in regard to enacting legislation to institutionalize things based upon their own strongly held beliefs, they just don't view legislation based upon their interpretations of Holy Scripture as limiting individuals personal liberties guaranteed under the Constitution, as God's Laws supercede Man's Laws.

Both sides are blind to the irony of it's being acceptable to enact legislation forcing others to act outwardly in accordance with their own beliefs while resisting legislation that would force them to outwardly act in accordance with someone else's beliefs.

And, there are those who would deny being part of the Right, who support their legislative actions to repeal regulations because they oppose regulations in general, as an unConstitutional restriction of personal liberties, but oppose legislation by the Right to enact regulations, for that very same reason. They don't want anyone enacting regulations to restrict freedom of choice, that would impose other's beliefs upon anyone. They have a far rosier view of human nature than I have, in regard to the society they believe this would result in; I've read some Libertarian SF, and it flatly contradicts the historical record of what life was like prior to the regulations they oppose being enacted.

Which gets to where I stand in regard to these issues. Which is to say, with none of these groups, while in some ways with all of them.

I will not argue as to whether anyone has the right to impose their beliefs upon others. I will merely state that any examination of the history of our species shows that once you get a group of any appreciable size someone ends up being in charge, given dominance by those who support that individual's agenda. Many different methods of selecting the individual(s) in charge have been tried, with varying results as to their competency. The period of time a given individual is supposed to be in charge has varied, from a few scant months to life appointments. In all cases the individual purportedly in charge has only remained in that position so long as they have had the tacit approval of the populace as a whole, and the active support of the economic leaders and the civil/military law enforcement structures; while it helps to have the support of the various political organizations, given the active support of the economic magnates and the armed forces, that can be despensed with. Tacit approval of the populace as a whole translates into their not being in armed revolt; other resistance is ultimately futile given the support of the Captains of Industry (or their period equivalents) and the Armed Forces, and given solid enough backing by the military, even the Captains of Industry can be dispensed with. I admit this is a very dark perspective on our history, but I also believe it is accurate. It is a reality that only those interested in having power over others will seek to attain a position which gives them power over others; no one aspires to management who doesn't desire the authority to see that things are done the way they think they should be done. The greater the amount of authority required to implement their vision of how they think things should be, the higher in the power structure they will seek to be. This applies equally to those whose vision is simply that they want to be the one giving all the orders, as to those who have a grand vision of how society should be structured for the benefit of all of its constituents; they all have a vision of how things should be that requires them to obtain a position of dominance to effect. As a result, power over others accrues to those who seek to have power over others, and can obtain the support of the established apparatus for selecting those in positions of authority; a military backed coup is an established method of selecting those in a position of authority, and is probably more faithful to the origins of civilized society than anyone is comfortable acknowledgeing. It definately is a more accurate description of the founding of the United States of America than you will find in the history books; it speaks very highly of the leadership of the American Revolutionary forces that they actively sought to establish a system of checks and balances upon themselves and their successors to prevent the development of a situation such as they had found themselves in, where there was sufficient popular support for the armed overthrow of the established government that it was successfully attempted. As it was, their original vision was nowhere near so broad as has been attributed to them in the years following. Yes, they established an elected representative leadership; read up on who initially was granted the franchise, and they won't seem quite so enlightened as they are made to appear. Trust me on this (DON"T! DO go read up on this!), if those same criteria were in place today, adjusted to reflect inflation, the vast majority of the current electorate would not be enfranchised.

On military support for those at the top of the structure. So long as the military is composed of a representave sampling of the population as a whole, it is unlikely to support the violent suppression of established civil liberties, if those civil liberties have the support of the population as a whole. If it ever comes to pass that the military is composed primarily of the adherents of one faction within society, it will become amenable to suppressing the civil liberties of those that faction disagrees with. This, in and itself, is an argument for a universal draft of all elegable citizens to serve time within the military. An all volunteer military is much more amenable to the long term cooption of its leadership positions by members of one faction than a military constituted upon universal conscription; while it can, and has, been accomplished in both situations, it is much more difficult to do with universal conscription, provided there is a means for promotion from the lowest ranks to the highest, such that it doesn't develope an effective caste system, where upper leadership is drawn from a pool that preselects for a desire to be in a position of military leadership and can afford the cost of attending a military academy. This is not to deny that service families have been the backbone of every military within the history of our race, but where you have service families, their values can, and most likely will, diverge from those of mainstream society; really, they have, or there wouldn't be distinct service families as opposed to the rest of the citizenry; just how those values differ from the majority populace is a crucial datum regarding their willingness to cooperate with an overthrow of the established civil authority, either by cooperating with those at the top not stepping down when they are supposed to, or assisting someone else in supplanting those currently at the top. Universal conscription is also a more effective method of exposing your citizenry to those of different economic, eductional, and cultural groups than any other, provided that military units are not comprised of heterogenous groups. By having your citizens serve alongside a representative sampling of the population as a whole they are given an exposure to those outside of their self-selecting socioeconomic polity, with, hopefully, beneficial results in regard to how they judge those members of the other socioeconomic sectors of society at large; of course, they may just have their pre-existing beliefs confirmed, but at least then they'll have some solid basis for their beliefs based upon personal experience in a semi-level playing field. At the very least, it should show them that there are hard workers and slackards within all segments of society. And it has to stick in the craw of those who volunteer for military service, for whatever reason, the disdain for this service expressed by a significant number in the liberal left. While it is very arguable that a significant percentage of the military actions we've been involved in since the Second World War didn't turn out all that well, that is not, ultimately, attributable to the military, but to their civilian overseers. The rules of engagement forced upon them, the sub-contracting to private firms things that should have been left to the military, the various constraints upon their actions that were purely political rather than operationally necessary. Changes to military hardware dictated by the economic benefit to a given Congressman's home district rather than what represents the best value in actual conflict. You do the best that you can, but it's hard to win a three legged race when you don't have a partner and you're an amputee.

On preferential treatment as a means of righting past socioeconomic wrongs. It will only be successful to the extent that those individuals who are the recipients of that preferential treatment demonstrate that they can do the job. Case in point, speaking from my personal experience with the negative results of preferential treatment established for this purpose. When I worked at the Chicago Public Library, the actual process of ordering the items to be purchased for our collections was outsourced; in other words, we still made all the decisions in regard to what was to be ordered, but teh actual process of contacting teh various publishers and book jobbers to order, tack, and pay for the items once received, was outsourced. When the Chicago City Council approved doing this, they added the stipulation that it had to be contracted to a minority-owned business within Chicago. Well, there weren't any minority-owned businesses within Chicago with experience in this area, and the firm that got the bid, we never successfully ordered a single item via them; this was in the 1990s, the Internet was in its infancy. They didn't answer the telephone number we were given to contact them, they didn't respond to our written letters. So, we relied upon the fall-back policy that thankfully had been approved; after three attempts on different days to contact them, we could go ahead and process the order ourselves. This is probably one of the worst examples that could be provided of the negative aspects of preferential treatment, in that someone who proved themselves unable to do the job was given the job purely because of legislated social engineering. If they prove themselves capable of fulfilling the requirements of the job on a day to day basis, I have no problems with preferential treatment as a route to eliminating the impact of past discrimination. Speaking purely as a manager, I'd prefer to get the applicant who will do the best job possible out of those applying for a given position, but so long as they discharge the listed job responsibilities, I'll go along with preferential treatment as a a method of helping people out of an economic situation forced upon their ancestors, where it negatively impacted their ability to get the background necessary to do the job; given the discrimination my father experienced, and in his case it was based upon his having his right leg amputated below the knee, I recognize the reality of what discrimination can do; dad was refused a promotion, and then requested to train the individual who got the job because he was the best qualified individual to train him in the responsibilities of the job, based upon the fact that he was an amputee. They said so to his face. It was an accounting firm. How does his missing the lower part of a leg impact his ability to do accounting, or to supervise others? Obviously it didn't, since they wanted him to train the individual who got the job. This was in the 1950s, the ADA would have prevented them from being that open about why they made their decision today; it still happens, but they have to be much more subtle about it. Dad wasn't having any of it, he quit then and there. Fortunately, there was a market for experienced accountants at the time, so he wasn't unemployed for long. What drove dad to where he quit wasn't that someone else got the job, but that a less qualified person got the job based upon non-rational discrimination, and they acknowledged that this was the case to his face, and then added insult to injury by asking him to train the person who actually got the job. My expereince, and my father's experience, is why my support for preferential treatment as social engineering has the caveat that they still have to be capable of doing the job; it does no one any good to be hired over someone else and then end up being fired because you can't actually fulfill the responsibilities of the position, and I have to say, again speaking as a former manager, that it's a lot harder to fire someone from within a group being granted preferential treatment than it is to fire someone not in such a group, because of the level of documentation required to withstand charges of discrimination; it's not enough to document that the specific individual isn't doing the job, you have to document how everyone in equivalent positions is doing to show that, in fact, they are the one doing the poorest job of all, and that it falls below the minimum acceptable performance levels, and that no one who is being retained falls below those levels; in an ideal world this level of documentation would exist as a matter of course, but the reality is that no one can afford the time and effort involved, no one is staffed at that level. I've seen it succesfully done once, at the Chicago Public Library. In that case it had a positive impact upon another worker who had been hired for similer reasons, but who was discharging his responsibilities in good order; his self-esteem went way up, because he truly internalized that if he wasn't doing the job, he'd have been fired; seeing this other chap get fired for not doing his job was the beginning of an incredible change as he realized that he really was valued for what he was doing, and he started actively improving a lot of things in his life. It was amazing to watch. It's been twenty years since I've had any contact with him, and I've forgotten his last name, but it's stuck in my mind how his realizing that he'd held onto his job through merit really turned his world view around. Which does point out the major problem with preferential treatment, no matter who is getting that preferential treatment, how the nagging question of if you really measure up has to wear at those who know they have received it and aren't sure they are truly worthy of the position they hold must be; of course, in general, the only ones that's going to bother are those who do measure up in the truly important things.

The other thing that impacts my feelings on the above matter is this. I'm a retired librarian. I'm a white male. For a very long time, in libraries, you had a greater number of female librarians than male librarians, yet the Library Director was always male. And he might not even have an MLS. Case in point, again at the Chicago Public Library, there was an individual who the City Powers That Be wanted as Library Director, who didn't have an MLS. Now, this wasn't going to fly, because Illinois State Law requires Public Library Directors have an MLS. So they hired someone else as Library Director, and created an assistant position for the guy they really wanted in the job, and gave him all the actual responsibilities; needless to say, paying for two top administrative positions rather than one loused up the Library budget big time. The guy was an ass. He didn't have the background for the job, and really fucked things up. I'm not going to name him, anyone really interested can do the research and find out. I will name his immediate successor, because she's now the Librarian of Congress, and did an absolutely bangup job during the time she was in charge at CPL; it was only for a couple of years, then she moved on to bigger and better things, but Carla Hayden is one sharp cookie, who impressed the hell out of me. She's the first female Librarian of Congress. She's the first African-American Librarian of Congress. She's the first Librarian of Congress in over sixty years to actually come from a background of working in libraries. Those of us who have had the experience of working for her, however indirectly (hey, I was an L2, which is the lowest supervisory Librarian position at CPL, I never had any direct interaction with her) are very, very pleased at this, because we all think she's done a great job. And if you look into her background, she's worked her way from being a front line Children's Librarian to where she is now on merit; while there may have been some preference at times, she always excelled at the job once she had it. Yes, she was appointed while Barack Obama was President, and she is friends with the Obama's, but she was also best qualified for the job amongst those being considered. Major caveat, she was able to do all of it on merit because her parents were well to do and she could afford to go to the good universities. So in her case, all being part of a "preference" group did was wipe out discrimination due to her being female and Black; once given a level playing field, she did it all on her own abilities. Which, ideally, is what preferential consideration as part of social engineering is supposed to do; give people the opportunity to show they've got what it takes while attempting to undo the impact of generations of deliberate discrimination, which, when it works, helps to prove that it was indeed discrimination rather than the discriminated upon group's innate lack of ability in that field, since, when given the chance, they got the job done.

Oy. I've been working on this for hours. As usual, it has wandered very far astray from where I started.

2017-10-28

A modest proposal for Copyright Revision

I just had an idea concerning copyright. I don't know if anyone would be happy with it, but here goes.

So long as the holder of the intellectual property rights keeps the item in print in a current format at a price such that it is accessable to the general public, they maintain their copyright. Forever. Hardcopy always counts as a current format. However, if they are not prepared to produce a quality digital version, or whatever is the current technological standard, and someone who is prepared to do so contacts them, they are required to negotiate a fair price for the format specific rights to the item concerned. If the entity producing the format specific edition ceases to provide it, the format specific rights, including the master copies for those versions, revert to the current intellectual property rights owner for the source item, unless said entity has, through failure to keep the item in print as specified below, lost their copyright, in which case the format specific edition becomes public domain; it does not become public domain so long as the holder of the rights to the format specific edition keeps it in print as specified below, even if the intellectual property rights holder losses their copyright through failure to maintain in print status. Similer clauses hold true in regard to foreign language editions, and adaptations into other formats not specifically mentioned or existant at this time. In short, so long as the intellectual property rights holder actively seeks to make a return on their investment, it remains theirs, provided they make arrangements with those prepared to transform the item into other formats that the intellectual property rights holder is not themselves interested in marketing the item in. If it remains available to the public, at a price that is considered fair given production costs and the need for a reasonable profit, and that they are making available general purpose editions, not just collectors editions, they retain copyright.

If they allow a five year period to pass with it out of print, they lose their copyright, permanently and irrevocably. It's gone, not theirs anymore, no saving throw, do not pass go, do not collect $200.00. It's now in the public domain. Permanently. That it is available in a format that they have contracted the rights away does not count as their keeping it in print, they have to be actively involved in keeping it before the public for sale. Actions taken on their behalf by their legal representatives if they are no longer functioning well enough to be actively involved do count towards their active participation.

Once in the public domain, anyone is permitted to produce it for sale, provided that a certain quality level is maintained; public domain does not mean trash is acceptable. Quality has to, at a minimum, match that of the general purpose editions produced by the intellectual property rights holder, in regards to formatting, legibility, viewability, listenability, etc., as appropriate depending upon the type of item. In other words, uncorrected ocr scans are not acceptable, they have to be proofed and formatted to match the original as closely as required to fulfill the purpose the original was created to meet. If illustrated in the original, the new version must be illustrated in a functionally identical fashion; functionally identical does not preclude different illustrations, but they must be identifiably the same subject matter, and where intended to convey instructional information, must be consistent with the original, with the exception that if the item is being updated to reflect changes in practice in the field involved, the illustrations must be updated to match the other revisions where the original illustrations would now provide misinformation concerning current practice; clearly, a facsimile reproduction is not a revision to current proactice, and any changes to illustrations should reflect the content of the original.

The five year period for allowing lapses of in print status is, in my mind, very reasonable; if they can't scrape up the funds to keep it in print after that long a break, they aren't going to. The major problem with current copyright is a) items just not being produced at all, yet still having copyright protection preventing anyone else from making them available, and 2) the demand for the item in formats compatable with the original in other media types not being met. Followed very closely by 3) being accessable to the general public at a reasonable price; I've had it up to here with items only being available in proprietary versions accessable only to those within Academia or a particular trade, or items where the publisher and author are both down for the count, no one has a clue who the current intellectual rights holder is or how to contact them, or where they just haven't bothered to keep the item in print yet refuse to license it to others at a reasonable price. If you want to maintain copyright, you have to make it available to anyone who requests access at a reasonable price; if you don't, your copyright is void, but so long as you do, you retain copyright; if Walt Disney, Inc., wants to maintain their copyright, they have to keep the item available, if they do that, they keep their copyright. If an informational database, such as EEBO 2, charging a per item access fee, provided it is within reason, is acceptable, so long as it is within the economic reach of those interested in the item; if this necesitates a sliding scale of access fees, so be it, provided the scale used passes review. If an article is cited in a source which is legally available, then access to that article by those reading the citing item is mandated. Since the reason for copyright is to insure the intellectual property rights holder a return on their investment, charging to access their intellectual property is not only allowed, but strongly encouraged; allowing free access declares it to be public domain, thank you for your donation to society; an exception would be granted to this if there is free access granted based upon a set criteria, such as being a current student, or retired on fixed or limited income, or a net income less than x. Making a stripped down or older version free does not void your copyright on the most recent complete version, as they are not functionally the same.

There is no minimum number of copies to be sold in any given time period; so long as it is available for purchase/access, at a reasonable price, you retain copyright; so long as there is no unmeet demand for the item, to make it clear, you retain copyright if it is kept available for any demand that may develop; if you keep it in print, but no one is buying, and it's not because you have priced it out of their range, demand has clearly been met, so you maintain copyright; clearly, if this goes on for a while, and you can't foresee demand picking back up within a certain period of time, you might want to consider allowing your copyright to lapse so you don't have to maintain access no one is using; again, copyright is to insure a return on your investment, when it costs more to maintain copyright than you actually recieve via access charges, why maintain your copyright?

If you initially make the item available for free, thus putting it in the public domain, but at a later date produce a "value added" edition, copyright is established for the value added edition only. So, yes, if you come out with a collector's edition, the collector's edition will be copyright to you, even if the geneal edition's copyright has lapsed or been willingly released. Just for so long as you keep it in print, of course.

Obviously, this would have some tricky implementation stuff. Defining what was a reasonable price for the access granted to the item would be a source of much contention, but since the idea of copyright is to insure a return on your investment, while at the same time making the item available to those who have a use for it, a compromise between maximizing the return on your investment and access to all who have need of your creation is mandated; no price gouging, while at the same time, a non-negligable amount over your costs of production and marketting/distribution. But I really think something like this would make a reasonable compromise between the current system of copyright covering items that have been out of print for decades, and the desire on the part of intellectual rights holders to maintain copyright for as long as possible; those who exercise those rights, keep them, those who don't, loose them; exercising the rights means making the item available to those interested in it at a price they can afford; it's arguable that whem debating what someone can afford, their spending habits should be reviewed to see if they are being good stewards of their resources; if they are not, why should the intellectual property rights holder be made to suffer? And yes, intellectual property rights can be sold, or inherited, or otherwise transfered, so long as access is maintained to any who have an interest in the item.

Kinda radical, kinda conservative. That's me, in a nutshell.

2017-09-17

The Son of More Thoughts on Transcription

In my first post I discussed a philosophy of Transcription. In my second, the creation of a master font document to assist in character recognition, and the concept of initially transcribing into a font that matches the source document, for ease of comparing your initial transcription with the source to see if the individual characters match. In this post I'm going to talk about getting access to your source document.

There is one assumption I'm making, and that is that you are using a desktop computer for this purpose. I can envision using a laptop, but anything without a physical keyboard distinct from the display is right out.

Your source document will come in one of three basic forms. 1) Digitized images of the original, 2) a hard copy of the original; this may be a physical book, a photocopy of the document, or, if you are fortunate enough to be working with the owner of the original document, the original document itself. In the case of working with the original document itself, odds are very good that you will be doing this where they store it, and unless they are providing you with access to a work station, you will be using a laptop. 3) Sound recordings. Sound recordings are a whole nother kettle of fish, if they aren't a sound file, because you will need to have the equipment to play back the media they are recorded on. Well, even if they are a sound file, they may be on an outdated storage format, such as floppy disks, and in an outdated file format. In which case you would need access to a computer of the appropriate vintage, with the appropriate audio software. As time passes, this is going to become harder and harder to do; I no longer possess a computer with floppy drives of any kind that still works, and it's been quite some time since I had access to anything capable of running a pre-Windows 95 program. Anyway, if you are dealing with sound recordings that aren't digitized audio files, you will need the appropriate equipment to play them. I'm not going to go into what all this might entail, at least not in this post, just take my word for it that finding the equipment to playback non-digitized audio recordings may be quite the adventure, if it doesn't come provided with access to the sound recordings themselves. However, you would be surprised what equipment is still available, if you hunt around a bit; the online marketplace has made obtaining obsolescent equipment much easier, as individuals who couldn't quite bear to just throw their old equipment away now have a means of finding it a new home, and those who made a business out of obtaining obsolescent equipment from those wanting to get rid of it (heck, sometimes they even got paid to take it away!) for resale to those who needed that equipment to access obsolescent media now have it much better when it comes to outreach to their prospective customers. And, there are those who make a business out of converting audio between different storage media; for a price, you send them your outdated media, they'll send you back the contents on current media. This holds true for all data types, not just audio; if you are willing to let them retain a copy of the converted data, and distribute it as they wish (including selling copies), they might be willing to arrange a lower price, but it would need to be something marketable that isn't under someone else's copyright.

Digitized images of the original: In short, a computer data file. Hopefully, this will have been created recently enough that it is in a current file format, and current storage media. If a non-current file format, you will need to either obtain conversion software so you can convert it to a modern file type, or software capable of displaying the contents of that file type. If it's not a current storage media, we're back to the problem outlined with audio recordings, of needing to obtain the equipment necessary to read the storage media and file type. For my purposes in this post, I'm going to pretend that your source image is in a current file format, stored on modern equipment, such that you can view it on your main computer's monitor. In some cases you may be allowed to download the images to your own storage media, in other cases the source site may not allow downloading (and installed the appropriate scripts to disable mouse right -clicks from pulling up a context menu), and you will need to keep an active browser window open to their site. Of course, their not allowing you to download a copy of the image should raise the question of whether you have their permission to create a transcript of the document. If it is a unique document, you really need to contact them to seek their permission to create a transcript from it; while the original document may be out of copyright, odds are real good that the image they won't let you download is in copyright, and modifying the image, which includes transcribing the contents, requires their permission. In writing. One can argue fair use for transcribing a small portion of the information contained in the image, enough for a quote in another document, but a complete transcription is right out without their permission. If they allow you to download the image, but require permission to use the image in a publication, you will still need to contact them about distributing your transcription in any form. If it is not a unique document, things get a little bit iffy. But only a little bit. Sure the original is not unique, but do you have physical access to any of the other physical copies? Has anyone else made images of one of those copies available without constraints placed upon their use? If the answer to those questions is No, then you still need to get their permission. If the answer to either of those questions is Yes, then that's what you need to do to access the document if you don't want to contact the image producer about producing a transcript from their image.

[Note: A bit tardy, but I've just emailed the Lord Collection to request permission to make transcriptions from their .pdfs. As with my article on Link Rot, I must practice what I preach.][2017 09 18: Got an email back, it's cool with them. Yay!]

There are online repositories of digitized documents that make their holdings available without constraint, other than not selling what you obtain from them; derivative works, your call, but there needs to be substantive changes made, such as transcribing them into a modern typeface, annotating them, translating them into another language, things that take considerable time and effort, such that you have a real claim on the resulting document. Google Books, the Internet Archive, any agency of the United States Government, in general any State Government agency, Project Gutenberg, to name a few.

Accessing the original document in hard copy.

If it is a published work, now out of copyright, and you own a copy of it in hard copy, you are set, good to go. I would recommend investing in a good document holder, appropriate to the hard copy format, to hold the document open and well displayed while you work from it.

If you do not own a copy of the work, you may be able to borrow a copy via your local library; while they may not have a copy themselves, they could try to borrow it from another library that does, through InterLibrary Loan (ILL). There is a caveat to this, and that is, the less common the item, the less likely that anyone who still has it will lend it out. I worked in the Bibliographic and Interlibrary Loan Center of the Chicago Public Library for three years, I know whereof I speak.

If you don't own a copy, and can't borrow a copy, you will have to go to where a copy is kept. First, you have to find out where a copy is held. For published works, OCLC WorldCat is the best place to start for holdings within the USA, as it is drawn from the cataloging database that OCLC maintains of materials for which they have bibliographic records, and they are the major, although not the only, cataloging database service provider in North America. Outside of North America their coverage is not very good. OCLC has been in operation since 1967, and by now, most libraries in North America have substantially completed their retrospective conversion projects; retrospective conversion is a fancy term for taking the information from your physical card catalog and converting it into information in an electronic database, typically available via the library's online catalog. Pretty much, the only things that haven't been converted are items unique to a given collection, where they haven't been able to afford the time of an original item cataloger to create the bibliographic record. Original cataloging is a lot harder than copy cataloging; copy catalogers have to be very careful, but what they are doing is searching the existing cataloging records for one which matches the physical description of the item in their collection; if they find one, they attach their holdings code to the record, download the record for use in their online catalog, and proceed on to the next item. If they can't find a matching record, they record that fact in a local record of some kind, and move on to the next item. The record of items for which a matching bibliographic record wasn't found will then be accessed by an original item cataloger, when they can afford to hire one; note that point, when they can afford to hire one. Pretty much all libraries of any size have a copy cataloger on staff, to handle their ongoing acquisitions. It may not be a dedicated copy cataloger, but someone who does it as part of their duties; my sister, when she was the Children's Librarian in Klamath Falls, Oregon, did the copy cataloging for the Children's Library as part of her duties. But original cataloging is much more time consuming, and requires a very analytical, detail oriented mind set; they have to create a bibliographic record that accurately describes the item in their possession such that it is clear what they have, and how the edition of the document in their possession differs from all other editions of that document. Having worked in ILL for three years in one of the largest public library systems in North America, I have a much better understanding of just how important that is than I did previously. Different editions are just that, different. They differ in formatting of the information contained, the actual information contained in the work can differ between different editions; like, duh, why else would they call it a different edition? Different printings of the same edition can vary in appearance. There are all sorts of reasons why a researcher will need access to not just a specific work, but a specific printing of a specific edition. If you are looking at travelling thousands of miles to do your research, you want to be certain before you pack your bags that the copy of the item held by the repository you are going to visit matches the item you are seeking to research. So good, detailed, anal retentive original cataloging is not a luxury, it is mandatory, and people capable of that quality of work cost. Collections greater than a certain size, who have funding adequate to their needs, can afford original catalogers. Smaller collections, and specialized collections, may not be able to afford to have an original item cataloger on staff permanently. What they do is 1) hope their item isn't as unique as they fear, and a cataloging record will be input by another institution that matches the item in their collection, and 2) seek outside funding in addition to their normal funding to hire a project cataloger, someone who will focus all their efforts on cataloging the items unique to their collection, for the duration of their funding. They don't always call these individuals catalogers, sometimes they are called archivists; archivists focus on non-published items such as personal and corporate papers and records, but the basic concept is the same, the creation of entry points to the holdings of the library/archive, such that researchers can become aware of what they have that is unique to that collection, so people will use the materials and justify the expense of preserving them; researchers are also a revenue source, while publicly funded repositories are usually free to access in person, privately funded collections frequently charge for admission to their collections, as a means of supplementing their usually inadequate funding; they are also more likely to charge publication fees for use of the information unique to their collection in publications, said fees generally on a sliding scale based upon expected number of individuals who will access that publication.

And with that last, I've advanced to unique items. Items that are unique to a given collection, because few if any copies were made. While WorldCat's coverage in this area is improving, that's damning with faint praise. This is where you need to have some reason to think that a given collection would have resources relating to your research, before you can search their holdings information. Thankfully, as these collections are able to obtain funding for inventorying of their unique holdings, more and more information about these collections is becoming available via web searches. Also, there are a growing number of organizations such as Archives West, which acts as a portal to the specialized collections of a great many collections in the Greater Pacific Northwest, allowing you to use their front end search software to search the a number of specialized collections at once; caveat, due to the variety of materials in these collections, they don't all use the same terminology in their collection descriptions, you need to try a number of searches using terms tangential to each other to maximize the chances of finding that they have materials related to your research.

Hm. Shifted from transcription to research. Well, looking for a collection that holds a copy of the fairly unique item you want to transcribe is research. And, I have to admit, that's how I've tracked down the items I've been transcribing, searching on the web for items related to my area of interest; I didn't start out looking for Vincentio Saviolo his Practise in Two Bookes, I was looking for historical fencing manuals, and stumbled across the Raymond J. Lord Collection by purest chance. It was only afterwards that I located the various HEMA link repositories that directed there. I mean, the University of Massachusetts does not immediately spring to mind as an institution which would have a collection of historical European fighting manuals. Once you find out about their academic programs, not so surprising.

Well, I did, and didn't, cover what I intended to in this post. It certainly isn't what I'd been thinking about earlier today, which was the physical layout of your transcription area. But it did cover something important; before you can transcribe, you need to have something to transcribe.

It's past time for lunch.

Post this Puppy!

Edit 2017 09 18: Permission received from the Lord Collection to make the transcriptions from their .pdfs.

More thoughts on transcription of documents in odd fonts

What I've done so far, is transcribe directly to a modern font. I just realized, while trying to make out the letters in a German Blackletter volume, that what you should really do is this: go through your font library to find the font that most closely matches the font you are transcribing from, and use that for your first transcription. If the fonts are a good match, you will be able to tell by comparing your transcription to the original document whether you have correctly identified each letter, because if you haven't they won't look the same.

For this, you need an easy way to look at all of your fonts. Doesn't come with Wndows. But, there is a software solution. High Logic produces a couple of font related programs. The one you want to get is called MainType. MainType only has one download, so that's the one you want. There are three license levels available for the MainType software: 1) Free, which limits the number of fonts that you can have it manage to 2500, 2) Standard, which ups the number of fonts to 10000, and 3) Professional, which has no limits on the total number of fonts, but will only display 50000 fonts at a time; if you have more than that, organize them into families, and assign tags, and you can then pull up just the ones you want to look at. There are some other nice things that the standard and professional licenses provide, but nothing that you need at this time, so when you start the program, always select Free version; it will ask you every time you start the program, but hey, they are trying to sell this software to make a living. They really aren't asking much for the standard and professional versions. If you have that many fonts you are doing this professionally.

MainType will merrily go through and index all of the fonts on your computer. If any have been corrupted, it will let you know, and offer to fix the situation; to do that, you would need the Professional licence. Not needed. It will list the fonts that have gone bad, and what you need to do is bring up your favorite file search utility (I use Everything, available from void tools; it's free, and does a very good job of locating files on your computer.), and enter the file name of the affected font(s); not the name of the font, but the name of the file, which will be at the far right of the info on bad fonts. Once you have located the font file, delete it. Do this with all the corrupt font files. You might think you can avoid searching for them this way, since they generally reside in the Windows Fonts directory, but you will find that if you use file explore to go to that directory, it brings up Windows font manager, which will only display active fonts; the font files you are looking for are not active, because they have been corrupted. The Windows Font Manager just will not show you any files in that directory except active fonts, and you can't bypase it when accessing that directory with file explorer. So you have to use an alternative file search utility, and delete from it's listing of files. Anyway, once that is done, MainType will not bother you about them again. After MainType finishes indexing all your files, it will list them in alphabetical order in a scrollable list, with the font name written in its font. Select a font by clicking on it. On the right of the MainType main window there is a window which shows all the characters suppoerted by that font, arranged in Unicode order inside Unicode groups. You can scroll down this display, and see what the characters are that are supported by the font, and what they look like. Using this display, you can go through the fonts installed on your computer and see which is the closest match to the font used in the document you are considering transcribing. If none of them seem close enough, time to go on a font hunt online. Now that I know about it, the first place I'd start is with Typewolf's site. Typewolf is into fonts, big time. He does it for a living. His site has reviews of an incredible number of fonts, and many recommendations for free fonts if you cannot afford, or don't need, the commercial fonts. he also has a lot to say about the various font sites, which are worth your time, and which aren't. So I'd start there when looking for a new font. I'll assume that, working with his advice, you succeed in tracking down an acceptable font in regard to matching the font used on your document.

Now do your transcription, using that font. I know, the end goal is to have the text in something easier to read. That's the end goal, right now your goal is to be certain you have chosen the correct character to match that in the original document. When you are all done with the transcription, and have gone through the verification process to insure that you have, indeed, chosen the correct character in each case, then and only then, but wait, first save your document, and open a copy of it; you don't want to lose your hard work (this should become instinctive after a while). Once you have opened the copy, select all the text, and apply the font you want to have the document in; well, first verify, using MainType, that it supports all the characters needed for your document. There. Done. You have your transcribed document in an easy to read modern font. Save the document; this is the basis for all of your future text manipulations.

Now you can proceed in the process described in my previous post.

Post this Puppy!

Edit: 2017 10 12: Removed lengthy description of how to create a master sheet of font characters, replacing it with how to get MainType, and why. Added info on Typewolf.

2017-09-15

Link Rot

Link Rot: The condition of HTML links going bad due to changes in the destination site's url tree.

Link Rot happens. Link Rot Deniers lie through their teeth. OK, I don't really think there are Link Rot Deniers, but there are definitely those who don't check their posted links for Link Rot as often as they should.

I spent six hours yesterday preparing an errata sheet for a web site I stumbled across; don't ask why I did this, I'm not totally sure why myself. And that was just for a quarter of the categories based off of one page of their site. Most of their link pages hadn't been updated since 2011. I had 24 corrections for them; in a couple of cases I couldn't find a current site to replace the dead link, but most of them I was able to provide the current url.

In the process of tracking down current urls, I found out dated links on two other sites referring to the url I was trying to update. So when I found the current url, I informed them as well. I'm not going to name names here, but one of the sites is run by a chap who sends out notices about its existence to the major mailing lists of that interest group on a monthly basis. He had links to GeoCities in his list. GeoCities shut down all operations outside of Japan in 2009, for crying out loud!

The url that lead me to his site, the one I was looking for a replacement to, well, the other site knew it was bad, so they had provided a link to an Internet Archive backup of it. Which was a good temporary fix, except, as they noted, it was a music lyric/sound file site, and the .midi files hadn't been grabbed by the Wayback Machine. However, they had enough information about the purpose of the site, which hadn't been provided by the first site, the one that started all of this, that I was then able to find the current site for the organization that all three sites had bad urls for. So there are three sites which, hopefully, will shortly have active links to that organization again, and one site that will, hopefully, have 24 links corrected shortly.

Now, I will admit that I got a bit snarky in one of my emails, pointing out that GeoCities had shut down operations in 2009, which was pretty common knowledge, so there was no excuse for still having a link to a GeoCities site on his link list.

This morning, getting up somewhat later than usual (I finished all of that activity at nearly 2:00 AM), I decided that if I was going to get snarky about other people's Link Rot, maybe I should look at the links in my Blog postings. So I did. Got side tracked a couple of times, but all of my blog posts are now up to date in regard to referring urls. And I updated product availability and price information as well, as annotations, leaving the original information intact, except for turning off invalid product links. With only a couple of exceptions, I changed from linking directly to a sub page to linking just to the home page, and then providing the information needed to use the home page search engine to find the proper sub page. Those exceptions were for sites where, as far as I could tell, they hadn't changed their directory tree schema; they might have added and removed pages, but they hadn't changed the url of an existing, retained, page. It's sad just how few such stable web sites I found. While some url changes are perfectly understandable, such as when your domain owner goes out of business, others reflected the realization that they hadn't put proper effort into their initial web site structure development. Still others reflected organizational changes that required site structure changes to be able to function in a reasonable manner. Anyway, when I found such a stable web site, one that hadn't changed it's url structure since I posted links to it in 2008, I sent them messages letting them know how much this was appreciated, and commending their initial web site design initiative for being so successful.

I'm not staying up as late as yesterday.

Post this Puppy!

2017-09-07

Thoughts on transcribing historical documents.

When transcribing historical documents, there are a number of potential end goals. 1) a strict transcription: the goal is to maintain all the vagaries of the original document, just make it more readable by using modern typefaces. Next to accessing the original document, this is the most accurate presentation; it's also the hardest to produce, as you have to really look at each character of the original document carefully, and have to fight the urge to say, "Oh, it's that word," and make sure that what you enter is what was actually there. You can't trust the results of your first pass through the document, you have to let it sit a while and then make a second, and maybe even a third, pass through the document. Then there's formatting the results. You can just have a text document, or you can try to make it look as much like the original in formatting as possible; this is harder to do, but better for those who are able to access the original, or images of the original, as it makes it easier to place the transcription side by side with the original, and be able to look back and forth between them. 2) a transcription with regularized spellings for the language at that time: If you are not concerned with the spelling variations inside the original, but do want to read it in the original language, this is best suited to your purpose. Again, you can produce a straight text document, or you can attempt to make the transcription match the original in layout. 3) Transcribing/translating to the modern version of the original language. You have to be very careful here, to insure that you capture the meaning in context of each word; word meanings change over time, the word used in the original may no longer have that meaning in the modern language, so you have to replace it with the modern word that most closely matches the original intent where word meanings have changed; either that, or provide a gloss of the meaning of the word at the time the document was written. The previously mentioned methods presume a researcher who is familiar with the original language, and the word meanings at the time the original document was created. This is for those interested in the intellectual content of the original without having to understand the changes in the language. The previous methods have no interpretation involved, no need to really grasp the intended message of the author, it's just typesetting; well, somewhat more than typesetting if you are working with handwritten documents, you have to be able to read the original script, and sometimes that's very difficult; this isn't made any better if all you have to work with is a scan of the original. Here, you have to understand what the author was trying to say, so you can translate it for them into modern language, reflecting the changes in word meanings. This is much more intellectually stimulating for the editor, at this point you are becoming an editor, as you try to change the text as little as possible while trying to create a modern language version. Punctuation changes. Changes in word meanings requires the substitution of the closest modern word that provides the original meaning in the context of the surrounding words; you're not aiming at a total recasting, as much as is possible you want to maintain the original phrasing. You need to be a scholar of the subject the author was writing about, so you can comprehend what he was trying to say, so you can make the changes to the modern language while changing his phrasing and meaning as little as possible. You need to understand the subject both as it was understood when the author wrote the document, and as it is now practiced, so that the changes in vocabulary remain true to the original intent while becoming more accessible to the modern practitioner of the subject. Not all that much recognition is given to the individual who performs the first two types of transcription, it's strictly character recognition in the first, that and spelling regularization in the second. Here, there is interpretation involved, and that interpretation will be debated. But the desire is still for a document that would reflect the original author's style and phrasing, following the conventions of the author's time, with as little change as possible while remaining true to the intent of the original words. While some words are changed, it should still read as a period piece, not a modern document. It should read as if only the language had changed, not the writing conventions; it should remain faithful in style as well as meaning. This produces a document of use to those interested in period practices who are not interested in the language of the time the original was written, but are interested in how the information was presented at the time the original document was created. It retains the intellectual property of the author. Anything beyond the third is a modern interpretation, a retelling rather than reformatting. You are creating a derivative work, a modern work based upon the intellectual content of a historical document but written using modern conventions. This is not what I do, I don't understand the subject matter well enough to recast it in modern form. What I am now trying to do is create distinct documents reflecting the goals outlined above. First, I'm working from documents that use modern European alphabets; while I have access to fonts for Futhark, etc., that's not my primary area of interest, and I have to be interested in the subject matter, or there's no way I'd put up with the drudgery and monotony of this process. In theory, I start by producing a character by character transliteration from the historical typeface to a modern typeface, I generally use Times New Roman, it's the typeface we are most used to reading, although I'm considering switching to using Georgia; the really tricky bit is attempting to retain the original specialized non-alpha-numeric symbols, this really comes into play with Elizabethan printed materials, where they will use a
to represent "on" at the end of a word; it's not a character of any alphabet, it's a specialized printer's space saving symbol, and there is no Unicode for it; it is close enough to a ♁ (U-2641) that I've decided to use that in its stead, with a note explaining the substitution. I'm also using (U-0361) to join ct to produce c͡t, (U-0113) ē for “em” and “en” and (U-014D) for “on” appear to be exact matches. After I've finished the first version, the character by character transposition into a modern typeface, and verified that it's accurate, I save that as a master copy and create a copy from it to use for the next step, which is producing a document formatted to match the original document. This has it's own tricky bits. The fancy woodblock/engraving/illuminated initial letters are beyond my ability to reproduce except by creating an image from the scan of the original and inserting it into my document. This holds true for other illustrations/artwork. LibreOffice is not the best program for doing this formatting, but it's what I have to work with; I have time I can devote to this activity, but I can't invest very much money. Again, once I've finished this document, I save a master copy of it, and move on to the next step, which is producing a document with regularized period spellings. For this I return to a copy of the first master document, pre formatting and image introduction. The trick here is to determine what the standard period spellings are. Where possible I consult contemporaneous dictionaries, to see what was the opinion of the time; the larger the number of contemporary dictionaries I can consult, the more confident I am as to the spelling I determine to use. I temper this by checking to see if there are any authoritative modern works covering the contemporary spellings; I know there are modern Anglo-Saxon dictionaries, I suspect that there are modern Elizabethan English dictionaries. I'm not going to go against what modern scholarship has determined unless I think they are all way off base, and that's not very likely. The intent in producing this regularized spelling document is to present what they would have produced if they had computers with spell checkers in the language of their time. Conversely, software is available which can determine the frequency of words within a document; running the original transcription through said software would enable me to determine which spellings the author of the document most favored, and change the other spellings to match; this may not jib with what modern scholarship has determined to be the societal consensus, but would produce a normalized spelling closer to the intent of the author. As part of the normalization process the printers special symbols are transformed back to the text they represent. As a bonus, I'm producing glossaries to words, individuals, places and events mentioned in the documents; what was common knowledge amongst the intended audience may be unknown to the modern reader; if I had to look it up, it goes in the glossary, if I think I knew about it due to specialized knowledge, it goes in the glossary. These glossaries are appended to the end of the document. Depending upon the margins, and if the original text already did this, I might insert text boxes in the margins adjacent to the first appearance of archaic words or word meanings to present their current meanings, as an alternative to replacing them with a modern equivalent; if the original text contains notes presented this way I'll need to find a way of clearly differentiating my notations from the author's notations, to prevent confusion as to who is providing the information; using a radically different font springs to mind, clearly there would need to be a note concerning this. The idea of glossing word meanings adjacent to the first occurrence of the word could be used in the modern spelling document, as a means of avoiding changing the text of the document via the replacement of archaic words with their modern equivalents. 

I'm not the only one doing this. Not by far!

There are currently a number of transcription projects ongoing in Academia.

The Text Creation Partnership has transcribed a ton of documents from ProQuest's Early English Books Online, Gale Cengage's Eighteenth Century Collections Online, and Readex's Evans Early American Imprints, all of which are restricted access services. ECCO-TCP (Eighteenth Century Collections Online); these are available to anyone. EEBO-TCP (Early English Books Online) has two parts, the first contains approximately 25,000 books, available to anyone, while the second part, consisting of 35,000 books are only available to TCP partner organizations. Evans-TCP (Evans Early American Imprint Collection) is available to anyone. While TCP's main page doesn't go into detail, they do say these are normalized texts, and a quick scan of the word index for EEBO-TCP and browsing the titles for ECCO-TCP and Evans-TCP seems to confirm this; the frequency of variant spellings is nowhere near as great in EEBO-TCP as would be indicated based upon the two Elizabethan Fencing Manuals that I have examined in depth. everie 3811, everye 118, every 419924 just screams that the spelling has been normalized. publique 19417 and public 3171 confirms normalizing to period practice. Given the large number of individuals doing the transcription and creating metadata over a long period of time, the metadata is not standardized; you have to try a variety of terms if searching the metadata, to insure you find all the texts related to your subject, they weren't working from a standardized thesaurus of terms with clear definitions. It is clear they didn't get the Library Cataloger community involved. I'm not really in a position to throw stones, as I haven't been referring to either Sear's or LC's subject heading works; I have a copy of Sear's, I don't own a copy of LC.

Visualizing English Print is a project that is taking the TCP and similar files and make them more amenable to textual analysis using specialized software. Certain sacrifices had to be made to enable this, which makes their output of no use to those researching period printing practices. All text is stripped to bare ASCII; no umlauts, apostrophes, italics, etc. No attempt is made to preserve document formatting, other than maintaining the same line breaks as their source files. As part of removing punctuation, words were standardized; to wit, fashiond, fashion'd, both were changed to fashioned. So some, not all, spelling variants have been removed from their SimpleText output. It will be interesting to see what people do with the result of their efforts.

Smithsonian Digital Volunteers is a project of the Smithsonian Institution to coordinate the digital transcription of a whole slew of documents either in their possession or in the possession of institutions who have joined with them in this project. As they are constantly creating new images of text items in their collections, this is a very long term project. It started in June 2013, and according to their page, currently has 9085 volunteers.

Citizen Archivist is a similar project of the National Archives and Records Administration.

Manuscript Transcription Projects is a list of projects similar to Early Modern Manuscripts Online (EMMO); EMMO is a Folger Library project, and the Manuscript Transcription Projects link page is maintained by the Folger Library.

FromThePage appears to be a transcription crowdsourcing service provider, where individuals and institutions pay them monthly fees to host their projects, and volunteer transcriptionists log in to do the actual transcription. Their fees for hosting projects seem reasonable, and this allows individuals/institutions to have crowdsourced transcription projects without having to set up all the software/hardware interfaces themselves. Clearly, since they charge for this, once a given transcription project is completed the project owner may choose to remove the project from their site and store it elsewhere, which may or may not include making it accessible through the web.

Papers of the War Department, 1784-1800 is a crowdsourced transcription project of the Roy Rosenzweig Center for History and New Media (RRCHNM), which in turn is a project of the Department of History and Art History at George Mason University. There are a number of projects that the RRCHNM has been involved with, which they provide links to. They have also developed some useful Open Source software for use in this type of activity.

There are many more such projects out there; these are merely those from the first page of a Google search on document transcription projects.

If getting involved in this activity intrigues you, determine what your preferred subject matter is and start looking for relevant projects, if you want to work with established collections, or do as I'm doing, which is tracking down .pdfs or other format scans of relevant documents, transcribing them, placing them on my Academia web page and the Internet Archive, and the files section of pertinent Facebook groups that I belong to. Of course, given the source material being out of copyright (which it had better be if you don't have the permission of the copyright holder), you could always attempt to make some extra money by selling your completed transcription project via the various marketplaces. I'm making the results of my labours freely available, because so much of what I'm able to do these days is a result of others making materials freely available; turnabout is fair play.

Since there were a number of professional transcription sites included in the results of my Google search, there is the option of branching out as a Transcriptionist For Hire once you have developed your skills via volunteering with a crowdsourced transcription project; that works just fine by me, it's in the spirit of the Works Projects Administration projects during the Great Depression, where the US Government put people to work on various projects to give them income and teach them practical skills which they could then put to use in the private sector. Mini rant: They should never have shut down the Works Project Administration, it was successful in all of its goals. The American Association of Electronic Reporters and Transcribers can provide you with information on learning how to do this and getting certified.

I could go on (and on and on) but I think this is enough for now on this topic.

Post this Puppy!

2017-08-02

What I've been doing: Creation of omnibus eStory volumes.

I read Internet Fiction. A lot of it. It's what I spend most of my time doing these days. I expect to keep doing this for a long time. The thing to note about Internet Fiction is that you need an active Internet connection when you are reading. A lot of the host sites allow you to download the stories for off-line reading, but for the free sites with a few exceptions the files are text files that aren't very pretty. In fact, some are down right ugly. But if you don't have an active connection, they're better than nothing.

Looking ahead, I can see the time coming where I'm in a care facility. It may not come to that, depending upon how my health deteriorates, but I'd be wise to plan for it in advance. What I need to plan for is being in a care facility because of physical problems, but while my mind is still working. I don't think many care facilities provide Internet access for those in their care. At least, that's the assumption I'm making. It's unlikely I'd have a desktop computer in a care facility, but a laptop or tablet seems reasonable. So, if I'm going to have anything to read, I'll need to have amassed a collection of files, preferably eBooks, of the stories I enjoy reading. While some of the authors who post on the Internet have gone to the effort to repackage their stories for sale via Nook, Kindle, Smashwords, Lulu, and a growing plethora of related sites, many have not. Which means that for many of the stories that I enjoy reading, if I want something other than raw text or a downloaded web page, I have to create it myself. There are eBooks out there about this, I even have two of them in my collection, although I have to admit I haven't read them. There discussions of this subject in various of the online forums frequented by authors; I've casually monitored the discussions in the Authors section of the Stories Online Forum. Mostly I've followed the examples of eBooks that I've purchased. The thing about the discussions on the author forums is, they pretty much assume you've already got a clean complete document in .docx, .rtf., or .odt, or some other accepted standard more advanced than .txt. They don't talk about what to do if your starting from .txt files download with lots of line feeds to keep the lines short, such as those available from Project Gutenberg or FictionMania. They don't talk about starting from downloaded web pages, with all the nasty .html artifacts that can make a standard word processor choke, and which can cause problems with the more basic (read: free) .html editors. So I've had to do a lot of learning by trial and error, sometimes ending up with such a mess that I deleted the working document and started over from the original downloaded file; one thing I learned very early is that you don't edit the original file, you make a copy first and edit that.

.txt files have no formatting. No italics, no bold, underlining, nothing. Sometimes authors use non-alphanumeric characters, such a - _ /\[]() to indicate formatting; this started with Usenet and BBS posts, where it was the only way. The Usenet/BBS crowd developed a fairly standard definition for what these non-alphanumeric characters intended to convey, but if you didn't grow up on Usenet it's not intuitive. If you start with a file that has formatting indicated in this manner, you have a big job ahead of you. First, you have to import the file into your favorite document editor that supports modern WYSIWYG formatting. Then you have to determine what the author intended with the symbols he used, apply the WYSIWYG formatting, and remove the characters used to imply that formatting. While there may be document editors out there that allow you to search for text surrounded by certain characters, and then replace those characters with modern formatting, I haven't come across them. So if you start with a document like that, it's going to be very labor intensive to bring it up to modern standards, as you will have to go through it character by character. And you will also need to look for foreign language words with non-English characters. I'm constantly replacing deja-vu with déjà vu, for example. and then you have to go through, line by line, adding a space to the end of each line and deleting the line feed so that paragraphs flow together as one unit; in the early days of Usenet and BBSs, lines didn't wrap around the screen, they ran off the end, and you had to manually insert line feeds into the document to keep the lines from running off the screen. Modern technology handles wrapping text just fine, and the display width is much greater, so a sentence that might take three lines with line feeds may take just one line with them removed. What I do now, when I come across such a file, is do an Internet search to see if it's been reposted with this reformatting already done. A good example of this is the stories by The Professor, which were originally posted at FictionMania, without formatting. In 2010 PS obtained permission from The Professor to repost many of his stories at BigCloset TopShelf. The reposted stories had The Professor's intended formatting. They were also .html documents. This was good and bad. Good in that the character formatting had been done. Bad, because .html documents have frames and all sorts of other stuff that mess up non.html documents in text processors. BigCloset allows you to create a printer friendly document that doesn't have all the site advertising sidebars and menus, etc. that their web pages are cluttered with. You can download that page, which gives you a much nicer document to start with in an .html editor. But I quickly discovered that weird shit happens when editing .html files, doing something that seems completely innocuous will cause a section of text to suddenly change font and font treatments for no reason I can fathom, and prove to be beyond the undo function to handle. So I gave up on editing .html files as the path to nifty eBooks. I had to, I was getting too angry and frustrated. The next thing I tried was to copy/paste the entire text of the printer friendly file in one fell swoop into a LibreOffice Writer document. This worked, after a fashion, but introduced some .html formatting elements, such as frames, into the document. These elements in some manner interfere with some of LibreOffice's formatting tools; I kept finding myself unable to insert horizontal lines between sections of text to indicate breaks in action, instead of the short lengths of dashes that had been used. And I really wanted those horizontal lines, they look much nicer than short runs of dashes. What I'm now doing, which is somewhat time consumptive and repetitive, is cutting/pasting text from within an individual frame; this way there are no .html artifacts to interfere with my document editor. It takes time, but is still so much faster than starting from a .txt file that it isn't funny.

Formatting aside, stories are posted in different ways. Sometimes the entire document is posted at once, sometimes it is posted in sections. Depending upon the host site, files may be limited in size, with larger files having to be broken down into parts. If posted via a mailing list, short stories may be posted complete, but longer works will be split up. If the author's mailing list has a host site with file storage capabilities, he may store the complete story as a single document at that site, and that single document will be what gets posted at other sites that can handle files that size. That's how Morpheus does things. Usually. He used to post his stories as serial emails to his Yahoo! Group, then post the complete story as a single file at BigCloset. Recently he's posted them at BigCloset at the same time he's sent them to his mailing list, and not posted a complete doc at BigCloset when done. The Academy was the last story in his Were universe that he posted at BigCloset as one file. The next, Touching the Moon, was posted in 62 parts! If I'm creating eBooks to read in a care facility, I don't want to have 62 eBooks to read one novel, just not going to happen. FictionMania readers were lucky, he posted it as one document there, but since it was done as a .txt file, no formatting and lots of line feeds. Touching the Moon was a straight forward cut/paste of 62 text blocks into an .odt doc; .odt is LibreOffice' default document format. However, I don't have an .odt document of Touching the Moon. Rather, I have one .odt doc of all the Were universe stories to date. With a cover page, a title page, a table of contents with internal links to each story, and at the beginning of each story, right under the title of the story, links to the files at FictionMania and BigCloset. And an About the Author section at the end, with a link to the copy of a chat session interview with him stored at FictionMania.

I've created a number of documents like that. And using Calibre, an eBook management/conversion program, I've created eBooks from those documents in a number of file types, for ease of reading. I've also had the thought that after all the effort involved in creating these documents, it would be nice if it benefited more than just myself. Since I don't own the rights to the stories, I can't distribute them without the permission of the author. The author may prefer to handle distribution themselves; while I did the packaging, I consider the documents their property to utilize as they see fit. If they want to sell copies, fine by me. If they want to make them freely available, well, that's pretty cool.

I've only contacted one author about this so far. With very positive results. With the permission of The Professor, I've uploaded eBook versions of The Complete Ovid Stories to the Internet Archive. While I haven't looked into what would be involved in making them available through the Nook and Kindle storefronts as free eBooks, I have The Professor's permission to do so, it's just a matter of working with those sites to make it clear that while it is not my intellectual property, I have been authorized to act as The Professor's agent in placing copies in the wild.

I've got to say I feel pretty good about this. While the majority of Internet Fiction is drek, Sturgeon's Law holding true, there's some pretty good stuff that risks getting lost when the host site closes, as happened when EWP went under; in that case we were fortunate that the Internet Archive's Wayback Machine had archived the site, and that someone checked while we still remembered the URL of EWP, since the Wayback machine indexes by URL. Unlike the print publishing industry, where using your legal name as the author is the norm, Internet Fiction is almost entirely published under pseudonyms. The heirs to print industry authors generally know that so and so is an author, and what he's published, and can take action to keep those items in print, so that they get the revenue. The vast majority of times, Internet Fiction author's relatives have no clue that they write Internet Fiction, nor how to obtain access to their accounts; I was fortunate that twenty years after the last Ovid story was posted, The Professor was still monitoring the message board at FictionMania, and answered my message asking if anyone knew how to contact The Professor, if he was still alive. The last person I knew to be in contact with him, PS, in 2010, hadn't posted at BigCloset since 2013, and the last contact Angharad had with PS had been several years ago, when he was in ill health. In the print community, publishers generally find out when their authors die. On the Internet, unless someone in contact with them outside the Internet finds out and posts the information, an individual could be dead for decades and no one would know it, they'd just know it had been a while since they'd been heard from. Without knowing legal names, you can't search for obituaries or go through the Social Security Death Index. This can sometimes be disastrous for an Internet community, when the person managing the web hosting dies and the first anyone knows is when the site is shut down for non-payment of maintenance fees. BigCloset, The Crystal Hall, and Stardust all had that start to happen to them, when Bob Arnold died. He'd not only handled hosting those web sites, the server's physical location was his home. When the power was turned off, the sites went black. In this case, his family knew what he had been involved with, and approved, which is pretty incredible since the primary genre posted to those three sites is Transgender Fiction; Bob wasn't Trans himself, but had an interest in Transformation and Gender-Bender fiction, which heavily overlaps TG fiction. The admins at BigCloset were able to contact Bob's family, and arranged for the power to go back on, and then purchased and relocated the servers. Stardust was Bob's baby, and is being maintained in his memory. The Crystal Hall has since set up shop on it's own, but maintains close ties with BigCloset. But if Bob's family hadn't approved of what he was doing, and if Erin and the other admins at BigCloset hadn't known how to contact them, all three sites would have been lost forever, along with any stories not backed up elsewhere. BigCloset has set up a corporation to administer the site, so there won't be one key individual whose loss will bring it down. BigCloset also has a memorial wall, where are listed the names of those members who they know have died. There is a forum thread at Beyond The Far Horizon dedicated to information on the status of authors, but I don't know what arrangements Gina Marie Wylie has made for maintenance of the site when she becomes unable to do so; she's already had to change the domain type in the URL because someone snipped the domain renewal on her. Stories Online, and it's sister sites, Fine Stories and SciFi Stories, are managed by World Literature Publishing Company, but as far as I know that organization is wholly owned by Lazeez Jiddan, and I don't know what arrangements he's made for their continuation when he's no longer up to it; he's a very hand's on sysadmin, lot's of hand coding of the site infrastructure, it would be very difficult for someone to come in cold and keep it going.

Potentially, I could be making master documents and talking with authors about getting them archived for a very long time. I'll keep making the documents since they meet a need that I have. And I'll keep offering them to the authors because it would be criminal, in my mind, to keep the results of that effort to myself.

This isn't the first time I've done something like this. At my Academia site are stored .pdf files of

Di Grassi his true Arte of Defence modernized v1 2

Vincentio Saviolo, His Practise in two books, modernized typeface, annotated vocabulary

which I produced several years ago. The copies available were all unmodified images of the original publications, and man, were they hard to read. So I ran them through OCR, corrected all the OCR errors, annotated them, and created new .pdf files, and posted them to Academia and spread the word through the SCA Rapier community, and also the HEMA community. I didn't update the spelling, they're a strict transcription formatted to match the original. The one major alteration was replacing the illustrations from the Di Grassi English edition with those from the original Italian edition, which were much better illustrations, which I did at the suggestion of one of the HEMA types, who provided the URL for the images. I should probably get them uploaded to the Internet Archive as well. I need to modify them anyway, my email contact information in them is now incorrect.

Addendum, 8/29/2017: OK, I hadn't looked at the text of the fencing manuals since I created them in 2013, so I was in error. I did standardize, and modernize, the spelling, and in some cases, the words themselves, substituting modern equivalents where the intended meaning was no longer what the word means in Modern English. I'm currently creating a non-normalized EModE version of Saviolo, and plan to then create a normalized version, which will make them of use to researchers who want them in the original language. I need to revise the modernized text, as I've found places where I misread the original text the first time through, and I want to rethink EModE/Modern word equivalencies.

2017-07-28

Yandex Image Search and Google Image Search

Been a long time since I last posted. Can't say it'll happen more frequently, but this is a start.

https://yandex.com/images/ https://www.google.com/imghp

Both allow you to do a standard text description search. Both allow you to search for an image based upon one for which you have a known URL. Both allow you to upload an image to search. It's when you get to the results of the search that things differ.

I'm going to use the following test image which I uploaded from my computer; in this case I know the precise URL where I found it, although Calibre renamed it when using it to add a cover to the .rtf version of the book in my possession. Incidentally, the book is well worth reading.

I know, a pretty plebeian image, but since I don't have this blog set up behind a 21+ firewall, my choices are limited; if I didn't, I've got an image that I downloaded at least ten years ago, where I didn't record where I found it, didn't know the name of the model, who took the photo, where it was initially published, didn't know a thing about it other than that I thought the model was good looking, where Google couldn't find anything like it, but Yandex found the precise image with a bunch of sites, including a site with an entry on the model containing six photo shoot collections of around 90 images each; I now know the model's first name, went from three images to way too many, but know nothing much about her since the site Yandex found didn't provide that information; the site is natively in Russian, but has a drop down list of other languages to display in, including English. 

Moving right along...

First, the Google search. Rocinante cover, Google Image Search results 
Second, the Yandex search. Rocinante cover, Yandex Image Search results 

This was, perhaps, too easy an item to find. I may have to try this again with something more obscure.

Google didn't find the same resolution image, so it didn't declare a winner. It's best guess as to the identity of the image was spot on. The first site it listed was the actual source site. The related images were alternate cover images for the book, which indicates Google searched for related images by their best guess title, rather than items which featured things that looked like the submitted image. The first four sites listed as having matching images did, indeed, have matching images, while the final two sites were completely bogus. Only one of the sites which had a matching image was not owned by Wes Boyd; that site looked to be of interest to me, and I've now subscribed to their mailing list.

Yandex was much more confident about saying it had a match. It immediately offered a list of different resolutions for the image, with links to those images; this is something Yandex does for every image you select from those displayed in their search results, and I find this very useful. The related images section didn't come up with the alternate covers of the book, but instead images of aircraft similar to the one on the cover. This indicates their related image search is based upon an analysis of the submitted image to determine the main topic of the image, rather than the item the submitted image had been linked to. This is an important difference, and should be borne in mind when deciding which search engine to use. All six of the sites listed as having matching images did. All six sites are owned by Wes Boyd. Google didn't find as many sites owned by Wes Boyd, but did find the image at someone else's site.

Neither found the entry at LibraryThingGoodreads had the book listed, but showed one of the alternate covers; the only Wes Boyd book they listed where that was the case. FictionDB had the alternate cover. The Google Books entry didn't show. A whole bunch of others didn't show, including Nook and Kindle eBook stores.

Now to try again.
This is an image of the map of Middle Earth included with one of the hardcover editions of The Lord of the Rings published by Allen & Unwin lo these many years agone. 


Google, again, wasn't sure about it's identification, but it's best guess was pretty good. The two sites they list before showing related images were sites I already knew about as primo Middle Earth fan projects. The related images were spot on, all being similar maps of Middle Earth. Google then goes on with a bazillion hits for sites with matching images, I mean pages upon pages upon pages, leading off with five articles on the find of a copy of the map hand annotated by J.R.R. Tolkien himself. And where possible, a small thumbnail of the image at that site appears to the left of the listing.

Yandex, again, was sure of it's identification, and offered a variety of resolutions for the image. The related images were spot on.  The sites listed as having matching images aren't organized the way Google's are, which may be good or bad; after all, the first five sites Google listed had basically the same information, while Yandex leads off with a Korean language Middle Earth fan site rich with maps. Of course, I didn't know it was Korean, and the translating software used by Chrome doesn't tell you what language is being translated from, which is a grievous lack, and the translated site didn't have anything saying it was based in Korea, except that in the About page it did list a problem at one time with the Palgong Port interface, and a search on Palgong determined that it was in South Korea. Yandex also includes a thumbnail of the image as part of each site listing, and continuing their focus on resolution, lists the resolution of the image at the bottom of then thumbnail.

Google is very good at finding information in your language, and geographically close by. This is because of all the information they collect about you, as the Internet Conspiracy Theorists rant about all the time; I think it's cool,  I generally get better results because of it. But there are times when that isn't what you want. It wasn't until the end of the thirteenth page of results that Google listed a non-English language site; Yandex lead off with one. However, Google did have those pages upon pages of sites, while Yandex only lists forty-three sites. And both of my example search objects were non-obscure; as I related at the beginning, I had an obscure Adult Model image in my collection that Google didn't have a clue about, that Yandex, given their far more aggressive delving into former Soviet countries resources, found.

If your interest lies in finding different resolution images, foreign language resources, or obscure Adult Model image information, Yandex is definitely the search engine to use. If you want localized information stick with Google, that's where they put their focus. There are other image search providers out there, but I haven't tried them out; it could be well worth your time checking them out, as I suspect each has differing strengths and weaknesses, and with proper investigation you would be able to select the best search engine for your specific research project. I know I'll be switching back and forth between Google and Yandex, just like when I'm looking for used books that were published in Scandinavia I search Antikvariat.net rather than AbeBooks, you choose the proper tool for the task at hand.

Update: 2017 10 24

Google really has a problem due to its localization process. It won't forget about your most recent searches, and where you found useful information. So if you start a new search, which has nothing to do with your previous search, it hits all the wrong web sites first. At least that seems to be what happens when trying to ID images of Europeans found on Asian web sites; since you had just been visiting Asian web sites, that's where it starts looking, and since that's where you found the images, why, there you go, success! Except that there isn't any identification information there, if there had been, I wouldn't be doing an image search in the first place, I'd be doing a text search based upon the ID. And if Google doesn't make a solid ID the first time through, it bases it's broader search for similar images based upon the text found in those first web pages. Fine and dandy if the page is for a narrow subject, but if its a general page, then only general terms will be provided. Such as the image search where the term Google insisted upon adding as it's text criteria was the word "girl"; not standing, sitting, laying down, in a chair, leaning against a wall, wearing a business suit, wearing a bikini, wearing nothing at all, just "girl", so that's what were brought up as "similar" images, images of lots of very different girls. Or where it insisted upon a Portuguese search term, since it was a Portuguese site where it found a matching image. Funny thing, images don't group based upon the language of their originating country, not if they have been around for any length of time, and if the first sites where a match is found isn't the same language as the language where the image was created and first posted, the foreign language search terms will actually decrease the likelihood of finding a proper ID. You use search terms in the language which has the most information concerning your subject of research. This is precisely why I'm creating a glossary of search terms in various languages, attempting to find equivalent phrases so I can use the descriptive phrase appropriate to the language which has the most information on the subject; when researching the Venice-Ottoman Wars of the mid-1500s, English is _not_ the most productive language to search in. Determining the foreign language term used for a location in the mid-1500s is not an easy task, especially when starting with the temporal congruent English term which doesn't match the Modern English term.

Yandex, on the other hand, doesn't localize the search, so if the model is European and you found the image on a Japanese site, Yandex will start with European sites, because that's where it will find the greatest number of hits. If Yandex doesn't ID the image right off, it analyzes the image itself, not the text on the websites where the image is found, and pulls up images that look like the image submitted; the images will have the same stance, similar background, similar clothes (or lack thereof), and even tend toward the same hair color. In other words, images that really do look like the image submitted. Google insists upon adding text search terms based upon the websites where it found matches with the submitted image to the image search after its first try; Yandex analyzes the image itself, and tries to match the image, without adding search terms. Guess which is most successful if you are trying to find more images of a specific item, and didn't stumble upon a site dedicated to that item the first time around? Yandex, hands down, since it really does pull up similar images that aren't exact matches, regardless of any accompanying text.

Now, when Google does make a correct ID, it tends to put a lot more information onto the screen than Yandex does; with Yandex, you have to actually visit the relevant websites, while if Google can find a matching Wikipedia page, it will abstract the basic information and show it in the upper right quadrant of the screen.

But Google's pulling text search terms from the pages where the image was found helps to explain why the first seven or so hits for the Tolkien map image were all near identical articles in the popular press about the British museum obtaining a copy of that map hand annotated by J.R.R. Tolkien himself; it grabbed the text from the first hit it found, and that got the best match from the other papers getting their articles from the same news service. So after the first hit, the next six where useless duplicates, and they weren't actually the results of an image search, but of the image search converted to a text search.