2017-08-02

What I've been doing: Creation of omnibus eStory volumes.

I read Internet Fiction. A lot of it. It's what I spend most of my time doing these days. I expect to keep doing this for a long time. The thing to note about Internet Fiction is that you need an active Internet connection when you are reading. A lot of the host sites allow you to download the stories for off-line reading, but for the free sites with a few exceptions the files are text files that aren't very pretty. In fact, some are down right ugly. But if you don't have an active connection, they're better than nothing.

Looking ahead, I can see the time coming where I'm in a care facility. It may not come to that, depending upon how my health deteriorates, but I'd be wise to plan for it in advance. What I need to plan for is being in a care facility because of physical problems, but while my mind is still working. I don't think many care facilities provide Internet access for those in their care. At least, that's the assumption I'm making. It's unlikely I'd have a desktop computer in a care facility, but a laptop or tablet seems reasonable. So, if I'm going to have anything to read, I'll need to have amassed a collection of files, preferably eBooks, of the stories I enjoy reading. While some of the authors who post on the Internet have gone to the effort to repackage their stories for sale via Nook, Kindle, Smashwords, Lulu, and a growing plethora of related sites, many have not. Which means that for many of the stories that I enjoy reading, if I want something other than raw text or a downloaded web page, I have to create it myself. There are eBooks out there about this, I even have two of them in my collection, although I have to admit I haven't read them. There discussions of this subject in various of the online forums frequented by authors; I've casually monitored the discussions in the Authors section of the Stories Online Forum. Mostly I've followed the examples of eBooks that I've purchased. The thing about the discussions on the author forums is, they pretty much assume you've already got a clean complete document in .docx, .rtf., or .odt, or some other accepted standard more advanced than .txt. They don't talk about what to do if your starting from .txt files download with lots of line feeds to keep the lines short, such as those available from Project Gutenberg or FictionMania. They don't talk about starting from downloaded web pages, with all the nasty .html artifacts that can make a standard word processor choke, and which can cause problems with the more basic (read: free) .html editors. So I've had to do a lot of learning by trial and error, sometimes ending up with such a mess that I deleted the working document and started over from the original downloaded file; one thing I learned very early is that you don't edit the original file, you make a copy first and edit that.

.txt files have no formatting. No italics, no bold, underlining, nothing. Sometimes authors use non-alphanumeric characters, such a - _ /\[]() to indicate formatting; this started with Usenet and BBS posts, where it was the only way. The Usenet/BBS crowd developed a fairly standard definition for what these non-alphanumeric characters intended to convey, but if you didn't grow up on Usenet it's not intuitive. If you start with a file that has formatting indicated in this manner, you have a big job ahead of you. First, you have to import the file into your favorite document editor that supports modern WYSIWYG formatting. Then you have to determine what the author intended with the symbols he used, apply the WYSIWYG formatting, and remove the characters used to imply that formatting. While there may be document editors out there that allow you to search for text surrounded by certain characters, and then replace those characters with modern formatting, I haven't come across them. So if you start with a document like that, it's going to be very labor intensive to bring it up to modern standards, as you will have to go through it character by character. And you will also need to look for foreign language words with non-English characters. I'm constantly replacing deja-vu with déjà vu, for example. and then you have to go through, line by line, adding a space to the end of each line and deleting the line feed so that paragraphs flow together as one unit; in the early days of Usenet and BBSs, lines didn't wrap around the screen, they ran off the end, and you had to manually insert line feeds into the document to keep the lines from running off the screen. Modern technology handles wrapping text just fine, and the display width is much greater, so a sentence that might take three lines with line feeds may take just one line with them removed. What I do now, when I come across such a file, is do an Internet search to see if it's been reposted with this reformatting already done. A good example of this is the stories by The Professor, which were originally posted at FictionMania, without formatting. In 2010 PS obtained permission from The Professor to repost many of his stories at BigCloset TopShelf. The reposted stories had The Professor's intended formatting. They were also .html documents. This was good and bad. Good in that the character formatting had been done. Bad, because .html documents have frames and all sorts of other stuff that mess up non.html documents in text processors. BigCloset allows you to create a printer friendly document that doesn't have all the site advertising sidebars and menus, etc. that their web pages are cluttered with. You can download that page, which gives you a much nicer document to start with in an .html editor. But I quickly discovered that weird shit happens when editing .html files, doing something that seems completely innocuous will cause a section of text to suddenly change font and font treatments for no reason I can fathom, and prove to be beyond the undo function to handle. So I gave up on editing .html files as the path to nifty eBooks. I had to, I was getting too angry and frustrated. The next thing I tried was to copy/paste the entire text of the printer friendly file in one fell swoop into a LibreOffice Writer document. This worked, after a fashion, but introduced some .html formatting elements, such as frames, into the document. These elements in some manner interfere with some of LibreOffice's formatting tools; I kept finding myself unable to insert horizontal lines between sections of text to indicate breaks in action, instead of the short lengths of dashes that had been used. And I really wanted those horizontal lines, they look much nicer than short runs of dashes. What I'm now doing, which is somewhat time consumptive and repetitive, is cutting/pasting text from within an individual frame; this way there are no .html artifacts to interfere with my document editor. It takes time, but is still so much faster than starting from a .txt file that it isn't funny.

Formatting aside, stories are posted in different ways. Sometimes the entire document is posted at once, sometimes it is posted in sections. Depending upon the host site, files may be limited in size, with larger files having to be broken down into parts. If posted via a mailing list, short stories may be posted complete, but longer works will be split up. If the author's mailing list has a host site with file storage capabilities, he may store the complete story as a single document at that site, and that single document will be what gets posted at other sites that can handle files that size. That's how Morpheus does things. Usually. He used to post his stories as serial emails to his Yahoo! Group, then post the complete story as a single file at BigCloset. Recently he's posted them at BigCloset at the same time he's sent them to his mailing list, and not posted a complete doc at BigCloset when done. The Academy was the last story in his Were universe that he posted at BigCloset as one file. The next, Touching the Moon, was posted in 62 parts! If I'm creating eBooks to read in a care facility, I don't want to have 62 eBooks to read one novel, just not going to happen.  FictionMania readers were lucky, he posted it as one document there, but since it was done as a .txt file, no formatting and lots of line feeds. Touching the Moon was a straight forward cut/paste of 62 text blocks into an .odt doc; .odt is LibreOffice' default document format. However, I don't have an .odt document of Touching the Moon. Rather, I have one .odt doc of all the Were universe stories to date. With a cover page, a title page, a table of contents with internal links to each story, and at the beginning of each story, right under the title of the story, links to the files at FictionMania and BigCloset. And an About the Author section at the end, with a link to the copy of a chat session interview with him stored at FictionMania.

I've created a number of documents like that. And using Calibre, an eBook management/conversion program, I've created eBooks from those documents in a number of file types, for ease of reading. I've also had the thought that after all the effort involved in creating these documents, it would be nice if it benefited more than just myself.  Since I don't own the rights to the stories, I can't distribute them without the permission of the author. The author may prefer to handle distribution themselves; while I did the packaging, I consider the documents their property to utilize as they see fit. If they want to sell copies, fine by me. If they want to make them freely available, well, that's pretty cool.

I've only contacted one author about this so far. With very positive results. With the permission of The Professor, I've uploaded eBook versions of The Complete Ovid Stories to the Internet Archive. While I haven't looked into what would be involved in making them available through the Nook and Kindle storefronts as free eBooks, I have The Professor's permission to do so, it's just a matter of working with those sites to make it clear that while it is not my intellectual property, I have been authorized to act as The Professor's agent in placing copies in the wild.

I've got to say I feel pretty good about this. While the majority of Internet Fiction is drek, Sturgeon's Law holding true, there's some pretty good stuff that risks getting lost when the host site closes, as happened when EWP went under; in that case we were fortunate that the Internet Archive's Wayback Machine had archived the site, and that someone checked while we still remembered the URL of EWP, since the Wayback machine indexes by URL. Unlike the print publishing industry, where using your legal name as the author is the norm, Internet Fiction is almost entirely published under pseudonyms. The heirs to print industry authors generally know that so and so is an author, and what he's published, and can take action to keep those items in print, so that they get the revenue. The vast majority of times, Internet Fiction author's relatives have no clue that they write Internet Fiction, nor how to obtain access to their accounts; I was fortunate that twenty years after the last Ovid story was posted, The Professor was still monitoring the message board at FictionMania, and answered my message asking if anyone knew how to contact The Professor, if he was still alive. The last person I knew to be in contact with him, PS, in 2010, hadn't posted at BigCloset since 2013, and the last contact Angharad had with PS had been several years ago, when he was in ill health. In the print community, publishers generally find out when their authors die. On the Internet, unless someone in contact with them outside the Internet finds out and posts the information, an individual could be dead for decades and no one would know it, they'd just know it had been a while since they'd been heard from. Without knowing legal names, you can't search for obituaries or go through the Social Security Death Index. This can sometimes be disastrous for an Internet community, when the person managing the web hosting dies and the first anyone knows is when the site is shut down for non-payment of maintenance fees. BigCloset, The Crystal Hall, and Stardust all had that start to happen to them, when Bob Arnold died. He'd not only handled hosting those web sites, the server's physical location was his home. When the power was turned off, the sites went black. In this case, his family knew what he had been involved with, and approved, which is pretty incredible since the primary genre posted to those three sites is Transgender Fiction; Bob wasn't Trans himself, but had an interest in Transformation and Gender-Bender fiction, which heavily overlaps TG fiction. The admins at BigCloset were able to contact Bob's family, and arranged for the power to go back on, and then purchased and relocated the servers. Stardust was Bob's baby, and is being maintained in his memory. The Crystal Hall has since set up shop on it's own, but maintains close ties with BigCloset. But if Bob's family hadn't approved of what he was doing, and if Erin and the other admins at BigCloset hadn't known how to contact them, all three sites would have been lost forever, along with any stories not backed up elsewhere. BigCloset has set up a corporation to administer the site, so there won't be one key individual whose loss will bring it down. BigCloset also has a memorial wall, where are listed the names of those members who they know have died. There is a forum thread at Beyond The Far Horizon dedicated to information on the status of authors, but I don't know what arrangements Gina Marie Wylie has made for maintenance of the site when she becomes unable to do so; she's already had to change the domain type in the URL because someone snipped the domain renewal on her. Stories Online, and it's sister sites, Fine Stories and SciFi Stories, are managed by World Literature Publishing Company, but as far as I know that organization is wholly owned by Lazeez Jiddan, and I don't know what arrangements he's made for their continuation when he's no longer up to it; he's a very hand's on sysadmin, lot's of hand coding of the site infrastructure, it would be very difficult for someone to come in cold and keep it going.

Potentially, I could be making master documents and talking with authors about getting them archived for a very long time. I'll keep making the documents since they meet a need that I have. And I'll keep offering them to the authors because it would be criminal, in my mind, to keep the results of that effort to myself.

This isn't the first time I've done something like this. At my Academia site are stored .pdf files of

Di Grassi his true Arte of Defence modernized v1 2

Vincentio Saviolo, His Practise in two books, modernized typeface, annotated vocabulary

which I produced several years ago. The copies available were all unmodified images of the original publications, and man, were they hard to read. So I ran them through OCR, corrected all the OCR errors, annotated them, and created new .pdf files, and posted them to Academia and spread the word through the SCA Rapier community, and also the HEMA community. I didn't update the spelling, they're a strict transcription formatted to match the original. The one major alteration was replacing the illustrations from the Di Grassi English edition with those from the original Italian edition, which were much better illustrations, which I did at the suggestion of one of the HEMA types, who provided the URL for the images. I should probably get them uploaded to the Internet Archive as well. I need to modify them anyway, my email contact information in them is now incorrect.

2017-07-28

Yandex Image Search and Google Image Search

Been a long time since I last posted. Can't say it'll happen more frequently, but this is a start.

https://yandex.com/images/ https://www.google.com/imghp

Both allow you to do a standard text description search. Both allow you to search for an image based upon one for which you have a known URL. Both allow you to upload an image to search. It's when you get to the results of the search that things differ.

I'm going to use the following test image which I uploaded from my computer; in this case I know the precise URL where I found it, although Calibre renamed it when using it to add a cover to the .rtf version of the book in my possession. Incidentally, the book is well worth reading.

I know, a pretty plebeian image, but since I don't have this blog set up behind a 21+ firewall, my choices are limited; if I didn't, I've got an image that I downloaded at least ten years ago, where I didn't record where I found it, didn't know the name of the model, who took the photo, where it was initially published, didn't know a thing about it other than that I thought the model was good looking, where Google couldn't find anything like it, but Yandex found the precise image with a bunch of sites, including a site with an entry on the model containing six photo shoot collections of around 90 images each; I now know the model's first name, went from three images to way too many, but know nothing much about her since the site Yandex found didn't provide that information; the site is natively in Russian, but has a drop down list of other languages to display in, including English. 

Moving right along...

First, the Google search. Rocinante cover, Google Image Search results 
Second, the Yandex search. Rocinante cover, Yandex Image Search results 

This was, perhaps, too easy an item to find. I may have to try this again with something more obscure.

Google didn't find the same resolution image, so it didn't declare a winner. It's best guess as to the identity of the image was spot on. The first site it listed was the actual source site. The related images were alternate cover images for the book, which indicates Google searched for related images by their best guess title, rather than items which featured things that looked like the submitted image. The first four sites listed as having matching images did, indeed, have matching images, while the final two sites were completely bogus. Only one of the sites which had a matching image was not owned by Wes Boyd; that site looked to be of interest to me, and I've now subscribed to their mailing list.

Yandex was much more confident about saying it had a match. It immediately offered a list of different resolutions for the image, with links to those images; this is something Yandex does for every image you select from those displayed in their search results, and I find this very useful. The related images section didn't come up with the alternate covers of the book, but instead images of aircraft similar to the one on the cover. This indicates their related image search is based upon an analysis of the submitted image to determine the main topic of the image, rather than the item the submitted image had been linked to. This is an important difference, and should be borne in mind when deciding which search engine to use. All six of the sites listed as having matching images did. All six sites are owned by Wes Boyd. Google didn't find as many sites owned by Wes Boyd, but did find the image at someone else's site.

Neither found the entry at LibraryThingGoodreads had the book listed, but showed one of the alternate covers; the only Wes Boyd book they listed where that was the case. FictionDB had the alternate cover. The Google Books entry didn't show. A whole bunch of others didn't show, including Nook and Kindle eBook stores.

Now to try again.
This is an image of the map of Middle Earth included with one of the hardcover editions of The Lord of the Rings published by Allen & Unwin lo these many years agone. 


Google, again, wasn't sure about it's identification, but it's best guess was pretty good. The two sites they list before showing related images were sites I already knew about as primo Middle Earth fan projects. The related images were spot on, all being similar maps of Middle Earth. Google then goes on with a bazillion hits for sites with matching images, I mean pages upon pages upon pages, leading off with five articles on the find of a copy of the map hand annotated by J.R.R. Tolkien himself. And where possible, a small thumbnail of the image at that site appears to the left of the listing.

Yandex, again, was sure of it's identification, and offered a variety of resolutions for the image. The related images were spot on.  The sites listed as having matching images aren't organized the way Google's are, which may be good or bad; after all, the first five sites Google listed had basically the same information, while Yandex leads off with a Korean language Middle Earth fan site rich with maps. Of course, I didn't know it was Korean, and the translating software used by Chrome doesn't tell you what language is being translated from, which is a grievous lack, and the translated site didn't have anything saying it was based in Korea, except that in the About page it did list a problem at one time with the Palgong Port interface, and a search on Palgong determined that it was in South Korea. Yandex also includes a thumbnail of the image as part of each site listing, and continuing their focus on resolution, lists the resolution of the image at the bottom of then thumbnail.

Google is very good at finding information in your language, and geographically close by. This is because of all the information they collect about you, as the Internet Conspiracy Theorists rant about all the time; I think it's cool,  I generally get better results because of it. But there are times when that isn't what you want. It wasn't until the end of the thirteenth page of results that Google listed a non-English language site; Yandex lead off with one. However, Google did have those pages upon pages of sites, while Yandex only lists forty-three sites. And both of my example search objects were non-obscure; as I related at the beginning, I had an obscure Adult Model image in my collection that Google didn't have a clue about, that Yandex, given their far more aggressive delving into former Soviet countries resources, found.

If your interest lies in finding different resolution images, foreign language resources, or obscure Adult Model image information, Yandex is definitely the search engine to use. If you want localized information stick with Google, that's where they put their focus. There are other image search providers out there, but I haven't tried them out; it could be well worth your time checking them out, as I suspect each has differing strengths and weaknesses, and with proper investigation you would be able to select the best search engine for your specific research project. I know I'll be switching back and forth between Google and Yandex, just like when I'm looking for used books that were published in Scandinavia I search Antikvariat.net rather than AbeBooks, you choose the proper tool for the task at hand.