May 2011 – Temet Nosce

How to give a new life to books which are out of copyright!

Here is a short summary of the Free Software tools that I have found useful for converting hard copies into readable/searchable formats in GNU/Linux!

Typically the making a soft-copy from a hard-copy involves following steps:

Step1:
Scan the Hard copy using a scanner / camera. This step generates image files
typically .tiff, .png or .jpeg. Some scanning programs also have option of directly generating to .pdf
Basically at this stage you have all the data, if you compress the folder into a comic book reader format .cbr or .cbz format you are good to go. But for a more professional touch read on. The main step to scan the books properly. Some do’s and dont’s
Align the pages to the sides of the scanner.
If the book is small size scan 2 pages at once.
If the book is too large adjust the scan in the image preview side so that only one page is scanned.
If these steps are done properly there is a little that we have to do in the second step. And we can directly jump to Step 3.
Preferably scan in the ~~binary~~ grayscale form, unless there are colored images in the text. This will help reduce the final size of the file.
Scan at minimum 300 dpi, this is the optimum level that I have come to after trials and errors with different resolutions, their final results and the time taken for each scan. Of course this can differ depending on what is that you are scanning. Many people do the scanning at 600 dpi, but I am happy at 300 dpi. Note: The 300 dpi images can be upscaled in scan-tailor to 600 dpi.
First of all for the scanning itself. Most of the scanners come with an installation disk for M$-Windows or Mac-OSX. But for GNU/Linux there seems to be no ‘installation disk’. The Xsane package allows quite a few scanners which are detected and are ready for use as soon as you plug them in.
The list of the scanners which are supported by Xsane can be found here:
http://www.sane-project.org/sane-mfgs.html
When we bought our scanner we had to search this list to get the compatible scanner.
What is the problem with the manufacturers, why do they not want to sell more, to people who are using Free Software?
If your scanner is not in the list, then you might have to do some R&D before your scanner is up and running like I had to do for my old HP 2400 Scanjet at my home.
Once your scanner is up and running. You scan the images preferably in .tiff format as they can be processed and compressed without much loss of quality. This again I have found by trial and error.
Step2:
Crop the files and rotate them to remove unwanted white spaces or
accidental entries of adjoining pages from the images that were obtained. When the pages are scanned as 2 pages in one image, we may need to separate the pages.
Initially I did it manually, it was the second most boring part after the scanning. But I have found a very wonderful tool for this work.
Imagemagick provides a set of tools which work like magick in images, hence the name I guess 🙂
This is one of the best tools for batch processing image files.
Then I found out the dream tool that I was looking for.
The is called Scan-Tailor, as the name suggests it is meant for processing of scanned images.
Scan Tailor can be found at http://scantailor.sourceforge.net/ or directly from Ubuntu Software Centre.
Step by step scan tailor cleans and creates amazingly good output files from relatively unclean images.
There are a total of 6 steps in scan-tailor which produces the desired output.
You have to choose the folder in which your scanned images are. Scan-tailor produces a directory called out in the same folder by default. The steps are as follows

Change the Orientation: This enables one to change the orientation of all the files in the directory. This is good option in case you have scanned the book in a different orientation.
Split Pages: This step will tell whether the scans that we have made are single page scans, single page with some marginal text from other page or two page scans. Most of the times the auto detection works well with single page and two page scans. But it is a good idea to check manually whether all the pages have been divided correctly, so that it does not create problems later. If you find that a page has been divided incorrectly then we can slide the margin to correct it. In case of two page scans the two pages are shown with a semitransparent blue or red layer on top of them. After looking at all the pages we commit the result.
Deskew: After the pages have been split we need to change the orientation for better alignment of the text. Here in my experience most of the auto-orientation works fine. But still it is a good idea to check manually the pages, in case something is missed.
Select Content: This is the one step that I have found as the most useful one in the scan-tailor. Here you can select the portion of the text that will appear in the final output. So that you can say goodbye to all the dark lines that come inevitably as part of scanning. Also some library marks can be removed easily by this step. The auto option works well when the text is in nice box shape, but it may leave wide areas open also. The box shape can be changed the way we want. If you want a blank page, remove the content box, by right clicking on the box.
Page Layout: Here one can set the dimensions for the output page and how each page content will be on the page.
Output: Produces the final output with all the above changes.

The output is stored in a directory called Out in the same folder. The original images are not changed, so that in case you want some changes or something goes wrong we can always go back to the original files. Also numbering of the images is done.
So we have cleaned pages of same size from the scanned pages.
Update: The latest scantailor has image -de-warping facility. See the amazing thing at work here:

Step 3:

Collate the processed files in Step 2 to one single PDF. For this I have used the convert command.
Typical synatax is like this

convert *.tiff output.pdf

This command will take all the .tiff files in the given directory and collate these files into a pdf named output.pdf

http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/
Alternative to Step 3
Another alternative is to use gscan2pdf for joining the image files into pdf and doing the OCR which can be tesseract or cunieform. gscan2pdf is also able to scan files and stich them into pdf , but I would recommend that you use scantailor as one of the intermediate steps.
Also using gscan2pdf gives you an option for editing the files, if, for example, you might want to remove some marks from the images. For this it opens the image in GIMP.

Step 4:
OCR the PDF file.
Now this is again tricky, I could not find a good application which would OCR the pdf file and embed the resulting text on the pdf file. But I have found a hack on the following link which seems to work fine 🙂
http://blog.konradvoelkel.de/2010/01/linux-ocr-and-pdf-problem-solved/
The hack is a bash script which does the required work.
Alternate
gscan2pdf can do OCR for you using cunieform or tesseract as backends. The end result is a searchable text, but it does not sit on the image, as it would happen in a vector pdf, but is embedded on the page as “note” at the top-left-hand corner.
Step 5:
Optimize the PDF file generated in Step 4.
Here there is a nautilus shell script which I have found in the link below which does optimization.
http://www.webupd8.org/2010/11/download-compress-pdf-12-nautilus.html
Step 6:
In case you want to convert the .pdf to .djvu there is one step solution for that also

pdf2djvu -o output.djvu input.pdf

The tips and tricks here are by no means complete or the best. But this is what I have found to be useful. Some of the professional and non-free softwares can do all these, but the point of writing this article was to make a list of Free and Open Source Softwares for this purpose.
Comments and suggestions are welcome!

Temet Nosce

Know Thyself Too…

Month: May 2011

A self referential post for others

Amar Sings On Amar Singh tapes

Free Software Tools for scanning and making e-books

Self – defence

A ‘Piagetian curriculum’ is a contradiction in terms!