scanning

In this post we will see how to make a time-lapse animation of something which changes over time, with a scanner. Most probably you have seen some amazing time -lapse photography of different objects. Common examples include the ever changing skyscape, blooming flowers, metamorphosing insects etc. I wanted to do a similar stuff, but due to my lethargy and other reasons I did not. Though the cameras have intervelometer, and I have used it once to take photos of a lunar eclipse, (moon changing position which I was supposed to merge later, but never did), and wanted to do the same with a blooming of a flower. But as Ghalib has said, they were one of those हज़ारों ख्वाहिशें…
The roots of the idea what follows are germinated long back, when I had a scanner. It was a basic HP 3200 scanner. That time I did not have a digital camera, (c. 2002-2003), but then I used the scanner as a camera. I had this project lined up for making collages of different cereals. Though I got a few good images from botanical samples (a dried fern below) as well and also fractals from a sheet of rusting iron. Then, I sort of forget about it.

Coming to now, I saw some amazing works of art done by scanning flowers. I remembered what had been done a few years back and combined this with the amazing time-lapse sequences that I had seen , the germ began can we combine the two?
http://vimeo.com/22439234
Can we make the scanner, make scans at regular intervals, and make a animation from the resulting images. Scanning images with a scanner would solve problem of uniform lighting, for which you may require an artificial light setup. So began the task to make this possible. One obvious and most easy way to do this is to scan the images manually, lets say every 15 minutes. In this case you setup the scanner, and just press the scan button. Though this is possible, but its not how the computers should be used. In this case we are working for the computer, let us think of making the computer do work for us. In comes shell scripting to our rescue. The support for scanners in GNU/Linux is due to the SANE (Scanner Access Now Easy) Project. the GUI for the SANE is the xsane, which we have talked about in a previous post on scanning books and scanimage is the terminal option for the sane project.
The rough idea for the project is this :
1. Use scanimage to acquire images
2. Use some script to make this done at regular time intervals.
3. Once the images are with use, combine them to make a time-lapse movie
For the script part, crontab is what is mostly used for scheduling tasks, which you want to be repeated at regular intervals. So the project then became of combining crontab and scanimage. Scanimage has a mode called --batch in which you can specify the number of images that you want to scan and also provides you with renaming options. Some people have already made bash scripts for ADF (Automatic Document Feeders), you can see the example here. But there seems to be no option for delay between the scans, which is precisely what we wanted. To approach it in another way is to introduce the the scanimage command in a shell script, which would be in a loop for the required number of images and you use the sleep command for the desired time intervals, this approach does not need the crontab for its operation. But with I decided to proceed with the crontab approach.
The first thing that was needed was to get a hang of the scanimage options. So if your scanner is already supported by SANE, then you are good to go.
$scanimage -L
This will list out the devices available for scanning. In my case the scanner is Canon Lide 110, which took some efforts to get detected. For knowing how to install this scanner, if it is not automatically supported on your GNU/Linux system, please see here.
In my case it lists out something like this:
device `genesys:libusb:002:007' is a Canon LiDE 110 flatbed scanner
If there are more than one devices attached to the system the -L option will show you that. Now coming to the scan, in the scanimage programme, we have many options which control various parameters of the scanned picture. For example we can set the size, dpi, colour mode, output type, contrast etc. For a complete set of options you can go here or just type man scanimage at the terminal. We will be using very limited options for this project, namely the x, y size, mode, format, and the resolution in dpi.
Lets see what the following command does:
$scanimage -d genesys:libusb:002:006 -x 216 -y 300 --resolution 600dpi --mode Color --format=tiff>output.tiff
-d option specifies the device to be used, if there is nothing specified, scanimage takes the first device in the list which you get with -L option.
-x 216 and -y 300 options specify the size of the final image. If for example you give 500 for both x and y, scanimage will tell us that maximum x and y are these and will use those values. Adjusting these two values you will be able to ‘select’ the area to be scanned. In the above example the entire scan area is used.
--resolution option is straight forward , it sets the resolution of the image, here we have set it to 600dpi.
--mode option specifies the colour space of the output, it can be Color, Gray or Lineart
--format option chooses the output of the format, here we have chosen tiff, by default it is .pnm .
The > character tells scanimage that the scan should be output to a file called “output.tiff”, by default this will in the directory from where the command is run. For example if your command is run from the /home/user/ directory, the output.tiff will be placed there.
With these commands we are almost done with the scanimage part of the project. With this much code, we can manually scan the images every 15 minutes. But in this case it will rewrite the existing image. So what we need to do is to make sure that the filename for each scan is different. In the --batch mode scanimage takes care of this by itself, but since we are not using the batch mode we need to do something about it.
What we basically need is a counter, which should be appended to the final line of the above code.
For example let us have a variable n, we start with n=1, and each time a scan happens this variable should increment by 1. And in the output, we use this variable in the file name.
For example, filename = out$n.tiff:
n =1 | filename = out1.tiff
n = n + 1
n = 2 | filename = out2.tiff
n = n + 1
n = 3 and so on…
We can have this variable within the script only, but since we are planning to use crontab, each time the script gets called, the variable will be initialized, and it will not do the function we intend it to do. For this we need to store variable outside the script, from where it will be called and will be written into. Some googling and landed on this site, which was very helpful to attain what I wanted to. Author says that he hasn’t found any use for the script, but I have 🙂 As explained in the site above this script is basically a counter, which creates a file nvalue. starting from n=0, values are written in this file, and each time the script is executed, this file with n=n+1 is updated.
So what I did is appended the above scanimage code to the ncounter script and the result looks something like this:
#!/bin/bash nfilename="./nvalue" n=0 touch $nfilename . $nfilename n=$(expr $n + 1) echo "n=$n" > $nfilename scanimage -x 80 -y 60 --resolution 600dpi --mode Color --format=tiff>out"$n".tiff
What this will attain is that every time this script is run, it will create a separate output file, depending on the value of n. We put these lines of code in a file and call it time-lapse.sh
Now to run this file we need to make it executable, for this use:
$chmod +x time-lapse.sh
and to run the script:
$./time-lapse.sh
If everything is right, you will get a file named out1.tiff as output, running the script again you will have out2.tiff as the output. Thus we have attained what we had wanted. Now everytime the script runs we get a new file, which was desired. With this the scanimage part is done, and now we come to the part where we are scheduling the scans. For this we use the crontab, which is a powerful tool for scheduling jobs. Some good and basic tutorials for crontab can be found here and here.
To edit your crontab use:
$crontab -e
If you are using crontab for the first time, it will ask for editor of choice which has nano, vi and emacs. For me emacs is the editor of choice.
So to run scans every 15 minutes my crontab looks like this:
# m h dom mon dow command */15 * * * * /home/yourusername/time-lapse.sh
And I had tough time when nothing was happenning with crontab. Though the script was running correctly in the terminal. So finally the tip of adding in the cronfile
SHELL=/bin/bash
solved the problem. But it took me some effort to land up on exact cause of the problem and in many places there were sermons on setting PATH and other things in the script but, I did not understand what they meant.
Okay, so far so good. Once you put this script in the crontab and keep the scanner connected, it will produce scans every 15 minutes. If you are scanning in colour at high resolution, make sure you have enough free disk space.
Once the scans have run for the time that you want them , lets say 3 days. You will have a bunch of files which are the time lapse images. For this we use the ffmpeg and ImageMagick to help us out.

How to give a new life to books which are out of copyright!

Here is a short summary of the Free Software tools that I have found useful for converting hard copies into readable/searchable formats in GNU/Linux!

Typically the making a soft-copy from a hard-copy involves following steps:

Step1:
Scan the Hard copy using a scanner / camera. This step generates image files
typically .tiff, .png or .jpeg. Some scanning programs also have option of directly generating to .pdf
Basically at this stage you have all the data, if you compress the folder into a comic book reader format .cbr or .cbz format you are good to go. But for a more professional touch read on. The main step to scan the books properly. Some do’s and dont’s
Align the pages to the sides of the scanner.
If the book is small size scan 2 pages at once.
If the book is too large adjust the scan in the image preview side so that only one page is scanned.
If these steps are done properly there is a little that we have to do in the second step. And we can directly jump to Step 3.
Preferably scan in the ~~binary~~ grayscale form, unless there are colored images in the text. This will help reduce the final size of the file.
Scan at minimum 300 dpi, this is the optimum level that I have come to after trials and errors with different resolutions, their final results and the time taken for each scan. Of course this can differ depending on what is that you are scanning. Many people do the scanning at 600 dpi, but I am happy at 300 dpi. Note: The 300 dpi images can be upscaled in scan-tailor to 600 dpi.
First of all for the scanning itself. Most of the scanners come with an installation disk for M$-Windows or Mac-OSX. But for GNU/Linux there seems to be no ‘installation disk’. The Xsane package allows quite a few scanners which are detected and are ready for use as soon as you plug them in.
The list of the scanners which are supported by Xsane can be found here:
http://www.sane-project.org/sane-mfgs.html
When we bought our scanner we had to search this list to get the compatible scanner.
What is the problem with the manufacturers, why do they not want to sell more, to people who are using Free Software?
If your scanner is not in the list, then you might have to do some R&D before your scanner is up and running like I had to do for my old HP 2400 Scanjet at my home.
Once your scanner is up and running. You scan the images preferably in .tiff format as they can be processed and compressed without much loss of quality. This again I have found by trial and error.
Step2:
Crop the files and rotate them to remove unwanted white spaces or
accidental entries of adjoining pages from the images that were obtained. When the pages are scanned as 2 pages in one image, we may need to separate the pages.
Initially I did it manually, it was the second most boring part after the scanning. But I have found a very wonderful tool for this work.
Imagemagick provides a set of tools which work like magick in images, hence the name I guess 🙂
This is one of the best tools for batch processing image files.
Then I found out the dream tool that I was looking for.
The is called Scan-Tailor, as the name suggests it is meant for processing of scanned images.
Scan Tailor can be found at http://scantailor.sourceforge.net/ or directly from Ubuntu Software Centre.
Step by step scan tailor cleans and creates amazingly good output files from relatively unclean images.
There are a total of 6 steps in scan-tailor which produces the desired output.
You have to choose the folder in which your scanned images are. Scan-tailor produces a directory called out in the same folder by default. The steps are as follows

Change the Orientation: This enables one to change the orientation of all the files in the directory. This is good option in case you have scanned the book in a different orientation.
Split Pages: This step will tell whether the scans that we have made are single page scans, single page with some marginal text from other page or two page scans. Most of the times the auto detection works well with single page and two page scans. But it is a good idea to check manually whether all the pages have been divided correctly, so that it does not create problems later. If you find that a page has been divided incorrectly then we can slide the margin to correct it. In case of two page scans the two pages are shown with a semitransparent blue or red layer on top of them. After looking at all the pages we commit the result.
Deskew: After the pages have been split we need to change the orientation for better alignment of the text. Here in my experience most of the auto-orientation works fine. But still it is a good idea to check manually the pages, in case something is missed.
Select Content: This is the one step that I have found as the most useful one in the scan-tailor. Here you can select the portion of the text that will appear in the final output. So that you can say goodbye to all the dark lines that come inevitably as part of scanning. Also some library marks can be removed easily by this step. The auto option works well when the text is in nice box shape, but it may leave wide areas open also. The box shape can be changed the way we want. If you want a blank page, remove the content box, by right clicking on the box.
Page Layout: Here one can set the dimensions for the output page and how each page content will be on the page.
Output: Produces the final output with all the above changes.

The output is stored in a directory called Out in the same folder. The original images are not changed, so that in case you want some changes or something goes wrong we can always go back to the original files. Also numbering of the images is done.
So we have cleaned pages of same size from the scanned pages.
Update: The latest scantailor has image -de-warping facility. See the amazing thing at work here:

Step 3:

Collate the processed files in Step 2 to one single PDF. For this I have used the convert command.
Typical synatax is like this

convert *.tiff output.pdf

This command will take all the .tiff files in the given directory and collate these files into a pdf named output.pdf

http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/
Alternative to Step 3
Another alternative is to use gscan2pdf for joining the image files into pdf and doing the OCR which can be tesseract or cunieform. gscan2pdf is also able to scan files and stich them into pdf , but I would recommend that you use scantailor as one of the intermediate steps.
Also using gscan2pdf gives you an option for editing the files, if, for example, you might want to remove some marks from the images. For this it opens the image in GIMP.

Step 4:
OCR the PDF file.
Now this is again tricky, I could not find a good application which would OCR the pdf file and embed the resulting text on the pdf file. But I have found a hack on the following link which seems to work fine 🙂
http://blog.konradvoelkel.de/2010/01/linux-ocr-and-pdf-problem-solved/
The hack is a bash script which does the required work.
Alternate
gscan2pdf can do OCR for you using cunieform or tesseract as backends. The end result is a searchable text, but it does not sit on the image, as it would happen in a vector pdf, but is embedded on the page as “note” at the top-left-hand corner.
Step 5:
Optimize the PDF file generated in Step 4.
Here there is a nautilus shell script which I have found in the link below which does optimization.
http://www.webupd8.org/2010/11/download-compress-pdf-12-nautilus.html
Step 6:
In case you want to convert the .pdf to .djvu there is one step solution for that also

pdf2djvu -o output.djvu input.pdf

The tips and tricks here are by no means complete or the best. But this is what I have found to be useful. Some of the professional and non-free softwares can do all these, but the point of writing this article was to make a list of Free and Open Source Softwares for this purpose.
Comments and suggestions are welcome!

Temet Nosce

Know Thyself Too…

Time Lapse Film Using Scanner

Free Software Tools for scanning and making e-books