Suppose you have an ebook or an article in pdf format, which unfortunately is not cleaned. By not cleaned we mean
- Single page scan with edge darkening, pages not aligned that is text is rotated differently , page size different, library and use marks marks etc.
- 2-in-1 scan: Two pages simultaneously scanned together, the central spine dark band, pages not rotated properly, edge and wear marks, library marks etc.
In this case we cannot use the tools like scantailor for cleaning the images directly. For this we first need to extract images from the PDF file and then do a processing on these images. One can do extract the images one by one and process them, but then we can do it in a better way also.
First we split the pdf file into single PDFs by using the most versatile pdftk
For this in the terminal type
$ pdftk file.pdf burst
It will create as many pdf files as there are pages. with names like pg_0000.pdf etc.
Now next task is to convert these pdf to images, for this we use the convert command, but we don’t want to convert files one by one by
convert pg_0000.pdf pg_0000.tiff
But this is not very useful for large number of files, we want to make this in one go. So we do the following
$ for i in $(ls | grep pdf;);
do
convert -density 600 $i $i.tiff;
done
Lets see what these commands do:
ls
will list all the files in that directory
ls | grep pdf
This will filter out the files with pdf in the filename and provide us with a list
On this list we can do a lot of operations as we do in on any other list
for i in $(ls | grep pdf)
is calling each member of this list that we generated and treating it as variable i
and for each memberwe
do
the following
convert -density 600 $i $i.tiff
and after this is over the task is
done
We can set the dpi for the output images by passing the number, above it is set as 600. The output images will be named same as the input pdf files.
Now we can happily run scantailor on these images to clean them up!
PS:
Instead of a PDF if you have a djvu file we have another approach.
Step 1
Convert the djvu file into a multipage tif file, by using ddjvu command.
$ddjvu -format=tiff -verbose -quality=uncompressed input_file.djvu output_file.tif
With this command we will get a tiff format, with same resolution as the original djvu file.
Once the multipage tif file is there, it can be split into its original pages by tiffsplit command.
$tiffsplit input_file.tif
And we are done. Now we can happily run scantailor on these tiff files.