Remaking ebooks from existing pdfs, djvu

Suppose you have an ebook or an article in pdf format, which unfortunately is not cleaned. By not cleaned we mean

  • Single page scan with edge darkening, pages not aligned that is text is rotated differently , page size different, library and use marks marks etc.
  • 2-in-1 scan: Two pages simultaneously scanned together, the central spine dark band, pages not rotated properly, edge and wear marks,  library marks etc.

In this case we cannot use the tools like scantailor for cleaning the images directly. For this we first need to extract images from the PDF file and then do a processing on these images. One can do extract the images one by one and process them, but then we can do it in a better way also.

First we split the pdf file into single PDFs by using the most versatile pdftk

For this in the terminal type

$ pdftk file.pdf burst

It will create as many pdf files as there are pages. with names like pg_0000.pdf etc.

Now next task is to convert these pdf to images, for this we use the convert command, but we don’t want to convert files one by one by

convert pg_0000.pdf pg_0000.tiff

But this is not very useful for large number of files, we want to make this in one go. So we do the following

$ for i in $(ls | grep pdf;);
do
convert -density 600 $i $i.tiff;
done
Lets see what these commands do:

ls

will list all the files in that directory

ls | grep pdf

This will filter out the files with pdf in the filename and provide us with a list

On this list we can do a lot of operations as we do in on any other list

for i in $(ls | grep pdf)

is calling each member of this list that we generated and treating it as variable i

and for each memberwe

do

the following

convert -density 600 $i $i.tiff

and after this is over the task is

done

We can set the dpi for the output images by passing the number, above it is set as 600. The output images will be named same as the input pdf files.

Now we can happily run scantailor on these images to clean them up!

PS:

Instead of a PDF if you have a djvu file we have another approach.

Step 1

Convert the djvu file into a multipage tif file, by using ddjvu command.

$ddjvu -format=tiff -verbose -quality=uncompressed input_file.djvu output_file.tif

With this command we will get a tiff format, with same resolution as the original djvu file.

Once the multipage tif file is there, it can be split into its original pages by tiffsplit command.

$tiffsplit input_file.tif

And we are done. Now we can happily run scantailor on these tiff files.

 

Advertisements
This entry was posted in books, ebooks, education, free software and tagged , , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s