Automatic mass scanning of documents tutorial
Using a variety of packages, we’ll show you how to put together a script-driven customisable document scanning solution
We’re going to show you how to set up a mass document scanning system. To accomplish this, we’ll make use of a variety of Linux tools. The advantage of this approach is that you can tailor the process to match your specific requirements. This gives you a system that can handle big jobs and that is open to a lot of customisation.
We’ll be looking at two possible end products: a PDF file in which each page is a scan of an original page, and a text file that contains the textual content of the original pages. The content of the text file is searchable and we cover a couple of ways of making it into a PDF file.
This tutorial is modular. For example, if you are dealing with a set of pre-scanned images, you can skip the initial steps and move straight on to OCRing them or converting them into a PDF file. By the same token, if you prefer to use a GUI tool for some parts of the process, there’s nothing to stop you. That said, we’ve tried to make every part of the process scriptable for complete automation.
A Linux box
Step 01 Install SANE
Install SANE (Scanner Access Now Easy) using your package manager. If you have installed SANE and still have difficulty accessing your scanner, there may be a manufacturer-specific SANE back-end for it. In that case, Google for it.
Step 02 Find scanner
Check that SANE can work with your scanner by typing scanimage -L at the command line. If your scanner is supported, the text output will include a device name. The bit you need is the first element before the colons.
Step 03 Scan image
Do a quick test scan of something by typing scanimage -d [device name] > test.pnm. This will scan one page using default settings. Open the resulting file in a graphics viewer to check it. Add the -v option to troubleshoot if there are problems.
Step 04 Refine the scanimage options
If you plan to OCR the text, or colour isn’t needed, 300 DPI resolution and black and white are typical options. For one thing, this keeps the file sizes down. scanimage -d [device name] –format=tiff –mode Lineart –resolution
300 > [filename] are typical options.
Step 05 Create scanning script
We’ll make our first script out of the command-line string that carries scanning. Using a text editor, create a file called scan.sh. Add #!/bin/bash as the first line. Add the scanner line that you used as the second line. Save the file. Make the script executable by typing chmod +x scan.sh at the command line. Run it by typing ./scan.sh into a terminal. As we go on, create extra scripts like this at useful stages. You can chain these together into one long script or invoke the different stages separately.
Step 06 Calculate the crop
Don’t worry about rotating the documents at this stage. Install GIMP using your package manager. Open the scanned document in GIMP and select the crop tool from the too palette. Stretch the crop area over the valid part of the document and then make a note of the crop size in the crop dialog. Make the appropriate notes about page sizes if you intend to split facing pages into separate pages. Don’t carry out the crop in GIMP because we’ll be doing that from the command line in a minute.
Step 07 Install ImageMagick
Use the package manager to install ImageMagick on your system. We interact with this image-processing tool with the convert command. We can use it to rotate and crop the image. We can also use it to split pages. Note that convert uses a single dash in front of options.
Step 08 Crop the page
Use the parameters you arrived at by using GIMP to carry out the crop. Type convert [image name] +repage -crop [x width]x[y width]+[x offset]+y[offset] [output name]. So, for instance, convert page1.png +repage -crop 2244×3113+1+1 page1_crop.png would crop a 2244 by 3113 rectangle starting at 1 pixel from the top and left edges.
Step 09 Rotate the page
If you had to scan the pages on their sides, use ImageMagick to rotate them. convert [input name] -rotate 90 [output name] should do the trick.
Step 10 Split facing pages
As before, make use of GIMP’s crop feature to work out the exact dimensions for the crop. convert page1.tiff +repage -crop 2233×1579+0+1529 page1_a.tiff followed by convert page1.tiff +repage -crop 2233×1546+0+0 page1_b.tiff creates two separate files from two facing pages.
Step 11 Make pre-processing Script
Above is an example pre-processing script. It creates a directory called splits and then rotates each scan and then splits it into two pages which are numbered in sequence. At the end, it deletes the rotated files. Save it and then chmod +x it.
Step 12 Scanner batch mode
Use the –batch family of options for multiple pages. If you don’t have a document feeder, add the –batch-prompt option to be prompted between each scan. In addition, add –batch=./$page_%03d.tiff to give filenames that begin with ‘page’ and end with a number that has three leading zeros.
Step 13 Pre-crop in scanner
You may be able to crop the page in the scanner, which leads to smaller files and faster operation because the scanner head doesn’t have to travel as far. Use the crop tool in GIMP to calculate the size and offset that you need by switching the units from px to mm in the dialog. If the information obtained in this way proves inaccurate, consider going low-tech and using a ruler. On the scanimage command line, the format of the extra flags is -l [left edge] -x [width] -t [top edge] -y [height].
Step 14 Double-sided documents
If you have to scan double-sided documents, use the batch options as before, but add –batch-double option to increment the page numbers by two. On the second pass, for the other sides, do the same again, but add –batch-start=2 to make the numbering add up.
Step 15 Convert scans to PDF
You can use ImageMagick to convert a directory full of scanned images into a PDF ‘book’. The command convert *.tiff output.pdf will create a multi-page document. If you need to insert a header page, title it page000.tiff and place it in the directory.
Step 16 OCR the text with Tesseract
Try the OCR engine on a test page. To do this, type tesseract [name of input file] [output file]. Don’t add a file extension to the output filename as this will be added by Tesseract. Note that Tesseract can detect multi-
column text and facing pages.
Step 17 OCR batch processing
Use the following Bash code to OCR a directory full of scanned pages: for i in *.tiff ; do tesseract $i outtext$i; done; The end result is a set of text files. Join them together with cat *.txt > [output text file].
Step 18 Format text
By default, Tesseract will insert carriage returns in the same place as they occurred in the source text. You can reformat the text file with the following command: fmt -u [input file] > [output file].
Step 19 Edit the text in LibreOffice
Simply cut and paste the output from the previous steps into LibreOffice Writer. At this stage, you can take control of the editing and manually edit things like section headers. You can even insert images from the original document.
Step 20 Export PDF file from LibreOffice
LibreOffice has built-in facilities for the creation of PDF files. When you have finalised the layout and formatting, go to File>Export as PDF. From here, click on Export and give the document a name.
Step 21 Scriptable PDF generation
Install the iconv, ps2pdf and enscript packages using the package manager. Prepare the text file by typing iconv -f UTF-8 -t ISO-8859-1 -c [input textfile][output textfile]. Type enscript [textfile] -p [output postscript file.ps]. Convert the PostScript into a PDF by typing ps2pdf [.ps file].