Textractor by skvark

Welcome to Textractor documentation

Textractor is an OCR (optical character recognition) application. It extracts text from images so that you can edit or save the text on a digital device.

Textractor is based on two third party libraries. The actual recognition is made with Google's Tesseract OCR engine. Some general image manipulation is done with Leptonica image processing library before the image is passed to the Tesseract for recognition.

General Information

Upon first use, Textractor will prompt you to choose a language file. The selected file is then downloaded and installed on your device. After that you can start using Textractor. If you need more languages, just select and download them. Once a language file is downloaded, it does not need to be downloaded again.

Question: Why I need to download the files separately?

Answer: It is recommended by Jolla and by common sense that developers don't ship massive amounts of data files with the application. In this case the application size would have been over 200 MB if all the language files were embedded to the .rpm package.

Supported image formats

Textractor supports all the image formats which Sailfish OS supports. Images must have 32 bits per pixel. If image does not have 32 bits per pixels, preprocessing step is skipped and results may be poor. However, already preprocessed (Textractor outputs 1 bpp aka binarized image after the preprocessing step and saves it to the gallery) images will work fine. Preprocessed image is always overwritten when new recognition is started.

Cropping the image

After you have selected or taken an image, Textractor will show a cropping page. By dragging the four corners in desired position you can select a smaller area from the image. The area doesn't need to be square: the corner points can be in any quadrilateral arrangement due to the fact that perspective correction is applied afterwards to the selected area. This means that Textractor tries to correct the distortion (camera's optical axis is rarely perpendicular to the target when an image is taken) in the area according to the selected corner point locations.

If you don't want to crop the image, just proceed to analyze.

Settings

Textractor has some adjustable parameters which can improve the recognition results if adjusted correctly. The values are set as a default so that Textractor should work pretty good for most of the images. Most of the settings are rather advanced and I recommend to read instructions below if you want to adjust them.

Postprocessing

Minimum word confidence: Textractor filters the results after recognition based on this value. Higher value means that word is probably correct. Value of 10-20 might help filtering some invalid results out.

Preprocessing

These settings control the background normalization and Otsu thresholding used in preprocessing. If you touch them, you can always return the default values from the pulley menu.

Leptonica notes (other explanations are from Leptonica sources too):

Otsu binarization attempts to split the image into two roughly equal sets of pixels, and it does a very poor job when there are large amounts of dark background. By doing a background normalization first, to get the background near 255, we remove this problem. Then we use a modified Otsu to estimate the best global threshold on the normalized image.

Tile size in pixels: The dimension of the pixel tile give the amount by which the map is reduced in size from the input image.
Threshold for determining foreground: The threshold is used to binarize the input image, in order to locate the foreground components. If this is set too low, some actual foreground may be used to determine the maps; if set too high, there may not be enough background to determine the map values accurately. Typically, it's better to err by setting the threshold too high.
Min threshold on counts in a tile: This is a minimum count of pixels in a tile for which a background reading is made, in order for that pixel in the map to be valid. This number should perhaps be at least 1/3 the size of the tile.
Target bg value for the normalized image: A target background value for the normalized image. This should be at least 128. If set too close to 255, some clipping will occur in the result.
Smoothing factor: Input for smoothing the map. Each low-pass filter kernel dimension is 2 * (smoothing factor) + 1, so a value of 0 means no smoothing. A value of 1 or 2 is recommended.
Otsu score fraction: The scorefract is the fraction of the maximum Otsu score, which is used to determine the range over which the histogram minimum is searched.

Hints for better results

Taking a Good Picture

To get the best results you should follow a couple of simple guidelines when taking pictures:

Check that the lightning conditions are good. There should be no visible shadows or reflections in the image.
Check that the color of the background is light and there are no complex textures in it. The background can be also dark: just make sure the text is white or some other light color.

Reasons for slow processing

Processing will be very slow and the results are obscure if the quality of the picture is bad. Some examples of bad quality pictures:

Underexposure or overexposure
Distorted (i.e. text is not straight or it's distorted) or blurred (camera moved during the shot) image
There's complex image or texture behind the actual text to be regocnized
Hand or some other object casted a shadow to the picture
There are reflections in the picture

About the PDF analysis

Running OCR for PDF files is mainly intended for files which have been created for example by a scanner. This means that the text inside the files is actually in an image format and can't be copied without OCR. However, this feature will work too for PDF files which have been created by a text editor.

Authors and Contributors

This application has been developed by @skvark. The icon and Jolla store header image have been created by @fercen.

Bug Reports and Feature Requests

Bugs and feature requests can be submitted to the issues of the repository.

Textractor

OCR application for Sailfish OS. Based on Google's Tesseract OCR engine and Leptonica image processing library.