Welcome to Textractor documentation

Textractor is an OCR (optical character recognition) application. It extracts text from images so that you can edit or save the text on a digital device.

Textractor is based on two third party libraries. The actual recognition is made with Google's Tesseract OCR engine. Some general image manipulation is done with Leptonica image processing library before the image is passed to the Tesseract for recognition.

General Information

Upon first use, Textractor will prompt you to choose a language file. The selected file is then downloaded and installed on your device. After that you can start using Textractor. If you need more languages, just select and download them. Once a language file is downloaded, it does not need to be downloaded again.

Question: Why I need to download the files separately?

Answer: It is recommended by Jolla and by common sense that developers don't ship massive amounts of data files with the application. In this case the application size would have been over 200 MB if all the language files were embedded to the .rpm package.

Supported image formats

Textractor supports all the image formats which Sailfish OS supports. Images must have 32 bits per pixel. If image does not have 32 bits per pixels, preprocessing step is skipped and results may be poor. However, already preprocessed (Textractor outputs 1 bpp aka binarized image after the preprocessing step and saves it to the gallery) images will work fine. Preprocessed image is always overwritten when new recognition is started.

Cropping the image

After you have selected or taken an image, Textractor will show a cropping page. By dragging the four corners in desired position you can select a smaller area from the image. The area doesn't need to be square: the corner points can be in any quadrilateral arrangement due to the fact that perspective correction is applied afterwards to the selected area. This means that Textractor tries to correct the distortion (camera's optical axis is rarely perpendicular to the target when an image is taken) in the area according to the selected corner point locations.

If you don't want to crop the image, just proceed to analyze.

Settings

Textractor has some adjustable parameters which can improve the recognition results if adjusted correctly. The values are set as a default so that Textractor should work pretty good for most of the images. Most of the settings are rather advanced and I recommend to read instructions below if you want to adjust them.

Postprocessing

Minimum word confidence: Textractor filters the results after recognition based on this value. Higher value means that word is probably correct. Value of 10-20 might help filtering some invalid results out.

Preprocessing

These settings control the background normalization and Otsu thresholding used in preprocessing. If you touch them, you can always return the default values from the pulley menu.

Leptonica notes (other explanations are from Leptonica sources too):

Otsu binarization attempts to split the image into two roughly equal sets of pixels, and it does a very poor job when there are large amounts of dark background. By doing a background normalization first, to get the background near 255, we remove this problem. Then we use a modified Otsu to estimate the best global threshold on the normalized image.

Hints for better results

Taking a Good Picture

To get the best results you should follow a couple of simple guidelines when taking pictures:

Reasons for slow processing

Processing will be very slow and the results are obscure if the quality of the picture is bad. Some examples of bad quality pictures:

About the PDF analysis

Running OCR for PDF files is mainly intended for files which have been created for example by a scanner. This means that the text inside the files is actually in an image format and can't be copied without OCR. However, this feature will work too for PDF files which have been created by a text editor.

Authors and Contributors

This application has been developed by @skvark. The icon and Jolla store header image have been created by @fercen.

Bug Reports and Feature Requests

Bugs and feature requests can be submitted to the issues of the repository.