Package 

Class TessBaseAPI


  • 
    public class TessBaseAPI
    
                        

    Java interface for the Tesseract OCR engine. Does not implement all available JNI methods, but does implement enough to be useful. Comments are adapted from original Tesseract source.

    • Constructor Summary

      Constructors 
      Constructor Description
      TessBaseAPI() Constructs an instance of TessBaseAPI.
      TessBaseAPI(TessBaseAPI.ProgressNotifier progressNotifier) Constructs an instance of TessBaseAPI with a callback method forreceiving progress updates during OCR.
    • Method Summary

      Modifier and Type Method Description
      boolean init(String datapath, String language) Initializes the Tesseract engine with a specified language model.
      boolean init(String datapath, String language, int ocrEngineMode) Initializes the Tesseract engine with the specified language model(s).
      boolean init(String datapath, String language, int ocrEngineMode, Map<String, String> config) Initializes the Tesseract engine with the specified language model(s).
      String getInitLanguagesAsString() Returns the languages string used in the last valid initialization.If the last initialization specified "deu+hin" then that will bereturned.
      void clear() Frees up recognition results and any stored image data, without actuallyfreeing any recognition data that would be time-consuming to reload.Afterwards, you must call SetImage or SetRectangle before doing anyRecognize or Get* operation.
      void recycle() Closes down tesseract and free up all memory.
      String getVariable(String var) Get the value of an internal "parameter" as a string, if it exists.
      boolean setVariable(String var, String value) Set the value of an internal "parameter.
      int getPageSegMode() Return the current page segmentation mode.
      void setPageSegMode(int mode) Sets the page segmentation mode.
      void setDebug(boolean enabled) Sets debug mode.
      void setRectangle(Rect rect) Restricts recognition to a sub-rectangle of the image.
      void setRectangle(int left, int top, int width, int height) Restricts recognition to a sub-rectangle of the image.
      void setImage(File file) Provides an image for Tesseract to recognize.
      void setImage(Bitmap bmp) Provides an image for Tesseract to recognize.
      void setImage(Pix image) Provides a Leptonica pix format image for Tesseract to recognize.
      void setImage(Array<byte> imagedata, int width, int height, int bpp, int bpl) Provides an image for Tesseract to recognize.
      String getUTF8Text() The recognized text is returned as a String which is coded as UTF8.
      String getConfidentText(int minConfidence, int level) Returns text where items on the given iterator level (symbol, word, line, paragraph, block)which has confidence lower than given threshold are filtered out.
      int meanConfidence() Returns the (average) confidence value between 0 and 100.
      Array<int> wordConfidences() Returns all word confidences (between 0 and 100) in an array.
      Pix getThresholdedImage() Get a copy of the internal thresholded image from Tesseract.
      Pixa getRegions() Returns the result of page layout analysis as a Pixa, in reading order.
      Pixa getTextlines() Returns the textlines as a Pixa.
      Pixa getStrips() Get textlines and strips of image regions as a Pixa, in reading order.
      Pixa getWords() Get the words as a Pixa, in reading order.
      Pixa getConnectedComponents() Gets the individual connected (text) components (created after pagessegmentation step, but before recognition) as a Pixa, in reading order.
      ResultIterator getResultIterator() Get a reading-order iterator to the results of LayoutAnalysis and/orRecognize.
      String getHOCRText(int page) Make a HTML-formatted string with hOCR markup from the internal datastructures.
      void setInputName(String name) Set the name of the input file.
      void setOutputName(String name) Set the name of the bonus output files.
      void readConfigFile(String filename) Read a "config" file containing a set of variable, value pairs.
      String getBoxText(int page) The recognized text is returned as coded in the same format as a UTF8box file used in training.
      String getVersion() Returns the version identifier as a string.
      String getLibraryFlavor() Returns flavor of the library.
      void stop() Cancel recognition started by getHOCRText.
      boolean beginDocument(TessPdfRenderer tessPdfRenderer, String title) Starts a new document.
      boolean beginDocument(TessPdfRenderer tessPdfRenderer) Starts a new document with no title.
      boolean endDocument(TessPdfRenderer tessPdfRenderer) Finishes the document and finalizes the output data.Invalid if beginDocument not yet called.
      boolean addPageToDocument(Pix imageToProcess, String imageToWrite, TessPdfRenderer tessPdfRenderer) Adds the given data to the opened document (if any).
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • TessBaseAPI

        TessBaseAPI()
        Constructs an instance of TessBaseAPI.
      • TessBaseAPI

        TessBaseAPI(TessBaseAPI.ProgressNotifier progressNotifier)
        Constructs an instance of TessBaseAPI with a callback method forreceiving progress updates during OCR.
        Parameters:
        progressNotifier - Callback to receive progress notifications
    • Method Detail

      • init

         boolean init(String datapath, String language)

        Initializes the Tesseract engine with a specified language model. Returnstrue on success.

        Instances are now mostly thread-safe and totally independent, but someglobal parameters remain. Basically it is safe to use multipleTessBaseAPIs in different threads in parallel, UNLESS you use SetVariableon some of the Params in classify and textord. If you do, then the effectwill be to change it for all your instances.

        The datapath must be the name of the parent directory of tessdata andmust end in / . Any name after the last / will be stripped. The languageis (usually) an ISO 639-3 string or null will default to eng.It is entirely safe (and eventually will be efficient too) to call Initmultiple times on the same instance to change language, or just to resetthe classifier.

        The language may be a string of the form {@code [~][+[~] indicatingthat multiple languages are to be loaded. Eg hin+eng will load Hindi andEnglish. Languages may specify internally that they want to be loadedwith one or more other languages, so the ~ sign is available to overridethat. Eg if hin were set to load eng by default, then hin+~eng would forceloading only hin. The number of loaded languages is limited only bymemory, with the caveat that loading additional languages will impactboth speed and accuracy, as there is more work to do to decide on theapplicable language, and there is more chance of hallucinating incorrectwords.

        WARNING: On changing languages, all Tesseract parameters are resetback to their default values. (Which may vary between languages.)

        Parameters:
        datapath - the parent directory of tessdata ending in a forwardslash
        language - an ISO 639-3 string representing the language(s)
      • init

         boolean init(String datapath, String language, int ocrEngineMode)

        Initializes the Tesseract engine with the specified language model(s). Returnstrue on success.

        Parameters:
        datapath - the parent directory of tessdata ending in a forwardslash
        language - an ISO 639-3 string representing the language(s)
        ocrEngineMode - the OCR engine mode to be set
      • init

         boolean init(String datapath, String language, int ocrEngineMode, Map<String, String> config)

        Initializes the Tesseract engine with the specified language model(s). Returnstrue on success.

        Parameters:
        datapath - the parent directory of tessdata ending in a forwardslash
        language - an ISO 639-3 string representing the language(s)
        ocrEngineMode - the OCR engine mode to be set
        config - variables to be set at initialization; can be empty
      • getInitLanguagesAsString

         String getInitLanguagesAsString()

        Returns the languages string used in the last valid initialization.If the last initialization specified "deu+hin" then that will bereturned. If hin loaded eng automatically as well, then that willnot be included in this list.

      • clear

         void clear()

        Frees up recognition results and any stored image data, without actuallyfreeing any recognition data that would be time-consuming to reload.Afterwards, you must call SetImage or SetRectangle before doing anyRecognize or Get* operation.

      • recycle

         void recycle()

        Closes down tesseract and free up all memory. No other methods may be used anymore.

      • getVariable

         String getVariable(String var)

        Get the value of an internal "parameter" as a string, if it exists.

        Boolean variables are returned as "0" (false) or "1" (true).

        Parameters:
        var - name of the variable
      • setVariable

         boolean setVariable(String var, String value)

        Set the value of an internal "parameter."

        Supply the name of the parameter and the value as a string, just asyou would in a config file.

        Returns false if the name lookup failed.

        Eg setVariable("tessedit_char_blacklist", "xyz"); toignore x, y and z.

        Or setVariable("classify_bln_numeric_mode", "1"); to setnumeric-only mode.

        Note: Must be called after init(). Only works for non-init variables.

        Parameters:
        var - name of the variable
        value - value to set
      • getPageSegMode

         int getPageSegMode()

        Return the current page segmentation mode.

      • setPageSegMode

         void setPageSegMode(int mode)

        Sets the page segmentation mode. Defaults to PSM_SINGLE_BLOCK. This controls how much processingthe OCR engine will perform before recognizing text.

        The mode can also be modified by readConfigFile orsetVariable("tessedit_pageseg_mode", mode as string).

        Parameters:
        mode - the PageSegMode to set
      • setDebug

         void setDebug(boolean enabled)

        Sets debug mode. This controls how much information is displayed in thelog during recognition.

        Parameters:
        enabled - true to enable debugging mode
      • setRectangle

         void setRectangle(Rect rect)

        Restricts recognition to a sub-rectangle of the image. Call afterSetImage. Each SetRectangle clears the recognition results so multiplerectangles can be recognized with the same image.

        Parameters:
        rect - the bounding rectangle
      • setRectangle

         void setRectangle(int left, int top, int width, int height)

        Restricts recognition to a sub-rectangle of the image. Call afterSetImage. Each SetRectangle clears the recognition results so multiplerectangles can be recognized with the same image.

        Parameters:
        left - the left bound
        top - the right bound
        width - the width of the bounding box
        height - the height of the bounding box
      • setImage

        @WorkerThread() void setImage(File file)

        Provides an image for Tesseract to recognize. Copies the image buffer.The source image may be destroyed immediately after SetImage is called.SetImage clears all recognition results, and sets the rectangle to thefull image, so it may be followed immediately by a GetUTF8Text, and itwill automatically perform recognition.

        Parameters:
        file - absolute path to the image file
      • setImage

        @WorkerThread() void setImage(Bitmap bmp)

        Provides an image for Tesseract to recognize. Copies the image buffer.The source image may be destroyed immediately after SetImage is called.SetImage clears all recognition results, and sets the rectangle to thefull image, so it may be followed immediately by a GetUTF8Text, and itwill automatically perform recognition.

        Parameters:
        bmp - bitmap representation of the image
      • setImage

        @WorkerThread() void setImage(Pix image)

        Provides a Leptonica pix format image for Tesseract to recognize. Clonesthe pix object. The source image may be destroyed immediately afterSetImage is called, but its contents may not be modified.

        Parameters:
        image - Leptonica pix representation of the image
      • setImage

        @WorkerThread() void setImage(Array<byte> imagedata, int width, int height, int bpp, int bpl)

        Provides an image for Tesseract to recognize. Copies the image buffer.The source image may be destroyed immediately after SetImage is called.SetImage clears all recognition results, and sets the rectangle to thefull image, so it may be followed immediately by a GetUTF8Text, and itwill automatically perform recognition.

        Parameters:
        imagedata - byte representation of the image
        width - image width
        height - image height
        bpp - bytes per pixel
        bpl - bytes per line
      • getUTF8Text

        @WorkerThread() String getUTF8Text()

        The recognized text is returned as a String which is coded as UTF8.This is a blocking operation that will not work with stop.Call getHOCRText before calling this function tointerrupt a recognition task with stop

      • getConfidentText

        @NonNull() String getConfidentText(int minConfidence, int level)

        Returns text where items on the given iterator level (symbol, word, line, paragraph, block)which has confidence lower than given threshold are filtered out.

        Important: Recognition results must be already available when calling this method.This means getUTF8Text or getHOCRText needs to be called before this.

        Parameters:
        minConfidence - Minimal confidence threshold (0-100) to accept the item.
        level - Target level to iterate over and check the minimal confidence on.
      • meanConfidence

         int meanConfidence()

        Returns the (average) confidence value between 0 and 100.

      • wordConfidences

         Array<int> wordConfidences()

        Returns all word confidences (between 0 and 100) in an array.

        The number of confidences should correspond to the number ofspace-delimited words in GetUTF8Text().

      • getThresholdedImage

         Pix getThresholdedImage()

        Get a copy of the internal thresholded image from Tesseract.

        Caller takes ownership of the Pix and must recycle() it.May be called any time after setImage.

      • getRegions

         Pixa getRegions()

        Returns the result of page layout analysis as a Pixa, in reading order.

        Can be called before or after Recognize.

      • getTextlines

         Pixa getTextlines()

        Returns the textlines as a Pixa. Textlines are extracted from thethresholded image.

        Can be called before or after Recognize. Block IDs are not returned.Paragraph IDs are not returned.

      • getStrips

         Pixa getStrips()

        Get textlines and strips of image regions as a Pixa, in reading order.

        Enables downstream handling of non-rectangular regions. Can be calledbefore or after Recognize. Block IDs are not returned.

      • getWords

         Pixa getWords()

        Get the words as a Pixa, in reading order.

        Can be called before or after Recognize.

      • getConnectedComponents

         Pixa getConnectedComponents()

        Gets the individual connected (text) components (created after pagessegmentation step, but before recognition) as a Pixa, in reading order.

        Can be called before or after Recognize. Note: the caller isresponsible for calling recycle() on the returned Pixa.

      • getResultIterator

         ResultIterator getResultIterator()

        Get a reading-order iterator to the results of LayoutAnalysis and/orRecognize. The returned iterator must be deleted after use.

      • getHOCRText

        @WorkerThread() String getHOCRText(int page)

        Make a HTML-formatted string with hOCR markup from the internal datastructures. Interruptible by stop.

        Parameters:
        page - is 0-based but will appear in the output as 1-based.
      • setInputName

         void setInputName(String name)

        Set the name of the input file. Needed for training and reading a UNLVzone file.

        Parameters:
        name - input file name
      • setOutputName

         void setOutputName(String name)

        Set the name of the bonus output files. Needed only for debugging.

        Parameters:
        name - output file name
      • readConfigFile

         void readConfigFile(String filename)

        Read a "config" file containing a set of variable, value pairs.

        Searches the standard places: tessdata/configs, tessdata/tessconfigs.Note: only non-init params will be set.

        Parameters:
        filename - the configuration filename, without the path
      • getBoxText

         String getBoxText(int page)

        The recognized text is returned as coded in the same format as a UTF8box file used in training.

        Constructs coordinates in the original image - not just the rectangle.

        Parameters:
        page - a 0-based page index that will appear in the box file.
      • beginDocument

         boolean beginDocument(TessPdfRenderer tessPdfRenderer, String title)

        Starts a new document. This clears the contents of the output data.

        Caller is responsible for escaping the provided title.

        Parameters:
        tessPdfRenderer - the renderer instance to use
        title - a title to be used in the document metadata
      • beginDocument

         boolean beginDocument(TessPdfRenderer tessPdfRenderer)

        Starts a new document with no title.

        Parameters:
        tessPdfRenderer - the renderer instance to use
      • endDocument

         boolean endDocument(TessPdfRenderer tessPdfRenderer)

        Finishes the document and finalizes the output data.Invalid if beginDocument not yet called.

        Parameters:
        tessPdfRenderer - the renderer instance to use
      • addPageToDocument

         boolean addPageToDocument(Pix imageToProcess, String imageToWrite, TessPdfRenderer tessPdfRenderer)

        Adds the given data to the opened document (if any).

        Parameters:
        imageToProcess - image to be used for OCR
        imageToWrite - path to image to be written into resulting document
        tessPdfRenderer - the renderer instance to use