-
public class TessBaseAPIJava interface for the Tesseract OCR engine. Does not implement all available JNI methods, but does implement enough to be useful. Comments are adapted from original Tesseract source.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description public final classTessBaseAPI.PageSegModePage segmentation mode.
public @interfaceTessBaseAPI.OcrEngineModepublic final classTessBaseAPI.PageIteratorLevelElements of the page hierarchy, used in ResultIterator to providefunctions that operate on each level without having to have 5x as manyfunctions.
NOTE: At present RIL_PARA and RIL_BLOCK are equivalentas there is no paragraph internally yet.
public interfaceTessBaseAPI.ProgressNotifierInterface that may be implemented by calling object in order to receiveprogress callbacks during OCR.
Progress callbacks are available when getHOCRText is used.
public classTessBaseAPI.ProgressValuesRepresents values indicating recognition progress and status.
-
Field Summary
Fields Modifier and Type Field Description public final static StringVAR_CHAR_WHITELISTpublic final static StringVAR_CHAR_BLACKLISTpublic final static StringVAR_SAVE_BLOB_CHOICESpublic final static StringVAR_TRUEpublic final static StringVAR_FALSEpublic final static intOEM_TESSERACT_ONLYpublic final static intOEM_LSTM_ONLYpublic final static intOEM_TESSERACT_LSTM_COMBINEDpublic final static intOEM_DEFAULT
-
Constructor Summary
Constructors Constructor Description TessBaseAPI()Constructs an instance of TessBaseAPI. TessBaseAPI(TessBaseAPI.ProgressNotifier progressNotifier)Constructs an instance of TessBaseAPI with a callback method forreceiving progress updates during OCR.
-
Method Summary
Modifier and Type Method Description booleaninit(String datapath, String language)Initializes the Tesseract engine with a specified language model. booleaninit(String datapath, String language, int ocrEngineMode)Initializes the Tesseract engine with the specified language model(s). booleaninit(String datapath, String language, int ocrEngineMode, Map<String, String> config)Initializes the Tesseract engine with the specified language model(s). StringgetInitLanguagesAsString()Returns the languages string used in the last valid initialization.If the last initialization specified "deu+hin" then that will bereturned. voidclear()Frees up recognition results and any stored image data, without actuallyfreeing any recognition data that would be time-consuming to reload.Afterwards, you must call SetImage or SetRectangle before doing anyRecognize or Get* operation. voidrecycle()Closes down tesseract and free up all memory. StringgetVariable(String var)Get the value of an internal "parameter" as a string, if it exists. booleansetVariable(String var, String value)Set the value of an internal "parameter. intgetPageSegMode()Return the current page segmentation mode. voidsetPageSegMode(int mode)Sets the page segmentation mode. voidsetDebug(boolean enabled)Sets debug mode. voidsetRectangle(Rect rect)Restricts recognition to a sub-rectangle of the image. voidsetRectangle(int left, int top, int width, int height)Restricts recognition to a sub-rectangle of the image. voidsetImage(File file)Provides an image for Tesseract to recognize. voidsetImage(Bitmap bmp)Provides an image for Tesseract to recognize. voidsetImage(Pix image)Provides a Leptonica pix format image for Tesseract to recognize. voidsetImage(Array<byte> imagedata, int width, int height, int bpp, int bpl)Provides an image for Tesseract to recognize. StringgetUTF8Text()The recognized text is returned as a String which is coded as UTF8. StringgetConfidentText(int minConfidence, int level)Returns text where items on the given iterator level (symbol, word, line, paragraph, block)which has confidence lower than given threshold are filtered out. intmeanConfidence()Returns the (average) confidence value between 0 and 100. Array<int>wordConfidences()Returns all word confidences (between 0 and 100) in an array. PixgetThresholdedImage()Get a copy of the internal thresholded image from Tesseract. PixagetRegions()Returns the result of page layout analysis as a Pixa, in reading order. PixagetTextlines()Returns the textlines as a Pixa. PixagetStrips()Get textlines and strips of image regions as a Pixa, in reading order. PixagetWords()Get the words as a Pixa, in reading order. PixagetConnectedComponents()Gets the individual connected (text) components (created after pagessegmentation step, but before recognition) as a Pixa, in reading order. ResultIteratorgetResultIterator()Get a reading-order iterator to the results of LayoutAnalysis and/orRecognize. StringgetHOCRText(int page)Make a HTML-formatted string with hOCR markup from the internal datastructures. voidsetInputName(String name)Set the name of the input file. voidsetOutputName(String name)Set the name of the bonus output files. voidreadConfigFile(String filename)Read a "config" file containing a set of variable, value pairs. StringgetBoxText(int page)The recognized text is returned as coded in the same format as a UTF8box file used in training. StringgetVersion()Returns the version identifier as a string. StringgetLibraryFlavor()Returns flavor of the library. voidstop()Cancel recognition started by getHOCRText. booleanbeginDocument(TessPdfRenderer tessPdfRenderer, String title)Starts a new document. booleanbeginDocument(TessPdfRenderer tessPdfRenderer)Starts a new document with no title. booleanendDocument(TessPdfRenderer tessPdfRenderer)Finishes the document and finalizes the output data.Invalid if beginDocument not yet called. booleanaddPageToDocument(Pix imageToProcess, String imageToWrite, TessPdfRenderer tessPdfRenderer)Adds the given data to the opened document (if any). -
-
Constructor Detail
-
TessBaseAPI
TessBaseAPI()
Constructs an instance of TessBaseAPI.
-
TessBaseAPI
TessBaseAPI(TessBaseAPI.ProgressNotifier progressNotifier)
Constructs an instance of TessBaseAPI with a callback method forreceiving progress updates during OCR.- Parameters:
progressNotifier- Callback to receive progress notifications
-
-
Method Detail
-
init
boolean init(String datapath, String language)
Initializes the Tesseract engine with a specified language model. Returns
trueon success.Instances are now mostly thread-safe and totally independent, but someglobal parameters remain. Basically it is safe to use multipleTessBaseAPIs in different threads in parallel, UNLESS you use SetVariableon some of the Params in classify and textord. If you do, then the effectwill be to change it for all your instances.
The datapath must be the name of the parent directory of tessdata andmust end in / . Any name after the last / will be stripped. The languageis (usually) an ISO 639-3 string or
nullwill default to eng.It is entirely safe (and eventually will be efficient too) to call Initmultiple times on the same instance to change language, or just to resetthe classifier.The language may be a string of the form
{@code [~][+[~]indicatingthat multiple languages are to be loaded. Eg hin+eng will load Hindi andEnglish. Languages may specify internally that they want to be loadedwith one or more other languages, so the ~ sign is available to overridethat. Eg if hin were set to load eng by default, then hin+~eng would forceloading only hin. The number of loaded languages is limited only bymemory, with the caveat that loading additional languages will impactboth speed and accuracy, as there is more work to do to decide on theapplicable language, and there is more chance of hallucinating incorrectwords.WARNING: On changing languages, all Tesseract parameters are resetback to their default values. (Which may vary between languages.)
- Parameters:
datapath- the parent directory of tessdata ending in a forwardslashlanguage- an ISO 639-3 string representing the language(s)
-
init
boolean init(String datapath, String language, int ocrEngineMode)
Initializes the Tesseract engine with the specified language model(s). Returns
trueon success.- Parameters:
datapath- the parent directory of tessdata ending in a forwardslashlanguage- an ISO 639-3 string representing the language(s)ocrEngineMode- the OCR engine mode to be set
-
init
boolean init(String datapath, String language, int ocrEngineMode, Map<String, String> config)
Initializes the Tesseract engine with the specified language model(s). Returns
trueon success.- Parameters:
datapath- the parent directory of tessdata ending in a forwardslashlanguage- an ISO 639-3 string representing the language(s)ocrEngineMode- the OCR engine mode to be setconfig- variables to be set at initialization; can be empty
-
getInitLanguagesAsString
String getInitLanguagesAsString()
Returns the languages string used in the last valid initialization.If the last initialization specified "deu+hin" then that will bereturned. If hin loaded eng automatically as well, then that willnot be included in this list.
-
clear
void clear()
Frees up recognition results and any stored image data, without actuallyfreeing any recognition data that would be time-consuming to reload.Afterwards, you must call SetImage or SetRectangle before doing anyRecognize or Get* operation.
-
recycle
void recycle()
Closes down tesseract and free up all memory. No other methods may be used anymore.
-
getVariable
String getVariable(String var)
Get the value of an internal "parameter" as a string, if it exists.
Boolean variables are returned as "0" (false) or "1" (true).
- Parameters:
var- name of the variable
-
setVariable
boolean setVariable(String var, String value)
Set the value of an internal "parameter."
Supply the name of the parameter and the value as a string, just asyou would in a config file.
Returns false if the name lookup failed.
Eg
setVariable("tessedit_char_blacklist", "xyz");toignore x, y and z.Or
setVariable("classify_bln_numeric_mode", "1");to setnumeric-only mode.Note: Must be called after init(). Only works for non-init variables.
- Parameters:
var- name of the variablevalue- value to set
-
getPageSegMode
int getPageSegMode()
Return the current page segmentation mode.
-
setPageSegMode
void setPageSegMode(int mode)
Sets the page segmentation mode. Defaults to PSM_SINGLE_BLOCK. This controls how much processingthe OCR engine will perform before recognizing text.
The mode can also be modified by readConfigFile orsetVariable("tessedit_pageseg_mode", mode as string).
- Parameters:
mode- the PageSegMode to set
-
setDebug
void setDebug(boolean enabled)
Sets debug mode. This controls how much information is displayed in thelog during recognition.
- Parameters:
enabled-trueto enable debugging mode
-
setRectangle
void setRectangle(Rect rect)
Restricts recognition to a sub-rectangle of the image. Call afterSetImage. Each SetRectangle clears the recognition results so multiplerectangles can be recognized with the same image.
- Parameters:
rect- the bounding rectangle
-
setRectangle
void setRectangle(int left, int top, int width, int height)
Restricts recognition to a sub-rectangle of the image. Call afterSetImage. Each SetRectangle clears the recognition results so multiplerectangles can be recognized with the same image.
- Parameters:
left- the left boundtop- the right boundwidth- the width of the bounding boxheight- the height of the bounding box
-
setImage
@WorkerThread() void setImage(File file)
Provides an image for Tesseract to recognize. Copies the image buffer.The source image may be destroyed immediately after SetImage is called.SetImage clears all recognition results, and sets the rectangle to thefull image, so it may be followed immediately by a GetUTF8Text, and itwill automatically perform recognition.
- Parameters:
file- absolute path to the image file
-
setImage
@WorkerThread() void setImage(Bitmap bmp)
Provides an image for Tesseract to recognize. Copies the image buffer.The source image may be destroyed immediately after SetImage is called.SetImage clears all recognition results, and sets the rectangle to thefull image, so it may be followed immediately by a GetUTF8Text, and itwill automatically perform recognition.
- Parameters:
bmp- bitmap representation of the image
-
setImage
@WorkerThread() void setImage(Pix image)
Provides a Leptonica pix format image for Tesseract to recognize. Clonesthe pix object. The source image may be destroyed immediately afterSetImage is called, but its contents may not be modified.
- Parameters:
image- Leptonica pix representation of the image
-
setImage
@WorkerThread() void setImage(Array<byte> imagedata, int width, int height, int bpp, int bpl)
Provides an image for Tesseract to recognize. Copies the image buffer.The source image may be destroyed immediately after SetImage is called.SetImage clears all recognition results, and sets the rectangle to thefull image, so it may be followed immediately by a GetUTF8Text, and itwill automatically perform recognition.
- Parameters:
imagedata- byte representation of the imagewidth- image widthheight- image heightbpp- bytes per pixelbpl- bytes per line
-
getUTF8Text
@WorkerThread() String getUTF8Text()
The recognized text is returned as a String which is coded as UTF8.This is a blocking operation that will not work with stop.Call getHOCRText before calling this function tointerrupt a recognition task with stop
-
getConfidentText
@NonNull() String getConfidentText(int minConfidence, int level)
Returns text where items on the given iterator level (symbol, word, line, paragraph, block)which has confidence lower than given threshold are filtered out.
Important: Recognition results must be already available when calling this method.This means getUTF8Text or getHOCRText needs to be called before this.
- Parameters:
minConfidence- Minimal confidence threshold (0-100) to accept the item.level- Target level to iterate over and check the minimal confidence on.
-
meanConfidence
int meanConfidence()
Returns the (average) confidence value between 0 and 100.
-
wordConfidences
Array<int> wordConfidences()
Returns all word confidences (between 0 and 100) in an array.
The number of confidences should correspond to the number ofspace-delimited words in GetUTF8Text().
-
getThresholdedImage
Pix getThresholdedImage()
Get a copy of the internal thresholded image from Tesseract.
Caller takes ownership of the Pix and must recycle() it.May be called any time after setImage.
-
getRegions
Pixa getRegions()
Returns the result of page layout analysis as a Pixa, in reading order.
Can be called before or after Recognize.
-
getTextlines
Pixa getTextlines()
Returns the textlines as a Pixa. Textlines are extracted from thethresholded image.
Can be called before or after Recognize. Block IDs are not returned.Paragraph IDs are not returned.
-
getStrips
Pixa getStrips()
Get textlines and strips of image regions as a Pixa, in reading order.
Enables downstream handling of non-rectangular regions. Can be calledbefore or after Recognize. Block IDs are not returned.
-
getWords
Pixa getWords()
Get the words as a Pixa, in reading order.
Can be called before or after Recognize.
-
getConnectedComponents
Pixa getConnectedComponents()
Gets the individual connected (text) components (created after pagessegmentation step, but before recognition) as a Pixa, in reading order.
Can be called before or after Recognize. Note: the caller isresponsible for calling recycle() on the returned Pixa.
-
getResultIterator
ResultIterator getResultIterator()
Get a reading-order iterator to the results of LayoutAnalysis and/orRecognize. The returned iterator must be deleted after use.
-
getHOCRText
@WorkerThread() String getHOCRText(int page)
Make a HTML-formatted string with hOCR markup from the internal datastructures. Interruptible by stop.
- Parameters:
page- is 0-based but will appear in the output as 1-based.
-
setInputName
void setInputName(String name)
Set the name of the input file. Needed for training and reading a UNLVzone file.
- Parameters:
name- input file name
-
setOutputName
void setOutputName(String name)
Set the name of the bonus output files. Needed only for debugging.
- Parameters:
name- output file name
-
readConfigFile
void readConfigFile(String filename)
Read a "config" file containing a set of variable, value pairs.
Searches the standard places: tessdata/configs, tessdata/tessconfigs.Note: only non-init params will be set.
- Parameters:
filename- the configuration filename, without the path
-
getBoxText
String getBoxText(int page)
The recognized text is returned as coded in the same format as a UTF8box file used in training.
Constructs coordinates in the original image - not just the rectangle.
- Parameters:
page- a 0-based page index that will appear in the box file.
-
getVersion
String getVersion()
Returns the version identifier as a string.
-
getLibraryFlavor
String getLibraryFlavor()
Returns flavor of the library.
-
stop
void stop()
Cancel recognition started by getHOCRText.
-
beginDocument
boolean beginDocument(TessPdfRenderer tessPdfRenderer, String title)
Starts a new document. This clears the contents of the output data.
Caller is responsible for escaping the provided title.
- Parameters:
tessPdfRenderer- the renderer instance to usetitle- a title to be used in the document metadata
-
beginDocument
boolean beginDocument(TessPdfRenderer tessPdfRenderer)
Starts a new document with no title.
- Parameters:
tessPdfRenderer- the renderer instance to use
-
endDocument
boolean endDocument(TessPdfRenderer tessPdfRenderer)
Finishes the document and finalizes the output data.Invalid if beginDocument not yet called.
- Parameters:
tessPdfRenderer- the renderer instance to use
-
addPageToDocument
boolean addPageToDocument(Pix imageToProcess, String imageToWrite, TessPdfRenderer tessPdfRenderer)
Adds the given data to the opened document (if any).
- Parameters:
imageToProcess- image to be used for OCRimageToWrite- path to image to be written into resulting documenttessPdfRenderer- the renderer instance to use
-
-
-
-