Package net.sf.okapi.common
Interface ISegmenter
-
- All Known Implementing Classes:
SRXSegmenter
public interface ISegmenterCommon methods to provide segmentation facility to extracted content.
-
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description intcomputeSegments(String text)Calculate the segmentation of a given plain text string.intcomputeSegments(TextContainer container)Calculates the segmentation of a given TextContainer object.LocaleIdgetLanguage()Gets the language used to apply the rules.RangegetNextSegmentRange(TextContainer container)Compute the range of the next segment for a given TextContainer object.List<Range>getRanges()Gets the list off all segments ranges calculated when callingcomputeSegments(String), orcomputeSegments(TextContainer).List<Integer>getSplitPositions()Gets the list of all the split positions in the text that was last segmented.booleanincludeEndCodes()Indicates if end codes should be included (See SRX implementation notes).booleanincludeIsolatedCodes()Indicates if isolated codes should be included (See SRX implementation notes).booleanincludeStartCodes()Indicates if start codes should be included (See SRX implementation notes).booleanoneSegmentIncludesAll()Indicates if, when there is a single segment in a text, it should include the whole text (no spaces or codes trim left/right)voidreset()Resets the options to their defaults, and the compiled rules to nothing.booleansegmentSubFlows()Indicates if sub-flows must be segmented.voidsetIncludeEndCodes(boolean includeEndCodes)voidsetIncludeIsolatedCodes(boolean includeIsolatedCodes)voidsetIncludeStartCodes(boolean includeStartCodes)voidsetLanguage(LocaleId locale)Sets the locale used to apply the rules.voidsetOneSegmentIncludesAll(boolean oneSegmentIncludesAll)voidsetOptions(boolean segmentSubFlows, boolean includeStartCodes, boolean includeEndCodes, boolean includeIsolatedCodes, boolean oneSegmentIncludesAll, boolean trimLeadingWS, boolean trimTrailingWS)Sets the options for this segmenter.voidsetSegmentSubFlows(boolean segmentSubFlows)voidsetTreatIsolatedCodesAsWhitespace(boolean treatIsolatedCodesAsWhitespace)voidsetTrimCodes(boolean trimCodes)voidsetTrimLeadingWS(boolean trimLeadingWS)voidsetTrimTrailingWS(boolean trimTrailingWS)booleantreatIsolatedCodesAsWhitespace()Indicate if the segmenter should treat each isolated code as a single whitespace character (U+0020) when applying segmentation.booleantrimLeadingWhitespaces()Indicates if leading white-spaces should be left outside the segments.booleantrimTrailingWhitespaces()Indicates if trailing white-spaces should be left outside the segments.
-
-
-
Method Detail
-
computeSegments
int computeSegments(String text)
Calculate the segmentation of a given plain text string.- Parameters:
text- plain text to segment.- Returns:
- the number of segments calculated.
-
computeSegments
int computeSegments(TextContainer container)
Calculates the segmentation of a given TextContainer object. If the content is already segmented, it is un-segmented automatically before being processed.- Parameters:
container- the object to segment.- Returns:
- the number of segments calculated.
-
getNextSegmentRange
Range getNextSegmentRange(TextContainer container)
Compute the range of the next segment for a given TextContainer object. The next segment is searched from the first character after the last segment marker found in the container.- Parameters:
container- the text container where to look for the next segment.- Returns:
- a range corresponding to the start and end position of the found segment, or null if no more segments are found.
-
getSplitPositions
List<Integer> getSplitPositions()
Gets the list of all the split positions in the text that was last segmented. You must callcomputeSegments(TextContainer)orcomputeSegments(String)before calling this method. A split position is the first character position of a new segment.IMPORTANT: The position returned here are the position WITHOUT taking in account any options for trimming or not leading and trailing white-spaces.
- Returns:
- An array of integers where each value is a split position in the coded text that was segmented.
-
getRanges
List<Range> getRanges()
Gets the list off all segments ranges calculated when callingcomputeSegments(String), orcomputeSegments(TextContainer).- Returns:
- the list of all segments ranges. each range is stored in
a
Rangeobject where start is the start and end the end of the range. Returns null if no ranges have been defined yet.
-
getLanguage
LocaleId getLanguage()
Gets the language used to apply the rules.- Returns:
- the language code used to apply the rules, or null, if none has been specified.
-
includeEndCodes
boolean includeEndCodes()
Indicates if end codes should be included (See SRX implementation notes).- Returns:
- true if they should be included, false otherwise.
-
includeIsolatedCodes
boolean includeIsolatedCodes()
Indicates if isolated codes should be included (See SRX implementation notes).- Returns:
- true if they should be included, false otherwise.
-
includeStartCodes
boolean includeStartCodes()
Indicates if start codes should be included (See SRX implementation notes).- Returns:
- true if they should be included, false otherwise.
-
reset
void reset()
Resets the options to their defaults, and the compiled rules to nothing.
-
segmentSubFlows
boolean segmentSubFlows()
Indicates if sub-flows must be segmented.- Returns:
- true if sub-flows must be segmented, false otherwise.
-
trimLeadingWhitespaces
boolean trimLeadingWhitespaces()
Indicates if leading white-spaces should be left outside the segments.- Returns:
- true if the leading white-spaces should be trimmed.
-
trimTrailingWhitespaces
boolean trimTrailingWhitespaces()
Indicates if trailing white-spaces should be left outside the segments.- Returns:
- true if the trailing white-spaces should be trimmed.
-
oneSegmentIncludesAll
boolean oneSegmentIncludesAll()
Indicates if, when there is a single segment in a text, it should include the whole text (no spaces or codes trim left/right)- Returns:
- true if a text with a single segment should include the whole text.
-
treatIsolatedCodesAsWhitespace
boolean treatIsolatedCodesAsWhitespace()
Indicate if the segmenter should treat each isolated code as a single whitespace character (U+0020) when applying segmentation.- Returns:
- true if the segmenter should treat isolated codes as whitespace
-
setLanguage
void setLanguage(LocaleId locale)
Sets the locale used to apply the rules.- Parameters:
locale- Code of the language to use to apply the rules.
-
setIncludeEndCodes
void setIncludeEndCodes(boolean includeEndCodes)
-
setIncludeIsolatedCodes
void setIncludeIsolatedCodes(boolean includeIsolatedCodes)
-
setIncludeStartCodes
void setIncludeStartCodes(boolean includeStartCodes)
-
setOneSegmentIncludesAll
void setOneSegmentIncludesAll(boolean oneSegmentIncludesAll)
-
setOptions
void setOptions(boolean segmentSubFlows, boolean includeStartCodes, boolean includeEndCodes, boolean includeIsolatedCodes, boolean oneSegmentIncludesAll, boolean trimLeadingWS, boolean trimTrailingWS)Sets the options for this segmenter.- Parameters:
segmentSubFlows- true to segment sub-flows, false to no segment them.includeStartCodes- true to include start codes just before a break in the 'left' segment, false to put them in the next segment.includeEndCodes- true to include end codes just before a break in the 'left' segment, false to put them in the next segment.includeIsolatedCodes- true to include isolated codes just before a break in the 'left' segment, false to put them in the next segment.oneSegmentIncludesAll- true to include everything in segments that are alone.trimLeadingWS- true to trim leading white-spaces from the segments, false to keep them.trimTrailingWS- true to trim trailing white-spaces from the segments, false to keep them.
-
setSegmentSubFlows
void setSegmentSubFlows(boolean segmentSubFlows)
-
setTrimCodes
void setTrimCodes(boolean trimCodes)
-
setTrimLeadingWS
void setTrimLeadingWS(boolean trimLeadingWS)
-
setTrimTrailingWS
void setTrimTrailingWS(boolean trimTrailingWS)
-
setTreatIsolatedCodesAsWhitespace
void setTreatIsolatedCodesAsWhitespace(boolean treatIsolatedCodesAsWhitespace)
-
-