public class HyphenationTree extends TernaryTree implements PatternConsumer
TernaryTree.Iterator| Modifier and Type | Field and Description |
|---|---|
protected TernaryTree |
classmap
This map stores the character classes
|
protected Map<String,List> |
stoplist
This map stores hyphenation exceptions
|
protected ByteVector |
vspace
value space: stores the interletter values
|
BLOCK_SIZE, eq, freenode, hi, kv, length, lo, root, sc| Constructor and Description |
|---|
HyphenationTree() |
| Modifier and Type | Method and Description |
|---|---|
void |
addClass(String chargroup)
Add a character class to the tree.
|
void |
addException(String word,
ArrayList hyphenatedword)
Add an exception to the tree.
|
void |
addPattern(String pattern,
String ivalue)
Add a pattern to the tree.
|
String |
findPattern(String pat) |
protected byte[] |
getValues(int k) |
protected int |
hstrcmp(char[] s,
int si,
char[] t,
int ti)
String compare, returns 0 if equal or t is a substring of s
|
Hyphenation |
hyphenate(char[] w,
int offset,
int len,
int remainCharCount,
int pushCharCount)
Hyphenate word and return an array of hyphenation points.
|
Hyphenation |
hyphenate(String word,
int remainCharCount,
int pushCharCount)
Hyphenate word and return a Hyphenation object.
|
void |
loadSimplePatterns(InputStream stream) |
protected int |
packValues(String values)
Packs the values by storing them in 4 bits, two values into a byte Values range is from 0 to 9.
|
void |
printStats() |
protected void |
searchPatterns(char[] word,
int index,
byte[] il)
Search for all possible partial matches of word starting
at index an update interletter values.
|
protected String |
unpackValues(int k) |
protected ByteVector vspace
protected TernaryTree classmap
protected int packValues(String values)
values - a string of digits from '0' to '9' representing the interletter values.protected String unpackValues(int k)
public void loadSimplePatterns(InputStream stream)
protected int hstrcmp(char[] s,
int si,
char[] t,
int ti)
s - The first String to comparesi - The index to start at on String st - The second String to compareti - The index to start at on String tprotected byte[] getValues(int k)
protected void searchPatterns(char[] word,
int index,
byte[] il)
Search for all possible partial matches of word starting at index an update interletter values. In other words, it does something like:
for(i=0; i<patterns.length; i++) {
if ( word.substring(index).startsWidth(patterns[i]) ) {
update_interletter_values(patterns[i]);
}
}
But it is done in an efficient way since the patterns are stored in a ternary tree. In fact, this is the whole purpose of having the tree: doing this search without having to test every single pattern. The number of patterns for languages such as English range from 4000 to 10000. Thus, doing thousands of string comparisons for each word to hyphenate would be really slow without the tree. The tradeoff is memory, but using a ternary tree instead of a trie, almost halves the the memory used by Lout or TeX. It's also faster than using a hash table
word - null terminated word to matchindex - start index from wordil - interletter values array to updatepublic Hyphenation hyphenate(String word, int remainCharCount, int pushCharCount)
word - the word to be hyphenatedremainCharCount - Minimum number of characters allowed before the hyphenation point.pushCharCount - Minimum number of characters allowed after the hyphenation point.Hyphenation object representing the hyphenated word or null if word is not
hyphenated.public Hyphenation hyphenate(char[] w, int offset, int len, int remainCharCount, int pushCharCount)
w - char array that contains the wordoffset - Offset to first character in wordlen - Length of wordremainCharCount - Minimum number of characters allowed before the hyphenation point.pushCharCount - Minimum number of characters allowed after the hyphenation point.Hyphenation object representing the hyphenated word or null if word is not
hyphenated.public void addClass(String chargroup)
SimplePatternParser as callback to
add character classes. Character classes define the valid word characters for hyphenation. If a word contains a
character not defined in any of the classes, it is not hyphenated. It also defines a way to normalize the
characters in order to compare them with the stored patterns. Usually pattern files use only lower case
characters, in this case a class for letter 'a', for example, should be defined as "aA", the first character
being the normalization char.addClass in interface PatternConsumerchargroup - character grouppublic void addException(String word, ArrayList hyphenatedword)
SimplePatternParser class as callback to
store the hyphenation exceptions.addException in interface PatternConsumerword - normalized wordhyphenatedword - a vector of alternating strings and hyphen objects.public void addPattern(String pattern, String ivalue)
SimplePatternParser class as
callback to add a pattern to the tree.addPattern in interface PatternConsumerpattern - the hyphenation patternivalue - interletter weight values indicating the desirability and priority of hyphenating at a given point
within the pattern. It should contain only digit characters. (i.e. '0' to '9').public void printStats()
printStats in class TernaryTreeCopyright © 2024. All rights reserved.