Package ai.djl.huggingface.tokenizers
Class HuggingFaceTokenizer
java.lang.Object
ai.djl.util.NativeResource<Long>
ai.djl.huggingface.tokenizers.HuggingFaceTokenizer
- All Implemented Interfaces:
ai.djl.modality.nlp.preprocess.TextProcessor,ai.djl.modality.nlp.preprocess.Tokenizer,AutoCloseable
public final class HuggingFaceTokenizer
extends ai.djl.util.NativeResource<Long>
implements ai.djl.modality.nlp.preprocess.Tokenizer
HuggingFaceTokenizer is a Huggingface tokenizer implementation of the Tokenizer
interface that converts sentences into token.-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic final classThe builder for creating huggingface tokenizer. -
Field Summary
Fields inherited from class ai.djl.util.NativeResource
handle -
Method Summary
Modifier and TypeMethodDescriptionString[]batchDecode(long[][] batchIds) Returns the decoded Strings from the input batch ids.String[]batchDecode(long[][] batchIds, boolean skipSpecialTokens) Returns the decoded Strings from the input batch ids.Encoding[]batchEncode(ai.djl.util.PairList<String, String> inputs) Returns theEncodingof the input text pair in batch.Encoding[]batchEncode(ai.djl.util.PairList<String, String> inputs, boolean addSpecialTokens, boolean withOverflowingTokens) Returns theEncodingof the input text pair in batch.Encoding[]batchEncode(String[] inputs) Returns theEncodingof the input sentence in batch.Encoding[]batchEncode(String[] inputs, boolean addSpecialTokens, boolean withOverflowingTokens) Returns theEncodingof the input sentence in batch.Encoding[]batchEncode(List<String> inputs) Returns theEncodingof the input sentence in batch.Encoding[]batchEncode(List<String> inputs, boolean addSpecialTokens, boolean withOverflowingTokens) Returns theEncodingof the input sentence in batch.static HuggingFaceTokenizer.Builderbuilder()Creates a builder to build aHuggingFaceTokenizer.static HuggingFaceTokenizer.BuilderCreates a builder to build aHuggingFaceTokenizer.buildSentence(List<String> tokens) voidclose()decode(long[] ids) Returns the decoded String from the input ids.decode(long[] ids, boolean skipSpecialTokens) Returns the decoded String from the input ids.Returns theEncodingof the input sentence.Returns theEncodingof the input sentences.Returns theEncodingof the input sentences.Returns theEncodingof the input sentence.Returns theEncodingof the input sentence.Returns theEncodingof the input sentence.Returns theEncodingof the input sentences.Returns theEncodingof the input sentences.protected voidfinalize()intReturns the max token length.Returns the padding policy.intReturns the padToMultipleOf for padding.intReturns the stride to use in overflow overlap when truncating sequences longer than the model supports.Returns the truncation policy.Returns the version of the Huggingface tokenizer.static HuggingFaceTokenizernewInstance(InputStream is, Map<String, String> options) Create a pre-trainedHuggingFaceTokenizerinstance fromInputStream.static HuggingFaceTokenizernewInstance(String name) Creates a pre-trainedHuggingFaceTokenizerinstance from huggingface hub.static HuggingFaceTokenizernewInstance(String identifier, Map<String, String> options) Create a pre-trainedHuggingFaceTokenizerinstance from huggingface hub.static HuggingFaceTokenizernewInstance(Path modelPath) Create a pre-trainedHuggingFaceTokenizerinstance from existing models.static HuggingFaceTokenizerCreate a pre-trained BPEHuggingFaceTokenizerinstance from existing models.static HuggingFaceTokenizernewInstance(Path modelPath, Map<String, String> options) Create a pre-trainedHuggingFaceTokenizerinstance from existing models.Methods inherited from class ai.djl.util.NativeResource
getHandle, getUid, isReleased, onCloseMethods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface ai.djl.modality.nlp.preprocess.Tokenizer
preprocess
-
Method Details
-
newInstance
Creates a pre-trainedHuggingFaceTokenizerinstance from huggingface hub.- Parameters:
name- the name of the huggingface tokenizer- Returns:
- a
HuggingFaceTokenizerinstance
-
newInstance
Create a pre-trainedHuggingFaceTokenizerinstance from huggingface hub.- Parameters:
identifier- the identifier of the huggingface tokenizeroptions- tokenizer options- Returns:
- a
HuggingFaceTokenizerinstance
-
newInstance
Create a pre-trainedHuggingFaceTokenizerinstance from existing models.- Parameters:
modelPath- the directory or file path of the model location- Returns:
- a
HuggingFaceTokenizerinstance - Throws:
IOException- when IO operation fails in loading a resource
-
newInstance
public static HuggingFaceTokenizer newInstance(Path modelPath, Map<String, String> options) throws IOExceptionCreate a pre-trainedHuggingFaceTokenizerinstance from existing models.- Parameters:
modelPath- the directory or file path of the model locationoptions- tokenizer options- Returns:
- a
HuggingFaceTokenizerinstance - Throws:
IOException- when IO operation fails in loading a resource
-
newInstance
public static HuggingFaceTokenizer newInstance(Path vocab, Path merges, Map<String, String> options) throws IOExceptionCreate a pre-trained BPEHuggingFaceTokenizerinstance from existing models.- Parameters:
vocab- the BPE vocabulary filemerges- the BPE merges fileoptions- tokenizer options- Returns:
- a
HuggingFaceTokenizerinstance - Throws:
IOException- when IO operation fails in loading a resource
-
newInstance
public static HuggingFaceTokenizer newInstance(InputStream is, Map<String, String> options) throws IOExceptionCreate a pre-trainedHuggingFaceTokenizerinstance fromInputStream.- Parameters:
is-InputStreamoptions- tokenizer options- Returns:
- a
HuggingFaceTokenizerinstance - Throws:
IOException- when IO operation fails in loading a resource
-
getVersion
Returns the version of the Huggingface tokenizer.- Returns:
- the version number of the Huggingface tokenizer
-
tokenize
- Specified by:
tokenizein interfaceai.djl.modality.nlp.preprocess.Tokenizer
-
buildSentence
- Specified by:
buildSentencein interfaceai.djl.modality.nlp.preprocess.Tokenizer
-
close
public void close()- Specified by:
closein interfaceAutoCloseable- Overrides:
closein classai.djl.util.NativeResource<Long>
-
encode
Returns theEncodingof the input sentence.- Parameters:
text- the input sentenceaddSpecialTokens- whether to encode the sequence with special tokens relative to their modelwithOverflowingTokens- whether to return overflowing tokens- Returns:
- the
Encodingof the input sentence
-
encode
Returns theEncodingof the input sentence.- Parameters:
text- the input sentence- Returns:
- the
Encodingof the input sentence
-
encode
public Encoding encode(String text, String textPair, boolean addSpecialTokens, boolean withOverflowingTokens) Returns theEncodingof the input sentence.- Parameters:
text- the input sentencetextPair- the second input sentenceaddSpecialTokens- whether to encode the sequence with special tokens relative to their modelwithOverflowingTokens- whether to return overflowing tokens- Returns:
- the
Encodingof the input sentence
-
encode
Returns theEncodingof the input sentence.- Parameters:
text- the input sentencetextPair- the second input sentence- Returns:
- the
Encodingof the input sentence
-
encode
public Encoding encode(List<String> inputs, boolean addSpecialTokens, boolean withOverflowingTokens) Returns theEncodingof the input sentences.- Parameters:
inputs- the input sentencesaddSpecialTokens- whether to encode the sequence with special tokens relative to their modelwithOverflowingTokens- whether to return overflowing tokens- Returns:
- the
Encodingof the input sentences
-
encode
Returns theEncodingof the input sentences.- Parameters:
inputs- the input sentences- Returns:
- the
Encodingof the input sentences
-
encode
Returns theEncodingof the input sentences.- Parameters:
inputs- the input sentencesaddSpecialTokens- whether to encode the sequence with special tokens relative to their modelwithOverflowingTokens- whether to return overflowing tokens- Returns:
- the
Encodingof the input sentences
-
encode
Returns theEncodingof the input sentences.- Parameters:
inputs- the input sentences- Returns:
- the
Encodingof the input sentences
-
batchEncode
public Encoding[] batchEncode(List<String> inputs, boolean addSpecialTokens, boolean withOverflowingTokens) Returns theEncodingof the input sentence in batch.- Parameters:
inputs- the batch of input sentenceaddSpecialTokens- whether to encode the sequence with special tokens relative to their modelwithOverflowingTokens- whether to return overflowing tokens- Returns:
- the
Encodingof the input sentence in batch
-
batchEncode
Returns theEncodingof the input sentence in batch.- Parameters:
inputs- the batch of input sentence- Returns:
- the
Encodingof the input sentence in batch
-
batchEncode
public Encoding[] batchEncode(String[] inputs, boolean addSpecialTokens, boolean withOverflowingTokens) Returns theEncodingof the input sentence in batch.- Parameters:
inputs- the batch of input sentenceaddSpecialTokens- whether to encode the sequence with special tokens relative to their modelwithOverflowingTokens- whether to return overflowing tokens- Returns:
- the
Encodingof the input sentence in batch
-
batchEncode
Returns theEncodingof the input sentence in batch.- Parameters:
inputs- the batch of input sentence- Returns:
- the
Encodingof the input sentence in batch
-
batchEncode
public Encoding[] batchEncode(ai.djl.util.PairList<String, String> inputs, boolean addSpecialTokens, boolean withOverflowingTokens) Returns theEncodingof the input text pair in batch.- Parameters:
inputs- the batch of input text pairaddSpecialTokens- whether to encode the sequence with special tokens relative to their modelwithOverflowingTokens- whether to return overflowing tokens- Returns:
- the
Encodingof the input text pair in batch
-
batchEncode
Returns theEncodingof the input text pair in batch.- Parameters:
inputs- the batch of input text pair- Returns:
- the
Encodingof the input text pair in batch
-
decode
Returns the decoded String from the input ids.- Parameters:
ids- the input idsskipSpecialTokens- whether to remove special tokens in the decoding- Returns:
- the decoded String from the input ids
-
decode
Returns the decoded String from the input ids.- Parameters:
ids- the input ids- Returns:
- the decoded String from the input ids
-
batchDecode
Returns the decoded Strings from the input batch ids.- Parameters:
batchIds- the batch of id sequences to decodeskipSpecialTokens- whether to remove special tokens in the decoding- Returns:
- the decoded Strings from the input batch ids
-
batchDecode
Returns the decoded Strings from the input batch ids.- Parameters:
batchIds- the batch of id sequences to decode- Returns:
- the decoded Strings from the input batch ids
-
getTruncation
Returns the truncation policy.- Returns:
- the truncation policy
-
getPadding
Returns the padding policy.- Returns:
- the padding policy
-
getMaxLength
public int getMaxLength()Returns the max token length.- Returns:
- the max token length
-
getStride
public int getStride()Returns the stride to use in overflow overlap when truncating sequences longer than the model supports.- Returns:
- the stride to use in overflow overlap when truncating sequences longer than the model supports
-
getPadToMultipleOf
public int getPadToMultipleOf()Returns the padToMultipleOf for padding.- Returns:
- the padToMultipleOf for padding
-
builder
Creates a builder to build aHuggingFaceTokenizer.- Returns:
- a new builder
-
builder
Creates a builder to build aHuggingFaceTokenizer.- Parameters:
arguments- the models' arguments- Returns:
- a new builder
-
finalize
-