Class HuggingFaceTokenizer

java.lang.Object
ai.djl.util.NativeResource<Long>
ai.djl.huggingface.tokenizers.HuggingFaceTokenizer
All Implemented Interfaces:
ai.djl.modality.nlp.preprocess.TextProcessor, ai.djl.modality.nlp.preprocess.Tokenizer, AutoCloseable

public final class HuggingFaceTokenizer extends ai.djl.util.NativeResource<Long> implements ai.djl.modality.nlp.preprocess.Tokenizer
HuggingFaceTokenizer is a Huggingface tokenizer implementation of the Tokenizer interface that converts sentences into token.
  • Method Details

    • newInstance

      public static HuggingFaceTokenizer newInstance(String name)
      Creates a pre-trained HuggingFaceTokenizer instance from huggingface hub.
      Parameters:
      name - the name of the huggingface tokenizer
      Returns:
      a HuggingFaceTokenizer instance
    • newInstance

      public static HuggingFaceTokenizer newInstance(String identifier, Map<String,String> options)
      Create a pre-trained HuggingFaceTokenizer instance from huggingface hub.
      Parameters:
      identifier - the identifier of the huggingface tokenizer
      options - tokenizer options
      Returns:
      a HuggingFaceTokenizer instance
    • newInstance

      public static HuggingFaceTokenizer newInstance(Path modelPath) throws IOException
      Create a pre-trained HuggingFaceTokenizer instance from existing models.
      Parameters:
      modelPath - the directory or file path of the model location
      Returns:
      a HuggingFaceTokenizer instance
      Throws:
      IOException - when IO operation fails in loading a resource
    • newInstance

      public static HuggingFaceTokenizer newInstance(Path modelPath, Map<String,String> options) throws IOException
      Create a pre-trained HuggingFaceTokenizer instance from existing models.
      Parameters:
      modelPath - the directory or file path of the model location
      options - tokenizer options
      Returns:
      a HuggingFaceTokenizer instance
      Throws:
      IOException - when IO operation fails in loading a resource
    • newInstance

      public static HuggingFaceTokenizer newInstance(Path vocab, Path merges, Map<String,String> options) throws IOException
      Create a pre-trained BPE HuggingFaceTokenizer instance from existing models.
      Parameters:
      vocab - the BPE vocabulary file
      merges - the BPE merges file
      options - tokenizer options
      Returns:
      a HuggingFaceTokenizer instance
      Throws:
      IOException - when IO operation fails in loading a resource
    • newInstance

      public static HuggingFaceTokenizer newInstance(InputStream is, Map<String,String> options) throws IOException
      Create a pre-trained HuggingFaceTokenizer instance from InputStream.
      Parameters:
      is - InputStream
      options - tokenizer options
      Returns:
      a HuggingFaceTokenizer instance
      Throws:
      IOException - when IO operation fails in loading a resource
    • getVersion

      public String getVersion()
      Returns the version of the Huggingface tokenizer.
      Returns:
      the version number of the Huggingface tokenizer
    • tokenize

      public List<String> tokenize(String sentence)
      Specified by:
      tokenize in interface ai.djl.modality.nlp.preprocess.Tokenizer
    • buildSentence

      public String buildSentence(List<String> tokens)
      Specified by:
      buildSentence in interface ai.djl.modality.nlp.preprocess.Tokenizer
    • close

      public void close()
      Specified by:
      close in interface AutoCloseable
      Overrides:
      close in class ai.djl.util.NativeResource<Long>
    • encode

      public Encoding encode(String text, boolean addSpecialTokens, boolean withOverflowingTokens)
      Returns the Encoding of the input sentence.
      Parameters:
      text - the input sentence
      addSpecialTokens - whether to encode the sequence with special tokens relative to their model
      withOverflowingTokens - whether to return overflowing tokens
      Returns:
      the Encoding of the input sentence
    • encode

      public Encoding encode(String text)
      Returns the Encoding of the input sentence.
      Parameters:
      text - the input sentence
      Returns:
      the Encoding of the input sentence
    • encode

      public Encoding encode(String text, String textPair, boolean addSpecialTokens, boolean withOverflowingTokens)
      Returns the Encoding of the input sentence.
      Parameters:
      text - the input sentence
      textPair - the second input sentence
      addSpecialTokens - whether to encode the sequence with special tokens relative to their model
      withOverflowingTokens - whether to return overflowing tokens
      Returns:
      the Encoding of the input sentence
    • encode

      public Encoding encode(String text, String textPair)
      Returns the Encoding of the input sentence.
      Parameters:
      text - the input sentence
      textPair - the second input sentence
      Returns:
      the Encoding of the input sentence
    • encode

      public Encoding encode(List<String> inputs, boolean addSpecialTokens, boolean withOverflowingTokens)
      Returns the Encoding of the input sentences.
      Parameters:
      inputs - the input sentences
      addSpecialTokens - whether to encode the sequence with special tokens relative to their model
      withOverflowingTokens - whether to return overflowing tokens
      Returns:
      the Encoding of the input sentences
    • encode

      public Encoding encode(List<String> inputs)
      Returns the Encoding of the input sentences.
      Parameters:
      inputs - the input sentences
      Returns:
      the Encoding of the input sentences
    • encode

      public Encoding encode(String[] inputs, boolean addSpecialTokens, boolean withOverflowingTokens)
      Returns the Encoding of the input sentences.
      Parameters:
      inputs - the input sentences
      addSpecialTokens - whether to encode the sequence with special tokens relative to their model
      withOverflowingTokens - whether to return overflowing tokens
      Returns:
      the Encoding of the input sentences
    • encode

      public Encoding encode(String[] inputs)
      Returns the Encoding of the input sentences.
      Parameters:
      inputs - the input sentences
      Returns:
      the Encoding of the input sentences
    • batchEncode

      public Encoding[] batchEncode(List<String> inputs, boolean addSpecialTokens, boolean withOverflowingTokens)
      Returns the Encoding of the input sentence in batch.
      Parameters:
      inputs - the batch of input sentence
      addSpecialTokens - whether to encode the sequence with special tokens relative to their model
      withOverflowingTokens - whether to return overflowing tokens
      Returns:
      the Encoding of the input sentence in batch
    • batchEncode

      public Encoding[] batchEncode(List<String> inputs)
      Returns the Encoding of the input sentence in batch.
      Parameters:
      inputs - the batch of input sentence
      Returns:
      the Encoding of the input sentence in batch
    • batchEncode

      public Encoding[] batchEncode(String[] inputs, boolean addSpecialTokens, boolean withOverflowingTokens)
      Returns the Encoding of the input sentence in batch.
      Parameters:
      inputs - the batch of input sentence
      addSpecialTokens - whether to encode the sequence with special tokens relative to their model
      withOverflowingTokens - whether to return overflowing tokens
      Returns:
      the Encoding of the input sentence in batch
    • batchEncode

      public Encoding[] batchEncode(String[] inputs)
      Returns the Encoding of the input sentence in batch.
      Parameters:
      inputs - the batch of input sentence
      Returns:
      the Encoding of the input sentence in batch
    • batchEncode

      public Encoding[] batchEncode(ai.djl.util.PairList<String,String> inputs, boolean addSpecialTokens, boolean withOverflowingTokens)
      Returns the Encoding of the input text pair in batch.
      Parameters:
      inputs - the batch of input text pair
      addSpecialTokens - whether to encode the sequence with special tokens relative to their model
      withOverflowingTokens - whether to return overflowing tokens
      Returns:
      the Encoding of the input text pair in batch
    • batchEncode

      public Encoding[] batchEncode(ai.djl.util.PairList<String,String> inputs)
      Returns the Encoding of the input text pair in batch.
      Parameters:
      inputs - the batch of input text pair
      Returns:
      the Encoding of the input text pair in batch
    • decode

      public String decode(long[] ids, boolean skipSpecialTokens)
      Returns the decoded String from the input ids.
      Parameters:
      ids - the input ids
      skipSpecialTokens - whether to remove special tokens in the decoding
      Returns:
      the decoded String from the input ids
    • decode

      public String decode(long[] ids)
      Returns the decoded String from the input ids.
      Parameters:
      ids - the input ids
      Returns:
      the decoded String from the input ids
    • batchDecode

      public String[] batchDecode(long[][] batchIds, boolean skipSpecialTokens)
      Returns the decoded Strings from the input batch ids.
      Parameters:
      batchIds - the batch of id sequences to decode
      skipSpecialTokens - whether to remove special tokens in the decoding
      Returns:
      the decoded Strings from the input batch ids
    • batchDecode

      public String[] batchDecode(long[][] batchIds)
      Returns the decoded Strings from the input batch ids.
      Parameters:
      batchIds - the batch of id sequences to decode
      Returns:
      the decoded Strings from the input batch ids
    • getTruncation

      public String getTruncation()
      Returns the truncation policy.
      Returns:
      the truncation policy
    • getPadding

      public String getPadding()
      Returns the padding policy.
      Returns:
      the padding policy
    • getMaxLength

      public int getMaxLength()
      Returns the max token length.
      Returns:
      the max token length
    • getStride

      public int getStride()
      Returns the stride to use in overflow overlap when truncating sequences longer than the model supports.
      Returns:
      the stride to use in overflow overlap when truncating sequences longer than the model supports
    • getPadToMultipleOf

      public int getPadToMultipleOf()
      Returns the padToMultipleOf for padding.
      Returns:
      the padToMultipleOf for padding
    • builder

      public static HuggingFaceTokenizer.Builder builder()
      Creates a builder to build a HuggingFaceTokenizer.
      Returns:
      a new builder
    • builder

      public static HuggingFaceTokenizer.Builder builder(Map<String,?> arguments)
      Creates a builder to build a HuggingFaceTokenizer.
      Parameters:
      arguments - the models' arguments
      Returns:
      a new builder
    • finalize

      protected void finalize() throws Throwable
      Overrides:
      finalize in class Object
      Throws:
      Throwable