Package org.apache.tika.parser.html
Class JSoupParser
- java.lang.Object
-
- org.apache.tika.parser.AbstractEncodingDetectorParser
-
- org.apache.tika.parser.html.JSoupParser
-
- All Implemented Interfaces:
Serializable,org.apache.tika.parser.Parser
public class JSoupParser extends org.apache.tika.parser.AbstractEncodingDetectorParserHTML parser. Uses JSoup to turn the input document to HTML SAX events, and post-processes the events to produce XHTML and metadata expected by Tika clients.- See Also:
- Serialized Form
-
-
Field Summary
Fields Modifier and Type Field Description static CharsetDEFAULT_CHARSET
-
Constructor Summary
Constructors Constructor Description JSoupParser()JSoupParser(org.apache.tika.detect.EncodingDetector encodingDetector)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected org.apache.tika.detect.EncodingDetectorgetEncodingDetector(org.apache.tika.parser.ParseContext parseContext)Look for an EncodingDetetor in the ParseContext.Set<org.apache.tika.mime.MediaType>getSupportedTypes(org.apache.tika.parser.ParseContext context)booleanisExtractScripts()voidparse(InputStream stream, ContentHandler handler, org.apache.tika.metadata.Metadata metadata, org.apache.tika.parser.ParseContext context)voidparseString(String html, ContentHandler handler, org.apache.tika.metadata.Metadata metadata, org.apache.tika.parser.ParseContext context)voidsetExtractScripts(boolean extractScripts)Whether or not to extract contents in script entities.
-
-
-
Field Detail
-
DEFAULT_CHARSET
public static final Charset DEFAULT_CHARSET
-
-
Method Detail
-
getSupportedTypes
public Set<org.apache.tika.mime.MediaType> getSupportedTypes(org.apache.tika.parser.ParseContext context)
-
isExtractScripts
public boolean isExtractScripts()
-
setExtractScripts
@Field public void setExtractScripts(boolean extractScripts)
Whether or not to extract contents in script entities. Default isfalse- Parameters:
extractScripts-
-
parse
public void parse(InputStream stream, ContentHandler handler, org.apache.tika.metadata.Metadata metadata, org.apache.tika.parser.ParseContext context) throws IOException, SAXException, org.apache.tika.exception.TikaException
- Throws:
IOExceptionSAXExceptionorg.apache.tika.exception.TikaException
-
parseString
public void parseString(String html, ContentHandler handler, org.apache.tika.metadata.Metadata metadata, org.apache.tika.parser.ParseContext context) throws SAXException
- Throws:
SAXException
-
getEncodingDetector
protected org.apache.tika.detect.EncodingDetector getEncodingDetector(org.apache.tika.parser.ParseContext parseContext)
Look for an EncodingDetetor in the ParseContext. If it hasn't been passed in, use the original EncodingDetector from initialization.- Overrides:
getEncodingDetectorin classorg.apache.tika.parser.AbstractEncodingDetectorParser- Parameters:
parseContext-- Returns:
-
-