Package org.apache.nutch.parse
Interface Parser
- All Superinterfaces:
Configurable,Pluggable
- All Known Implementing Classes:
ExtParser,FeedParser,HtmlParser,JSParseFilter,TikaParser,ZipParser
A parser for content generated by a
Protocol implementation. This interface is
implemented by extensions. Nutch's core contains no page parsing code.-
Field Summary
Fields -
Method Summary
Modifier and TypeMethodDescriptionThis method parses the given content and returns a map of <key, parse> pairs.Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
Field Details
-
X_POINT_ID
The name of the extension point.
-
-
Method Details
-
getParse
This method parses the given content and returns a map of <key, parse> pairs.
Parseinstances will be persisted under the given key.Note: Meta-redirects should be followed only when they are coming from the original URL. That is:
Assume fetcher is in parsing mode and is currently processing foo.bar.com/redirect.html. If this url contains a meta redirect to another url, fetcher should only follow the redirect if the map contains an entry of the form <"foo.bar.com/redirect.html",Parsewith aParseStatusindicating the redirect>.- Parameters:
c- Content to be parsed- Returns:
- a map containing <key, parse> pairs
- Since:
- NUTCH-443
-