Interface Parser

All Superinterfaces:
Configurable, Pluggable
All Known Implementing Classes:
ExtParser, FeedParser, HtmlParser, JSParseFilter, TikaParser, ZipParser

public interface Parser extends Pluggable, Configurable
A parser for content generated by a Protocol implementation. This interface is implemented by extensions. Nutch's core contains no page parsing code.
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final String
    The name of the extension point.
  • Method Summary

    Modifier and Type
    Method
    Description
    This method parses the given content and returns a map of <key, parse> pairs.

    Methods inherited from interface org.apache.hadoop.conf.Configurable

    getConf, setConf
  • Field Details

    • X_POINT_ID

      static final String X_POINT_ID
      The name of the extension point.
  • Method Details

    • getParse

      ParseResult getParse(Content c)

      This method parses the given content and returns a map of <key, parse> pairs. Parse instances will be persisted under the given key.

      Note: Meta-redirects should be followed only when they are coming from the original URL. That is:
      Assume fetcher is in parsing mode and is currently processing foo.bar.com/redirect.html. If this url contains a meta redirect to another url, fetcher should only follow the redirect if the map contains an entry of the form <"foo.bar.com/redirect.html", Parse with a ParseStatus indicating the redirect>.

      Parameters:
      c - Content to be parsed
      Returns:
      a map containing <key, parse> pairs
      Since:
      NUTCH-443