How can i write my custom link extractor in scrapy python

I want to write my custom scrapy link extractor for extracting links.

The scrapy documentation says it has two built-in extractors.

http://doc.scrapy.org/en/latest/topics/link-extractors.html

But i haven’t seen any code example of how can i implement by custom link extractor, can someone give some example of writing custom extractor?

Best answer

This is the example of custom link extractor

class RCP_RegexLinkExtractor(SgmlLinkExtractor):
    """High performant link extractor"""

    def _extract_links(self, response_text, response_url, response_encoding, base_url=None):
        if base_url is None:
            base_url = urljoin(response_url, self.base_url) if self.base_url else response_url

        clean_url = lambda u: urljoin(base_url, remove_entities(clean_link(u.decode(response_encoding))))
        clean_text = lambda t: replace_escape_chars(remove_tags(t.decode(response_encoding))).strip()

        links_text = linkre.findall(response_text)
        urlstext = set([(clean_url(url), clean_text(text)) for url, _, text in links_text])

        return [Link(url, text) for url, text in urlstext]

Usage

rules = (
    Rule(
        RCP_RegexLinkExtractor(
            allow=(r"epolls/2012/president/[a-z]{2}/[a-z]+_romney_vs_obama-[0-9]{4}\.html"),
            # Regex explanation:
            #     [a-z]{2} - matches a two character state abbreviation
            #     [a-z]*   - matches a state name
            #     [0-9]{4} - matches a 4 number unique webpage identifier

            allow_domains=('realclearpolitics.com',),
        ),
        callback='parseStatePolls',
        # follow=None, # default 
        process_links='processLinks',
        process_request='processRequest',
    ),
)

have a look at here https://github.com/jtfairbank/RCP-Poll-Scraper