I use CrawlSpider combined with the following Rules to automatically turn the page and climb the movie information of Douban top250:
rules = (
Rule(LinkExtractor(restrict_xpaths="//span[@class="next"]/a"),
callback="parse_item", follow=True),
)
because the information I want to crawl is on the surface of the web page, I don"t need to enter the URL on every page.
but the problem arises. Even if callback
sets the handler, the LinkExtractor
starts calling the callback
function only when it extracts the link from the second page and generates page
, so the content of the first page is gone.
some other solutions have been searched on the Internet, but most of them use two or more Rule (they need to get into the deep URL). You can solve this problem by writing the page flip code manually with the most basic Spider
, but can you solve it with CrawlerSpider
, because it looks a little more elegant.