recently read Learning Scrapy, which mentions a crawler that automatically turns pages and crawls items on each page. The book says that Scrapy uses last-in, first-out queues.
suppose there are 30 items on each page, and start_url is set to the first page. My understanding of LIFO is that the first item out should be the item at the bottom of the last page, but the result of the routine runs first is the last item on the first page. In fact, the overall order is page one and page two. On the last page, the content order of each page is from the last to the first.
The order of is really good, but I think the overall order of the results of routines should start from the last page. After the next link cannot be extracted on the last page, why not go directly to item_selector
, and then extract the links of each project on the last page and give it to parse_item
for processing? How can we deal with the first page first?
is there a problem in my understanding of yield that leads to this misunderstanding? I hope to get your help.
def parse(self, response):
-sharp Get the next index URLs and yield Requests
next_selector = response.xpath("//*[contains(@class,"
""next")]//@href")
for url in next_selector.extract():
yield Request(urlparse.urljoin(response.url, url))
-sharp Get item URLs and yield Requests
item_selector = response.xpath("//*[@itemprop="url"]/@href")
for url in item_selector.extract():
yield Request(urlparse.urljoin(response.url, url),
callback=self.parse_item)