Using Scrapy-Redis to implement distributed crawlers how to gracefully keep the scheduling pool capable of crawling multiple machines at the same time? Why is the scheduling pool easy to be empty?

  1. question : RedisCrawlSpider"s crawler template is used in the project to achieve two-way crawling, that is, a Rule handles horizontal url crawling of the next page, and a Rule handles vertical detail page url crawling. Then the effect of distributed crawling is that even if multiple machines run together, the next page is crawled only after the current page and the details page related to the current page have been crawled. When the website has anti-crawler measures, the effect can be imagined. The efficiency of distributed crawling is basically not reflected.
  2. idea : later, by trying to find that adding request links to the name:start_urls in the Redis database during crawling can also be scheduled to request, so it is necessary to first add enough request links to the name:start_urls so that there are enough scheduling pools to be allocated, which should avoid some machines waiting for scheduling.
  3. practice : so I separate the original horizontal crawling page number url and store the url of each generated page in name:start_urls. Sure enough, when so many hosts have enough scheduling pools to allocate, the crawling efficiency is fully reflected.

but , I wonder if you have encountered this kind of problem of mine. If so, what is your solution? is there a better solution? Because I also need a separate program to generate url and store it in name:start_urls in this way, it doesn"t feel very elegant and convenient, although it is already a better solution that I can think of.

May.12,2022

by default, Scrapy uses LIFO queues to store waiting requests. In a nutshell, it is depth priority. Depth first is more convenient in most cases. If you want to crawl in breadth first order, you can set the following settings:

  reference documentation  

Menu