- question : RedisCrawlSpider"s crawler template is used in the project to achieve two-way crawling, that is, a Rule handles horizontal url crawling of the next page, and a Rule handles vertical detail page url crawling. Then the effect of distributed crawling is that even if multiple machines run together, the next page is crawled only after the current page and the details page related to the current page have been crawled. When the website has anti-crawler measures, the effect can be imagined. The efficiency of distributed crawling is basically not reflected.
- idea : later, by trying to find that adding request links to the name:start_urls in the Redis database during crawling can also be scheduled to request, so it is necessary to first add enough request links to the name:start_urls so that there are enough scheduling pools to be allocated, which should avoid some machines waiting for scheduling.
- practice : so I separate the original horizontal crawling page number url and store the url of each generated page in name:start_urls. Sure enough, when so many hosts have enough scheduling pools to allocate, the crawling efficiency is fully reflected.
but , I wonder if you have encountered this kind of problem of mine. If so, what is your solution? is there a better solution? Because I also need a separate program to generate url and store it in name:start_urls in this way, it doesn"t feel very elegant and convenient, although it is already a better solution that I can think of.