Using Scrapy-Redis to implement distributed crawlers how to gracefully keep the scheduling pool capable of crawling multiple machines at the same time? Why is the scheduling pool easy to be empty? - Codes Helper - Programming Question Answer

Using Scrapy-Redis to implement distributed crawlers how to gracefully keep the scheduling pool capable of crawling multiple machines at the same time? Why is the scheduling pool easy to be empty?

question : RedisCrawlSpider"s crawler template is used in the project to achieve two-way crawling, that is, a Rule handles horizontal url crawling of the next page, and a Rule handles vertical detail page url crawling. Then the effect of distributed crawling is that even if multiple machines run together, the next page is crawled only after the current page and the details page related to the current page have been crawled. When the website has anti-crawler measures, the effect can be imagined. The efficiency of distributed crawling is basically not reflected.
idea : later, by trying to find that adding request links to the name:start_urls in the Redis database during crawling can also be scheduled to request, so it is necessary to first add enough request links to the name:start_urls so that there are enough scheduling pools to be allocated, which should avoid some machines waiting for scheduling.
practice : so I separate the original horizontal crawling page number url and store the url of each generated page in name:start_urls. Sure enough, when so many hosts have enough scheduling pools to allocate, the crawling efficiency is fully reflected.

but , I wonder if you have encountered this kind of problem of mine. If so, what is your solution? is there a better solution? Because I also need a separate program to generate url and store it in name:start_urls in this way, it doesn"t feel very elegant and convenient, although it is already a better solution that I can think of.

Scrapy python-crawler

May.12,2022

by default, Scrapy uses LIFO queues to store waiting requests. In a nutshell, it is depth priority. Depth first is more convenient in most cases. If you want to crawl in breadth first order, you can set the following settings:

  reference documentation


										
												Previous: Springboot is packaged into a jar package, and the @ Configuration annotated class is not executed
												                		Next: How to realize the conversion from right and left layout to upper, middle and lower layout with CSS alone
                							
					
						
					
															
						
														
								Scrapy scheduled task under centos, cannot be executed
								
 
  execute after entering the project, the error shows scrapy command not found , but I-sharpscrapy can be run, the scrapy crawl test crawler command can also be executed alone,  only the scheduled command will appear scrapy:command not found 
...
								
									
																														Crontab
																				scrapy
																				python-crawler
																			
									
										Mar.04,2021
									
								
							
														
								Ask a python scrapy deep crawler problem.
								
 after crawling the navigation, the URL crawl that you want to continue in-depth navigation, and then the unified return value is written to xlsx 
  
< H1 >--coding: utf-8--<   H1 >
 from lagou.items import LagouItem;  import scrapy 
 class LaGouSpider (...
								
									
																														Scrapy
																				python-crawler
																			
									
										Mar.04,2021
									
								
							
														
								The problem of scrapy RetryMiddleware Middleware retry request carrying request header and proxy ip
								
 goal: you want to launch the current request repeatedly when the request ip fails, or when the CAPTCHA is encountered, until the request succeeds, so as to reduce the data omission of crawling.  question: I don  t know if my thinking is correct. At pres...
								
									
																														Scrapy
																				python-crawler
																			
									
										Mar.23,2021
									
								
							
														
								Can we set a proxy for the spider using the scrapy_splash?
								
 When I implemented a spider using Scrapy, I wanted to change the proxy of it so that the server wouldn  t forbid my request according to the frequent requests from an ip. I also knew how to change the proxy with Scrapy, using middlewares or directly cha...
								
									
																														Scrapy
																				python-crawler
																			
									
										Mar.30,2021
									
								
							
														
								How scrapy crawls the content under the style= "display:none" tag when the display style of web page elements is set to invisible
								
 as shown in the title, scrapy novice asks how to crawl the content under the style=  "display:none " tag where the display style of web elements is set to invisible:  the source code of the web page is as follows: 
<dl class="xxx" style=&qu...
								
									
																														Selenium
																				scrapy
																				python-crawler
																			
									
										Sep.24,2021
									
								
							
														
								Please ask me the question of scrapy crawler, thank you, online, etc.
								
 ask,  scrapy crawler, why did I send it to scrapy.Request 
https:  www.tianyancha.com reportContent 24505794 2017
 then print out the url in callback to become 
https:  www.tianyancha.com login?from=https:  www.tianyancha.com reportContent 24505794 2017...
								
									
																														Scrapy
																				python-crawler
																				python
																			
									
										Jun.20,2022
									
								
							
														
								An error is reported during the operation of scrapy, ModuleNotFoundError: No module named 'pymongo'
								
 I run the single file directly without import errors. In addition, it is normal for me to use mongodb in the py file alone, but when I run it in the scrapy project, I will say that the import failed. Why? 
import json
import pymongo
from scrapy.utils.pr...
								
									
																														Mongodb
																				python
																				scrapy
																				python-crawler
																			
									
										Jul.02,2022
									
								
							
														
								Scrapy cannot extract the next page
								
 problem description 
 cannot get the next page 
 related codes 
     Please paste the code text below (do not replace the code with pictures) 
 import scrapy  from qsbk.items import QsbkItem  from scrapy.http.response.html import HtmlResponse  from scra...
								
									
																														Scrapy
																				python-crawler
																			
									
										Jul.05,2022


				
					
						
	
						
		css
		mysql
		arrays
		josn
		react
		html
		typescript
		webpack
		npm
		sass
		R
		objective-c
		.net
		sql-server
		jquery
		python-3.x
		angularjs
		django
		angular
		excel
		regex
		iphone
		ajax
		linux
		xml
		pandas
		vba
		spring
		database
		wordpress
		string
		wpf
		xcode
		windows
		bash
		postgresql
		oracle
		multithreading
		eclipse
		list
		firebase
		algorithm
		macos
		forms
		image
		scala
		visual-studio
		azure
		bootstrap
		spring-boot
		react-native
		python-2.7
		docker
		performance
		function
		winforms
		matlab
		powershell
		apache
		dataframe
		api
		sqlite
		numpy
		rest
		shell
		selenium
		flutter
		dart
		maven
		loops
		qt
		swing
		android-studio
		csv
		express
		file
		class
		tensorflow
		sorting
		codeigniter
		perl
	
						MySQL Query :  SELECT * FROM `codeshelper`.`v9_news` WHERE status=99 AND catid='6' ORDER BY rand() LIMIT 5 
 MySQL Error : Disk full (/tmp/#sql-temptable-64f5-4166f9e-d4.MAI); waiting for someone to free some space... (errno: 28 "No space left on device") 
 MySQL Errno : 1021 
 Message :  Disk full (/tmp/#sql-temptable-64f5-4166f9e-d4.MAI); waiting for someone to free some space... (errno: 28 "No space left on device") 
Need Help?