goal: you want to launch the current request repeatedly when the request ip fails, or when the CAPTCHA is encountered, until the request succeeds, so as to reduce the data omission of crawling.
question: I don"t know if my thinking is correct. At present, there is a CAPTCHA in the middleware, it can repeat requests, and it does carry a new ip.
but the CAPTCHA is still returned from the repeated request. Is there something missing in the middleware method?
(proxy ip, random request User-Agent is set)
the middleware RetryMiddleware code is as follows:
< H1 > retry request < / H1 >class LocalRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
if request.meta.get("dont_retry", False):
return response
print(":", response.body)
-sharp
img = response.xpath("//img[@src="/Account/ValidateImage"]")
print(img)
if img:
print("1 ")
time.sleep(random.choice(range(6)))
print("ip:", request.meta.get("proxy"))
return self._retry(request, response.body, spider) or response
return response
def process_exception(self, request, exception, spider):
if isinstance(exception, self.EXCEPTIONS_TO_RETRY) and not request.meta.get("dont_retry", False):
-sharp
-sharpself.delete_proxy(request.meta.get("proxy", False))
time.sleep(random.randint(3, 5))
print("2 ")
return self._retry(request, exception, spider)