The problem of scrapy RetryMiddleware Middleware retry request carrying request header and proxy ip

goal: you want to launch the current request repeatedly when the request ip fails, or when the CAPTCHA is encountered, until the request succeeds, so as to reduce the data omission of crawling.
question: I don"t know if my thinking is correct. At present, there is a CAPTCHA in the middleware, it can repeat requests, and it does carry a new ip.
but the CAPTCHA is still returned from the repeated request. Is there something missing in the middleware method?

(proxy ip, random request User-Agent is set)

the middleware RetryMiddleware code is as follows:

< H1 > retry request < / H1 >
class LocalRetryMiddleware(RetryMiddleware):

    def process_response(self, request, response, spider):
        if request.meta.get("dont_retry", False):
            return response
        print(":", response.body)
        -sharp 
        img = response.xpath("//img[@src="/Account/ValidateImage"]")
        print(img)
        if img:
            print("1 ")
            time.sleep(random.choice(range(6)))
            print("ip:", request.meta.get("proxy"))

            return self._retry(request, response.body, spider) or response
        return response



    def process_exception(self, request, exception, spider):
        if isinstance(exception, self.EXCEPTIONS_TO_RETRY) and not request.meta.get("dont_retry", False):
            -sharp 
            -sharpself.delete_proxy(request.meta.get("proxy", False))
            time.sleep(random.randint(3, 5))
            print("2 ")

            return self._retry(request, exception, spider)
            
Mar.23,2021

I don't know if the verification code you encounter here appears through 302 redirection. If so, I think it may be because the redirected request here is the url , and the url itself is to jump to the verification code page, so the repeated request here still returns the verification code page.

the above is my guess for reference.


response.request.meta.get ('redirect_urls') this should contain the original URL before redirection

MySQL Query : SELECT * FROM `codeshelper`.`v9_news` WHERE status=99 AND catid='6' ORDER BY rand() LIMIT 5
MySQL Error : Disk full (/tmp/#sql-temptable-64f5-1b37cd7-4ed29.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
MySQL Errno : 1021
Message : Disk full (/tmp/#sql-temptable-64f5-1b37cd7-4ed29.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
Need Help?