Hello!
question:
- can detect the header information of the current request through response.request.headers () (because the user-agent is random), but want to determine whether the configured IP agent is valid and how to get which ip? is used for the current request.
-
generally speaking, if the user-agent, and ip addresses are changed, the web page with CAPTCHA will not appear, right? If it is because of cookie, I still have a CAPTCHA to let my crawler stop running without cookie, so I suspect that the IP agent is not configured.
-Middleware for sharp proxy interface
class ProxyAPIMiddleware (object):def process_request(self, request, spider): req = urllib.request.Request("ipurl") response = urllib.request.urlopen(req) ip = "http://%s" % str(response.read(), "utf-8") -sharpip+ request.meta["proxy"] = ip -sharpip print(request.meta["proxy"]) -sharp APIrequest.meta["proxy"] = ip ip
Runtime:
.
.
.
.
2018-06-23 15:57:29 [scrapy.middleware] INFO: Enabled spider middlewares:
["scrapy.spidermiddlewares.httperror.HttpErrorMiddleware",
"scrapy.spidermiddlewares.offsite.OffsiteMiddleware",
"scrapy.spidermiddlewares.referer.RefererMiddleware",
"scrapy.spidermiddlewares.urllength.UrlLengthMiddleware",
"scrapy.spidermiddlewares.depth.DepthMiddleware"]
=================
2018-06-23 15:57:30 [scrapy.middleware] INFO: Enabled item pipelines:
["soopat_patent.pipelines.SoopatPatentPipeline"]
2018-06-23 15:57:30 [scrapy.core.engine] INFO: Spider opened
2018-06-23 15:57:30 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-06-23 15:57:30 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1
ip: http://122.230.248.127:4523
http://122.230.248.127:4523
2018-06-23 15:57:33 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.soopat.com/> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.>]
User-Agent: Mozilla/5.0 (compatible; WOW64; MSIE 10.0; Windows NT 6.2)
ip: http://60.172.68.112:4507
http://60.172.68.112:4507
2018-06-23 15:57:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.soopat.com/> (referer: http://www.soopat.com/)
=========================
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10
ip: http://140.255.4.142:4523
http://140.255.4.142:4523
2018-06-23 15:57:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET ....)
...
[]
list index out of range
2018-06-23 15:57:48 [scrapy.core.engine] INFO: Closing spider (finished)
2018-06-23 15:57:48 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
.
.
.
the scrapy crawler is normal at first, and the data is stored normally, but when it is run again the next day, it is directly blocked by the CAPTCHA.
Crawler Xiaobai, humbly ask for advice, thank you.