problem description
there are 6000 url, to start the celery generation task at 12:00 and send the queue to two servers to crawl. I use middleware to get 10 proxy ip to carry up the request at a time. After 100, I proceed to process the next set of 100 url, in the queue, but why not read the new ip? In this way, after 6000 url runs, I will always use the 10 ip, for the first time. At present, I read a text with ip for each request in the process_request function, while text timing replacement guarantees that there are only 10 ip, so 100 requests will only be randomly taken from 10, but the other requests in the queue will never read the new ip again.
read the text to save ip, because I will control the text to be replaced regularly with only 10 ip,. If you do not read the text, but directly call the ip interface, you need a lot of ip, a round of 6000 url, you need at least 6000 ip,. Now you only want to use 6000 ip, in a round and let it get the new 10 ip, every time the next set of url is carried out, but it doesn"t seem to take it now. The ip in the text is still being changed regularly. As a result, scrapy will take it once and never take it again.
2 servers, celery+rabbitmq + python+ scrapy crawler framework