I want to climb a website with about 1 billion data. Url is http://xxx.com/id=xx accesses and extracts the data and stores it in the database
.where the id parameter in url is predictable, ranging from 0 to 1000000000
so I can generate these 1 billion URLs directly
for i in range(0,1000000000):
yield Request(f"http://xxx.com/id={i}", self.parse)
but it can only run on one machine, and the efficiency is too low
I intend to access scrapy-redis, but I have the following questions:
when I connect to scrapy-redis, and plan to use 20 machines:
are 1.Master and Slaver responsible for their respective duties? Master is responsible for generating URLs and putting them on redis,Slaver to retrieve URLs from redis and consume them? If not, what is it all about?
2. Because I want to store the crawling results in the database, does every Slaver have to connect to the database? Can you let Slaver store the data in the database itself, because I feel that it will waste a lot of time and traffic if I send the data back to Master, and let Master into the database
3. According to the articles I have read on the Internet, it seems that there is no strict distinction between Master and Slaver,. We just go to redis before doing it and ask if anyone else has made this id, but this will waste a lot of time. Doesn"t scrapy-redis support my question 1 the function of performing their own duties ?
for the time being, thank you
for these three questions.