I have written a crawler. When writing the mongo class insert function, I want to add a judgment statement to avoid crawling data to load data repeatedly, but from the actual running situation, it is not feasible. Please take a look at the code
first.class MongoPipeLine():
def __init__(self):
self.client = pymongo.MongoClient(SETTINGS.MONGO_URI,connect=False)
self.db = self.client[SETTINGS.MONGO_DB]
self.collection = SETTINGS.COLLECTION
def insert(self,data):
if self.db[self.collection].find(data) == 1:
print("Data has been existed")
else:
self.db[self.collection].insert(data)
def close(self):
self.client.close()
scheduling function:
spider = Spider()
mongo = MongoPipeLine()
image = ImagePipeLine()
def run(i):
for page in range(i,i+20):
response = spider.get_page(page)
data_list = spider.parse(response)
for data in data_list:
mongo.insert(data)
image.download(data)
if __name__ == "__main__":
pool = Pool(20)
pool.map(run,[i*20 for i in range(10)])
pool.close()
pool.join()
mongo.close()
partial screenshot of run result):
Isaiah Rustad Already download
Annie Spratt Already download
Alex Kalinin Already download
Jakob Owens Already download
Emily Henry Already download
- you can see from the result that although some images have been downloaded and the corresponding entries are saved in MongoDB, the output from the command side does not have the result of "Data has been existed"". Obviously, the operation after the statement is not performed.
- strangely, after I change the judgment statement to"if self.bb [self.collection]. Find (data) = = 0 (data)", the output is the same as the original output, and there is still no "Data has been existed" .
- ask if there are any low-level mistakes that I haven"t considered?
Thank you first and wish you all a happy National Day!