How does pyspider manually determine whether a url has been crawled?

when you want to climb the list page, url will stop the list crawling when you find that there are articles that have been crawled. Otherwise, how can you continue to the next page?

Dec.28,2021

crawled url is written to the record, such as the pickle file. Just pull it out and check it out when you need it.


you can calculate the MD5 value based on the crawled url, put it in the redis database, and judge whether it is there or not when you crawl. The efficiency of redis is very high, so you don't have to worry about efficiency

.
import hashlib
import redis


def md_url(url):
    md5 = hashlib.md5()
    md5.update(url.encode('utf-8'))
    return md5.hexdigest()


def exist_or_not(finger_print):
    -sharp 0 
    ex = conn.sadd('urls', finger_print)
    return ex
MySQL Query : SELECT * FROM `codeshelper`.`v9_news` WHERE status=99 AND catid='6' ORDER BY rand() LIMIT 5
MySQL Error : Disk full (/tmp/#sql-temptable-64f5-1e4360c-44a39.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
MySQL Errno : 1021
Message : Disk full (/tmp/#sql-temptable-64f5-1e4360c-44a39.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
Need Help?