How does pyspider manually determine whether a url has been crawled?

when you want to climb the list page, url will stop the list crawling when you find that there are articles that have been crawled. Otherwise, how can you continue to the next page?

Python pyspider

Dec.28,2021

crawled url is written to the record, such as the pickle file. Just pull it out and check it out when you need it.

you can calculate the MD5 value based on the crawled url, put it in the redis database, and judge whether it is there or not when you crawl. The efficiency of redis is very high, so you don't have to worry about efficiency

import hashlib
import redis


def md_url(url):
    md5 = hashlib.md5()
    md5.update(url.encode('utf-8'))
    return md5.hexdigest()


def exist_or_not(finger_print):
    -sharp 0 
    ex = conn.sadd('urls', finger_print)
    return ex

Previous: Ask about the difference between the principle of using class and the introduction of function directly.

Next: Why did python3 get html when collecting this page but not when fetching an element?

How to clean up some unwanted HTML attributes in crawler data
for example, for the following data <p id="a">data I just want to keep data is there a quick way to do this? ...

Web-crawler python pyspider scrapy

Mar.01,2021
Excuse me, how does the pyspider, running on the centos7.2 server open webui through the public network IP?
excuse me, how does the pyspider, running on the centos7.2 server open webui? through the public network IP? config is written like this { "scheduler" : { "xmlrpc-host": "0.0.0.0", "delete-time&qu...

Python pyspider

Mar.14,2021
Does pyspider support mongodb clusters as taskdb?
problem description capture answers similar to Zhihu because there are so many answers from Zhihu, response.save is used to save the results of crawling ahead because Zhihu site cannot be crawled too fast, the task may not be completed in time so ...

Python pyspider

Mar.31,2021
What if pyspider always hangs up items and disappears on the server?
centos7 pyspider 1, run in the background with nohup pyspider all > pyspider.log 2 > & 1 & occasionally hang up 2, and there is no reason for outputting pyspider.log. 3, what if the previously written project disappears after restarting pyspider. ...

Python pyspider

Apr.03,2021
Only one entry can be entered into the mysql database by pyspider.
pyspider starts with config file result crawled only one piece of data ...

Python pyspider

Apr.09,2021
Novice, I'd like to ask. How does pyspider work? I installed py2.7 and phth.
find a basic detailed process pip install pyspider SyntaxError: invalid syntax pip3 install pyspider SyntaxError: invalid syntax ...

Python pyspider

Jun.08,2021

MySQL Query : SELECT * FROM `codeshelper`.`v9_news` WHERE status=99 AND catid='6' ORDER BY rand() LIMIT 5
MySQL Error : Disk full (/tmp/#sql-temptable-64f5-40b3757-38e6.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
MySQL Errno : 1021
Message : Disk full (/tmp/#sql-temptable-64f5-40b3757-38e6.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
Need Help?