A crawler written before has been used for half a year, but it can not be used recently. An error of 500 was reported. The crawled website http://xilin123.cn/ can open
normally. Open the developer tool
and find that Status Code is 500, so it causes my program to report an error
amazing, 500 web pages can also be accessed normally, the normal should be 200ah!
did they use some anti-crawler means
import urllib.request as urllib2
import random
class HtmlDownloader(object):
def download(self, url):
if url is None:
return None
ua_list = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; rv2.0.1) Gecko/20100101 Firefox/4.0.1",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
"Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"
]
user_agent = random.choice(ua_list)
request = urllib2.Request(url=url)
request.add_header("User-Agent", user_agent)
response = urllib2.urlopen(request)
html = response.read()
page = html.decode("GBK")
return page
what should I do?