problem description
in CSDN, log in as a member, normally click this button to download the file
a url
can always crawl the file according to this url, but recently may have taken some measures, so that click on the url below can not download the file, that is, the crawler crawled below the url can not climb the file, return is a 404 web page.
related codes
requests library of python used.
this is the download function in the crawler class.
def download(self, remote_url, local_dir):
-sharp 1.
if not self.__is_logined:
self.__login()
-sharp +1
self.download_count += 1
count = 0
while count < 3:
count += 1
-sharp 2.URL
html_text = self.__session.get(remote_url).text
html = BeautifulSoup(html_text, "html5lib")
real_url = html.find("a", id="vip_btn").attrs["href"]
-sharp 3.
source = self.__session.get(real_url)
-sharp 3.1
filename = re.findall(r".*\"(.*)\"$", source.headers.get("Content-Disposition", "\"None\""))[0]
if filename == "None":
continue
filename = re.sub("\s", "_", filename)
-sharp 3.2
if not os.path.exists(local_dir):
os.makedirs(local_dir)
_local_path = local_dir + filename
-sharp 3.3
local_file = open(_local_path.encode("utf-8"), "wb")
for file_buffer in source.iter_content(chunk_size=512):
if file_buffer:
local_file.write(file_buffer)
return _local_path
return None
run according to the above code, return a 404 web page, ask the god how to correctly climb to the file ~