I have just come into contact with the future library and intend to write a simple multi-threaded script for downloading videos. The target website has 20 + videos per url, and you can get the corresponding straight chain by visiting the video url. Some of the slag codes are as follows:
from requests_html import HTML,HTMLSession
import concurrent.futures
...
def get_src(hdPageurl):
s = HTMLSession()
with HTML(html=s.get(hdPageurl["url"],headers=headers,timeout=5).text) as doc:
return hdPageurl["title"], doc.find("source")[0].attrs["src"] -sharp
def crawl_page(url): -sharp [ {"title":xxx},{"src":xxx} ...]
s = HTMLSession()
with s.get(url,headers=headers,timeout=5) as r:
doc = HTML(html=r.text)
hdPageUrl = []
for i in doc.find(".hd-video"):
title = i.find("a[target="blank"]")[0].find("img")[0].attrs["title"]
if not any(keyword in title for keyword in filterTitle):
url = i.find("a[target="blank"]")[0].attrs["href"]
hdPageUrl.append({"title":title,
"url":url})
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as thread:
for title,src in thread.map(get_src,hdPageUrl,timeout=10):
thread.submit(download,{"title":title,"src":src}) -sharp
def download(hdVideo): -sharp ,{"title":xxx,"src":xxx}
logging.info("Downloading: " + hdVideo["title"] + " " +hdVideo["src"])
try:
title = hdVideo["title"]
src = hdVideo["src"]
s = HTMLSession()
with s.get(src,headers=headers,stream=True) as r:
with open(title+".mp4","wb") as v:
for chunk in r.iter_content(chunk_size=1024):
if chunk:
v.write(chunk)
logging.info("Downloaded: " + title)
except:
logging.exception("Failed to Download.")
if __name__ == "__main__":
logging.info("New Task Begin. ")
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
executor.map(crawl_page,urls,timeout=10) -sharpurlsrc
the problem now is that the output is not a journal of current account mode, but a swarm of "Downloading:", followed by a swarm of "Download". It seems that you can only download all the videos in one page before you can start the next page. If there is too much url, the program will be stuck. And the script cannot end automatically, it can only terminate the process artificially.
would you like to know whether using the future library alone can solve this problem? It should be convenient to use threading and queue, but I"d like to try future
Thank you.