destination page https://www.w3cschool.cn/code.
this is crawling html
def getHtml(url):
re = requests.get(url)
return re.text
index = getHtml(url)
index
this is the method for parsing html
def parseHtml(html):
soup = BeautifulSoup(index,"html.parser")
-sharpsoup
lessonList= soup.find("div",class_="codecamplist-catalog").find_all("a")
return lessonList
lessonList = parseHtml(index)
lessonList
the final lessonList is in bs4.element.ResultSet format
[<a href="//www.w3cschool.cn/codecamp/say-hello-to-html-element.html" title="Say Hello to HTML Element">
<i class="icon-codecamp-list icon-codecamp-option"></i>
HTML</a>,
<a href="//www.w3cschool.cn/codecamp/headline-with-the-h2-element.html" title="Headline with the h2 Element">
<i class="icon-codecamp-list icon-codecamp-option"></i>
HTML h2</a>,
<a href="//www.w3cschool.cn/codecamp/inform-with-the-paragraph-element.html" title="Inform with the Paragraph Element">
<i class="icon-codecamp-list icon-codecamp-option"></i>
HTML p</a>,
<a href="//www.w3cschool.cn/codecamp/uncomment-html.html" title="Uncomment HTML">
<i class="icon-codecamp-list icon-codecamp-option"></i>
HTML</a>]
how to parse data in this format
goal is to save the links and title in csv format
only the first one can be found for the corresponding data in Tag format, and an error will be reported if you use the Find_all method.
def getLesson(lessonList):
for i in lessonList:
lesson={}
try:
lesson["title"] = i.find("a")["href"].lstrip("//")
lesson["name"]= i.find("a")["title"]
except:
print("error")
return lesson
getLesson(lessonList)
-sharp lessonList= soup.find_all("div",class_="codecamplist-catalog")
-sharp .find_all("a")
result
{"name": "Say Hello to HTML Element",
"title": "www.w3cschool.cn/codecamp/say-hello-to-html-element.html"}