When a crawler encounters a front-end page with <p> tags, how do you extract the content you want?
main problem: the front-end code of the web page is very messy, all are
tags, python crawler extraction content is very uncomfortable, BeautifulSoup4 is very difficult to locate, ask for your guidance, how to do in such a situation?
URL: http://eshu.100xuexi.com/uplo.
:
::
my code:
import requests
from bs4 import BeautifulSoup
chapterurl="http://eshu.100xuexi.com/uploads/ebook/e512edf6fac442fbafa2d23e8f2c8c22/mobile/epub/OEBPS/chap9.html"
responce = requests.get(chapterurl)
print(responce.status_code)
responce.encoding = responce.apparent_encoding
res = responce.text
soup = BeautifulSoup(res,"lxml")
-sharp
chap = soup.find(class_="TocHref").get_text()
print(chap)
-sharp
TiXings = soup.findAll(class_="TiXing")
for TiXing in TiXings:
TiXing =TiXing.get_text().strip()
print(TiXing)
-sharp
thanks again to all the great gods!
parse it with regular expressions
personally, I think we can only find the rule. All the p tags can be found directly, sliced according to multiple choice and analysis questions, and extracted according to the label rule.
take multiple choice questions as an example, each single and multiple choice question has 8 p tags, and each question has an empty tag interval of < p class= "PSplit" >, which is directly coded:
.
-sharp,
def seg_list(l, n):
"""
:param l: List,
:param n:
:return:
"""
if len(l) < n:
raise Exception('len() %s.!' % (n,))
new_list = []
for i in range(n):
new_list.append([])
segment_num = 0
remainder = 0
segpoint = int(len(l) / n)
for num, key in enumerate(l, 1):
if segment_num < n:
if num % segpoint != 0:
new_list[segment_num].append(key)
else:
new_list[segment_num].append(key)
segment_num += 1
else:
new_list[remainder].append(key)
remainder += 1
write the crawler, organize the format, and extract the multiple choice questions:
import requests
from bs4 import BeautifulSoup
url='http://eshu.100xuexi.com/uploads/ebook/e512edf6fac442fbafa2d23e8f2c8c22/mobile/epub/OEBPS/chap9.html'
res=requests.get(url)
res.encoding=res.apparent_encoding
soup=BeautifulSoup(res.text,'lxml')
total_list=soup.select('p')
mcp=total_list[2:299]-sharp
mcp.insert(8,None)-sharp<p class="PSplit">,,
mcp.pop(144)-sharp,,,9p
result_list=seg_list(mcp,33)-sharp33
for i in result_list:
question=i[0].text
chioce_A=i[1].text
chioce_B=i[2].text
chioce_C=i[3].text
chioce_D=i[4].text
answer=i[5].text
test_point=i[6].text
analyze=i[7].text
print([question,chioce_A,chioce_B,chioce_C,chioce_D,answer,test_point,analyze])
as for the analysis questions, there are rules to follow
your problem is actually to convert HTML page format to Excel table format, which can be converted directly online without crawlers.
-sharp
import re
-sharp content P
content = re.findall('(.*?)
', html.content.decode('utf-8'), re.S)