The crawler encountered a special situation.

https://www.lagou.com/gongsi/. URL

clipboard.png

I want to extract the content under this tag < div class= "item_manager_content"

but the first one does not have p and everyone else has p how to deal with this situation?

Dec.04,2021

first of all, follow the crawl without

, assuming that the content of the segment is crawled by content,:

    if content.startswith('

'): content=content[3:] if content.endswith('

'): content=content[:-4]

this kind of incomplete web page is really crappy. It is recommended to use beautifulsoup's html5lib library to parse. It has the best fault tolerance, that is, it is slower


to grasp it uniformly without

, and then if there is

outside, it will be removed

.
MySQL Query : SELECT * FROM `codeshelper`.`v9_news` WHERE status=99 AND catid='6' ORDER BY rand() LIMIT 5
MySQL Error : Disk full (/tmp/#sql-temptable-64f5-1b36ce3-2c03c.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
MySQL Errno : 1021
Message : Disk full (/tmp/#sql-temptable-64f5-1b36ce3-2c03c.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
Need Help?