In a real scrapy project, do you always use the xpath that comes with the framework when using xpath, or will you also use etree.HTML to re-instantiate it as appropriate?

because when crawling the target website, the get data returns a structure in json format, so if you want to parse the html string in the sub-field by xpath, you can"t use response.xpath (or there is another way, I don"t know..). Instead, you can parse the following sub-field of response.text. At this time, you can only re-instantiate xpath. Would you like to ask if this is the correct way to deal with it in the actual project?

Scrapy web-crawler python

Jun.27,2022

generally speaking, scrapy's built-in xpath and css selectors are sufficient, and no other html/xhtml parsers, such as etree or bs4, are needed.

for json content, you can directly call json.loads () to parse, such as

js = json.loads(response.body_as_unicode())
js['xxx']

in the future, scrapy may also come with .json () methods (similar to requests libraries).

< H2 > reference < / H2 >

https://docs.scrapy.org/en/la...
https://github.com/scrapy/scr...

The html fragments obtained by

json can be constructed with Selector under scrapy.selector, and parsed with xpath and css selectors

>>> from scrapy.selector import Selector
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').get()
'good'

you can also use BeautifulSoup, lxml, pyquery and other libraries.

Previous: How to remove the up and down arrows from the ios input box?

Next: Springboot starts error reporting 'methodValidationPostProcessor'

Multiple scrapy-redis cannot be crawled at the same time
Open two scrapy tasks at the same time, and then go to push in redis a start_url but only one scrapy task An is running, and when An is stopped, B task will begin to crawl. the reason seems to be that requests is not saved in redis while...

Scrapyd scrapy web-crawler python-crawler python

Mar.05,2021

MySQL Query : SELECT * FROM `codeshelper`.`v9_news` WHERE status=99 AND catid='6' ORDER BY rand() LIMIT 5
MySQL Error : Disk full (/tmp/#sql-temptable-64f5-36c282a-30fc0.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
MySQL Errno : 1021
Message : Disk full (/tmp/#sql-temptable-64f5-36c282a-30fc0.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
Need Help?