topic description
I use the Scrapy framework to crawl the release dates or directors and other information of some movies in Douban, and find that the Xpath paths of the release dates or directors of different films are different.
for example, the Xpath of the release date of the film https://movie.douban.com/subj. is / / [@ id=" info "] / span [10].
while the Xpath of the release date of https://movie.douban.com/subj. is / / * [@ id=" info "] / span [9], and the numbers of Span paths between
and
are different, resulting in some crawling information and some not.
I would like to ask the Great God, apart from the syntax of Xpath, what other grammars can solve this problem?
Thank you, God!
related codes
import scrapy
import json
from scrapy.http import Request
from scrapy.selector import Selector
from MovieSpider.items import MoviespiderItem
class MovieSpider (scrapy.Spider):
name = "MovieSpider"
allowed_domains = ["movie.douban.com"]
start_urls = ["https://movie.douban.com/j/search_subjects?type=movie&tag=%E5%86%B7%E9%97%A8%E4%BD%B3%E7%89%87&sort=rank&page_limit=20&page_start=0"]
def parse(self, response):
list = json.loads(response.text)
urls = list["subjects"]
for url in urls:
src = url["url"]
yield Request(src, callback=self.parse_detail)
def parse_detail(self, response):
sel = Selector(response)
item = MoviespiderItem()
item["title"] = sel.xpath("//*[@id="content"]/h1/span[1]/text()").extract_first()
item["score"] = sel.xpath("//*[@id="interest_sectl"]/div[1]/div[2]/strong/text()").extract_first()
item["year"] = sel.xpath("//*[@id="info"]/span[10]/text()").extract_first()
item["author"] = sel.xpath("// *[ @ id = "info"]/span[1]/span[2]/a/text()").extract_first()
yield item
the main problem is the path of Xpath, thank you!