topic description
the thing is, I am using scrapy to crawl the first 30 pages of the Sina home site according to keywords. I can only simulate the browser to click the next page to crawl, and write the selenium-related operations into middlewares.py according to the official documents. the problem now is that middleware gets response and passes it to spider for parsing, but spider can only parse the first page, and the subsequent page cannot be parsed by spider
related codes
/ / Please paste the code text below (do not replace the code with pictures)
Thedocument writes selenium-related operations into middlewares.py. The problem now is that middleware gets the response and passes it to
.spider resolution. The problem now is that spider can only parse the first page, and the subsequent page cannot be parsed by spider. Clicking on the next page url will not change. I now need to get the page number of each piece of content, but each time I can only parse the first page, the termination code is as follows:
middleware.py
class SeleniumMiddleware (object):
def process_request (self,request,spider):
if spider_name = = "sina":
driver = webdriver.Chrome()
driver.get(request.url)
next_page = driver.find_element_by_xpath("//a[contains(text(),"" )]")
next_page.click()
return HtmlResponse (request.url, body=driver.page_source,request=requst)
esle:
return
< hr >
Spier.py
class Sina (scrapy.Spider):
def start_request (self):
keyword_list = ["a","b","c"]
max_page = 30
for k in keyword_list:
for p in range(1,max_page+1):
url = base_url.format(k)
yield scrapy.Request(url=url, callback=self.parse)