problem description
crawl the list of Amazon products, save the data into mongodb
crawl the first page and pass the next page link to Request. You can get the next page link in shell
but you can only see the first page of data in the database
after crawling the first page of data, you can see that after crawling the first page of data, there is a link to the next page, but there is no data crawling
the platform version of the problem and what methods you have tried
linux,mongodb
tried to add dont_filter=Ture to Request
without success and climbed some unnecessary things
related codes
/ / Please paste the code text below (do not replace the code with pictures)
spider.py
from scrapy import Request, Spider
from amazon.items import AmazonItem
class AmazonSpider (Spider):
name = "book"
allowed_domains = ["amazon.com"]
start_urls = ["https://www.amazon.com/s/ref=lp_2649512011_il_ti_movies-tv?rh=n%3A2625373011%2Cn%3A%212625374011%2Cn%3A2649512011&ie=UTF8&qid=1533351160&lo=movies-tv"]
def parse(self, response):
result = response.xpath("//div[@id="mainResults"]/ul/li")
-sharp print(result)
for it in result:
item = AmazonItem()
item["title"] = it.css("h2::text").extract_first()
item["price"] = it.css(".a-link-normal.a-text-normal .a-offscreen::text").extract_first()
yield item
next_page = response.css("-sharpbottomBar -sharppagn -sharppagnNextLink::attr("href")").extract_first()
url = response.urljoin("https://www.amazon.com",next_page)
yield Request(url=url, callback=self.parse, dont_filter=True)
pipelines.py:
class MongoPipeline (object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri = crawler.settings.get("MONGO_URI"),
mongo_db = crawler.settings.get("MONGO_DB")
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def process_item(self, item, spider):
self.db[item.collection].insert(dict(item))
return item
def close_spider(self, spider):
self.client.close()
what result do you expect? What is the error message actually seen?
want to be able to callback to parse after url is passed into Request, and crawl the related content of the next page
in the command line, you can see that after crawling the first page of data, there is a link to the next page, but there is no data crawling
and the link copied to the browser can be opened and is 2. 3.4. . The subsequent page
but I don"t know why data crawling is not performed
I hope the bosses will not hesitate to give us your advice. Thank you very much!