the crawler cannot download pictures after Scrapy uses a custom Pipeline class that inherits ImagesPipeline
use Python 3.7 environment and Scrapy crawler framework to crawl and download pictures on the web page. You can download them normally using the built-in ImagesPipeline, but only output the link address of the picture on the command line after using the custom Pipeline class, but cannot download the picture to the local
.related codes
items.py
class ImageItem(scrapy.Item):
-sharp
-sharpimage_names = scrapy.Field()
-sharp
-sharpfold = scrapy.Field()
-sharp
-sharpimage_paths = scrapy.Field()
-sharp
image_urls = scrapy.Field()
pipelines.py
class PicPipeline(ImagesPipeline):
def process_item(self, item, spider):
return item
def get_media_requests(self, item, info):
for image_url in item["image_urls"]:
yield scrapy.Request(image_url)
def item_completed(self, results, item, info):
image_paths = [x["path"] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item["image_paths"] = image_paths
return item
settings.py
ITEM_PIPELINES = {
"Pic.pipelines.PicPipeline": 300
}
IMAGE_STORES= "D:\Pic"
IMAGES_URLS_FIELD = "image_urls"
PicSpider.py
class PicSpider(scrapy.Spider):
name = "picspider"
allowed_domains = ["guokr.com"]
start_urls = ["https://www.guokr.com/"]
def parse(self, response):
images = response.xpath("//img/@src").extract()
item = ImageItem()
item["image_urls"] = images
yield item
just want to write a small Demo to practice crawling pictures, use the built-in ImagesPipeline to download pictures normally, but custom Pipeline can not be downloaded, in the command line output image link address, there is no, please advise.
the command line output is as follows
2019-02-19 12:11:06 [scrapy.core.engine] INFO: Spider opened
2019-02-19 12:11:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-02-19 12:11:06 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-02-19 12:11:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.guokr.com/robots.txt> (referer: None)
2019-02-19 12:11:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.guokr.com/> (referer: None)
2019-02-19 12:11:06 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.guokr.com/>
{"image_urls": ["https://3-im.guokr.com/vXVhDq_6nindVo2LqIloosK-2bHrkYpU8DEXP75DpnZKAQAA6wAAAEpQ.jpg",
"https://2-im.guokr.com/hD7RoVC8IpQGnc2humofXMGyex-iSZH1VDaWLq2VWCE2BAAA7gMAAEpQ.jpg?imageView2/1/w/330/h/235",
"https://2-im.guokr.com/AU-Q8pTYY_OffTqWyKfXTC5NV0RmarK_QJ9m6A6_7qhKAQAA6wAAAEpQ.jpg",
"https://1-im.guokr.com/IIlEodManGB8jos3eP7KcrMhu3l8dtG6F5nrJczcrTiAAwAAUwIAAEpQ.jpg?imageView2/1/w/330/h/235",
"https://1-im.guokr.com/klfXUFzwXV_jz42yk497oZ-RkLAJEc03spAKMg9AeIw4BAAADQMAAEpQ.jpg?imageView2/1/w/330/h/235",
"https://1-im.guokr.com/BZ7R7bpcrwjOyFJ5kajc0tVHlOF8BUyEs3IpWB0l6Q4sAgAA2AEAAEpQ.jpg?imageView2/1/w/135/h/90",
"https://1-im.guokr.com/1CJgQkib1ePSCpLBARUhOyMdf6THL2BGrkDj6WDc5eiGAQAABAEAAEpQ.jpg?imageView2/1/w/135/h/90",
"https://1-im.guokr.com/4prMeIXxsaF2y6OTfpCB2IiI7udvwK8f_lsTcqbFcaeHAAAAWgAAAEpQ.jpg",
"https://1-im.guokr.com/WPrAHjwbKwXNYqiYZgkaYEyh9i2R8zm9noog_AxfpHiaAgAAvAEAAEpQ.jpg?imageView2/1/w/135/h/90",
"https://2-im.guokr.com/TNpsKxaaNGuIDTJWTpy2P5wfji_oG66rHUWGa8L7zFhKAQAAtQAAAFBO.png?imageView2/1/w/135/h/90",
"https://2-im.guokr.com/gLbC7ix6NWlx3bz6ihFyOxsl_fWqwtB554NswEOmACFKAQAA8AAAAEpQ.jpg?imageView2/1/w/135/h/90",
"https://2-im.guokr.com/Rx9MyfI6hndQBTyoGWvfOyb469BZ7ruf0w0k7V0aJ1pKAQAA6wAAAEpQ.jpg?imageView2/1/w/135/h/90",
"https://2-im.guokr.com/-OmYOzUa0Nhm9vKimCFn2c2ZR9pHmgxqMiMxijD5KwkLAQAAngAAAFBO.png?imageView2/1/w/135/h/90",
"https://1-im.guokr.com/mysULQspmaLPEMu-MQFZGHwaccTPPs9msjtLrYoDtGcsAQAAagAAAEpQ.jpg",
"https://2-im.guokr.com/fSvqlLJ6wcRv8cCCc5Ehm5pgqZWg7TyiLZdEba34NTKgAAAAoAAAAEpQ.jpg?imageView2/1/w/48/h/48",
"https://3-im.guokr.com/F9IifzSeB9OoKKIP-_2i3SnWHnUceIpmGyOMuwgRvgGgAAAAoAAAAEpQ.jpg?imageView2/1/w/48/h/48",
"https://sslstatic.guokr.com/skin/imgs/dimensions-code.jpg?v=unknown",
"https://3-im.guokr.com/0Al5wQUv5IAuo87evbERy190Y83ENmP9OpIs8Stm2lMUAAAAFAAAAFBO.png"]}
2019-02-19 12:11:06 [scrapy.core.engine] INFO: Closing spider (finished)
2019-02-19 12:11:06 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{"downloader/request_bytes": 434,
"downloader/request_count": 2,
"downloader/request_method_count/GET": 2,
"downloader/response_bytes": 12316,
"downloader/response_count": 2,
"downloader/response_status_count/200": 2,
"finish_reason": "finished",
"finish_time": datetime.datetime(2019, 2, 19, 4, 11, 6, 755334),
"item_scraped_count": 1,
"log_count/DEBUG": 3,
"log_count/INFO": 9,
"response_received_count": 2,
"robotstxt/request_count": 1,
"robotstxt/response_count": 1,
"robotstxt/response_status_count/200": 1,
"scheduler/dequeued": 1,
"scheduler/dequeued/memory": 1,
"scheduler/enqueued": 1,
"scheduler/enqueued/memory": 1,
"start_time": datetime.datetime(2019, 2, 19, 4, 11, 6, 142378)}
2019-02-19 12:11:06 [scrapy.core.engine] INFO: Spider closed (finished)