crawler file:
import scrapy
from xtzx.items import XtzxItem
class LessonSpider (scrapy.Spider):
name = "lesson"
allowed_domains = ["xuetangx.com"]
start_urls = ["http://www.xuetangx.com/courses?credential=0&page_type=0&cid=118&process=0&org=0&course_mode=0&page=2"]
def parse(self, response):
item=XtzxItem()
item["title"]=response.xpath("//div[@class="fl list_inner_right cf"]/div[@class="coursename"]/a/h2[@class="coursetitle"]/text()").extract()
-sharpitem["school"]=response.xpath("//div[@class="fl name"]/ul/li/span/text()").extract()
-sharpitem["stu"]=response.xpath("//div[@class="fl name] "/li/span[@class="ri-tag fl"]/text()").extract()
yield item
< hr >
pipelines:
class XtzxPipeline (object):
def process_item(self, item, spider):
print(item["title"][0])
-sharpprint(item["school"][0])
-sharpprint(item["stu"][0])
return item
< hr >
2018-04-28 14:43:09 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: xtzx)
2018-04-28 14:43:09 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.5.4 (v3.5.4br 3f56838, Aug 8 2017, 02:17:05) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Windows-10-10.0.16299-SP0
2018-04-28 14:43:09 [scrapy.crawler] INFO: Overridden settings: {"NEWSPIDER_MODULE":" xtzx.spiders", "SPIDER_MODULES": [" xtzx.spiders"], "BOT_NAME": "xtzx"}
2018-04-28 14:43:09 [scrapy.middleware] INFO: Enabled extensions:
[" scrapy.extensions.telnet.TelnetConsole",
"scrapy.extensions.corestats.CoreStats",
" scrapy.extensions.logstats.LogStats"]
2018-04-28 14:43:09 [scrapy.middleware] INFO: Enabled downloadermiddlewares:
["scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware",
" scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware",
"scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware",
" scrapy.downloadermiddlewares.useragent.UserAgentMiddleware",
"scrapy.downloadermiddlewares.retry.RetryMiddleware",
" scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware",
"scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware",
" scrapy.downloadermiddlewares.redirect.RedirectMiddleware",
"scrapy.downloadermiddlewares.cookies.CookiesMiddleware",
" scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware",
"scrapy.downloadermiddlewares.stats.DownloaderStats"]
2018-04-28 14:43:09 [scrapy.middleware] INFO: Enabled spidermiddlewares:
[" scrapy.spidermiddlewares.httperror.HttpErrorMiddleware",
"scrapy.spidermiddlewares.offsite.OffsiteMiddleware",
" scrapy.spidermiddlewares.referer.RefererMiddleware",
"scrapy.spidermiddlewares.urllength.UrlLengthMiddleware",
" scrapy.spidermiddlewares.depth.DepthMiddleware"]
2018-04-28 14:43:09 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-04-28 14:43:09 [scrapy.core.engine] INFO: Spider opened
2018-04-28 14:43:09 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), Scraped 0 items (at 0 items/min)
2018-04-28 14:43:09 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1 scraped 6023
2018-04-28 14:43:10 [scrapy.core.engine] DEBUG: Crawled (200) < GET http://www.xuetangx.com/cours.; Page_type=0&cid=118&process=0&org=0&course_mode=0&page=2 > (referer: None)
2018-04-28 14:43:10 [scrapy.core.scraper] DEBUG: Scraped from http://www.xuetangx.com/cours.;page_type=0&cid=118&process=0&org=0&course_mode=0&page=2>
is the content to be crawled, and you have got
{"title": [" Accounting principles (Spring 2018)",
"",
"2018",
"()",
"2018",
"102:",
"102:2018",
""" 2018",
"",
"2018"]}
< hr >
2018-04-28 14:43:10 [scrapy.core.engine] INFO: Closing spider (finished)
2018-04-28 14:43:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{"downloader/request_bytes": 292,
" downloader/request_count": 1,
"downloader/request_method_count/GET": 1,
" downloader/response_bytes": 27101,
"downloader/response_count": 1,
" downloader/response_status_count/200": 1,
"finish_reason":" finished",
"finish_time": datetime.datetime (2018, 4, 28, 6, 43, 10, 470916),
" item_scraped_count": 1,
"log_count/DEBUG": 3,
" log_count/INFO": 7,
"response_received_count": 1,
" scheduler/dequeued": 1,
"scheduler/dequeued/memory": 1,
" scheduler/enqueued": 1,
"scheduler/enqueued/memory": 1,
start_time": datetime.datetime (2018, 4, 28, 6, 43, 9, 860924)}
2018-04-28 14:43:10 [scrapy.core.engine] INFO: Spider closed (finished)
execute scrapy crawl lesson in cmd to get the above results
but execute scrapy crawl lesson-- nolog has no output
besides, title should be a list, right?
in addition, I also asked a question just now, and then a teacher answered it, but I didn"t respond to it. Is there any limit on adopting the answer?