when the crawler starts, it is redirected to an error page. What to do
http://www.gzcc.gov.cn/data/l.
crawler"s error log is
when the crawler starts, it is redirected to an error page. What to do
http://www.gzcc.gov.cn/data/l.
crawler"s error log is
Open the web page, see what the request headers are
, and then configure the request first sample in your crawler.
this kind of website is OK to reverse crawl. If you just want to get the source code, just disguise your head
.Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip, deflate, sdch
Accept-Language:zh-CN,zh;q=0.8
Cache-Control:max-age=0
Connection:keep-alive
Cookie:ASP.NET_SessionId=bqtygl55xovvgp45ajwmuj45
DNT:1
Host:www.gzcc.gov.cn
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0
of course, in practice, we don't need to fill in so many parameters. Take the requests library as an example, (Pyhthon)
.header={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0'}
-sharp json
resp=requests.get(url=url,headers=header)
camouflage the request header
after you have climbed with scrapy-redis spiders, you will not be able to climb again. If you change the name of the spider, you can crawl again, and if you change back to the original name, you will start the deduplication mechanism again. Although you...
scrapyapi5 because I originally wanted to get 100 ip, at a time and put it in the agent pool, but because the agent is unstable, it can not provide support for a long time. So I gave up the idea of getting 100 ip at once. 5request ...
because scrapy s own FilePipeline is named by downloading the hash code of url, you want to customize your own filepipeline, to rename the file. So google for a while, found that everyone said: inherit the FilesPipeline class and then rewrite the get_m...
when scrapy crawls a picture of a web page, the class that inherits ImagesPipelines is customized in the pipelines file. but the custom pipelines cannot be executed after running the program. Item cannot pass the following is a custom pipelines clas...
construct a POST request using the FormRequest object of scrapy , where the formdata parameter is a dictionary, the dictionary has only one kv, and v is a list, how to send it as post content? Several methods have been tried, all of which fall shor...
for example, for the following data <p id="a">data I just want to keep data is there a quick way to do this? ...
I set the crawler to run every 6 hours, and it did. The problem with is that it runs immediately after each point starts, and then executes every 6 hours. how do you stop it from running at the start of the point? ! @web Oh, it s all right. Jus...
when I crawl a page with scrapy, I find that I can only request one page at a time, but the posts on the official website and Baidu say that the concurrency can be controlled through CONCURRENT_REQUESTS , but I tried it or it didn t work? CONCURRENT_...
Page flipping only collects the last piece of data on each page. What is wrong with it? Routed ~ import sys sys.path.append( .. ) from scrapy.linkextractors.sgml import SgmlLinkExtractor from scrapy.spiders import CrawlSpider, Rule from items import ...
system: Ubuntu 16.4 python3.6 twisted-15.2.1 Scrapy 1.5.0 is also installed in the virtual environment prompt the following message when creating a Scrapy: (pyvirSpider) root@ubuntu: myScrapy-sharp scrapy startproject test Traceback (most recent...
execute after entering the project, the error shows scrapy command not found , but I-sharpscrapy can be run, the scrapy crawl test crawler command can also be executed alone, only the scheduled command will appear scrapy:command not found ...
the number of pages in the website is only 100 pages. How to collect the data after 101pages ...
after crawling the navigation, the URL crawl that you want to continue in-depth navigation, and then the unified return value is written to xlsx < H1 >--coding: utf-8--< H1 > from lagou.items import LagouItem; import scrapy class LaGouSpider (...
http: house.njhouse.com.cn r. website flip links are displayed as a-sharp, can you still use crawl spider? how to write the rules of this site if it works. I wrote this unworkable amount rules = [ Rule (LinkExtractor (allow= ( rent houselist ...
scrapy.Request cannot enter callback code is as follows: def isIdentifyingCode(self, response): -sharp pass def get_identifying_code(self, headers): -sharp -sharp return scrapy.Req...
Open two scrapy tasks at the same time, and then go to push in redis a start_url but only one scrapy task An is running, and when An is stopped, B task will begin to crawl. the reason seems to be that requests is not saved in redis while...
There is no page information in the source code of the page. How to get the xpath. http: fwzl.hffd.gov.cn house. on the next page can all be found in the source code, but the information in the following figure is not available, which makes me unable ...
when collecting, it will always stay on the card for more than 30 minutes, and then prompt "took longer than 180.0 seconds " . seek a general solution ...
the company computer, plus domain, win10 system, when there are many retries in the collection process, part of the data will be collected and will be retried all the time, unable to continue, the reason is unknown. has nothing to do with agent availab...
^ C2018-04-27 10:47:58 [scrapy.crawler] INFO: Received SIG_SETMASK, shutting down gracefully. Send again to force ^ C2018-04-27 10:47:58 [scrapy.crawler] INFO: Received SIG_SETMASK twice, forcing unclean shutdown often get stuck and occasionally prom...