-
How to reset or empty the data of scrapy-redis 's dupefilter?
after you have climbed with scrapy-redis spiders, you will not be able to climb again. If you change the name of the spider, you can crawl again, and if you change back to the original name, you will start the deduplication mechanism again. Although you...
-
When scrapy calls the paid agent api, proxy, it can only be obtained once every 5 seconds. Where do I need to set it?
scrapyapi5
because I originally wanted to get 100 ip, at a time and put it in the agent pool, but because the agent is unstable, it can not provide support for a long time. So I gave up the idea of getting 100 ip at once.
5request
...
-
Scrapy's question about customizing his own FilePipeline implementation file renaming
because scrapy s own FilePipeline is named by downloading the hash code of url, you want to customize your own filepipeline, to rename the file. So google for a while, found that everyone said: inherit the FilesPipeline class and then rewrite the get_m...
-
The Scrapy ImagesPipeline class cannot be executed.
when scrapy crawls a picture of a web page, the class that inherits ImagesPipelines is customized in the pipelines file. but the custom pipelines cannot be executed after running the program. Item cannot pass
the following is a custom pipelines
clas...
-
The formdata parameter of scrapy.FormRequest. If you want to pass in a dictionary with value as the list, what do you do?
construct a POST request using the FormRequest object of scrapy , where the formdata parameter is a dictionary, the dictionary has only one kv, and v is a list, how to send it as post content? Several methods have been tried, all of which fall shor...
-
How to clean up some unwanted HTML attributes in crawler data
for example, for the following data
<p id="a">data
I just want to keep
data
is there a quick way to do this?
...
-
Crawler timing execution
I set the crawler to run every 6 hours, and it did. The problem with is that it runs immediately after each point starts, and then executes every 6 hours. how do you stop it from running at the start of the point? !
@web
Oh, it s all right. Jus...
-
Scrapy can only request one page at a time?
when I crawl a page with scrapy, I find that I can only request one page at a time, but the posts on the official website and Baidu say that the concurrency can be controlled through CONCURRENT_REQUESTS , but I tried it or it didn t work? CONCURRENT_...
-
Scrapy page flip only collects the last piece of data on each page.
Page flipping only collects the last piece of data on each page. What is wrong with it? Routed ~
import sys
sys.path.append( .. )
from scrapy.linkextractors.sgml import SgmlLinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from items import ...
-
An error was reported when creating a new scrapy project. The module No module named 'twisted.persisted' was not found.
system: Ubuntu 16.4 python3.6 twisted-15.2.1 Scrapy 1.5.0 is also installed in the virtual environment
prompt the following message when creating a Scrapy:
(pyvirSpider) root@ubuntu: myScrapy-sharp scrapy startproject test Traceback (most recent...
-
Scrapy scheduled task under centos, cannot be executed
execute after entering the project, the error shows scrapy command not found , but I-sharpscrapy can be run, the scrapy crawl test crawler command can also be executed alone, only the scheduled command will appear scrapy:command not found
...
-
The number of pages of the website only displays 100 pages. How to collect the data after 101 pages?
the number of pages in the website is only 100 pages. How to collect the data after 101pages ...
-
Ask a python scrapy deep crawler problem.
after crawling the navigation, the URL crawl that you want to continue in-depth navigation, and then the unified return value is written to xlsx
< H1 >--coding: utf-8--< H1 >
from lagou.items import LagouItem; import scrapy
class LaGouSpider (...
-
The page turning links of the website are all displayed as one-sharp. Can you still use crawl spider?
http: house.njhouse.com.cn r. website flip links are displayed as a-sharp, can you still use crawl spider? how to write the rules of this site if it works. I wrote this unworkable amount
rules = [ Rule (LinkExtractor (allow= ( rent houselist ...
-
Scrapy.Request cannot enter callback
scrapy.Request cannot enter callback code is as follows:
def isIdentifyingCode(self, response):
-sharp
pass
def get_identifying_code(self, headers):
-sharp
-sharp
return scrapy.Req...
-
Multiple scrapy-redis cannot be crawled at the same time
Open two scrapy tasks at the same time, and then go to push in redis a start_url but only one scrapy task An is running, and when An is stopped, B task will begin to crawl.
the reason seems to be that requests is not saved in redis while...
-
There is no information on the first page in the source code of the page. How to get the xpath of the next page?
There is no page information in the source code of the page. How to get the xpath. http: fwzl.hffd.gov.cn house. on the next page can all be found in the source code, but the information in the following figure is not available, which makes me unable ...
-
Scrapy timing prompt took longer than 180.0 seconds
when collecting, it will always stay on the card for more than 30 minutes, and then prompt "took longer than 180.0 seconds " .
seek a general solution
...
-
Are there restrictions on win10 that affect scrapy crawlers?
the company computer, plus domain, win10 system, when there are many retries in the collection process, part of the data will be collected and will be retried all the time, unable to continue, the reason is unknown. has nothing to do with agent availab...
-
What if scrapy encounters Received SIG_SETMASK?
^ C2018-04-27 10:47:58 [scrapy.crawler] INFO: Received SIG_SETMASK, shutting down gracefully. Send again to force ^ C2018-04-27 10:47:58 [scrapy.crawler] INFO: Received SIG_SETMASK twice, forcing unclean shutdown
often get stuck and occasionally prom...