-
How to clean up some unwanted HTML attributes in crawler data
for example, for the following data
<p id="a">data
I just want to keep
data
is there a quick way to do this?
...
-
Pyspider cross-task send_message has no effect
first project
self.send_message("DETAIL", { url : href }, url= msg %s %href)
second project name "DETAIL "
@every(minutes=7 * 60)
def on_start(self):
pass
@config(priority=3)
def on_message(self, project, msg):
self....
-
Pyspider reports an error after running detail page
index page, can be displayed after the first run, but an error will be reported as soon as you run detail page
...
-
Pyspider pkg_resources.DistributionNotFound: wsgidav
the pyspider installation prompt was successful and there was a pkg_resources.DistributionNotFound: wsgidav problem at run time.
[root@localhost ~]-sharp pip install pyspider
Collecting pyspider
Downloading https: files.pythonhosted.org packages df ...
-
Pyspider crawler result gets data-bgimage attribute value
<a href="testtese" target="_blank" data-bgimage="testtese">< a>
the a tag acquired by the crawler contains href, target, data-bgimage and other attributes, which can be obtained with this.attr.href and this.at...
-
May I ask pyspider how to climb a web page with regular url, content in json format?
for example, there are 10 url: http: www.baidu.com userid=1 http: www.baidu.com userid=2 http: www.baidu.com userid=3. http: www.baidu.com userid=10
the content of the web page is
{
"data": {
"1": {
&q...
-
After pyspider run, log prompts the tornado_fetcher.py file to report an error with the encoding problem.
there is no problem starting to use the default taskdb,projectdb. If you change it to mysql storage, you will throw this exception ....
-
Pyspider debugging is correct, but automatic running has no result.
1. Write a pyspider script, debug and run without error, and can also be inserted into the database, but after the first successful automatic run, it will never run successfully again. The prompt message is all success, but no data is inserted. the cod...
-
Docker follows the tutorials to deploy MYSQL problems encountered with pyspider,.
execute the command: docker run-- name scheduler-d-- link mysql:mysql-- link rabbitmq:rabbitmq binux pyspider:latest scheduler
finally, there was a problem with the deployment of webui. I went to check the scheduler log: docker logs scheduler: the ...
-
Excuse me, how does the pyspider, running on the centos7.2 server open webui through the public network IP?
excuse me, how does the pyspider, running on the centos7.2 server open webui? through the public network IP? config is written like this
{
"scheduler" : {
"xmlrpc-host": "0.0.0.0",
"delete-time&qu...
-
Pyspider crawler page contains lazy load lazy-load, to get no data
use pyspider to get Mango TV page popular variety column content ( div.mg-main ul > li.v-item ), because the page uses a lazy loading mode, so can not get specific information, how to let the page to load this part of the content, and then get the ...
-
The pyspider task restarts, but the result shows that none
< H2 > ask for advice. I don t quite understand why the error report on the terminal is none, and I don t know what it has to do with on_result. < H2 >
-sharp! usr bin env python
-sharp -*- encoding: utf-8 -*-
-sharp Created on 2018-05-22 15:22:51
-s...
-
Pyspider uses the on_message method and does not return result
use the send_message and on_message methods to handle situations where multiple task results are returned from a single page, and prepare to override the on_result method for further processing. However, the msg returned by the on_message method is not ...
-
Using pyspider to call phantomjs to render the page Times error: "no response from phantomjs", status code 599
use pyspider to call phantomjs to render the page. Error: "no response from phantomjs ", status code 599. Phantomjs works on the terminal, but an error is reported as soon as you use the pyspider call, and both pyspider and phantomjs search for the late...
-
How does pyspider kill duplicate queues in scheduler
Click RUN on the console and report this [E 180704 09:49:46 scheduler:1223] 1062 (23000): Duplicate entry on_start for key PRIMARY ).
mysql.connector.errors.IntegrityError: 1062 (23000): Duplicate entry on_start for key PRIMARY )
norm...
-
Pyspider can't handle Tmall International at all.
headerrequestspyspiderfetch_type="js"URL>1024
phantomjsrestartfetch_errorfetch_error
...
-
Does pyspider support mongodb clusters as taskdb?
problem description
capture answers similar to Zhihu because there are so many answers from Zhihu, response.save is used to save the results of crawling ahead
because Zhihu site cannot be crawled too fast, the task may not be completed in time
so ...
-
What if pyspider always hangs up items and disappears on the server?
centos7 pyspider 1, run in the background with nohup pyspider all > pyspider.log 2 > & 1 & occasionally hang up 2, and there is no reason for outputting pyspider.log. 3, what if the previously written project disappears after restarting pyspider. ...
-
What is the reason for pyspider processor:202 and tornado_fetcher:212 abnormal error reporting? What should be done?
problem description
when there are many pyspider projects, it is always stuck there and cannot run tasks automatically
the environmental background of the problems and what methods you have tried
it is not possible to add more than one processor f...
-
Only one entry can be entered into the mysql database by pyspider.
pyspider starts with config file
result crawled only one piece of data
...