Scrapy: different item will be handled by different pipeline. - Codes Helper - Programming Question Answer

Scrapy: different item will be handled by different pipeline.

problem description

how to choose different item processing according to different pipeline

the environmental background of the problems and what methods you have tried

there are multiple crawler items in a scrapy, and each crawler project has a different item,. The result of searching on the Internet is to judge the type of the received item and then execute different code. I hope to call the corresponding pipeline function after the judgment. For example, item a will be handed over to pipeline a to deal with.

related codes

/ / Please paste the code text below (do not replace the code with pictures)

from items import AspiderItem, BspiderItem, CspiderItem

class myspiderPipeline(object):
    def __init__(self):
        pass

    def process_item(self, item, spider):
        if isinstance(item, AspiderItem):
            pass
        elif isinstance(item, BspiderItem):
            return item
        elif isinstance(item, CspiderItem):
            print item
            return item
            
class AspiderPipeline(myspiderPipeline):
    def __init__(self):
        self.file = open("myadata.json", "wb")

    def process_item(self, item, spider):
        content = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(content)
        return item

    def close_spider(self, spider):
        self.file.close()
        
class BspiderPipeline(myspiderPipeline):
    pass

what result do you expect? What is the error message actually seen?

I want to know how to call the corresponding pipeline after item type determination. Instantiate the corresponding class and then call the process_item method? If so, will methods such as close_spider of this class be executed automatically?

Scrapy

Oct.10,2021

my understanding:

The key function of

pipeline is that "one item can be processed by multiple pipeline step by step according to the configuration in settings.py."

at each step, the pipeline modifies some contents of the item (such as repetitive checking, error data repair, etc.), or does different processing according to the data of the item (for example, some pipeline is responsible for writing the item to the log, some pipeline is responsible for writing the item to the database, and some pipeline is responsible for sending the item through the http).

if an item needs only one operation to complete, just call the member method in the pipeline that uses isinstance to determine the item type.

the routine of the subject can be written as:

class myspiderPipeline(object):
    def __init__(self):
        self.file = open('myadata.json', 'wb')

    def process_item(self, item, spider):
        if isinstance(item, AspiderItem):
            content = json.dumps(dict(item), ensure_ascii=False) + "\n"
            self.file.write(content)
            return item
        elif isinstance(item, BspiderItem):
            return item
        elif isinstance(item, CspiderItem):
            print item
            return item
            
    def close_spider(self, spider):
        self.file.close()

Previous: Wechat official account development, call the new permanent picture and text material API to upload picture and text messages, how to attach audio and video?

Next: About datagrid dynamically generate lines, can not compile, and delete some problems on the code.

How to reset or empty the data of scrapy-redis 's dupefilter?
after you have climbed with scrapy-redis spiders, you will not be able to climb again. If you change the name of the spider, you can crawl again, and if you change back to the original name, you will start the deduplication mechanism again. Although you...

Redis scrapy

Feb.26,2021
When scrapy calls the paid agent api, proxy, it can only be obtained once every 5 seconds. Where do I need to set it?
scrapyapi5 because I originally wanted to get 100 ip, at a time and put it in the agent pool, but because the agent is unstable, it can not provide support for a long time. So I gave up the idea of getting 100 ip at once. 5request ...

Scrapy python

Feb.28,2021
Scrapy's question about customizing his own FilePipeline implementation file renaming
because scrapy s own FilePipeline is named by downloading the hash code of url, you want to customize your own filepipeline, to rename the file. So google for a while, found that everyone said: inherit the FilesPipeline class and then rewrite the get_m...

Scrapy

Feb.28,2021
The Scrapy ImagesPipeline class cannot be executed.
when scrapy crawls a picture of a web page, the class that inherits ImagesPipelines is customized in the pipelines file. but the custom pipelines cannot be executed after running the program. Item cannot pass the following is a custom pipelines clas...

Web-crawler scrapy python

Mar.01,2021
The formdata parameter of scrapy.FormRequest. If you want to pass in a dictionary with value as the list, what do you do?
construct a POST request using the FormRequest object of scrapy , where the formdata parameter is a dictionary, the dictionary has only one kv, and v is a list, how to send it as post content? Several methods have been tried, all of which fall shor...

Scrapy

Mar.01,2021
How to clean up some unwanted HTML attributes in crawler data
for example, for the following data <p id="a">data I just want to keep data is there a quick way to do this? ...

Web-crawler python pyspider scrapy

Mar.01,2021
Crawler timing execution
I set the crawler to run every 6 hours, and it did. The problem with is that it runs immediately after each point starts, and then executes every 6 hours. how do you stop it from running at the start of the point? ! @web Oh, it s all right. Jus...

Scrapy python

Mar.02,2021
Scrapy can only request one page at a time?
when I crawl a page with scrapy, I find that I can only request one page at a time, but the posts on the official website and Baidu say that the concurrency can be controlled through CONCURRENT_REQUESTS , but I tried it or it didn t work? CONCURRENT_...

Web-crawler scrapy python

Mar.02,2021
Scrapy page flip only collects the last piece of data on each page.
Page flipping only collects the last piece of data on each page. What is wrong with it? Routed ~ import sys sys.path.append( .. ) from scrapy.linkextractors.sgml import SgmlLinkExtractor from scrapy.spiders import CrawlSpider, Rule from items import ...

Scrapy

Mar.03,2021
An error was reported when creating a new scrapy project. The module No module named 'twisted.persisted' was not found.
system: Ubuntu 16.4 python3.6 twisted-15.2.1 Scrapy 1.5.0 is also installed in the virtual environment prompt the following message when creating a Scrapy: (pyvirSpider) root@ubuntu: myScrapy-sharp scrapy startproject test Traceback (most recent...

Python scrapy

Mar.03,2021
Scrapy scheduled task under centos, cannot be executed
execute after entering the project, the error shows scrapy command not found , but I-sharpscrapy can be run, the scrapy crawl test crawler command can also be executed alone, only the scheduled command will appear scrapy:command not found ...

Crontab scrapy python-crawler

Mar.04,2021
The number of pages of the website only displays 100 pages. How to collect the data after 101 pages?
the number of pages in the website is only 100 pages. How to collect the data after 101pages ...

Scrapy

Mar.04,2021
Ask a python scrapy deep crawler problem.
after crawling the navigation, the URL crawl that you want to continue in-depth navigation, and then the unified return value is written to xlsx < H1 >--coding: utf-8--< H1 > from lagou.items import LagouItem; import scrapy class LaGouSpider (...

Scrapy python-crawler

Mar.04,2021
The page turning links of the website are all displayed as one-sharp. Can you still use crawl spider?
http: house.njhouse.com.cn r. website flip links are displayed as a-sharp, can you still use crawl spider? how to write the rules of this site if it works. I wrote this unworkable amount rules = [ Rule (LinkExtractor (allow= ( rent houselist ...

Scrapy

Mar.04,2021
Scrapy.Request cannot enter callback
scrapy.Request cannot enter callback code is as follows: def isIdentifyingCode(self, response): -sharp pass def get_identifying_code(self, headers): -sharp -sharp return scrapy.Req...

Web-crawler scrapy python

Mar.05,2021
Multiple scrapy-redis cannot be crawled at the same time
Open two scrapy tasks at the same time, and then go to push in redis a start_url but only one scrapy task An is running, and when An is stopped, B task will begin to crawl. the reason seems to be that requests is not saved in redis while...

Scrapyd scrapy web-crawler python-crawler python

Mar.05,2021
There is no information on the first page in the source code of the page. How to get the xpath of the next page?
There is no page information in the source code of the page. How to get the xpath. http: fwzl.hffd.gov.cn house. on the next page can all be found in the source code, but the information in the following figure is not available, which makes me unable ...

Scrapy

Mar.05,2021
Scrapy timing prompt took longer than 180.0 seconds
when collecting, it will always stay on the card for more than 30 minutes, and then prompt "took longer than 180.0 seconds " . seek a general solution ...

Scrapy

Mar.06,2021
Are there restrictions on win10 that affect scrapy crawlers?
the company computer, plus domain, win10 system, when there are many retries in the collection process, part of the data will be collected and will be retried all the time, unable to continue, the reason is unknown. has nothing to do with agent availab...

Windows win10 python python-crawler scrapy

Mar.06,2021
What if scrapy encounters Received SIG_SETMASK?
^ C2018-04-27 10:47:58 [scrapy.crawler] INFO: Received SIG_SETMASK, shutting down gracefully. Send again to force ^ C2018-04-27 10:47:58 [scrapy.crawler] INFO: Received SIG_SETMASK twice, forcing unclean shutdown often get stuck and occasionally prom...

Scrapy

Mar.06,2021

MySQL Query : SELECT * FROM `codeshelper`.`v9_news` WHERE status=99 AND catid='6' ORDER BY rand() LIMIT 5
MySQL Error : Disk full (/tmp/#sql-temptable-64f5-34a4373-1bb36.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
MySQL Errno : 1021
Message : Disk full (/tmp/#sql-temptable-64f5-34a4373-1bb36.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
Need Help?