How to control the startup cycle, priority, collection frequency and other settings of multiple collection websites by Scrapy, and how to complete the status of data collection and time display of collection websites?

want to collect some online data, the online Scrapy framework is recommended, I read the official documents and online articles, but there are still a few places confused, want to sort out the learning ideas, beginners, some things are just ideas, incorrect please point out, discuss together.
the Scrapy framework feels like a shell of a crawler solution. Officials only provide a simple way to use it. You can use middleware or modify the configuration to deal with more situations. I think this is a good way to deal with the complex and growing crawler field.
crawler mainly solves three things:
1, writing Spider; that can traverse all pages and data for website structure and page analysis
2, compiling Item Pipeline;
3 for cleaning, deduplicating, reconstructing and saving crawled data, setting startup cycle, priority, collection frequency, etc., collecting website data collection completion status, completion time display;

Spider production process:
give the initial URL address, and analyze the structure of the page after the downloader completes, such as collecting the list on the home page, analyzing the paged data to make a paged url connection, sending the collection URL address of the new page, saving the list data to Item, for analysis and sending the detailed page collection URL address, and saving the detailed page data to Item after collection.
(my understanding: should both list page and detail page be put into a Spider implementation, that is, a collection of information about this website should be put into a Spider, such as list and detail page, the detail page is made into a new function, and list collection is only called repeatedly)

data storage Item PipeLine production process:
the Item collected by Spider is saved in the database, and the data is cleaned before saving, such as de-duplication, standardization of data format, preservation of all collected data, preservation of incremental collected data, and preservation of updated data.
(does Item just define saved data structures? PipeLine is the way of data storage. Should data cleaning and saving methods be put in PipeLine? How do Item and PipeLine relate? )

set the startup cycle, priority and collection frequency of the collection website, complete the data collection status of the collection website, and show the completion time: can the startup mode of
Spider be called only by using scrapy command parameters? Is it possible that you can only write shell or Python programs and then execute them through commands? Where is the startup cycle, priority, and collection frequency set, and how to obtain the completion status and time of the collection website?
for example, how to reduce the priority of repeated collection after failure? I read the document to understand that there is no platform managed by Web on the official website. What does ScrapyD do? I learned that SpiderKeeper is a Web system written by Python to maintain Scrapy tasks. are there any similar projects written by PHP?

wrote a lot if only I could find open source projects that I can refer to. Welcome experts who are familiar with crawler collection to solve the problem.

Mar.14,2021

combined with official documentation and source code, you should be able to find the answer


the operation problems of Scrapy can be understood by taking a look at the example of document connection on the official website. Before, I wanted to connect the entire collection process of Scrapy to understand how this framework is, mainly focused on "the startup cycle, priority, collection frequency and other settings of the collection website, the completion status of the collection website data collection, and the display of the completion time." later, looking for online articles, we found that the official website provided ScrapyD as a solution. After the service runs, it provides some Http access interfaces, which can complete the functions of controlling task scheduling and monitoring task status. You can make an interface by yourself, UI, and call the interface provided by ScrapyD. In addition, there are several sets of completed framework references on the Internet: SpiderKeeper, Tiktok, django-dynamic-scraper
reference website: https://www.cnblogs.com/zhong.

I'd like to have a general understanding of what problems Scrapy can solve, which libraries need help, and what work you need to do if you want to know more about Scrapy.

Menu