want to collect some online data, the online Scrapy framework is recommended, I read the official documents and online articles, but there are still a few places confused, want to sort out the learning ideas, beginners, some things are just ideas, incorrect please point out, discuss together.
the Scrapy framework feels like a shell of a crawler solution. Officials only provide a simple way to use it. You can use middleware or modify the configuration to deal with more situations. I think this is a good way to deal with the complex and growing crawler field.
crawler mainly solves three things:
1, writing Spider; that can traverse all pages and data for website structure and page analysis
2, compiling Item Pipeline;
3 for cleaning, deduplicating, reconstructing and saving crawled data, setting startup cycle, priority, collection frequency, etc., collecting website data collection completion status, completion time display;
Spider production process:
give the initial URL address, and analyze the structure of the page after the downloader completes, such as collecting the list on the home page, analyzing the paged data to make a paged url connection, sending the collection URL address of the new page, saving the list data to Item, for analysis and sending the detailed page collection URL address, and saving the detailed page data to Item after collection.
(my understanding: should both list page and detail page be put into a Spider implementation, that is, a collection of information about this website should be put into a Spider, such as list and detail page, the detail page is made into a new function, and list collection is only called repeatedly)
data storage Item PipeLine production process:
the Item collected by Spider is saved in the database, and the data is cleaned before saving, such as de-duplication, standardization of data format, preservation of all collected data, preservation of incremental collected data, and preservation of updated data.
(does Item just define saved data structures? PipeLine is the way of data storage. Should data cleaning and saving methods be put in PipeLine? How do Item and PipeLine relate? )
set the startup cycle, priority and collection frequency of the collection website, complete the data collection status of the collection website, and show the completion time: can the startup mode of
Spider be called only by using scrapy command parameters? Is it possible that you can only write shell or Python programs and then execute them through commands? Where is the startup cycle, priority, and collection frequency set, and how to obtain the completion status and time of the collection website?
for example, how to reduce the priority of repeated collection after failure? I read the document to understand that there is no platform managed by Web on the official website. What does ScrapyD do? I learned that SpiderKeeper is a Web system written by Python to maintain Scrapy tasks. are there any similar projects written by PHP?
wrote a lot if only I could find open source projects that I can refer to. Welcome experts who are familiar with crawler collection to solve the problem.