【Flowchart Mode】How to use the "Extract Data" component
Abstract：This tutorial introduces the function points and application scenarios of the "Extract Data" component.
In the “What is a behavior component” tutorial, we introduced the functions and usage of various behavior components in the ScrapeStorm Flowchart Mode. This article focuses on the “Extract Data” component in the Behavior component.
“Extract Data” component is used to extract data from a web page. The component can be used alone or in conjunction with a “Loop” component or a “Judgment” component. It is suitable for extracting data on a single page when used alone. When used together, it is suitable for extracting data on all pages.
The settings of this component include the field list and the extraction range. The specific settings are as follows:
1. Set the field list
(1) Add a field
Click the “Field List” setting item on the “Extract Data” component, then click the “Add Field” button in the lower left corner of the pop-up list box, select “Click page add field” or “Add special field“, where “Add Special Field” includes “Page URL, Page title, Convert to pdf, and Scraping time”.
ⅰ. Add page field
The page field is the normal field displayed on the page. After clicking the add page field, the user can directly select the element on the page or manually set the XPath in the “Element XPath” list.
ⅱ. Add special fields
Adding special fields refers to some special fields that cannot be added directly on the page, including “Page URL, Page title, Convert to pdf, and Scraping time”. The system has already configured these special fields and the user can use them directly without having to set them up.
(2) Set the fields
Click the icon to the right of the field to modify the field name.
Click the button shown below to process the data. For example, just need the number in the field, the mailbox, replace the text in the field with new text, trim blanks, or create some new regular expression, you can click ” Modify Data” performs field processing.
For the page elements selected by the user, the default extraction method is “Extract Text“. If the mode does not meet your needs, you can manually set the field extraction method, as follows:
Extract Text: Suitable for ordinary text data.
Extract innerHTML: Suitable for extracting HTML that does not include the content itself.
Extract outerHTML: Suitable for extracting HTML that includes the content itself
Extract link URL: Suitable for extracting links
Extract image URL: Suitable for extracting pictures
Tips: HTML is a language used to describe web pages. It is mainly used to control the display and appearance of data. HTML documents are also called web pages. Click here to learn more about HTML.
After we click the “Click the page to add a field” button, you can click on the extracted element in the page, and the XPath of the element will be automatically generated.
In addition, if you need to modify the XPath of the field, you can automatically generate XPath by clicking the button to the right of the element XPath, then clicking on the elements in the page, or you can manually edit the XPath.
(3) Delete the field
Right click on the field and click on “Delete Field“. If you want to delete all fields, press “Ctrl” or “Shift” to select all and then click “Delete Field“.
2. Extraction range
When the extracted data component is used alone, it is usually set to “Extract from the global page”.
When the “Extract Data” component is used in conjunction with a “Loop” component or a “Judgment” component, it is usually set to “Extract from current loop”.
Tips: After the “Extract Data” component, you must use the “Save Data” component. Otherwise, the data cannot be saved and the collection task will not work properly.