【Flowchart Mode】How to use the "Extract Data" component
Abstract：This tutorial introduces the function points and application scenarios of the "Extract Data" component. No Programming Needed. Visual Operation.
In the “What is a behavior component” tutorial, we introduced the functions and usage of various behavior components in the ScrapeStorm Flowchart Mode. This article focuses on the “Extract Data” component in the Behavior component.
“Extract Data” component is used to extract data from a web page. The component can be used alone or in conjunction with a “Loop” component or a “Judgment” component. It is suitable for extracting data on a single page when used alone. When used together, it is suitable for extracting data on all pages.
The settings of this component include the field list and the extraction range. The specific settings are as follows:
There are two ways to merge fields.
(1) Click on a field that needs to be merged, right click and select “Merge”, then select the fields you want to merge in the page.
(2) Press crtl or shift to select multiple fields, then right click on “Merge”. This method is suitable for the combination of multiple fields.
3. Select in page
If you want to modify the content extracted in the field, or add a new field to set the extraction content, you need to click “Select in page”, and then extract the required data in the web page.
4. Edit Xpath
Xpath is a path query language that uses a path expression to find the location of the data we need in the web page. Users with a programming foundation can use this feature to set up a new XPath.
Click here to learn more about Xpath.
5. Extract Type
Different data needs to set different value attributes. When setting a new field, the value of the field defaults to a text field.
In general, when you select new data, ScrapeStorm will automatically help you determine the field attributes, you don’t need to set it up. However, if there is a judgment error, you can set the value attribute of the field yourself.
Extract text: Suitable for ordinary text data.
Extract innerHTML: Suitable for extracting HTML that does not include the content itself.
Extract outerHTML: Suitable for extracting HTML that includes the content itself.
Extract link URL: Suitable for extracting links
Extract image URL: Suitable for extracting images
Tips: HTML is a language used to describe web pages. It is mainly used to control the display and appearance of data. HTML documents are also called web pages.
Click here to learn more about HTML.
6. Modify data
Sometimes we need to do some processing on the content of the extracted fields. For example, you only need the numbers and email in the fields, or replace the text in the fields with new text, or clear the blank characters at the beginning and the end, or create some new regular expressions. Alternatively, you can click on “Modify Data”.
7. Special value
In the data scraping process, some users need to scrape some special fields, such as scraping time, page source code, current page title, current page URL, etc.
These fields cannot be scraped directly in the web page, then you can use “Special Value” to set the field. Users can create a new field, change the field to a special field, or change the original field to a special field.
8. Delete column
You can right click on the field to select Delete, or press Ctrl or Shift to select multiple fields to delete.
If the user does not need the fields that the system automatically recognizes, you can click “Clear” to clear the fields and you can reset the required fields.
10. Add field
If you want to add a new field, click on “Add Field” in the upper right corner, right click on the newly added field, click on “Select in Page”, and extract the required data from the page.
11. Data Filters
Click here to learn more about Data Filters.