【Smart Mode】How to configure the extracted field
Abstract：In smart mode, ScrapeStorm will automatically identify the URL and extract the fields. If there are too many fields set by the system, or the user has other requirements, you can configure the extracted fields. This tutorial shows you how to set the extracted fields.
In Smart Mode, ScrapeStorm will automatically identify the URL and set the extraction field. The extracted field is the default user needs to extract the field.
If you think the field extracted by the system does not meet your needs, or you need to extract some new fields, then you can right click on the field and make settings in the menu bar, as shown below:
The detailed description of the specific settings is as follows:
1. Scrape Into
Smart Mode generally extracts the content of the list page. If we need to extract the content of the list page the detail page, we can use the “Scrape Into” function to complete this requirement.
In the other chapters we have a detailed introduction to the content of this section, click here to learn more about “Scrape Into”.
2. Rename the field
3. Select in page
If you want to modify the content extracted in the field, or add a new field to set the extraction content, you need to click “Select in the page”, and then extract the required data in the web page.
4. Edit column Xpath
Xpath is a path query language that uses a path expression to find the location of the data we need in the web page. Users with a programming foundation can use this feature to set up a new XPath.
Click here to learn more about Xpath.
5. Extract Type
Different data needs to set different value attributes. When setting a new field, the value of the field defaults to a text field.
In general, when you select new data, ScrapeStorm will automatically help you determine the field attributes, you don’t need to set it up. However, if there is a judgment error, you can set the value attribute of the field yourself.
Extract text: Suitable for ordinary text data.
Extract innerHTML: Suitable for extracting HTML that does not include the content itself.
Extract outerHTML: Suitable for extracting HTML that includes the content itself.
Extract link URL: Suitable for extracting links
Extract image URL: Suitable for extracting images
Tips: HTML is a language used to describe web pages. It is mainly used to control the display and appearance of data. HTML documents are also called web pages.
Click here to learn more about HTML.
(6) Modify data
Sometimes we need to do some processing on the content of the extracted fields. For example, you only need the numbers and email in the fields, or replace the text in the fields with new text, or clear the blank characters at the beginning and the end, or create some new regular expressions. Alternatively, you can click on “Modify Data”.
(7) Not Null
Since each piece of data is different, there may be blank fields in the extracted fields of the data we set. If you want to ensure the integrity of the data, you do not need the data of these blank fields. You can set the field “Not Null”, so if there is blank content in the field, a whole piece of data will be skipped directly.
(8) Special value
In the data scraping process, some users need to scrape some special fields, such as scraping time, page source code, current page title, current page URL, etc.
These fields cannot be scraped directly in the web page, then you can use “Special Value” to set the field. Users can create a new field, change the field to a special field, or change the original field to a special field.
(9) Delete the column
(10) Clear fields
If the user does not need the fields that the system automatically recognizes, you can use the “Clear Fields” to clear the fields and you can reset the required fields.
(11) Add field or change the content of the original field
If you want to add a new field, click on “Add Field” in the upper right corner, right click on the newly added field, click on “Select in Page”, and extract the required data from the page.
If you want to change the extracted content of an existing field, you can directly select the existing field, click “Select in the page”, and then extract the required data.