Getting Started Main Features Examples

【Getting Started】Basic noun explanation

2018-09-24 23:47:32
60 views

Abstract:This article explains some basic nouns that appear in ScrapeStorm.

Here are some nouns in ScrapeStorm.

Scraping Rule:

This is a program script for the specific settings of the ScrapeStorm record extraction task and for import and export operations. After importing existing rules, you can modify them or perform data extraction automatically according to the configured rules without modification.

 

Xpath:

This is a path query language, which is simply a way to find the location of the data we need in the web page using a path expression.

The following introduction is from Wikipedia, please click here for more details:

XPath (XML Path Language) is a query language for selecting nodes from an XML document. In addition, XPath may be used to compute values (e.g., strings, numbers, or Boolean values) from the content of an XML document. XPath was defined by the World Wide Web Consortium (W3C).

If you want to learn more, please click here to view the tutorials in W3schools.

 

HTML:

This is a language used to describe web pages, mainly used to control the display and appearance of data. HTML documents are also called web pages.

The following introduction is from Wikipedia, please click here for more details.

Hypertext Markup Language (HTML) is the standard markup language for creating web pages and web applications. With Cascading Style Sheets (CSS) and JavaScript, it forms a triad of cornerstone technologies for the World Wide Web.

Web browsers receive HTML documents from a web server or from local storage and render the documents into multimedia web pages. HTML describes the structure of a web page semantically and originally included cues for the appearance of the document.

If you want to learn more, please click here to view the tutorials in W3schools.

 

URL:

The URL is the address of the website.

The following introduction is from Wikipedia, please click here for more details.

A Uniform Resource Locator (URL), colloquially termed a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifier (URI), although many people use the two terms interchangeably. URLs occur most commonly to reference web pages (http), but are also used for file transfer (ftp), email (mailto), database access (JDBC), and many other applications.

 

Cookie:

A cookie is a piece of data that the server temporarily stores on your computer (such as the text you enter on the website, such as usernames, passwords, etc., and other operational records), so that the server can identify your computer.

The following introduction is from Wikipedia, please click here for more details.

An HTTP cookie (also called web cookie, Internet cookie, browser cookie, or simply cookie) is a small piece of data sent from a website and stored on the user’s computer by the user’s web browser while the user is browsing. Cookies were designed to be a reliable mechanism for websites to

remember stateful information (such as items added in the shopping cart in an online store) or to record the user’s browsing activity (including clicking particular buttons, logging in, or recording which pages were visited in the past). They can also be used to remember arbitrary pieces of information that the user previously entered into form fields such as names, addresses, passwords, and credit card numbers.

 

Regular expression:

This is a rule for filtering data, which is used to extract and replace data during collection.

The following introduction is from Wikipedia, please click here for more details.

A regular expression, regex or regexp (sometimes called a rational expression) is, in theoretical computer science and formal language theory, a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for “find” or “find and replace” operations on strings, or for input validation.