How to Scrape News from HuffPost
Abstract：This tutorial explains in detail how to scrape news details from HuffPost via ScrapeStorm's smart mode.
In this article, we will tell you how to scrape the news details of the HuffPost World News section using ScrapeStorm’s “Smart mode“.
Introduction to the scraping tool
ScrapeStorm is a new generation of web scraping tool based on artificial intelligence technology. It is the first scraper to support both Windows, Mac and Linux operating systems.
Introduction to the scraping object
HuffPost is an American news and opinion website and blog that has localized and international editions. The site offers news, satire, blogs, and original content and covers politics, business, entertainment, environment, technology, popular media, lifestyle, culture, comedy, healthy living, women’s interests, and local news.
Official website: https://www.huffingtonpost.com/
title, title_link, headline, image, author, time, abstract, label
Function point directory
Preview of the scraped result
1. Download and install ScrapeStorm, then register and log in
(1) Open the ScrapeStorm official website, download and install the latest version.
(2) Click Register/Login to register a new account and then log in to ScrapeStorm.
Tips: You can use this web scraping software directly, you don’t need to register, but the tasks under the anonymous account will be lost when you switch to the registered user, so it is recommended that you use it after registration.
2. Create a task
(1) Copy the URL of HuffPost world news section
Click here to learn more about how to enter the URL correctly.
(2) Create a new smart mode task
You can create a new scraping task directly on the software, or you can create a task by importing rules.
Click here to learn how to import and export scraping rules.
3. Configure the scraping rules
(1) Manually select
If you are not satisfied with the automatically recognized data or the effect of recognition is not good, you can manually select the list on the page.
(2) Set the fields
Click here to learn how to how to configure the extracted field.
ⅰ. Add fieds
ⅱ. Rename the fields
Right click on the data and select “Rename” to modify the field name.
(3) Scrape into the detail page
Select a column of data for the URL link and click “Scrape Into”, the page will go to the detail page. You can add the required fields to the detail page.
Click here to learn how to extract the list page plus the detail page.
4. Set up and start the scraping task
(1) Running and Anti-block settings
Click “Setting”, set waiting time based on web page open speed. You can check “Block Images” and “Block Ads”. The anti-block settings follow the system default settings. Then click “Save”.
Click here to learn more about how to configure the scraping task.
P.S. “Block Images” will reduce the load time and speed up the scraping process. And this operation does not affect the scraping and downloading of images.
(2) Start scraping data
Premium Plan and above users can use “Scheduled job” and “Sync to Database”. If you want to download images, you can check “Download images while running”. Then click “Start”.
Click here to learn about scheduled job.
Click here to learn about sync to database.
Click here to learn about download images.
(3) Wait a moment, you will see the data being scraped.
5. Export and view the data
(1) Click “Export” to download your data.
(2) Choose the format to export according to your needs.
ScrapeStorm provides a variety of export methods to export locally, such as excel, csv, html, txt or database. Professional Plan and above users can also post directly to wordpress.
Click here to learn more about how to view the extraction results and clear the extracted data.
Click here to learn more about how to export the result of extraction.