It's no secret (anymore) that data is precious.

Structured data is particularly valuable for companies and professionals in almost any industry, as it offers them the raw material for making predictions, identifying trends, and shaping their target audience.

One method for getting hold of structure data, that’s been growing in popularity in the past ten years, is web scraping.

In this article, we'll explore two distinct approaches to web scraping:

  • First, we'll demonstrate how to extract specific data from a website using custom JavaScript code to directly interact with the HTML structure.
  • Then, we'll dive into n8n, a powerful low-code tool that simplifies web scraping and offers seamless integration with other services like ChatGPT, Google Sheets, and Microsoft Excel.

By comparing these two methods, you'll better understand their strengths and weaknesses, enabling you to select the best tool for your specific scraping needs.

As a bonus, we'll also explore how to combine n8n with AI's power to extract data and gain deeper insights from it.

What is web scraping?

Web scraping (also called web harvesting) is the process of automatically extracting and collecting data from a website. This approach is useful if you need to get structured data from a website without an API or if the API offers limited access to the data.

In practical terms, let's say you want to have a spreadsheet with data about the top 50 best-selling products in an online store, including product name, price, and availability.

One option is to manually copy-paste the data from the website into the spreadsheet – but this would be incredibly time-consuming and error-prone. Alternatively, you can automatically collect this data using web scraping.

The legality of web scraping is nuanced. While the act itself isn't inherently illegal, some websites explicitly forbid it in their terms of service. Others may allow it but impose restrictions on the type and amount of data you can collect.

Before starting the scraping processes, it is recommended to check the terms and conditions and privacy policy of the website from where you wish to fetch the data. If there is no such mention, then it is advised to look for the robots.txt file on the given website. This file contains instructions about which areas of the website are allowed or disallowed for scraping and crawling. For example, to check the robots.txt file for IMDb, go to https://imdb.com/robots.txt.

It's crucial to prioritize the ToS over the robots.txt file when assessing the legality of scraping a particular website. The robots.txt file should be viewed as a courtesy guideline rather than a strict legal document. Scraping activities that violate a website's terms of service can lead to bans, or even legal action, especially if the scraped data is used for commercial purposes.

How does web scraping work?

Essentially, the process of scraping data from websites involves 5 steps:

  1. Select the URL (website) you want to scrape.
  2. Make a request to the URL. The server responds to the request and returns the data as HTML.
  3. Select the data you want to extract from the webpage.
  4. Run the code to extract the selected data.
  5. Export the data in a readable format (for example, as a CSV file).

How to do web scraping?

There are different options for web scraping, from browser extensions to no-code tools and custom code in different programming languages.

While browser extensions and other simpler tools offer a convenient entrypoint for basic web scraping tasks, they fall short when there is a need for extensive data manipulation or integration.

This section will focus on more advanced web scraping techniques that provide greater flexibility and control.

Web scraping with custom code

If you're comfortable with coding, you can leverage libraries specifically designed for web scraping across various programming languages. Popular choices include:

  • Python: Beautiful Soup, Scrapy, Selenium
  • R: rvest
  • JavaScript: Puppeteer

These libraries offer a wealth of tools for fetching web pages, parsing HTML and XML, and extracting the precise data you need. JavaScript, being native to web browsers, can directly interact with web pages and HTML elements, making it a convenient choice for certain scraping tasks.

The main advantage of using custom code and libraries is the flexibility and control it gives developers. You can tailor your scraper to target specific data, handle complex website structures, and implement custom logic for data cleaning and transformation. Additionally, most web scraping libraries are open-source and free.

Web scraping with n8n workflows

While JavaScript provides a solid foundation for web scraping, managing larger projects, scheduling tasks, or integrating data into other systems can present challenges.

For a more streamlined and scalable approach, n8n offers a user-friendly, no-code alternative. Let's explore how n8n empowers you to automate and enhance your web scraping workflows through a visual interface, all while handling the underlying boilerplate code that would otherwise be required.

n8n combines the power of libraries with the ease of browser extensions – and it's free to use. You can build automated workflows using core nodes like HTTP Request and HTML Extract to scrape data from websites and save it in a spreadsheet.

Moreover, you can extend the workflow however you like, for example, by emailing the spreadsheet, inserting the data into a database, or analyzing and visualizing it on a dashboard.

Let’s take a closer look at both methods for web scraping!

How to scrape data from a website JavaScript?

Let's create a basic web scraper that monitors the price of an item on Amazon using NodeJS and Puppeteer.

Prerequisites

  • Install NodeJS
  • Initialize a NodeJS project with npm init -y
  • Install Puppeteer with npm install puppeteer

Step 1: Identify the target elements

Before writing code, let’s inspect the product page on Amazon. Right-click on the price element and select "Inspect" (or "Inspect Element") in your browser to view its HTML structure. Here, we need to pinpoint the unique class name or ID that identifies the price element. In this case, we are going to target the element with the class a-price-whole.

Inspect the webpage and identify the elements you want to extract the data from
Inspect the webpage and identify the elements you want to extract the data from

Step 2: Get the page content

First we import Puppeteer and define the page we want to scrape:

const puppeteer = require("puppeteer");
const { KnownDevices } = require("puppeteer");
const iPhone = KnownDevices['iPhone 13'];

const targetURL = 'https://www.amazon.com/your-product';

The KnownDevices will be used to emulate a certain device, in this case an iPhone.

Step 3: Initialize Puppeteer and get the page content

For this we are creating a function called checkAmazonPrice, we then instruct Puppeteer to emulate an iPhone device, and then go to the URL of the product we want to monitor.

const checkAmazonPrice = async () => {
  // Launch the headless browser
  const browser = await puppeteer.launch();

  // Create a page
  const page = await browser.newPage();
  await page.emulate(iPhone);

  // Go to your site
  await page.goto(targetURL);

  console.log('Page loaded!');
};

Step 3: Identify the price element and get the price

In order to debug our scraper, we can take a screenshot of the page. We then wait for the element with the class a-price-whole to be loaded on the page, and after that we get the value (the innerText) of that element.

const checkAmazonPrice = async () => {
  // …
  // Screenshot for testing purposes
  await page.screenshot({ path: 'screenshot.png' });

  const element = await page.waitForSelector('.a-price-whole');
  const elementPrice = await element.evaluate(el => el.innerText);
  const price = parseFloat(elementPrice);
  console.log('Price:', price);

  // Close browser
  await browser.close();
};

We then cast it to a floating point number by using the parseFloat, and log the price to the console.

Finally we close the browser in order to avoid memory leaks.

We can then call the checkAmazonPrice function. The script should now look like this:

JavaScript function that checks Amazon product pricing
JavaScript function that checks Amazon product pricing

We can then execute this script by running node index.js and observe the price of the product logged to the console.

Testing the JavaScript function
Testing the JavaScript function

Step 4: Schedule the price check periodically

We can then modify the web scraping script to run periodically, in this case every hour:

// run immediately then once every hour

checkAmazonPrice();
setInterval(checkAmazonPrice, 1000 * 60 * 60);

Step 5: Monitor any structural changes

Amazon's website structure might change over time. These changes can affect the way your code interacts with the website, potentially causing it to break. Therefore, it's essential to stay updated with any structural changes to ensure your code continues to work as intended.

To avoid overloading Amazon's servers, it's crucial to be mindful of your requests and space them out appropriately. Rapidly sending a large number of requests can strain the servers. Typically, websites employ protective measures to prevent excessive requests. Sending numerous requests in a short period can result in your requests being blocked.

While JavaScript offers a direct way to scrape data from websites, handling tasks beyond basic extraction can quickly become complex and time-consuming.

But don't worry, if you're looking for a more streamlined and efficient approach, we'll introduce you to web scraping with n8n in the next section. This versatile tool simplifies web scraping through a visual interface, allowing you to build powerful workflows by connecting nodes and actions, all while handling the underlying boilerplate code that would otherwise be required.

How to automatically scrape data from a website with n8n?

We'll demonstrate how to use n8n to scrape a website, trigger actions like sending notification emails, and store the extracted data in Google Sheets or Microsoft Excel.

Let's build a web scraper with low-code using n8n!

This workflow scrapes information about books from the fictive online book store http://books.toscrape.com, generates a CSV file, sends it as an attachment in an email and stores the data in Google Sheets and Microsoft Excel 365.

Workflow for scraping data from a website and storing it into a spreadsheet
Workflow for scraping data from a website and storing it into a spreadsheet

Prerequisites

Step 1: Get the website data

First, you need to get data from the website in HTML format. To do this, use the HTTP Request node setting GET as Request Method and http://books.toscrape.com as the URL. Next you can configure the response: select String from the Response Format dropdown, and set the Property Name to whatever you like. In this case I named it data.

You can then execute the node, and the result should look like this:

Configuration of HTTP Request node
Configuration of HTTP Request node

Step 2: Extract data from the website

Now that you have the HTML code of the website, you need to extract the data you want (in this case, all the books on the page). To do this, we can add a HTML Extract node and configure it with the following:

  • Source Data: JSON
  • JSON Property: data (This is the property name set in the HTTP Request node)

Extraction Values:

  • Key: books
  • CSS Selector: .row > li
  • Return Value: HTML
  • Return Array enabled (since we will be getting multiple items)

You can obtain the CSS selector by inspecting the page, just like we did in the JavaScript section of this tutorial. n8n supports any CSS selectors and combinators, in this case .row > li selects all list element items that are inside an element with the class “row”.

When you execute the node, the result should look like this:

Configuration of HTML Extract node
Configuration of HTML Extract node

We then need to process each element returned by the previous steps. To do this one by one, we need to use the Split Out node. Set the Fields To Split Out to books and leave the rest as default.

When you execute the node, the result should look like this:

Configuration of Split Out node
Configuration of Split Out node

Now that you have a list of 20 items (books), you need to extract the title and price from their HTML. To do this, we will use another HTML node with the following parameters:

  • Source Data: JSON
  • JSON Property: books

Extraction Values:

  • Key: title
  • CSS Selector: h3
  • Return Value: Text
  • Key: price
  • CSS Selector: article > div.product_price > p.price_color
  • Return Value: Text
💡
As you can see we can extract multiple values in the same HTML node. This is possible by clicking the Add Value button under the Extraction Values section.

When you execute the node, the result should look like this:

Configuration of HTML Extract node for each individual book
Configuration of HTML Extract node for each individual book

Step 3: Sort the data

Let’s sort the books by their price in descending order. To do this, we can use the Sort node. We leave the Type as Simple and under Fields To Sort By we set Field Name to price and Order Descending.

When you execute the node, the result should look like this:

Configuration of Sort node for sorting books by price
Configuration of Sort node for sorting books by price


Step 4: Write the scraped data to a spreadsheet

Now that you scraped the data in a structured way, you'll want to export it as a CSV file. To do this, use the Convert to File node. From the Operation dropdown select Convert to CSV, and set Put Output File in Field to data.

When you execute the node, the result should look like this:

Configuration of Convert to CSV File node
Configuration of Convert to CSV File node

And it’s as simple as that! You now have a CSV file which you can download or even better, we can send it by email.

Step 5: Send the data set via email

To send the CSV file as an attachment via e-mail we can use the Gmail node with the following parameters:

  • Resource: Message
  • Operation: Send
  • To: email address you want to send to
  • Subject: bookstore csv (or any other subject line you'd like)
  • Message: Hey, here's the scraped data from the online bookstore!

And for Options / Attachments set:

  • Attached Field Name: data (this is the name of the binary property set in the Convert to File node)

Now make sure you have When you execute the node, the result should look like this:

Configuration of Gmail node
Configuration of Gmail node

Now, just check your inbox – you should receive an email with the CSV file!

Step 6: Save the data to Google Sheets or Microsoft Excel

Another option is to store the scraped data in a Google Sheet. You can do this using the Google Sheets node setup like so:

  • Credential to connect with: your Google Cloud account credentials.
  • Resource: Sheet Within Document
  • Operation: Append Row
  • Document: pick a document from the dropdown list. Alternatively you can select a sheet by URL or ID.
  • Sheet: pick the sheet from the dropdown list. Alternatively you can select a sheet by URL, by ID or by name.Mapping Column Mode: use Map Automatically for most cases.
Configuration of Google Sheets node
Configuration of Google Sheets node

After executing the node, you should see all the books and their prices in the Google Sheet.

Google Sheet containing scraped data
Google Sheet containing scraped data

If you prefer working with Microsoft Excel, n8n seamlessly integrates with it as well. Instead of using the Google Sheets node, you would use the Microsoft Excel node instead.

This requires a bit of extra setup, first you need to setup Microsoft Azure credentials with Microsoft Graph permissions.

Next create an Excel workbook and grab its ID from the URL. The format of the URL should be something like https://onedrive.live.com/edit?id=XXXXXXXXXXXXXXXX&. The ID will be located between id= and &.

Unlike Google Sheets where we did not need to have a certain format, here we need to create the column headers, otherwise the appending to the sheet doesn’t work. In this case, we can add title and price as column headers. The empty workbook should look something like this:

Excel Sheet structure
Excel Sheet structure

Next, we need to configure the Microsoft Excel node like so:

  • Resource: Sheet
  • Operation: Append
  • Workbook (By ID): Workbook ID
  • Sheet: select a sheet from the listData Mode: Auto-Map Input Data to Columns

After running the node, you should see the data appended to the Excel sheet:

  • Resource: Sheet
  • Operation: Append
  • Workbook (By ID): Workbook ID
  • Sheet: select a sheet from the listData Mode: Auto-Map Input Data to Columns
Configuration of Microsoft Excel node
Configuration of Microsoft Excel node

After running the node, you should see the data appended to the Excel sheet:

Excel sheet containing scraped data
Excel sheet containing scraped data

Create your own scraping workflows

Build complex automations 10x faster, without fighting APIs

Bonus: Can ChatGPT scrape the web?

While n8n's web scraping capabilities are impressive on their own, its integration with AI tools like ChatGPT opens up a world of possibilities.

In this section, we'll explore a workflow that scrapes data and leverages AI to summarize the content.

Here is an overview of a workflow that scrapes and summarizes web pages with AI:

Web scraping with ChatGPT workflow overview
Web scraping with ChatGPT workflow overview

Step 1: Fetch the page content

Similarly to the previous example, we will scrape book data from the online store http://books.toscrape.com.First, we use the HTTP Request node to grab the contents of the page, like so:

Configuration of the HTTP Request node
Configuration of the HTTP Request node

Step 2: Get the first 3 essays

Then we use the HTML Extract node to parse the resulting HTML, and grab the essay titles using the table table a CSS selector.

Configuration of the HTML Extract node
Configuration of the HTML Extract node

Similar to the previous workflow, we need to use the Split Out node in order to process the essays one by one. Set the Fields To Split Out to essay and leave the rest as default.

Configuration of the Split Out node
Configuration of the Split Out node

Then, using the Limit node, we can grab just the first 3 essays:

Configuration of the Limit node
Configuration of the Limit node

Step 3: Extract the essay text and title

The previous node gives us just the page URL of each essay. Using another HTTP Request node we can then access the page and extract the content there:

Use HTTP Request node to fetch contents of each essay
Use HTTP Request node to fetch contents of each essay

After we get the content from each of the 3 essays, we want to extract the title of each essay. This information will be useful when composing the final result. For this, we will use another HTML Extract node:

Extracting the title of each essay with the HTML Extract node
Extracting the title of each essay with the HTML Extract node

Step 4: Send the text to ChatGPT for summarization

This is where AI magic comes in. The Summarization Chain node, a specialized node in n8n, simplifies the process of using AI language models like ChatGPT to generate summaries. We'll connect this node to our workflow, feeding it the extracted essay text. The Summarization Chain node takes care of the complex communication with the AI model, returning a concise summary for each essay.

Here, we can leave everything as default:

Configuration of the AI Summarization Chain node
Configuration of the AI Summarization Chain node

The Summarization Chain node requires an AI model. You can use the OpenAI integration by setting up the OpenAI credentials and then picking a Model from the list. In this case we opted for gpt-4-turbo:

Configuration of the OpenAI credentials and AI Model
Configuration of the OpenAI credentials and AI Model

Step 5: Join the title and the summarization

Finally, we'll use the Merge node, to combine the original essay titles with their corresponding AI-generated summaries. The result is a neatly structured dataset, ready for further processing or export.

Configuration of the Merge node
Configuration of the Merge node

Wrap Up

Now you know the basics of web scraping: what it is, how it works, and what to remember when scraping websites. You also learned how to scrape data from a website using JavaScript and an n8n low-code workflow.

With n8n, there’s no need to write lines of code anymore to pull the data you need. You can use n8n to automate your HTML web scraping and send the data to the right places. Moreover, you can add conditional logic, custom code, and connect other integrations to set up workflows that boost your productivity.

What's next?

Ready to take your automation skills to the next level? Here's what you can do next: