Perform Web scraping in Node.js using a scraping browser

Photo by Conny Schneider / Unsplash
Photo by Conny Schneider / Unsplash

Scraping a website can be helpful for various reasons, such as:

  • Perform data analysis on the data retrieved to make business decisions.
  • Do market research about the competition.
  • Automate some tasks depending on the data update of the website page.
  • Retrieve data that aren't exposed through a public API.
  • Etc...

It can also extract information that is not easily accessible or available through other means.

However, it is essential to be aware of the legal and ethical implications of scraping, especially when it comes to scraping protected websites.

Scrape protected website

To protect websites from harmful attacks such as DDoS, bot spamming, inaccurate solicitation of websites, etc...

The rise of building websites as Single Page Applications (SPA) using technologies such as React, Angular, Vue.js, Svelte, SolidJS, etc... makes a website highly dependent on running on a browser supporting JavaScript.

The consequence is people using Web scraping moderately or for the right reason have difficulty trying to scrape websites that fall into these categories.

Having a better scraper tool that can simulate a real browser, solve captcha validation, and change IP address per scraping session, etc... is paramount to providing a better experience to developers.

The Bright Data Scraping Browser is the right tool for this job, and in this post, we will see how to use it to bypass website protection preventing developers from scraping websites data.

Scraping Browser - Automated Browser for Scraping
Scraping Browser is a GUI browser designed for web scraping. Automated browser that is eperienced as a headless browser, interacting with Puppeteer or Playwright API.

Create an account

The first step is creating an account on https://brightdata.com; select the preferred registration method.

Sign up from the Bright Data homepage.
Sign up from the Bright Data homepage.

On the next page, signup with Google or with a professional email address by filling out the registration form and clicking on the "Create Account" button.

Sign up with a professional email address.
Sign up with a professional email address.

On the next page, set a password and click the "Sign up" button.

Set a password for the registered account.
Set a password for the registered account.

You will receive an email to verify your account; check out your spam if it isn't in your inbox. Click on the link, and you will be redirected to the dashboard user page.

The Bright Data user dashboard page.
The Bright Data user dashboard page.

The registration is successful, and you are on a free trial subscription. A pay-as-you plan allows you to pay only for what you use.

Create a Scraping browser proxy

To use the Scraping browser, we must create a proxy which is the URL to the browser instance that will be used to perform Web scraping.

From the Bright Data user dashboard page, go to the page related to Web scraping products: https://brightdata.com/cp/zones. Scroll to the "Proxy Products" section. Locate the Scraping Browser and click on the "Get Started" button.

Browse the Bright Data proxy products list.
Browse the Bright Data proxy products list.

On the next page, provide a proxy name and validate;

Add a new Scraping Browser proxy.
Add a new Scraping Browser proxy.

A modal will request the proxy creation confirmation; click "Yes" to confirm. You will be redirected to the proxy page view.

View the Scraping Browser proxy page.
View the Scraping Browser proxy page.

On this page, you can see the proxy host, the username, and the password. We will use this information later in the Node.js project.

Set up the Node.js project

To set up the project with Node.js and TypeScript, we will use the starter project we built in this tutorial.

Run the command below to clone the project on GitHub and run it locally.


git clone https://github.com/tericcabrel/node-ts-starter.git node-scraping-browser

cd node-scraping-browser

yarn install

yarn start

You will get the following output:

Set up and run the Node.js project locally.
Set up and run the Node.js project locally.

Install HTTP Node.js library

To perform Web scraping, we need an HTTP client library to retrieve the content page. Axios is among the most popular; run the command below to install it:


yarn add axios

We are ready to scrape data from our Node.js project.

Scrape a website requiring JavaScript enabled

SoundCloud is an audio distribution platform used by millions of users. Each month they release 50 of the most streamed music in the world.

The SoundCloud top 50 all music genres list.
The SoundCloud top 50 all music genres list.

On this page, we want to scrape the data about every song on the list. Create a file ./src/scrape-soundclound.ts and add the code below:


import axios from 'axios';

(async () => {
  const url = 'https://soundcloud.com/discover/sets/charts-top:all-music';

  const response = await axios.get(url);

  console.log(response.data);

  console.log('Scraping done!');
})();

We use Axios to retrieve the page content and print it in the console; run the command below to run the file:


yarn ts-node ./src/scrape-soundclound.ts

If we look at the output, we see a message indicating JavaScript is disabled.

SoundCloud page content when JavaScript is disabled.
SoundCloud page content when JavaScript is disabled.

To fix this issue, we must use a tool that simulates a real browser so that the JavaScript detection on SoundCloud will pass.

We will use the Scraping Browser proxy we created earlier. It integrates perfectly with the Puppeteer core library, providing a feature to connect to a remote browser so it will not download a Chromium instance as Puppeteer.

Playwright is another great tool for this task, read how to use Playwright with Node.js if you are interested.

We will use Puppeteer in this tutorial so let's install the core package:


yarn add puppeteer-core

Load the scraping browser credentials in the application

To connect to our remote Scraping Browser created earlier in the dashboard, we need the following information: The hostname, the port, the username, and the password. This sensitive information should not be hardcoded in the application but instead loaded from an environment configuration file.

The package Dotenv helps us manage it easily; let's install it:


yarn add dotenv

Create a file name .env and add the content below:


PROXY_HOST=<your_scraping_browser_host>
PROXY_PORT=<your_scraping_browser_port>
PROXY_USERNAME=<your_scraping_browser_username>
PROXY_PASSWORD=<your_scraping_browser_password>

Don't forget to add this file name in the .gitignore file to exclude it from the version control.

Scrape SoundCloud using the Scraping browser

Replace the content of the file ./src/scrape-soundclound.ts with the code below:


import dotenv from 'dotenv';
import puppeteer from 'puppeteer-core';

dotenv.config();

(async () => {
  const auth = `${process.env.PROXY_USERNAME}:${process.env.PROXY_PASSWORD}`;
  const browserURL = `wss://${auth}@${process.env.PROXY_HOST}:${process.env.PROXY_PORT}`;

  let browser;

  try {
    browser = await puppeteer.connect({ browserWSEndpoint: browserURL });
    const page = await browser.newPage();

    page.setDefaultNavigationTimeout(2 * 60 * 1000);

    await page.goto('https://soundcloud.com/discover/sets/charts-top:all-music');

    await page.waitForSelector('.trackItem__numberWrapper', { visible: true });

    const html = await page.content();

    console.log(html);
  } catch (e) {
    console.error('run failed', e);
  } finally {
    await browser?.close();
  }
})();

Execute the file with the command yarn ts-node ./src/scrape-soundclound.ts, and this time we get the content of the page as the screenshot below shows:

Content of the SoundCloud retrieved using the Scraping Browser proxy.
Content of the SoundCloud retrieved using the Scraping Browser proxy.

One line of code I want to highlight is the following:


await page.waitForSelector('.trackItem__numberWrapper', { visible: true });

This tells us to wait until the component with the class name trackItem__numberWrapper is visible before returning the page content. This is useful for SPA pages where there is a time gap between the first content loaded and the content the user can interact with.

You can now use an HTML parser library such as Cheerio to extract the content; I show how to do that in the blog post below.

Learn how to do Web scraping in Node.js - Wikipedia case
In this tutorial, we will see how to perform Web scraping on a website using Node.js and Typescript. We will store the data in MongoDB

Scrape a website with a ReCaptcha login

Zillow is a popular online real estate marketplace that provides information about homes for sale, apartments for rent, and home values. One of the most visited pages is the one showing homes for sale.

The Zillow's page shows homes for sale.
The Zillow's page shows homes for sale.

If we try to scrape this page with Axios using the code below:


import axios from 'axios';

(async () => {
  const response = await axios.get('https://www.zillow.com/homes/for_sale');

  console.log(response.data);
})();

We get the following output where we can see the website is asking to verify we aren't a Bot by validating a ReCaptcha.

Cannot get the Zillow content because of the ReCaptcha verification.
Cannot get the Zillow content because of the ReCaptcha verification.

Let's create a new file called scrape-zillow.ts in the folder src and add the code below that uses our Scraping browser proxy:


import dotenv from 'dotenv';
import puppeteer from 'puppeteer-core';

dotenv.config();

(async () => {
  const auth = `${process.env.PROXY_USERNAME}:${process.env.PROXY_PASSWORD}`;
  const browserURL = `wss://${auth}@${process.env.PROXY_HOST}:${process.env.PROXY_PORT}`;

  let browser;

  try {
    browser = await puppeteer.connect({ browserWSEndpoint: browserURL });
    const page = await browser.newPage();

    page.setDefaultNavigationTimeout(2 * 60 * 1000);

    await page.goto('https://www.zillow.com/homes/for_sale');

    const html = await page.content();

    console.log(html);
  } catch (e) {
    console.error('run failed', e);
  } finally {
    await browser?.close();
  }
})();

Run the file with the following command: yarn ts-node ./src/scrape-soundclound.ts.

We can see the page's content is now retrieved successfully, meaning the Scraper browser doesn't just simulate a real browser but also perform ReCaptcha validation for us.

HTML content of the Zillow page retrieved by the Scraping Browser.
HTML content of the Zillow page retrieved by the Scraping Browser.

Wrap up

The Scraping Browser product from Bright Data is a great tool to perform Web scraping on websites with advanced protection such as Bot detection, ReCaptcha verification, DDoS protection using CDN, etc...

You can create an instance of a real browser and then connect to it using the Puppeteer core library to perform web Scraping. You can create many instances and use them simultaneously.

The Scraping browser handles for you the ReCaptcha validation, the IP rotation, the SPA page loading, and so much more.

The Pricing plan allows you to pay only for what you use, which is good for cost efficiency and optimization.

You can find the code source on the GitHub repository.

Follow me on Twitter or subscribe to my newsletter to avoid missing the upcoming posts and the tips and tricks I occasionally share.