Web Scraping

Perform Web scraping in Node.js using a scraping browser

Eric Cabrel TIOGO

Jan 4, 2024 • 8 min read

Scraping a website can be helpful for various reasons, such as:

Perform data analysis on the data retrieved to make business decisions.
Do market research about the competition.
Automate some tasks depending on the data update of the website page.
Retrieve data that aren't exposed through a public API.
Etc...

It can also extract information that is not easily accessible or available through other means.

However, it is essential to be aware of the legal and ethical implications of scraping, especially when it comes to scraping protected websites.

Scrape protected website

To protect websites from harmful attacks such as DDoS, bot spamming, inaccurate solicitation of websites, etc...

The rise of building websites as Single Page Applications (SPA) using technologies such as React, Angular, Vue.js, Svelte, SolidJS, etc... makes a website highly dependent on running on a browser supporting JavaScript.

The consequence is people using Web scraping moderately or for the right reason have difficulty trying to scrape websites that fall into these categories.

Having a better scraper tool that can simulate a real browser, solve captcha validation, and change IP address per scraping session, etc... is paramount to providing a better experience to developers.

The Bright Data Scraping Browser is the right tool for this job, and in this post, we will see how to use it to bypass website protection preventing developers from scraping websites data.

Create an account

The first step is creating an account on https://brightdata.com; select the preferred registration method.

On the next page, signup with Google or with a professional email address by filling out the registration form and clicking on the "Create Account" button.

On the next page, set a password and click the "Sign up" button.

You will receive an email to verify your account; check out your spam if it isn't in your inbox. Click on the link, and you will be redirected to the dashboard user page.

The registration is successful, and you are on a free trial subscription. A pay-as-you plan allows you to pay only for what you use.

Create a Scraping browser proxy

To use the Scraping browser, we must create a proxy which is the URL to the browser instance that will be used to perform Web scraping.

From the Bright Data user dashboard page, go to the page related to Web scraping products: https://brightdata.com/cp/zones. Scroll to the "Proxy Products" section. Locate the Scraping Browser and click on the "Get Started" button.

Browse the Bright Data proxy products list.

On the next page, provide a proxy name and validate;

A modal will request the proxy creation confirmation; click "Yes" to confirm. You will be redirected to the proxy page view.

On this page, you can see the proxy host, the username, and the password. We will use this information later in the Node.js project.

Set up the Node.js project

To set up the project with Node.js and TypeScript, we will use the starter project we built in this tutorial.

Run the command below to clone the project on GitHub and run it locally.


git clone https://github.com/tericcabrel/node-ts-starter.git node-scraping-browser

cd node-scraping-browser

yarn install

yarn start

You will get the following output:

Set up and run the Node.js project locally.

Install HTTP Node.js library

To perform Web scraping, we need an HTTP client library to retrieve the content page. Axios is among the most popular; run the command below to install it:


yarn add axios

We are ready to scrape data from our Node.js project.

Scrape a website requiring JavaScript enabled

SoundCloud is an audio distribution platform used by millions of users. Each month they release 50 of the most streamed music in the world.

The SoundCloud top 50 all music genres list.

On this page, we want to scrape the data about every song on the list. Create a file ./src/scrape-soundclound.ts and add the code below:


import axios from 'axios';

(async () => {
  const url = 'https://soundcloud.com/discover/sets/charts-top:all-music';

  const response = await axios.get(url);

  console.log(response.data);

  console.log('Scraping done!');
})();

We use Axios to retrieve the page content and print it in the console; run the command below to run the file:


yarn ts-node ./src/scrape-soundclound.ts

If we look at the output, we see a message indicating JavaScript is disabled.

SoundCloud page content when JavaScript is disabled.

To fix this issue, we must use a tool that simulates a real browser so that the JavaScript detection on SoundCloud will pass.

We will use the Scraping Browser proxy we created earlier. It integrates perfectly with the Puppeteer core library, providing a feature to connect to a remote browser so it will not download a Chromium instance as Puppeteer.

Playwright is another great tool for this task, read how to use Playwright with Node.js if you are interested.

We will use Puppeteer in this tutorial so let's install the core package:


yarn add puppeteer-core

Load the scraping browser credentials in the application

To connect to our remote Scraping Browser created earlier in the dashboard, we need the following information: The hostname, the port, the username, and the password. This sensitive information should not be hardcoded in the application but instead loaded from an environment configuration file.

The package Dotenv helps us manage it easily; let's install it:


yarn add dotenv

Create a file name .env and add the content below:


PROXY_HOST=<your_scraping_browser_host>
PROXY_PORT=<your_scraping_browser_port>
PROXY_USERNAME=<your_scraping_browser_username>
PROXY_PASSWORD=<your_scraping_browser_password>

Don't forget to add this file name in the .gitignore file to exclude it from the version control.

Scrape SoundCloud using the Scraping browser

Replace the content of the file ./src/scrape-soundclound.ts with the code below:


import dotenv from 'dotenv';
import puppeteer from 'puppeteer-core';

dotenv.config();

(async () => {
  const auth = `${process.env.PROXY_USERNAME}:${process.env.PROXY_PASSWORD}`;
  const browserURL = `wss://${auth}@${process.env.PROXY_HOST}:${process.env.PROXY_PORT}`;

  let browser;

  try {
    browser = await puppeteer.connect({ browserWSEndpoint: browserURL });
    const page = await browser.newPage();

    page.setDefaultNavigationTimeout(2 * 60 * 1000);

    await page.goto('https://soundcloud.com/discover/sets/charts-top:all-music');

    await page.waitForSelector('.trackItem__numberWrapper', { visible: true });

    const html = await page.content();

    console.log(html);
  } catch (e) {
    console.error('run failed', e);
  } finally {
    await browser?.close();
  }
})();

Execute the file with the command yarn ts-node ./src/scrape-soundclound.ts, and this time we get the content of the page as the screenshot below shows:

Content of the SoundCloud retrieved using the Scraping Browser proxy.

One line of code I want to highlight is the following:


await page.waitForSelector('.trackItem__numberWrapper', { visible: true });

This tells us to wait until the component with the class name trackItem__numberWrapper is visible before returning the page content. This is useful for SPA pages where there is a time gap between the first content loaded and the content the user can interact with.

You can now use an HTML parser library such as Cheerio to extract the content; I show how to do that in the blog post below.

Zillow is a popular online real estate marketplace that provides information about homes for sale, apartments for rent, and home values. One of the most visited pages is the one showing homes for sale.

If we try to scrape this page with Axios using the code below:


import axios from 'axios';

(async () => {
  const response = await axios.get('https://www.zillow.com/homes/for_sale');

  console.log(response.data);
})();

We get the following output where we can see the website is asking to verify we aren't a Bot by validating a ReCaptcha.

Cannot get the Zillow content because of the ReCaptcha verification.

Let's create a new file called scrape-zillow.ts in the folder src and add the code below that uses our Scraping browser proxy:


import dotenv from 'dotenv';
import puppeteer from 'puppeteer-core';

dotenv.config();

(async () => {
  const auth = `${process.env.PROXY_USERNAME}:${process.env.PROXY_PASSWORD}`;
  const browserURL = `wss://${auth}@${process.env.PROXY_HOST}:${process.env.PROXY_PORT}`;

  let browser;

  try {
    browser = await puppeteer.connect({ browserWSEndpoint: browserURL });
    const page = await browser.newPage();

    page.setDefaultNavigationTimeout(2 * 60 * 1000);

    await page.goto('https://www.zillow.com/homes/for_sale');

    const html = await page.content();

    console.log(html);
  } catch (e) {
    console.error('run failed', e);
  } finally {
    await browser?.close();
  }
})();

Run the file with the following command: yarn ts-node ./src/scrape-soundclound.ts.

We can see the page's content is now retrieved successfully, meaning the Scraper browser doesn't just simulate a real browser but also perform ReCaptcha validation for us.

HTML content of the Zillow page retrieved by the Scraping Browser.

Wrap up

The Scraping Browser product from Bright Data is a great tool to perform Web scraping on websites with advanced protection such as Bot detection, ReCaptcha verification, DDoS protection using CDN, etc...

You can create an instance of a real browser and then connect to it using the Puppeteer core library to perform web Scraping. You can create many instances and use them simultaneously.

The Scraping browser handles for you the ReCaptcha validation, the IP rotation, the SPA page loading, and so much more.

The Pricing plan allows you to pay only for what you use, which is good for cost efficiency and optimization.

You can find the code source on the GitHub repository.

Follow me on Twitter or subscribe to my newsletter to avoid missing the upcoming posts and the tips and tricks I occasionally share.