Web Scraping

Using Web scraping in Node.js to build an API to browse programming languages

Eric Cabrel TIOGO

Mar 7, 2023 • 8 min read

Today, companies expose public APIs that are consumed by applications built by other companies or developers. The goal is to help at building more features on top of their system and give more flexibility to the API consumers.

Sometimes, there is no API available that exposes data needed by a feature of your application. Still, they are available on a website (of the company or elsewhere). In this case, we can use Web scraping to retrieve these data.

In this tutorial, we will see how to do that with Node.js, and as a use case, I recently needed data for all programming languages, but I didn't find an API that provides that, so I built one.

Libraries to do Web scraping

Web Scraping is a technique of fetching the content of a website page and then extracting data from that page. With Node.js, we will use the following libraries to show how to do Web scraping:

Axios: Get the HTML content of a page through the URL.
Cheerio: Parse the HTML content to retrieve the data needed.
Mongoose: Save the data extracted into a MongoDB database.
Express: Create an endpoint that returns languages stored in the database in a JSON format.

Prerequisites

You must need these tools installed on your computer to follow this tutorial.

Node.js 16+ - Download's link
NPM or Yarn - I will use Yarn
Docker (optional)

We need Docker to run a container for MongoDB; you can skip it if MongoDB is installed on your computer. Run the command below to start the Docker container from the Mongo image:


docker run -d --rm --name scraping-db -e MONGO_INITDB_ROOT_USERNAME=root -e MONGO_INITDB_ROOT_PASSWORD=secret mongo:6.0

Set up the project

To start, we will use a boilerplate for the Node.js project we built on this tutorial. The branch express-mongo has Express and Mongoose already installed, so we can directly implement the Web scraping.


git clone https://github.com/tericcabrel/node-ts-starter.git -b express-mongo node-web-scraping

cd node-web-scraping

cp .env.example .env

nano .env
# Enter database credentials for your local environment, save and exit

yarn install

yarn start

Now we have a working project, let's continue by installing libraries for web scraping.


yarn add axios cheerio

yarn add -D @types/cheerio

No need to install the definition of the types for Axios because the type definition is included in the library.

Scrape the content

We will use Axios to get the HTML content of the page. The page to scrape is a Wikipedia page listing all the programming languages created from the beginning until today. Click on this link to check out the page.

Let's create a file called scraper.ts inside the folder src, then add the code below:


import axios from 'axios';

const PAGE_URL = 'https://en.wikipedia.org/wiki/Timeline_of_programming_languages';

const scraper = async () => {
  const response = await axios.get(PAGE_URL);

  console.log(response.data);
};

(async () => {
  await scraper();
})();

As you see, getting the content of the page is very straightforward. Run this code with the command below:


ts-node src/scraper.ts

We get the following output:

Content of the Wikipedia page retrieved using Axios.

It is a huge unreadable HTML code that cannot be parsed without the appropriate tool; it is here Cheerio will help us in parsing and then select the data we need.

Retrieve the data from the page content

For Cheerio to get the data, we need to provide the selector in the HTML page that holds the data we want. The only way to know is to analyze the page structure, but the page is huge with many useless data.

So, the first step is to define which data we want from the page. The picture below shows which data we want to retrieve from the page.

Illustration of data to retrieve on the web page using Web scraping.

Now we identified which data we want to extract, the TypeScript data structure for representing these data can be the following:


type ProgrammingLanguage = {
  yearCategory: string;
  year: number;
  name: string;
  author: string;
  predecessors: string[];
};

Beware of edge cases on the data type.

One essential thing is to analyze the data to make sure the type you selected is right. Previously, we define the year of creation as a number, but it seems like there is a problem with this type if we pay attention.

Lines of programming language with a different format for the year of creation.

The first line highlighted represents a programming language that has been released over a range of years.
The second line highlighted indicates that the release year of a programming language isn't confirmed.

Let's update our data structure to handle these cases:


type ProgrammingLanguage = {
  yearCategory: string;
  year: number[];
  yearConfirmed: boolean;
  name: string;
  author: string;
  predecessors: string[];
};

Find a selector to retrieve data

We now know the part of the page we want to retrieve. Let's analyze the page structure to find our selector:

Structure of the HTML containing the programming language to extract.

From the picture, we can guess a pattern:

The year category is inside a <h2> tag. The next tag is a <table> where the data we want is inside the tag <body> in the following order:

Year: the first column
Name: the second column
Author: the third column
Predecessors: the fourth column

Now we have everything to retrieve our data using Cheerio. Update the file scraper.ts with the code below:


import axios from 'axios';
import * as cheerio from 'cheerio';

const PAGE_URL = 'https://en.wikipedia.org/wiki/Timeline_of_programming_languages';

type ProgrammingLanguage = {
  yearCategory: string;
  year: number[];
  yearConfirmed: boolean;
  name: string;
  author: string;
  predecessors: string[];
};

const formatYear = (input: string) => {
  const array = input.split('–');

  if (array.length < 2) {
    return [+input.substr(0, 4)];
  }
  return [+array[0], +(array[1].length < 4 ? `${array[0].substr(0, 2)}${array[1]}` : array[1])];
};

const retrieveData = (content: string) => {
  const $ = cheerio.load(content);

  const headers = $('body h2');

  const languages: ProgrammingLanguage[] = [];

  for (let i = 0; i < headers.length; i++) {
    const header = headers.eq(i);
    const table = header.next('table');

    if (!table.is('table')) {
      continue;
    }

    const yearCategory = header.children('span').first().text();
    const tableRows = table.children('tbody').children('tr');

    for (let i = 0; i < tableRows.length; i++) {
      const rowColumns = tableRows.eq(i).children('td');
      const name = rowColumns.eq(1).text().replace('\n', '');

      if (!name) {
        continue;
      }

      const language: ProgrammingLanguage = {
        author: rowColumns.eq(2).text().replace('\n', ''),
        name,
        predecessors: rowColumns
          .eq(3)
          .text()
          .split(',')
          .map((value) => value.trim()),
        year: formatYear(rowColumns.eq(0).text()),
        yearConfirmed: !rowColumns.eq(0).text().endsWith('?'),
        yearCategory,
      };

      languages.push(language);
    }
  }

  return languages;
};

const scraper = async () => {
  const response = await axios.get(PAGE_URL);

  const languages = retrieveData(response.data);

  console.log(languages);
};

(async () => {
  await scraper();
})();

Run the code to see the result:

Programming languages retrieved from the HTML page.

Save data in the database

Since we retrieve our data, we can now save them inside MongoDB, and for that, we need to create the model; check out my tutorial below to see how to define a model for MongoDB.

Create a folder called models, then create a file language.ts inside. Add the code below:


import mongoose, { Model, Schema, Document } from 'mongoose';

type LanguageDocument = Document & {
  yearCategory: string;
  year: number[];
  yearConfirmed: boolean;
  name: string;
  author: string;
  predecessors: string[];
};

const languageSchema = new Schema(
  {
    name: {
      type: Schema.Types.String,
      required: true,
      index: true,
    },
    yearCategory: {
      type: Schema.Types.String,
      required: true,
      index: true,
    },
    year: {
      type: [Schema.Types.Number],
      required: true,
    },
    yearConfirmed: {
      type: Schema.Types.Boolean,
      required: true,
    },
    author: {
      type: Schema.Types.String,
    },
    predecessors: {
      type: [Schema.Types.String],
      required: true,
    },
  },
  {
    collection: 'languages',
    timestamps: true,
  },
);

const Language: Model<LanguageDocument> = mongoose.model<LanguageDocument>('Language', languageSchema);

export { Language, LanguageDocument };

Let's update our scraper() method to insert languages into the database.


const scraper = async () => {
  const response = await axios.get(PAGE_URL);

  const languages = retrieveData(response.data);

  await connectToDatabase();

  const insertPromises = languages.map(async (language) => {
    const isPresent = await Language.exists({ name: language.name });

    if (!isPresent) {
      await Language.create(language);
    }
  });

  await Promise.all(insertPromises);

  console.log('Data inserted successfully!');
};

Update the scraper() method inside the file scraper.ts

Run the code, wait for the execution to complete, then check out your database to ensure the data have been inserted as expected.

Note: The unique constraint is not applied to the language name because there are programming languages with the same name as Short Code. I don't know if it is a mistake, but for now, in Wikipedia, we trust.

Create an endpoint to retrieve data

The final part is to create a route /languages to retrieve programming languages stored in the database. Let's add the code below to the file index.ts.


app.get('/languages', async (req, res) => {
  const languages = await Language.find().sort({ name: 1 }).exec();

  return res.json({ data: languages });
});

We retrieve all data ordered by name in the ascending direction.

Start the application by running the command yarn start , then navigate to http://localhost:4500/languages in your browser.

Retrieve the programming languages scraped and stored in MongoDB

Caveats on Web scraping

Always check if there is already an API that provides the data you need to avoid spending many hours building a new API.
The code to retrieve data is highly coupled to the HTML structure of the page, meaning if the structure change, you have to update your code.
Some companies can forbid the scraping of their website, so before doing it, always check first if you are allowed to do that.
Some websites have enhanced security against Web scraping, such as captcha validation. I wrote a post on how you can bypass these advanced securities.

Wrap up

We have seen how to scrape data from a website with static content, but this method will not work on a website with dynamic content (SPA). In this case, a puppeteer is a tool for the solution, but also various Web scraper APIs that provide a working solution out of the box.

Also, check out this link to see some problems you can face while doing Web scraping and how to avoid them.

You can find the code source on the GitHub repository.

Follow me on Twitter or subscribe to my newsletter to avoid missing the upcoming posts and the tips and tricks I occasionally share.