Photo by Polina Rytova on Unsplash

Today, companies have APIs that are consumed by applications built by other companies or developers. The goal is to help at building more features and give flexibility. Sometimes, there is no API available that exposes data needed by a feature of your application. Still, they are available on a website (of the company or elsewhere). In this case, we can use Web scraping to retrieve these data.

In this tutorial, we will see how to do that with Node.js, and as a use case, I recently needed data for all programming languages, but I didn't found an API that provides that.

What we will use

  • Axios: Get the HTML content of a page through the URL.
  • Cheerio: Parse the HTML content to retrieve the data needed.
  • Mongoose: Store the data extracted into a MongoDB database
  • Express: Create an endpoint that returns languages stored in the database in a JSON format.

Setup the project

To start, we will use a boilerplate for the Node.js project we built on this tutorial. The branch express-mongo has Express and Mongoose already installed so we can focus on the web scraping part.

git clone https://github.com/tericcabrel/node-ts-starter.git -b express-mongo node-web-scraping

cd node-web-scraping

cp .env.example .env

nano .env
# Enter database credentials for your local environment, save and exit

yarn install

yarn start

Now we have a working project, let's continue by installing libraries for web scraping.

yarn install axios cheerio
yarn install @types/cheerio

Note: No need to install types definition for Axios there are included in the library.

Scrape the content

We will use Axios to get the HTML content of the page. For that, let's create a file called scraper.ts inside the folder src, then add the code below:

import axios from 'axios';

const PAGE_URL = 'https://en.wikipedia.org/wiki/Timeline_of_programming_languages';

const scraper = async () => {
  const response = await axios.get(PAGE_URL);

  console.log(response.data);
};

(async () => {
  await scraper();
})();

As you see, getting the content of the page is very straightforward. Let's run this code to see the output:

ts-node src/scraper.ts

We got the output below:

It is a huge misunderstandable HTML code, and it is here cheerio will help us select the data we need.

Retrieve the data from page content

For cheerio to get the data, we need to provide the selector in the HTML page that holds the data we want. The only way to know is to analyze the page structure, but the page is huge with many useless data. So, the first step is to define which data we want from the page.

The picture below shows which data we want to retrieve from the page.

Now we identified which data we want; this can be translated to the Typescript type below.

type ProgrammingLanguage = {
  yearCategory: string;
  year: number;
  name: string;
  author: string;
  predecessors: string[];
};

Beware of edge cases on the data type.

One essential thing is to analyze the data to make sure the type you selected is right. Previously, we define the year of creation as a number, but it seems like there is a problem with this type if we pay attention.

  • Some programming language has been released in a range of years.
  • The release year of some programming languages isn't confirmed.

Let's update our type to handle these cases:

type ProgrammingLanguage = {
  yearCategory: string;
  year: number[];
  yearConfirmed: boolean;
  name: string;
  author: string;
  predecessors: string[];
};

Find selector to retrieve data.

We now know the part of the page we want to retrieve. Let's analyze the page structure to find our selector:

From the picture, we can remark a pattern:

The year category is inside an <h2> tag. The next tag is a <table> where the data we want is inside the tag <body> in the following order:

  • Year: the first column
  • Name: the second column
  • Author: the third column
  • Predecessors: the fourth column

Now we have everything to retrieve our data with cheerio. Update the scraper.ts with the code below:

import axios from 'axios';
import * as cheerio from 'cheerio';

const PAGE_URL = 'https://en.wikipedia.org/wiki/Timeline_of_programming_languages';

type ProgrammingLanguage = {
  yearCategory: string;
  year: number[];
  yearConfirmed: boolean;
  name: string;
  author: string;
  predecessors: string[];
};

const formatYear = (input: string) => {
  const array = input.split('–');

  if (array.length < 2) {
    return [+input.substr(0, 4)];
  }
  return [+array[0], +(array[1].length < 4 ? `${array[0].substr(0, 2)}${array[1]}` : array[1])];
};

const retrieveData = (content: string) => {
  const $ = cheerio.load(content);

  const headers = $('body h2');

  const languages: ProgrammingLanguage[] = [];

  for (let i = 0; i < headers.length; i++) {
    const header = headers.eq(i);
    const table = header.next('table');

    if (!table.is('table')) {
      continue;
    }

    const yearCategory = header.children('span').first().text();
    const tableRows = table.children('tbody').children('tr');

    for (let i = 0; i < tableRows.length; i++) {
      const rowColumns = tableRows.eq(i).children('td');
      const name = rowColumns.eq(1).text().replace('\n', '');

      if (!name) {
        continue;
      }

      const language: ProgrammingLanguage = {
        author: rowColumns.eq(2).text().replace('\n', ''),
        name,
        predecessors: rowColumns
          .eq(3)
          .text()
          .split(',')
          .map((value) => value.trim()),
        year: formatYear(rowColumns.eq(0).text()),
        yearConfirmed: !rowColumns.eq(0).text().endsWith('?'),
        yearCategory,
      };

      languages.push(language);
    }
  }

  return languages;
};

const scraper = async () => {
  const response = await axios.get(PAGE_URL);

  const languages = retrieveData(response.data);

  console.log(languages);
};

(async () => {
  await scraper();
})();

Run the code to see the result:

Save data in the database.

Since we retrieve our data, we can now save them inside MongoDB, and for that, we need to create the model; check my tutorial to see how to create a model for MongoDB.

Create a folder called models, then create a file language.ts inside. Add the code below:

import mongoose, { Model, Schema, Document } from 'mongoose';

type LanguageDocument = Document & {
  yearCategory: string;
  year: number[];
  yearConfirmed: boolean;
  name: string;
  author: string;
  predecessors: string[];
};

const languageSchema = new Schema(
  {
    name: {
      type: Schema.Types.String,
      required: true,
      index: true,
    },
    yearCategory: {
      type: Schema.Types.String,
      required: true,
      index: true,
    },
    year: {
      type: [Schema.Types.Number],
      required: true,
    },
    yearConfirmed: {
      type: Schema.Types.Boolean,
      required: true,
    },
    author: {
      type: Schema.Types.String,
    },
    predecessors: {
      type: [Schema.Types.String],
      required: true,
    },
  },
  {
    collection: 'languages',
    timestamps: true,
  },
);

const Language: Model<LanguageDocument> = mongoose.model<LanguageDocument>('Language', languageSchema);

export { Language, LanguageDocument };

Now update our scraper method to insert languages into the database.

const scraper = async () => {
  const response = await axios.get(PAGE_URL);

  const languages = retrieveData(response.data);

  await connectToDatabase();

  const insertPromises = languages.map(async (language) => {
    const isPresent = await Language.exists({ name: language.name });

    if (!isPresent) {
      await Language.create(language);
    }
  });

  await Promise.all(insertPromises);

  console.log('Data inserted successfully!');
};
Update the scraper() method inside the file scraper.ts

Run the code, wait for the execution to complete, then check your database make verify data have been inserted as expected

Note: The unique constraint is not applied to the language name because there are programming languages with the same name as Short Code. I don't know if it is a mistake, but for now, in Wikipedia, we trust 🙌.

Create an endpoint to retrieve data

The final part is to create a route to retrieve languages stored in the database. Let's add the code below inside the file index.ts.

app.get('/languages', async (req, res) => {
  const languages = await Language.find().sort({ name: 1 }).exec();

  return res.json({ data: languages });
});

We retrieve all data ordered by name in the ascending direction.

Start the application with yarn start then, navigate to http://localhost:4500/languages in your browser.

Yesss 🎉

Caveats on Web scraping

  • Always check if there is already an API that provides the data you need to avoid unnecessary workload.
  • The code to retrieve data is highly coupled to the HTML structure of the page, meaning if the structure change, you have to update your code.
  • Some companies can forbid the scraping of their website so, before doing it, always check first you are allowed to do that.

Going Further

We have seen how to scrape data from a website with static content but, this method will not work on a website with dynamic content (SPA). In this case, a puppeteer is a good tool for the solution.

Also, check this link to see some problems you can face while doing Web scraping and how to avoid them.

Find the final source code of this tutorial here.

I hope you find it interesting and see you in the next tutorial 😉.