Using Web scraping in Node.js to build an API to browse programming languages
Today, companies expose public APIs that are consumed by applications built by other companies or developers. The goal is to help at building more features on top of their system and give more flexibility to the API consumers.
Sometimes, there is no API available that exposes data needed by a feature of your application. Still, they are available on a website (of the company or elsewhere). In this case, we can use Web scraping to retrieve these data.
In this tutorial, we will see how to do that with Node.js, and as a use case, I recently needed data for all programming languages, but I didn't find an API that provides that, so I built one.
Libraries to do Web scraping
Web Scraping is a technique of fetching the content of a website page and then extracting data from that page. With Node.js, we will use the following libraries to show how to do Web scraping:
- Axios: Get the HTML content of a page through the URL.
- Cheerio: Parse the HTML content to retrieve the data needed.
- Mongoose: Save the data extracted into a MongoDB database.
- Express: Create an endpoint that returns languages stored in the database in a JSON format.
Prerequisites
You must need these tools installed on your computer to follow this tutorial.
- Node.js 16+ - Download's link
- NPM or Yarn - I will use Yarn
- Docker (optional)
We need Docker to run a container for MongoDB; you can skip it if MongoDB is installed on your computer. Run the command below to start the Docker container from the Mongo image:
docker run -d --rm --name scraping-db -e MONGO_INITDB_ROOT_USERNAME=root -e MONGO_INITDB_ROOT_PASSWORD=secret mongo:6.0
Set up the project
To start, we will use a boilerplate for the Node.js project we built on this tutorial. The branch express-mongo
has Express and Mongoose already installed, so we can directly implement the Web scraping.
git clone https://github.com/tericcabrel/node-ts-starter.git -b express-mongo node-web-scraping
cd node-web-scraping
cp .env.example .env
nano .env
# Enter database credentials for your local environment, save and exit
yarn install
yarn start
Now we have a working project, let's continue by installing libraries for web scraping.
yarn add axios cheerio
yarn add -D @types/cheerio
No need to install the definition of the types for Axios because the type definition is included in the library.
Scrape the content
We will use Axios to get the HTML content of the page. The page to scrape is a Wikipedia page listing all the programming languages created from the beginning until today. Click on this link to check out the page.
Let's create a file called scraper.ts inside the folder src, then add the code below:
import axios from 'axios';
const PAGE_URL = 'https://en.wikipedia.org/wiki/Timeline_of_programming_languages';
const scraper = async () => {
const response = await axios.get(PAGE_URL);
console.log(response.data);
};
(async () => {
await scraper();
})();
As you see, getting the content of the page is very straightforward. Run this code with the command below:
ts-node src/scraper.ts
We get the following output:
It is a huge unreadable HTML code that cannot be parsed without the appropriate tool; it is here Cheerio will help us in parsing and then select the data we need.
Retrieve the data from the page content
For Cheerio to get the data, we need to provide the selector in the HTML page that holds the data we want. The only way to know is to analyze the page structure, but the page is huge with many useless data.
So, the first step is to define which data we want from the page. The picture below shows which data we want to retrieve from the page.
Now we identified which data we want to extract, the TypeScript data structure for representing these data can be the following:
type ProgrammingLanguage = {
yearCategory: string;
year: number;
name: string;
author: string;
predecessors: string[];
};
Beware of edge cases on the data type.
One essential thing is to analyze the data to make sure the type you selected is right. Previously, we define the year of creation as a number, but it seems like there is a problem with this type if we pay attention.
- The first line highlighted represents a programming language that has been released over a range of years.
- The second line highlighted indicates that the release year of a programming language isn't confirmed.
Let's update our data structure to handle these cases:
type ProgrammingLanguage = {
yearCategory: string;
year: number[];
yearConfirmed: boolean;
name: string;
author: string;
predecessors: string[];
};
Find a selector to retrieve data
We now know the part of the page we want to retrieve. Let's analyze the page structure to find our selector:
From the picture, we can guess a pattern:
The year category is inside a <h2> tag. The next tag is a <table> where the data we want is inside the tag <body> in the following order:
- Year: the first column
- Name: the second column
- Author: the third column
- Predecessors: the fourth column
Now we have everything to retrieve our data using Cheerio. Update the file scraper.ts
with the code below:
import axios from 'axios';
import * as cheerio from 'cheerio';
const PAGE_URL = 'https://en.wikipedia.org/wiki/Timeline_of_programming_languages';
type ProgrammingLanguage = {
yearCategory: string;
year: number[];
yearConfirmed: boolean;
name: string;
author: string;
predecessors: string[];
};
const formatYear = (input: string) => {
const array = input.split('–');
if (array.length < 2) {
return [+input.substr(0, 4)];
}
return [+array[0], +(array[1].length < 4 ? `${array[0].substr(0, 2)}${array[1]}` : array[1])];
};
const retrieveData = (content: string) => {
const $ = cheerio.load(content);
const headers = $('body h2');
const languages: ProgrammingLanguage[] = [];
for (let i = 0; i < headers.length; i++) {
const header = headers.eq(i);
const table = header.next('table');
if (!table.is('table')) {
continue;
}
const yearCategory = header.children('span').first().text();
const tableRows = table.children('tbody').children('tr');
for (let i = 0; i < tableRows.length; i++) {
const rowColumns = tableRows.eq(i).children('td');
const name = rowColumns.eq(1).text().replace('\n', '');
if (!name) {
continue;
}
const language: ProgrammingLanguage = {
author: rowColumns.eq(2).text().replace('\n', ''),
name,
predecessors: rowColumns
.eq(3)
.text()
.split(',')
.map((value) => value.trim()),
year: formatYear(rowColumns.eq(0).text()),
yearConfirmed: !rowColumns.eq(0).text().endsWith('?'),
yearCategory,
};
languages.push(language);
}
}
return languages;
};
const scraper = async () => {
const response = await axios.get(PAGE_URL);
const languages = retrieveData(response.data);
console.log(languages);
};
(async () => {
await scraper();
})();
Run the code to see the result:
Save data in the database
Since we retrieve our data, we can now save them inside MongoDB, and for that, we need to create the model; check out my tutorial below to see how to define a model for MongoDB.
Create a folder called models, then create a file language.ts inside. Add the code below:
import mongoose, { Model, Schema, Document } from 'mongoose';
type LanguageDocument = Document & {
yearCategory: string;
year: number[];
yearConfirmed: boolean;
name: string;
author: string;
predecessors: string[];
};
const languageSchema = new Schema(
{
name: {
type: Schema.Types.String,
required: true,
index: true,
},
yearCategory: {
type: Schema.Types.String,
required: true,
index: true,
},
year: {
type: [Schema.Types.Number],
required: true,
},
yearConfirmed: {
type: Schema.Types.Boolean,
required: true,
},
author: {
type: Schema.Types.String,
},
predecessors: {
type: [Schema.Types.String],
required: true,
},
},
{
collection: 'languages',
timestamps: true,
},
);
const Language: Model<LanguageDocument> = mongoose.model<LanguageDocument>('Language', languageSchema);
export { Language, LanguageDocument };
Let's update our scraper()
method to insert languages into the database.
Run the code, wait for the execution to complete, then check out your database to ensure the data have been inserted as expected.
Note: The unique constraint is not applied to the language name because there are programming languages with the same name as Short Code. I don't know if it is a mistake, but for now, in Wikipedia, we trust.
Create an endpoint to retrieve data
The final part is to create a route /languages
to retrieve programming languages stored in the database. Let's add the code below to the file index.ts.
app.get('/languages', async (req, res) => {
const languages = await Language.find().sort({ name: 1 }).exec();
return res.json({ data: languages });
});
We retrieve all data ordered by name in the ascending direction.
Start the application by running the command yarn start
, then navigate to http://localhost:4500/languages in your browser.
Caveats on Web scraping
- Always check if there is already an API that provides the data you need to avoid spending many hours building a new API.
- The code to retrieve data is highly coupled to the HTML structure of the page, meaning if the structure change, you have to update your code.
- Some companies can forbid the scraping of their website, so before doing it, always check first if you are allowed to do that.
- Some websites have enhanced security against Web scraping, such as captcha validation. I wrote a post on how you can bypass these advanced securities.
Wrap up
We have seen how to scrape data from a website with static content, but this method will not work on a website with dynamic content (SPA). In this case, a puppeteer is a tool for the solution, but also various Web scraper APIs that provide a working solution out of the box.
Also, check out this link to see some problems you can face while doing Web scraping and how to avoid them.
You can find the code source on the GitHub repository.
Follow me on Twitter or subscribe to my newsletter to avoid missing the upcoming posts and the tips and tricks I occasionally share.