An Introduction to Web Scraping with Node JS – codeburst

  • The options object needs to do two things:Pass in the url we want to scrape.Tell Cheerio to load the returned HTML so that we can use it.Here’s what that looks like:const options = { uri: `https://www.yourURLhere.com`, transform: function (body) { return cheerio.load(body); }};The uri key is simply the website we want to scrape.The transform key tells request-promise to take the returned body and load it into Cheerio before returning it to us.Awesome.
  • Here’s what your code should look like so far:const rp = cheerio = require(‘cheerio’);const options = { uri: `https://www.yourURLhere.com`, transform: function (body) { return cheerio.load(body); }};Make the RequestNow that the options are taken care of, we can actually make our request.
  • Imagine we have the following HTML in the website we want to scrape:ul id=”cities” li class=”large”New York/li li id=”medium”Portland/li li class=”small”Salem/li/ulWe can select id’s using (#), classes using (.)
  • text();});// New York Portland SalemFindingImagine we have two lists on our web site:ul id=”cities” li class=”large”New York/li li id=”c-medium”Portland/li li class=”small”Salem/li/ulul id=”towns” li class=”large”Bend/li li id=”t-medium”Hood River/li li class=”small”Madras/li/ulWe can select each list using their respective ID’s, then find the small city/town within each MadrasFinding will search all descendant DOM elements, not just immediate children as shown in this example.ChildrenChildren is similar to find.
  • html()// li class=”large”Bend/liAdditional MethodsThere are more methods than I can count, and the documentation for all of them is available here.Chrome Developer ToolsDon’t forget, the Chrome Developer Tools are your friend.

Web scraping is extracting data from a website. Why would someone want to scrape the web? Here are four examples: Web scraping is against most website’s terms of service. Your IP address may be…

@codeburstio: “An Introduction to Web Scraping with #NodeJS” by @BrandonMorelli #node #javascript #webdev #js #angularjs #reactjs

What is web scraping?

Web scraping is extracting data from a website. Why would someone want to scrape the web? Here are four examples:

Warnings.

Web scraping is against most website’s terms of service. Your IP address may be banned from a website if you scrape too frequently or maliciously.

What will we need?

For this project we’ll be using Node.js. If you’re not familiar with Node, check out my 3 Best Node.JS Courses.

We’ll also be using two open-sourced npm modules to make today’s task a little easier:

Project Setup.

file. We’ll need to install and require our dependencies. Open up your command line, and install and save: request, request-promise, and cheerio

npm install –save request request-promise cheerio

file:

const rp = require(‘request-promise’);

const cheerio = require(‘cheerio’);

Setting up the Request

object needs to do two things:

Here’s what that looks like:

const options = {

uri: `https://www.yourURLhere.com`,

transform: function (body) {

return cheerio.load(body);

key is simply the website we want to scrape.

to take the returned body and load it into Cheerio before returning it to us.

Awesome. We’ve successfully set up our HTTP request options! Here’s what your code should look like so far:

const rp = require(‘request-promise’);

const cheerio = require(‘cheerio’);

const options = {

uri: `https://www.yourURLhere.com`,

transform: function (body) {

return cheerio.load(body);

Make the Request

Now that the options are taken care of, we can actually make our request. The boilerplate in the documentation for that looks like this:

rp(OPTIONS)

.then(function (data) {

// REQUEST SUCCEEDED: DO SOMETHING

.catch(function (err) {

// REQUEST FAILED: ERROR OF SOME KIND

, then wait to see if our request succeeds or fails. Either way, we do something with the returned data.

Knowing what the documentation says to do, lets create our own version:

rp(options)

.then(($) => {

console.log($);

.catch((err) => {

console.log(err);

The code is pretty similar. The big difference is I’ve used arrow functions. I’ve also logged out the returned data from our HTTP request. We’re going to test to make sure everything is working so far.

with the website you want to scrape. Then, open up your console and type:

node index.js

// LOGS THE FOLLOWING:

{ [Function: initialize]

fn:

initialize {

constructor: [Circular],

_originalRoot:

{ type: ‘root’,

name: ‘root’,

namespace: ‘http://www.w3.org/1999/xhtml’,

attribs: {},

If you don’t see an error, then everything is working so far — and you just made your first scrape!

Having fun? Want to learn how to build more cool stuff with Node? Check out my 3 Best Node JS Courses

Here is the full code of our boilerplate:

Using the Data

What good is our web scraper if it doesn’t actually return any useful data? This is where the fun begins.

There are numerous things you can do with Cheerio to extract the data that you want. First and foremost, Cheerio’s selector implementation is nearly identical to jQuery’s. So if you know jQuery, this will be a breeze. If not, don’t worry, I’ll show you.

The selector method allows you to traverse and select elements in the document. You can get data and set data using a selector. Imagine we have the following HTML in the website we want to scrape:

  • New York
  • Portland
  • Salem

$(‘.large’).text()

// New York

$(‘#medium’).text()

// Portland

$(‘li[class=small]’).html()

//

  • Salem
  • with the following code:

    $(‘li’).each(function(i, elem) {

    cities[i] = $(this).text();

    // New York Portland Salem

    Imagine we have two lists on our web site:

    • New York
    • Portland
    • Salem
    • Bend
    • Hood River
    • Madras

    We can select each list using their respective ID’s, then find the small city/town within each list:

    $(‘#cities’).find(‘.small’).text()

    // Salem

    $(‘#towns’).find(‘.small’).text()

    // Madras

    Finding will search all descendant DOM elements, not just immediate children as shown in this example.

    Children is similar to find. The difference is that children only searches for immediate children of the selected element.

    $(‘#cities’).children(‘#c-medium’).text();

    // Portland

    to return the html of the given element:

    $(‘.large’).text()

    // Bend

    $(‘.large’).html()

    //

  • Bend
  • There are more methods than I can count, and the documentation for all of them is available here.

    Don’t forget, the Chrome Developer Tools are your friend. In Google Chrome, you can easily find element, class, and ID names using: CTRL + SHIFT + C

    As you seen in the above image, I’m able to hover over an element on the page and the element name and class name of the selected element are shown in real-time!

    Go forth and scrape!

    Thanks for reading! You should have the tools necessary now to go forth and scrape. Select a website, and see what data you’re able to extract.

    As always, I’ll be in the comments if you get stuck or need help with anything.

    I publish a few articles and tutorials each week, please consider entering your email here if you’d like to be added to my once-weekly email list.

    If tutorials like this interest you and you want to learn more, check out my 3 Best Node JS Courses

    An Introduction to Web Scraping with Node JS – codeburst

    You might also like More from author

    Comments are closed, but trackbacks and pingbacks are open.