Web scraping with R using rvest & RSelenium

walkthrough
R
Author

Mark Druffel

Published

August 8, 2023

Intro

I had a web scraping walk through on my old blog where I scraped Airbnb to find listings with king size beds, but Airbnb did major updates to their site and the post wouldn’t render when I did an overhaul to my website so I no longer have it. I was trying to do a little scraping while I was doing some research and found myself wanting my guide so I started writing a new one while I was working - I’m coming back to it now to finish it up as a post. I’m not going to bother covering some of the basic web development skills necessary because I coincidentally cover most of that in the prerequisites for my post on building a blogdown site. So without further delay, let’s get to scraping!

Scraping

R has a number of tools for scraping, but I typically only use rvest, RSelenium, and polite. The rvest library is the main one to know because it provides tools to extract (scrape) data from web pages. Often that’s all you need to web scrape. RSelenium allows you to drive web browsers from R, which is necessary when scraping some sites. Finely, polite implements a framework to use proper decorum when web scraping. I’ll very briefly cover polite in this post, but not extensively for reasons that will become clear later on.

Static Site

When scraping, as is true with many things done it code, it’s easiest to start small. Scraping static sites that do not heavily rely on JavaScript to function is significantly less complicated than scraping sites that rely on JavaScript or dynamic sites. With that in mind, we’ll start with a very simple static site, the CRAN packages page, because it’s structured in a way that’s ideal for scraping so it’s straight forward.

Parsing HTML

The rvest package has a suite of tools for parsing the HTML document, which is the core functionality required to scrape. The first thing to do when scraping a page is to figure out what we want to scrape and determine it’s HTML structure. We can do this using the browser’s web developer tools, but most of this can also be done inside RStudio. It probably goes without saying if you look at the CRAN packages page, but I’d like to scrape the packages table and make it a data frame.

To inspect the page, we first read the page using read_html(). This reads the HTML page as is into our rsession for processing. We can see that the cran_packages_html object is a list of length two and both objects inside the list are external pointers. In other words, the cran_packages_html document is not in the active rsession, rather a pointer which directs R to the documents created by libxml2 which are stored in RAM (at least this is my rough understanding of how it works). For more information, Bob Rudis provided a very detailed response about scraping which touches on this point, but the take away should be that this object does not contain the data from the HTML page - just pointers!

Code
library(tidyverse) 
library(rvest)
cran_packages_html <- read_html("https://cran.r-project.org/web/packages/available_packages_by_date.html")
str(cran_packages_html)
List of 2
 $ node:<externalptr> 
 $ doc :<externalptr> 
 - attr(*, "class")= chr [1:2] "xml_document" "xml_node"

An aside, if you open cran_packages_html with viewer and trying to inspect one of the pointers, you’ll get an error could not find function "xml_child". That’s because rvest depends on xml2, but does not attached it to the global environment.

You can simply load xml2 to fix the issue.

Code
library(xml2)
xml_child(cran_packages_html, 2)
{html_node}
<body lang="en">
[1] <div class="container">\n<h1>Available CRAN Packages By Date of Publicati ...

The rvest package has a suite of functions for parsing the HTML document starting with functions that help use understand the structure including html_children() & html_name(). We can use html_children() to climb down the page and html_name() to see the tag names of the HTML elements we want to parse. For this page, we used html_chidren() to see that the page has a and a , which is pretty standard. We’ll want to scrape the <body> because that’s where the content of the page will be.

Code
cran_packages_html |> 
  html_children() |> 
  html_name()
[1] "head" "body"

To further parse the <body>, we’ll use html_element() to clip the rest of the HTML document and look inside <body>. Within <body>, we can see there’s just a []](https://www.w3schools.com/tags/tag_div.asp).

Code
cran_packages_html |> 
  html_element("body") |> 
  html_children() |> 
  html_name()
[1] "div"

We can continue the process with the <div> and we see an an and a . It’s fairly obvious we’ll want to the <table>, not <h1>, but just to illustrate if we look within <h1>, we’ll see no nodes exist beneath it.

Code
cran_packages_html |> 
  html_element("body") |> 
  html_element("div") |> 
  html_element("h1") |> 
  html_children() |> 
  html_name()
character(0)

That doesn’t mean <h1> has no data, it just means no HTML is a child of <h1>. Since <h1> a tag used on titles text), we can use [html_text()] to extract the actual text inside. This isn’t particularly useful in this case, but html_text() can be very useful.

Code
cran_packages_html |> 
  html_element("body") |> 
  html_element("h1") |> 
  html_text() 
[1] "Available CRAN Packages By Date of Publication"

If we use html_element("table"), we can see it contains the data we’re looking for, but there’s a bit of HTML junk we’ll need to clean up for our data frame.

Code
cran_packages_html |>  
  html_element("body") |> 
  html_element("div") |> 
  html_element("table") 
{html_node}
<table border="1">
 [1] <tr>\n<th> Date </th> <th> Package </th> <th> Title </th> </tr>\n
 [2] <tr>\n<td> 2023-08-10 </td> <td> <a href="../../web/packages/DrugSim2DR/ ...
 [3] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/actxps/inde ...
 [4] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/AgroR/index ...
 [5] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/aplot/index ...
 [6] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/av/index.ht ...
 [7] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/basemodels/ ...
 [8] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/bayesPop/in ...
 [9] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/Bayesrel/in ...
[10] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/beanz/index ...
[11] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/bookdown/in ...
[12] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/bruceR/inde ...
[13] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/canvasXpres ...
[14] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/CARBayes/in ...
[15] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/clinDR/inde ...
[16] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/cmm/index.h ...
[17] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/complexlm/i ...
[18] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/CPC/index.h ...
[19] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/dfoliatR/in ...
[20] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/DR.SC/index ...
...

In the code above, we walked down the whole HTML tree body > div > table. The html_element() function will pickup HTML tags without providing the exact path, which is very convenient but can lead to unexpected results. The code below leads to the same results, but only because this page only has one HTML table. If it had multiple, it would only pick up the first one whether that was our intent or not. This point is very important to understand for more complicated web pages.

Code
cran_packages_html |>   
  # Skipped <body> & <div>
  html_element("table") 
{html_node}
<table border="1">
 [1] <tr>\n<th> Date </th> <th> Package </th> <th> Title </th> </tr>\n
 [2] <tr>\n<td> 2023-08-10 </td> <td> <a href="../../web/packages/DrugSim2DR/ ...
 [3] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/actxps/inde ...
 [4] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/AgroR/index ...
 [5] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/aplot/index ...
 [6] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/av/index.ht ...
 [7] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/basemodels/ ...
 [8] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/bayesPop/in ...
 [9] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/Bayesrel/in ...
[10] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/beanz/index ...
[11] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/bookdown/in ...
[12] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/bruceR/inde ...
[13] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/canvasXpres ...
[14] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/CARBayes/in ...
[15] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/clinDR/inde ...
[16] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/cmm/index.h ...
[17] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/complexlm/i ...
[18] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/CPC/index.h ...
[19] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/dfoliatR/in ...
[20] <tr>\n<td> 2023-08-09 </td> <td> <a href="../../web/packages/DR.SC/index ...
...

Fortunately, rvest has a handy html_table() function that’s specifically for HTML tables and automatically coerces them into a list of tibbles. I used bind_rows() to coerce the list to a tibble. As you can see below, we end up with a table of packages with a date, package name, and title.

Code
library(tidyverse) 
library(reactable)
cran_packages_df <- cran_packages_html |> 
  html_table() |> 
  bind_rows()

cran_packages_df |> 
  reactable(
    searchable = TRUE, 
    paginationType = "jump",  
    showPageSizeOptions = TRUE,
    pageSizeOptions = c(5, 10, 50, 100),
    defaultPageSize = 5)