Web Scraping in Rust With Scraper and Selenium (Plus Best Practices)

Card image for Web Scraping in Rust with Selenium and Scraper
February 22, 2024
Written by
Elijah Asaolu
Contributor
Opinions expressed by Twilio contributors are their own
Reviewed by

Web Scraping in Rust With Scraper and Selenium (Plus Best Practices)

Rust excels in multiple areas — particularly in its performance, safety, and concurrency capabilities. These strengths make it an increasingly popular choice for web scraping, a technique for extracting data from websites.

In this tutorial, we'll explore how to scrape web content with Rust using a variety of methods, including the Reqwest and Scraper libraries, and how to work with Selenium WebDrivers in Rust. We’ll learn interactively by building an application that scrapes recent data from a news website, mimics a Google search and scrapes its content, and also scrapes the most recent questions posted on StackOverflow.

Prerequisites

What is web scraping?

Web scraping is a powerful technique used for programmatically collecting information from websites. This process involves extracting data from web pages, allowing for the automated collection of large amounts of data that would be impractical to gather manually. At its core, web scraping requires a thorough understanding of the markup structure of the web resource in question. This knowledge tells us where and how the data we need is organized within the webpage.

For instance, if our objective is to scrape the titles and meta-descriptions that appear when we search for "rust" on Google, our first step is to perform this search manually. Then, we carefully inspect the markup of the search results page to locate the specific HTML elements that contain the titles and meta descriptions.

Screenshot of a Google search results page for 'rust' in inspect mode, with an arrow pointing to the title and description of the first search result.

Once we've identified the necessary HTML elements and their structure, as in the screenshot above, we can write the script to navigate to the webpage programmatically, parse the HTML content, and extract the information we've identified as our target.

Web scraping can range from simple tasks, like extracting text from a webpage, to more complex scenarios, that involve navigating through multiple pages, handling login forms, or even dealing with JavaScript-rendered content.

Web scraping vs. data fetching

While often confused, web scraping and data fetching are distinct processes. Data fetching involves sending requests to HTTP endpoints, typically APIs (REST/GraphQL), where the data is directly accessible and structured for easy integration into your program.

In contrast, web scraping starts with fetching the content of a web page using a library and then parses the page's markup to extract the needed data. This process might involve using a markup parsing library to decipher HTML or XML structures. Alternatively, tools like headless browsers (e.g., Selenium) can automate interactions with web pages that are more complex, such as those requiring navigation or dealing with dynamically loaded content.

Scraping web content using Rust, Reqwest, and Scraper

Reqwest is a highly popular HTTP client library in Rust's ecosystem, renowned for its comprehensive data fetching capabilities. It simplifies sending HTTP requests and managing responses, making it suitable for web scraping tasks.

Conversely, Scraper specializes in parsing and analyzing HTML content. It efficiently navigates through markup, enabling us to extract specific data points from a web page's structure.

Leveraging these tools, our web scraping workflow in Rust will involve:

  • Utilizing Reqwest to fetch web page content
  • Employing Scraper to parse the fetched content and extract the necessary information

Let’s start by creating a new Rust application, and changing into the new project directory, with the following commands:

cargo new rust-scrape
cd rust-scrape

Running this command initiates a new Rust project, generating a cargo.toml file for dependency management and a src/main.rs file containing a basic "Hello, World!" program.

To effectively scrape web content, we need to install a few dependencies: the Reqwest and Scraper crates for fetching and parsing web data. Additionally, since our application will be asynchronous, the Tokio library is required to create an async runtime. Install these dependencies using:

cargo add reqwest scraper tokio

Then, update the dependencies section in Cargo.toml to match the following:

[dependencies]
reqwest = "0.11.23"
scraper = "0.18.1"
tokio = { version = "1.35.1", features = ["rt-multi-thread", "macros"] }

With these setups in place, we're now equipped to dive into web scraping with Rust. Let's get started!

Extracting tech news from OSnews

To demonstrate web scraping with Rust, let's extract the top five most recent articles posted on OSnews.com.

Screenshot of the OSnews.com index page. showing the latest two articles

Upon inspecting the OSNews page's markup, we find that our target data is contained within the div#content_box. Inside this div, each article is wrapped in an <article> tag with the class latestPost excerpt. The article's title is located within an <h1> tag with the class title front-view-title, wrapped in an <a> tag that also contains the article's URL. The author's name is found within a span with the class theauthor, specifically in an <a> tag inside this span. The date posted is in a span with the class thetime.

With this understanding of the page structure, we can tailor our scraping script to extract these specific details. Open your src/main.rs file and insert the following Rust code:

use reqwest;
use scraper::{Html, Selector};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let url = "https://www.osnews.com";
    let resp = reqwest::get(url).await?.text().await?;
    let fragment = Html::parse_fragment(&resp);
    let article_selector = Selector::parse("div#content_box article").unwrap();
    let title_selector = Selector::parse("h1.title a").unwrap();
    let author_selector = Selector::parse("span.theauthor a").unwrap();
    let date_selector = Selector::parse("span.thetime").unwrap();

    for (i, article) in fragment.select(&article_selector).enumerate() {
        if i >= 5 {
            break;
        }

        let title = article.select(&title_selector).next().unwrap().inner_html();
        let author = article
            .select(&author_selector)
            .next()
            .unwrap()
            .inner_html();
        let date = article
            .select(&date_selector)
            .next()
            .unwrap()
            .text()
            .collect::<Vec<_>>()
            .join("");
        let url = article
            .select(&title_selector)
            .next()
            .unwrap()
            .value()
            .attr("href")
            .unwrap();

        println!("Article Title: {}", title);
        println!("Posted by: {}", author);
        println!("Posted on: {}", date);
        println!("Read more: {}", url);
        println!("_____");
    }

    Ok(())
}

This code begins by making a GET request to the OSNews homepage using the Reqwest library. After receiving the response, we use the Scraper library's Html::parse_fragment() method to parse the HTML content. We define selectors for extracting article titles, author names, dates, and URLs using Selector::parse().

The script iterates over these elements within the div#content_box, specifically targeting the articles. For each article, it extracts and prints the title, author, date, and URL. We also implemented a counter to limit the extraction to the top five latest articles, ensuring we only process the most recent content.

Start the application by running the command below.

cargo run

You should see the extracted data displayed in your console, similar to the image below.

Terminal output displaying the top 5 most recent posts on osnews.com

For a detailed exploration of the extensive markup parsing methods provided by the Scraper library, visit the official documentation.

Limitations of Reqwest and Scraper for web scraping

While the Reqwest and Scraper libraries are highly effective for standard web scraping tasks, they encounter limitations with websites where content is dynamically loaded after the initial page load. In such cases, these tools can only capture the content on the first-page load, missing any subsequently loaded data.

Additionally, there are scenarios where accessing the required data necessitates completing an authentication process. Reqwest and Scraper alone cannot handle such situations effectively. In these instances, headless browser solutions like Selenium are more suitable.

Web scraping with Rust, Selenium WebDriver, and Thirtyfour

Selenium WebDriver is a prominent tool for web testing and automation. It allows you to operate a fully automated browser, replicating user interactions with web pages. This capability is crucial for scraping dynamic content, where data is loaded in response to user actions or asynchronously after the initial page load.

A compatible framework is required to integrate Selenium WebDriver with a programming language. For Rust, this framework is Thirtyfour. Thirtyfour is a comprehensive library designed specifically for Rust, enabling seamless integration with Selenium. It provides a range of functionality to interact with and control a browser, making it an invaluable asset for complex web scraping tasks in the Rust environment.

Configure Selenium WebDrivers for Rust

To utilize Selenium WebDrivers for web scraping in Rust, the initial step is to install a web driver compatible with your desired browser. In this example, we're using the Chrome browser. Begin by checking your Chrome version in the browser settings. Download the corresponding Chrome WebDriver from the official ChromeDriver download page based on this version.

Run the executable web driver file; you should see a confirmation that it has started on a new port. Next, add the Thirtyfour crate to your Rust project with the command below:

cargo add thirtyfour

With the web driver and Thirtyfour installed, you're now ready to automate browser processes in Rust.

Mimic Google search

Let's begin with a simple task: visiting google.com, searching for the keyword "rust", and scraping its contents. Open your src/main.rs file and replace its contents with the following code:

use thirtyfour::prelude::*;
use thirtyfour::Key;
use tokio;

#[tokio::main]
async fn main() -> WebDriverResult<()> {
    let caps = DesiredCapabilities::chrome();
    let driver = WebDriver::new("http://localhost:9515", caps).await?;

    // Navigate to Google.
    driver.goto("https://www.google.com").await?;

    let search_box = driver.find(By::Name("q")).await?;
    search_box.send_keys("rust").await?;
    search_box.send_keys(Key::Enter.to_string()).await?;
    tokio::time::sleep(std::time::Duration::from_secs(5)).await;

    let page_source = driver.source().await?;
    println!("{}", page_source);
    driver.quit().await?;

    Ok(())
}

In the code, we use the Thirtyfour crate for Rust to automate a web browser. We start by creating browser capabilities with DesiredCapabilities::chrome(), which can be adapted for other browsers such as Edge using DesiredCapabilities::edge(). We then initiate a WebDriver to interact with a browser instance listening on http://localhost:9515.

The script navigates to Google's homepage, searches for the input box (identified by name="q"), and enters the keyword "rust". It simulates an Enter key press to execute the search and waits briefly to allow the results page to load.

After displaying the search results, the script prints the page's HTML source to the console. Finally, it closes the browser window and ends the WebDriver session.

Start this program by running the following command:

cargo run

You should see a browser window open automatically, mimic the search as instructed, and print the result to the console.

However, in practice, we do not want to see the browser open and perform all the automated activities; we simply want to see the result. To accomplish this, add the "headless" parameter to launch our browser in headless mode and optionally deactivate the GPU to optimize performance, as demonstrated below.

. . .
async fn main() -> WebDriverResult<()> {
    let mut caps = DesiredCapabilities::chrome();

    caps.add_chrome_arg("headless")?;
    caps.add_chrome_arg("disable-gpu")?;

    let driver = WebDriver::new("http://localhost:9515", caps).await?;
    
    . . .

With this argument added, the browser window won’t be visible when performing the predefined instructions.

Extract questions from StackOverflow using WebDriver

Let’s try out another example and extract the top five questions from Stack Overflow's Newest Questions page. Our focus is on capturing each question's title, excerpt, author, and complete URL. Upon inspecting the page, we find that the necessary data is contained within a div#questions tag, and each question is wrapped around another div with class s-post-summary, as shown below.

Screenshot of stackoverflow.com/questions in inspect mode, with a red-outlined box illustrating where the question markup reside.

To implement this, update your src/main.rs file with the following code:

use thirtyfour::prelude::*;
use tokio;

#[tokio::main]
async fn main() -> WebDriverResult<()> {
    // Set up Chrome WebDriver
    let mut caps = DesiredCapabilities::chrome();
    caps.add_chrome_arg("headless")?;
    let driver = WebDriver::new("http://localhost:9515", caps).await?;

    // Navigate to Stack Overflow's questions page
    driver.goto("https://stackoverflow.com/questions").await?;

    // Wait for the page to load
    tokio::time::sleep(std::time::Duration::from_secs(5)).await;

    // Locate the questions container
    let questions = driver
        .find_all(By::Css("div#questions .s-post-summary"))
        .await?;

    for question in questions.iter().take(5) {
        // Extract question details
        let title_element = question.find(By::Css("h3 .s-link")).await?;
        let title = title_element.text().await?;
        let url_suffix = title_element.attr("href").await?.unwrap_or_default();
        let url = format!("https://stackoverflow.com{}", url_suffix);
        let excerpt = question
            .find(By::Css(".s-post-summary--content-excerpt"))
            .await?
            .text()
            .await?;
        let author = question
            .find(By::Css(".s-user-card--link a"))
            .await?
            .text()
            .await?;

        println!("Question: {}", title);
        println!("Excerpt: {}", excerpt);
        println!("Posted by: {}", author);
        println!("Read more: {}", url);
        println!("_____");
    }

    driver.quit().await?;

    Ok(())
}

In the code above, we use Selenium WebDriver in headless mode to navigate to Stack Overflow's questions page. After waiting for the page's content to load, we use CSS selectors to extract the first five questions' titles, excerpts, authors, and URLs. Each question's details are then printed, combining the extracted URL with 'https://stackoverflow.com' to form complete links.

To run this script and see the output, use the command:

cargo run

Running this script will display the top five most recent questions from StackOverflow in your console, similar to the output below.

Terminal output displaying the top 5 most recent questions posted on StackOverflow

It's worth mentioning that the Selenium WebDriver's capabilities go well beyond what we've seen in this example. Selenium can automate almost any web browser interaction, including complex tasks such as navigating web pages, filling out forms, mimicking mouse movements, managing cookies, etc. Learn more about Selenium and thirtyfour.

Web scraping best practices

Web scraping goes beyond collecting data; it is an activity with legal, ethical, and technical implications.

The legality of web scraping depends on several factors: the data's availability, adherence to a website's terms of service, and your data usage purpose. Typically, scraping publicly accessible information for personal, non-commercial use is acceptable. However, commercial usage or violating terms of service can lead to legal issues. It's critical to respect the guidelines in a website's robots.txt file and to use a unique user-agent string for transparency.

Technical best practices

Efficient web scraping blends smart strategies with appropriate tools. Headless browsers, while powerful, are resource-intensive and should be used sparingly. They're most effective for pages heavy in JavaScript or dynamic content.

Use official website APIs whenever possible, such as those that Coinbase and Binance provide. These APIs are generally more stable and resource-efficient compared to traditional scraping methods. Our examples with these platforms were only for educational purposes, highlighting the practical application of scraping techniques.

Understanding a website's architecture is key to effective scraping. This knowledge guides you in making targeted requests, reducing unnecessary server load, and optimizing the scraping process. Furthermore, using efficient parsing libraries for data extraction is crucial. These libraries streamline the process, ensuring quick and accurate data retrieval.

That's how to web scrape with Rust using Scraper and Selenium

Throughout this article, we've explored different web scraping methods with rust. We went from using the basic HTTP client library, Reqwest, and Scraper to advanced automated browser processes using the Selenium web driver. We then considered some of the best practices to help make sure you're not doing web scraping wrong.

For more hands-on experience, you can find the complete code used in this article on GitHub.

Elijah Asaolu is a technical writer and software engineer. He frequently enjoys writing technical articles to share his skills and experience with other developers.