Get notified of new magazine issues using web scraping and SMS with C# .NET

October 10, 2022
Written by
Volkan Paksoy
Contributor
Opinions expressed by Twilio contributors are their own
Reviewed by

Get notified of new magazine issues using web scraping  and SMS with C# .NET

As a Raspberry PI fan, I like to read The MagPi Magazine, which is freely available as PDFs. The problem is I tend to forget to download it manually every month, so I decided to automate the process. If Raspberry Pi is not your thing, you should be able to modify the demo application to work for any periodical publication that offers free downloads.

Prerequisites

You'll need the following things in this tutorial:

Project Overview

First, let’s understand what the demo intends to achieve. The components involved and the workflow looks like this:

Diagram showing the components and actors involved in the project.
  1. The worker service reads a database to get the latest issues it sends notifications for.
  2. The worker service fetches the website for the magazine and gets the latest issue number. Then, it compares the latest issue number in the database to the latest issue number on the website. If the numbers are equal, it means there is no new issue. If the latest issue number on the website is greater, then there is a new issue. If there is no new issue, the worker service goes to sleep. If there is a new issue, it gets the cover image and the direct link URLs from the magazine’s website.
  3. The worker service calls Twilio API to send an SMS/MMS message.
  4. Twilio sends the message to the user.
  5. The worker service updates its database with the latest issue to avoid duplicate messages.

Project Implementation

Let’s start by creating the worker service by running the following commands:

mkdir MagazineTracker
cd MagazineTracker
dotnet new worker

Create the Data Layer

First, let’s look into the data layer. The only piece of information that needs to be stored is the latest issue number that the application processed.

Create a folder inside your project named Data. Then, create a file LatestMagazineIssue.cs, that contains a model class for your data. Add the following code:

namespace MagazineTracker.Data;

public class LatestMagazineIssue
{
    public int IssueNumber { get; set; }
}

Then, create a new file IMagazineIssueRepository.cs in the Data folder that holds a repository interface to outline the data operations you’re going to use. Add the following code to the file:

namespace MagazineTracker.Data;

public interface IMagazineIssueRepository
{
    Task<LatestMagazineIssue> GetLatestIssue();
    Task SaveLatestIssue(int latestIssueNumber);
}

The next step is to decide how to store the data. The requirements of this project are very straightforward, so you don’t need a full-fledged database; a simple JSON file will suffice. Go ahead and create a JSON file named db.json under the Data directory. Update its contents as shown below:

{
  "LatestIssueNumber": 0
}

Then, create another file named JsonMagazineIssueRepository.cs in the Data folder which will contain the repository implementation for the JSON file named that implements the previous interface. Update the code as shown below:

using System.Text.Json;
using Microsoft.Extensions.Options;

namespace MagazineTracker.Data;

public class JsonMagazineIssueRepository : IMagazineIssueRepository
{
    private readonly DatabaseSettings _databaseSettings;

    public JsonMagazineIssueRepository(IOptions<DatabaseSettings> databaseSettings)
    {
        _databaseSettings = databaseSettings.Value;
    }
    
    public async Task<LatestMagazineIssue> GetLatestIssue()
    {
        var dbAsJson = await File.ReadAllTextAsync(_databaseSettings.JsonFilePath);
        var latestIssue = JsonSerializer.Deserialize<LatestMagazineIssue>(dbAsJson);
        return latestIssue;
    }

    public async Task SaveLatestIssue(int latestIssueNumber)
    {
        var dbAsJson = await File.ReadAllTextAsync(_databaseSettings.JsonFilePath);
        var latestIssue = JsonSerializer.Deserialize<LatestMagazineIssue>(dbAsJson);
        latestIssue.IssueNumber = latestIssueNumber;
     
        dbAsJson = JsonSerializer.Serialize(latestIssue);
        await File.WriteAllTextAsync(_databaseSettings.JsonFilePath, dbAsJson);
    }
}

The JsonMagazineIssueRepository only needs one parameter: The path to the JSON file. You can encapsulate it in a simple class. Create DatabaseSettings.cs under the Data directory with the following code:

namespace MagazineTracker.Data;

public class DatabaseSettings
{
    public string JsonFilePath { get; set; }
}

Then update your appsettings.json file so that it looks like this:

{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft.Hosting.Lifetime": "Information"
    }
  },
  "DatabaseSettings": {
    "JsonFilePath": "./Data/db.json"
  }
}

Finally, for this stage, update Program.cs as shown below:

using MagazineTracker;
using MagazineTracker.Data;

IHost host = Host.CreateDefaultBuilder(args)
    .ConfigureServices((hostBuilderContext, services) =>
    {
        services.AddHostedService<Worker>();
        services.AddTransient<IMagazineIssueRepository, JsonMagazineIssueRepository>();
        services.Configure<DatabaseSettings>(hostBuilderContext.Configuration.GetSection("DatabaseSettings"));
    })
    .Build();

// await host.RunAsync();

var repo = host.Services.GetRequiredService<IMagazineIssueRepository>();
await repo.SaveLatestIssue(120);

var latestIssue = await repo.GetLatestIssue();
Console.WriteLine(latestIssue.IssueNumber);

From line 7 to 9,  is where you register your services with the concrete implementations in the DI container. Then the IMagazineIssueRepository service is retrieved to get the latest magazine issue and print it to the console.

Line 12 is commented out temporarily to make the implementation/debugging phase easier. As of now, you don’t need to worry about scheduling. That will come later. So, for now, run the application by  

dotnet run

And confirm your output looks like this:

120

Now that you have a working data layer move on to the next section, where you will do some HTML parsing.

HTML Parse the Magazine Page

You need 3 things to get from the magazine website:

  1. The latest issue number
  2. The URL of the magazine (PDF or other formats)
  3. The URL of the cover image (Optional)

Every magazine tracker will work differently but you can combine the requirements above in a single interface so that all the trackers can work in a similar fashion.

Create IMagazineTrackerService.cs for the interface and update its code as shown below:

namespace MagazineTracker;

public interface IMagazineTrackerService
{
    Task<int> GetLatestIssueNumber();
    Task<string> GetLatestIssueCoverUrl();
    Task<string> GetIssuePdfUrl(int issueNumber);
}

All your trackers must implement the IMagazineTrackerService interface.

Now, implement your first tracker by creating a file MagPiTrackerService.cs with the following dummy implementation:

namespace MagazineTracker;

public class MagPiTrackerService : IMagazineTrackerService
{
    public async Task<int> GetLatestIssueNumber()
    {
        throw new NotImplementedException();
    }

    public async Task<string> GetLatestIssueCoverUrl()
    {
        throw new NotImplementedException();
    }

    public async Task<string> GetIssuePdfUrl(int issueNumber)
    {
        throw new NotImplementedException();
    }
}

To do the HTML parsing, you will use a library called AngleSharp. It makes the whole process a lot easier, and it can be added to your project via NuGet by running:

dotnet add package AngleSharp

Now, take a look at where to find the latest issue number. The easiest way to find the latest issue number is by going to the issues page, which looks like this at the time of this writing:

The MagPi Magazine issues page showing the latest issue

If you look at the source of the page (Right click and click Show/View Page Source depending on your browser). If you search the phrase “The MagPi issue 121 out now” (replace the number with the one you see on your screen) you should find the relevant area that looks something like this:


<div class="c-slice c-slice--white">
  <div class="o-container">
    <section class="c-latest-issue">
      <div class="c-latest-issue__cover">
        <a href="/issues/121">
          <img alt="The MagPi issue 121 cover" class="c-latest-issue__image" src="https://magpi.raspberrypi.com/storage/…/MagPi121_COVER_STORE.jpg" />
</a>      </div>

      <div class="c-latest-issue__description">
        <h1 class="o-type-display">
          <a class="c-link" href="/issues/121">The MagPi issue 121 out now!</a>
        </h1>
…

This page contains the latest issue number and a URL of the cover image. To parse this page, update the MagPiTrackerService code as shown below:

using AngleSharp;

namespace MagazineTracker;

public class MagPiTrackerService : IMagazineTrackerService
{
    private const string MagpiRootUrl = "https://magpi.raspberrypi.com";
    
    public async Task<int> GetLatestIssueNumber()
    {
        var config = Configuration.Default.WithDefaultLoader();
        var context = BrowsingContext.New(config);
        var document = await context.OpenAsync($"{MagpiRootUrl}/issues/");
        var latestCoverLinkSelector = ".c-latest-issue > .c-latest-issue__cover > a";
        var latestCoverLink = document.QuerySelector(latestCoverLinkSelector);
        var rawLink = latestCoverLink.Attributes.GetNamedItem("href").Value;
        return int.Parse(rawLink.Substring(rawLink.LastIndexOf('/') + 1));
    }

    public async Task<string> GetLatestIssueCoverUrl()
    {
        var config = Configuration.Default.WithDefaultLoader();
        var context = BrowsingContext.New(config);
        var document = await context.OpenAsync($"{MagpiRootUrl}/issues/");
        var latestCoverImageSelector = ".c-latest-issue > .c-latest-issue__cover > a > img";
        var latestCoverImage = document.QuerySelector(latestCoverImageSelector);
        var latestCoverImageUrl = latestCoverImage.Attributes.GetNamedItem("src").Value;
        return latestCoverImageUrl;
    }

    public async Task<string> GetIssuePdfUrl(int issueNumber)
    {
        throw new NotImplementedException();
    }
}

After loading the page with AngleSharp, you have to write your CSS-selector to get the element you’re interested in. In this example, the latest issue number is obtained from the href attribute of the anchor element (by parsing the number that follows the latest ‘/’ character)

Similarly, the cover URL is parsed from the src attribute of the img element.

Even though both pieces of information are obtained from the same page, they were implemented as separate methods. This might look repetitive, but the reason for this is to accommodate other trackers. Having both the issue number and cover URL on the same page may not be the case for other magazines, so if you combine them into a single method, you might have issues later on with other trackers.

To test the latest version, update the Program.cs file as shown below:

using MagazineTracker;
using MagazineTracker.Data;

IHost host = Host.CreateDefaultBuilder(args)
    .ConfigureServices((hostBuilderContext, services) =>
    {
        services.AddHostedService<Worker>();
        services.AddTransient<IMagazineIssueRepository, JsonMagazineIssueRepository>();
        services.AddTransient<IMagazineTrackerService, MagPiTrackerService>();
        services.Configure<DatabaseSettings>(hostBuilderContext.Configuration.GetSection("DatabaseSettings"));
    })
    .Build();

// await host.RunAsync();

var repo = host.Services.GetRequiredService<IMagazineIssueRepository>();
var tracker = host.Services.GetRequiredService<IMagazineTrackerService>();

var latestProcessedIssue = await repo.GetLatestIssue();
var latestIssueNumber = await tracker.GetLatestIssueNumber();
if (latestIssueNumber > latestProcessedIssue.IssueNumber)
{
    Console.WriteLine($"New issue detected: {latestIssueNumber}");
    var coverUrl = await tracker.GetLatestIssueCoverUrl();
    Console.WriteLine($"Cover URL: {coverUrl}");
}

Now the IMagazineTrackerService is also configured as a service and retrieved from the service provider. Then tracker.GetLatestIssueNumber and  tracker.GetLatestIssueCoverUrl is used to scrape the data and print it.

Run the application, and you should see an output that looks like this:

New issue detected: 121
Cover URL: https://magpi.raspberrypi.com/storage/…/MagPi121_COVER_STORE.jpg

The third and final piece of information you need is the link to the PDF file. If you click on the “Download Free PDF” link, you get redirected to https://magpi.raspberrypi.com/issues/121/pdf, which looks like this:

The MagPi Magazine donation page

I'd strongly recommend everybody to consider donating. This is a great magazine with professional quality, and it's full of valuable knowledge about everything Raspberry Pi.

If you click on the "No thanks, take me to the free PDF" link, you get redirected to https://magpi.raspberrypi.com/issues/121/pdf/download, and your download starts automatically. This is done by placing an iframe and setting the src as the link to the URL.

If you look at the source code of the download page and search for “iframe”, you should find the relevant code looks like this:

  <main>
        <iframe src="/downloads/…/MagPi121.pdf" class="u-hidden"></iframe>

To parse this URL, update the MagPiTrackerService.GetIssuePdfUrl method as shown below:

public async Task<string> GetIssuePdfUrl(int issueNumber)
{
    var issueUrl = $"{MagpiRootUrl}/issues/{issueNumber}/pdf/download";
    
    var config = AngleSharp.Configuration.Default.WithDefaultLoader();
    var address = issueUrl;
    var context = BrowsingContext.New(config);
    var document = await context.OpenAsync(address);
    var cellSelector = "iframe";
    var cell = document.QuerySelector(cellSelector);
    var iframeSrc = cell.Attributes.GetNamedItem("src").Value;

    return $"{MagpiRootUrl}/{iframeSrc.TrimStart('/')}";
}

Update the test code in Program.cs only to test the latest update:

…
var latestProcessedIssue = await repo.GetLatestIssue();
var latestIssueNumber = await tracker.GetLatestIssueNumber();
if (latestIssueNumber > latestProcessedIssue.IssueNumber)
{
    Console.WriteLine($"New issue detected: {latestIssueNumber}");
    var pdfUrl = await tracker.GetIssuePdfUrl(latestIssueNumber);
    Console.WriteLine($"PDF URL: {pdfUrl}");
}

Run the application and confirm you can see the same URL you saw in the download page source:

New issue detected: 121
PDF URL: https://magpi.raspberrypi.com/downloads/…/MagPi121.pdf

Set up Twilio to Send SMS Notifications

Before implementing the actual notification mechanism, create a new interface to ensure all notification channels work the same. Create a file named INotificationService.cs and update its code like this:

namespace MagazineTracker;

public interface INotificationService
{
    Task SendNewIssueNotification(int issueNumber, string coverUrl, string mediaUrl);
}

In the demo project, you will implement SMS/MMS notifications using Twilio Programmable SMS.

Now that you have all the information, you need to deliver this to Twilio so that you can get SMS notifications on your mobile device. To achieve this, first, add Twilio SDK to your project by running:

dotnet add package Twilio

You will need your Account SID and Auth Token to be able to talk to the Twilio API. You can find both of these on the welcome page in the account info section when you log in to the Twilio Console:

Account info section on the welcome page showing Account SID and Auth Token

To store these values, you can use environment variables or a vault service, but for local development, you can use dotnet user secrets. First, you need to initialize user secrets by running

dotnet user-secrets init

Then, create two new user secrets called Twilio:AccountSid and Twilio:AuthToken and set the values:

dotnet user-secrets set Twilio:AccountSid {YOUR TWILIO ACCOUNT SID}
dotnet user-secrets set Twilio:AuthToken {YOUR TWILIO AUTH TOKEN}

Create a new file called SmsService.cs and add the following code:

using Microsoft.Extensions.Options;
using Twilio.Rest.Api.V2010.Account;
using Twilio.Types;

namespace MagazineTracker;

public class SmsService : INotificationService
{
    private readonly SmsSettings _smsSettings;

    public SmsService(IOptions<SmsSettings> smsSettings)
    {
        _smsSettings = smsSettings.Value;
    }
    
    public async Task SendNewIssueNotification(int issueNumber, string coverUrl, string mediaUrl)
    {
        MessageResource.Create(
            body: $"Here's the latest issue (#{issueNumber}) of The MagPi Magazine: {mediaUrl}",
            from: new PhoneNumber(_smsSettings.FromPhoneNumber),
            to: new PhoneNumber(_smsSettings.ToPhoneNumber),
            mediaUrl: string.IsNullOrEmpty(coverUrl) ? null : new []
            {
                new Uri(coverUrl)
            }.ToList()
        );
    }
}

The SMS message needs to be sent from your Twilio phone number (which you can find right below Account SID and Auth Token on Twilio Console welcome page).

The reason the code checks whether or not coverUrl has a value is that some Twilio Phones Numbers don’t support MMS. For example, Twilio Phone Numbers from the United Kingdom (UK) do not support MMS, so my UK number could only send plain SMS. So, if you are not able to send MMS messages, simply send an empty string as the cover URL so that setting the coverUrl in your worker service looks like this:

var coverUrl = String.Empty;

Alternatively, you can create a boolean setting such as includeCoverUrl to manage this behaviour.

To store both from and to phone numbers, update appsettings.json like this:


{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft.Hosting.Lifetime": "Information"
    }
  },
  "DatabaseSettings": {
    "JsonFilePath": "./Data/db.json"
  },
  "SmsSettings": {
    "FromPhoneNumber": "{YOUR TWILIO PHONE NUMBER}",
    "ToPhoneNumber": "{YOUR ACTUAL PHONE NUMBER}"
  }
}

Create a file called SmsSettings.cs  with the following class:

namespace MagazineTracker;

public class SmsSettings
{
    public string FromPhoneNumber { get; set; }
    public string ToPhoneNumber { get; set; }
}

Finally, update Program.cs to reflect these changes:

using MagazineTracker;
using MagazineTracker.Data;
using Twilio;

IHost host = Host.CreateDefaultBuilder(args)
    .ConfigureServices((hostBuilderContext, services) =>
    {
        services.AddHostedService<Worker>();
        services.AddTransient<IMagazineIssueRepository, JsonMagazineIssueRepository>();
        services.AddTransient<IMagazineTrackerService, MagPiTrackerService>();
        services.AddTransient<INotificationService, SmsService>();
        services.Configure<DatabaseSettings>(hostBuilderContext.Configuration.GetSection("DatabaseSettings"));
        services.Configure<SmsSettings>(hostBuilderContext.Configuration.GetSection("SmsSettings"));
        
        var accountSid = hostBuilderContext.Configuration["Twilio:AccountSid"];
        var authToken = hostBuilderContext.Configuration["Twilio:AuthToken"];
        TwilioClient.Init(accountSid, authToken);
    })
    .Build();

// await host.RunAsync();

var repo = host.Services.GetRequiredService<IMagazineIssueRepository>();
var tracker = host.Services.GetRequiredService<IMagazineTrackerService>();
var notificationService = host.Services.GetRequiredService<INotificationService>();

var latestProcessedIssue = await repo.GetLatestIssue();
var latestIssueNumber = await tracker.GetLatestIssueNumber();
if (latestIssueNumber > latestProcessedIssue.IssueNumber)
{
    Console.WriteLine($"New issue detected: {latestIssueNumber}");
    var coverUrl = await tracker.GetLatestIssueCoverUrl();
    var pdfUrl = await tracker.GetIssuePdfUrl(latestIssueNumber);
    await notificationService.SendNewIssueNotification(latestIssueNumber, coverUrl, pdfUrl);
    await repo.SaveLatestIssue(latestIssueNumber);
}

Sending a message via WhatsApp works exactly the same way, except you can only use a sandbox environment unless your account is approved. The sandbox session expires after 3 days, so it’s not a great fit for continuous notifications, but if your account is approved already, you can still use SmsService without any modifications. All you have to do is replace the “from phone number” with “whatsapp:+xxxxxxxxxxx”, where xxxxxxxxxxx is the number provided to you by Twilio. Also, prefix the “to phone number” with “whatsapp:”

Time to test the final version (which also updates the database with the latest issue number). Run the application, and you should receive an SMS/MMS on your phone.

My UK Twilio Phone Number doesn’t support MMS. If I try to set the coverUrl to the image URL, I get the following exception:

Twilio.Exceptions.ApiException: Number: +44xxxxxxxxxx has not been enabled for MMS

So I set the coverUrl to empty string as discussed previously and the SMS I receive on my phone looks like this:

Phone screenshot showing an SMS message with text and link that says Tap lo Load Preview

And when I tap on the link, I get this:

SMS showing the cover image after tapping the preview link

To test the MMS feature, I purchased a US Twilio Phone Number and sent the same message with the actual coverURL (meaning reverted the code to its original version: var coverUrl = await _magazineTrackerService.GetLatestIssueCoverUrl();).

When I send the message from the US phone number, I get this message:

MMS sent from a US Twlio Phone Number showing the text, link to the PDF and a shortened link to the cover image

It shows the text, the full URL to the PDF and a shortened URL of the cover image.

In my case, I prefer the original message. Depending on your phone, carrier and the messaging app you use, your experience may vary. I’d recommend playing around with splitting up the notification into multiple messages, such as sending the text in one message and the cover image in another or sending text, cover image, and URL all in different messages. Try it out and decide which format you like the most.

Schedule the Worker Service

You have a working application but it only functions when you run it manually. To automate the process, move the code into the Worker.cs class shown below:

using MagazineTracker.Data;

namespace MagazineTracker;

public class Worker : BackgroundService
{
    private readonly ILogger<Worker> _logger;
    private readonly IMagazineIssueRepository _magazineIssueRepository;
    private readonly IMagazineTrackerService _magazineTrackerService;
    private readonly INotificationService _notificationService;

    public Worker(ILogger<Worker> logger, IMagazineIssueRepository magazineIssueRepository, IMagazineTrackerService magazineTrackerService, INotificationService notificationService)
    {
        _logger = logger;
        _magazineIssueRepository = magazineIssueRepository;
        _magazineTrackerService = magazineTrackerService;
        _notificationService = notificationService;
    }

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        while (!stoppingToken.IsCancellationRequested)
        {
            _logger.LogInformation("Worker running at: {time}", DateTimeOffset.Now);
            
            var latestProcessedIssue = await _magazineIssueRepository.GetLatestIssue();
            var latestIssueNumber = await _magazineTrackerService.GetLatestIssueNumber();
            if (latestIssueNumber > latestProcessedIssue.IssueNumber)
            {
                _logger.LogInformation("New issue detected: {latestIssueNumber}", latestIssueNumber);
                var coverUrl = await _magazineTrackerService.GetLatestIssueCoverUrl();
                var pdfUrl = await _magazineTrackerService.GetIssuePdfUrl(latestIssueNumber);
                await _notificationService.SendNewIssueNotification(latestIssueNumber, coverUrl, pdfUrl);
                await _magazineIssueRepository.SaveLatestIssue(latestIssueNumber);
            }
            else
            {
                _logger.LogInformation("No new issue is detected.");
            }
            
            await Task.Delay(1000 * 60 * 60, stoppingToken); // Run hourly
        }
    }
}

This way, you can remove all the previous test code and initializations and Program.cs becomes very concise:

using MagazineTracker;
using MagazineTracker.Data;
using Twilio;

IHost host = Host.CreateDefaultBuilder(args)
    .ConfigureServices((hostBuilderContext, services) =>
    {
        services.AddHostedService<Worker>();
        services.AddTransient<IMagazineIssueRepository, JsonMagazineIssueRepository>();
        services.AddTransient<IMagazineTrackerService, MagPiTrackerService>();
        services.AddTransient<INotificationService, SmsService>();
        services.Configure<DatabaseSettings>(hostBuilderContext.Configuration.GetSection("DatabaseSettings"));
        services.Configure<SmsSettings>(hostBuilderContext.Configuration.GetSection("SmsSettings"));
        
        var accountSid = hostBuilderContext.Configuration["Twilio:AccountSid"];
        var authToken = hostBuilderContext.Configuration["Twilio:AuthToken"];
        TwilioClient.Init(accountSid, authToken);
    })
    .Build();

await host.RunAsync();

Now run the application again (reset the database first to a value lower than the latest issue number), and you should receive an SMS/MMS; your database should be updated with the latest issue number, and your service should wait for 1 hour and then run the code again. You can, of course, change how often you would like to check for new issues by changing the delay.

Conclusion

My favorite projects are the ones that I develop to solve a real problem of mine. This one was a small issue, but I like the idea of automating something that otherwise I’d forget. Even though there is one implementation of a magazine tracker service, you can adapt the existing code for your favorite publication. As long as you add a new class that implements the same interface, you can replace the registration code in Program.cs and your application will start fetching that magazine. The same goes for the notification. You can replace SMS/MMS with email using SendGrid or WhatsApp.

If you'd like to keep learning, I recommend taking a look at these articles:

Volkan Paksoy is a software developer with more than 15 years of experience, focusing mainly on C# and AWS. He’s a home lab and self-hosting fan who loves to spend his personal time developing hobby projects with Raspberry Pi, Arduino, LEGO and everything in-between. You can follow his personal blogs on software development at devpower.co.uk and cloudinternals.net.