Build the future of communications.
Start building for free

Web Scraping posts

  • By Sam Agnew
    PythonとBeautiful SoupでWebスクレイピングとHTML解析をする方法 PythonとBeautiful SoupでWebスクレイピングとHTML解析をする方法

    この記事はSam Agnewこちらで公開した記事(英語)を日本語化したものです。

    インターネットにはあまりに多くのデータがあふれています。しかし、これらのデータがREST APIの形式ではないと、プログラムによってアクセスすることは難しくなります。Beautiful SoupなどのPythonツールを使用すると、Webページから直接データをスクレイピングして解析し、プロジェクトやアプリケーションで使えるようになります。

    本稿では、インターネットからMIDIデータをスクレイピングする方法をご紹介します。過去のブログで、Magentaによるニューラルネットワークのトレーニングを使用してクラシックな任天堂ゲームミュージックを作成する方法をご紹介しました。この実装には、昔の任天堂ゲームのMIDIミュージックが必要になります。今回は、Beautiful Soupを使用して、ビデオゲーム音楽アーカイブからMIDIデータを取得する方法をご紹介します。

    プロジェクトの準備と依存パッケージの設定

    まず、最新バージョンのPython 3とpipがインストールされていることを確認してください。また、依存パッケージをインストールする前に、仮想環境を作成して有効にしてください。

    Webページからのデータ取得のHTTPリクエストを作成するRequestsライブラリと、HTMLを解析するBeautiful Soupをインストールする必要があります。

    仮想環境を有効にしたら、ターミナルで次のコマンドを実行します。

    pip install requests==2.22. …
    Read More
  • By Matt Nikonorov
    How to Scrape Websites With PHP Using Goutte How to Scrape Websites With PHP Using Goutte

    For many PHP based applications involving data collection or data analysis, PHP scripts will need to scrape data from external web pages. This is especially true if the web source that you are looking to interact with doesn’t provide an API; or maybe they do provide an API, but you don’t want to pay for their API services.

    Web scraping is usually performed with Node.js or Python, however, when trying to scrape data and pass it to the frontend, web scraping with Node.js or Python complicates the process of scraping data from the web and displaying it on a web page.

    This is where Goutte makes life easier. Instead of relying on Node.js or Python scripts to scrape data from the web and display it on the frontend by passing it to a PHP script, with Goutte, you can scrape data from the web directly inside of your PHP script. …

    Read More
  • By Sam Agnew
    Automatisierte Skripterstellung für Headless Browser in Node.js mit Playwright Automatisierte Skripterstellung für Headless Browser in Node.js mit Playwright


    Hallo und Danke fürs Lesen! Dieser Blogpost ist eine Übersetzung von Automated Headless Browser scripting in Node.js with Playwright. Während wir unsere Übersetzungsprozesse verbessern, würden wir uns über Dein Feedback an help@twilio.com freuen, solltest Du etwas bemerken, was falsch übersetzt wurde. Wir bedanken uns für hilfreiche Beiträge mit Twilio Swag :)

    Manchmal sind die Daten die wir benötigen online verfügbar, allerdings nicht über eine öffentliche API. Web Scraping kann in solchen Fällen hilfreich sein, allerdings nur, wenn die Daten über eine Webseite statisch verfügbar sind. Entwickler haben das Glück, dass alle Aufgaben, die sie manuell im Browser durchführen, mithilfe von Playwright automatisiert werden können. Playwright ist eine Node-Bibliothek, die vom gleichen Team entwickelt wurde wie Puppeteer und die eine High-Level-API zum Automatisieren von Aufgaben in verschiedenen Browsern bietet.

    Ich zeige nun, wie wir mithilfe von Playwright programmgesteuert mit Webseiten interagieren können. Wir verwenden in diesem Beispiel das Tool Native …

    Read More
  • By Sam Agnew
    Web Scraping und Parsen von HTML in Node.js mit jsdom Web Scraping und Parsen von HTML in Node.js mit jsdom


    Hallo und Danke fürs Lesen! Dieser Blogpost ist eine Übersetzung von Web Scraping and Parsing HTML in Node.js with jsdom. Während wir unsere Übersetzungsprozesse verbessern, würden wir uns über Dein Feedback an help@twilio.com freuen, solltest Du etwas bemerken, was falsch übersetzt wurde. Wir bedanken uns für hilfreiche Beiträge mit Twilio Swag :)

    Im Internet findet sich eine große Vielfalt an Daten, die wir nach Belieben verwenden können. Der programmgesteuerte Zugriff auf diese Daten ist allerdings oft schwierig, sofern er nicht über eine dedizierte REST API bereitgestellt wird. Aber mit einem Node.js-Tool wie jsdom können wir diese Daten direkt aus den Webseiten scrapen und parsen, um sie für unsere Projekte und Anwendungen zu nutzen.

    Ein Beispiel dafür wären MIDI-Daten, die wir benötigen, um ein neuronales Netzwerk so zu trainieren, dass es Musik im klassischen Nintendo-Stil generiert. Dazu brauchen wir zunächst MIDI-Dateien mit Musik aus alten Nintendo-Spielen. Mit jsdom …

    Read More
  • By Sam Agnew
    Web Scraping et Analyse du HTML en Python avec Beautiful Soup web-scraping-analyse-html-python-beautiful-soup

    Internet offre une incroyable diversité d’informations destinées à la consommation humaine. Mais il est souvent difficile d'accéder à ces données par voie programmatique, si elles ne se présentent pas sous la forme d'une API REST dédiée. Grâce à des outils Python comme Beautiful Soup, vous pouvez récupérer, analyser des pages Web puis utiliser ces données dans vos projets.

    Par exemple : Comment récupérer des données MIDI sur Internet pour entraîner un réseau neuronal avec Magenta qui sera capable de générer de la musique rétro Nintendo ?

    Nous avons besoin pour cela d'un ensemble de musiques MIDI provenant d'anciens jeux Nintendo. Beautiful Soup nous permet d’obtenir ces données à partir des Video Game Music Archive.

    Démarrage et installation des dépendances

    Avant de continuer, assurez-vous d’avoir bien installé la mise à jour de Python 3 et de pip. Créez et activez un environnement virtuel avant d'installer toutes les dépendances. …

    Read More
  • By Luís Leão
    Raspagem de dados na web com Python e Beautiful Soup Raspagem de dados na web com Python e Beautiful Soup

    A Internet tem uma variedade incrível de informações para o consumo humano, mas estes dados geralmente são difíceis de serem acessados programaticamente se não vierem na forma de uma API REST dedicada. Com ferramentas para Python como a Beaultiful Soup, você pode fazer a raspagem e tratamento desses dados diretamente das páginas web para usar em seus projetos e aplicações.

    Como exemplo, faremos a raspagem de arquivos MIDI da Internet para treinar uma rede neural com Magenta, para gerar música e sons clássicos do Nintendo. Para fazer isso, vamos precisar de um conjunto de músicas MIDI de jogos antigos da Nintendo. Usando o Beaultiful Soup, podemos pegar estes dados do Arquivo de Músicas de Video Game.

    Primeiros passos e configuração das dependências

    Antes de continuarmos, você vai precisar se certificar de que tem uma versão atualizada do Python 3 e do pip instalado. Certifique-se de …

    Read More
  • By Sam Agnew
    4 Ferramentas para fazer extração de dados em Node.js 4-ferramentas-web-scraping-in-node-js.png

    Algumas vezes, os dados que você precisa estão disponíveis online, mas não através de uma API REST. Felizmente, para desenvolvedores JavaScript, existem uma variedade de ferramentas disponíveis em Node.js para extrair e analisar dados diretamente dos websites e usar em seus projetos e aplicativos.

    Vamos abordar 4 dessas bibliotecas para ver como elas funcionam e as diferenças entre elas.

    Make sure you have up to date versions of Node.js (at least 12.0.0) and npm installed on your machine. Run the terminal command in the directory where you want your code to live:

    Certifique-se de que você possui versões atualizadas do Node.js (pelo menos 12.0.0) e npm instaladas na sua máquina. No diretório que seu código será instalado execute no terminal o comando a seguir:

    npm init --yes
    

    Para algumas dessas aplicações, vamos usar a biblioteca Got para fazer chamadas HTTP, então instale isso com o comando a seguir no …

    Read More
  • By Sam Agnew
    4 Tools for Web Scraping in Node.js Copy of Generic Blog Header 3 (2).png

    Sometimes the data you need is available online, but not through a dedicated REST API. Luckily for JavaScript developers, there are a variety of tools available in Node.js for scraping and parsing data directly from websites to use in your projects and applications.

    Let's walk through 4 of these libraries to see how they work and how they compare to each other.

    Make sure you have up to date versions of Node.js (at least 12.0.0) and npm installed on your machine. Run the terminal command in the directory where you want your code to live:

    npm init --yes
    

    For some of these applications, we'll be using the Got library for making HTTP requests, so install that with this command in the same directory:

    npm install got@11.0.2
    

    Let's try finding all of the links to unique MIDI files on this web page from the Video Game Music Archive with a …

    Read More
  • By Sam Agnew
    Automated Headless Browser scripting in Node.js with Playwright Copy of Language template - GENERIC3 (3).png

    Sometimes the data you need is available online, but not through a public API. Web scraping can be useful in these situations, but only if this data can be accessed statically on a web page. Fortunately for developers everywhere, most things that you can do manually in the browser can be done using Playwright, a Node library built by the same team that made Puppeteer which provides a high-level API for automating various browsers.

    Let's walk through how to use Playwright to interact with web pages programmatically. In this example we'll use the Native Land Digital tool, an awesome project built to teach people more about their local indigenous history. In this case, an API does exist, but it only takes location data in the form of geo-coordinates rather than a more user-friendly address. We'll write code to programmatically type an address and figure out which Native land corresponds …

    Read More
  • By Sam Agnew
    Web Scraping and Parsing HTML in Node.js with jsdom Copy of Generic Blog Header 3 (1).png

    The internet has a wide variety of information for human consumption. But this data is often difficult to access programmatically if it doesn't come in the form of a dedicated REST API. With Node.js tools like jsdom, you can scrape and parse this data directly from web pages to use for your projects and applications.

    Let's use the example of needing MIDI data to train a neural network that can generate classic Nintendo-sounding music. In order to do this, we'll need a set of MIDI music from old Nintendo games. Using jsdom we can scrape this data from the Video Game Music Archive.

    Getting started and setting up dependencies

    Before moving on, you will need to make sure you have an up to date version of Node.js and npm installed.

    Navigate to the directory where you want this code to live and run the following …

    Read More
  • Newer
    Older
    Sign up and start building
    Not ready yet? Talk to an expert.