Build a web scraper with Symfony

9 min readJun 24, 2021

For the last few years, scraping websites and gathering information from different sources has become one of the industry’s primary tools. By having a scraper, companies can collect data from various websites and feed them to their AI platforms to enhance their algorithms and results.

That’s why I decided to create a tutorial on how to make a web scraper in Symfony to fetch data from different websites. This project aims to have a dynamic and extendable web scraper that works with most websites out there.

Structure

The thing we want to achieve here is a system to scrap and gather news from different sources. Our approach should be extendable, meaning we should be able to fetch data from a new source easily without changing or writing many codes.

The data we are interested in on a news website are:

website title
website URL
post-title
post URL
post image (if any)
post description or body
post date
post author

All of these elements are accessible through the website’s HTML source code. The good news is almost every website have all of these elements but render them differently. For instance, one website uses the `<article>` tag to specify each article or post; the other is using `<section>.` This problem is pretty much easy to solve as long as we have access to the website’s HTML source. However, some websites use ReactJs, Angular, or Vuejs to render their content, which is also known as SPA or single-page applications. If we look at the source of these websites, we’d see there is no HTML code, and everything is in JavaScript. To have access to a SPA website’s source code, we have to find a way to compile the JavaScript on the web page.

Installation

Create a new project with Symfony:

symfony new web-scraper
cd web-scraper

Then install the following packages as we need them throughout this article:

composer require makecomposer require orm

To be able to compile SPA applications, we are going to use the symfony/panther package. It’s a package using headless Chrome/Gecko under the hood to compile the applications to HTML codes. Install the package:

composer require symfony/panther

bdi package is responsible for installing and verifying the drivers for us.

Run these commands:

composer require — dev dbrekelmans/bdi
vendor/bin/bdi detect drivers

Development

Let’s start our developments by implementing our interfaces.

SourceInterface

To make our project extendable we need an interface class to make sure every new sources (website) we want to scrap, follows same rules. For this purpose, let’s create a file called SourceInterface.php under src/Scraper/Contracts directory:

<?php
namespace App\Scraper\Contracts;interface SourceInterface{    public function getUrl(): string;    public function getName(): string;    public function getWrapperSelector(): string;    public function getTitleSelector(): string;    public function getDescSelector(): string;    public function getDateSelector(): string;    public function getLinkSelector(): string;    public function getImageSelector(): string;}

This interface indicates that each source we are going to scrap must have these methods which each method returns a particular CSS selector for different parts of the web page. We’ll see in a bit how this interface is going to help us.

Create a new source

create a folder under the app directory and call it Sources, inside this directory, create a file Coindesk.php and add the following content to it:

<?php

namespace App\Sources;

use App\Scraper\Contracts\SourceInterface;

class Coindesk implements SourceInterface
{
    public function getUrl(): string
    {
        return 'https://www.coindesk.com/news';
    }

    public function getName(): string
    {
        return 'Coinbase';
    }

    public function getWrapperSelector(): string
    {
        return 'section.list-body .list-item-wrapper';
    }

    public function getTitleSelector(): string
    {
        return 'a h4.heading';
    }

    public function getDescSelector(): string
    {
        return 'a p.card-text';
    }

    public function getDateSelector(): string
    {
        return 'time.time';
    }

    public function getLinkSelector(): string
    {
        return 'div.text-content a:nth-child(2)';
    }

    public function getImageSelector(): string
    {
        return 'img.list-img';
    }
}

getUrl() method is returning the URL of the page we are going to crawl, and the name is the website’s name. Other methods are returning the CSS selector to access the particular element on the page. Notice that it implemented the SourceInterface.

Post

Before we write our scraper, we need one more class. In order to be able to store the data in database, create a Post.php file under src/Entity directory:

<?php

namespace App\Entity;

use App\Repository\PostRepository;
use Doctrine\ORM\Mapping as ORM;

/**
 * @ORM\Entity(repositoryClass=PostRepository::class)
 */
class Post implements \JsonSerializable
{
    /**
     * @ORM\Id
     * @ORM\GeneratedValue
     * @ORM\Column(type="integer")
     */
    private $id;

    /**
     * @ORM\Column(type="string")
     */
    private string $title;

    /**
     * @ORM\Column(type="string")
     */
    private string $description;

    /**
     * @ORM\Column(type="string")
     */
    private string $url;

    /**
     * @ORM\Column(type="datetime")
     */
    private \DateTime $dateTime;

    /**
     * @ORM\Column(type="string")
     */
    private string $author;

    /**
     * @ORM\Column(type="string")
     */
    private string $image;

    public function getId(): ?int
    {
        return $this->id;
    }

    public function getTitle(): string
    {
        return $this->title;
    }

    public function setTitle(string $title): void
    {
        $this->title = $title;
    }

    public function getDescription(): string
    {
        return $this->description;
    }

    public function setDescription(string $description): void
    {
        $this->description = $description;
    }

    public function getUrl(): string
    {
        return $this->url;
    }

    public function setUrl(string $url): void
    {
        $this->url = $url;
    }

    public function getDateTime(): \DateTime
    {
        return $this->dateTime;
    }

    public function setDateTime(\DateTime $dateTime): void
    {
        $this->dateTime = $dateTime;
    }

    public function getAuthor(): string
    {
        return $this->author;
    }

    public function setAuthor(string $author): void
    {
        $this->author = $author;
    }

    public function getImage(): string
    {
        return $this->image;
    }

    public function setImage(string $image): void
    {
        $this->image = $image;
    }

    public function jsonSerialize()
    {
        return [
            'title' => $this->getTitle(),
            'url' => $this->getUrl(),
            'desc' => $this->getDescription(),
            'date' => $this->getDateTime(),
            'image' => $this->getImage(),
        ];
    }
}

Notice that we implemented JsonSerialize interface to help us to convert the Post entity to JSON. You can use built in Symfony serializer if you wish, but for now JsonSerialize interface will do the job in our case.

Scraper

Scraper class is the place we crawl a source and fetch its data. Here we have to deal with different scenarios and make a flexible scraper to work with different input types. For example, in one website, the date format could be “YY-MM-DD,” and while in the other, it can be like “MM dd Y,” our job is to fetch the data from the selected tag then normalize it.

now create a file Scraper.php and add the following content to it:

<?php

namespace App\Scraper;

use App\Scraper\Contracts\SourceInterface;
use Doctrine\Common\Collections\ArrayCollection;
use Doctrine\Common\Collections\Collection;
use Symfony\Component\DomCrawler\Crawler;
use Symfony\Component\Panther\Client;

class Scraper
{
    public function scrap(SourceInterface $source): Collection
    {
        $collection = [];
        $client = Client::createChromeClient();
        $crawler = $client->request('GET', $source->getUrl());
        $crawler->filter($source->getWrapperSelector())->each(function (Crawler $c) use ($source, &$collection) {
            /// this line usually by passes the ads
            if (!$c->filter($source->getLinkSelector())->count()) {
                return;
            }

            $post = new Post();

            /// Find and filter the title
            $title = $c->filter($source->getTitleSelector())->text();
            $post->setTitle($title);

            /// some websites using datetime attribute in <time> tag to store the full
            /// date time, here we first checked if this attribute exists, otherwise we fetch the
            /// text inside the tag.
            $dateTime = $c->filter($source->getDateSElector())->attr('datetime');
            if (!$dateTime) {
                $dateTime = $c->filter($source->getDateSElector())->text();
            }
            $dateTime = $this->cleanupDate($dateTime);
            $post->setDate($dateTime);

            $link = $c->filter($source->getLinkSelector())->attr('href');
            $post->setUrl($link);

            $desk = ($c->filter($source->getDescSelector())->text());
            $post->setDesc($desk);

            $collection[] = $post;
        });

        return new ArrayCollection($collection);
    }

    /**
     * In order to make DateTime work, we need to clean up the input.
     *
     * @throws \Exception
     */
    private function cleanupDate(string $dateTime): \DateTime
    {
        $dateTime = str_replace(['(', ')', 'UTC', 'at', '|'], '', $dateTime);

        return new \DateTime($dateTime);
    }
}

Here is our simple, yet powerful scraper. You see, with a few lines of code, we can scrap any source that we want, as long as the CSS selectors that we’ve chosen are correct.

Create a UI for adding new sources

So far, we can scrap a website by creating a new source under the Sources directory. But as you may be noticed, this method is not flexible, and every time we want to add a new source, we need to create a class and specify its selectors. To solve this problem, we can store all sources in the database and add or modify them by having an HTML form.

Source entity

Create an entity and call it Source and implement the SrouceInterface:

<?php

namespace App\Entity;

use App\Repository\SourceRepository;
use App\Scraper\Contracts\SourceInterface;
use Doctrine\ORM\Mapping as ORM;

/**
 * @ORM\Entity(repositoryClass=SourceRepository::class)
 */
class Source implements SourceInterface
{
    /**
     * @ORM\Id
     * @ORM\GeneratedValue
     * @ORM\Column(type="integer")
     */
    private $id;

    /**
     * @ORM\Column(type="string")
     */
    private string $url;

    /**
     * @ORM\Column(type="string")
     */
    private string $name;

    /**
     * @ORM\Column(type="string")
     */
    private string $wrapperSelector;

    /**
     * @ORM\Column(type="string")
     */
    private string $titleSelector;

    /**
     * @ORM\Column(type="string")
     */
    private string $descSelector;

    /**
     * @ORM\Column(type="string")
     */
    private string $linkSelector;

    /**
     * @ORM\Column(type="string")
     */
    private string $dateSelector;

    /**
     * @ORM\Column(type="string")
     */
    private string $imageSelector;

    /**
     * @return mixed
     */
    public function getId()
    {
        return $this->id;
    }

    /**
     * @param mixed $id
     */
    public function setId($id): void
    {
        $this->id = $id;
    }

    public function getUrl(): string
    {
        return $this->url;
    }

    public function setUrl(string $url): void
    {
        $this->url = $url;
    }

    public function getName(): string
    {
        return $this->name;
    }

    public function setName(string $name): void
    {
        $this->name = $name;
    }

    public function getWrapperSelector(): string
    {
        return $this->wrapperSelector;
    }

    public function setWrapperSelector(string $wrapperSelector): void
    {
        $this->wrapperSelector = $wrapperSelector;
    }

    public function getTitleSelector(): string
    {
        return $this->titleSelector;
    }

    public function setTitleSelector(string $titleSelector): void
    {
        $this->titleSelector = $titleSelector;
    }

    public function getDescSelector(): string
    {
        return $this->descSelector;
    }

    public function setDescSelector(string $descSelector): void
    {
        $this->descSelector = $descSelector;
    }

    public function getLinkSelector(): string
    {
        return $this->linkSelector;
    }

    public function setLinkSelector(string $linkSelector): void
    {
        $this->linkSelector = $linkSelector;
    }

    public function getDateSelector(): string
    {
        return $this->dateSelector;
    }

    public function setDateSelector(string $dateSelector): void
    {
        $this->dateSelector = $dateSelector;
    }

    public function getImageSelector(): string
    {
        return $this->imageSelector;
    }

    public function setImageSelector(string $imageSelector): void
    {
        $this->imageSelector = $imageSelector;
    }
}

and run the following command to create the table:

php bin/console make:migration
php bin/console doctrine:migrations:migrate

Install EasyAdmin

EasyAdmin is a helpful package that will create a nice-looking admin panel with CRUD operation for our entities without writing any code.

composer require easycorp/easyadmin-bundle:2.x

after installation, open config/packages/easy_admin.yaml file and put the following content in it:

easy_admin:
    entities:
        - App\Entity\Source

EasyAdmin looks at this file and creates CRUD operations for the specified entities.

if everything goes well, after browsing to http://localhost:8000/admin, you should see the following page:

Go ahead and click on the Add Source button and create a new source:

Now create a source by adding the selectors we had used for Coindesk.com and click the Save changes button.

HomeController

Now it’s time to see if our crawler works fine. Create a controller and call it HomeController.

<?php

namespace App\Controller;

use App\Entity\Source;
use App\Scraper\Scraper;
use Symfony\Bundle\FrameworkBundle\Controller\AbstractController;
use Symfony\Component\Routing\Annotation\Route;

class HomeController extends AbstractController
{
    private Scraper $scraper;

    public function __construct(Scraper $scraper)
    {
        $this->scraper = $scraper;
    }

    /**
     * @Route("/fetch/{id}", name="fetch")
     */
    public function fetch(Source $source)
    {
        $posts = $this->scraper->scrap($source);

        return $this->json($posts->toArray());
    }
}

There are two things you need to notice here. First, we took advantage of autoloading by type-hinting the Scraper class to the constructor. Second, we used Parameter Conversion in Symfony to convert the given id to the respective record in our database.

Remember that to use Parameter Conversion; you should install the framework-extra-bundle package:

composer require sensio/framework-extra-bundle

Now everything is set up. Run the Symfony web server and browse to the http://localhost:8000/fetch/1 . You should see the following content:

You see how we can scrap a website simply by adding a new source and specify some CSS selectors. There are two ways to do this, first by creating a class under Sources directory and implement the SourceInterface, second, is to create a new record in sources table in our database. There are other features you can add, such as create a command to scrap the sources and a Cron job to run the command in a time schedule. Also save the scraped data into posts table and check to not insert duplicat records.

Summary

Nowadays, collecting data from different sources helps us to analayzing the market, news, or social media. Having a scraper to help us to collect data from different sources without writing and modifying any code, was the aim of this article. There are other things need to be done to have a real full featured scraper such as save data into the database, filter duplicated records, add a cron job, etc.