Build a web scraper with Symfony

For the last few years, scraping websites and gathering information from different sources has become one of the industry’s primary tools. By having a scraper, companies can collect data from various websites and feed them to their AI platforms to enhance their algorithms and results.

That’s why I decided to create a tutorial on how to make a web scraper in Symfony to fetch data from different websites. This project aims to have a dynamic and extendable web scraper that works with most websites out there.

Structure

The data we are interested in on a news website are:

  • website title
  • website URL
  • post-title
  • post URL
  • post image (if any)
  • post description or body
  • post date
  • post author

All of these elements are accessible through the website’s HTML source code. The good news is almost every website have all of these elements but render them differently. For instance, one website uses the `<article>` tag to specify each article or post; the other is using `<section>.` This problem is pretty much easy to solve as long as we have access to the website’s HTML source. However, some websites use ReactJs, Angular, or Vuejs to render their content, which is also known as SPA or single-page applications. If we look at the source of these websites, we’d see there is no HTML code, and everything is in JavaScript. To have access to a SPA website’s source code, we have to find a way to compile the JavaScript on the web page.

Installation

symfony new web-scraper
cd web-scraper

Then install the following packages as we need them throughout this article:

composer require makecomposer require orm

To be able to compile SPA applications, we are going to use the symfony/panther package. It’s a package using headless Chrome/Gecko under the hood to compile the applications to HTML codes. Install the package:

composer require symfony/panther

bdi package is responsible for installing and verifying the drivers for us.

Run these commands:

composer require — dev dbrekelmans/bdi
vendor/bin/bdi detect drivers

Development

SourceInterface

<?php
namespace App\Scraper\Contracts;
interface SourceInterface{ public function getUrl(): string; public function getName(): string; public function getWrapperSelector(): string; public function getTitleSelector(): string; public function getDescSelector(): string; public function getDateSelector(): string; public function getLinkSelector(): string; public function getImageSelector(): string;}

This interface indicates that each source we are going to scrap must have these methods which each method returns a particular CSS selector for different parts of the web page. We’ll see in a bit how this interface is going to help us.

Create a new source

<?php

namespace App\Sources;

use App\Scraper\Contracts\SourceInterface;

class Coindesk implements SourceInterface
{
public function getUrl(): string
{
return 'https://www.coindesk.com/news';
}

public function getName(): string
{
return 'Coinbase';
}

public function getWrapperSelector(): string
{
return 'section.list-body .list-item-wrapper';
}

public function getTitleSelector(): string
{
return 'a h4.heading';
}

public function getDescSelector(): string
{
return 'a p.card-text';
}

public function getDateSelector(): string
{
return 'time.time';
}

public function getLinkSelector(): string
{
return 'div.text-content a:nth-child(2)';
}

public function getImageSelector(): string
{
return 'img.list-img';
}
}

getUrl() method is returning the URL of the page we are going to crawl, and the name is the website’s name. Other methods are returning the CSS selector to access the particular element on the page. Notice that it implemented the SourceInterface.

Post

<?php

namespace App\Entity;

use App\Repository\PostRepository;
use Doctrine\ORM\Mapping as ORM;

/**
*
@ORM\Entity(repositoryClass=PostRepository::class)
*/
class Post implements \JsonSerializable
{
/**
*
@ORM\Id
* @ORM\GeneratedValue
* @ORM\Column(type="integer")
*/
private $id;

/**
*
@ORM\Column(type="string")
*/
private string $title;

/**
*
@ORM\Column(type="string")
*/
private string $description;

/**
*
@ORM\Column(type="string")
*/
private string $url;

/**
*
@ORM\Column(type="datetime")
*/
private \DateTime $dateTime;

/**
*
@ORM\Column(type="string")
*/
private string $author;

/**
*
@ORM\Column(type="string")
*/
private string $image;

public function getId(): ?int
{
return $this->id;
}

public function getTitle(): string
{
return $this->title;
}

public function setTitle(string $title): void
{
$this->title = $title;
}

public function getDescription(): string
{
return $this->description;
}

public function setDescription(string $description): void
{
$this->description = $description;
}

public function getUrl(): string
{
return $this->url;
}

public function setUrl(string $url): void
{
$this->url = $url;
}

public function getDateTime(): \DateTime
{
return $this->dateTime;
}

public function setDateTime(\DateTime $dateTime): void
{
$this->dateTime = $dateTime;
}

public function getAuthor(): string
{
return $this->author;
}

public function setAuthor(string $author): void
{
$this->author = $author;
}

public function getImage(): string
{
return $this->image;
}

public function setImage(string $image): void
{
$this->image = $image;
}

public function jsonSerialize()
{
return [
'title' => $this->getTitle(),
'url' => $this->getUrl(),
'desc' => $this->getDescription(),
'date' => $this->getDateTime(),
'image' => $this->getImage(),
];
}
}

Notice that we implemented JsonSerialize interface to help us to convert the Post entity to JSON. You can use built in Symfony serializer if you wish, but for now JsonSerialize interface will do the job in our case.

Scraper

now create a file Scraper.php and add the following content to it:

<?php

namespace App\Scraper;

use App\Scraper\Contracts\SourceInterface;
use Doctrine\Common\Collections\ArrayCollection;
use Doctrine\Common\Collections\Collection;
use Symfony\Component\DomCrawler\Crawler;
use Symfony\Component\Panther\Client;

class Scraper
{
public function scrap(SourceInterface $source): Collection
{
$collection = [];
$client = Client::createChromeClient();
$crawler = $client->request('GET', $source->getUrl());
$crawler->filter($source->getWrapperSelector())->each(function (Crawler $c) use ($source, &$collection) {
/// this line usually by passes the ads
if (!$c->filter($source->getLinkSelector())->count()) {
return;
}

$post = new Post();

/// Find and filter the title
$title = $c->filter($source->getTitleSelector())->text();
$post->setTitle($title);

/// some websites using datetime attribute in <time> tag to store the full
/// date time, here we first checked if this attribute exists, otherwise we fetch the
/// text inside the tag.
$dateTime = $c->filter($source->getDateSElector())->attr('datetime');
if (!$dateTime) {
$dateTime = $c->filter($source->getDateSElector())->text();
}
$dateTime = $this->cleanupDate($dateTime);
$post->setDate($dateTime);

$link = $c->filter($source->getLinkSelector())->attr('href');
$post->setUrl($link);

$desk = ($c->filter($source->getDescSelector())->text());
$post->setDesc($desk);

$collection[] = $post;
});

return new ArrayCollection($collection);
}

/**
* In order to make DateTime work, we need to clean up the input.
*
*
@throws \Exception
*/
private function cleanupDate(string $dateTime): \DateTime
{
$dateTime = str_replace(['(', ')', 'UTC', 'at', '|'], '', $dateTime);

return new \DateTime($dateTime);
}
}

Here is our simple, yet powerful scraper. You see, with a few lines of code, we can scrap any source that we want, as long as the CSS selectors that we’ve chosen are correct.

Create a UI for adding new sources

Source entity

<?php

namespace App\Entity;

use App\Repository\SourceRepository;
use App\Scraper\Contracts\SourceInterface;
use Doctrine\ORM\Mapping as ORM;

/**
*
@ORM\Entity(repositoryClass=SourceRepository::class)
*/
class Source implements SourceInterface
{
/**
*
@ORM\Id
* @ORM\GeneratedValue
* @ORM\Column(type="integer")
*/
private $id;

/**
*
@ORM\Column(type="string")
*/
private string $url;

/**
*
@ORM\Column(type="string")
*/
private string $name;

/**
*
@ORM\Column(type="string")
*/
private string $wrapperSelector;

/**
*
@ORM\Column(type="string")
*/
private string $titleSelector;

/**
*
@ORM\Column(type="string")
*/
private string $descSelector;

/**
*
@ORM\Column(type="string")
*/
private string $linkSelector;

/**
*
@ORM\Column(type="string")
*/
private string $dateSelector;

/**
*
@ORM\Column(type="string")
*/
private string $imageSelector;

/**
*
@return mixed
*/
public function getId()
{
return $this->id;
}

/**
*
@param mixed $id
*/
public function setId($id): void
{
$this->id = $id;
}

public function getUrl(): string
{
return $this->url;
}

public function setUrl(string $url): void
{
$this->url = $url;
}

public function getName(): string
{
return $this->name;
}

public function setName(string $name): void
{
$this->name = $name;
}

public function getWrapperSelector(): string
{
return $this->wrapperSelector;
}

public function setWrapperSelector(string $wrapperSelector): void
{
$this->wrapperSelector = $wrapperSelector;
}

public function getTitleSelector(): string
{
return $this->titleSelector;
}

public function setTitleSelector(string $titleSelector): void
{
$this->titleSelector = $titleSelector;
}

public function getDescSelector(): string
{
return $this->descSelector;
}

public function setDescSelector(string $descSelector): void
{
$this->descSelector = $descSelector;
}

public function getLinkSelector(): string
{
return $this->linkSelector;
}

public function setLinkSelector(string $linkSelector): void
{
$this->linkSelector = $linkSelector;
}

public function getDateSelector(): string
{
return $this->dateSelector;
}

public function setDateSelector(string $dateSelector): void
{
$this->dateSelector = $dateSelector;
}

public function getImageSelector(): string
{
return $this->imageSelector;
}

public function setImageSelector(string $imageSelector): void
{
$this->imageSelector = $imageSelector;
}
}

and run the following command to create the table:

php bin/console make:migration
php bin/console doctrine:migrations:migrate

Install EasyAdmin

composer require easycorp/easyadmin-bundle:2.x

after installation, open config/packages/easy_admin.yaml file and put the following content in it:

easy_admin:
entities:
- App\Entity\Source

EasyAdmin looks at this file and creates CRUD operations for the specified entities.

if everything goes well, after browsing to http://localhost:8000/admin, you should see the following page:

EasyAdmin

Go ahead and click on the Add Source button and create a new source:

Create a new source

Now create a source by adding the selectors we had used for Coindesk.com and click the Save changes button.

HomeController

<?php

namespace App\Controller;

use App\Entity\Source;
use App\Scraper\Scraper;
use Symfony\Bundle\FrameworkBundle\Controller\AbstractController;
use Symfony\Component\Routing\Annotation\Route;

class HomeController extends AbstractController
{
private Scraper $scraper;

public function __construct(Scraper $scraper)
{
$this->scraper = $scraper;
}

/**
*
@Route("/fetch/{id}", name="fetch")
*/
public function fetch(Source $source)
{
$posts = $this->scraper->scrap($source);

return $this->json($posts->toArray());
}
}

There are two things you need to notice here. First, we took advantage of autoloading by type-hinting the Scraper class to the constructor. Second, we used Parameter Conversion in Symfony to convert the given id to the respective record in our database.

Remember that to use Parameter Conversion; you should install the framework-extra-bundle package:

composer require sensio/framework-extra-bundle

Now everything is set up. Run the Symfony web server and browse to the http://localhost:8000/fetch/1 . You should see the following content:

You see how we can scrap a website simply by adding a new source and specify some CSS selectors. There are two ways to do this, first by creating a class under Sources directory and implement the SourceInterface, second, is to create a new record in sources table in our database. There are other features you can add, such as create a command to scrap the sources and a Cron job to run the command in a time schedule. Also save the scraped data into posts table and check to not insert duplicat records.

Summary

My ideas, thoughts, and tutorials about life, internet, and programming. https://github.com/smoqadam