Scraping with Pydantic and Scrapy

by James Malcolm,

One of the primary purposes of web scraping is to collate data from websites. This data is often transformed and stored in a database. This post will walk you through the three main components of a solid web scraping solution using Scrapy, Pydantic, and SQLModel.

Let's start by introducing the three Python packages we'll be using.

What is Scrapy?

The most common route people get into webcraping with Python would be the combination of Requests and BeautifulSoup. Requests + BeautifulSouipe works really well for simple crawling applications and has a relatively low learning angle compared to Scrapy.

When crawling many websites or doing an exhaustive crawl of a single website, Scrapy becomes the best fit. Scrapy runs asynchronously, meaning website requests can run concurrently, and it has built-in the link following features for navigating through webpages.

What is Pydantic?

Pydantic is a Python package that helps ensure the format of data is consistent. It is powered by type hints and forces datatypes, which helps ensure that data is exactly how it is expected to be.

If we look at our pipeline from webpage to SQL table, Pydantic helps when we've extracted a webpage and want to ensure the format of the data is consistent.

On top of ensuring the format of data, Pydantic alsow enables custom validators to ensure the the accuracy of data we get, eg that URLs begin with "https://".

What is SQLModel?

SQLModel is a Python package built on top of the popular SQL Alchemy. It is designed to simplify the process of interacting with databases from Python applications.

It is built by the same creators as FastAPI and PyDantic. The main use of SQLModel within this project is to minimise rewriting code since we’re already using Pydantic.

Let's see how all three of these packages fit together by using a simple example.

Creating an initial spider

We'll follow the Scrapy tutorial exactly, but we will call out the changes needed for this workflow.

Since building and scraping the data isn't the focus, the Scrapy tutorial will cover these details. You start a scraper by running the following command in your terminal:

scrapy startproject projectname

This command will create a directory projectname containing the following files:

tutorial/
    scrapy.cfg            # deploy configuration file

    projectname/             # project's Python module, you'll import your code from here
        __init__.py
        items.py          
        middlewares.py   
        pipelines.py  
        settings.py       
        spiders/          
            __init__.py

For this post, we are mainly interested in the items.py and pipelines.py.

Scrapy + Pydantic Items

As mentioned in the introduction, scraping is about taking unstructured data and making it structured - this is the goal of Scapy Items.

Pydantic helps validate the data types and structure we get from our spider to ensure consistency.

We can define a Scrapy Item like this:

import scrapy


class ScrapeItem(scrapy.items.Item):
    field_one = scrapy.items.Field()

# Or alternative we can use dataclasses

from dataclasses import dataclass

@dataclass
class CustomItem:
    one_field: str

To make it a Pydantic object, we change the syntax slightly by inheriting BaseModel into the ItemClass.

from pydantic import BaseModel


class ScrapeItem(BaseModel):
	field_one: str

There are no other changes needed! We are now successfully using Pydantic.

Pipelines + SQLModel

Now, we get to the fun part—actually saving our data. To do this, we’ll turn to Scrapy’s built-in Pipelines. Pipelines are great for cleansing data, checking for duplicates, and saving data.

There are a few gotcha’s in this process, which we’ll handle nicely.

Scrapy pipelines.py

open_spider and close_spider

The first one is open_spider and close_spider. We want to create and close a connection to our database in these components.

from sqlmodel import create_engine, Session

class ItemPipeline:
	def open_spider(self, spider):
		engine = create_engine(**connection_details)
		self.session = Session(engine)

	def close_spider(self, spider):
		self.session.close()

We only ever want to create and close the database connection in these two components as we want to keep the connection open while the pipeline is processing items.

Below, we add the items using the process_item function. But we want to persist the connection between items. If not, due to the concurrency that Scrapy provides, you’ll quickly be trying to add items to an already closed connection.

[Scrapy process_item]

Let’s dig in now to process_item. First, process_item is called for every item; it must return an item or raise a DropItem exception.

In this class, we can check if the item received matches our item of relevance with if isinstance(item, ScrapeItem). For our use case currently, this is redundant as we only have one item. However, in larger scraping applications, you may have multiple items.

Since now, item is a Pydantic object and SQLModel we defined in items.py, all we need to do is go: session.add(item) and session.commit() to add this item our a session and commit it to our database.

from sqlmodel import create_engine, Session
from items import ScrapeItem


class ItemPipeline:
	def open_spider(self, spider):
		...

	def process_item(self, item):
		if isinstance(item, ScrapeItem):
			session.add(item)
			session.commit()
			return item
		else:
			pass
			
	def close_spider(self, spider):
		...

The ease of adding items to our database shows the value of using SQLModel and Pydantic within a scraping Pipeline. Note, we don't use SQLModels' with code block, as this would close the session.

There we have it, a simple Scrapy spider taking advantage of Pydantic and SQLModel to get high-quality data. Of course, there are many more features to explore, such as Pydantic's validators but this provides the bones for connecting these three wonderful packages together.

If you need help with your webscraping needs, we're always willing to help. We've got deep experience in scraping the most complex websites and can help you on any project size. Feel free to reach out anytime for a no obligation chat.

More articles

LLM Risks - Prompt Injection

With the rise of generative AI models, new and emerging security risks. One of the largest and novel risks is Prompt Injection. We take a look at a couple of mitigations on how to make AI safer.

Read more

How does Trustpilot scoring work?

Trustpilot is a leading consumer review site, understand how they rank businesses and how you can learn from it.

Read more

Tell us about your project