Setting Up A Sample Scrapy Pipeline For CouchDB

As I was surprised that there is not much (recent) information out there on using CouchDB (which appears to me as far underestimated DB) with Scrapy this post demonstrates the very basic steps it takes to set up a Scrapy pipeline for CouchDB which directly imports all scraped items via the python-cloudant CouchDB client.

The following entries are necessary in the Scrapy file:

# Configure item pipelines
# See

    "sample.pipelines.CouchDBPipeline": 300, # value depends if there other pipelines

COUCHDB_URI = "http://" + os.getenv("HOST_ADDRESS") + ":5984"
COUCHDB_DB = "sample"

The file is where you do most of the db related code and is mostly adapted from the MongoDB example:

from itemadapter import ItemAdapter
from cloudant.client import CouchDB

class CouchDBPipeline:

    def __init__(self, couchdb_uri, couchdb_db):
        self.couchdb_uri = couchdb_uri
        self.couchdb_db = couchdb_db

    def from_crawler(cls, crawler):
        return cls(couchdb_uri=crawler.settings.get("COUCHDB_URI"), couchdb_db=crawler.settings.get("COUCHDB_DB", "default"))

    def open_spider(self, spider):
        self.client = CouchDB("admin", "password", url="http://" + os.getenv("SAMPLE_DB1") + ":5984", connect=True)
        self.db = self.client[self.couchdb_db]

    def close_spider(self, spider):

    def process_item(self, item, spider):
        adapter = ItemAdapter(item)

        # only if you want to do some cleanup ... can also happen before
        if adapter["name"]:
            adapter['name'] = list(set(map(str.strip, filter(None, adapter["name"]))))
        if adapter["year"] and adapter["year"][0]:
            adapter["year"] = list(set([int(adapter["year"][0])]))
        if adapter["tags"]:
            adapter["tags"] = list(set(map(str.lower, adapter["tags"])))
        # ... and so on

        # Either:
        # only if it is NOT a partitioned db (where you don't need to specify the partition key)
        # then you can just leave it just as it is
        return item

        # Or:
        # otherwise (if it is a partitioned db) make an '_id' with a suitable partition key (here the db/project name 'sample')
        # everything before the first colon is considered as the partition key by CouchDB
        doc = adapter.asdict()
            doc['_id'] = f"sample:{adapter['name'][0]}:{adapter['year'][0]}:{str(uuid.uuid4())}"
        except IndexError as e:
            logging.error(f'{{"action": "Generate ID", "key": ["country", "region"], "value": [{adapter["country"]}, {adapter["region"]}], "error": "{e}"}}')
            doc['_id'] = f"sample:na:na:{str(uuid.uuid4())}"


        return item

Scrapy uses so-called items for scraped content and the following is based mostly on here and here:

class Sample:
    name: Optional[str] = field(default=None)
    year: Optional[int] = field(default=None)
    tags: Optional[List[str]] = field(default=None)
    # ... and so on

This is only an excerpt and only one (very simplified) way to do it and is mostly based on the Scrapy tutorial and on Scrapy's (advanced) itemloaders:

from sample.items import Sample

class SampleSpider(scrapy.Spider):
    name = "sample"

    def start_requests(self):
        # ...

    def parse(self, response, sample_site=None):
        for selector in response.xpath(sample_site.selector):
            l = ItemLoader(item=Sample(), response=response, selector=selector)
            for xpath in sample_site.xpaths:
                l.add_xpath(, xpath.xpath)

        yield l.load_item()


As mentioned above this is only a (basic) example without much further explanation: More information can be found in all the links below.

Try out CouchDB ... it is worth it ... especially if your data is somehow tree-structured/hierarchically structured. Don't get e.g. confused by the javascript-basedviews with map&reduce: There is also db/_find which does not require views (although very helpful). And which you should be very familiar with if you know e.g. the ElasticSearch query language/HTTP API.

This is a sample query which you can POST via HTTP or use CouchDB's web-based interface Fauxton and run a Mango Query there:

# e.g.
   "selector": {
      "_id": {
         "$gt": null
      "name": {
         "$eq": [
            "homer s."

# or
   "selector": {
      "_id": {
         "$gt": null
      "tags": {
         "$all": [
      "year": {
         "$gte": [

Further Information:



CouchDB Python:

CouchDB Search:

"Old" Scrapy Pipeline Examples On Github: