Setting Up A Sample Scrapy Pipeline For CouchDB

As I was surprised that there is not much (recent) information out there on using CouchDB (which appears to me as far underestimated DB) with Scrapy this post demonstrates the very basic steps it takes to set up a Scrapy pipeline for CouchDB which directly imports all scraped items via the python-cloudant CouchDB client.

Settings.py

The following entries are necessary in the Scrapy settings.py file:

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {
    "sample.pipelines.CouchDBPipeline": 300, # value depends if there other pipelines
}

COUCHDB_URI = "http://" + os.getenv("HOST_ADDRESS") + ":5984"
COUCHDB_DB = "sample"

Pipelines.py

The pipelines.py file is where you do most of the db related code and is mostly adapted from the MongoDB example:

from itemadapter import ItemAdapter
from cloudant.client import CouchDB

class CouchDBPipeline:

    def __init__(self, couchdb_uri, couchdb_db):
        self.couchdb_uri = couchdb_uri
        self.couchdb_db = couchdb_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(couchdb_uri=crawler.settings.get("COUCHDB_URI"), couchdb_db=crawler.settings.get("COUCHDB_DB", "default"))

    def open_spider(self, spider):
        self.client = CouchDB("admin", "password", url="http://" + os.getenv("SAMPLE_DB1") + ":5984", connect=True)
        self.db = self.client[self.couchdb_db]

    def close_spider(self, spider):
        self.client.disconnect()

    def process_item(self, item, spider):
        adapter = ItemAdapter(item)

        # only if you want to do some cleanup ... can also happen before
        if adapter["name"]:
            adapter['name'] = list(set(map(str.strip, filter(None, adapter["name"]))))
        if adapter["year"] and adapter["year"][0]:
            adapter["year"] = list(set([int(adapter["year"][0])]))
        if adapter["tags"]:
            adapter["tags"] = list(set(map(str.lower, adapter["tags"])))
        # ... and so on

        # Either:
        # only if it is NOT a partitioned db (where you don't need to specify the partition key)
        # then you can just leave it just as it is
        self.db.create_document(adapter.asdict())
        return item

        # Or:
        # otherwise (if it is a partitioned db) make an '_id' with a suitable partition key (here the db/project name 'sample')
        # everything before the first colon is considered as the partition key by CouchDB
        doc = adapter.asdict()
        try:
            doc['_id'] = f"sample:{adapter['name'][0]}:{adapter['year'][0]}:{str(uuid.uuid4())}"
        except IndexError as e:
            logging.error(f'{{"action": "Generate ID", "key": ["country", "region"], "value": [{adapter["country"]}, {adapter["region"]}], "error": "{e}"}}')
            doc['_id'] = f"sample:na:na:{str(uuid.uuid4())}"

        self.db.create_document(doc)

        return item

Items.py

Scrapy uses so-called items for scraped content and the following items.py is based mostly on here and here:

@dataclass
class Sample:
    name: Optional[str] = field(default=None)
    year: Optional[int] = field(default=None)
    tags: Optional[List[str]] = field(default=None)
    # ... and so on

Sample_spider.py

This is only an excerpt and only one (very simplified) way to do it and is mostly based on the Scrapy tutorial and on Scrapy's (advanced) itemloaders:

from sample.items import Sample

class SampleSpider(scrapy.Spider):
    name = "sample"

    def start_requests(self):
        # ...

    def parse(self, response, sample_site=None):
        for selector in response.xpath(sample_site.selector):
            l = ItemLoader(item=Sample(), response=response, selector=selector)
            for xpath in sample_site.xpaths:
                l.add_xpath(xpath.name, xpath.xpath)

        yield l.load_item()

Summary

As mentioned above this is only a (basic) example without much further explanation: More information can be found in all the links below.

Try out CouchDB ... it is worth it ... especially if your data is somehow tree-structured/hierarchically structured. Don't get e.g. confused by the javascript-basedviews with map&reduce: There is also db/_find which does not require views (although very helpful). And which you should be very familiar with if you know e.g. the ElasticSearch query language/HTTP API.

This is a sample query which you can POST via HTTP or use CouchDB's web-based interface Fauxton and run a Mango Query there:

# e.g.
{
   "selector": {
      "_id": {
         "$gt": null
      },
      "name": {
         "$eq": [
            "homer s."
         ]
      }
   }
}

# or
{
   "selector": {
      "_id": {
         "$gt": null
      },
      "tags": {
         "$all": [
            "series",
            "comedy",
            "animation"
         ]
      },
      "year": {
         "$gte": [
            1989
         ]
      }
   }
}

Further Information:

Scrapy:

https://scrapy.org

https://docs.scrapy.org/en/latest/topics/items.html

https://docs.scrapy.org/en/latest/topics/items.html#topics-items-declaring

https://docs.scrapy.org/en/latest/topics/item-pipeline.html

https://docs.scrapy.org/en/latest/topics/item-pipeline.html#write-items-to-mongodb

https://docs.scrapy.org/en/latest/topics/loaders.html

https://docs.scrapy.org/en/latest/topics/loaders.html#working-with-dataclass-items

CouchDB:

https://couchdb.apache.org

https://couchdb.apache.org/fauxton-visual-guide/index.html

https://docs.couchdb.org/en/latest/ddocs/views/intro.html

https://docs.couchdb.org/en/latest/api/database/find.html

CouchDB Python:

https://github.com/cloudant/python-cloudant

https://python-cloudant.readthedocs.io/en/latest/getting_started.html

CouchDB Search:

https://docs.couchdb.org/en/latest/ddocs/views/intro.html

https://docs.couchdb.org/en/latest/api/database/find.html

"Old" Scrapy Pipeline Examples On Github:

https://github.com/noplay/scrapy-couchdb

https://github.com/martinsbalodis/scrapy-couchdb