Setting Up A Sample Scrapy Pipeline For CouchDB
As I was surprised that there is not much (recent) information out there on using CouchDB (which appears to me as far underestimated DB) with Scrapy this post demonstrates the very basic steps it takes to set up a Scrapy pipeline for CouchDB which directly imports all scraped items via the python-cloudant CouchDB client.
Settings.py
The following entries are necessary in the Scrapy settings.py
file:
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
"sample.pipelines.CouchDBPipeline": 300, # value depends if there other pipelines
}
COUCHDB_URI = "http://" + os.getenv("HOST_ADDRESS") + ":5984"
COUCHDB_DB = "sample"
Pipelines.py
The pipelines.py
file is where you do most of the db related code and is mostly adapted from the MongoDB example:
from itemadapter import ItemAdapter
from cloudant.client import CouchDB
class CouchDBPipeline:
def __init__(self, couchdb_uri, couchdb_db):
self.couchdb_uri = couchdb_uri
self.couchdb_db = couchdb_db
@classmethod
def from_crawler(cls, crawler):
return cls(couchdb_uri=crawler.settings.get("COUCHDB_URI"), couchdb_db=crawler.settings.get("COUCHDB_DB", "default"))
def open_spider(self, spider):
self.client = CouchDB("admin", "password", url="http://" + os.getenv("SAMPLE_DB1") + ":5984", connect=True)
self.db = self.client[self.couchdb_db]
def close_spider(self, spider):
self.client.disconnect()
def process_item(self, item, spider):
adapter = ItemAdapter(item)
# only if you want to do some cleanup ... can also happen before
if adapter["name"]:
adapter['name'] = list(set(map(str.strip, filter(None, adapter["name"]))))
if adapter["year"] and adapter["year"][0]:
adapter["year"] = list(set([int(adapter["year"][0])]))
if adapter["tags"]:
adapter["tags"] = list(set(map(str.lower, adapter["tags"])))
# ... and so on
# Either:
# only if it is NOT a partitioned db (where you don't need to specify the partition key)
# then you can just leave it just as it is
self.db.create_document(adapter.asdict())
return item
# Or:
# otherwise (if it is a partitioned db) make an '_id' with a suitable partition key (here the db/project name 'sample')
# everything before the first colon is considered as the partition key by CouchDB
doc = adapter.asdict()
try:
doc['_id'] = f"sample:{adapter['name'][0]}:{adapter['year'][0]}:{str(uuid.uuid4())}"
except IndexError as e:
logging.error(f'{{"action": "Generate ID", "key": ["country", "region"], "value": [{adapter["country"]}, {adapter["region"]}], "error": "{e}"}}')
doc['_id'] = f"sample:na:na:{str(uuid.uuid4())}"
self.db.create_document(doc)
return item
Items.py
Scrapy uses so-called items for scraped content and the following items.py
is based mostly on here and here:
@dataclass
class Sample:
name: Optional[str] = field(default=None)
year: Optional[int] = field(default=None)
tags: Optional[List[str]] = field(default=None)
# ... and so on
Sample_spider.py
This is only an excerpt and only one (very simplified) way to do it and is mostly based on the Scrapy tutorial and on Scrapy's (advanced) itemloaders:
from sample.items import Sample
class SampleSpider(scrapy.Spider):
name = "sample"
def start_requests(self):
# ...
def parse(self, response, sample_site=None):
for selector in response.xpath(sample_site.selector):
l = ItemLoader(item=Sample(), response=response, selector=selector)
for xpath in sample_site.xpaths:
l.add_xpath(xpath.name, xpath.xpath)
yield l.load_item()
Summary
As mentioned above this is only a (basic) example without much further explanation: More information can be found in all the links below.
Try out CouchDB ... it is worth it ... especially if your data is somehow tree-structured/hierarchically structured. Don't get e.g. confused by the javascript-basedviews
with map&reduce: There is also db/_find
which does not require views (although very helpful). And which you should be very familiar with if you know e.g. the ElasticSearch query language/HTTP API.
This is a sample query which you can POST
via HTTP
or use CouchDB's web-based interface Fauxton and run a Mango Query there:
# e.g.
{
"selector": {
"_id": {
"$gt": null
},
"name": {
"$eq": [
"homer s."
]
}
}
}
# or
{
"selector": {
"_id": {
"$gt": null
},
"tags": {
"$all": [
"series",
"comedy",
"animation"
]
},
"year": {
"$gte": [
1989
]
}
}
}
Further Information:
Scrapy:
https://scrapy.org
https://docs.scrapy.org/en/latest/topics/items.html
https://docs.scrapy.org/en/latest/topics/items.html#topics-items-declaring
https://docs.scrapy.org/en/latest/topics/item-pipeline.html
https://docs.scrapy.org/en/latest/topics/item-pipeline.html#write-items-to-mongodb
https://docs.scrapy.org/en/latest/topics/loaders.html
https://docs.scrapy.org/en/latest/topics/loaders.html#working-with-dataclass-items
CouchDB:
https://couchdb.apache.org
https://couchdb.apache.org/fauxton-visual-guide/index.html
https://docs.couchdb.org/en/latest/ddocs/views/intro.html
https://docs.couchdb.org/en/latest/api/database/find.html
CouchDB Python:
https://github.com/cloudant/python-cloudant
https://python-cloudant.readthedocs.io/en/latest/getting_started.html
CouchDB Search:
https://docs.couchdb.org/en/latest/ddocs/views/intro.html
https://docs.couchdb.org/en/latest/api/database/find.html
"Old" Scrapy Pipeline Examples On Github:
https://github.com/noplay/scrapy-couchdb
https://github.com/martinsbalodis/scrapy-couchdb