supposed I am scraping data and some of the fields scraped "" meaning no value
and I don't want a row with "" in it. How can I do it?
example:
field1 field2 field3
my place blurred trying
house fan
door mouse hat
what I want is that my program will not write the entire 2nd row to the csv because the field3 is empty.
You can write and configure an Item Pipeline following instructions from [the scrapy docs] and drop item with a test on it's values.
Add this in your pipeline.py file:
from scrapy.exceptions import DropItem
class DropIfEmptyFieldPipeline(object):
def process_item(self, item, spider):
# to test if only "job_id" is empty,
# change to:
# if not(item["job_id"]):
if not(all(item.values())):
raise DropItem()
else:
return item
And set this in your settings.py (adapt to your projet's name)
ITEM_PIPELINES = [ 'myproject.pipeline.DropIfEmptyFieldPipeline', ]
Edit after OP's comment about testing for "Nurse"
from scrapy.exceptions import DropItem
import re
class DropIfEmptyFieldPipeline(object):
# case-insensitive search for string "nurse"
REGEX_NURSE = re.compile(r'nurse', re.IGNORECASE)
def process_item(self, item, spider):
# user .search() and not .match() to test for substring match
if not(self.REGEX_NURSE.search(item["job_id"])):
raise DropItem()
else:
return item
Related
I'm having the scrapy spider which scraps details from the website. It works well for fixed Item fields. It also extracts the dynamic fields from the website, but it did not add all the extracted dynamic fields to an output CSV file.
To export records to CSV, I'm using CsvItemExporter.
Find the Below Item class for Dynamic fields
class MortgageInfoItem(scrapy.Item):
def __setitem__(self, key, value):
if key not in self.fields:
if "date" in key:
self.fields[key] = scrapy.Field(serializer=serialize_date)
else:
self.fields[key] = scrapy.Field(serializer=serialize_text)
self._values[key] = value
The fields may vary for each record. For example, the first record has owner1 and owner2, next record has owner1, owner2 and owner3 like this.
So finally I need the CSV file which has all the owner information. (ex. owner1, owner2, owner3, ...)
Find the below class for csv exporter
class MultiCSVItemPipeline(object):
CSVDir = 'output' + settings.DIRECTORY
file_name = "MortGage_info"
max_columns = 0
def __init__(self):
dispatcher.connect(self.spider_opened, signal=signals.spider_opened)
dispatcher.connect(self.spider_closed, signal=signals.spider_closed)
def spider_opened(self, spider):
self.file = open(self.CSVDir + self.file_name + '.csv', 'w+b')
self.exporters = CsvItemExporter(self.file)
self.exporters.start_exporting()
def spider_closed(self, spider):
self.exporters.finish_exporting()
self.file.close()
def process_item(self, item, spider):
self.exporters.export_item(item)
return item
Please help me to export all the fields to CSV file using CsvItemExporter. Thanks in advance.
I'm trying to build a function (clean_keyboard) to use in the extended Itemloader Class.
It should filter and clean data in the extended Item Class 'category' = 'Notebook'.
I have tested it without the filter for 'Notebooks' -> (if ProductItem['category'] == 'Notebook':) and the Processor/Method works fine without. But after inserting this piece of code for filtering I get the TypeError above in the title. See code below.
### processor method for cleaning data with the Itemloader, Item and Itemloader class extended
def clean_keyboard(pattern):
keyboard_dict = {'deutsch': 'DE', 'US-QWERTY': 'US', '': 'DE'}
if ProductItem['category'] == 'Notebook': # <-- TypeError when adding category filter, without it works fine
if pattern in keyboard_dict:
return keyboard_dict[pattern]
else:
return pattern
class ProductItem(scrapy.Item):
category = scrapy.Field()
keyboard = scrapy.Field()
class SpiderItemLoader(ItemLoader):
default_item_class = ProductItem
default_input_processor = MapCompose(str.strip)
default_output_processor = TakeFirst()
keyboard_out = MapCompose(clean_keyboard)
### Parse Method in the Spider to get the data/ using the SpiderItemloader extended class
def parse_item(self, response):
l = SpiderItemLoader(response = response)
l.add_xpath('keyboard', '//*[#class="short-description"]/p/strong[text()="keyboard"]/following-sibling::text()')
l.add_xpath('category', '//*[#class="short-description"]/p/strong[text()="category"]/following-sibling::text()'')
return l.load_item()
As Daniel commented, the failing line makes no sense. You want to inspect the 'category' property of the item being processed, however your clean_keyboard function has no access to it. ProductItem is a class shared by all items, not a specific item.
Item loader processors have no access to items, only to a specific property of those.
I recommend that you use an item pipeline instead of an item loader processor to implement the logic of your clean_keyboard function.
Indeed, thank you both for helping me understanding this with the class access (Itemloader vs. Item Pipeline).
Thus, since I do have access to items in Item Pipeline I was able to solve the filtering, accesing other item by using the Item Pipeline. See my tested code with the solution below.
# Configure item pipelines in settings.py
ITEM_PIPELINES = {
'tutorial.pipelines.DataCleaningPipeline': 300,
}
# Pipeline in pipelines.py
class DataCleaningPipeline(object):
def process_item(self, item, spider):
keyboard_dict = {'deutsch': 'DE', 'US-QWERTY': 'US', '': 'DE', 'QWERTZ': 'DE'}
dict_key = item.get('keyboard')
category = item.get('category')
if 'Notebook' in category and dict_key in keyboard_dict:
item['keyboard']= keyboard_dict[dict_key]
return item
else:
return item
I'm scraping review from moocs likes this one
From there I'm getting all the course details, 5 items and another 6 items from each review itself.
This is the code I have for the course details:
def parse_reviews(self, response):
l = ItemLoader(item=MoocsItem(), response=response)
l.add_xpath('course_title', '//*[#class="course-header-ng__main-info__name__title"]//text()')
l.add_xpath('course_description', '//*[#class="course-info__description"]//p/text()')
l.add_xpath('course_instructors', '//*[#class="course-info__instructors__names"]//text()')
l.add_xpath('course_key_concepts', '//*[#class="key-concepts__labels"]//text()')
l.add_value('course_link', response.url)
return l.load_item()
Now I want to include the review details, another 5 items for each review.
Since the course data is common for all the reviews I want to store it in a different file and use course name/id to relate the data afterward.
This is the code I have for the review's items:
for review in response.xpath('//*[#class="review-body"]'):
review_body = review.xpath('.//div[#class="review-body__content"]//text()').extract()
course_stage = review.xpath('.//*[#class="review-body-info__course-stage--completed"]//text()').extract()
user_name = review.xpath('.//*[#class="review-body__username"]//text()').extract()
review_date = review.xpath('.//*[#itemprop="datePublished"]/#datetime').extract()
score = review.xpath('.//*[#class="sr-only"]//text()').extract()
I tried to work with a temporary solution, returning all the items for each case but is not working either:
def parse_reviews(self, response):
#print response.body
l = ItemLoader(item=MoocsItem(), response=response)
#l = MyItemLoader(selector=response)
l.add_xpath('course_title', '//*[#class="course-header-ng__main-info__name__title"]//text()')
l.add_xpath('course_description', '//*[#class="course-info__description"]//p/text()')
l.add_xpath('course_instructors', '//*[#class="course-info__instructors__names"]//text()')
l.add_xpath('course_key_concepts', '//*[#class="key-concepts__labels"]//text()')
l.add_value('course_link', response.url)
for review in response.xpath('//*[#class="review-body"]'):
l.add_xpath('review_body', './/div[#class="review-body__content"]//text()')
l.add_xpath('course_stage', './/*[#class="review-body-info__course-stage--completed"]//text()')
l.add_xpath('user_name', './/*[#class="review-body__username"]//text()')
l.add_xpath('review_date', './/*[#itemprop="datePublished"]/#datetime')
l.add_xpath('score', './/*[#class="sr-only"]//text()')
yield l.load_item()
The output file for that script is corrupted, cells are displaced and the size of the fields is not correct.
EDIT:
I want to have two files at the output:
The first one containing:
course_title,course_description,course_instructors,course_key_concepts,course_link
And the second one with:
course_title,review_body,course_stage,user_name,review_date,score
The issue is you are mixing everything up into a single item, which is not the right way to do it. You should create two items: MoocsItem and MoocsReviewItem.
And then update the code like below
def parse_reviews(self, response):
#print response.body
l = ItemLoader(item=MoocsItem(), response=response)
l.add_xpath('course_title', '//*[#class="course-header-ng__main-info__name__title"]//text()')
l.add_xpath('course_description', '//*[#class="course-info__description"]//p/text()')
l.add_xpath('course_instructors', '//*[#class="course-info__instructors__names"]//text()')
l.add_xpath('course_key_concepts', '//*[#class="key-concepts__labels"]//text()')
l.add_value('course_link', response.url)
item = l.load_item()
for review in response.xpath('//*[#class="review-body"]'):
r = ItemLoader(item=MoocsReviewItem(), response=response, selector=review)
r.add_value('course_title', item['course_title'])
r.add_xpath('review_body', './/div[#class="review-body__content"]//text()')
r.add_xpath('course_stage', './/*[#class="review-body-info__course-stage--completed"]//text()')
r.add_xpath('user_name', './/*[#class="review-body__username"]//text()')
r.add_xpath('review_date', './/*[#itemprop="datePublished"]/#datetime')
r.add_xpath('score', './/*[#class="sr-only"]//text()')
yield r.load_item()
yield item
Now what you want is that different item type goes in different csv files, which is what the below SO thread answers:
How can scrapy export items to separate csv files per item
I have not tested the below, but the code will look something like this:
from scrapy.exporters import CsvItemExporter
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
def item_type(item):
return type(item).__name__.replace('Item','').lower() # TeamItem => team
class MultiCSVItemPipeline(object):
SaveTypes = ['moocs','moocsreview']
def __init__(self):
dispatcher.connect(self.spider_opened, signal=signals.spider_opened)
dispatcher.connect(self.spider_closed, signal=signals.spider_closed)
def spider_opened(self, spider):
self.files = dict([ (name, open(CSVDir+name+'.csv','w+b')) for name in self.SaveTypes ])
self.exporters = dict([ (name,CsvItemExporter(self.files[name])) for name in self.SaveTypes])
[e.start_exporting() for e in self.exporters.values()]
def spider_closed(self, spider):
[e.finish_exporting() for e in self.exporters.values()]
[f.close() for f in self.files.values()]
def process_item(self, item, spider):
what = item_type(item)
if what in set(self.SaveTypes):
self.exporters[what].export_item(item)
return item
You need to make sure the ITEM_PIPELINES is updated to use this MultiCSVItemPipeline class
ITEM_PIPELINES = {
'mybot.pipelines.MultiCSVItemPipeline': 300,
}
I've had a bit of help on here by my code pretty much works. The only issue is that in the process of generating an XML, it wraps the content in "value" tags when I don't want it to. According to the doc's this is due to this:
Unless overriden in the :meth:serialize_field method, multi-valued
fields are exported by serializing each value inside a <value>
element. This is for convenience, as multi-valued fields are very
common.
This is my output:
<?xml version="1.0" encoding="UTF-8"?>
<items>
<item>
<body>
<value>Don't forget me this weekend!</value>
</body>
<to>
<value>Tove</value>
</to>
<who>
<value>Jani</value>
</who>
<heading>
<value>Reminder</value>
</heading>
</item>
</items>
What I send it to the XML exporter seems to be this, so I don't know why it think's it's multivalue?
{'body': [u"Don't forget me this weekend!"],
'heading': [u'Reminder'],
'to': [u'Tove'],
'who': [u'Jani']}
pipeline.py
from scrapy import signals
from scrapy.contrib.exporter import XmlItemExporter
class XmlExportPipeline(object):
def __init__(self):
self.files = {}
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_products.xml' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = XmlItemExporter(file)
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
spider.py
from scrapy.contrib.spiders import XMLFeedSpider
from crawler.items import CrawlerItem
class SiteSpider(XMLFeedSpider):
name = 'site'
allowed_domains = ['www.w3schools.com']
start_urls = ['http://www.w3schools.com/xml/note.xml']
itertag = 'note'
def parse_node(self, response, selector):
item = CrawlerItem()
item['to'] = selector.xpath('//to/text()').extract()
item['who'] = selector.xpath('//from/text()').extract()
item['heading'] = selector.xpath('//heading/text()').extract()
item['body'] = selector.xpath('//body/text()').extract()
return item
Any help would be really appreciated. I just want the same output without the redundant tags.
The extract() method will always return a list of values, even if there is only a single value as a result, for example: [4], [3,4,5] or None.
To avoid this, if you know there is only one value, you can select it like:
item['to'] = selector.xpath('//to/text()').extract()[0]
Note:
Be aware that this can result in an exception thrown in case extract() returns None and you are trying to index that. In such uncertain cases, this is a good trick to use:
item['to'] = (selector.xpath('...').extract() or [''])[0]
Or you could write your custom function to get the first element:
def extract_first(selector, default=None):
val = selector.extract()
return val[0] if val else default
This way you can have a default value in case your desired value is not found:
item['to'] = extract_first(selector.xpath(...)) # First or none
item['to'] = extract_first(selector.xpath(...), 'not-found') # First of 'not-found'
The above answer is correct regarding why this is happening, but I'd like to add that there is now out of the box support for this, and no need to write a helper method.
item['to'] = selector.xpath('//to/text()').extract_first()
and
item['to'] = selector.xpath('//to/text()').extract_first(default='spam')
I added a pipeline which I found as an answer in stackoverflow to a sample project.
it is :
import csv
from craiglist_sample import settings
def write_to_csv(item):
writer = csv.writer(open(settings.csv_file_path, 'a'), lineterminator='\n')
writer.writerow([item[key] for key in item.keys()])
class WriteToCsv(object):
def process_item(self, item, spider):
write_to_csv(item)
return item
it writes correctly to a csv file. then I change it to this one :
import csv
import sys
from craiglist_sample import settings
import datetime
import PyRSS2Gen
def write_to_csv(item):
rss = PyRSS2Gen.RSS2(
title = "Andrew's PyRSS2Gen feed",
link = "http://www.dalkescientific.com/Python/PyRSS2Gen.html",
description = "The latest news about PyRSS2Gen, a "
"Python library for generating RSS2 feeds",
lastBuildDate = datetime.datetime.now(),
items = [
PyRSS2Gen.RSSItem(
title =str((item['title']),
link = str((item['link']),
description = "Dalke Scientific today announced PyRSS2Gen-0.0, "
"a library for generating RSS feeds for Python. ",
guid = PyRSS2Gen.Guid("http://www.dalkescientific.com/news/"
"030906-PyRSS2Gen.html"),
pubDate = datetime.datetime(2003, 9, 6, 21, 31)),
])
rss.write_xml(open("pyrss2gen.xml", "w"))
class WriteToCsv(object):
def process_item(self, item, spider):
write_to_csv(item)
return item
But problem is it writes only the last entry to the xml file. How can I fix this? do I need to add new line for each entry?
items.py is :
class CraiglistSampleItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title=Field()
link=Field()
Use a to append, your are overwriting each time using w so you only get the last piece of data:
rss.write_xml(open("pyrss2gen.xml", "a"))
If you look at the original code you can that also uses a not w.
You might want to use with when opening files or at least closing them.