Scrapy Data Flow and Items and Item Loaders

Scrapy Data Flow and Items and Item Loaders - python

I am looking at the Architecture Overview page in the Scrapy documentation, but I still have a few questions regarding data and or control flow.
Scrapy Architecture
Default File Structure of Scrapy Projects
scrapy.cfg
myproject/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
spider1.py
spider2.py
...
item.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class MyprojectItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
which, I'm assuming, becomes
import scrapy
class Product(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
stock = scrapy.Field()
last_updated = scrapy.Field(serializer=str)
so that errors are thrown when trying to populate undeclared fields of Product instances
>>> product = Product(name='Desktop PC', price=1000)
>>> product['lala'] = 'test'
Traceback (most recent call last):
...
KeyError: 'Product does not support field: lala'
Question #1
Where, when, and how does our crawler become aware of items.py if we have created class CrowdfundingItem in items.py?
Is this done in...
__init__.py?
my_crawler.py?
def __init__() of mycrawler.py?
settings.py?
pipelines.py?
def __init__(self, dbpool) of pipelines.py?
somewhere else?
Question #2
Once I have declared an item such as Product, how do I then store the data by creating instances of Product in a context similar to the one below?
import scrapy
class MycrawlerSpider(CrawlSpider):
name = 'mycrawler'
allowed_domains = ['google.com']
start_urls = ['https://www.google.com/']
def parse(self, response):
options = Options()
options.add_argument('-headless')
browser = webdriver.Firefox(firefox_options=options)
browser.get(self.start_urls[0])
elements = browser.find_elements_by_xpath('//section')
count = 0
for ele in elements:
name = browser.find_element_by_xpath('./div[#id="name"]').text
price = browser.find_element_by_xpath('./div[#id="price"]').text
# If I am not sure how many items there will be,
# and hence I cannot declare them explicitly,
# how I would go about creating named instances of Product?
# Obviously the code below will not work, but how can you accomplish this?
count += 1
varName + count = Product(name=name, price=price)
...
Lastly, say we forego naming the Product instances altogether, and instead simply create unnamed instances.
for ele in elements:
name = browser.find_element_by_xpath('./div[#id="name"]').text
price = browser.find_element_by_xpath('./div[#id="price"]').text
Product(name=name, price=price)
If such instances are indeed stored somewhere, where are they stored? By creating instances this way, would it be impossible to access them?

Using an Item is optional; they're just a convenient way to declare your data model and apply validation. You can also use a plain dict instead.
If you do choose to use Item, you will need to import it for use in the spider. It's not discovered automatically. In your case:
from items import CrowdfundingItem
As a spider runs the parse method on each page, you can load the extracted data into your Item or dict. Once it's loaded, yield it, which passes it back to the scrapy engine for processing downstream, in pipelines or exporters. This is how scrapy enables "storage" of the data you scrape.
For example:
yield Product(name='Desktop PC', price=1000) # uses Item
yield {'name':'Desktop PC', 'price':1000} # plain dict

Related

Item serializers don't work. Function never gets called

I'm trying to use the serializer attribute in an Item, just like the example in the documentation:
https://docs.scrapy.org/en/latest/topics/exporters.html#declaring-a-serializer-in-the-field
The spider works without any errors, but the serialization doesn't happens, the print in the function doesn't print too. It's like the function remove_pound is never called.
import scrapy
def remove_pound(value):
print('Am I a joke to you?')
return value.replace('£', '')
class BookItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field(serializer=remove_pound)
class BookSpider(scrapy.Spider):
name = 'bookspider'
start_urls = ['https://books.toscrape.com/']
def parse(self, response):
books = response.xpath('//ol/li')
for i in books:
yield BookItem(
title=i.xpath('article/h3/a/text()').get(),
price=i.xpath('article/div/p[#class="price_color"]/text()').get(),
)
Am I using it wrong?
PS.: I know there are other ways to do it, I just want to learn to use this way.

The only reason it doesn't work is because your XPath expression is not right. You need to use relative XPath:
price=i.xpath('./article/div/p[#class="price_color"]/text()').get()
Update It's not XPath. The serialization works only for item exporters:
you can customize how each field value is serialized before it is
passed to the serialization library.
So if you run this command scrapy crawl bookspider -o BookSpider.csv you'll get a correct (serialized) output.

scrapy custom output processor

I'm using the scrapy framework for a web scraping project but I can't seem to figure out how to get a custom output processor to work.
I have an item class like so:
class Item(scrapy.Item)
ad_type = scrapy.Field()
then my parse function looks something like this. I have 2 scraped strings which I am adding to the ad_type. I want my output processor function to assign tags based on what is scraped from these 2 xpaths.
def parse(self, response):
l = ItemLoader(item=Item(), selector=listing)
l.add_xpath('ad_type', '(.//div/#class)[1]')
l.add_xpath('ad_type', '(.//div[contains(#class, "brand")]/#class)[1]')
yield l.load_item()
How do I get my output processor function to access the 2 xpath scraped strings that I have added to ad_type? The scrapy docs give this example but I can't get it to work.
def lowercase_processor(self, values):
for v in values:
yield v.lower()
class MyItemLoader(ItemLoader):
name_in = lowercase_processor

You have named your loader MyItemLoader, but your spider uses ItemLoader (probably scrapy's).
If you update your code to use the custom loader, you should get the result you want.
I would also recommend not naming your item class Item, since that could be confusing.

Python Scrapy not outputting to csv file

What am I doing wrong with the script so it's not outputting a csv file with the data? I am running the script with scrapy runspider yellowpages.py -o items.csv and still nothing is coming out but a blank csv file. I have followed different things here and also watched youtube trying to figure out where I am making the mistake and still cannot figure out what I am not doing correctly.
# -*- coding: utf-8 -*-
import scrapy
import requests
search = "Plumbers"
location = "Hammond, LA"
url = "https://www.yellowpages.com/search"
q = {'search_terms': search, 'geo_location_terms': location}
page = requests.get(url, params=q)
page = page.url
items = ()
class YellowpagesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['yellowpages.com']
start_urls = [page]
def parse(self, response):
self.log("I just visited: " + response.url)
items = response.css('a[class=business-name]::attr(href)')
for item in items:
print(item)

Simple spider without project.
Use my code, I wrote comments to make it easier to understand. This spider looks for all blocks on all pages for a pair of parameters "service" and "location". To run, use:
In your case:
scrapy runspider yellowpages.py -a servise="Plumbers" -a location="Hammond, LA" -o Hammondsplumbers.csv
The code will also work with any queries. For example:
scrapy runspider yellowpages.py -a servise="Doctors" -a location="California, MD" -o MDDoctors.json
etc...
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from scrapy.exceptions import CloseSpider
class YellowpagesSpider(scrapy.Spider):
name = 'yellowpages'
allowed_domains = ['yellowpages.com']
start_urls = ['https://www.yellowpages.com/']
# We can use any pair servise + location on our request
def __init__(self, servise=None, location=None):
self.servise = servise
self.location = location
def parse(self, response):
# If "service " and" location " are defined
if self.servise and self.location:
# Create search phrase using "service" and " location"
search_url = 'search?search_terms={}&geo_location_terms={}'.format(self.servise, self.location)
# Send request with url "yellowpages.com" + "search_url", then call parse_result
yield Request(url=response.urljoin(search_url), callback=self.parse_result)
else:
# Else close our spider
# You can add deffault value if you want.
self.logger.warning('=== Please use keys -a servise="service_name" -a location="location" ===')
raise CloseSpider()
def parse_result(self, response):
# all blocks without AD posts
posts = response.xpath('//div[#class="search-results organic"]//div[#class="v-card"]')
for post in posts:
yield {
'title': post.xpath('.//span[#itemprop="name"]/text()').extract_first(),
'url': response.urljoin(post.xpath('.//a[#class="business-name"]/#href').extract_first()),
}
next_page = response.xpath('//a[#class="next ajax-page"]/#href').extract_first()
# If we have next page url
if next_page:
# Send request with url "yellowpages.com" + "next_page", then call parse_result
yield scrapy.Request(url=response.urljoin(next_page), callback=self.parse_result)

for item in items:
print(item)
put yield instead of print there,
for item in items:
yield item

On inspection of your code, I notice a number of problems:
First, you initialize items to a tuple, when it should be a list: items = [].
You should change your name property to reflect the name you want on your crawler so you can use it like so: scrapy crawl my_crawler where name = "my_crawler".
start_urls is supposed to contain strings, not Request objects. You should change the entry from page to the exact search string you want to use. If you have a number of search strings and want to iterate over them, I would suggest using a middleware.
When you try to extract the data from CSS you're forgetting to call extract_all() which would actually transform your selector into string data which you could use.
Also, you shouldn't be redirecting to the standard output stream because a lot of logging goes there and it'll make your output file really messy. Instead, you should extract the responses into items using loaders.
Finally, you're probably missing the appropriate settings from your settings.py file. You can find the relevant documentation here.
FEED_FORMAT = "csv"
FEED_EXPORT_FIELDS = ["Field 1", "Field 2", "Field 3"]

Real-time Scraper | Complicated Issue

I have an up and running webscraper; the tickers are listed on a separate excel document. I am using ScrapingHub's API because it is accessible anywhere, and provides a big convenient factor. I want to create a code that will update and scrape from what is listed on the Excel sheet.
With my excel list, how can I have my code automatically update (ie. I add MSFT to my excel sheet so this updates my code to include MSFT)?
Additionally, is there anyway to have it automatically deploy?
--==Spider Code==--
**tickers appended in each link (search criteria)
import scrapy
import collections
from collections import OrderedDict
from scrapy.spiders import XMLFeedSpider
from tickers.items import tickersItem
class Spider(XMLFeedSpider):
name = "NewsScraper"
allowed_domains = ["yahoo.com"]
start_urls = (
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=ABIO,ACFN,AEMD,AEZS,AITB',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=BGMD,BIOA',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=CANF,CBIO,CCCR',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=DRIO,DRWI,DXTR,ENCR',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=GNMX,GNUS,GPL,HIPP,HSGX',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=MBOT,MBVX',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=NBY,NNVC,NTRP',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=PGRX,PLXP',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=SANW,SBOT,SCON,SCYX',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=UNXL,UQM,URRE',
)
itertag = 'item'
def parse_node(self, response, node):
item = collections.OrderedDict()
item['Title'] = node.xpath(
'title/text()').extract_first()
item['PublishDate'] = node.xpath(
'pubDate/text()').extract_first()
item['Description'] = node.xpath(
'description/text()').extract_first()
item['Link'] = node.xpath(
'link/text()').extract_first()
return item

How can I use the fields_to_export attribute in BaseItemExporter to order my Scrapy CSV data?

I have made a simple Scrapy spider that I use from the command line to export my data into the CSV format, but the order of the data seem random. How can I order the CSV fields in my output?
I use the following command line to get CSV data:
scrapy crawl somwehere -o items.csv -t csv
According to this Scrapy documentation, I should be able to use the fields_to_export attribute of the BaseItemExporter class to control the order. But I am clueless how to use this as I have not found any simple example to follow.
Please Note: This question is very similar to THIS one. However, that question is over 2 years old and doesn't address the many recent changes to Scrapy and neither provides a satisfactory answer, as it requires hacking one or both of:
contrib/exporter/init.py
contrib/feedexport.py
to address some previous issues, that seem to have already been resolved...
Many thanks in advance.

To use such exporter you need to create your own Item pipeline that will process your spider output. Assuming that you have simple case and you want to have all spider output in one file this is pipeline you should use (pipelines.py):
from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter
class CSVPipeline(object):
def __init__(self):
self.files = {}
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_items.csv' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = CsvItemExporter(file)
self.exporter.fields_to_export = [list with Names of fields to export - order is important]
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
Of course you need to remember to add this pipeline in your configuration file (settings.py):
ITEM_PIPELINES = {'myproject.pipelines.CSVPipeline': 300 }

You can now specify settings in the spider itself.
https://doc.scrapy.org/en/latest/topics/settings.html#settings-per-spider
To set the field order for exported feeds, set FEED_EXPORT_FIELDS.
https://doc.scrapy.org/en/latest/topics/feed-exports.html#feed-export-fields
The spider below dumps all links on a website (written against Scrapy 1.4.0):
import scrapy
from scrapy.http import HtmlResponse
class DumplinksSpider(scrapy.Spider):
name = 'dumplinks'
allowed_domains = ['www.example.com']
start_urls = ['http://www.example.com/']
custom_settings = {
# specifies exported fields and order
'FEED_EXPORT_FIELDS': ["page", "page_ix", "text", "url"],
}
def parse(self, response):
if not isinstance(response, HtmlResponse):
return
a_selectors = response.xpath('//a')
for i, a_selector in enumerate(a_selectors):
text = a_selector.xpath('normalize-space(text())').extract_first()
url = a_selector.xpath('#href').extract_first()
yield {
'page_ix': i + 1,
'page': response.url,
'text': text,
'url': url,
}
yield response.follow(url, callback=self.parse) # see allowed_domains
Run with this command:
scrapy crawl dumplinks --loglevel=INFO -o links.csv
Fields in links.csv are ordered as specified by FEED_EXPORT_FIELDS.

I found a pretty simple way to solve this issue. The above answers I would still say are more correct, but this is a quick fix. It turns out scrapy pulls the items in alphabetical order. Capitals are also important. So, an item beginning with 'A' will be pulled first, then 'B', 'C', etc, followed by 'a', 'b', 'c'. I have a project going right now where the header names are not extremely important, but I did need the UPC to be the first header for input into another program. I have the following item class:
class ItemInfo(scrapy.Item):
item = scrapy.Field()
price = scrapy.Field()
A_UPC = scrapy.Field()
ID = scrapy.Field()
time = scrapy.Field()
My CSV file outputs with the headers (in order): A_UPC, ID, item, price, time

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy Data Flow and Items and Item Loaders - python

Related

Item serializers don't work. Function never gets called

scrapy custom output processor

Python Scrapy not outputting to csv file

Real-time Scraper | Complicated Issue

How can I use the fields_to_export attribute in BaseItemExporter to order my Scrapy CSV data?

Categories

Resources