Pipeline for item not JSON serializable - python

I am trying to write output of a scraped xml to json. The scrape fails due to an item not being serializable.
From this question its advised that you need to build a pipeline, answer not provided out of scope for question SO scrapy serializer
So referring to scrapy docs
It illustrates an example, however the docs then advise not to use this
The purpose of JsonWriterPipeline is just to introduce how to write
item pipelines. If you really want to store all scraped items into a
JSON file you should use the Feed exports.
If I go to feed exports this is shown
JSON
FEED_FORMAT: json Exporter used: JsonItemExporter See this warning if
you’re using JSON with large feeds.
My issue still remains as that as i understand is for executing from command line as such.
scrapy runspider myxml.py -o ~/items.json -t json
However, this creates the error I was aiming to use a pipeline to solve.
TypeError: <bound method SelectorList.extract of [<Selector xpath='.//#venue' data=u'Royal Randwick'>]> is not JSON serializable
How do I create the json pipeline to rectify the json serialize error?
This is my code.
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from scrapy.selector import XmlXPathSelector
from conv_xml.items import ConvXmlItem
# https://stackoverflow.com/a/27391649/461887
import json
class MyxmlSpider(scrapy.Spider):
name = "myxml"
start_urls = (
["file:///home/sayth/Downloads/20160123RAND0.xml"]
)
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//meeting')
items = []
for site in sites:
item = ConvXmlItem()
item['venue'] = site.xpath('.//#venue').extract
item['name'] = site.xpath('.//race/#id').extract()
item['url'] = site.xpath('.//race/#number').extract()
item['description'] = site.xpath('.//race/#distance').extract()
items.append(item)
return items
# class JsonWriterPipeline(object):
#
# def __init__(self):
# self.file = open('items.jl', 'wb')
#
# def process_item(self, item, spider):
# line = json.dumps(dict(item)) + "\n"
# self.file.write(line)
# return item

The problem is here:
item['venue'] = site.xpath('.//#venue').extract
You've just forgot to call extract. Replace it with:
item['venue'] = site.xpath('.//#venue').extract()

Related

Soundcloud Scrapy Spider

I'm trying to build a Scrapy Spider to parse the artist and track info from SoundCloud.
Using the developer tools in FireFox I've determined an API call can be made that returns a JSON object that converts to a python dictionary. This API call needs an artist ID, and as far as I can tell these IDs have been auto-incremented. This means I don't need to crawl the site, and can just have a list of starting URLs that make the initial API call and then parse the pages that follow from that. I believe this should make me more friendly to the site?
From the returned response the artists' URL can be obtained, and visiting and parsing this URL will give more information about the artist
From the artists' URL we can visit their tracks and scrape a list of tracks alongside the tracks' attributes.
I think the issues I'm having stem from not understanding Scrapy's framework...
If I directly put in the artists' URL is start_urls Scrapy passes a scrapy.http.response.html.HtmlResponse Object to parse_artist. This allows me to extract the data I need (I didn't include all the code to parse the page to keep the code snippet shorter). However, if I pass that same Object to the same function from the function parse_api_call it results in an error...
I cannot understand why this is, and any help would be appreciated.
Side Note:
The initial API call grabs tracks from the artist, and the offset and limit can be changed and the function called recursively to collect the tracks. This, however, has proven unreliable, and even when it doesn't result in an error that terminates the program, it doesn't get a full list of tracks from the artist.
Here's the current code:
"""
Scrapes SoundCloud websites for artists and tracks
"""
import json
import scrapy
from ..items import TrackItem, ArtistItem
from scrapy.spiders.crawl import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
class SoundCloudBot(scrapy.Spider):
name = 'soundcloudBot'
allowed_domains = ['soundcloud.com']
start_urls = [
'https://api-v2.soundcloud.com/users/7436630/tracks?offset=0&limit=20&client_id=Q11Oe0rIPEuxvMeMbdXV7qaowYzlaESv&app_version=1556892058&app_locale=en',
'https://api-v2.soundcloud.com/users/4803918/tracks?offset=0&limit=20&client_id=Q11Oe0rIPEuxvMeMbdXV7qaowYzlaESv&app_version=1556892058&app_locale=en',
'https://api-v2.soundcloud.com/users/17364233/tracks?offset=0&limit=20&client_id=Q11Oe0rIPEuxvMeMbdXV7qaowYzlaESv&app_version=1556892058&app_locale=en',
'https://api-v2.soundcloud.com/users/19697240/tracks?offset=0&limit=20&client_id=Q11Oe0rIPEuxvMeMbdXV7qaowYzlaESv&app_version=1556892058&app_locale=en',
'https://api-v2.soundcloud.com/users/5949564/tracks?offset=0&limit=20&client_id=Q11Oe0rIPEuxvMeMbdXV7qaowYzlaESv&app_version=1556892058&app_locale=en'
]
# This is added for testing purposes. When these links are added directly to the
# start_urls the code runs as expected, when these links are extracted using parse_api_call
# is when problems arise
# start_urls.extend([
# 'https://soundcloud.com/futureisnow',
# 'https://soundcloud.com/bigsean-1',
# 'https://soundcloud.com/defjam',
# 'https://soundcloud.com/ymcmbofficial',
# 'https://soundcloud.com/walefolarin',
# # 'https://soundcloud.com/futureisnow/tracks',
# # 'https://soundcloud.com/bigsean-1/tracks',
# # 'https://soundcloud.com/defjam/tracks',
# # 'https://soundcloud.com/ymcmbofficial/tracks',
# # 'https://soundcloud.com/walefolarin/tracks'
# ])
def parse(self, response):
url = response.url
if url[:35] == 'https://api-v2.soundcloud.com/users':
self.parse_api_call(response)
# 'https://soundcloud.com/{artist}'
elif url.replace('https://soundcloud.com', '').count('/') == 1: # One starting forward slash for artist folder
self.parse_artist(response)
# 'https://soundcloud.com/{artist}/{track}'
elif url.replace('https://soundcloud.com', '').count('/') == 2 and url[-6:] == 'tracks':
self.parse_tracks(response)
def parse_api_call(self, response):
data = json.loads(response.text)
artistItem = ArtistItem()
first_track = data['collection'][0]
artist_info = first_track.get('user')
artist_id = artist_info.get('id')
artist_url = artist_info.get('permalink_url')
artist_name = artist_info.get('username')
artistItem['artist_id'] = artist_id
artistItem['username'] = artist_name
artistItem['url'] = artist_url
artist_response = scrapy.http.response.html.HtmlResponse(artist_url)
self.parse_artist(artist_response)
# Once the pipelines are written this will be changed to yeild
return artistItem
def parse_artist(self, response):
# This prints out <class 'scrapy.http.response.html.HtmlResponse'>
# It doesn't matter if start_urls get extend with artists' URLS or not
print(type(response))
data = response.css('script::text').extract()
# This prints out a full HTML response if the function is called directly
# With scrapy, or an empty list if called from parse_api_call
print(data)
track_response = scrapy.http.response.html.HtmlResponse(f'{response.url}/tracks')
self.parse_tracks(track_response)
def parse_tracks(self, response):
pass
You have to use
Request(url)
to get data from new url. But you can't execute it as normal function and get result at once. You have to use return Request() or yield Request() and scrapy puts it in queue to get data later.
After it gets data it uses method parse() to parse data from response. But you can set own method in request
Request(url, self.parse_artist)
But in parse_artist() you will not have access to data which you get in previous function so you have to send it in request using meta - ie.
Request(artistItem['url'], self.parse_artist, meta={'item': artistItem})
Full working code. You can put all in one file and run it without creating project.
It also saves result in output.csv
import scrapy
from scrapy.http import Request
import json
class MySpider(scrapy.Spider):
name = 'myspider'
allowed_domains = ['soundcloud.com']
start_urls = [
'https://api-v2.soundcloud.com/users/7436630/tracks?offset=0&limit=20&client_id=Q11Oe0rIPEuxvMeMbdXV7qaowYzlaESv&app_version=1556892058&app_locale=en',
'https://api-v2.soundcloud.com/users/4803918/tracks?offset=0&limit=20&client_id=Q11Oe0rIPEuxvMeMbdXV7qaowYzlaESv&app_version=1556892058&app_locale=en',
'https://api-v2.soundcloud.com/users/17364233/tracks?offset=0&limit=20&client_id=Q11Oe0rIPEuxvMeMbdXV7qaowYzlaESv&app_version=1556892058&app_locale=en',
'https://api-v2.soundcloud.com/users/19697240/tracks?offset=0&limit=20&client_id=Q11Oe0rIPEuxvMeMbdXV7qaowYzlaESv&app_version=1556892058&app_locale=en',
'https://api-v2.soundcloud.com/users/5949564/tracks?offset=0&limit=20&client_id=Q11Oe0rIPEuxvMeMbdXV7qaowYzlaESv&app_version=1556892058&app_locale=en'
]
def parse(self, response):
data = json.loads(response.text)
if len(data['collection']) > 0:
artist_info = data['collection'][0]['user']
artistItem = {
'artist_id': artist_info.get('id'),
'username': artist_info.get('username'),
'url': artist_info.get('permalink_url'),
}
print('>>>', artistItem['url'])
# make requests to url artistItem['url'],
# parse response in parse_artist,
# send artistItem to parse_artist
return Request(artistItem['url'], self.parse_artist, meta={'item': artistItem})
else:
print("ERROR: no collections in data")
def parse_artist(self, response):
artistItem = response.meta['item']
data = response.css('script::text').extract()
# add data to artistItem
#print(data)
artistItem['new data'] = 'some new data'
#print('>>>', response.urljoin('tracks'))
print('>>>', response.url + '/tracks')
# make requests to url artistItem['url'],
# parse response in parse_tracks,
# send artistItem to parse_tracks
return Request(response.url + '/tracks', self.parse_tracks, meta={'item': artistItem})
def parse_tracks(self, response):
artistItem = response.meta['item']
artistItem['tracks'] = 'some tracks'
# send to CSV file
return artistItem
#------------------------------------------------------------------------------
# run it without creating project
#------------------------------------------------------------------------------
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0',
# save in file as CSV, JSON or XML
'FEED_FORMAT': 'csv', # csv, json, xml
'FEED_URI': 'output.csv', #
})
c.crawl(MySpider)
c.start()
ouput.csv
artist_id,username,url,new data,tracks
17364233,Def Jam Recordings,https://soundcloud.com/defjam,some new data,some tracks
4803918,Big Sean,https://soundcloud.com/bigsean-1,some new data,some tracks
19697240,YMCMB-Official,https://soundcloud.com/ymcmbofficial,some new data,some tracks
5949564,WALE,https://soundcloud.com/walefolarin,some new data,some tracks

simple scrapy based on https://github.com/scrapy/quotesbot/blob/master/quotesbot/spiders/toscrape-xpath.py not passing data using yield Request

strong textMy code based on examples that I searched did not seem to function as intended so I decided to use a working model found on github: https://github.com/scrapy/quotesbot/blob/master/quotesbot/spiders/toscrape-xpath.py
I then modified it slightly to showcase what I am running into. The code below works great as intended but my ultimate goal is to pass the scraped data from first "parse" to a second "parse2" function so that I can combine data from 2 different pages. But for now I wanted to start very simple so I can follow what is happening, hence the heavily stripped code below.
# -*- coding: utf-8 -*-
import scrapy
from quotesbot.items import MyItems
from scrapy import Request
class ToScrapeSpiderXPath(scrapy.Spider):
name = 'toscrape-xpath'
start_urls = [
'http://quotes.toscrape.com/',
]
def parse(self, response):
item = MyItems()
for quote in response.xpath('//div[#class="quote"]'):
item['tinfo'] =
quote.xpath('./span[#class="text"]/text()').extract_first()
yield item
but then when I modify the code as below:
# -*- coding: utf-8 -*-
import scrapy
from quotesbot.items import MyItems
from scrapy import Request
class ToScrapeSpiderXPath(scrapy.Spider):
name = 'toscrape-xpath'
start_urls = [
'http://quotes.toscrape.com/',
]
def parse(self, response):
item = MyItems()
for quote in response.xpath('//div[#class="quote"]'):
item['tinfo'] =
quote.xpath('./span[#class="text"]/text()').extract_first()
yield Request("http://quotes.toscrape.com/",
callback=self.parse2, meta={'item':item})
def parse2(self, response):
item = response.meta['item']
yield item
I only have one item scraped and it says the rest are duplicates. It also looks like "parse2" is not even read at all. I have played with the indentation and the brackets thinking I am missing something simple, but without much success. I have looked at many examples to see if I can make sense of what could be the issue but I still am not able to make it work. I am sure its a very simple issue for those gurus out there, so I yelp "Help!" somebody!
also my items.py file looks like below and I think those two files items.py and toscrape-xpath.py are the only ones in action as far as I can tell since I am quite new to all this.
# -*- coding: utf-8 -*-`enter code here`
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class QuotesbotItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
class MyItems(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
tinfo = scrapy.Field()
pass
Thank you very much to any and all help you can provide
# -*- coding: utf-8 -*-
import scrapy
from quotesbot.items import MyItems
from scrapy import Request
class ToScrapeSpiderXPath(scrapy.Spider):
name = 'toscrape-xpath'
start_urls = [
'http://quotes.toscrape.com/',
]
def parse(self, response):
item = MyItems()
for quote in response.xpath('//div[#class="quote"]'):
item =
{'tinfo':quote.xpath('./span[#class="text"]/text()').extract_first()}
**yield response.follow**('http://quotes.toscrape.com', self.parse_2,
meta={'item':item})
def parse_2(self, response):
print "almost there"
item = response.meta['item']
yield item
Your spider logic is very confusing:
def parse(self, response):
for quote in response.xpath('//div[#class="quote"]'):
yield Request("http://quotes.toscrape.com/",
callback=self.parse2, meta={'item':item})
For every quote you find on quotes.toscrape.com you schedule another request with to the same webpage?
What happens is that these new scheduled requests get filtered out by scrapys duplicate request filter.
Maybe you should just yield the item right there:
def parse(self, response):
for quote in response.xpath('//div[#class="quote"]'):
item = MyItems()
item['tinfo'] = quote.xpath('./span[#class="text"]/text()').extract_first()
yield item
To illustrate why your current crawler does nothing see this drawing:

How to use Scrapy sitemap spider on sites with text sitemaps?

I tried using a generic Scrapy.spider to follow links, but it didn't work - so I hit upon the idea of simplifying the process by accessing the sitemap.txt instead, but that didn't work either!
I wrote a simple example (to help me understand the algorithm) of a spider to follow the sitemap specified on my site: https://legion-216909.appspot.com/sitemap.txt It is meant to navigate the URLs specified on the sitemap, print them out to screen and output the results into a links.txt file. The code:
import scrapy
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
name = "spyder_PAGE"
sitemap_urls = ['https://legion-216909.appspot.com/sitemap.txt']
def parse(self, response):
print(response.url)
return response.url
I ran the above spider as Scrapy crawl spyder_PAGE > links.txt but that returned an empty text file. I have gone through the Scrapy docs multiple times, but there is something missing. Where am I going wrong?
SitemapSpider is expecting an XML sitemap format, causing the spider to exit with this error:
[scrapy.spiders.sitemap] WARNING: Ignoring invalid sitemap: <200 https://legion-216909.appspot.com/sitemap.txt>
Since your sitemap.txt file is just a simple list or URLs, it would be easier to just split them with a string method.
For example:
from scrapy import Spider, Request
class MySpider(Spider):
name = "spyder_PAGE"
start_urls = ['https://legion-216909.appspot.com/sitemap.txt']
def parse(self, response):
links = response.text.split('\n')
for link in links:
# yield a request to get this link
print(link)
# https://legion-216909.appspot.com/index.html
# https://legion-216909.appspot.com/content.htm
# https://legion-216909.appspot.com/Dataset/module_4_literature/Unit_1/.DS_Store
You only need to override _parse_sitemap(self, response) from SitemapSpider with the following:
from scrapy import Request
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = [...]
sitemap_rules = [...]
def _parse_sitemap(self, response):
# yield a request for each url in the txt file that matches your filters
urls = response.text.splitlines()
it = self.sitemap_filter(urls)
for loc in it:
for r, c in self._cbs:
if r.search(loc):
yield Request(loc, callback=c)
break

How to write scraped data into a CSV file in Scrapy?

I am trying to scrape a website by extracting the sub-links and their titles, and then save the extracted titles and their associated links into a CSV file. I run the following code, the CSV file is created but it is empty. Any help?
My Spider.py file looks like this:
from scrapy import cmdline
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class HyperLinksSpider(CrawlSpider):
name = "linksSpy"
allowed_domains = ["some_website"]
start_urls = ["some_website"]
rules = (Rule(LinkExtractor(allow=()), callback='parse_obj', follow=True),)
def parse_obj(self, response):
items = []
for link in LinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response):
item = ExtractlinksItem()
for sel in response.xpath('//tr/td/a'):
item['title'] = sel.xpath('/text()').extract()
item['link'] = sel.xpath('/#href').extract()
items.append(item)
return items
cmdline.execute("scrapy crawl linksSpy".split())
My pipelines.py is:
import csv
class ExtractlinksPipeline(object):
def __init__(self):
self.csvwriter = csv.writer(open('Links.csv', 'wb'))
def process_item(self, item, spider):
self.csvwriter.writerow((item['title'][0]), item['link'][0])
return item
My items.py is:
import scrapy
class ExtractlinksItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
link = scrapy.Field()
pass
I have also changed my settings.py:
ITEM_PIPELINES = {'extractLinks.pipelines.ExtractlinksPipeline': 1}
To output all data scrapy has inbuilt feature called Feed Exports.
To put it shortly all you need is two settings in your settings.py file: FEED_FORMAT - format in which the feed should be saved, in your case csv and FEED_URI - location where the feed should be saved, e.g. ~/my_feed.csv
My related answer covers it in greater detail with a use case:
https://stackoverflow.com/a/41473241/3737009

Scrapy - Get index of item being parsed?

I'm trying to load some XPATH rules from a database using Scrapy.
The code I've written so far works fine, however after some debugging I've realised that Scrapy is parsing each item asynchronously, meaning I have no control over the order of which item is being parsed.
What I want to do is figure out which item from the list is currently being parsed when it hits the parse() function so I can reference that index to the rows in my database and acquire the correct XPATH query. The way I'm currently doing this is by using a variable called item_index and incrementing it after each item iteration. Now I realise this is not enough and I'm hoping there's some internal functionality that could help me achieve this.
Does anyone know the proper way of keeping track of this? I've looked through the documentation but couldn't find any info about it. I've also looked at the Scrapy source code but I can't seem to figure out how the list of URL's actually get stored.
Here's my code to explain my problem further:
# -*- coding: utf-8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from dirbot.items import Product
from dirbot.database import DatabaseConnection
# Create a database connection object so we can execute queries
connection = DatabaseConnection()
class DmozSpider(Spider):
name = "dmoz"
start_urls = []
item_index = 0
# Query for all products sold by a merchant
rows = connection.query("SELECT * FROM products_merchant WHERE 1=1")
def start_requests(self):
for row in self.rows:
yield self.make_requests_from_url(row["product_url"])
def parse(self, response):
sel = Selector(response)
item = Product()
item['product_id'] = self.rows[self.item_index]['product_id']
item['merchant_id'] = self.rows[self.item_index]['merchant_id']
item['price'] = sel.xpath(self.rows[self.item_index]['xpath_rule']).extract()
self.item_index+=1
return item
Any guidance would be greatly appreciated!
Thanks
Here's the solution I came up with just in case anyone needs it.
As #toothrot suggested, you need to overload methods within the Request class to be able to access meta information.
Hope this helps someone.
# -*- coding: utf-8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from dirbot.items import Product
from dirbot.database import DatabaseConnection
# Create a database connection object so we can execute queries
connection = DatabaseConnection()
class DmozSpider(Spider):
name = "dmoz"
start_urls = []
# Query for all products sold by a merchant
rows = connection.query("SELECT * FROM products_merchant WHERE 1=1")
def start_requests(self):
for indx, row in enumerate(self.rows):
self.start_urls.append( row["product_url"] )
yield self.make_requests_from_url(row["product_url"], {'index': indx})
def make_requests_from_url(self, url, meta):
return Request(url, callback=self.parse, dont_filter=True, meta=meta)
def parse(self, response):
item_index = response.meta['index']
sel = Selector(response)
item = Product()
item['product_id'] = self.rows[item_index]['product_id']
item['merchant_id'] = self.rows[item_index]['merchant_id']
item['price'] = sel.xpath(self.rows[item_index]['xpath_rule']).extract()
return item
You can pass the index (or the row id from the database) along with the request using Request.meta. It's a dictionary you can access from Response.meta in your handler.
For example, when you're building your request:
Request(url, callback=self.some_handler, meta={'row_id': row['id']})
Using a counter like you've attempted won't work because you can't guarantee the order in which the responses are handled.

Categories