How to make Scrapy XmlFeed Spider Faster - python

The xml feed I'm scraping has around thousand items. I'm wondering if there is a way to split the load or another method to significantly reduce run time. It currently takes two minutes to iterate all the xml within the link below. Any suggestions or advice is greatly appreciated.
Example: https://www.cityblueshop.com/sitemap_products_1.xml
from scrapy.spiders import XMLFeedSpider
from learning.items import TestItem
class MySpider(XMLFeedSpider):
name = 'testing'
allowed_domains = ['www.cityblueshop.com']
start_urls = ['https://www.cityblueshop.com/sitemap_products_1.xml']
namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
itertag = 'n:url'
iterator = 'xml'
def parse_node(self, response, node):
item = TestItem()
item['url'] = node.xpath('.//n:loc/text()').extract()
return item
Two minute run time for all items. Any ways to make it quicker using Scrapy?

I tested the following spider locally:
from scrapy.spiders import XMLFeedSpider
class MySpider(XMLFeedSpider):
name = 'testing'
allowed_domains = ['www.cityblueshop.com']
start_urls = ['https://www.cityblueshop.com/sitemap_products_1.xml']
namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
itertag = 'n:url'
iterator = 'xml'
def parse_node(self, response, node):
yield {'url': node.xpath('.//n:loc/text()').get()}
It takes less than 3 seconds to run, including Scrapy core startup and everything.
Please, ensure that the time is not spent somewhere else, e.g. in the learning module from which you import your item subclass.

Try to increase CONCURRENT_REQUESTS, CONCURRENT_REQUESTS_PER_DOMAIN, CONCURRENT_REQUESTS_PER_IP, for example: https://doc.scrapy.org/en/latest/topics/settings.html#concurrent-requests-per-domain
But remember that besides high speed it can lead to lower success rate, like many 429 responses, bans, etc.

Related

First Python Scrapy Web Scraper Not Working

I took the Data Camp Web Scraping with Python course and am trying to run the 'capstone' web scraper in my own environment (the course takes place in a special in-browser environment). The code is intended to scrape the titles and descriptions of courses from the Data Camp webpage.
I've spend a good deal of time tinkering here and there, and at this point am hoping that the community can help me out.
The code I am trying to run is:
# Import scrapy
import scrapy
# Import the CrawlerProcess
from scrapy.crawler import CrawlerProcess
# Create the Spider class
class YourSpider(scrapy.Spider):
name = 'yourspider'
# start_requests method
def start_requests(self):
yield scrapy.Request(url= https://www.datacamp.com, callback = self.parse)
def parse (self, response):
# Parser, Maybe this is where my issue lies
crs_titles = response.xpath('//h4[contains(#class,"block__title")]/text()').extract()
crs_descrs = response.xpath('//p[contains(#class,"block__description")]/text()').extract()
for crs_title, crs_descr in zip(crs_titles, crs_descrs):
dc_dict[crs_title] = crs_descr
# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()
# Run the Spider
process = CrawlerProcess()
process.crawl(YourSpider)
process.start()
# Print a preview of courses
previewCourses(dc_dict)
I get the following output:
C:\Users*\PycharmProjects\TestScrape\venv\Scripts\python.exe C:/Users/*/PycharmProjects/TestScrape/main.py
File "C:\Users******\PycharmProjects\TestScrape\main.py", line 20
yield scrapy.Request(url=https://www.datacamp.com, callback=self.parse1)
^
SyntaxError: invalid syntax
Process finished with exit code 1
I notice that the parse method in line 20 remains grey in my PyCharm window. Maybe I am missing something important in the parse method?
Any help in getting the code to run would be greatly appreciated!
Thank you,
-WolfHawk
The error message is triggered in the following line:
yield scrapy.Request(url=https://www.datacamp.com, callback = self.parse)
As an input to url you should enter a string and strings are written with ' or " in the beginning and in the end.
Try this:
yield scrapy.Request(url='https://www.datacamp.com', callback = self.parse)
If this is your full code, you are also missing the function previewCourses. Check if it is provided to you or write it yourself with something like this:
def previewCourses(dict_to_print):
for key, value in dict_to_print.items():
print(key, value)

Item is losing name while yielding in Python / Scrapy

In Scrapy 2.4.x on Python 3.8.x I am yielding an item with the purpose to save some stats to a DB. The scraper has another Item that gets yielded as well.
While the name of the item is present in the main script "StatsItem", it is lost within the other class. I am using the name of the item to decide which method to call:
in scraper.py:
import scrapy
from crawler.items import StatsItem, OtherItem
class demo(scrapy.Spider):
def parse_item(self, response):
stats = StatsItem()
stats['results'] = 10
yield stats
print(type(stats).__name__)
# Output: StatsItem
print(stats)
# Output: {'results': 10}
in pipeline.py
import scrapy
from crawler.items import StatsItem, OtherItem
class mysql_pipeline(object):
def process_item(self, item, spider):
print(type(item).__name__)
# Output: NoneType
if isinstance(item, StatsItem):
self.save_stats(item, spider)
elif isinstance(item, OtherItem):
# call other method
return item
The output of print in the first class is "StatsItem", while it is "NoneType" within the pipeline, therefore the method save_stats() gets never called.
I am pretty new to Python, so there might be a better way of doing this. There is no error message or exception I am aware of. Any help is greatly appreciated.
You can't use yield outside of a function imo.
I was finaly able to locate the problem. The particular crawler was nearly identical to all other ones that did not have this issue but with one exception, I was custom setting the item pipeline:
custom_settings.update({
'ITEM_PIPELINES' : {
'crawler.pipelines.mysql_pipeline': 301,
}
})
Removing this, fixed the issue.

Real-time Scraper | Complicated Issue

I have an up and running webscraper; the tickers are listed on a separate excel document. I am using ScrapingHub's API because it is accessible anywhere, and provides a big convenient factor. I want to create a code that will update and scrape from what is listed on the Excel sheet.
With my excel list, how can I have my code automatically update (ie. I add MSFT to my excel sheet so this updates my code to include MSFT)?
Additionally, is there anyway to have it automatically deploy?
--==Spider Code==--
**tickers appended in each link (search criteria)
import scrapy
import collections
from collections import OrderedDict
from scrapy.spiders import XMLFeedSpider
from tickers.items import tickersItem
class Spider(XMLFeedSpider):
name = "NewsScraper"
allowed_domains = ["yahoo.com"]
start_urls = (
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=ABIO,ACFN,AEMD,AEZS,AITB',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=BGMD,BIOA',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=CANF,CBIO,CCCR',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=DRIO,DRWI,DXTR,ENCR',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=GNMX,GNUS,GPL,HIPP,HSGX',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=MBOT,MBVX',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=NBY,NNVC,NTRP',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=PGRX,PLXP',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=SANW,SBOT,SCON,SCYX',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=UNXL,UQM,URRE',
)
itertag = 'item'
def parse_node(self, response, node):
item = collections.OrderedDict()
item['Title'] = node.xpath(
'title/text()').extract_first()
item['PublishDate'] = node.xpath(
'pubDate/text()').extract_first()
item['Description'] = node.xpath(
'description/text()').extract_first()
item['Link'] = node.xpath(
'link/text()').extract_first()
return item

How to store cumulated data after Scrapy has finished working with every URL?

I'm trying to store some data after Scrapy has finished working (i.e after it did every URL I asked him).
Everytime Scrapy parse some result (through the parse function in the spider class), I append some information to an existing global object in the class itself. I would like to access that object at the end, and if possible to do everything from a Python script. Here's my spider code:
from scrapy.spider import Spider
from scrapy.selector import Selector
from nltk.corpus import stopwords
from newsScrapy.items import NewsscrapyItem
class newsScrapySpider(Spider):
name = "newsScrapy"
start_urls = []
global wordMatrix
wordMatrix = {}
global prefix
prefix = "http://www.nytimes.com/indexes/"
sufix = "/todayspaper/index.html"
for year in range (2000,2015):
for month in range (1,13):
for day in range (1,32):
if(month<10 and day<10):
start_urls.append (prefix+str(year)+"/"+"0"+str(month)+"/"+"0"+str(day))
elif (month<10 and day>9):
start_urls.append (prefix+str(year)+"/"+"0"+str(month)+"/"+str(day))
elif (month>9 and day<10):
start_urls.append (prefix+str(year)+"/"+str(month)+"/"+"0"+str(day))
else:
start_urls.append (prefix+str(year)+"/"+str(month)+"/"+str(day))
def parse(self, response):
sel = Selector(response)
items = []
text = sel.xpath('//body//text()').re('(\w+)')
item = NewsscrapyItem()
item['body'] = text
item['date'] = response.url.strip(prefix)
items.append(item)
for word in item['body']:
word = word.strip(' ').strip(',').strip('\n')
word = word.lower()
if (not word in stopwords.words('english')):
if(wordMatrix.__contains__((word, item['date']))):
wordMatrix[word,item['date']]+=1
else:
wordMatrix[word, item['date']]=1
# print wordMatrix
return items
The idea would be to access the wordMatrix variable after the end of the scraping (once every data was collected) and to do it from another Python script (for ploting per example).
Thanks a lot !
Together with your existing imports:
try:
import cPickle as pickle
except ImportError:
import pickle
And then just before return items:
pickle.dump(wordMatrix, '/path/to/file/wordMatrix.data');
In another script you can load this data with:
try:
import cPickle as pickle
except ImportError:
import pickle
wordMatrix = pickle.load('/path/to/file/wordMatrix.data')
Pickling is a process of serializing and deserializing any Python object. There are two implementations in Python standard library - pickle is pure Python and cPickle is written in C thus much faster. The unusual import code tries to import faster one but for instance IronPython lacks cPickle and in this case the latter is imported. Both modules do exactly the same and share the same interface.

How can I use the fields_to_export attribute in BaseItemExporter to order my Scrapy CSV data?

I have made a simple Scrapy spider that I use from the command line to export my data into the CSV format, but the order of the data seem random. How can I order the CSV fields in my output?
I use the following command line to get CSV data:
scrapy crawl somwehere -o items.csv -t csv
According to this Scrapy documentation, I should be able to use the fields_to_export attribute of the BaseItemExporter class to control the order. But I am clueless how to use this as I have not found any simple example to follow.
Please Note: This question is very similar to THIS one. However, that question is over 2 years old and doesn't address the many recent changes to Scrapy and neither provides a satisfactory answer, as it requires hacking one or both of:
contrib/exporter/init.py
contrib/feedexport.py
to address some previous issues, that seem to have already been resolved...
Many thanks in advance.
To use such exporter you need to create your own Item pipeline that will process your spider output. Assuming that you have simple case and you want to have all spider output in one file this is pipeline you should use (pipelines.py):
from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter
class CSVPipeline(object):
def __init__(self):
self.files = {}
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_items.csv' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = CsvItemExporter(file)
self.exporter.fields_to_export = [list with Names of fields to export - order is important]
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
Of course you need to remember to add this pipeline in your configuration file (settings.py):
ITEM_PIPELINES = {'myproject.pipelines.CSVPipeline': 300 }
You can now specify settings in the spider itself.
https://doc.scrapy.org/en/latest/topics/settings.html#settings-per-spider
To set the field order for exported feeds, set FEED_EXPORT_FIELDS.
https://doc.scrapy.org/en/latest/topics/feed-exports.html#feed-export-fields
The spider below dumps all links on a website (written against Scrapy 1.4.0):
import scrapy
from scrapy.http import HtmlResponse
class DumplinksSpider(scrapy.Spider):
name = 'dumplinks'
allowed_domains = ['www.example.com']
start_urls = ['http://www.example.com/']
custom_settings = {
# specifies exported fields and order
'FEED_EXPORT_FIELDS': ["page", "page_ix", "text", "url"],
}
def parse(self, response):
if not isinstance(response, HtmlResponse):
return
a_selectors = response.xpath('//a')
for i, a_selector in enumerate(a_selectors):
text = a_selector.xpath('normalize-space(text())').extract_first()
url = a_selector.xpath('#href').extract_first()
yield {
'page_ix': i + 1,
'page': response.url,
'text': text,
'url': url,
}
yield response.follow(url, callback=self.parse) # see allowed_domains
Run with this command:
scrapy crawl dumplinks --loglevel=INFO -o links.csv
Fields in links.csv are ordered as specified by FEED_EXPORT_FIELDS.
I found a pretty simple way to solve this issue. The above answers I would still say are more correct, but this is a quick fix. It turns out scrapy pulls the items in alphabetical order. Capitals are also important. So, an item beginning with 'A' will be pulled first, then 'B', 'C', etc, followed by 'a', 'b', 'c'. I have a project going right now where the header names are not extremely important, but I did need the UPC to be the first header for input into another program. I have the following item class:
class ItemInfo(scrapy.Item):
item = scrapy.Field()
price = scrapy.Field()
A_UPC = scrapy.Field()
ID = scrapy.Field()
time = scrapy.Field()
My CSV file outputs with the headers (in order): A_UPC, ID, item, price, time

Categories