Real-time Scraper | Complicated Issue

Real-time Scraper | Complicated Issue - python

I have an up and running webscraper; the tickers are listed on a separate excel document. I am using ScrapingHub's API because it is accessible anywhere, and provides a big convenient factor. I want to create a code that will update and scrape from what is listed on the Excel sheet.
With my excel list, how can I have my code automatically update (ie. I add MSFT to my excel sheet so this updates my code to include MSFT)?
Additionally, is there anyway to have it automatically deploy?
--==Spider Code==--
**tickers appended in each link (search criteria)
import scrapy
import collections
from collections import OrderedDict
from scrapy.spiders import XMLFeedSpider
from tickers.items import tickersItem
class Spider(XMLFeedSpider):
name = "NewsScraper"
allowed_domains = ["yahoo.com"]
start_urls = (
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=ABIO,ACFN,AEMD,AEZS,AITB',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=BGMD,BIOA',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=CANF,CBIO,CCCR',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=DRIO,DRWI,DXTR,ENCR',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=GNMX,GNUS,GPL,HIPP,HSGX',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=MBOT,MBVX',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=NBY,NNVC,NTRP',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=PGRX,PLXP',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=SANW,SBOT,SCON,SCYX',
'https://feeds.finance.yahoo.com/rss/2.0/headline?s=UNXL,UQM,URRE',
)
itertag = 'item'
def parse_node(self, response, node):
item = collections.OrderedDict()
item['Title'] = node.xpath(
'title/text()').extract_first()
item['PublishDate'] = node.xpath(
'pubDate/text()').extract_first()
item['Description'] = node.xpath(
'description/text()').extract_first()
item['Link'] = node.xpath(
'link/text()').extract_first()
return item

Related

How to make Scrapy XmlFeed Spider Faster

The xml feed I'm scraping has around thousand items. I'm wondering if there is a way to split the load or another method to significantly reduce run time. It currently takes two minutes to iterate all the xml within the link below. Any suggestions or advice is greatly appreciated.
Example: https://www.cityblueshop.com/sitemap_products_1.xml
from scrapy.spiders import XMLFeedSpider
from learning.items import TestItem
class MySpider(XMLFeedSpider):
name = 'testing'
allowed_domains = ['www.cityblueshop.com']
start_urls = ['https://www.cityblueshop.com/sitemap_products_1.xml']
namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
itertag = 'n:url'
iterator = 'xml'
def parse_node(self, response, node):
item = TestItem()
item['url'] = node.xpath('.//n:loc/text()').extract()
return item
Two minute run time for all items. Any ways to make it quicker using Scrapy?

I tested the following spider locally:
from scrapy.spiders import XMLFeedSpider
class MySpider(XMLFeedSpider):
name = 'testing'
allowed_domains = ['www.cityblueshop.com']
start_urls = ['https://www.cityblueshop.com/sitemap_products_1.xml']
namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
itertag = 'n:url'
iterator = 'xml'
def parse_node(self, response, node):
yield {'url': node.xpath('.//n:loc/text()').get()}
It takes less than 3 seconds to run, including Scrapy core startup and everything.
Please, ensure that the time is not spent somewhere else, e.g. in the learning module from which you import your item subclass.

Try to increase CONCURRENT_REQUESTS, CONCURRENT_REQUESTS_PER_DOMAIN, CONCURRENT_REQUESTS_PER_IP, for example: https://doc.scrapy.org/en/latest/topics/settings.html#concurrent-requests-per-domain
But remember that besides high speed it can lead to lower success rate, like many 429 responses, bans, etc.

Python Scrapy not outputting to csv file

What am I doing wrong with the script so it's not outputting a csv file with the data? I am running the script with scrapy runspider yellowpages.py -o items.csv and still nothing is coming out but a blank csv file. I have followed different things here and also watched youtube trying to figure out where I am making the mistake and still cannot figure out what I am not doing correctly.
# -*- coding: utf-8 -*-
import scrapy
import requests
search = "Plumbers"
location = "Hammond, LA"
url = "https://www.yellowpages.com/search"
q = {'search_terms': search, 'geo_location_terms': location}
page = requests.get(url, params=q)
page = page.url
items = ()
class YellowpagesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['yellowpages.com']
start_urls = [page]
def parse(self, response):
self.log("I just visited: " + response.url)
items = response.css('a[class=business-name]::attr(href)')
for item in items:
print(item)

Simple spider without project.
Use my code, I wrote comments to make it easier to understand. This spider looks for all blocks on all pages for a pair of parameters "service" and "location". To run, use:
In your case:
scrapy runspider yellowpages.py -a servise="Plumbers" -a location="Hammond, LA" -o Hammondsplumbers.csv
The code will also work with any queries. For example:
scrapy runspider yellowpages.py -a servise="Doctors" -a location="California, MD" -o MDDoctors.json
etc...
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from scrapy.exceptions import CloseSpider
class YellowpagesSpider(scrapy.Spider):
name = 'yellowpages'
allowed_domains = ['yellowpages.com']
start_urls = ['https://www.yellowpages.com/']
# We can use any pair servise + location on our request
def __init__(self, servise=None, location=None):
self.servise = servise
self.location = location
def parse(self, response):
# If "service " and" location " are defined
if self.servise and self.location:
# Create search phrase using "service" and " location"
search_url = 'search?search_terms={}&geo_location_terms={}'.format(self.servise, self.location)
# Send request with url "yellowpages.com" + "search_url", then call parse_result
yield Request(url=response.urljoin(search_url), callback=self.parse_result)
else:
# Else close our spider
# You can add deffault value if you want.
self.logger.warning('=== Please use keys -a servise="service_name" -a location="location" ===')
raise CloseSpider()
def parse_result(self, response):
# all blocks without AD posts
posts = response.xpath('//div[#class="search-results organic"]//div[#class="v-card"]')
for post in posts:
yield {
'title': post.xpath('.//span[#itemprop="name"]/text()').extract_first(),
'url': response.urljoin(post.xpath('.//a[#class="business-name"]/#href').extract_first()),
}
next_page = response.xpath('//a[#class="next ajax-page"]/#href').extract_first()
# If we have next page url
if next_page:
# Send request with url "yellowpages.com" + "next_page", then call parse_result
yield scrapy.Request(url=response.urljoin(next_page), callback=self.parse_result)

for item in items:
print(item)
put yield instead of print there,
for item in items:
yield item

On inspection of your code, I notice a number of problems:
First, you initialize items to a tuple, when it should be a list: items = [].
You should change your name property to reflect the name you want on your crawler so you can use it like so: scrapy crawl my_crawler where name = "my_crawler".
start_urls is supposed to contain strings, not Request objects. You should change the entry from page to the exact search string you want to use. If you have a number of search strings and want to iterate over them, I would suggest using a middleware.
When you try to extract the data from CSS you're forgetting to call extract_all() which would actually transform your selector into string data which you could use.
Also, you shouldn't be redirecting to the standard output stream because a lot of logging goes there and it'll make your output file really messy. Instead, you should extract the responses into items using loaders.
Finally, you're probably missing the appropriate settings from your settings.py file. You can find the relevant documentation here.
FEED_FORMAT = "csv"
FEED_EXPORT_FIELDS = ["Field 1", "Field 2", "Field 3"]

Scrapy Data Flow and Items and Item Loaders

I am looking at the Architecture Overview page in the Scrapy documentation, but I still have a few questions regarding data and or control flow.
Scrapy Architecture
Default File Structure of Scrapy Projects
scrapy.cfg
myproject/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
spider1.py
spider2.py
...
item.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class MyprojectItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
which, I'm assuming, becomes
import scrapy
class Product(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
stock = scrapy.Field()
last_updated = scrapy.Field(serializer=str)
so that errors are thrown when trying to populate undeclared fields of Product instances
>>> product = Product(name='Desktop PC', price=1000)
>>> product['lala'] = 'test'
Traceback (most recent call last):
...
KeyError: 'Product does not support field: lala'
Question #1
Where, when, and how does our crawler become aware of items.py if we have created class CrowdfundingItem in items.py?
Is this done in...
__init__.py?
my_crawler.py?
def __init__() of mycrawler.py?
settings.py?
pipelines.py?
def __init__(self, dbpool) of pipelines.py?
somewhere else?
Question #2
Once I have declared an item such as Product, how do I then store the data by creating instances of Product in a context similar to the one below?
import scrapy
class MycrawlerSpider(CrawlSpider):
name = 'mycrawler'
allowed_domains = ['google.com']
start_urls = ['https://www.google.com/']
def parse(self, response):
options = Options()
options.add_argument('-headless')
browser = webdriver.Firefox(firefox_options=options)
browser.get(self.start_urls[0])
elements = browser.find_elements_by_xpath('//section')
count = 0
for ele in elements:
name = browser.find_element_by_xpath('./div[#id="name"]').text
price = browser.find_element_by_xpath('./div[#id="price"]').text
# If I am not sure how many items there will be,
# and hence I cannot declare them explicitly,
# how I would go about creating named instances of Product?
# Obviously the code below will not work, but how can you accomplish this?
count += 1
varName + count = Product(name=name, price=price)
...
Lastly, say we forego naming the Product instances altogether, and instead simply create unnamed instances.
for ele in elements:
name = browser.find_element_by_xpath('./div[#id="name"]').text
price = browser.find_element_by_xpath('./div[#id="price"]').text
Product(name=name, price=price)
If such instances are indeed stored somewhere, where are they stored? By creating instances this way, would it be impossible to access them?

Using an Item is optional; they're just a convenient way to declare your data model and apply validation. You can also use a plain dict instead.
If you do choose to use Item, you will need to import it for use in the spider. It's not discovered automatically. In your case:
from items import CrowdfundingItem
As a spider runs the parse method on each page, you can load the extracted data into your Item or dict. Once it's loaded, yield it, which passes it back to the scrapy engine for processing downstream, in pipelines or exporters. This is how scrapy enables "storage" of the data you scrape.
For example:
yield Product(name='Desktop PC', price=1000) # uses Item
yield {'name':'Desktop PC', 'price':1000} # plain dict

How to store cumulated data after Scrapy has finished working with every URL?

I'm trying to store some data after Scrapy has finished working (i.e after it did every URL I asked him).
Everytime Scrapy parse some result (through the parse function in the spider class), I append some information to an existing global object in the class itself. I would like to access that object at the end, and if possible to do everything from a Python script. Here's my spider code:
from scrapy.spider import Spider
from scrapy.selector import Selector
from nltk.corpus import stopwords
from newsScrapy.items import NewsscrapyItem
class newsScrapySpider(Spider):
name = "newsScrapy"
start_urls = []
global wordMatrix
wordMatrix = {}
global prefix
prefix = "http://www.nytimes.com/indexes/"
sufix = "/todayspaper/index.html"
for year in range (2000,2015):
for month in range (1,13):
for day in range (1,32):
if(month<10 and day<10):
start_urls.append (prefix+str(year)+"/"+"0"+str(month)+"/"+"0"+str(day))
elif (month<10 and day>9):
start_urls.append (prefix+str(year)+"/"+"0"+str(month)+"/"+str(day))
elif (month>9 and day<10):
start_urls.append (prefix+str(year)+"/"+str(month)+"/"+"0"+str(day))
else:
start_urls.append (prefix+str(year)+"/"+str(month)+"/"+str(day))
def parse(self, response):
sel = Selector(response)
items = []
text = sel.xpath('//body//text()').re('(\w+)')
item = NewsscrapyItem()
item['body'] = text
item['date'] = response.url.strip(prefix)
items.append(item)
for word in item['body']:
word = word.strip(' ').strip(',').strip('\n')
word = word.lower()
if (not word in stopwords.words('english')):
if(wordMatrix.__contains__((word, item['date']))):
wordMatrix[word,item['date']]+=1
else:
wordMatrix[word, item['date']]=1
# print wordMatrix
return items
The idea would be to access the wordMatrix variable after the end of the scraping (once every data was collected) and to do it from another Python script (for ploting per example).
Thanks a lot !

Together with your existing imports:
try:
import cPickle as pickle
except ImportError:
import pickle
And then just before return items:
pickle.dump(wordMatrix, '/path/to/file/wordMatrix.data');
In another script you can load this data with:
try:
import cPickle as pickle
except ImportError:
import pickle
wordMatrix = pickle.load('/path/to/file/wordMatrix.data')
Pickling is a process of serializing and deserializing any Python object. There are two implementations in Python standard library - pickle is pure Python and cPickle is written in C thus much faster. The unusual import code tries to import faster one but for instance IronPython lacks cPickle and in this case the latter is imported. Both modules do exactly the same and share the same interface.

How can I use the fields_to_export attribute in BaseItemExporter to order my Scrapy CSV data?

I have made a simple Scrapy spider that I use from the command line to export my data into the CSV format, but the order of the data seem random. How can I order the CSV fields in my output?
I use the following command line to get CSV data:
scrapy crawl somwehere -o items.csv -t csv
According to this Scrapy documentation, I should be able to use the fields_to_export attribute of the BaseItemExporter class to control the order. But I am clueless how to use this as I have not found any simple example to follow.
Please Note: This question is very similar to THIS one. However, that question is over 2 years old and doesn't address the many recent changes to Scrapy and neither provides a satisfactory answer, as it requires hacking one or both of:
contrib/exporter/init.py
contrib/feedexport.py
to address some previous issues, that seem to have already been resolved...
Many thanks in advance.

To use such exporter you need to create your own Item pipeline that will process your spider output. Assuming that you have simple case and you want to have all spider output in one file this is pipeline you should use (pipelines.py):
from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter
class CSVPipeline(object):
def __init__(self):
self.files = {}
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_items.csv' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = CsvItemExporter(file)
self.exporter.fields_to_export = [list with Names of fields to export - order is important]
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
Of course you need to remember to add this pipeline in your configuration file (settings.py):
ITEM_PIPELINES = {'myproject.pipelines.CSVPipeline': 300 }

You can now specify settings in the spider itself.
https://doc.scrapy.org/en/latest/topics/settings.html#settings-per-spider
To set the field order for exported feeds, set FEED_EXPORT_FIELDS.
https://doc.scrapy.org/en/latest/topics/feed-exports.html#feed-export-fields
The spider below dumps all links on a website (written against Scrapy 1.4.0):
import scrapy
from scrapy.http import HtmlResponse
class DumplinksSpider(scrapy.Spider):
name = 'dumplinks'
allowed_domains = ['www.example.com']
start_urls = ['http://www.example.com/']
custom_settings = {
# specifies exported fields and order
'FEED_EXPORT_FIELDS': ["page", "page_ix", "text", "url"],
}
def parse(self, response):
if not isinstance(response, HtmlResponse):
return
a_selectors = response.xpath('//a')
for i, a_selector in enumerate(a_selectors):
text = a_selector.xpath('normalize-space(text())').extract_first()
url = a_selector.xpath('#href').extract_first()
yield {
'page_ix': i + 1,
'page': response.url,
'text': text,
'url': url,
}
yield response.follow(url, callback=self.parse) # see allowed_domains
Run with this command:
scrapy crawl dumplinks --loglevel=INFO -o links.csv
Fields in links.csv are ordered as specified by FEED_EXPORT_FIELDS.

I found a pretty simple way to solve this issue. The above answers I would still say are more correct, but this is a quick fix. It turns out scrapy pulls the items in alphabetical order. Capitals are also important. So, an item beginning with 'A' will be pulled first, then 'B', 'C', etc, followed by 'a', 'b', 'c'. I have a project going right now where the header names are not extremely important, but I did need the UPC to be the first header for input into another program. I have the following item class:
class ItemInfo(scrapy.Item):
item = scrapy.Field()
price = scrapy.Field()
A_UPC = scrapy.Field()
ID = scrapy.Field()
time = scrapy.Field()
My CSV file outputs with the headers (in order): A_UPC, ID, item, price, time

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Real-time Scraper | Complicated Issue - python

Related

How to make Scrapy XmlFeed Spider Faster

Python Scrapy not outputting to csv file

Scrapy Data Flow and Items and Item Loaders

How to store cumulated data after Scrapy has finished working with every URL?

How can I use the fields_to_export attribute in BaseItemExporter to order my Scrapy CSV data?

Categories

Resources