Python, Passing data in Scrapy - python

How do I actually pass data into parse for my spider, le's say variable name or temp.
class CSpider(scrapy.Spider):
name = "s1"
allowed_domains = ["abc.com"]
temp = ""
start_urls = [
url.strip() for url in lists
]
def parse(self, response):
//How do i pass data into here, eg name, temp

If you are defining the temp variable as a class-level variable, you can access it via self.temp.
If this is something you want to be passed from a command-line, see the following topics:
How to give URL to scrapy for crawling?
Scrapy : How to pass list of arguments through command prompt to spider?

As alecxe answered, you can use attributes (class-level variables) to make variables or constants accessible wherever in your class or you can also add a parameter to your method (functions of a class) parse if you want to be able to give values to that parameter that would come from outside of the class.
I'll try here to give you an example of your code with both solutions.
Using an attribute:
class CSpider(scrapy.Spider):
name = "s1"
allowed_domains = ["abc.com"]
temp = ""
# Here is our attribute
self.number_of_days_in_a_week = 7
start_urls = [
url.strip() for url in lists
]
def parse(self, response):
# It is now used in the method
print(f"In a week, there is {self.number_of_days_in_a_week} days.")
If you need to, here is how to pass it as an other argument:
class CSpider(scrapy.Spider):
name = "s1"
allowed_domains = ["abc.com"]
temp = ""
start_urls = [
url.strip() for url in lists
]
def parse(self, what_you_want_to_pass_in):
print(f"In a week, there is {what_you_want_to_pass_in} days.")
# We create an instance of the spider
spider1 = CSpider
# Then we use it's method with an argument
spider1.parse(7)
Note that in the second example, I took back the response argument from your parse method because it was easier to show how the arguments would be passed. Still, if you consider the entire Scrapy framework, you can for sure add external values using this solution.

Related

Item serializers don't work. Function never gets called

I'm trying to use the serializer attribute in an Item, just like the example in the documentation:
https://docs.scrapy.org/en/latest/topics/exporters.html#declaring-a-serializer-in-the-field
The spider works without any errors, but the serialization doesn't happens, the print in the function doesn't print too. It's like the function remove_pound is never called.
import scrapy
def remove_pound(value):
print('Am I a joke to you?')
return value.replace('£', '')
class BookItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field(serializer=remove_pound)
class BookSpider(scrapy.Spider):
name = 'bookspider'
start_urls = ['https://books.toscrape.com/']
def parse(self, response):
books = response.xpath('//ol/li')
for i in books:
yield BookItem(
title=i.xpath('article/h3/a/text()').get(),
price=i.xpath('article/div/p[#class="price_color"]/text()').get(),
)
Am I using it wrong?
PS.: I know there are other ways to do it, I just want to learn to use this way.
The only reason it doesn't work is because your XPath expression is not right. You need to use relative XPath:
price=i.xpath('./article/div/p[#class="price_color"]/text()').get()
Update It's not XPath. The serialization works only for item exporters:
you can customize how each field value is serialized before it is
passed to the serialization library.
So if you run this command scrapy crawl bookspider -o BookSpider.csv you'll get a correct (serialized) output.

route results from yield to a file

I have the following Python script using Scrapy:
import scrapy
class ChemSpider(scrapy.Spider):
name = "site"
def start_requests(self):
urls = [
'https://www.site.com.au'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
category_links = response.css('li').xpath('a/#href').getall()
category_links_filtered = [x for x in category_links if 'shop-online' in x] # remove non category links
category_links_filtered = list(dict.fromkeys(category_links_filtered)) # remove duplicates
for category_link in category_links_filtered:
if "medicines" in category_link:
next_page = response.urljoin(category_link) + '?size=10'
self.log(next_page)
yield scrapy.Request(next_page, callback=self.parse_subcategories)
def parse_subcategories(self, response):
for product in response.css('div.Product'):
yield {
'category_link': response.url,
'product_name': product.css('img::attr(alt)').get(),
'product_price': product.css('span.Price::text').get().replace('\n','')
}
My solution will run multiple instances of this script, each scraping a different subset of information from different 'categories'. I know you can run scrapy from the command line to output to a json file, but i want do to the output to a file from within the function, so each instance writes to a different file. Being a beginner with Python, I'm not sure where to go with my script. I need to get the output of the yield into a file while the script is executing. How do i achieve this? There will be hundreds of rows scraped, and I'm not familiar enough with how yield works to understand how to 'return' from it a set of data (or a list) that can then be written to the file.
You are looking to append a file. But being file writing an I/O operation, you need to lock the file from being written by other processes while a process is writing.
Easiest way to achieve is to write in different random files (files with random names) in a directory and concatenating them all using another process.
First let me suggest you some changes to your code. If you want to remove duplicates i you could use a set like this:
category_links_filtered = (x for x in category_links if 'shop-online' in x) # remove non category links
category_links_filtered = set(category_links_filtered) # remove duplicates
note that i'm also changing the [ to ( to make a generator instead of a list and save some memory. Search more about generators: https://www.python-course.eu/python3_generators.php
OK then the solution for your problem is using an Item Pipeline (https://docs.scrapy.org/en/latest/topics/item-pipeline.html), what this does perfom some action on every item yielded from your function parse_subcategories. What you do is add a class in your pipelines.py file and enable this pipeline in settings.py. This is:
In settings.py:
ITEM_PIPELINES = {
'YOURBOTNAME.pipelines.CategoriesPipeline': 300, #the number here is the priority of the pipeline, dont worry and just leave it
}
In pipelines.py:
import json
from urlparse import urlparse #this is library to parse urls
class CategoriesPipeline(object):
#This class dynamically saves the data depending on the category name obtained in the url or by an atrtribute
def open_spider(self, spider):
if hasattr(spider, 'filename'):
#the filename is an attribute set by -a filename=somefilename
filename = spider.filename
else:
#you could also set the name dynamically from the start url like this, if you set -a start_url=https://www.site.com.au/category-name
try:
filename = urlparse(spider.start_url).path[1:] #this returns 'category-name' and replace spaces with _
except AttributeError:
spider.crawler.engine.close_spider(self, reason='no start url') #this should not happen
self.file = open(filename+'.jl', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
In spiders/YOURBOTNAME.py modify this:
class ChemSpider(scrapy.Spider):
name = "site"
if !hasattr(self, 'start_url'):
spider.crawler.engine.close_spider(self, reason='no start url') #we need a start url
start_urls = [ self.start_url ] #see why this works on https://docs.scrapy.org/en/latest/intro/tutorial.html#a-shortcut-for-creating-requests
def parse(self, response):#...
and then you start your crawl with this command: scrapy crawl site -a start_url=https://www.site.com.au/category-name and you could optionally add -a filename=somename

(Python, Scrapy) Taking data from txt file into Scrapy spider

I am new at Python and Scrapy. I have a project. In the spider there is a code like that:
class MySpider(BaseSpider):
name = "project"
allowed_domains = ["domain.com"]
start_urls = ["https://domain.com/%d" % i for i in range(12308128,12308148)]
I want to take the range numbers between 12308128 and 12308148 from a txt file (or csv file)
Lets say its numbers.txt including two lines in it:
12308128
12308148
How can I import these numbers to my spider? Another process will change these numbers in txt file periodically and my spider will update the numbers and run.
Thank you.
You can override the start_urls logic in spider's start_requests() method:
class Myspider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
# read file data
with open('filename', 'r') as f:
start, end = f.read().split('\n', 1)
# make range and urls with your numbers
range_ = (int(start.strip()), int(end.strip()))
start_urls = ["https://domain.com/%d" % i for i in range(range_)]
for url in start_urls:
yield scrapy.Request(url)
This spider will open up file, read the numbers, create starting urls, iterate through them and schedule a request for each one of them.
Default start_requests() method looks something like:
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url)
So you can see what we're doing here by overriding it.
You can pass any parameters to spider's constructor through command line using option -a of scrapy crawl command for ex.)
scrapy crawl spider -a inputfile=filename.txt
then use it like this:
class MySpider(scrapy.Spider):
name = 'spider'
def __init__(self, *args, **kwargs):
self.infile = kwargs.pop('inputfile', None)
def start_requests(self):
if self.infile is None:
raise CloseSpider('No filename')
# process file, name in self.infile
or you can just pass start and end values in similar way like this:
scrapy crawl spider -a start=10000 -a end=20000
I believe you need to read the file and pass the values to your url string
Start_Range = datacont.readline()
End_Range = datacont.readline()
print Start_Range
print End_Range

Scrapy - NameError: global name 'base_search_url' is not defined

I am trying to call a local variable from inside a Scrapy spider class but then I got NameError: global name 'base_search_url' is not defined.
class MySpider(scrapy.Spider):
name = "mine"
allowed_domains = ["www.example.com"]
base_url = "https://www.example.com"
start_date = "2011-01-01"
today = datetime.date.today().strftime("%Y-%m-%d")
base_search_url = 'https://www.example.com/?city={}&startDate={}&endDate={}&page=1',
city_codes = ['on', 'bc', 'ab']
start_urls = (base_search_url.format(city_code, start_date, today) for city_code in city_codes)
I tried to use self.base_search_url instead but there is no use. Does anyone know how to solve it?
FYI, I use Python 2.7
Solved! I end up solving it by using __init__() function.
def __init__(self):
self.start_urls = (self.base_search_url.format(city_code, self.start_date, self.today) for city_code in self.city_codes)
From the docs:
start_urls: a list of URLs where the Spider will begin to crawl from.
The first pages downloaded will be those listed here. The subsequent
URLs will be generated successively from data contained in the start
URLs.
Start urls is a list
Solve it by set in init method:
def __init__(self):
self.start_urls=[]
self.start_urls.append( (base_search_url.format(city_code, start_date, today) for city_code in city_codes) )
Or in the class declaration (as you show in your question):
start_urls=[]
start_urls.append( (base_search_url.format(city_code, start_date, today) for city_code in city_codes) )
Note
Make sure you add correct urls starting by http:// or https://.
There are only four ranges in Python: LEGB, because the local scope of the class definition and the local extent of the list derivation are not nested functions, so they do not form the Enclosing scope.Therefore, they are two separate local scopes that cannot be accessed from each other.
3 solutions:
1. global base_search_url
2. def __init__(self) ...
3. start_urls = ('https://www.example.com/?city={}&startDate={}&endDate={}&page=1'.format ... )

How can I use the fields_to_export attribute in BaseItemExporter to order my Scrapy CSV data?

I have made a simple Scrapy spider that I use from the command line to export my data into the CSV format, but the order of the data seem random. How can I order the CSV fields in my output?
I use the following command line to get CSV data:
scrapy crawl somwehere -o items.csv -t csv
According to this Scrapy documentation, I should be able to use the fields_to_export attribute of the BaseItemExporter class to control the order. But I am clueless how to use this as I have not found any simple example to follow.
Please Note: This question is very similar to THIS one. However, that question is over 2 years old and doesn't address the many recent changes to Scrapy and neither provides a satisfactory answer, as it requires hacking one or both of:
contrib/exporter/init.py
contrib/feedexport.py
to address some previous issues, that seem to have already been resolved...
Many thanks in advance.
To use such exporter you need to create your own Item pipeline that will process your spider output. Assuming that you have simple case and you want to have all spider output in one file this is pipeline you should use (pipelines.py):
from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter
class CSVPipeline(object):
def __init__(self):
self.files = {}
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_items.csv' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = CsvItemExporter(file)
self.exporter.fields_to_export = [list with Names of fields to export - order is important]
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
Of course you need to remember to add this pipeline in your configuration file (settings.py):
ITEM_PIPELINES = {'myproject.pipelines.CSVPipeline': 300 }
You can now specify settings in the spider itself.
https://doc.scrapy.org/en/latest/topics/settings.html#settings-per-spider
To set the field order for exported feeds, set FEED_EXPORT_FIELDS.
https://doc.scrapy.org/en/latest/topics/feed-exports.html#feed-export-fields
The spider below dumps all links on a website (written against Scrapy 1.4.0):
import scrapy
from scrapy.http import HtmlResponse
class DumplinksSpider(scrapy.Spider):
name = 'dumplinks'
allowed_domains = ['www.example.com']
start_urls = ['http://www.example.com/']
custom_settings = {
# specifies exported fields and order
'FEED_EXPORT_FIELDS': ["page", "page_ix", "text", "url"],
}
def parse(self, response):
if not isinstance(response, HtmlResponse):
return
a_selectors = response.xpath('//a')
for i, a_selector in enumerate(a_selectors):
text = a_selector.xpath('normalize-space(text())').extract_first()
url = a_selector.xpath('#href').extract_first()
yield {
'page_ix': i + 1,
'page': response.url,
'text': text,
'url': url,
}
yield response.follow(url, callback=self.parse) # see allowed_domains
Run with this command:
scrapy crawl dumplinks --loglevel=INFO -o links.csv
Fields in links.csv are ordered as specified by FEED_EXPORT_FIELDS.
I found a pretty simple way to solve this issue. The above answers I would still say are more correct, but this is a quick fix. It turns out scrapy pulls the items in alphabetical order. Capitals are also important. So, an item beginning with 'A' will be pulled first, then 'B', 'C', etc, followed by 'a', 'b', 'c'. I have a project going right now where the header names are not extremely important, but I did need the UPC to be the first header for input into another program. I have the following item class:
class ItemInfo(scrapy.Item):
item = scrapy.Field()
price = scrapy.Field()
A_UPC = scrapy.Field()
ID = scrapy.Field()
time = scrapy.Field()
My CSV file outputs with the headers (in order): A_UPC, ID, item, price, time

Categories