I am trying to call a local variable from inside a Scrapy spider class but then I got NameError: global name 'base_search_url' is not defined.
class MySpider(scrapy.Spider):
name = "mine"
allowed_domains = ["www.example.com"]
base_url = "https://www.example.com"
start_date = "2011-01-01"
today = datetime.date.today().strftime("%Y-%m-%d")
base_search_url = 'https://www.example.com/?city={}&startDate={}&endDate={}&page=1',
city_codes = ['on', 'bc', 'ab']
start_urls = (base_search_url.format(city_code, start_date, today) for city_code in city_codes)
I tried to use self.base_search_url instead but there is no use. Does anyone know how to solve it?
FYI, I use Python 2.7
Solved! I end up solving it by using __init__() function.
def __init__(self):
self.start_urls = (self.base_search_url.format(city_code, self.start_date, self.today) for city_code in self.city_codes)
From the docs:
start_urls: a list of URLs where the Spider will begin to crawl from.
The first pages downloaded will be those listed here. The subsequent
URLs will be generated successively from data contained in the start
URLs.
Start urls is a list
Solve it by set in init method:
def __init__(self):
self.start_urls=[]
self.start_urls.append( (base_search_url.format(city_code, start_date, today) for city_code in city_codes) )
Or in the class declaration (as you show in your question):
start_urls=[]
start_urls.append( (base_search_url.format(city_code, start_date, today) for city_code in city_codes) )
Note
Make sure you add correct urls starting by http:// or https://.
There are only four ranges in Python: LEGB, because the local scope of the class definition and the local extent of the list derivation are not nested functions, so they do not form the Enclosing scope.Therefore, they are two separate local scopes that cannot be accessed from each other.
3 solutions:
1. global base_search_url
2. def __init__(self) ...
3. start_urls = ('https://www.example.com/?city={}&startDate={}&endDate={}&page=1'.format ... )
Related
I'm trying to use the serializer attribute in an Item, just like the example in the documentation:
https://docs.scrapy.org/en/latest/topics/exporters.html#declaring-a-serializer-in-the-field
The spider works without any errors, but the serialization doesn't happens, the print in the function doesn't print too. It's like the function remove_pound is never called.
import scrapy
def remove_pound(value):
print('Am I a joke to you?')
return value.replace('£', '')
class BookItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field(serializer=remove_pound)
class BookSpider(scrapy.Spider):
name = 'bookspider'
start_urls = ['https://books.toscrape.com/']
def parse(self, response):
books = response.xpath('//ol/li')
for i in books:
yield BookItem(
title=i.xpath('article/h3/a/text()').get(),
price=i.xpath('article/div/p[#class="price_color"]/text()').get(),
)
Am I using it wrong?
PS.: I know there are other ways to do it, I just want to learn to use this way.
The only reason it doesn't work is because your XPath expression is not right. You need to use relative XPath:
price=i.xpath('./article/div/p[#class="price_color"]/text()').get()
Update It's not XPath. The serialization works only for item exporters:
you can customize how each field value is serialized before it is
passed to the serialization library.
So if you run this command scrapy crawl bookspider -o BookSpider.csv you'll get a correct (serialized) output.
The xml feed I'm scraping has around thousand items. I'm wondering if there is a way to split the load or another method to significantly reduce run time. It currently takes two minutes to iterate all the xml within the link below. Any suggestions or advice is greatly appreciated.
Example: https://www.cityblueshop.com/sitemap_products_1.xml
from scrapy.spiders import XMLFeedSpider
from learning.items import TestItem
class MySpider(XMLFeedSpider):
name = 'testing'
allowed_domains = ['www.cityblueshop.com']
start_urls = ['https://www.cityblueshop.com/sitemap_products_1.xml']
namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
itertag = 'n:url'
iterator = 'xml'
def parse_node(self, response, node):
item = TestItem()
item['url'] = node.xpath('.//n:loc/text()').extract()
return item
Two minute run time for all items. Any ways to make it quicker using Scrapy?
I tested the following spider locally:
from scrapy.spiders import XMLFeedSpider
class MySpider(XMLFeedSpider):
name = 'testing'
allowed_domains = ['www.cityblueshop.com']
start_urls = ['https://www.cityblueshop.com/sitemap_products_1.xml']
namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
itertag = 'n:url'
iterator = 'xml'
def parse_node(self, response, node):
yield {'url': node.xpath('.//n:loc/text()').get()}
It takes less than 3 seconds to run, including Scrapy core startup and everything.
Please, ensure that the time is not spent somewhere else, e.g. in the learning module from which you import your item subclass.
Try to increase CONCURRENT_REQUESTS, CONCURRENT_REQUESTS_PER_DOMAIN, CONCURRENT_REQUESTS_PER_IP, for example: https://doc.scrapy.org/en/latest/topics/settings.html#concurrent-requests-per-domain
But remember that besides high speed it can lead to lower success rate, like many 429 responses, bans, etc.
I am new at Python and Scrapy. I have a project. In the spider there is a code like that:
class MySpider(BaseSpider):
name = "project"
allowed_domains = ["domain.com"]
start_urls = ["https://domain.com/%d" % i for i in range(12308128,12308148)]
I want to take the range numbers between 12308128 and 12308148 from a txt file (or csv file)
Lets say its numbers.txt including two lines in it:
12308128
12308148
How can I import these numbers to my spider? Another process will change these numbers in txt file periodically and my spider will update the numbers and run.
Thank you.
You can override the start_urls logic in spider's start_requests() method:
class Myspider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
# read file data
with open('filename', 'r') as f:
start, end = f.read().split('\n', 1)
# make range and urls with your numbers
range_ = (int(start.strip()), int(end.strip()))
start_urls = ["https://domain.com/%d" % i for i in range(range_)]
for url in start_urls:
yield scrapy.Request(url)
This spider will open up file, read the numbers, create starting urls, iterate through them and schedule a request for each one of them.
Default start_requests() method looks something like:
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url)
So you can see what we're doing here by overriding it.
You can pass any parameters to spider's constructor through command line using option -a of scrapy crawl command for ex.)
scrapy crawl spider -a inputfile=filename.txt
then use it like this:
class MySpider(scrapy.Spider):
name = 'spider'
def __init__(self, *args, **kwargs):
self.infile = kwargs.pop('inputfile', None)
def start_requests(self):
if self.infile is None:
raise CloseSpider('No filename')
# process file, name in self.infile
or you can just pass start and end values in similar way like this:
scrapy crawl spider -a start=10000 -a end=20000
I believe you need to read the file and pass the values to your url string
Start_Range = datacont.readline()
End_Range = datacont.readline()
print Start_Range
print End_Range
How do I actually pass data into parse for my spider, le's say variable name or temp.
class CSpider(scrapy.Spider):
name = "s1"
allowed_domains = ["abc.com"]
temp = ""
start_urls = [
url.strip() for url in lists
]
def parse(self, response):
//How do i pass data into here, eg name, temp
If you are defining the temp variable as a class-level variable, you can access it via self.temp.
If this is something you want to be passed from a command-line, see the following topics:
How to give URL to scrapy for crawling?
Scrapy : How to pass list of arguments through command prompt to spider?
As alecxe answered, you can use attributes (class-level variables) to make variables or constants accessible wherever in your class or you can also add a parameter to your method (functions of a class) parse if you want to be able to give values to that parameter that would come from outside of the class.
I'll try here to give you an example of your code with both solutions.
Using an attribute:
class CSpider(scrapy.Spider):
name = "s1"
allowed_domains = ["abc.com"]
temp = ""
# Here is our attribute
self.number_of_days_in_a_week = 7
start_urls = [
url.strip() for url in lists
]
def parse(self, response):
# It is now used in the method
print(f"In a week, there is {self.number_of_days_in_a_week} days.")
If you need to, here is how to pass it as an other argument:
class CSpider(scrapy.Spider):
name = "s1"
allowed_domains = ["abc.com"]
temp = ""
start_urls = [
url.strip() for url in lists
]
def parse(self, what_you_want_to_pass_in):
print(f"In a week, there is {what_you_want_to_pass_in} days.")
# We create an instance of the spider
spider1 = CSpider
# Then we use it's method with an argument
spider1.parse(7)
Note that in the second example, I took back the response argument from your parse method because it was easier to show how the arguments would be passed. Still, if you consider the entire Scrapy framework, you can for sure add external values using this solution.
I'm trying to get this code working, but it keeps coming up with the error in the title. I don't get it. The function "url" is set before the "get_media" function, and the same function call thing works with other functions I've set, but it says otherwise. I've looked at similar question's answers, but I cannot understand any of them, because the answers are designed around their complicated code, and they offer up no proper explanation as to how it works.
def url(path):
if path.find("?") != -1:
pre = "&"
else:
pre = "?"
return protocol +"://" +host +base_path +path +pre +"access_token=" +access_token
def get_media(insta_id, max_id=None):
insta_id = str(insta_id)
path = url("/users/%s/media/recent/") # ERROR COMES UP HERE
if max_id is not None:
path = path +"&max_id=%s" % max_id
url = urllib.request.urlopen(path)
url = url.read().decode("utf-8")
url = json.loads(url)
return url
Any help appreciated. Tell me if you need more code to work with.
B
You assign to a local variable called "url" later in your function. Because of that, Python treats every reference to "url" within that function as local. But of course you haven't defined that local variable yet, hence the error.
Use a different name for the local "url" variable. (It's never a URL anyway, so you should definitely use a better name.)
Just need to tell python this is global variable inside function
url = "" # <------ 1. declare url outsite function/def
def get_media():
global url # <-------- 2. add here "global" tell the system this is global variable
# ....
url = "my text"
print(url) # will display "my text"