simple Inheritance issue - python

I have a problem with inheritance. Probably a beginners mistake. I made two Scrapy spiders:
from scrapy.spiders import SitemapSpider
class SchemaorgSpider(SitemapSpider):
name = 'schemaorg'
def parse(self, response):
print "parse"
...
and
from schemaorg import SchemaorgSpider
class SchemaorgSpider_two(SchemaorgSpider):
name = 'schemaorg_two'
sitemap_urls = [
urltoparse
]
sitemap_rules= [('/stuff/','parse_fromrule')]
def parse_fromrule(self, response):
print "parsefromrule"
self.parse(response)
I am basically defining all the logic in mparse and then using it in all child classes. When I run my second spider, I see only "parsefromrule" and not "parse". This looks to me like "inheritance 101" but it does not work.
What's wrong with it?
edit: test without scrapy that works:
class a(object):
def aa(self):
print "hello"
class b(a):
def bb(self):
self.aa()
class c(b):
def cc(self):
self.aa()
hello = c()
hello.cc()
hello.bb()
hello.aa()
I see all 3 "hello". I am confused why it doesn't work with Scrapy.
Edit2: if I put self.blabla(response) instead of self.parse(response) I get an error. It means that it is looking for an existing method.

Related

Scrapy callback function in another file

I am using Scrapy with Python to scrape several websites.
I got many Spiders with a structure like this:
import library as lib
class Spider(Spider):
...
def parse(self, response):
yield FormRequest(..., callback=lib.parse_after_filtering_results1)
yield FormRequest(..., callback=lib.parse_after_filtering_results2)
def parse_after_filtering_results1(self,response):
return results
def parse_after_filtering_results2(self,response):
... (doesn't return anything)
I would like to know if there's any way I can put the last 2 functions, which are called in the callback, in another module that is common to all my Spiders (so that if I modify it then all of them change). I know they are class functions but is there anyway I could put them in another file?
I have tried declaring the functions in my library.py file but my problem is how can I pass the 2 parameters needed (self, response) to them.
Create a base class to contain those common functions. Then your real spiders can inherit from that. For example, if all your spiders extend Spider then you can do the following:
spiders/basespider.py:
from scrapy import Spider
class BaseSpider(Spider):
# Do not give it a name so that it does not show up in the spiders list.
# This contains only common functions.
def parse_after_filtering_results1(self, response):
# ...
def parse_after_filtering_results2(self, response):
# ...
spiders/realspider.py:
from .basespider import BaseSpider
class RealSpider(BaseSpider):
# ...
def parse(self, response):
yield FormRequest(..., callback=self.parse_after_filtering_results1)
yield FormRequest(..., callback=self.parse_after_filtering_results2)
If you have different types of spiders you can create different base classes. Or your base class can be a plain object (not Spider) and then you can use it as a mixin.

Python package structure with base classes

I am wondering if there is a way to do what I am trying, best explained with an example:
Contents of a.py:
class A(object):
def run(self):
print('Original')
class Runner(object):
def run(self):
a = A()
a.run()
Contents of b.py:
import a
class A(a.A):
def run(self):
# Do something project-specific
print('new class')
class Runner(a.Runner):
def other_fcn_to_do_things(self):
pass
Basically, I have a file with some base classes that I would like to use for a few different projects. What I would like would be for b.Runner.run() to use the class A in b.py, without needing to override the run method. In the example above, I would like to code
import b
r = b.Runner()
print(r.run())
to print "new class". Is there any way to do that?
This seems a little convoluted. The Runner classes are probably unnecessary, unless there's something else more complex going on that was left out of your example. If you're set on not overriding the original run(), you could call it in another method in B. Please take a look at this post and this post on super().
It would probably make more sense to do something like this:
a.py:
class A(object):
def run(self):
# stuff
print ('Original')
b.py:
import a
class B(A):
def run(self):
return super(A, self).run()
# can also do: return A.run()
def run_more(self):
super(A, self).run()
# other stuff
print('new class')

Access attribute of one class from another class in Python

I' playing with OOP (OOP concept is something totally new for me) in Python 3 and trying to access attribute (list) of one class from another class. Obviously I am doing something wrong but don't understand what.
from urllib import request
from bs4 import BeautifulSoup
class getUrl(object):
def __init__(self):
self.appList = []
self.page = None
def getPage(self, url):
url = request.urlopen(url)
self.page = url.read()
url.close()
def parsePage(self):
soup = BeautifulSoup(self.page)
for link in soup.find_all("a"):
self.appList.append(link.get('href'))
return (self.appList)
class getApp(object):
def __init__(self):
pass
def selectApp(self):
for i in getUrl.appList():
return print(i)
a = getUrl()
a.getPage("http://somepage/page")
a.parsePage()
b = getApp()
b.selectApp()
And I get:
AttributeError: type object 'getUrl' has no attribute 'appList'
Your code seems to confuse classes with functions. Normally a function name is a verb (e.g. getUrl) because it represents an action. A class name is usually a noun, because it represents a class of objects rather than actions. For example, the following is closer to how I would expect to see classes being used:
from urllib import request
from bs4 import BeautifulSoup
class Webpage(object):
def __init__(self, url):
self.app_list = []
url = request.urlopen(url)
self.page = url.read()
def parse(self):
soup = BeautifulSoup(self.page)
for link in soup.find_all("a"):
self.app_list.append(link.get('href'))
return self.app_list
class App(object):
def __init__(self, webpage, number):
self.webpage = webpage
self.link = webpage.app_list[number]
my_webpage = Webpage("http://somepage/page")
my_webpage.parse()
selected_app = App(my_webpage, 1)
print (selected_app.link)
Note that we usually make an instance of a class (e.g. my_webpage) then access methods and properties of the instance rather than of the class itself. I don't know what you intend to do with the links found on the page, so it is not clear if these need their own class (App) or not.
You need to pass in the getUrl() instance; the attributes are not present on the class itself:
class getApp(object):
def __init__(self):
pass
def selectApp(self, geturl_object):
for i in geturl_object.appList:
print(i)
(note the removed return as well; print() returns None and you'd exit the loop early).
and
b = getApp()
b.selectApp(a)
The appList is a variable in an instance of the getUrl class. So you can only access it for each instance (object) of the getUrl class. The problem is here:
class getApp(object):
def __init__(self):
pass
def selectApp(self):
for i in getUrl.appList():
return print(i)
Look at getUrl.appList(). Here you call the class, not an object. You might also want to look at the return print(i) statement.
Use requests instead of urllib, it's more comfortable.

How to present a class as a function?

As it was unclear earlier I am posting this scenario:
class Scraper:
def __init__(self,url):
self.start_page = url
def parse_html(self):
pass
def get_all_links(self):
pass
def run(self):
#parse html, get all links, parse them and when done...
return links
Now in a task queue like rq
from rq import Queue
from worker import conn
q = Queue(connection=conn)
result = q.enqueue(what_function, 'http://stackoverflow.com')
I want to know what this what_function would be? I remembered Django does something similar with their CBVs so I used that analogy but it wasn't so clear.
I have a class like
class A:
def run(self,arg):
#do something
I need to past this to a task queue, so I can do something like
a = A()
b = a.run
# q is the queue object
q.enqueue(b,some_arg)
I'd want to know what other method is there to do this, for example, Django does it in their Class Based Views,
class YourListView(ListView):
#code for your view
which is eventually passed as a function
your_view = YourListView.as_view()
How is it done?
Edit: to elaborate, django's class based views are converted to functions because the argument in the pattern function expects a function. Similarly, you might have a function which accepts the following argument
task_queue(callback_function, *parameters):
#add to queue and return result when done
but the functionality of callback_function might have been mostly implemented in a class, which has a run() method via which the process is ran.
I think you're describing a classmethod:
class MyClass(object):
#classmethod
def as_view(cls):
'''method intended to be called on the class, not an instance'''
return cls(instantiation, args)
which could be used like this:
call_later = MyClass.as_view
and later called:
call_later()
Most frequently, class methods are used to instantiate a new instance, for example, dict's fromkeys classmethod:
dict.fromkeys(['foo', 'bar'])
returns a new dict instance:
{'foo': None, 'bar': None}
Update
In your example,
result = q.enqueue(what_function, 'http://stackoverflow.com')
you want to know what_function could go there. I saw a very similar example from the RQ home page. That's got to be your own implementation. It's going to be something you can call with your code. It's only going to be called with that argument once, so if using a class, your __init__ should look more like this, if you want to use Scraper for your what_function replacement:
class Scraper:
def __init__(self,url):
self.start_page = url
self.run()
# etc...
If you want to use a class method, that might look like this:
class Scraper:
def __init__(self,url):
self.start_page = url
def parse_html(self):
pass
def get_all_links(self):
pass
#classmethod
def run(cls, url):
instance = cls(url)
#parse html, get all links, parse them and when done...
return links
And then your what_function would be Scraper.run.

Where and how to print scrapy stats upon spider_closed signal (finish)?

I have a working BaseSpider on Scrapy 0.20.0. But I'm trying to collect the number of found website URL's and print it as INFO when the spider is finished (closed). Problem is that I am not able to print this simple integer variable at the end of the session, and any print statement in the parse() or parse_item() functions are printed too early, long before.
I also looked at this question, but it seem somewhat outdated and it is unclear how to use it, properly. I.e. Where to put it (myspider.py, pipelines.py etc)?
Right now my spider-code is something like:
class MySpider(BaseSpider):
...
foundWebsites = 0
...
def parse(self, response):
...
print "Found %d websites in this session.\n\n" % (self.foundWebsites)
def parse_item(self, response):
...
if item['website']:
self.foundWebsites += 1
...
And this is obviously not working as intended. Any better and simple ideas?
The first answer referred to works and there is no need to add anything else to pipelines.py. Just add "that answer" to your spider code like this:
# To use "spider_closed" we also need:
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
class MySpider(BaseSpider):
...
foundWebsites = 0
...
def parse(self, response):
...
def parse_item(self, response):
...
if item['website']:
self.foundWebsites += 1
...
def __init__(self):
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
if spider is not self:
return
print "Found %d websites in this session.\n\n" % (self.foundWebsites)

Categories