What triggers the from_crawler classmethod? - python

I'm using scrapy and I have the following functioning pipeline class :
class DynamicSQLlitePipeline(object):
#classmethod
def from_crawler(cls, crawler):
# Here, you get whatever value was passed through the "table" parameter
docket = getattr(crawler.spider, "docket")
return cls(docket)
def __init__(self,docket):
try:
db_path = "sqlite:///"+settings.SETTINGS_PATH+"\\data.db"
db = dataset.connect(db_path)
table_name = docket[0:3] # FIRST 3 LETTERS
self.my_table = db[table_name]
except Exception:
# traceback.exec_print()
pass
def process_item(self, item, spider):
try:
test = dict(item)
self.my_table.insert(test)
print('INSERTED')
except IntegrityError:
print('THIS IS A DUP')
In my spider I have:
custom_settings = {
'ITEM_PIPELINES': {
'myproject.pipelines.DynamicSQLlitePipeline': 600,
}
}
From a recent question I was pointed to What is the 'cls' variable used for in Python classes?
If I understand correctly in order for the pipeline object to be instantiated (using the init function), it requires a docket number. The docket number only becomes available once the from_crawler class method is run. But what triggers the from_crawler method. Again the code is working.

The caller of a classmethod has to have an instance of the class. They may just access it by name, like this:
DynamicSQLlitePipeline.from_crawler(crawler)
… or:
sqlitepipeline.DynamicSQLlitePipeline.from_crawler(crawler)
Or maybe you pass the class object to someone, and they store it and use it later like this:
pipelines[i].from_crawler(crawler)
In Scrapy, the usual way to register a set of pipelines with the framework, according to the docs, is like this:
ITEM_PIPELINES = {
'myproject.pipelines.PricePipeline': 300,
'myproject.pipelines.JsonWriterPipeline': 800,
}
(Also see the Extensions user guide, which explains how this fits into a scrapy project.)
Presumably you've done something similar in code you haven't shown us, putting something like 'sqlscraper.pipelines.DynamicSQLlitePipeline' in that dict. At some point, Scrapy goes through that dict, sorts it in order by the values, and instantiates each pipeline. (Because it has the name of the class, as a string, instead of the class object, this is a little trickier, but the details really aren't relevant here.)

Related

I'm getting an invallid syntax error with my class

I've been learning python for a while now but I really want to start using oop but I'm having trouble understanding it, please can you tell me where I'm going wrong with my class.
class Savecookies():
driver = webdriver.Firefox()
def __init__(self, site, url):
self.site = site
self.url = url
def twitter(driver, self.site, self.url):
if __name__=='__main__':
cooks = Savecookies('twitter', 'https://twitter.com/')
My error:
File "twitter_test2.py", line 26
def twitter(driver, self.site, self.url):
^
SyntaxError: invalid syntax
def twitter(driver, self.site, self.url):
What’s that?
First of all, methods need a body. Otherwise they are incomplete. The simplest body would be to just do pass (i.e. do nothing). But you probably want to add actual stuff in there.
Second, your arguments make no sense at all. The first argument of a method is self, and then you specify which other arguments you want the method to accept. And argument names need to be valid variables, so you cannot have a dot in there. And if you want the method to access self.site and self.url, you can just do that without needing to pass it to the function (since you have access to self). In your case, you already have the site and url from the Savecookies object, so you probably want something like this:
def twitter(self, driver):
# Do something useful here
print(self.site, self.url)
print(driver)
in case twitter is an instance method
change
def twitter(driver, self.site, self.url):
to
def twitter(self):
#now do something with the params
print(self.driver, self.site, self.url)
Basically, let it access the site and url instance attributes set by __init__
and let it access the driver class attribute set in the class
Both types of attributes can be reached through self., there is no need to pass it again as a parameter.

How to present a class as a function?

As it was unclear earlier I am posting this scenario:
class Scraper:
def __init__(self,url):
self.start_page = url
def parse_html(self):
pass
def get_all_links(self):
pass
def run(self):
#parse html, get all links, parse them and when done...
return links
Now in a task queue like rq
from rq import Queue
from worker import conn
q = Queue(connection=conn)
result = q.enqueue(what_function, 'http://stackoverflow.com')
I want to know what this what_function would be? I remembered Django does something similar with their CBVs so I used that analogy but it wasn't so clear.
I have a class like
class A:
def run(self,arg):
#do something
I need to past this to a task queue, so I can do something like
a = A()
b = a.run
# q is the queue object
q.enqueue(b,some_arg)
I'd want to know what other method is there to do this, for example, Django does it in their Class Based Views,
class YourListView(ListView):
#code for your view
which is eventually passed as a function
your_view = YourListView.as_view()
How is it done?
Edit: to elaborate, django's class based views are converted to functions because the argument in the pattern function expects a function. Similarly, you might have a function which accepts the following argument
task_queue(callback_function, *parameters):
#add to queue and return result when done
but the functionality of callback_function might have been mostly implemented in a class, which has a run() method via which the process is ran.
I think you're describing a classmethod:
class MyClass(object):
#classmethod
def as_view(cls):
'''method intended to be called on the class, not an instance'''
return cls(instantiation, args)
which could be used like this:
call_later = MyClass.as_view
and later called:
call_later()
Most frequently, class methods are used to instantiate a new instance, for example, dict's fromkeys classmethod:
dict.fromkeys(['foo', 'bar'])
returns a new dict instance:
{'foo': None, 'bar': None}
Update
In your example,
result = q.enqueue(what_function, 'http://stackoverflow.com')
you want to know what_function could go there. I saw a very similar example from the RQ home page. That's got to be your own implementation. It's going to be something you can call with your code. It's only going to be called with that argument once, so if using a class, your __init__ should look more like this, if you want to use Scraper for your what_function replacement:
class Scraper:
def __init__(self,url):
self.start_page = url
self.run()
# etc...
If you want to use a class method, that might look like this:
class Scraper:
def __init__(self,url):
self.start_page = url
def parse_html(self):
pass
def get_all_links(self):
pass
#classmethod
def run(cls, url):
instance = cls(url)
#parse html, get all links, parse them and when done...
return links
And then your what_function would be Scraper.run.

Methods on descriptors

I'm trying to implement a wrapper around a redis database that does some bookkeeping, and I thought about using descriptors. I have an object with a bunch of fields: frames, failures, etc., and I need to be able to get, set, and increment the field as needed. I've tried to implement an Int-Like descriptor:
class IntType(object):
def __get__(self,instance,owner):
# issue a GET database command
return db.get(my_val)
def __set__(self,instance,val):
# issue a SET database command
db.set(instance.name,val)
def increment(self,instance,count):
# issue an INCRBY database command
db.hincrby(instance.name,count)
class Stream:
_prefix = 'stream'
frames = IntType()
failures = IntType()
uuid = StringType()
s = Stream()
s.frames.increment(1) # float' object has no attribute 'increment'
Is seems like I can't access the increment() method in my descriptor. I can't have increment be defined in the object that the __get__ returns. This would require an additional db query if all I want to do is increment! I also don't want increment() on the Stream class, as later on when I want to have additional fields like strings or sets in Stream, then I'd need to type check the heck out of everything.
Does this work?
class Stream:
_prefix = 'stream'
def __init__(self):
self.frames = IntType()
self.failures = IntType()
self.uuid = StringType()
Why not define the magic method iadd as well as get and set. This will allow you to do normal addition with assignment on the class. It will also mean you can treat the increment separately from the get function and thereby minimise the database accesses.
So change:
def increment(self,instance,count):
# issue an INCRBY database command
db.hincrby(instance.name,count)
to:
def __iadd__(self,other):
# your code goes here
Try this:
class IntType(object):
def __get__(self,instance,owner):
class IntValue():
def increment(self,count):
# issue an INCRBY database command
db.hincrby(self.name,count)
def getValue(self):
# issue a GET database command
return db.get(my_val)
return IntValue()
def __set__(self,instance,val):
# issue a SET database command
db.set(instance.name,val)

Python - Better to have multiple methods or lots of optional parameters?

I have a class which makes requests to a remote API. I'd like to be able to reduce the number of calls I'm making. Some of the methods in my class make the same API calls (but for different reasons), so I'ld like the ability for them to 'share' a cached API response.
I'm not entirely sure if it's more Pythonic to use optional parameters or to use multiple methods, as the methods have some required parameters if they are making an API call.
Here are the approches as I see them, which do you think is best?
class A:
def a_method( item_id, cached_item_api_response = None):
""" Seems awkward having to supplied item_id even
if cached_item_api_response is given
"""
api_response = None
if cached_item_api_response:
api_response = cached_item_api_response
else:
api_response = ... # make api call using item_id
... #do stuff
Or this:
class B:
def a_method(item_id = None, cached_api_response = None):
""" Seems awkward as it makes no sense NOT to supply EITHER
item_id or cached_api_response
"""
api_response = None
if cached_item_api_response:
api_response = cached_item_api_response
elif item_id:
api_response = ... # make api call using item_id
else:
#ERROR
... #do stuff
Or is this more appropriate?
class C:
"""Seems even more awkward to have different method calls"""
def a_method(item_id):
api_response = ... # make api call using item_id
api_response_logic(api_response)
def b_method(cached_api_response):
api_response_logic(cached_api_response)
def api_response_logic(api_response):
... # do stuff
Normally when writing method one could argue that a method / object should do one thing and it should do it well. If your method get more and more parameters which require more and more ifs in your code that probably means that your code is doing more then one thing. Especially if those parameters trigger totally different behavior. Instead maybe the same behavior could be produced by having different classes and having them overload methods.
Maybe you could use something like:
class BaseClass(object):
def a_method(self, item_id):
response = lookup_response(item_id)
return response
class CachingClass(BaseClass):
def a_method(self, item_id):
if item_id in cache:
return item_from_cache
return super(CachingClass, self).a_method(item_id)
def uncached_method(self, item_id)
return super(CachingClass, self).a_method(item_id)
That way you can split the logic of how to lookup the response and the caching while also making it flexible for the user of the API to decide if they want the caching capabilities or not.
There is nothing wrong with the method used in your class B. To make it more obvious at a glance that you actually need to include either item_id or cached_api_response, I would put the error checking first:
class B:
def a_method(item_id = None, cached_api_response = None):
"""Requires either item_id or cached_api_response"""
if not ((item_id == None) ^ (cached_api_response == None)):
#error
# or, if you want to allow both,
if (item_id == None) and (cached_api_response == None):
# error
# you don't actually have to do this on one line
# also don't use it if cached_item_api_response can evaluate to 'False'
api_response = cached_item_api_response or # make api call using item_id
... #do stuff
Ultimately this is a judgement that must be made for each situation. I would ask myself, which of these two more closely fits:
Two completely different algorithms or actions, with completely different semantics, even though they may be passed similar information
A single conceptual idea, with consistent semantics, but with nuance based on input
If the first is closest, go with separate methods. If the second is closest, go with optional arguments. You might even implement a single method by testing the type of the argument(s) to avoid passing additional arguments.
This is an OO anti-pattern.
class API_Connection(object):
def do_something_with_api_response(self, response):
...
def do_something_else_with_api_response(self, response):
...
You have two methods on an instance and you're passing state between them explicitly? Why are these methods and not bare functions in a module?
Instead, think about using encapsulation to help you by having the instance of the class own the api response.
For example:
class API_Connection(object):
def __init__(self, api_url):
self._url = api_url
self.cached_response = None
#property
def response(self):
"""Actually use the _url and get the response when needed."""
if self._cached_response is None:
# actually calculate self._cached_response by making our
# remote call, etc
self._cached_response = self._get_api_response(self._url)
return self._cached_response
def _get_api_response(self, api_param1, ...):
"""Make the request and return the api's response"""
def do_something_with_api_response(self):
# just use self.response
do_something(self.response)
def do_something_else_with_api_response(self):
# just use self.response
do_something_else(self.response)
You have caching and any method which needs this response can run in any order without making multiple api requests because the first method that needs self.response will calculate it and every other will use the cached value. Hopefully it's easy to imagine extending this with multiple URLs or RPC calls. If you have a need for a lot of methods that cache their return values like response above then you should look into a memoization decorator for your methods.
The cached response should be saved in the instance, not passed around like a bag of Skittles -- what if you dropped it?
Is item_id unique per instance, or can an instance make queries for more than one? If it can have more than one, I'd go with something like this:
class A(object):
def __init__(self):
self._cache = dict()
def a_method( item_id ):
"""Gets api_reponse from cache (cache may have to get a current response).
"""
api_response = self._get_cached_response( item_id )
... #do stuff
def b_method( item_id ):
"""'nother method (just for show)
"""
api_response = self._get_cached_response( item_id )
... #do other stuff
def _get_cached_response( self, item_id ):
if item_id in self._cache:
return self._cache[ item_id ]
response = self._cache[ item_id ] = api_call( item_id, ... )
return response
def refresh_response( item_id ):
if item_id in self._cache:
del self._cache[ item_id ]
self._get_cached_response( item_id )
And if you may have to get the most current info about item_id, you can have a refresh_response method.

Simple python plugin system

I'm writing a parser for an internal xml-based metadata format in python. I need to provide different classes for handling different tags. There will be a need for a rather big collection of handlers, so I've envisioned it as a simple plugin system. What I want to do is simply load every class in a package, and register it with my parser.
My current attempt looks like this:
(Handlers is the package containing the handlers, each handler has a static member tags, which is a tuple of strings)
class MetadataParser:
def __init__(self):
#...
self.handlers={}
self.currentHandler=None
for handler in dir(Handlers): # Make a list of all symbols exported by Handlers
if handler[-7:] == 'Handler': # and for each of those ending in "Handler"
handlerMod=my_import('MetadataLoader.Handlers.' + handler)
self.registerHandler(handlerMod, handlerMod.tags) # register them for their tags
# ...
def registerHandler(self, handler, tags):
""" Register a handler class for each xml tag in a given list of tags """
if not isSequenceType(tags):
tags=(tags,) # Sanity check, make sure the tag-list is indeed a list
for tag in tags:
self.handlers[tag]=handler
However, this does not work. I get the error AttributeError: 'module' object has no attribute 'tags'
What am I doing wrong?
Probably one of your handlerMod modules does not contain any tags variable.
First off, apologies for poorly formated/incorrect code.
Also thanks for looking at it. However, the culprit was, as so often, between the chair and the keyboard. I confused myself by having classes and modules of the same name. The result of my_import (which I now realize I didn't even mention where it comes from... It's from SO: link) is a module named for instance areaHandler. I want the class, also named areaHandler. So I merely had to pick out the class by eval('Handlers.' + handler + '.' + handler).
Again, thanks for your time, and sorry about the bandwidth
I suggest you read the example and explanation on this page where how to write a plug-in architecture is explained.
Simple and completly extensible implementation via extend_me library.
Code could look like
from extend_me import ExtensibleByHash
# create meta class
tagMeta = ExtensibleByHash._('Tag', hashattr='name')
# create base class for all tags
class BaseTag(object):
__metaclass__ = tagMeta
def __init__(self, tag):
self.tag = tag
def process(self, *args, **kwargs):
raise NotImeplemntedError()
# create classes for all required tags
class BodyTag(BaseTag):
class Meta:
name = 'body'
def process(self, *args, **kwargs):
pass # do processing
class HeadTag(BaseTag):
class Meta:
name = 'head'
def process(self, *args, **kwargs):
pass # do some processing here
# implement other tags in this way
# ...
# process tags
def process_tags(tags):
res_tags = []
for tag in tags:
cls = tagMeta.get_class(tag) # get correct class for each tag
res_tags.append(cls(tag)) # and add its instance to result
return res_tags
For more information look at documentation or code.
This lib is used in OpenERP / Odoo RPC lib

Categories