I'm new in Python and I don't understand the purpose of list() function in this piece of code:
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
The method words() is already returning a list of tokenized words from a string, and I don't see any difference between that and
documents = [(movie_reviews.words(fileid), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
There are three possibilities:
It is a mistake, no call to list() is required.
The interface only guarantees that the method returns an Iterable type, which may be any one of: list, set, iterator, generator, etc. The specific movie_reviews.words() may return a list today, but either that may change in future versions, or in other classes with similar interfaces (child/parent/or simply similar interface).
Whether this is the case, should be stated explicitly in the documentation, or could be gleamed out of the inheritance hierarchy.
The method performs some sort of memoization, while keeping a copy of the returned list. A good practice would be to copy the cached-list inside the method, but maybe they returned a shared list-object.
If the method returns a reference to a shared list object, then it is a good idea to call list(), in order to create a new list object. Without the copy operation, any change to the list by one side (inside the method vs. through documents variable) will confuse the other side. If you change the list through documents varaible, then calling movie_reviews.words(fileid) with the same fileid may return the wrong value.
In general, although this is bad design, this happens in real code. I once had to debug such an issue in live code. Usually, in case of memoization, it is better to return an immutable type such as a tuple, instead of a list, which will guarantee both speed and safety.
Related
I just don't understand how an example in a book lists a parameter 'other' which is never introduced. When called the function, Python automatically understands it to be the other elements of the Class?
See the example:
def get_neighbors(self, others, radius, angle):
"""Return the list of neighbors within the given radius and angle."""
boids = []
for other in others:
if other is self: continue
offset = other.pos - self.pos
# if not in range, skip it
if offset.mag > radius:
continue
# if not within viewing angle, skip it
if self.vel.diff_angle(offset) > angle:
continue
# otherwise add it to the list
boids.append(other)
return boids
Nowhere else in the code there is a mention of 'other'.
Thanks, just trying to understand the mechanisms.
Updated answer, in response to comment
Python doesn't have any special behavior for a method parameter named "others", or for any of the other parameters in your example.
Most likely the book you're reading simply didn't explain (yet) how that function will be invoked. It's also possible that the book made a mistake (in which case, perhaps you should find a better book!).
Original answer (for posterity)
The name other is declared by the for statement:
for other in others:
...
From the Python documentation for the for statement:
The suite is then executed once for each item provided by the iterator, in the order of ascending indices. Each item in turn is assigned to the target list using the standard rules for assignments, and then the suite is executed.
Here, "the iterator" is derived from the list others, and "the target list" is simply the variable other. So on each iteration through the loop, the other variable is assigned ("using the standard rules for assignments") the next value from the list.
The DocString for that method should include the list of arguments and explain the expected type for each (I am planning to update this code soon, and I will improve the documentation).
In this case, others should be a list (or other sequence) of objects that have an attribute named pos (probably the same type as other).
Note that there is nothing special about the name 'others'.
I have a bunch of File objects, and a bunch of Folder objects. Each folder has a list of files. Now, sometimes I'd like to lookup which folder a certain file is in. I don't want to traverse over all folders and files, so I create a lookup dict file -> folder.
folder = Folder()
myfile = File()
folder_lookup = {}
# This is pseudocode, I don't actually reach into the Folder
# object, but have an appropriate method
folder.files.append(myfile)
folder_lookup[myfile] = folder
Now, the problem is, the files are mutable objects. My application is built around the fact. I change properites on them, and the GUI is notified and updated accordingly. Of course you can't put mutable objects in dicts. So what I tried first is to generate a hash based on the current content, basically:
def __hash__(self):
return hash((self.title, ...))
This didn't work of course, because when the object's contents changed its hash (and thus its identity) changed, and everything got messed up. What I need is an object that keeps its identity, although its contents change. I tried various things, like making __hash__ return id(self), overriding __eq__, and so on, but never found a satisfying solution. One complication is that the whole construction should be pickelable, so that means I'd have to store id on creation, since it could change when pickling, I guess.
So I basically want to use the identity of an object (not its state) to quickly look up data related to the object. I've actually found a really nice pythonic workaround for my problem, which I might post shortly, but I'd like to see if someone else comes up with a solution.
I felt dirty writing this. Just put folder as an attribute on the file.
class dodgy(list):
def __init__(self, title):
self.title = title
super(list, self).__init__()
self.store = type("store", (object,), {"blanket" : self})
def __hash__(self):
return hash(self.store)
innocent_d = {}
dodge_1 = dodgy("dodge_1")
dodge_2 = dodgy("dodge_2")
innocent_d[dodge_1] = dodge_1.title
innocent_d[dodge_2] = dodge_2.title
print innocent_d[dodge_1]
dodge_1.extend(range(5))
dodge_1.title = "oh no"
print innocent_d[dodge_1]
OK, everybody noticed the extremely obvious workaround (that took my some days to come up with), just put an attribute on File that tells you which folder it is in. (Don't worry, that is also what I did.)
But, it turns out that I was working under wrong assumptions. You are not supposed to use mutable objects as keys, but that doesn't mean you can't (diabolic laughter)! The default implementation of __hash__ returns a unique value, probably derived from the object's address, that remains constant in time. And the default __eq__ follows the same notion of object identity.
So you can put mutable objects in a dict, and they work as expected (if you expect equality based on instance, not on value).
See also: I'm able to use a mutable object as a dictionary key in python. Is this not disallowed?
I was having problems because I was pickling/unpickling the objects, which of course changed the hashes. One could generate a unique ID in the constructor, and use that for equality and deriving a hash to overcome this.
(For the curious, as to why such a "lookup based on instance identity" dict might be neccessary: I've been experimenting with a kind of "object database". You have pure python objects, put them in lists/containers, and can define indexes on attributes for faster lookup, complex queries and so on. For foreign keys (1:n relationships) I can just use containers, but for the backlink I have to come up with something clever if I don't want to modify the objects on the n side.)
In a Django UserProfile model, I am trying to do the following:
class UserProfile(models.Model):
...
def recent_activity(self):
followed_users = Followed.objects.filter(user=self.id).order_by("-when")
...
items = [self]
for f in followed_users:
items.append(f)
...
sorted(items,key=lambda item: item.getDate)
return items[:7]
The code is supposed to return a user's "recent activity" by pulling together a number of different actions they can do and then sorting them by date.
Each model implements a "getDate" method which returns the date to sort by.
The code:
Gets a QuerySet of users that the user has "followed" (I am using a table called Followed to store the fact that user x follows user y).
Creates a list containing the UserProfile object in question.
Add each Followed object to the list.
Sort the list by the last edit date.
Return the most recent n items (7 in the above).
I thought this was a neat way of implementing it. But no matter what I do, Python seems to refuse the correctly sort that list. It seems to sort every type of recent activity correctly except for the UserProfile - which seems to stay wherever I originally put it in that list after calling sorted. It doesn't fail in any way, it just doesn't return a correctly sorted list!
I've probably missed something simple... but I would expect either Python to complain that an object's method cannot sort a list that contains the object itself, OR for the list to be correctly sorted. I was a bit surprised that it did neither.
Can anyone tell me where I've gone astray?
Thanks.
The issue is that sorted does not sort in-place.
You have probably read that list.sort() does sort in-place. But sorted() actually returns the sorted list - but you are not assigning it to anything, so it just gets thrown away. Either use sort(), or assign the result to something.
(Plus, you'll need to actually call getDate, as Marcin points out.)
(1) The problem is that item.getDate will return the function (or rather, bound method) object getDate, and not a date. Change that to item.getDate(), or use key=Followed.getDate if self is an instance of Followed.
(2) There is no need to use a loop to convert a queryset to a list. Just use:
items = [self] + list(followed_users)
(This does create an unnecessary list. Because you can turn any iterable into a list, if you wanted to avoid that, you could use itertools.chain to create a single iterable to turn into a list).
Of what use is id() in real-world programming? I have always thought this function is there just for academic purposes. Where would I actually use it in programming?
I have been programming applications in Python for some time now, but I have never encountered any "need" for using id(). Could someone throw some light on its real world usage?
It can be used for creating a dictionary of metadata about objects:
For example:
someobj = int(1)
somemetadata = "The type is an int"
data = {id(someobj):somemetadata}
Now if I occur this object somewhere else I can find if metadata about this object exists, in O(1) time (instead of looping with is).
I use id() frequently when writing temporary files to disk. It's a very lightweight way of getting a pseudo-random number.
Let's say that during data processing I come up with some intermediate results that I want to save off for later use. I simply create a file name using the pertinent object's id.
fileName = "temp_results_" + str(id(self)).
Although there are many other ways of creating unique file names, this is my favorite. In CPython, the id is the memory address of the object. Thus, if multiple objects are instantiated, I'm guaranteed to never have a naming collision. That's all for the cost of 1 address lookup. The other methods that I'm aware of for getting a unique string are much more intense.
A concrete example would be a word-processing application where each open document is an object. I could periodically save progress to disk with multiple files open using this naming convention.
Anywhere where one might conceivably need id() one can use either is or a weakref instead. So, no need for it in real-world code.
The only time I've found id() useful outside of debugging or answering questions on comp.lang.python is with a WeakValueDictionary, that is a dictionary which holds a weak reference to the values and drops any key when the last reference to that value disappears.
Sometimes you want to be able to access a group (or all) of the live instances of a class without extending the lifetime of those instances and in that case a weak mapping with id(instance) as key and instance as value can be useful.
However, I don't think I've had to do this very often, and if I had to do it again today then I'd probably just use a WeakSet (but I'm pretty sure that didn't exist last time I wanted this).
in one program i used it to compute the intersection of lists of non-hashables, like:
def intersection(*lists):
id_row_row = {} # id(row):row
key_id_row = {} # key:set(id(row))
for key, rows in enumerate(lists):
key_id_row[key] = set()
for row in rows:
id_row_row[id(row)] = row
key_id_row[key].add(id(row))
from operator import and_
def intersect(sets):
if len(sets) > 0:
return reduce(and_, sets)
else:
return set()
seq = [ id_row_row[id_row] for id_row in intersect( key_id_row.values() ) ]
return seq
I'm trying to use Beaker's caching library but I can't get it working.
Here's my test code.
class IndexHandler():
#cache.cache('search_func', expire=300)
def get_results(self, query):
results = get_results(query)
return results
def get(self, query):
results = self.get_results(query)
return render_index(results=results)
I've tried the examples in Beaker's documentation but all I see is
<type 'exceptions.TypeError'> at /
can't pickle generator objects
Clearly I'm missing something but I couldn't find a solution.
By the way this problem occurs if the cache type is set to "file".
If you configure beaker to save to the filesystem, you can easily see that each argument is being pickled as well. Example:
tp3
sS'tags <myapp.controllers.tags.TagsController object at 0x103363c10> <MySQLdb.cursors.Cursor object at 0x103363dd0> apple'
p4
Notice that the cache "key" contains more than just my keyword, "apple," but instance-specific information. This is pretty bad, because especially the 'self' won't be the same across invocations. The cache will result in a miss every single time (and will get filled up with useless keys.)
The method with the cache annotation should only have the arguments to correspond to whatever "key" you have in mind. To paraphrase this, let's say that you want to store the fact that "John" corresponds to value 555-1212 and you want to cache this. Your function should not take anything except a string as an argument. Any arguments you pass in should stay constant from invocation to invocation, so something like "self" would be bad.
One easy way to make this work is to inline the function so that you don't need to pass anything else beyond the key. For example:
def index(self):
# some code here
# suppose 'place' is a string that you're using as a key. maybe
# you're caching a description for cities and 'place' would be "New York"
# in one instance
#cache_region('long_term', 'place_desc')
def getDescriptionForPlace(place):
# perform expensive operation here
description = ...
return description
# this will either fetch the data or just load it from the cache
description = getDescriptionForPlace(place)
Your cache file should resemble the following. Notice that only 'place_desc' and 'John' were saved as a key.
tp3
sS'place_desc John'
p4
I see that the beaker docs do not mention this explicitly, but, clearly, the decorate function must pickle the arguments it's called with (to use as part of the key into the cache, to check if the entry is present and to add it later otherwise) -- and, generator objects are not pickleable, as the error message is telling you. This implies that query is a generator object, of course.
What you should be doing in order to use beaker or any other kind of cache is to pass around, instead of a query generator object, the (pickleable) parameters from which that query can be built -- strings, numbers, dicts, lists, tuples, etc, etc, composed in any way that is easy to for you to arrange and easy to build the query from "just in time" only within the function body of get_results. This way, the arguments will be pickleable and caching will work.
If convenient, you could build a simple pickleable class whose instances "stand for" queries, emulating whatever initialization and parameter-setting you require, and performing the just-in-time instantiation only when some method requiring an actual query object is called. But that's just a "convenience" idea, and does not alter the underlying concept as explained in the previous paragraph.
Try return list(results) instead of return results and see if it helps.
The beaker file cache needs to be able to pickle both cache keys and values; most iterators and generators are unpickleable.