In general, Python sets don't seem to be designed for retrieving items by key. That's obviously what dictionaries are for. But is there anyway that, given a key, you can retrieve an instance from a set which is equal to the key?
Again, I know this is exactly what dictionaries are for, but as far as I can see, there are legitimate reasons to want to do this with a set. Suppose you have a class defined something like:
class Person:
def __init__(self, firstname, lastname, age):
self.firstname = firstname
self.lastname = lastname
self.age = age
Now, suppose I am going to be creating a large number of Person objects, and each time I create a Person object I need to make sure it is not a duplicate of a previous Person object. A Person is considered a duplicate of another Person if they have the same firstname, regardless of other instance variables. So naturally the obvious thing to do is insert all Person objects into a set, and define a __hash__ and __eq__ method so that Person objects are compared by their firstname.
An alternate option would be to create a dictionary of Person objects, and use a separately created firstname string as the key. The drawback here is that I'd be duplicating the firstname string. This isn't really a problem in most cases, but what if I have 10,000,000 Person objects? The redundant string storage could really start adding up in terms of memory usage.
But if two Person objects compare equally, I need to be able to retrieve the original object so that the additional instance variables (aside from firstname) can be merged in a way required by the business logic. Which brings me back to my problem: I need some way to retrieve instances from a set.
Is there anyway to do this? Or is using a dictionary the only real option here?
I'd definitely use a dictionary here. Reusing the firstname instance variable as a dictionary key won't copy it -- the dictionary will simply use the same object. I doubt a dictionary will use significantly more memory than a set.
To actually save memory, add a __slots__ attribute to your classes. This will prevent each of you 10,000,000 instances from having a __dict__ attribute, which will save much more memory than the potential overhead of a dict over a set.
Edit: Some numbers to back my claims. I defined a stupid example class storing pairs of random strings:
def rand_str():
return str.join("", (chr(random.randrange(97, 123))
for i in range(random.randrange(3, 16))))
class A(object):
def __init__(self):
self.x = rand_str()
self.y = rand_str()
def __hash__(self):
return hash(self.x)
def __eq__(self, other):
return self.x == other.x
The amount of memory used by a set of 1,000,000 instances of this class
random.seed(42)
s = set(A() for i in xrange(1000000))
is on my machine 240 MB. If I add
__slots__ = ("x", "y")
to the class, this goes down to 112 MB. If I store the same data in a dictionary
def key_value():
a = A()
return a.x, a
random.seed(42)
d = dict(key_value() for i in xrange(1000000))
this uses 249 MB without __slots__ and 121 MB with __slots__.
Yes, you can do this: A set can be iterated over. But note that this is an O(n) operation as opposed to the O(1) operation of the dict.
So, you have to trade off speed versus memory. This is a classic. I personally would optimize for here (i.e. use the dictionary), since memory won't get short so quickly with only 10,000,000 objects and using dictionaries is really easy.
As for additional memory consumption for the firstname string: Since strings are immutable in Python, assigning the firstname attribute as a key will not create a new string, but just copy the reference.
I think you'll have the answer here:
Moving Beyond Factories in Python
Related
I'm working with some LDAP data in python (I'm not great at Python) and trying to organize a class object to hold the LDAP variables. Since it's LDAP data, the end result will be many copies of the same data structure (per user) collected in an iterable list.
I started with hard-coded attribute names for the __slots__ which seemed to be working, but as things progressed I realized those LDAP attributes ought to be immutable constants of some sort to minimize hard coded text/typos. I assigned variables to the __slot__ attributes but it seems this is not such a workable plan:
AttributeError: 'LDAP_USER' object has no attribute 'ATTR_GIVEN_NAME'
Now that I think about it, I'm not actually creating immutable "constants" with the ATTR_ definitions so those values could theoretically be changed during runtime. I can see why Python might be having a problem with this design.
What is a better way to reduce the usage of hard coded text in the code while maintaining a class object which can be instantiated?
ATTR_DN = 'dn'
ATTR_GIVEN_NAME = 'givenName'
ATTR_SN = 'sn'
ATTR_EMP_TYPE = 'employeeType'
class LDAP_USER (object):
__slots__ = ATTR_GIVEN_NAME, ATTR_DN, ATTR_SN, ATTR_EMP_TYPE
user = LDAP_USER()
user.ATTR_GIVEN_NAME = "milton"
user.ATTR_SN = "waddams"
user.ATTR_EMP_TYPE = "collating"
print ("user is " + user.ATTR_GIVEN_NAME)
With __slots__ defined as [ATTR_GIVEN_NAME, ATTR_DN], the attributes should be referenced using user.givenName and user.dn, since those are the string values in __slots__.
If you want to actually reference the attribute as user.ATTR_GIVEN_NAME then that should be the value in the __slots__ array. You can then add a mapping routine to convert the object attributes to LDAP fields when performing LDAP operations with the object.
Referencing a property not in __slots__ will generate an error, so typos will be caught at runtime.
EDIT BELOW
I read a lot of discussion about dynamic variable in python and a lot have shown that they are things that you generally shouldn't do. However I have a case that I really can't think of a way to avoid it (and hence why I am here)
I am currently reading in a text file that stores the information of different members of a store. Each members have their own information: like their phone number, email, points in their accounts, etc. I want to create a class and objects that stores this information. Don't worry this is just a part of an assignment and they are not real people. Here is the sample code:
class Member:
def __init__(self, name, phoneNumber, email, points):
self.name = name
self.phoneNumber = phoneNumber
self.email = email
self.points = points
self.totalPointsSpent = 0
#There are methods below that will return some calculated results, like total points
spent or their ranking. I will not show those here as they are irrelevant
And for each member in the file will be read in and create an object out of it. For example if there are 5 members in that file, five objects will be created, and I want to name them member1, member2, etc. However, this will be confusing when accessing it, as it will be hard to tell them apart, so I add them to a dictionary, with their memberID as the key:
dictonary[memberID] = member1
dictonary[memberID] = member2 #and so on
which would result in a dictionary that look like something like this:
dictionary = {'jk1234':member1,'gh5678':member2,...#etc}
This is the interesting thing about python is the fact that the dictionary value does not need to be a value, it can be an object. This is something new to me, coming from Java.
However, here is the first problem. I do not know how many member are there in a file, or I should say, the number of members in the file varies from file to file. If the number is 5, I need 5 variables; if there are 8, I need 8, and so on. I have thought of using a while loop, and the code will look something like this:
a = len(#the number of members in the file.)
i = 0
while i <= a:
member + i = Member(#pass in the information)
but the '+' operator only works for strings when combining names, not for identifiers. And thus cannot work (I think).
Many solution I read has indicated that I should use a dictionary or a list in such case. However, since my variables are pointing towards a class object, I cannot think of a way to use list/dictionary as the implementation.
My current solution is to use a tuple to store the members information, and do the calculations elsewhere (i.e. I took out the class methods and defined them as functions) so that the dictionary looks like this:
dictionary = {'jk1234': (name, phoneNumber, email, points),'gh5678':(name, phoneNumber, email, points),...#etc}
However given the amount of information I need to pass in it is less than ideal. My current algorithm works, but I want to optimized it and make it more encapsulated.
If you made this far I appreciate it, and would like to know if there is a better algorithm for this problem. Please excuse me for my less than a year of Python experience.
EDIT: Okay I just discovered something really interesting that might seems basic to experienced Python programmers. That is you can pass the values inside the dictionary, instead of naming a variable exclusively. For my problem, that would be:
dictionary = {'jk1234': Member(name, phoneNumber, email, points),'gh5678':Member(name, phoneNumber, email, points),...#etc}
#and they are retrievable:
dictionary['jk1234'].get_score() #get_score is a getter function inside the class
And it would return the proper value. This seems to be a good solution. However, I would still love to hear other ways to think about this problem
It looks like you are on the right track with your update.
It's not clear to me if your MemberID is a value for each member, but if it is, you can use it in a loop creating the objects
d = {}
for member in members_file:
d[MemberID] = Member(name, phoneNumber, email, points)
This, of course, assumes that MemberID is unique, otherwise you will overwrite existing entries.
You could also use a list instead
l = []
for member in members_file:
l.append(Member(name, phoneNumber, email, points))
Further, you could also do the above with list/dict comprehensions which are basically just condensed for-loops.
Dict-comprehension
d = {MemberID: Member(name, phoneNumber, email, points) for member in members_file}
List-comprehension
l = [Member(name, phoneNumber, email, points) for member in members_file]
Whether a list or a dictionary makes more sense, is up to your use-case.
Also note that the above assumes various things, e.g. on what form you get the data from the file, since you did not provide that information.
You can create a dictionary with a for loop:
d = {}
for i in the_members_in_the_file:
d[i] = Member(your, parameters, that\'s, inputted)
You will get your expected dictionary.
I got for example the following structure of a class.
class Company(object):
Companycount = 0
_registry = {}
def __init__(self, name):
Company.Companycount +=1
self._registry[Company.Companycount] = [self]
self.name = name
k = Company("a firm")
b = Company("another firm")
Whenever I need the objects I can access them by using
Company._registry
which gives out a dictionary of all instances.
Do I need reasonable names for my objects since the name of the company is a class attribute, and I can iterate over Company._registry?
When loading the data from the database does it matter what the name of the instance (here k and b) is? Or can I just use arbitrary strings?
Both your Company._registry and the names k and b are just references to your actual instances. Neither play any role in what you'd store in the database.
Python's object model has all objects living on a big heap, and your code interacts with the objects via such references. You can make as many references as you like, and objects automatically are deleted when there are no references left. See the excellent Facts and myths about Python names and values article by Ned Batchelder.
You need to decide, for yourself, if the Company._registry structure needs to have names or not. Iteration over a list is slow if you already have a name for a company you wanted to access, but a dictionary gives you instant access.
If you are going to use an ORM, then you don't really need that structure anyway. Leave it to the ORM to help you find your objects, or give you a sequence of all objects to iterate over. I recommend using SQLAlchemy for this.
the name doesn't matter but if you are gonna initialize a lot of objects you are still gonna make it reasonable somehow
I have a ZODB installation where I have to organize several million objects of about a handful of different types. I have a generic container class Table, which contains BTrees to index objects by attributes or combinations of these attributes. Data consistency is quite essential, and so I want to enforce, that the indices are automatically updated, when I write to any of the attributes, which are covered by the indexing. So a simple obj.a = x should be sufficient to calculate all new dependent index entries, check if there are any collisions, and finally write the indices and the value.
In general, I'd be happy to use a library for that, so I was looking at repoze.catalog and IndexedCatalog, but was not really happy with that. IndexedCatalog seems dead for quite a while, and not providing the kind of consistency for changes to the objects. repoze.catalog seems to be more used and active, but also not providing this kind of consistency, as far as I understand. If I missed something here, I'd love to hear about it and prefer reusing over reinventing.
So, how I see it besides trying to find a library for the problem, I'd have to intercept the write access to the dataobject attributes with descriptors and let the Table class do the magic of changing the indices. For that, the descriptor instances have to know, with which Table instances they have to talk with. The current implementation goes someting like that:
class DatabaseElement(Persistent):
name = Property(constant_parameters)
...
class Property(object):
...
def __set__(self, obj, name, val):
val = self.check_value(val)
setattr(obj, '_' + name, val)
When these DatabaseElement classes are generated, the database and the objects within are not yet created. So as mentioned in this nice answer, I'd probably have to create some singleton lookup mechanism, to find the Table objects, without handing them to Property as an instantiation argument. Is there a more elegant way? Persisting the descriptors itself? Any suggestions and best-practice examples welcome!
So I finally figured out myself. The solution comes in three parts. No ugly Singleton required. Table provides the logic to check for collisions, DatabaseElement gets the ability to lookup the responsible Table without ugly workarounds and Property takes care, that the indices are updated, before any indexed values are written. Here some snippets, the main clue is the table lookup of DatabaseElement. I also didn't see that documented anywhere. Nice extra: It not only verifies writes to single values, I can also check for changes of several indexed values in one go.
class Table(PersistentMapping):
....
def update_indices(self, inst, updated_values_dict):
changed_indices_keys = self._check_collision(inst, updated_values_dict)
original_keys = [inst.key(index) for index, tmp_key in changed_indices_keys]
for (index, tmp_key), key in zip(changed_indices_keys, original_keys):
self[index][tmp_key] = inst
try:
del self[index][key]
except KeyError:
pass
class DatabaseElement(Persistent):
....
#property
def _table(self):
return self._p_jar and self._p_jar.root()[self.__class__.__name__]
def _update_indices(self, update_dict, verify=True):
if verify:
update_dict = dict((key, getattr(type(self), key).verify(val))
for key, val in update_dict.items()
if key in self._key_properties)
if not update_dict:
return
table = self._table
table and table.update_indices(self, update_dict)
class Property(object):
....
def __set__(self, obj, val):
validated_val = self.validator(obj, self.name, val)
if self.indexed:
obj._update_indices({self.name: val}, verify=False)
setattr(obj, self.hidden_name, validated_val)
Im currently uniquifying a list of objects based on their name attribute by creating a dict of objects with the name value as the dict key like this:
obj_dict = dict()
for obj in obj_list:
if not obj.name in obj_dict:
obj_dict[obj.name] = obj
new_obj_list = obj_dict.items()
And I was wondering if there was a quicker or more pythonic way to do this.
If two objects with the same name should always considered identical you could implement __eq__ and __hash__ accordingly. Then your solution would be as easy as storing all your objects in a set():
new_obj_list = list(set(obj_list))
Converting the list back to a set is probably not even necessary since the order is lost anyway so unless you need to do something with it that only works with a list but not with a set just keep using the set.
And if you need ordering:
oset = set()
new_obj_list = []
for o in obj_list:
if o not in oset:
oset.add(o)
new_obj_list.append(o)
I would also go the set approach but then figured you want to be able to look-up by name I guess..., but here's another approach that doesn't require amending the class... (albeit a bit more expensive)... Note that you could add further sort parameters to sorted to order by other priorities (such as age, or gender etc...)
from operator import attrgetter
from itetools import groupby
unique_by_name = {key: next(item) for key, item in groupby(attrgetter('name'), sorted(obj_list, key=attrgetter('name')))}
There's a unique last seen recipe in itertools too.
Otherwise for different ordering requirements, put them as different class methods, (or one called 'sort' that takes a known order and calls an internal function)...