Are there reasons to use get/put methods instead of item access? - python

I find that I have recently been implementing Mapping interfaces on classes which on the surface fit the model (they are essentially just key-value stores with no more meta-data), but underneath they are sometimes quite complex.
Here are a couple examples of increasing severity:
An object which wraps another mapping converting all objects to strings when set.
An object which uses a local database as a back-end to store the key-value pairs.
An object which makes HTTP requests to remote servers to get/set data.
Lets suppose all of these examples seamlessly implement the Mapping interface, and the only indication that there is something fishy going on is that item access potentially could take a few seconds, and an item may not be retrievable in the same form as stored (if at all). I'm perfectly content with something like the first example, pretty okay with the second, but I'm getting kinda uncomfortable with the last.
The question is, is there a line at which the API for these models should not use item access, even though the underlying structure may feel like it fits on the surface?

It sounds like you are describing the standard anydbm module semantics. And just as anydbm can raise exception anydbm.error, so too could your subclass raise derivatives like MyDbmTimeoutError as needed. Whether you implement it as dictionary operations or function calls, the caller will still have to contend with exceptions anyway (e.g. KeyError, NameError).
I think that arbitrary "tied" hashes exist in Python 2 and 3.x is decent justification for saying it is a reasonable approach. Indeed, I've been looking for (and designing in my head) more complex bindings than simple key ⇒ value mappings without a heavy ORM SQL layer in between.
added: The more I think about it, the more Pythonic tied dictionaries seem to be. A key ⇒ value collection is a dictionary. Whether it lives in core or on disk or across the network is an implementation detail that is best abstracted away. The only substantive differences are increased latency and possible unavailability; however, on a virtual memory based OS, "core" can have higher latency than RAM and in a multiprocessing OS, "core" can become unavailable too. So these are differences in degree only, not kind.

From a strictly philosophical point of view, I don't think that there is a line you can cross with this. If some tool provides the needed functionality, but its API is different, adapt away. The only time you shouldn't do this is if the adapted to API simply is not expressive enough to manipulate the adapted component in a way that is needed.
I wouldn't hesitate to adapt a database into a dict, because that's a great way to manipulate collections, and it's already compatible with a heck of a lot of other code. If I find that my particular application must make calls to the database connections begin(), commit(), and rollback() methods to work right, then a dict won't do, since dict's don't have transaction semantics.

Related

How does Python provide a maintainable way to pass data structures around in a system?

I am new to dynamic languages in general, and I have discovered that languages like Python prefer simple data structures, like dictionaries, for sending data between parts of a system (across functions, modules, etc).
In the C# world, when two parts of a system communicate, the developer defines a class (possibly one that implements an interface) that contains properties (like a Person class with a Name, Birth date, etc) where the sender in the system instantiates the class and assigns values to the properties. The receiver then accesses these properties. The class is called a DTO and it is "well- defined" and explicit. If I remove a property from the DTO's class, the compiler will instantly warn me of all parts of the code that use that DTO and are attempting to access what is now a non-existent property. I know exactly what has broken in my codebase.
In Python, functions that produce data (senders) create implicit DTOs by building up dictionaries and returning them. Coming from a compiled world, this scares me. I immediately think of the scenario of a large code base where a function producing a dictionary has the name of a key changed (or a key is removed altogether) and boom- tons of potential KeyErrors begin to crop up as pieces of the code base that work with that dictionary and expect a key are no longer able to access the data they were expecting. Without unit testing, the developer would have no reliable way of knowing where these errors will appear.
Maybe I misunderstand altogether. Are dictionaries a best practice tool for passing data around? If so, how do developers solve this kind of problem? How are implicit data structures and the functions that use them maintained? How do I become less afraid of what seems like a huge amount of uncertainty?
Coming from a compiled world, this scares me. I immediately think of
the scenario of a large code base where a function producing a
dictionary has the name of a key changed (or a key is removed
altogether) and boom- tons of potential KeyErrors begin to crop up as
pieces of the code base that work with that dictionary and expect a
key are no longer able to access the data they were expecting.
I would just like to highlight this part of your question, because I feel this is the main point you are trying to understand.
Python's development philosophy is a bit different; as objects can mutate without throwing errors (for example, you can add properties to instances without having them declared in the class) a common programming practice in Python is EAFP:
EAFP
Easier to ask for forgiveness than permission. This common Python
coding style assumes the existence of valid keys or attributes and
catches exceptions if the assumption proves false. This clean and fast
style is characterized by the presence of many try and except
statements. The technique contrasts with the LBYL style common to many
other languages such as C.
The LBYL referred to from the quote above is "Look Before You Leap":
LBYL
Look before you leap. This coding style explicitly tests for
pre-conditions before making calls or lookups. This style contrasts
with the EAFP approach and is characterized by the presence of many if
statements.
In a multi-threaded environment, the LBYL approach can risk
introducing a race condition between “the looking” and “the leaping”.
For example, the code, if key in mapping: return mapping[key] can fail
if another thread removes key from mapping after the test, but before
the lookup. This issue can be solved with locks or by using the EAFP
approach.
So I would say this is a bit of the norm and in Python you expect the objects will behave well and handle themselves with grace (mainly by throwing up lots of exceptions). Traditional "object hiding" and "interface contracts" are not what Python is all about. It is just like learning anything else, you have to acclimate to the programming environment and its rules.
The other part of your question:
Are dictionaries a best practice tool for passing data around? If so,
how do developers solve this kind of problem?
The answer here is depends on your problem domain. If your problem domain does not lend itself to custom objects, then you can pass around any kind of container (lists, tuples, dictionaries) around. If however all you have to pass around decorated data ("rich" data) is objects, then your code becomes littered with classes that don't define behavior but rather properties of things.
Oh, by the way - this getting of keys and raising KeyError problem is already solved, as Python dictionaries have a get method, which can return a default value (it returns the sentinel None object by default) when a key doesn't exist:
>>> d = {'a': 'b'}
>>> d['b']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'b'
>>> d.get('b') # None is returned, which is the only
# object that is not printed by the
# Python interactive interpreter.
>>> d.get('b','default')
'default'
When using Python for a large project using automated testing is a must because otherwise you would never dare to do any serious refactoring and the code base will rot in no time as all your changes will always try to touch nothing leading to bad solution (simply because you'd be too scared to implement the correct solution instead).
Indeed the above is true even with C++ or, as it often happens with large projects, with mixed-languages solutions.
Not longer than a few HOURS ago for example I had to make a branch for a four line bugfix (one of the lines is a single brace) for a specific customer because the trunk evolved too much from the version he has in production and the guy in charge of the release process told me his use cases have not yet been covered with manual testing in current version and therefore I cannot upgrade his installation.
The compiler can tell you something, but it cannot provide any confidence in stability after a refactoring if the software is complex. The idea that if some piece of code compiles then it's correct is bogus (possibly with the exception of hello_world.cpp).
That said you normally don't use dictionaries in Python for everything, unless you really care about the dynamic aspect (but in this case the code doesn't access the dictionary with a literal key). If your Python code has a lot of d["foo"] instead of d[k] when using dicts then I'd say there is a smell of a design problem.
I don't think passing dictionaries is the only way to pass structured data across parts of a system. I've seen lots of people use classes for that. Actually namedtuple is a good fit for that as well.
Without unit testing, the developer would have no reliable way of
knowing where these errors will appear
Now why would you not write unit tests?
In Python you don't rely on a compiler to catch your errors. If you really need static checking of your code, you can use one of several static analysis tools out there (see this question)

is it a good practice to store data in memory in a django application?

I am writing a reusable django application for returning json result for jquery ui autocomplete.
Currently i am storing the Class/function for getting the result in a dictionary with a unique key for each class/function.
When a request comes then I selects the corresponding class/function from the dict and returns the output.
My query is whether is the best practice to do the above or are there some other tricks to obtains the same result.
Sample GIST : https://gist.github.com/ajumell/5483685
You seem to be talking about a form of memoization.
This is OK, as long as you don't rely on that result being in the dictionary. This is because the memory will be local to each process, and you can't guarantee subsequent requests being handled by the same process. But if you have a fallback where you generate the result, this is a perfectly good optimization.
That's a very general question. It primary depends on the infrastructure of your code. The way your class and models are defined and the dynamics of the application.
Second, is important to have into account the resources of the server where your application is running. How much memory do you have available, and how much disk space so you can take into account what would be better for the application.
Last but not least, it's important to take into account how much operations does it need to put all these resources in memory. Memory is volatile, so if your application restarts you'll have to instantiate all the classes again and maybe this is to much work.
Resuming, as an optimization is very good choice to keep in memory objects that are queried often (that's what cache is all about) but you have to take into account all of the previous stuff.
Storing a series of functions in a dictionary and conditionally selecting one based on the request is a perfectly acceptable way to handle it.
If you would like a more specific answer it would be very helpful to post your actual code. And secondly, this might be better suited to codereview.stackexchange

Examples of use for PickledObjectField (django-picklefield)?

surfing on the web, reading about django dev best practices points to use pickled model fields with extreme caution.
But in a real life example, where would you use a PickledObjectField, to solve what specific problems?
We have a system of social-networks "backends" which do some generic stuff like "post message", "get status", "get friends" etc. The link between each backend class and user is django model, which keeps user, backend name and credentials. Now imagine how many auth systems are there: oauth, plain passwords, facebook's obscure js stuff etc. This is where JSONField shines, we keep all backend-specif auth data in a dictionary on this model, which is stored in db as json, we can put anything into it no problem.
You would use it to store... almost-arbitrary Python objects. In general there's little reason to use it; JSON is safer and more portable.
You can definitely substitute a PickledObjectField with JSON and some extra logic to create an object out of the JSON. At the end of the day, your use case, when considering to use a PickledObjectField or JSON+logic, is serializing a Python object into your database. If you can trust the data in the Python object, and know that it will always be serialize-able, you can reasonably use the PickledObjectField. In my case (I don't use django's ORM, but this should still apply), I have a couple different object types that can go into my PickledObjectField, and their definitions are constantly mutating. Rather than constantly updating my JSON parsing logic to create an object out of JSON values, I simply use a PickledObjectField to just store the different objects, and then later retrieve them in perfectly usable form (calling their functions). Caveat: If you store an object via PickledObjectField, then you change the object definition, and then you retrieve the object, the old object may have trouble fitting into the new object's definition (depending on what you changed).
The problems to be solved are the efficiency and the convenience of defining and handling a complex object consisting of many parts.
You can turn each part type into a Model and connect them via ForeignKeys.
Or you can turn each part type into a class, dictionary, list, tuple, enum or whathaveyou to your liking and use PickledObjectField to store and retrieve the whole beast in one step.
That approach makes sense if you will never manipulate parts individually, only the complex object as a whole.
Real life example
In my application there are RQdef objects that represent essentially a type with a certain basic structure (if you are curious what they mean, look here).
RQdefs consist of several Aspects and some fixed attributes.
Aspects consist of one or more Facets and some fixed attributes.
Facets consist of two or more Levels and some fixed attributes.
Levels consist of a few fixed attributes.
Overall, a typical RQdef will have about 20-40 parts.
An RQdef is always completely constructed in a single step before it is stored in the database and it is henceforth never modified, only read (but read frequently).
PickledObjectField is more convenient and much more efficient for this purpose than would be a set of four models and 20-40 objects for each RQdef.

Should properties do nontrivial initialization?

I have an object that is basically a Python implementation of an Oracle sequence. For a variety of reasons, we have to get the nextval of an Oracle sequence, count up manually when determining primary keys, then update the sequence once the records have been inserted.
So here's the steps my object does:
Construct an object, with a key_generator attribute initially set to None.
Get the first value from the database, passing it to an itertools.count.
Return keys from that generator using a property next_key.
I'm a little bit unsure about where to do step 2. I can think of three possibilities:
Skip step 1 and do step 2 in the constructor. I find this evil because I tend to dislike doing this kind of initialization in a constructor.
Make next_key get the starting key from the database the first time it is called. I find this evil because properties are typically assumed to be trivial.
Make next_key into a get_next_key method. I dislike this because properties just seem more natural here.
Which is the lesser of 3 evils? I'm leaning towards #2, because only the first call to this property will result in a database query.
I think your doubts come from PEP-8:
Note 3: Avoid using properties for computationally expensive
operations; the attribute notation makes the caller believe
that access is (relatively) cheap.
Adherence to a standard behavior is usually quite a good idea; and this would be a reason to scrap away solution #2.
However, if you feel the interface is better with property than a method, then I would simply document that the first call is more expensive, and go with that (solution #2).
In the end, recommendations are meant to be interpreted.
I agree that attribute access and everything that looks like it (i.e. properties in the Python context) should be fairly trivial. If a property is going to perform a potentially costly operation, use a method to make this explicit. I recommend a name like "fetch_XYZ" or "retrieve_XYZ", since "get_XYZ" is used in some languages (e.g. Java) as a convention for simple attribute access, is quite generic, and does not sound "costly" either.
A good guideline is: If your property could throw an exception that is not due to a programming error, it should be a method. For example, throwing a (hypothetical) DatabaseConnectionError from a property is bad, while throwing an ObjectStateError would be okay.
Also, when I understood you correctly, you want to return the next key, whenever the next_key property is accessed. I recommend strongly against having side-effects (apart from caching, cheap lazy initialization, etc.) in your properties. Properties (and attributes for that matter) should be idempotent.
I've decided that the key smell in the solution I'm proposing is that the property I was creating contained the word "next" in it. Thus, instead of making a next_key property, I've decided to turn my DatabaseIntrospector class into a KeyCounter class and implemented the iterator protocol (ie making a plain old next method that returns the next key).

Python Disk-Based Dictionary

I was running some dynamic programming code (trying to brute-force disprove the Collatz conjecture =P) and I was using a dict to store the lengths of the chains I had already computed. Obviously, it ran out of memory at some point. Is there any easy way to use some variant of a dict which will page parts of itself out to disk when it runs out of room? Obviously it will be slower than an in-memory dict, and it will probably end up eating my hard drive space, but this could apply to other problems that are not so futile.
I realized that a disk-based dictionary is pretty much a database, so I manually implemented one using sqlite3, but I didn't do it in any smart way and had it look up every element in the DB one at a time... it was about 300x slower.
Is the smartest way to just create my own set of dicts, keeping only one in memory at a time, and paging them out in some efficient manner?
The 3rd party shove module is also worth taking a look at. It's very similar to shelve in that it is a simple dict-like object, however it can store to various backends (such as file, SVN, and S3), provides optional compression, and is even threadsafe. It's a very handy module
from shove import Shove
mem_store = Shove()
file_store = Shove('file://mystore')
file_store['key'] = value
Hash-on-disk is generally addressed with Berkeley DB or something similar - several options are listed in the Python Data Persistence documentation. You can front it with an in-memory cache, but I'd test against native performance first; with operating system caching in place it might come out about the same.
The shelve module may do it; at any rate, it should be simple to test. Instead of:
self.lengths = {}
do:
import shelve
self.lengths = shelve.open('lengths.shelf')
The only catch is that keys to shelves must be strings, so you'll have to replace
self.lengths[indx]
with
self.lengths[str(indx)]
(I'm assuming your keys are just integers, as per your comment to Charles Duffy's post)
There's no built-in caching in memory, but your operating system may do that for you anyway.
[actually, that's not quite true: you can pass the argument 'writeback=True' on creation. The intent of this is to make sure storing lists and other mutable things in the shelf works correctly. But a side-effect is that the whole dictionary is cached in memory. Since this caused problems for you, it's probably not a good idea :-) ]
Last time I was facing a problem like this, I rewrote to use SQLite rather than a dict, and had a massive performance increase. That performance increase was at least partially on account of the database's indexing capabilities; depending on your algorithms, YMMV.
A thin wrapper that does SQLite queries in __getitem__ and __setitem__ isn't much code to write.
With a little bit of thought it seems like you could get the shelve module to do what you want.
I've read you think shelve is too slow and you tried to hack your own dict using sqlite.
Another did this too :
http://sebsauvage.net/python/snyppets/index.html#dbdict
It seems pretty efficient (and sebsauvage is a pretty good coder). Maybe you could give it a try ?
You should bring more than one item at a time if there's some heuristic to know which are the most likely items to be retrieved next, and don't forget the indexes like Charles mentions.
For simple use cases sqlitedict
can help. However when you have much more complex databases you might one to try one of the more upvoted answers.
It isn't exactly a dictionary, but the vaex module provides incredibly fast dataframe loading and lookup that is lazy-loading so it keeps everything on disk until it is needed and only loads the required slices into memory.
https://vaex.io/docs/tutorial.html#Getting-your-data-in

Categories