Pythonic Django object reuse - python

I've been racking my brain on this for the last few weeks and I just can't seem to understand it. I'm hoping you folks here can give me some clarity.
A LITTLE BACKGROUND
I've built an API to help serve a large website and like all of us, are trying to keep the API as efficient as possible. Part of this efficiency is to NOT create an object that contains custom business logic over and over again (Example: a service class) as requests are made. To give some personal background I come from the Java world so I'm use to using a IoC or DI to help handle object creation and injection into my classes to ensure classes are NOT created over and over on a per request basis.
WHAT I'VE READ
While looking at many Python IoC and DI posts I've become rather confused on how to best approach creating a given class and not having to worry about the server getting overloaded with too many objects based on the amount of requests it may be handling.
Some people say an IoC or DI really isn't needed. But as I run my Django app I find that unless I construct the object I want globally (top of file) for views.py to use later rather than within each view class or def within views.py I run the change of creating multiple classes of the same type, which from what I understand would cause memory bloat on the server.
So what's the right way to be pythonic to keep objects from being built over and over? Should I invest in using an IoC / DI or not? Can I safely rely on setting up my service.py files to just contain def's instead of classes that contain def's? Is the garbage collector just THAT efficient so I don't even have to worry about it.
I've purposely not placed any code in this post since this seems like a general questions, but I can provide a few code examples if that helps.
Thanks
From a confused engineer that wants to be as pythonic as possible

You come from a background where everything needs to be a class, I've programmed web apps in Java too, and sometimes it's harder to unlearn old things than to learn new things, I understand.
In Python / Django you wouldn't make anything a class unless you need many instances and need to keep state.
For a service that's hardly the case, and sometimes you'll notice in Java-like web apps some services are made singletons, which is just a workaround and a rather big anti-pattern in Python
Pythonic
Python is flexible enough so that a "services class" isn't required, you'd just have a Python module (e.g. services.py) with a number of functions, emphasis on being a function that takes in something, returns something, in a completely stateless fashion.
# services.py
# this is a module, doesn't keep any state within,
# it may read and write to the DB, do some processing etc but doesn't remember things
def get_scores(student_id):
return Score.objects.filter(student=student_id)
# views.py
# receives HTTP requests
def view_scores(request, student_id):
scores = services.get_scores(student_id)
# e.g. use the scores queryset in a template return HTML page
Notice how if you need to swap out the service, you'll just be swapping out a single Python module (just a file really), so Pythonistas hardly bother with explicit interfaces and other abstractions.
Memory
Now per each "django worker process", you'd have that one services module, that is used over and over for all requests that come in, and when the Score queryset is used and no longer pointed at in memory, it'll be cleaned up.
I saw your other post, and well, instantiating a ScoreService object for each request, or keeping an instance of it in the global scope is just unnecessary, the above example does the job with one module in memory, and doesn't need us to be smart about it.
And if you did need to keep state in-between several requests, keeping them in online instances of ScoreService would be a bad idea anyway because now every user might need one instance, that's not viable (too many online objects keeping context). Not to mention that instance is only accessible from the same process unless you have some sharing mechanisms in place.
Keep state in a datastore
In case you want to keep state in-between requests, you'd keep the state in a datastore, and when the request comes in, you hit the services module again to get the context back from the datastore, pick up where you left it and do your business, return your HTTP response, then unused things will get garbage collected.
The emphasis being on keeping things stateless, where any given HTTP request can be processed on any given django process, and all state objects are garbage collected after the response is returned and objects go out of scope.
This may not be the fastest request/response cycle we can pull, but it's scalable as hell
Look at some major web apps written in Django
I suggest you look at some open source Django projects and look at how they're organized, you'll see a lot of the things you're busting your brains with, Djangonauts just don't bother with.

Related

How to Design a Cache that Forgets (almost only) on Memory Pressure in Python?

Lets say I have an URL class in Python. The URL class represents an URL and offers a method to download its contents. It also caches that content to speed up subsequent calls.
#dataclass(frozen)
class URL:
url: str
_cache: Optional[bytes] = None
def get_content(self) -> bytes:
if self._cache is None:
self._cache = requests.get(self.url)
return self._cache
This code worked fine so far. Now I was asked to download huge amounts of URLs here (for sane reasons). To prevent the possibility of misusage, I want to support use cases where every instance of URL will be alive until all URLs are downloaded.
Having huge amounts of URL alive and cached will lead to memory exhaustion. Now I am wondering how I could design a cache that forgets only when there is memory pressure.
I considered following alternatives:
WeakValueDictionaries will forget as soon as the last strong reference is dropped. This is not helpful here as it leads to either having the current situation or to disabling the cache.
LRUCache requires to decide on the capacity before. However, the cache I would need is helpful only if it caches as many elements as anyhow possible. Whether I cache 100 or 1000 of 100.000 is not important.
tl;dr: How could I implement a cache in Python that holds weak references and only drops them seldomly and on memory pressure?
Updates:
I have no clear criterions in mind. I expect that others have developed good solutions I just do not know of. In Java I suspect that SoftReferences would be an acceptable solution.
MisterMyagi found a good wording:
Say the URL cache would evict an item if some unrelated, numeric computation needs memory?
I want to keep elements as long as possible but free them when any other code of the same python process would need it.
Maybe a solution would drop instances of URL only by the Garbage Collector. Then I could try to configure the Garbage Collector accordingly. But maybe someone has come up with a more clever idea already, so I can avoid to reinvent the wheel.
I found a nice solution for the concrete problem I was facing. Unfortunately,
this is not a solution to the concrete question, which is concerned only with a
cache. Thus, I will not accept my answer. However, I hope it will be inspiring
for others.
In my application, there is a storage backend which will call
url.get_content(). Using the storage backend will look roughly like this:
storage = ConcreteStorageBackend()
list_of_urls = [URL(url_str) for url_str in list_of_url_strings]
for url in list_of_urls:
storage.add_url(url)
storage.sync() # will trigger an internal loop that calls get_content() on all URLs.
It is easy to see that when list_of_urls is huge caching get_content()
can cause memory problems. The solution here is, to replace (or manipulate) the
URL objects such that the new URL objects retrieve the data from the storage
backend.
Replacing them would be cleaner. The usage could be something like
list_of_urls = storage.sync() where storage.sync() returns new URL
instances.
Overwriting URL.get_content() with a new function would allow the user
to be completely ignorant about performance considerations here.
In both cases caching would mean to avoid downloading the content again and
instead to obtain it from the storage backend. For this, getting data from the
storage backend has to be faster than to download the content again but I
assume this is the case quite often.
Caching with means of a storage backend leads to another cool benefit in many
cases: The OS kernel itself takes care of the cache. I assume that most storage
backends will eventually write to disk, e.g. to a file or to a database. At
least the Linux kernel caches disk IO in free RAM. Thus, as long as your RAM is
sufficiently big you'll get most of the url.get_content() at the speed of
RAM. However, the cache does not block other more important memory allocations,
e.g. for some unrelated, numeric computation as mentioned in one comment.
Again, I am aware that this is not exactly matching the question above, as it
relies on a storage backend and does not fall back to download data from
internet again. I hope this answer is helpful for others anyways.

How are the objects in the classic single responsibility principle example supposed to communicate?

I'm facing potentially a refactoring project at work and am having a little bit of trouble grasping the single responsibility principle example that shows up on most websites. It is the one regarding separating the connection and send/receive methods of a modem into two different objects. The project is in Python, by the way, but I don't think it is a language-specific issue.
Currently I'm working to break up a 1300 line web service driver class that somebody created (arbitrarily split into two classes but they are essentially one). On the level of responsibility I understand I need to break the connectivity, configuration, and XML manipulation responsibilities into separate classes. Right now all is handled by the class using string manipulations and the httplib.HTTPConnection object to handle the request.
So according to this example I would have a class to handle only the http connection, and a class to transfer that data across that connection, but how would these communicate? If I require a connection to be passed in when constructing the data transfer class, does that re-couple the classes? I'm just having trouble grasping how this transfer class actually accesses the connection that has been made.
With a class that huge (> 1000 lines of code) you have more to worry about than only the SRP or the DIP. I have (or "I fight") classes of similar size and from my experience you have to make unit tests where possible. Carefully refactor (very carefully!) Automatic testing is your friend - be it unit testing as mentioned or regression testing, integration testing, acceptance testing, or whatever you are able to automatically execute. Then refactor. And then run the tests. Refactor again. Test. Refactor. Test.
There is a very good book that describes this process: Michael Feather's "Working Effectively With Legacy Code". Read it.
For example, draw a picture that shows the dependencies of all methods and members of this class. That might help you to identify different "areas" of repsonsibility.

Shared Memory for Python Classes

So as I was learning about classes in Python I was taught that the class level attributes are shared among all instances of a given class. Something I don't think I've seen any other language before.
As a result I could have multiple instances of say a DB class pulling data DB data and dumping it into a class level attribute. Then any instance that needed any of that data would have access to it without having to go to a cache or saved file to
get it.
Currently I'm debugging an analytics class that grabs DB data via an inefficient means - one I'm currently trying to make much faster. Right now it takes several minutes for the DB data to load. And the format I've chosen for the data with ndarrays and such doesn't want to save to a file via numpy.save (I don't remember the error right now). Every time I make a little tweak the data is lost and I have to wait several minutes for it to reload.
So the thought occurred to me that I could create a simple class for holding that data. A class I wouldn't need to alter that could run under a separate iPython console (I'm using Anaconda, Python 2.7, and Spyder). That way I could link the analytics class to the shared data class in the init of the analytics class. Something like this:
def __init__(self):
self.__shared_data = SharedData()
self.__analytics_data_1 = self.__shared_data['analytics_data_1']
The idea would be that I would then write to self.__analytics_data_1 inside the analytics class methods. It would be automatically get updated to the shard data class. I would have an iPython console open to do nothing more than hold an instance of the shared data class during debug. That way when I have to reload and reinstantiate the analytics class it just gets whatever data I've already captured. Obviously, if there is an issue with the data itself then it will need to be removed manually.
Obviously, I wouldn't want to use the same shared data class for every tool I built. But that's simple to get around. The reason I mention all of this is that I couldn't find a recipe for something like this online. There seem to be plenty of recipes out there so I'm thinking the lack of a recipe might be a sign this is a bad idea.
I have attempted to achieve something similar via Memcache on a PHP project. However, in that project I had a lot of reads and write and it appeared that the code was causing some form of write collision. So data wasn't getting updated. (And Memcache doesn't guarantee the data will even be there.) Preventing this write collision meant a lot of extra code, extra processing time, and ultimately the code got to be too slow to be useful. I thought I might attempt it again with this shared memory in Python as well as using the shared memory for debugging purposes.
Thoughts? Warnings?
If you wanna shared data across your class instances, you should do:
class SharedDataClass(object):
shared_data = SharedData()
These SharedData instance will be shared between your class instances unless you override in in the constructor.
Careful, GIL lock may affect performance!
Sharing data "between consoles" or processes is different story. They're separate processes, and, of course, class attributes are not shared. In this case, you need IPC (it could be filesystem, database, process with open socket to connect to, etc.)

is it a good practice to store data in memory in a django application?

I am writing a reusable django application for returning json result for jquery ui autocomplete.
Currently i am storing the Class/function for getting the result in a dictionary with a unique key for each class/function.
When a request comes then I selects the corresponding class/function from the dict and returns the output.
My query is whether is the best practice to do the above or are there some other tricks to obtains the same result.
Sample GIST : https://gist.github.com/ajumell/5483685
You seem to be talking about a form of memoization.
This is OK, as long as you don't rely on that result being in the dictionary. This is because the memory will be local to each process, and you can't guarantee subsequent requests being handled by the same process. But if you have a fallback where you generate the result, this is a perfectly good optimization.
That's a very general question. It primary depends on the infrastructure of your code. The way your class and models are defined and the dynamics of the application.
Second, is important to have into account the resources of the server where your application is running. How much memory do you have available, and how much disk space so you can take into account what would be better for the application.
Last but not least, it's important to take into account how much operations does it need to put all these resources in memory. Memory is volatile, so if your application restarts you'll have to instantiate all the classes again and maybe this is to much work.
Resuming, as an optimization is very good choice to keep in memory objects that are queried often (that's what cache is all about) but you have to take into account all of the previous stuff.
Storing a series of functions in a dictionary and conditionally selecting one based on the request is a perfectly acceptable way to handle it.
If you would like a more specific answer it would be very helpful to post your actual code. And secondly, this might be better suited to codereview.stackexchange

Write custom translation allocation in Pyramid

Pyramid uses gettext *.po files for translations, a very good and stable way to internationalize an application. It's one disadvantage is it cannot be changed from the app itself. I need some way to give a normal user the ability to change the translations on his own. Django allows changes in the file directly and after the change it restarts the whole app. I do not have that freedom, because the changes will be quite frequent.
Since I could not find any package that will help me with the task, I decided to override the Localizer. My idea is based on using a Translation Domain like Zope projects use and make Localizer search for registered domain, and if not found, back off to default translation strategy.
The problem is that I could not find a good way to place a custom translation solution into the Localizer itself. All I could think of is to reimplement the get_localizer method and rewrite the whole Localizer. But there are several things, that need to be copypasted here, such as interpolation of mappings and other tweeks related to translation strings.
I don't know how much things you have in there but I did something alike a while ago.. Will have to do it once again. The implementation is pretty simple...
If you can be sure that all calls will be handled trough _() or something alike. You can provide your own function. It will look something like it.
def _(val):
val = db.translations.find({key: id, locale: request.locale_name})
if val:
return val['value']
else:
return real_gettext(val)
this is pretty simple... then you need to have something that will dump the database into the file...
But I guess overriding the localizer makes more sense.. I did it a long time ago and overiding the function was easier than searching in the code.
The plus side of Localiser is that it will work everywhere. Monkey patch is pretty cool but it's also pretty ugly. If I had to do it again, I'd provide my own localizer that will load from a database first and then display it's own value. The reason behind database is that if someone closes the server and the file hasn't been modified you won't see the results.
If the DB is more than needed, then Localize is good enough and you can update the file on every update. If the server will get restarted it will load the new files... You'll have to compile the catalog first.

Categories