Model to create threaded comments on Google Appengine with Python - python

I am trying to write a comments model which can be threaded (no limits to the number of child threads). How do I do this in Appengine? and what is the fastest way to read all comments?
I am trying to do this in a scalable way so that app engine's new pricing does not kill my startup :)

It depends on how nested threads you expect to get and if you want to optimize reading or writing. If we guess that the threads normally are quite shallow and that you want to optimize reading all subcomments on all levels of a comment, I think you should store each comment in a separate entry and then put a reference to it in a list on the parent and the parent's parent and so forth.
That way getting all subcomments are always just one call, but writing a new comment is a bit slower, since you have to modify all parents.

If you need fast way to read all comments, keep in comment model not only a reference to the parent comment but add a reference to the whole topic also. Then you will be able to get all comments for the topic with Comment.query(Comment.topic=12334).fetch().
If you are worrying about datastore usage cost use http://code.google.com/p/appengine-ndb-experiment/. It allows to use reduce saved data amount and number of queries. Shortly it will replace current db module.

Related

Django - Best way to create snapshots of objects

I am currently working on a Django 2+ project involving a blockchain, and I want to make copies of some of my object's states into that blockchain.
Basically, I have a model (say "contract") that has a list of several "signature" objects.
I want to make a snapshot of that contract, with the signatures. What I am basically doing is taking the contract at some point in time (when it's created for example) and building a JSON from it.
My problem is: I want to update that snapshot anytime a signature is added/updated/deleted, and each time the contract is modified.
The intuitive solution would be to override each "delete", "create", "update" of each of the models involved in that snapshot, and pray that all of them the are implemented right, and that I didn't forget any. But I think that this is not scalable at all, hard to debug and ton maintain.
I have thought of a solution that might be more centralized: using a periodical job to get the last update date of my object, compare it to the date of my snapshot, and update the snapshot if necessary.
However with that solution, I can identify changes when objects are modified or created, but not when they are deleted.
So, this is my big question mark: how with django can you identify deletions in relationships, without any prior context, just by looking at the current database's state ? Is there a django module to record deleted objects ? What are your thoughts on my issue ?
All right?
I think that, as I understand your problem, you are in need of a module like Django Signals, which listens for changes in the database and, when identified (and if all the desired conditions are met), executes certain commands in your application ( even be able to execute in the database).
This is the most recent documentation:
https://docs.djangoproject.com/en/3.1/topics/signals/

Ndb strong consistency and frequent writes

I'm trying to achieve strong consistency with ndb using python.
And looks like I'm missing something as my reads behave like they're not strongly consistent.
The query is:
links = Link.query(ancestor=lead_key).filter(Link.last_status ==
None).fetch(keys_only=True)
if links:
do_action()
The key structure is:
Lead root (generic key) -> Lead -> Website (one per lead) -> Link
I have many tasks that are executed concurrently using TaskQueue and this query is performed at the end of every task. Sometimes I'm getting "too much contention" exception when updating the last_status field but I deal with it using retries. Can it break strong consistency?
The expected behavior is having do_action() called when there are no links left with last_status equal to None. The actual behavior is inconsistent: sometimes do_action() is called twice and sometimes not called at all.
Using an ancestor key to get strong consistency has a limitation: you're limited to one update per second per entity group. One way to work around this is to shard the entity groups. Sharding Counters describes the technique. It's an old article, but as far as I know, the advise is still sound.
Adding to Dave's answer which is the 1st thing to check.
One thing which isn't well documented and can be a bit surprising is that the contention can be caused by read operations as well, not only by the write ones.
Whenever a transaction starts the entity groups being accessed (by read or write ops, doesn't matter) are marked as such. The too much contention error indicates that too many parallel transactions simultaneously try to access the same entity group. It can happen even if none of the transactions actually attempts to write!
Note: this contention is NOT emulated by the development server, it can only be seen when deployed on GAE, with the real datastore!
What can add to the confusion is the automatic re-tries of the transactions, which can happen after both actual write conflicts or just plain access contention. These retries may appear to the end-user as suspicious repeated execution of some code paths - which I suspect could explain your reports of do_action() being called twice.
Usually when you run into such problems you have to re-visit your data structures and/or the way you're accessing them (your transactions). In addition to solutions maintaining the strong consistency (which can be quite expensive) you may want to re-check if consistency is actually a must. In some cases it's added as a blanket requirement just because appears to simplify things. From my experience it doesn't :)
There is nothing in your sample that ensures that your code is only called once.
For the moment, I am going to assume that your "do_action" function does something to the Link entities, specifically that it sets the "last_status" property.
If you do not perform the query and the write to the Link Entity inside a transaction, then it is possible for two different requests (task queue tasks) to get results back from the query, then both write their new value to the Link entity (with the last write overwriting the previous value).
Remember that even if you do use a transaction, you don't know until the transaction is successfully completed that nobody else tried to perform a write. This is important if you are trying to do something external to datastore (for example, making a http request to an external system), as you may see http requests from transactions that would eventually fail with a concurrent modification exception.

Pythonic Django object reuse

I've been racking my brain on this for the last few weeks and I just can't seem to understand it. I'm hoping you folks here can give me some clarity.
A LITTLE BACKGROUND
I've built an API to help serve a large website and like all of us, are trying to keep the API as efficient as possible. Part of this efficiency is to NOT create an object that contains custom business logic over and over again (Example: a service class) as requests are made. To give some personal background I come from the Java world so I'm use to using a IoC or DI to help handle object creation and injection into my classes to ensure classes are NOT created over and over on a per request basis.
WHAT I'VE READ
While looking at many Python IoC and DI posts I've become rather confused on how to best approach creating a given class and not having to worry about the server getting overloaded with too many objects based on the amount of requests it may be handling.
Some people say an IoC or DI really isn't needed. But as I run my Django app I find that unless I construct the object I want globally (top of file) for views.py to use later rather than within each view class or def within views.py I run the change of creating multiple classes of the same type, which from what I understand would cause memory bloat on the server.
So what's the right way to be pythonic to keep objects from being built over and over? Should I invest in using an IoC / DI or not? Can I safely rely on setting up my service.py files to just contain def's instead of classes that contain def's? Is the garbage collector just THAT efficient so I don't even have to worry about it.
I've purposely not placed any code in this post since this seems like a general questions, but I can provide a few code examples if that helps.
Thanks
From a confused engineer that wants to be as pythonic as possible
You come from a background where everything needs to be a class, I've programmed web apps in Java too, and sometimes it's harder to unlearn old things than to learn new things, I understand.
In Python / Django you wouldn't make anything a class unless you need many instances and need to keep state.
For a service that's hardly the case, and sometimes you'll notice in Java-like web apps some services are made singletons, which is just a workaround and a rather big anti-pattern in Python
Pythonic
Python is flexible enough so that a "services class" isn't required, you'd just have a Python module (e.g. services.py) with a number of functions, emphasis on being a function that takes in something, returns something, in a completely stateless fashion.
# services.py
# this is a module, doesn't keep any state within,
# it may read and write to the DB, do some processing etc but doesn't remember things
def get_scores(student_id):
return Score.objects.filter(student=student_id)
# views.py
# receives HTTP requests
def view_scores(request, student_id):
scores = services.get_scores(student_id)
# e.g. use the scores queryset in a template return HTML page
Notice how if you need to swap out the service, you'll just be swapping out a single Python module (just a file really), so Pythonistas hardly bother with explicit interfaces and other abstractions.
Memory
Now per each "django worker process", you'd have that one services module, that is used over and over for all requests that come in, and when the Score queryset is used and no longer pointed at in memory, it'll be cleaned up.
I saw your other post, and well, instantiating a ScoreService object for each request, or keeping an instance of it in the global scope is just unnecessary, the above example does the job with one module in memory, and doesn't need us to be smart about it.
And if you did need to keep state in-between several requests, keeping them in online instances of ScoreService would be a bad idea anyway because now every user might need one instance, that's not viable (too many online objects keeping context). Not to mention that instance is only accessible from the same process unless you have some sharing mechanisms in place.
Keep state in a datastore
In case you want to keep state in-between requests, you'd keep the state in a datastore, and when the request comes in, you hit the services module again to get the context back from the datastore, pick up where you left it and do your business, return your HTTP response, then unused things will get garbage collected.
The emphasis being on keeping things stateless, where any given HTTP request can be processed on any given django process, and all state objects are garbage collected after the response is returned and objects go out of scope.
This may not be the fastest request/response cycle we can pull, but it's scalable as hell
Look at some major web apps written in Django
I suggest you look at some open source Django projects and look at how they're organized, you'll see a lot of the things you're busting your brains with, Djangonauts just don't bother with.

Is fetch() better than list(Model.all().run()) for returning a list from a datastore query?

Using Google App Engine Python 2.7 Query Class -
I need to produce a list of results that I pass to my django template. There are two ways I've found to do this.
Use fetch, however in the docs it says that fetch should almost never be used. https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_fetch
Use run() and then wrap it into list() thereby creating the list object.
Is one preferable to the other in terms of memory usage? Is there another way I could be doing this?
The key here is why fetch “should almost never be used”. The documentation says that fetch will get all the results, therefore having to keep all of them in memory at the same time. If the data you get is big, you will need lots of memory.
You say you can wrap run inside list. Sure, you can do that, but you will hit exactly the same problem—list will force all the elements into memory. So, this solution is actually discouraged on the same basis as using fetch.
Now, you could say: so what should I do? The answer is: in most cases you can deal with elements of your data one by one, without keeping them all in memory at the same time. For example, if all you need is to put the result data into a django template, and you know that it will be used at most once in your template, then the django template will happily take any iterator—so you can pass the run call result directly without wrapping it into list.
Similarly, if you need to do some processing, for example go over the results to find the element with the highest price or ranking, or whatever, you can just iterate over the result of run.
But if your usage requires having all the elements in memory (e.g.: your django template uses the data from the query several times), then you have a case where fetch or list(run(…)) actually has sense. In the end—this is just the typical trade-off: if you need for your application to apply an algorithm which requires all the data in memory, you need to pay for it by using up memory. So, you can either redesign your algorithms and usage to work with an iterator, or use fetch and pay for it by longer processing times and higher memory usage. Google of course encourages you to do the first thing. And this is what “should almost never be used” actually means.

is it a good practice to store data in memory in a django application?

I am writing a reusable django application for returning json result for jquery ui autocomplete.
Currently i am storing the Class/function for getting the result in a dictionary with a unique key for each class/function.
When a request comes then I selects the corresponding class/function from the dict and returns the output.
My query is whether is the best practice to do the above or are there some other tricks to obtains the same result.
Sample GIST : https://gist.github.com/ajumell/5483685
You seem to be talking about a form of memoization.
This is OK, as long as you don't rely on that result being in the dictionary. This is because the memory will be local to each process, and you can't guarantee subsequent requests being handled by the same process. But if you have a fallback where you generate the result, this is a perfectly good optimization.
That's a very general question. It primary depends on the infrastructure of your code. The way your class and models are defined and the dynamics of the application.
Second, is important to have into account the resources of the server where your application is running. How much memory do you have available, and how much disk space so you can take into account what would be better for the application.
Last but not least, it's important to take into account how much operations does it need to put all these resources in memory. Memory is volatile, so if your application restarts you'll have to instantiate all the classes again and maybe this is to much work.
Resuming, as an optimization is very good choice to keep in memory objects that are queried often (that's what cache is all about) but you have to take into account all of the previous stuff.
Storing a series of functions in a dictionary and conditionally selecting one based on the request is a perfectly acceptable way to handle it.
If you would like a more specific answer it would be very helpful to post your actual code. And secondly, this might be better suited to codereview.stackexchange

Categories