I'm not sure about the best way to approach this. Say I've got a "widget" table with fields 'id, 'name', 'size', 'color', with 10,000 rows.
When I load a webpage I will often need to look up hundreds of widgets (by id) and return one or more of the associated fields.
Once I have a database session established, is best practice to do something like:
thiswidget = session.query(Widget).filter(Widget.id=X)
Each time I need a piece of data, or should I grab all the data up front once, say like this:
widgetsdict = {}
for widget in session.query(Widget):
widgets[widget.id] = (widget.name, widget.size, widget.color)
Then each time I need to look something up, just do:
thiswidget = widgetsdict[X]
The first method is far simpler, but is it a good idea to keep asking the database over and over?
You should employ caching to prevent hitting the database too many times.
Redis or memcached are typically used for this purpose. They both act as processes which run on the client machine which can be called to save and retrieve data. You will need to set up the local server and the relevant python library.
The code you write in python should do the following:
Checks the cache for a key
If None returned, query db
Store result in cache, set reasonable expiry
Caching Software
redis
memcached
Related
I am trying to understand the best way to deal with Sessions.
I am scraping data from a website (and using scrapy, though kind of irrelevant here) and, when facing new data, saving it to my database, or when it match something that already exists, updates some values.
My issue is that each webpage I am scraping concerns only one entry in my table, and may need an update in some other table.
For instance, let's assume I am scraping cars and some features (gearbox type, engine type, owner, ...) are within linked tables.
At the moment, I am doing something like :
parse_cars_page(self, response) # note that this is a standard scrapy method, response holds the scraped html data
# do some stuff here to find the car, its parameters, among which its gearbox and motor types
engine = create_engine("connection_string")
with Session(engine):
stmt_cars = select(model.Cars).where(model.Cars.id == car.id)
current_car = session.execute(stmt_cars).scalar_one_or_none()
if current_car != None:
# update some data
else:
stmt_motor = select(model.Motors).where(model.Motors.type == car.motor_type)
current_motor = session.execute(stmt_motor).scalar_one_or_none()
if current_motor == None:
current_motor = model.Motors(create_it_using_my_data)
current_car.motor = current_motor
# then do the same for the gearbox, owner, .........
session.commit()
This works fine. But it's slow, very slow. The whole with statement takes about a minute to execute.
As the info is contained within a single page, and that I am opening thousands of them, getting the data is tedious and takes an enormous amount of time.
I tried finding best practices from the SQL Alchemy documentation but I can't find what I am looking for, and I understand that keeping a Session opened and share it within my whole app isn't a good idea.
My app is the only thing that will update the data in the database. Other processes may read it but not write to it.
Is there a way to open a Session, copy in memory a snapshot of the database, update this snapshot while the Session is closed, and only open a Session for synchronization once I generated X new data ?
I'm looking to post new records on a user triggered basis (i.e. workflow). I've spent the last couple of days reasearching the best way to approach this and so far I've come up with the following ideas:
(1) Utilize Django signals to check for conditions on a field change, and then post data originating from my Django app.
(2) Utilize JS/AJAX on the front-end to post data to the app based upon a user changing certain fields.
(3) Utilize a prebuilt workflow app like http://viewflow.io/, again based upon changes triggers by users.
Of the three above options, is there a best practice? Are there any other options I'm not considering for how to take this workflow based approach to post new records?
The second approach of monitoring the changes in the front end and then calling a backend view to update go database would be a better approach because processing on the backend or any other site would put the processing on the server which would slow down the site whereas second approach is more of a client side solution thereby keeping server relieved.
I do not think there will be a data loss, you are just trying to monitor a change, as soon as it changes your view will update the database, you can also use cookies or sessions to keep appending values as a list and update the database when site closes. Also django gives https errors you could put proper try and except conditions in that case as well. Anyways cookies would be a good approach I think
For anyone that finds this post I ended up deciding to take the Signals route. Essentially I'm utilizing Signals to track when users change a fields, and based on the field that changes I'm performing certain actions on the database.
For testing purposes this has been working well. When I reach production with this project I'll try to update this post with any challenges I run into.
Example:
#receiver(pre_save, sender=subTaskChecklist)
def do_something_if_changed(sender, instance, **kwargs):
try:
obj = sender.objects.get(pk=instance.pk) #define obj as "old" before change values
except sender.DoesNotExist:
pass
else:
previous_Value = obj.FieldToTrack
new_Value = instance.FieldToTrack #instance represents the "new" after change object
DoSomethingWithChangedField(new_Value)
So I have a site that on a per-user basis, and it is expected to query a very large database, and flip through the results. Due to the size of the number of entries returned, I run the query once (which takes some time...), store the result in a global, and let folks iterate through the results (or download them) as they want.
Of course, this isn't scalable, as the globals are shared across sessions. What is the correct way to do this in Django? I looked at session management, but I always ran into the "xyz is not serializeable on json" issue. Do I look into how I do this correctly using sessions, or is there another preferred way to do this?
If the user is flipping through the results, you probably don't want to pull back and render any more than you have to. Most SQL dialects have TOP and LIMIT clauses that will let you pull back a limited range of results, as long as your data is ordered consistently. Django's Pagination classes are a nice abstraction of this on top of Django Model classes: https://docs.djangoproject.com/en/dev/topics/pagination/
I would be careful of storing large amounts of data in user sessions, as it won't scale as your number of users grows, and user sessions can stay around for a while after the user has left the site. If you're set on this option, make sure you read about clearing the expired sessions. Django doesn't do it for you:
https://docs.djangoproject.com/en/1.7/topics/http/sessions/#clearing-the-session-store
I'm develop a web application using Flask. I have 2 approaches to return pages for user's request.
Load requesting data from database then return.
Load the whole database into python dictionary variable at initialization and return the related page when requested. (the whole database is not too big)
I'm curious which approach will have better performance?
Of course it will be faster to get data from cache that is stored in memory. But you've got to be sure that the amount of data won't get too large, and that you're updating your cache every time you update the database. Depending on your exact goal you may choose python dict, cache (like memcached) or something else, such as tries.
There's also a "middle" way for this. You can store in memory not the whole records from database, but just the correspondence between the search params in request and the ids of the records in database. That way user makes a request, you quickly check the ids of the records needed, and query your database by id, which is pretty fast.
I was just looking over EveryBlock's source code and I noticed this code in the alerts/models.py code:
def _get_user(self):
if not hasattr(self, '_user_cache'):
from ebpub.accounts.models import User
try:
self._user_cache = User.objects.get(id=self.user_id)
except User.DoesNotExist:
self._user_cache = None
return self._user_cache
user = property(_get_user)
I've noticed this pattern around a bunch, but I don't quite understand the use. Is the whole idea to make sure that when accessing the FK on self (self = alert object), that you only grab the user object once from the db? Why wouldn't you just rely upon the db caching amd django's ForeignKey() field? I noticed that the model definition only holds the user id and not a foreign key field:
class EmailAlert(models.Model):
user_id = models.IntegerField()
...
Any insights would be appreciated.
I don't know why this is an IntegerField; it looks like it definitely should be a ForeignKey(User) field--you lose things like select_related() here and other things because of that, too.
As to the caching, many databases don't cache results--they (or rather, the OS) will cache the data on disk needed to get the result, so looking it up a second time should be faster than the first, but it'll still take work.
It also still takes a database round-trip to look it up. In my experience, with Django, doing an item lookup can take around 0.5 to 1ms, for an SQL command to a local Postgresql server plus sometimes nontrivial overhead of QuerySet. 1ms is a lot if you don't need it--do that a few times and you can turn a 30ms request into a 35ms request.
If your SQL server isn't local and you actually have network round-trips to deal with, the numbers get bigger.
Finally, people generally expect accessing a property to be fast; when they're complex enough to cause SQL queries, caching the result is generally a good idea.
Although databases do cache things internally, there's still an overhead in going back to the db every time you want to check the value of a related field - setting up the query within Django, the network latency in connecting to the db and returning the data over the network, instantiating the object in Django, etc. If you know the data hasn't changed in the meantime - and within the context of a single web request you probably don't care if it has - it makes much more sense to get the data once and cache it, rather than querying it every single time.
One of the applications I work on has an extremely complex home page containing a huge amount of data. Previously it was carrying out over 400 db queries to render. I've refactored it now so it 'only' uses 80, using very similar techniques to the one you've posted, and you'd better believe that it gives a massive performance boost.