I have done lots of searching, but been unable to find a satisfactory answer to the most efficient approach to achieve the following.
Say my App contains a list of Products. At the end of every day an external service is called that returns another list of Products from a master data source.
If the list of Products from master data contains any Products not in my App, add the Product to the App.
If the Product in the master data is already in my App, and no changes have been made, do nothing.
If the Product in the master data is already in my App, but some data has changed (the Product's name for instance), update the Product.
If a Product is available in my App, but no longer in the master data source, flag it as "Unavailable" in the App.
At the moment, I do a loop on each list, looping through the other list for each Product:
For each Product in the master data list, loop through the Products in the App, and update as needed. If no Product was found, then add the Product to the App.
Then, for each Product in the App, loop through the Products in the master data list, and if not found, flag as "Unavailable" in the App.
I'm wondering if there is a more efficient method to achieve this? Or any algorithms or patterns that are relevant here?
In each case the Products are represented by objects in a Python list.
First of all I'd suggest to use dicts with the Product code (or name or whatever) as key and the Product object as value. This should make your loops faster by at least a 100x factor on a thousand entries.
Then especially for the second search it may be worth exploring the possibility of converting the keys of the first dict to a set and looping on the difference as in
for i in set(appDict.keys()).difference(masterDict.keys()):
##update unavailable Product data
Related
I have a django app with tasks stored in a database, similar to a kanban board type layout and each record has a column and order for where they are in the list.
The problem is this list can get to hundreds or thousands of records in one column, on changing column based on where the user places the card the column is updated and order is recalculated/set for all records in that column.
Is there a more efficient way than a heavy process updating thousands of records to set order?
I would suggest to use the F functionality from Django to update all objects in a queryset with arithmetic value rule
Something like:
Model.objects.filter(…).update(order=F('order') - 1)
https://docs.djangoproject.com/en/4.0/ref/models/expressions/
A more complex but smart answer: https://stackoverflow.com/a/72894011/10992051
You will want to use bulk_update to more efficiently update your records en-masse:
https://docs.djangoproject.com/en/4.0/ref/models/querysets/#bulk-update
I am working on making a relatively simple inventory system in which data is stored and updated with MySQL with Python connected to it. When adding to stock, an end user would input values into an interface and associate a purchase ticket number with that transaction. A log would then indicate that purchase ticket x added units to stock.
In a single purchase ticket, several items in stock may be increased, i.e. several rows within the stock table will need to be updated per purchase ticket.
However, I am having trouble conceptualizing an efficient way of updating multiple rows while still associating the purchase ticket number with the transaction. I was going to use a simple UPDATE statement, but can't figure out how to link the ticket number.
I was considering making a table for purchase tickets, but figured it would be more efficient to just increment stock with UPDATEs alone, but I appear to be wrong. Was going to use something like:
UPDATE stock SET count + x WHERE id = y;
Where x is how much the stock is being incremented by, and y is the specific product's unique ID.
TL;DR is there any efficient way to update multiple rows in a single column while also associating a user-inputted number with that transaction?
Don’t update. Updates are slow because the transaction has to seek out each row to be updated. Just append to the end of the table. Then use the Output clause so you have access to what got inserted. Finally you can join to those results that will include the primary keys and log those keys in another table.
Suppose I have the following GQL database,
class Signatories(db.Model):
name = db.StringProperty()
event = db.StringProperty()
This database holds information regarding events that people have signed up for. Say I have the following entries in the database in the format (event_name, event_desc): (Bob, TestEvent), (Bob, TestEvent2), (Fred, TestEvent), (John, TestEvent).
But the dilemma here is that I cannot just aggregate all of Bob's events into one entity because I'd like to Query for all the people signed up for a specific event and also I'd like to add such entries without having to manually update the entry every single time.
How could I count the number of distinct strings given by a GQL Query in Python (in my example, I am specifically trying to see how many people are currently signed up for events)?
I have tried using the old mcount = db.GqlQuery("SELECT name FROM Signatories").count(), however this of course returns the total number of strings in the list, regardless of the uniqueness of each string.
I have also tried using count = len(member), where member = db.GqlQuery("SELECT name FROM Signatories"), but unfortunately, this only returns an error.
You can't - at least not directly. (By the way you don't have a GQL database).
If you have a small number of items, then fetch them into memory, and use a set operation to produce the unique set and then count
If you have larger numbers of entities that make in memory filtering and counting problematic then your strategy will be to aggregate the count as you create them,
e.g.
create a separate entity each time you create an event that has the pair of strings as the key. This way you will only have one entity the data store representing the specific pair. Then you can do a straight count.
However as you get large numbers of these entities you will need to start performing some additional work to count them as the single query.count() will become too expensive. You then need to start looking at counting strategies using the datastore.
It sounds like an odd one but it's a really simple idea. I'm trying to make a simple Flickr for a website I'm building. This specific problem comes when I want to show a single photo (from my Photo model) on the page but I also want to show the image before it in the stream and the image after it.
If I were only sorting these streams by date, or was only sorting by ID, that might be simpler... But I'm not. I want to allow the user to sort and filter by a whole variety of methods. The sorting is simple. I've done that and I have a result-set, containing 0-many Photos.
If I want a single Photo, I start off with that filtered/sorted/etc stream. From it I need to get the current Photo, the Photo before it and the Photo after it.
Here's what I'm looking at, at the moment.
prev = None
next = None
photo = None
for i in range(1, filtered_queryset.count()):
if filtered_queryset[i].pk = desired_pk:
if i>1: prev = filtered_queryset[i-1]
if i<filtered_queryset.count(): next = filtered_queryset[i+1]
photo = filtered_queryset[i]
break
It just seems disgustingly messy. And inefficient. Oh my lord, so inefficient. Can anybody improve on it though?
Django queries are late-binding, so it would be nice to make use of that though I guess that might be impossible given my horrible restrictions.
Edit: it occurs to me that I can just chuck in some SQL to re-filter queryset. If there's a way of selecting something with its two (or one, or zero) closest neighbours with SQL, I'd love to know!
You could try the following:
Evaluate the filtered/sorted queryset and get the list of photo ids, which you hold in the session. These ids all match the filter/sort criteria.
Keep the current index into this list in the session too, and update it when the user moves to the previous/next photo. Use this index to get the prev/current/next ids to use in showing the photos.
When the filtering/sorting criteria change, re-evaluate the list and set the current index to a suitable value (e.g. 0 for the first photo in the new list).
I see the following possibilities:
Your URL query parameters contain the sort/filtering information and some kind of 'item number', which is the item number within your filtered queryset. This is the simple case - previous and next are item number minus one and plus one respectively (plus some bounds checking)
You want the URL to be a permalink, and contain the photo primary key (or some unique ID). In this case, you are presumably storing the sorting/filtering in:
in the URL as query parameters. In this case you don't have true permalinks, and so you may as well stick the item number in the URL as well, getting you back to option 1.
hidden fields in the page, and using POSTs for links instead of normal links. In this case, stick the item number in the hidden fields as well.
session data/cookies. This will break if the user has two tabs open with different sorts/filtering applied, but that might be a limitation you don't mind - after all, you have envisaged that they will probably just be using one tab and clicking through the list. In this case, store the item number in the session as well. You might be able to do something clever to "namespace" the item number for the case where they have multiple tabs open.
In short, store the item number wherever you are storing the filtering/sorting information.
I have a model, below, and I would like to get all the distinct area values. The SQL equivalent is select distinct area from tutorials
class Tutorials(db.Model):
path = db.StringProperty()
area = db.StringProperty()
sub_area = db.StringProperty()
title = db.StringProperty()
content = db.BlobProperty()
rating = db.RatingProperty()
publishedDate = db.DateTimeProperty()
published = db.BooleanProperty()
I know that in Python I can do
a = ['google.com', 'livejournal.com', 'livejournal.com', 'google.com', 'stackoverflow.com']
b = set(a)
b
>>> set(['livejournal.com', 'google.com', 'stackoverflow.com'])
But that would require me moving the area items out of the query into another list and then running set against the list (sounds very inefficient) and if I have a distinct item that is in position 1001 in the datastore I wouldnt see it because of the fetch limit of 1000.
I would like to get all the distinct values of area in my datastore to dump it to the screen as links.
Datastore cannot do this for you in a single query. A datastore request always returns a consecutive block of results from an index, and an index always consists of all the entities of a given type, sorted according to whatever orders are specified. There's no way for the query to skip items just because one field has duplicate values.
One option is to restructure your data. For example introduce a new entity type representing an "area". On adding a Tutorial you create the corresponding "area" if it doesn't already exist, and on deleting a Tutoral delete the corresponding "area" if no Tutorials remain with the same "area". If each area stored a count of Tutorials in that area, this might not be too onerous (although keeping things consistent with transactions etc would actually be quite fiddly). I expect that the entity's key could be based on the area string itself, meaning that you can always do key lookups rather than queries to get area entities.
Another option is to use a queued task or cron job to periodically create a list of all areas, accumulating it over multiple requests if need be, and put the results either in the datastore or in memcache. That would of course mean the list of areas might be temporarily out of date at times (or if there are constant changes, it might never be entirely in date), which may or may not be acceptable to you.
Finally, if there are likely to be very few areas compared with tutorials, you could do it on the fly by requesting the first Tutorial (sorted by area), then requesting the first Tutorial whose area is greater than the area of the first, and so on. But this requires one request per distinct area, so is unlikely to be fast.
The DISTINCT keyword has been introduced in release 1.7.4.
This has been asked before, and the conclusion was that using sets is fine.