Django Rest Framework large number of queries for nested relationship - python

My problem
I have a large number of nested model serializers (5 levels deep) via one-to-many relationships. The total query time is not that high according to django debug toolkit - maybe 100ms, but it takes about 6s of cpu time to run due to the many hundreds of times the database is being hit per query.
The problem is, 3 of those tables are huge, with millions of rows. As such, whenever I run 'prefetch_related' for anything it takes far longer to execute, sometimes minutes. select_related isn't an option, as they are all one-to-many.
Imagine something like a pedometer that takes steps, and also ties multiple environment readings and accelerometer readings to a step.
Example Schema
Person
id
name
Pedometer
id
name
owner(Person)
Step
id
info
timestamp
pedometer
Environment_Readings
id
pressure
step
Accelerometer_Readings
id
force
step
What I want to achieve
So I want to select all people > an array of their pedometers (often multiple) > an array of all steps they have taken in the last minute > array of environment readings and array of accelerometer readings.
I am not seeing a way to prefetch environment_readings or accelerometer_readings as neither have a timestamp. The only ones that can be prefetched appear to be pedometer and step. And for some reason prefetching pedometer seems to slow it down.
Any ideas?

Related

What is a viable approach for loading a large quantity of data originating in csv files?

Background
Ultimately I'm trying to load a large quantity of data into GBQ. Records exist for the majority of seconds across many dates.
I want to have a table for each day's data so I may query the set using the syntax:
_TABLE_SUFFIX BETWEEN '20170403' AND '20170408'...
Some of the columns will be nested and sets of columns currently originate from a high number of csv files each between 30-150mb in size.
Approach
My approach to processing the files is to:
1) Create tables with correct schema
2) Fill each table with a record for each second in preparation for cycling through data and running UPDATE queries
3) Cycle through csv files and load data
All this is intended to be driven by the GBQ client library for python.
Step (2) may seem odd, but I think it's needed in order to simplify the loading as I will be able to guarantee that UPDATE DML will work in all cases.
I do realise that (3) has little chance of working given the issues with (2), but I have run a similar operation at a slightly smaller scale leading me to believe that my approach might not be entirely hopeless at this larger scale.
I'm also left wondering whether this is the most appropriate method to load data of this nature, but some of the options presented in GBQ documentation seem like they could be overkill for a one-time load of data i.e I would not want to learn apache beam unless I know I have to given that my chief concern is analysing the dataset. But I'm not beyond learning something like apache beam if I know it's the only way of completing such a load as it could come in handy in future.
Problem that I've run into
In order to run (2) I have three nested for loops cycling through each hour, minute second and constructing a long string to go into the INSERT query:
(TIMESTAMP '{timestamp}', TIMESTAMP '2017-04-03 00:00:00', ... + ~86k times for each day)
GBQ doesn't allow you to run such a long query there is a limit, from memory it's 250k characters (could be wrong). For that reason the length of query is tested at each iteration and if it exceeds a number slightly lower than 250k characters then the insert query is performed.
Although this worked for a similar type of operation involving more columns when I run this this time the process terminates when GQB returns:
"google.cloud.exceptions.BadRequest: 400 Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex.."
When I reduce the size of the strings and run smaller queries this error is returned less frequently, but does not disappear all together. Unfortunately at this reduced size of query the script would take too long to complete to serve as a feasible option for step (2). It takes >600 seconds and the entire data set would take in excess of 20 days to run.
I had thought that changing the query mode to 'BATCH' may overcome this, but it's been pointed out to me that this won't help, I'm not yet clear why, but happy to take someone's word for it. Hence I've detailed the problem further as it seems likely that my approach may be wrong.
This is what I see as the relevant code block:
from google.cloud import bigquery
from google.cloud.bigquery import Client
... ...
bqc = Client()
dataset = bqc.dataset('test_dataset')
job = bqc.run_async_query(str(uuid.uuid4()), sql)
job.use_legacy_sql = False
job.begin()
What would be the best approach to loading this dataset? Is there a way of getting my current approach to work?
Query
INSERT INTO `project.test_dataset.table_20170403` (timestamp) VALUES (TIMESTAMP '2017-04-03 00:00:00'), ... for ~7000 distinct values
Update
It seems as if the GBQ client library has had a not insignificant upgrade since I last downloaded it Client.run_asyn_query() appears to have been replaced with Client.query(). I'll experiment with the new code and report back.

time series array database for python

We have a fairly big db about our traffic.
We have over 1000 trains daily where we gather from different sources either on minute or event bases (a train is arrived a a station, a car or loc composition has changed etc.)
The db has over 100M rows actually in MSSQL.
It turns out that 90% of the data is redundant, just the time, the station, where the train passed and the distance is changing.
Insert/update and reasonably simple queries are fine.
But when it comes to make statistical queries (like how many KM was a specific train/car/loc running in a given period of time) the response time and also the query complexity becomes important. (response needed within a 1-2 sec range).
What kind of DB/storage solution could I use for this kind of queries?
We have a python (Flask) front end for reporting so a DB solution having a python interface is a kind of must.
I've considered Pytables/pyhdf5 but I have some concerns about the reliability (I cannot guarantee that only on process will write the file so as per the documentation the risk of data corruption is high. And I cannot afford loose the data.
Side note: I'm quite OK with DB optimization so I know fairly well the relational limit.
And idea?

Pandas-ipython, how to create new data frames with drill down capabilities

I think my problem is simple but I've made a long post in the interest of being thorough.
I need to visualize some data but first I need to perform some calculations that seem too cumbersome in Tableau (am I hated if I say tableau sucks!)
I have a general problem with how to output data with my calculations in a nice format that can be visualized either in Tableau or something else so it needs to hang on to a lot of information.
My data set is a number of fields associated to usage of an application by user id. So there are potentially multiple entries for each user id and each entry (record) has information in columns such as time they began using app, end time, price they paid, whether they were on wifi, and other attributes (dimensions).
I have one year of data and want to do things like calculate average/total of duration/price paid in app over each month and over the full year of each user (remember each user will appear multiple times-each time they sign in).
I know some basics, like appending a column which subtracts start time from end time to get time spent and my python is fully functional but my data capabilities are amateur.
My question is, say I want the following attributes (measures) calculated (all per user id): average price, total price, max/min price, median price, average duration, total duration, max/min duration, median duration, and number of times logged in (so number of instances of id) and all on a per month and per year basis. I know that I could calculate each of these things but what is the best way to store them for use in a visualization?
For context, I may want to visualize the group of users who paid on average more than 8$ and were in the app a total of more than 3 hours (to this point a simple new table can be created with the info) but if I want it in terms of what shows they watched and whether they were on wifi (other attributes in the original data set) and I want to see it broken down monthly, it seems like having my new table of calculations won't cut it.
Would it then be best to create a yearly table and a table for each month for a total of 13 tables each of which contain the user id's over that time period with all the original information and then append a column for each calculation (if the calc is an avg then I enter the same value for each instance of an id)?
I searched and found that maybe the plyr functionality in R would be useful but I am very familiar with python and using ipython. All I need is a nice data set with all this info that can then be exported into a visualization software unless you can also suggest visualization tools in ipython :)
Any help is much appreciated, I'm so hoping it makes sense to do this in python as tableau is just painful for the calculation side of things....please help :)
It sounds like you want to run a database query like this:
SELECT user, show, month, wifi, sum(time_in_pp)
GROUP BY user, show, month, wifi
HAVING sum(time_in_pp) > 3
Put it into a database and run your queries using pandas sql interface or ordinary python queries. Presumably you index your database table on these columns.

Global leaderboard in Google App Engine

I want to build a backend for a mobile game that includes a "real-time" global leaderboard for all players, for events that last a certain number of days, using Google App Engine (Python).
A typical usage would be as follows:
- User starts and finishes a combat, acquiring points (2-5 mins for a combat)
- Points are accumulated in the player's account for the duration of the event.
- Player can check the leaderboard anytime.
- Leaderboard will return top 10 players, along with 5 players just above and below the player's score.
Now, there is no real constraint on the real-time aspect, the board could be updated every 30 seconds, to every hour. I would like for it to be as "fast" as possible, without costing too much.
Since I'm not very familiar with GAE, this is the solution I've thought of:
Each Player entity has a event_points attribute
Using a Cron job, at a regular interval, a query is made to the datastore for all players whose score is not zero. The query is
sorted.
The cron job then iterates through the query results, writing back the rank in each Player entity.
When I think of this solution, it feels very "brute force".
The problem with this solution lies with the cost of reads and writes for all entities.
If we end up with 50K active users, this would mean a sorted query of 50K+1 reads, and 50k+1 writes at regular intervals, which could be very expensive (depending on the interval)
I know that memcache can be a way to prevent some reads and some writes, but if some entities are not in memcache, does it make sense to query it at all?
Also, I've read that memcache can be flushed at any time anyway, so unless there is a way to "back it up" cheaply, it seems like a dangerous use, since the data is relatively important.
Is there a simpler way to solve this problem?
You don't need 50,000 reads or 50,000 writes. The solution is to set a sorting order on your points property. Every time you update it, the datastore will update its order automatically, which means that you don't need a rank property in addition to the points property. And you don't need a cron job, accordingly.
Then, when you need to retrieve a leader board, you run two queries: one for 6 entities with more or equal number of points with your user; second - for 6 entities with less or equal number of points. Merge the results, and this is what you want to show to your user.
As for your top 10 query, you may want to put its results in Memcache with an expiration time of, say, 5 minutes. When you need it, you first check Memcache. If not found, run a query and update the Memcache.
EDIT:
To clarify the query part. You need to set the right combination of a sort order and inequality filter to get the results that you want. According to App Engine documentation, the query is performed in the following order:
Identifies the index corresponding to the query's kind, filter
properties, filter operators, and sort orders.
Scans from the
beginning of the index to the first entity that meets all of the
query's filter conditions.
Continues scanning the index, returning
each entity in turn, until it encounters an entity that does not meet
the filter conditions, or reaches the end of the index, or has
collected the maximum number of results requested by the query.
Therefore, you need to combine ASCENDING order with GREATER_THAN_OR_EQUAL filter for one query, and DESCENDING order with LESS_THAN_OR_EQUAL filter for the other query. In both cases you set the limit on the results to retrieve at 6.
One more note: you set a limit at 6 entities, because both queries will return the user itself. You can add another filter (userId NOT_EQUAL to your user's id), but I would not recommend it - the cost is not worth the savings. Obviously, you cannot use GREATER_THAN/LESS_THAN filters for points, because many users may have the same number of points.
Here is a Google Developer article explaining similar problem and the solution using the Google code JAM ranking library. Further help and extension to this library can be discussed in the Google groups forum.
The library basically creates a N-ary tree with each node containing the count of the scores in a particular range. The score ranges are further divided all the way down till leaf node where its a single score . A tree traversal ( O log(n) ) can be used to find the number of players with score higher than the specific score. That is the rank of the player. It also suggests to aggregate the score submission requests in a pull taskqueue and then process them in a batch in a background thread in a backend.
Whether this is simpler or not is debatable.
I have assumed that ranking is not just a matter of ordering an accumulation of points, in which case thats just a simple query. I ranking involves other factors rather than just current score.
I would consider writing out an Event record for each update of points for a User (effectively a queue) . Tasks run collecting all the current Event records, In addition you maintain a set of records representing the top of the leaderboard. Adjust this set of records, based on the incoming event records. Discard event records once processed. This will limit your reads and writes to only active events in a small time window. The leader board could probably be a single entity, and fetched by key and cached.
I assume you may have different ranking schemes like current active rank (for the current 7 days), vs all time ranks. (ie players not playing for a while won't have a good current rank).
As the players view their rank, you can do that with two simple queries Players.query(Players.score > somescore).fetch(5) and Players.query(Players.score < somescore).fetch(5) this shouldn't cost too much and you could cache them.

Real time update of relative leaderboard for each user among friends

Ive been working on a feature of my application to implement a leaderboard - basically stack rank users according to their score. Im currently tracking the score on an individual basis. My thought is that this leaderboard should be relative instead of absolute i.e. instead of having the top 10 highest scoring users across the site, its a top 10 among a user's friend network. This seems better because everyone has a chance to be #1 in their network and there is a form of friendly competition for those that are interested in this sort of thing. Im already storing the score for each user so the challenge is how to compute the rank of that score in real time in an efficient way. Im using Google App Engine so there are some benefits and limitations (e.g., IN [array]) queries perform a sub-query for every element of the array and also are limited to 30 elements per statement
For example
1st Jack 100
2nd John 50
Here are the approaches I came up with but they all seem to be inefficient and I thought that this community could come up with something more elegant. My sense is that any solution will likely be done with a cron and that I will store a daily rank and list order to optimize read operations but it would be cool if there is something more lightweight and real time
Pull the list of all users of the site ordered by score.
For each user pick their friends out of that list and create new rankings.
Store the rank and list order.
Update daily.
Cons - If I get a lot of users this will take forever
2a. For each user pick their friends and for each friend pick score.
Sort that list.
Store the rank and list order.
Update daily.
Record the last position of each user so that the pre-existing list can be used for re-ordering for the next update in order to make it more efficient (may save sorting time)
2b. Same as above except only compute the rank and list order for people who's profiles have been viewed in the last day
Cons - rank is only up to date for the 2nd person that views the profile
If writes are very rare compared to reads (a key assumption in most key-value stores, and not just in those;-), then you might prefer to take a time hit when you need to update scores (a write) rather than to get the relative leaderboards (a read). Specifically, when a user's score change, queue up tasks for each of their friends to update their "relative leaderboards" and keep those leaderboards as list attributes (which do keep order!-) suitably sorted (yep, the latter's a denormalization -- it's often necessary to denormalize, i.e., duplicate information appropriately, to exploit key-value stores at their best!-).
Of course you'll also update the relative leaderboards when a friendship (user to user connection) disappears or appears, but those should (I imagine) be even rarer than score updates;-).
If writes are pretty frequent, since you don't need perfectly precise up-to-the-second info (i.e., it's not financials/accounting stuff;-), you still have many viable approaches to try.
E.g., big score changes (rarer) might trigger the relative-leaderboards recomputes, while smaller ones (more frequent) get stashed away and only applied once in a while "when you get around to it". It's hard to be more specific without ballpark numbers about frequency of updates of various magnitude, typical network-friendship cluster sizes, etc, etc. I know, like everybody else, you want a perfect approach that applies no matter how different the sizes and frequencies in question... but, you just won't find one!-)
There is a python library available for storing rankings:
http://code.google.com/p/google-app-engine-ranklist/

Categories