Django field for storing occurrence rates? - python

I have a Download model. I would like to store how often each instance is being downloaded.
I could just store a count but this seems brutally simplistic. Old files will seem more popular. I could mean-average against the upload date, but this makes newer almost always better. I don't neccessarily want to rank thigns but I'd like to show meaningful data to the people managing the files, something like:
Last 24 hours
Last 7 days
Last 30 days
Last 365 days
Total
Is there a field, method, or some sort of generic add-on that I can add to my model to store rate so that I have counts for something like that?
Edit: It occurs to me that doing "last x" means having to somehow store every instance with a datetime until they fall out the end (and only apply to total). I'm open to more efficient slightly compromised solutions that involve less churn.

You have not stated how many records / downloads you will be handling but a simple way of doing this create a DownloadCount model. Maybe something like:
class DownloadCount(models.Model):
date = models.DateField(auto_now_add=True)
item_downloaded = models.ForeignKey(Downloads)
You will then be able to produce metrics based off of the date field.
Something like:
DownloadCount.objects.filter(
date__range=(state_date, end_date),
item_downloaded=<Download-Model>
).count()
Obviously it would be highly advantageous to cache such results.
If you wanted to make the results a little more accurate you could also add an IPAddressField into the model making DownloadCount records unique based off of IP Address and Date.

Related

Pandas-ipython, how to create new data frames with drill down capabilities

I think my problem is simple but I've made a long post in the interest of being thorough.
I need to visualize some data but first I need to perform some calculations that seem too cumbersome in Tableau (am I hated if I say tableau sucks!)
I have a general problem with how to output data with my calculations in a nice format that can be visualized either in Tableau or something else so it needs to hang on to a lot of information.
My data set is a number of fields associated to usage of an application by user id. So there are potentially multiple entries for each user id and each entry (record) has information in columns such as time they began using app, end time, price they paid, whether they were on wifi, and other attributes (dimensions).
I have one year of data and want to do things like calculate average/total of duration/price paid in app over each month and over the full year of each user (remember each user will appear multiple times-each time they sign in).
I know some basics, like appending a column which subtracts start time from end time to get time spent and my python is fully functional but my data capabilities are amateur.
My question is, say I want the following attributes (measures) calculated (all per user id): average price, total price, max/min price, median price, average duration, total duration, max/min duration, median duration, and number of times logged in (so number of instances of id) and all on a per month and per year basis. I know that I could calculate each of these things but what is the best way to store them for use in a visualization?
For context, I may want to visualize the group of users who paid on average more than 8$ and were in the app a total of more than 3 hours (to this point a simple new table can be created with the info) but if I want it in terms of what shows they watched and whether they were on wifi (other attributes in the original data set) and I want to see it broken down monthly, it seems like having my new table of calculations won't cut it.
Would it then be best to create a yearly table and a table for each month for a total of 13 tables each of which contain the user id's over that time period with all the original information and then append a column for each calculation (if the calc is an avg then I enter the same value for each instance of an id)?
I searched and found that maybe the plyr functionality in R would be useful but I am very familiar with python and using ipython. All I need is a nice data set with all this info that can then be exported into a visualization software unless you can also suggest visualization tools in ipython :)
Any help is much appreciated, I'm so hoping it makes sense to do this in python as tableau is just painful for the calculation side of things....please help :)
It sounds like you want to run a database query like this:
SELECT user, show, month, wifi, sum(time_in_pp)
GROUP BY user, show, month, wifi
HAVING sum(time_in_pp) > 3
Put it into a database and run your queries using pandas sql interface or ordinary python queries. Presumably you index your database table on these columns.

Get the latest entries django

I currently have a function that writes one to four entries into a database every 12 hours. When certain conditions are met the function is called again to write another 1-4 entries based on the previous ones. Now since time isn't the only factor I have to check whether or not the conditions are met and because the entries are all in the same database I have to differentiate them based on their time posted into the database (DateTimeField is in the code)
How could I achieve this? Is there a function built in in django that I just couldn't find? Or would I have to take a look at a rather complicated solution.
as a sketch I would say i'd expect something like this:
latest = []
allData = myManyToManyField.objects.get(externalId=2)
for data in allData:
if data.Timestamp.checkIfLatest(): #checkIfLatest returns true/false
latest.append(data)
or even better something like this (although I don't think that's implemented)
latest = myManyToManyField.objects.get.latest.filter(externalId=2)
The django documentation is very very good, especially with regards to querysets and model layer functions. It's usually the first place you should look. It sounds like you want .latest(), but it's hard to tell with your requirements regarding conditions.
latest_entry = m2mfield.objects.latest('mydatefield')
if latest_entry.somefield:
# do something
Or perhaps you wanted:
latest_entry = m2mfield.objects.filter(somefield=True).latest('mydatefield')
You might also be interested in order_by(), which will order the rows according to a field you specify. You could then iterate on all the m2m fields until you find the one that matches a condition.
But without more information on what these conditions are, it's hard to be more specific.
IT's just a thought.. we can keep epoch time(current time of the entrie) field in database as a primary key and compare with the previous entiries and diffrentiate them

store a calculated value in the datastore or just calculate it on the fly

I writing an app in python for google app engine where each user can submit a post and each post has a ranking which is determined by its votes and comment count. The ranking is just a simple calculation based on these two parameters. I am wondering should I store this value in the datastore (and take up space there) or just simply calculate it every time that I need it. Now just fyi the posts will be sorted by ranking so that needs to be taken into account.
I am mostly thinking for the sake of efficiency and trying to balance if I should try and save the datastore room or save the read/write quota.
I would think it would be better to simply store it but then I need to recalculate and rewrite the ranking value every time anyone votes or comments on the post.
Any input would be great.
What about storing the ranking as a property in the post. That would make sense for querying/sorting wouldn't it.
If you store the ranking at the same time (meaning in the same entitiy) as you store the votes/comment count, then the only increase in write cost would be for the index. (ok initial write cost too but that is what 2 [very small anyway]).
You need to do a database operation everytime anyone votes or comments on the post anyway right!?! How else can to track votes/comments?
Actually though, I imagine you will get into use text search to find data in the posts. If so, I would look into maybe storing the ranking as a property in the search index and using it to rank matching results.
Don't we need to consider how you are selecting the posts to display. Is ranking by votes and comments the only criteria?
Caching is most useful when the calculation is expensive. If the calculation is simple and cheap, you might as well recalculate as needed.
If you're depending on keeping a running vote count in an entity, then you either have to be willing to lose an occasional vote, or you have to use transactions. If you use transactions, you're rate limited as to how many transactions you can do per second. (See the doc on transactions and entity groups). If you're liable to have a high volume of votes, rate limiting can be a problem.
For a low rate of votes, keeping a count in an entity might work fine. But if you any significant peaks in voting rate, storing separate Vote entities that periodically get rolled up into a cached count, perhaps adjusted by (possibly unreliable) incremental counts kept in memcache, might work better for you.
It really depends on what you want to optimize for. If you're trying to minimize disk writes by keeping a vote count cached non-transactionally, you risk losing votes.

Redis: find all objects older than

I store several properties of objects in hashsets. Among other things, something like "creation date". There are several hashsets in the db.
So, my question is, how can I find all objects older than a week for example? Can you suggest an algorithm what faster than O(n) (naive implementation)?
Thanks,
Oles
My initial thought would be to store the data elsewhere, like relational database, or possibly using a zset.
If you had continuous data (meaning it was consistently set at N interval time periods), then you could store the hash key as the member and the date (as a int timestamp) as the value. Then you could do a zrank for a particular date, and use zrevrange to query from the first rank to the value you get from zrank.

Efficiently determining if a business is open or not based on store hours

Given a time (eg. currently 4:24pm on Tuesday), I'd like to be able to select all businesses that are currently open out of a set of businesses.
I have the open and close times for every business for every day of the week
Let's assume a business can open/close only on 00, 15, 30, 45 minute marks of each hour
I'm assuming the same schedule each week.
I am most interested in being able to quickly look up a set of businesses that is open at a certain time, not the space requirements of the data.
Mind you, some my open at 11pm one day and close 1am the next day.
Holidays don't matter - I will handle these separately
What's the most efficient way to store these open/close times such that with a single time/day-of-week tuple I can speedily figure out which businesses are open?
I am using Python, SOLR and mysql. I'd like to be able to do the querying in SOLR. But frankly, I'm open to any suggestions and alternatives.
If you are willing to just look at single week at a time, you can canonicalize all opening/closing times to be set numbers of minutes since the start of the week, say Sunday 0 hrs. For each store, you create a number of tuples of the form [startTime, endTime, storeId]. (For hours that spanned Sunday midnight, you'd have to create two tuples, one going to the end of the week, one starting at the beginning of the week). This set of tuples would be indexed (say, with a tree you would pre-process) on both startTime and endTime. The tuples shouldn't be that large: there are only ~10k minutes in a week, which can fit in 2 bytes. This structure would be graceful inside a MySQL table with appropriate indexes, and would be very resilient to constant insertions & deletions of records as information changed. Your query would simply be "select storeId where startTime <= time and endtime >= time", where time was the canonicalized minutes since midnight on sunday.
If information doesn't change very often, and you want to have lookups be very fast, you could solve every possible query up front and cache the results. For instance, there are only 672 quarter-hour periods in a week. With a list of businesses, each of which had a list of opening & closing times like Brandon Rhodes's solution, you could simply, iterate through every 15-minute period in a week, figure out who's open, then store the answer in a lookup table or in-memory list.
The bitmap field mentioned by another respondent would be incredibly efficient, but gets messy if you want to be able to handle half-hour or quarter-hour times, since you have to increase arithmetically the number of bits and the design of the field each time you encounter a new resolution that you have to match.
I would instead try storing the values as datetimes inside a list:
openclosings = [ open1, close1, open2, close2, ... ]
Then, I would use Python's "bisect_right()" function in its built-in "bisect" module to find, in fast O(log n) time, where in that list your query time "fits". Then, look at the index that is returned. If it is an even number (0, 2, 4...) then the time lies between one of the "closed" times and the next "open" time, so the shop is closed then. If, instead, the bisection index is an odd number (1, 3, 5...) then the time has landed between an opening and a closing time, and the shop is open.
Not as fast as bitmaps, but you don't have to worry about resolution, and I can't think of another O(log n) solution that's as elegant.
You say you're using SOLR, don't care about storage, and want the lookups to be fast. Then instead of storing open/close tuples, index an entry for every open block of time at the level of granularity you need (15 mins). For the encoding itself, you could use just cumulative hours:minutes.
For example, a store open from 4-5 pm on Monday, would have indexed values added for [40:00, 40:15, 40:30, 40:45]. A query at 4:24 pm on Monday would be normalized to 40:15, and therefore match that store document.
This may seem inefficient at first glance, but it's a relatively small constant penalty for indexing speed and space. And makes the searches as fast as possible.
Sorry I don't have an easy answer, but I can tell you that as the manager of a development team at a company in the late 90's we were tasked with solving this very problem and it was HARD.
It's not the weekly hours that's tough, that can be done with a relatively small bitmask (168 bits = 1 per hour of the week), the trick is the businesses which are closed every alternating Tuesday.
Starting with a bitmask then moving on to an exceptions field is the best solution I've ever seen.
In your Solr index, instead of indexing each business as one document with hours, index every "retail session" for every business during the course of a week.
For example if Joe's coffee is open Mon-Sat 6am-9pm and closed on Sunday, you would index six distinct documents, each with two indexed fields, "open" and "close". If your units are 15 minute intervals, then the values can range from 0 to 7*24*4. Assuming you have a unique ID for each business, store this in each document so you can map the sessions to businesses.
Then you can simply do a range search in Solr:
open:[* TO N] AND close:[N+1 TO *]
where N is computed to the Nth 15 minute interval that the current time falls into. For examples if it's 10:10AM on Wednesday, your query would be:
open:[* TO 112] AND close:[113 TO *]
aka "find a session that starts at or before 10:00am Wed and ends at or after 10:15am Wed"
If you want to include other criteria in your search, such as location or products, you will need to index this with each session document as well. This is a bit redundant, but if your index is not huge, it shouldn't be a problem.
If you can control your data well, I see a simple solution, similar to #Sebastian's. Follow the advice of creating the tuples, except create them of the form [time=startTime, storeId] and [time=endTime, storeId], then sort these in a list. To find out if a store is open, simply do a query like:
select storeId
from table
where time <= '#1'
group by storeId
having count(storeId) % 2 == 1
To optimize this, you could build a lookup table at each of time t, store the stores that are open at t, and the store openings/closings between t and t+1 (for any grouping of t).
However, this has the drawback of being harder to maintain (overlapping openings/closings need to be merged into a longer open-close period).
Have you looked at how many unique open/close time combinations there are? If there are not that many, make a reference table of the unique combinations and store the index of the appropriate entry against each business. Then you only have to search the reference table and then find the business with those indices.

Categories