Evenly Distributed Slice of Django QuerySet (or Python list)

Evenly Distributed Slice of Django QuerySet (or Python list) - python

I'd like to get some values from a database and slice "n" of them, say to draw on a graph.
I'm starting with a Django QuerySet, but I don't think this is something I can reduce just via the ORM, so I'm fine with getting a list of values and then using whatever libs in Python are available to get a sample of the data.
So if I have a dataset of a thousand items, I'd like to be able to grab an evenly distributed range of non-random samples from that dataset, always including the start and end points and then evenly distributing the elements in between.
For instance, if:
data = [x for x in xrange(0, 777)]
and I wanted ten of them, how would I get a list back not of every 10th item, but exactly ten list items distributed evenly over the total number of elements in the list?
I'm trying:
number_of_results = 10
step = len(data) / number_of_results
data[::step]
But I'm hoping there's a more efficient way (and also a way that keeps the end point and returns exactly number_of_results items, even if the step between the items can't be exactly even).

There might be more efficient ways of doing it, but this is just a thought:
qs_ids = list(ModelA.objects.order_by('id').values_list('id', flat=True))
number_of_results = 10
step = len(qs_ids) / number_of_results
ids = data[::step]
qs = ModelA.objects.filter(id__in=ids).order_by('id')

Related

Get 10 random elements in Django template [duplicate]

How do I get two distinct random records using Django? I've seen questions about how to get one but I need to get two random records and they must differ.

The order_by('?')[:2] solution suggested by other answers is actually an extraordinarily bad thing to do for tables that have large numbers of rows. It results in an ORDER BY RAND() SQL query. As an example, here's how mysql handles that (the situation is not much different for other databases). Imagine your table has one billion rows:
To accomplish ORDER BY RAND(), it needs a RAND() column to sort on.
To do that, it needs a new table (the existing table has no such column).
To do that, mysql creates a new, temporary table with the new columns and copies the existing ONE BILLION ROWS OF DATA into it.
As it does so, it does as you asked, and runs rand() for every row to fill in that value. Yes, you've instructed mysql to GENERATE ONE BILLION RANDOM NUMBERS. That takes a while. :)
A few hours/days later, when it's done it now has to sort it. Yes, you've instructed mysql to SORT THIS ONE BILLION ROW, WORST-CASE-ORDERED TABLE (worst-case because the sort key is random).
A few days/weeks later, when that's done, it faithfully grabs the two measly rows you actually needed and returns them for you. Nice job. ;)
Note: just for a little extra gravy, be aware that mysql will initially try to create that temp table in RAM. When that's exhausted, it puts everything on hold to copy the whole thing to disk, so you get that extra knife-twist of an I/O bottleneck for nearly the entire process.
Doubters should look at the generated query to confirm that it's ORDER BY RAND() then Google for "order by rand()" (with the quotes).
A much better solution is to trade that one really expensive query for three cheap ones (limit/offset instead of ORDER BY RAND()):
import random
last = MyModel.objects.count() - 1
index1 = random.randint(0, last)
# Here's one simple way to keep even distribution for
# index2 while still gauranteeing not to match index1.
index2 = random.randint(0, last - 1)
if index2 == index1: index2 = last
# This syntax will generate "OFFSET=indexN LIMIT=1" queries
# so each returns a single record with no extraneous data.
MyObj1 = MyModel.objects.all()[index1]
MyObj2 = MyModel.objects.all()[index2]

If you specify the random operator in the ORM I'm pretty sure it will give you two distinct random results won't it?
MyModel.objects.order_by('?')[:2] # 2 random results.

For the future readers.
Get the the list of ids of all records:
my_ids = MyModel.objects.values_list('id', flat=True)
my_ids = list(my_ids)
Then pick n random ids from all of the above ids:
n = 2
rand_ids = random.sample(my_ids, n)
And get records for these ids:
random_records = MyModel.objects.filter(id__in=rand_ids)

Object.objects.order_by('?')[:2]
This would return two random-ordered records. You can add
distinct()
if there are records with the same value in your dataset.

About sampling n random values from a sequence, the random lib could be used,
random.Random().sample(range(0,last),2)
will fetch 2 random samples from among the sequence elements, 0 to last-1

from django.db import models
from random import randint
from django.db.models.aggregates import Count
class ProductManager(models.Manager):
def random(self, count=5):
index = randint(0, self.aggregate(count=Count('id'))['count'] - count)
return self.all()[index:index + count]
You can get different number of objects.

class ModelName(models.Model):
# Define model fields etc
#classmethod
def get_random(cls, n=2):
"""Returns a number of random objects. Pass number when calling"""
import random
n = int(n) # Number of objects to return
last = cls.objects.count() - 1
selection = random.sample(range(0, last), n)
selected_objects = []
for each in selection:
selected_objects.append(cls.objects.all()[each])
return selected_objects

Keeping count of values available from among multiple sets

I have the following situation:
I am generating n combinations of size 3 from, made from n values. Each kth combination [0...n] is pulled from a pool of values, located in the kth index of a list of n sets. Each value can appear 3 times. So if I have 10 values, then I have a list of size 10. Each index holds a set of values 0-10.
So, it seems to me that a good way to do this is to have something keeping count of all the available values from among all the sets. So, if a value is rare(lets say there is only 1 left), if I had a structure where I could look up the rarest value, and have the structure tell me which index it was located in, then it would make generating the possible combinations much easier.
How could I do this? What about one structure to keep count of elements, and a dictionary to keep track of list indices that contain the value?
edit: I guess I should put in that a specific problem I am looking to solve here, is how to update the set for every index of the list (or whatever other structures i end up using), so that when I use a value 3 times, it is made unavailable for every other combination.
Thank you.
Another edit
It seems that this may be a little too abstract to be asking for solutions when it's hard to understand what I am even asking for. I will come back with some code soon, please check back in 1.5-2 hours if you are interested.

how to update the set for every index of the list (or whatever other structures i end up using), so that when I use a value 3 times, it is made unavailable for every other combination.
I assume you want to sample the values truly randomly, right? What if you put 3 of each value into a list, shuffle it with random.shuffle, and then just keep popping values from the end of the list when you're building your combination? If I'm understanding your problem right, here's example code:
from random import shuffle
valid_values = [i for i in range(10)] # the valid values are 0 through 9 in my example, update accordingly for yours
vals = 3*valid_values # I have 3 of each valid value
shuffle(vals) # randomly shuffle them
while len(vals) != 0:
combination = (vals.pop(), vals.pop(), vals.pop()) # combinations are 3 values?
print(combination)
EDIT: Updated code based on the added information that you have sets of values (but this still assumes you can use more than one value from a given set):
from random import shuffle
my_sets_of_vals = [......] # list of sets
valid_values = list()
for i in range(my_sets_of_vals):
for val in my_sets_of_vals[i]:
valid_values.append((i,val)) # this can probably be done in list comprehension but I forgot the syntax
vals = 3*valid_values # I have 3 of each valid value
shuffle(vals) # randomly shuffle them
while len(vals) != 0:
combination = (vals.pop()[1], vals.pop()[1], vals.pop()[1]) # combinations are 3 values?
print(combination)

Based on the edit you could make an object for each value. It could hold the number of times you have used the element and the element itself. When you find you have used an element three times, remove it from the list

Efficient Python way to process two huge files?

I am working on a problem where I have to find if a number falls within a certain range. However, the problem is complicated due to the fact that the files I am dealing with have hundreds of thousands of lines.
Below I try to explain the problem in as simple a language as possible.
Here is a brief description of my input files :
File Ranges.txt has some ranges whose min and max are tab separated.
10 20
30 40
60 70
This can have about 10,000,000 such lines with ranges.
NOTE: The ranges never overlap.
File Numbers.txt has a list of numbers and some values associated with each number.
12 0.34
22 0.14
34 0.79
37 0.87
And so on. Again there are hundreds of thousands of such lines with numbers and their associated values.
What I wish to do is take every number from Numbers.txt and check if it falls within any of the ranges in Ranges.txt.
For all such numbers that fall within a range, I have to get a mean of their associated values (ie a mean per range).
For eg. in the example above in Numbers.txt, there are two numbers 34 and 37 that fall within the range 30-40 in Ranges.txt, so for the range 30-40 I have to calculate the mean of the associated values of 34 and 37. (i.e mean of 0.79 and 0.87), which is 0.82
My final output file should be the Ranges.txt but with the mean of the associated values of all numbers falling within each range. Something like :
Output.txt
10 20 <mean>
30 40 0.82
60 70 <mean>
and so on.
Would appreciate any help and ideas on how this can be written efficiently in Python.

Obviously you need to run each line from Numbers.txt against each line from Ranges.txt.
You could just iterate over Numbers.txt, and, for each line, iterate over Ranges.txt. But this will take forever, reading the whole Ranges.txt file millions of times.
You could read both of them into memory, but that will take a lot of storage, and it means you won't be able to do any processing until you've finished reading and preprocessing both files.
So, what you want to do is read Ranges.txt into memory once and store it as, say, a list of pairs of ints instead, but read Numbers.txt lazily, iterating over the list for each number.
This kind of thing comes up all the time. In general, you want to make the bigger collection into the outer loop, and make it as lazy as possible, while the smaller collection goes into the inner loop, and is pre-processed to make it as fast as possible. But if the bigger collection can be preprocessed more efficiently (and you have enough memory to store it!), reverse that.
And speaking of preprocessing, you can do a lot better than just reading into a list of pairs of ints. If you sorted Ranges.txt, you could find the closest range without going over by bisecting then just check that (18 steps), instead of checking each range exhaustively (100000 steps).
This is a bit of a pain with the stdlib, because it's easy to make off-by-one errors when using bisect, but there are plenty of ActiveState recipes to make it easier (including one linked from the official docs), not to mention third-party modules like blist or bintrees that give you a sorted collection in a simple OO interface.
So, something like this pseudocode:
with open('ranges.txt') as f:
ranges = sorted([map(int, line.split()) for line in f])
range_values = {}
with open('numbers.txt') as f:
rows = (map(int, line.split()) for line in f)
for number, value in rows:
use the sorted ranges to find the appropriate range (if any)
range_values.setdefault(range, []).append(value)
with open('output.txt') as f:
for r, values in range_values.items():
mean = sum(values) / len(values)
f.write('{} {} {}\n'.format(r[0], r[1], mean))
By the way, if the parsing turns out to be any more complicated than just calling split on each line, I'd suggest using the csv module… but it looks like that won't be a problem here.
What if you can't fit Ranges.txt into memory, but can fit Numbers.txt? Well, you can sort that, then iterate over Ranges.txt, find all of the matches in the sorted numbers, and write the results out for that range.
This is a bit more complicated, because it you have to bisect_left and bisect_right and iterate everything in between. But that's the only way in which it's any harder. (And here, a third-party class will help even more. For example, with a bintrees.FastRBTree as your sorted collection, it's just sorted_number_tree[low:high].)
If the ranges can overlap, you need to be a bit smarter—you have to find the closest range without going over the start, and the closest range without going under the end, and check everything in between. But the main trick there is the exact same one used for the last version. The only other trick is to keep two copies of ranges, one sorted by the start value and one by the end, and you'll need to have one of them be a map to indices in the other instead of just a plain list.

The naive approach would be to read Numbers.txt into some structure in number order, then read each line of Ranges, us a binary search to find the lowest number in the range, and the read through the numbers higher than that to find all those within the range, so that you can produce the corresponding line of output.
I assume the problem is that you can't have all of Numbers in memory.
So you could do the problem in phases, where each phase reads a portion of Numbers in, then goes through the process outlined above, but using an annotated version of Ranges, where each line includes the COUNT of the values so far that has produced that mean, and will write a similarly annotated version.
Obviously, the initial pass will not have an annotated version of Ranges, and the final pass will not produce one.

It looks like your data in both the files are already sorted. If not, first sort them by an external tool or using Python.
Then, you can go through the two files in parallel. You read a number from the Numbers.txt file, and see if it is in a range in Ranges.txt file, reading as many lines from that file as needed to answer that question. Then read the next number from Numbers.txt, and repeat. The idea is similar to merging two sorted arrays, and should run in O(n+m) time, n and m are the sizes of the files. If you need to sort the files, the run time is O(n lg(n) + m lg(m)). Here is a quick program I wrote to implement this:
import sys
from collections import Counter
class Gen(object):
__slots__ = ('rfp', 'nfp', 'mn', 'mx', 'num', 'd', 'n')
def __init__(self, ranges_filename, numbers_filename):
self.d = Counter() # sum of elements keyed by range
self.n = Counter() # number of elements keyed by range
self.rfp = open(ranges_filename)
self.nfp = open(numbers_filename)
# Read the first number and the first range values
self.num = float(self.nfp.next()) # Current number
self.mn, self.mx = [int(x) for x in self.rfp.next().split()] # Current range
def go(self):
while True:
if self.mx < self.num:
try:
self.mn, self.mx = [int(x) for x in self.rfp.next().split()]
except StopIteration:
break
else:
if self.mn <= self.num <= self.mx:
self.d[(self.mn, self.mx)] += self.num
self.n[(self.mn, self.mx)] += 1
try:
self.num = float(self.nfp.next())
except StopIteration:
break
self.nfp.close()
self.rfp.close()
return self.d, self.n
def run(ranges_filename, numbers_filename):
r = Gen(ranges_filename, numbers_filename)
d, n = r.go()
for mn, mx in sorted(d):
s, N = d[(mn, mx)], n[(mn, mx)]
if s:
av = s/N
else:
av = 0
sys.stdout.write('%d %d %.3f\n' % (mn, mx, av))
On files with 10,000,000 numbers in each of the files, the above runs in about 1.5 minute on my computer without the output part.

Program running too slow! : Suggest Algorithmic/Implementation Optimization

I have a huge python list(A) of lists. The length of list A is around 90,000. Each inner list contain around 700 tuples of (datetime.date,string). Now, I am analyzing this data. What I am doing is I am taking a window of size x in inner lists where- x = len(inner list) * (some fraction <= 1) and I am saving each ordered pair (a,b) where a occurs before b in that window (actually the innerlists are sorted wrt time). I am moving this window upto the last element adding one element at a time from one end and removing from other which takes O(window-size)time as I am considering the new tuples only. My code:
for i in xrange(window_size):
j = i+1;
while j<window_size:
check_and_update(cur, my_list[i][1], my_list[j][1],log);
j=j+1
i=1;
while i<=len(my_list)-window_size:
j=i;
k=i+window_size-1;
while j<k:
check_and_update(cur, my_list[j][1], my_list[k][1],log);
j+=1
i += 1
Here cur is actually a sqlite3 database cursor,my_list is a list containing the tuples and I iterate this code for all the lists in A and log is a opened logfile. In method check_and_update() I am looking up my database to find the tuple if exists or else I insert it, along with its total number of occurrence so far. Code:
def check_and_update(cur,start,end,log):
t = str(start)+":"+ str(end)
cur.execute("INSERT OR REPLACE INTO Extra (tuple,count)\
VALUES ( ? , coalesce((SELECT count +1 from Extra WHERE tuple = ?),1))",[t,t])
As expected this number of tuples is HUGE and I have previously experimented with dictionary which eats up the memory quite fast. So, I resorted to SQLite3, but now it is too slow. I have tried indexing but with no help. Probably the my program is spending way to much time querying and updating the database. Do you have any optimization ideas for this problem? Probably changing the algorithm or some different approach/tools. Thank you!
Edit: My goal here is to find the total number of tuples of strings that occur within the window grouped by the number of different innerlists they occur in. I extract this information with this query:
for i in range(1,size+1):
cur.execute('select * from Extra where count = ?',str(i))
#other stuff
For Example ( I am ignoring the date entries and will write them as 'dt'):
My_list = [
[ ( dt,'user1') , (dt, 'user2'), (dt, 'user3') ]
[ ( dt,'user3') , (dt, 'user4')]
[ ( dt,'user2') , (dt, 'user3'), (dt,'user1') ]
]
here if I take fraction = 1 then, results:
only 1 occurrence in window: 5 (user 1-2,1-3,3-4,2-1,3-1)
only 2 occurrence in window: 2 (user 2-3)

Let me get this straight.
You have up to about 22 billion potential tuples (for 90000 lists, any of 700, any of the following entries, on average 350) which might be less depending on the window size. You want to find, but number of inner lists that they appear in, how many tuples there are.
Data of this size has to live on disk. The rule for data that lives on disk due to size is, "Never randomly access, instead generate and then sort."
So I'd suggest that you write out each tuple to a log file, one tuple per line. Sort that file. Now all instances of any given tuple are in one place. Then run through the file, and for each tuple emit the count of how many times it appears in (that is how many inner lists it is in). Sort that second file. Now run through that file, and you can extract how many tuples appeared 1x, 2x, 3x, etc.
If you have multiple machines, it is easy to convert this into a MapReduce. (Which is morally the same approach, but you get to parallelize a lot of stuff.)

Apache Hadoop is one of the MapReduce implementations that is suited for this kind of problem:

3 questions about appengine indexes

I got the following situation
class M(db.Model):
a = db.ReferenceProperty(A)
x = db.ReferenceProperty(X)
y = db.ReferenceProperty(Y)
z = db.ReferenceProperty(Z)
items = db.StringListProperty()
date = db.DateTimeProperty()
I want to make queries that filter on (a), (x, y or z) and (items), ordered by date i.e.
mm = M.all().filter('a =', a1).filter('x =', x1).filter('items =', i).order('-date')
There will never be a query with filter on x and y at the same time, for example.
So, my questions are:
1) How many (and which) indexes should I create?
2) How many 'strings' can I add on items? (I'd like to add in the order of thousands)
3) How many index records will I have on a single "M" if there are 1000 items?
I don't quite yet understand this index stuff, and is killing me. Your help will be very appreciated :)

This article explains indexes/exploding indexes quite well, and it actually fits your example: https://developers.google.com/appengine/docs/python/datastore/queries#Big_Entities_and_Exploding_Indexes
Your biggest issue will be the fact that you will probably run into the 5000 indexes per entity limit with thousands of items. If you take an index for a, x, items (1000 items), date: |a||x||items|*|date| == 1*1*1000*1 == 1000.
If you have 5001 entries in items, the put() will fail with the appropriate exception.
From the example you provided, whether you filter on x, y or anything else seems irrelevant, as there is only 1 of that property, and therefore you do not run the chance of an exploding index. 1*1 == 1.
Now, if you had two list properties, you would want to make sure that they are indexed separately, otherwise you'd get an exploding index. For example, if you had 2 list properties with 100 items each, that would produce 100*100 indexes, unless you split them, which would then result in only 200 (assuming all other properties were non-lists).

For the criteria you have given, you only need to create three compound indexes: a,x,items,-date, a,y,items,-date, a,z,items,-date. Note that a list property creates an index entry for each property in the list.
There is a limit of total 5000 index entries per entity. If you only have three compound indexes, than it's 5000/3 = 1666 (capped at 1000 for a single list property).
In the case of three compound indexes only, 3*1000 = 3000.
NOTE: the above assumes you do not have built-in indexes per property (= properties are save as unindexed). Otherwise you need to account for built-in indexes at 2N, where N in number of single properties (2 is for asc, desc). In your case this would be 2*(5 + no_items), since items is a list property and every entry creates an index entry.

See also https://developers.google.com/appengine/articles/indexselection, which describes App Engine's (relatively recent) improved query planning capabilities. Basically, you can reduce the number of index entries needed to: (number of filters + 1) * (number of orders).
Though, as the article discusses, there can be reasons that you might still use compound indexes-- essentially, there is a time/space tradeoff.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.