What is causing inefficiency when parsing this QuerySet into tuples? - python

In a django app, I am attempting to parse a Queryset, representing individual time-series values x from n sensors, into tuples (t, x1, x2 ... xn), and thence into a json object in the format specified by google charts here: https://developers.google.com/chart/interactive/docs/gallery/linechart
None values are used as placeholders if no value was logged for a given timestamp from a particular sensor
The page load time is significant for a QuerySet with ~6500 rows (~3 seconds, run locally)
It's significantly longer on the server
http://54.162.202.222/pulogger/simpleview/?device=test
Profiling indicates that 99.9% of the time is spent on _winapi.WaitForSingleObject, (which I can't interpret) and manual profiling with a timer indicates that the server-side culprit is the while loop that iterates over the QuerySet and groups the values into tuples (line 23 in my code example)
Results are as-follows:
basic gets (took 5ms)
queried data (took 0ms)
split data by sensor (took 981ms)
prepared column labels/types (took 0ms)
prepared json (took 27ms)
created context (took 0ms)
For the sake of completeness, the timing function is as follows:
def print_elapsed_time(ref_datetime, description):
print('{} (took {}ms)'.format(description, floor((datetime.now()-ref_datetime).microseconds/1000)))
return datetime.now()
The code performing the processing and generating the view is as-follows:
def simpleview(request):
time_marker = datetime.now()
device_name = request.GET['device']
device = Datalogger.objects.get(device_name=device_name)
sensors = Sensor.objects.filter(datalogger=device).order_by('pk')
sensor_count = len(sensors) # should be no worse than count() since already-evaluated and cached. todo: confirm
#assign each sensor an index for the tuples (zero is used for time/x-axis)
sensor_indices = {}
for idx, sensor in enumerate(sensors, start=1):
sensor_indices.update({sensor.sensor_name:idx})
time_marker = print_elapsed_time(time_marker, 'basic gets')
# process data into timestamp-grouped tuples accessible by sensor-index ([0] is timestamp)
raw_data = SensorDatum.objects.filter(sensor__datalogger__device_name=device_name).order_by('timestamp', 'sensor')
data = []
data_idx = 0
time_marker = print_elapsed_time(time_marker, 'queried data')
while data_idx < len(raw_data):
row_list = [raw_data[data_idx].timestamp]
row_list.extend([None]*sensor_count)
row_idx = 1
while data_idx < len(raw_data) and raw_data[data_idx].timestamp == row_list[0]:
row_idx = sensor_indices.get(raw_data[data_idx].sensor.sensor_name)
row_list[row_idx] = raw_data[data_idx].value
data_idx += 1
data.append(tuple(row_list))
time_marker = print_elapsed_time(time_marker, 'split data by sensor')
column_labels = ['Time']
column_types = ["datetime"]
for sensor in sensors:
column_labels.append(sensor.sensor_name)
column_types.append("number")
time_marker = print_elapsed_time(time_marker, 'prepared column labels/types')
gchart_json = prepare_data_for_gchart(column_labels, column_types, data)
time_marker = print_elapsed_time(time_marker, 'prepared json')
context = {
'device': device_name,
'sensor_count': sensor_count,
'sensor_indices': sensor_indices,
'gchart_json': gchart_json,
}
time_marker = print_elapsed_time(time_marker, 'created context')
return render(request, 'pulogger/simpleTimeSeriesView.html', context)
I'm new to python, so I expect that there's a poor choice of operation/collection I've used somewhere. Unless I'm blind, it should run in O(n).
Obviously this isn't the whole problem since it only accounts for a part of the apparent load-time, but I figure this is a good place to start.

The "queried data" section is taking 0ms because that section is constructing the query, not executing your query against the database.
The query is being executed when it gets to this line: while data_idx < len(raw_data):, because to calculate the length of the iterable it must evaluate it.
So it may not be the loop that's taking most of the time, it's probably the query execution and evaluation. You can evaluate the query before the main loop by wrapping the queryset in a list(), this will allow your time_marker to display how long the query is actually taking to execute.
Do you need the queryset evaluated to a model? Alternatively you could use .values() or .values_list() to return an actual list of values, which skips serializing the query results into Model objects. By doing this you also avoid having to return all the columns from the database, you only return the ones you need.
You could potential remove the table join in this query SensorDatum.objects.filter(sensor__datalogger__device_name=device_name).order_by('timestamp', 'sensor') by denormalizing your schema (if possible) to have the device_name field on the sensor.

You have queries running under a loop. You can use select_related to cache related objects beforehand.
Example:
raw_data = SensorDatum.objects.filter(
sensor__datalogger__device_name=device_name
).order_by(
'timestamp',
'sensor'
).select_related('sensor') # this will fetch and cache sensor objects and will prevent further db queries in the loop
Ref: select_related Django 2.1 Docs

Related

Pyneo Inserts Limited Amount of Edges

For a university project I am using Neo4j together with flask and pyneo for a shift scheduling algorithm. On saving the scheduled shifts to Neo4j I realized that relationships go missing, from 330 only 91 get inserted.
On printing them before/after inserting, they are in the list to be inserted, and I also moved the transaction around to check if this changes the result.
I have the following structure:
(w:Worker)-[r:works_during]->(s:Shift) with
r.day, r.month, r.year as set parameters for the relationship and multiple connections between each worker and each shift, which can be filtered via the relation then.
my code looks like the following:
header = df.columns.tolist()
header.remove("index")
header.remove("worker")
tuplelist = []
for index, row in df.iterrows():
for i in header:
worker = self.driver.nodes.match("Worker", id=int(row["worker"])).first()
if row[i] == 1:
# Shifts are in the format {day}_{shift_of_day}
shift_id = str(i).split("_")[1]
shift_day = str(i).split("_")[0]
shift = self.driver.nodes.match("Shift", id=int(shift_id)).first()
rel = Relationship(worker, "works_during", shift)
rel["day"] = int(shift_day)
rel["month"] = int(month)
rel["year"] = int(year)
tuplelist.append(rel)
print(len(tuplelist))
for i in tuplelist:
connection = self.driver.begin()
connection.create(i)
connection.commit()
Is there any special behaviour in pyneo which I need to be aware of that could cause this issue?
Pyneo allows just one connection from the same type between node A and node B.
If multiple connections of the same type (even with different attributes) are needed, it is necessary to use plain Cypher Querying as pyneo will merge this edges to a single edge.

Update Query if data of 2 columns is equal to a particular string

My table contains user query data. I generate a hashed string by doing the following:
queries = Query.objects.values('id', 'name')
# creating a bytes string using the ID, NAME columns and a string "yes" (this string could be anything, I've considered yes as an example)
data = (str(query['id']) + str(query['name']) + "yes").encode()
link_hash = hashlib.pbkdf2_hmac('sha256', data, b"satisfaction", 100000)
link_hash_string = binascii.hexlify(link_hash).decode()
I've sent this hashstring via email embedded in a link which is checked when the use visits that link. My current method of checking if the hash (got from the GET parameter in the link) matches with some data in the table is like this:
queries = Query.objects.values('id', 'name')
# I've set replyHash as a string here as an example, it is generated by the code written above, but the hash will be got from the GET parameter in the link
replyHash = "269e1b3de97b10cd28126209860391938a829ef23b2f674f79c1436fd1ea38e4"
#Currently iterating through all the queries and checking each of the query
for query in queries:
data = (str(query['id']) + str(query['name']) + "yes").encode()
link_hash = hashlib.pbkdf2_hmac('sha256', data, b"satisfaction", 100000)
link_hash_string = binascii.hexlify(link_hash).decode()
if replyHash == link_hash_string:
print("It exists, valid hash")
query['name'] = "BooBoo"
query.save()
break
The problem with this approach is that if I have a large table with thousands of rows, this method will take a lot of time. Is there an approach using annotation or aggregation or something else which will perform the same action in less time?

Bulk create Django with unique sequences or values per record?

I have what is essentially a table which is a pool of available codes/sequences for unique keys when I create records elsewhere in the DB.
Right now I run a transaction where I might grab 5000 codes out of an available pool of 1 billion codes using the slice operator [:code_count] where code_count == 5000.
This works fine, but then for every insert, I have to run through each code and insert it into the record manually when I use the code.
Is there a better way?
Example code (omitting other attributes for each new_item that are similar to all new_items):
code_count=5000
pool_cds = CodePool.objects.filter(free_indicator=True)[:code_count]
for pool_cd in pool_cds:
new_item = Item.objects.create(
pool_cd=pool_cd.unique_code,
)
new_item.save()
cursor = connection.cursor()
update_sql = 'update CodePool set free_ind=%s where pool_cd.id in %s'
instance_param = ()
#Create ridiculously long list of params (5000 items)
for pool_cd in pool_cds:
instance_param = instance_param + (pool_cd.id,)
params = [False, instance_param]
rows = cursor.execute(update_sql, params)
As I understand how it works:
code_count=5000
pool_cds = CodePool.objects.filter(free_indicator=True)[:code_count]
ids = []
for pool_cd in pool_cds:
Item.objects.create(pool_cd=pool_cd.unique_code)
ids += [pool_cd.id]
CodePool.objects.filter(id__in=ids).update(free_ind=False)
By the way if you created object using queryset method create, you don't need call save method. See docs.

How to improve performance when load large data from database and convert to JSON

I write a Djano application which deals with financial data process.
I have to load large data(more than 1000000 records) from MySQL table, and convert the records to JSON data in django views as following:
trades = MtgoxTrade.objects.all()
data = []
for trade in trades:
js = dict()
js['time']= trade.time
js['price']= trade.price
js['amount']= trade.amount
js['type']= trade.type
data.append(js)
return data
The problem is that the FOR loop is very slow(which takes more than 9 seconds for 200000 records), is there any effective way to convert DB records to JSON format data in Python?
Updated: I have run code according to Mike Housky's answer in my ENV(ActivePython2.7,Win7) With code changes and result as:
def create_data(n):
from api.models import MtgoxTrade
result = MtgoxTrade.objects.all()
return result
Build ............ 0.330999851227
For loop ......... 7.98400020599
List Comp. ....... 0.457000017166
Ratio ............ 0.0572394796312
For loop 2 ....... 0.381999969482
Ratio ............ 0.047845686326
You will find the for loop takes about 8 seconds! And if i comment out the For loop,then List Comp also takes such time as:
Times:
Build ............ 0.343000173569
List Comp. ....... 7.57099986076
For loop 2 ....... 0.375999927521
My new question is that whether the for loop will touch the database? But i did not see any DB access log. So strange!
Here are several tips/things to try.
Since you need to make a JSON-string from the queryset eventually, use django's built-in serializers:
from django.core import serializers
data = serializers.serialize("json",
MtgoxTrade.objects.all(),
fields=('time','price','amount','type'))
You can make serialization faster by using ujson or simplejson modules. See SERIALIZATION_MODULES setting.
Also, instead of getting all the field values from the record, be explicit and get only what you need to serialize:
MtgoxTrade.objects.all().values('time','price','amount','type')
Also, you may want to use iterator() method of a queryset:
...For a QuerySet which returns a large number of objects that you
only need to access once, this can result in better performance and a
significant reduction in memory...
Also, you can split your huge queryset into batches, see: Batch querysets.
Also see:
Why is iterating through a large Django QuerySet consuming massive amounts of memory?
Memory efficient Django Queryset Iterator
django: control json serialization
You can use a list comprehension as that prevents many dict() and append() calls:
trades = MtgoxTrade.objects.all()
data = [{'time': trade.time, 'price': trade.price, 'amount': trade.amount, 'type': trade.type}
for trade in trades]
return data
Function calls are expensive in Python so you should aim to avoid them in slow loops.
This answer is in support of Simeon Visser's observation. I ran the following code:
import gc, random, time
if "xrange" not in dir(__builtins__):
xrange = range
class DataObject(object):
def __init__(self, time, price, amount, type):
self.time = time
self.price = price
self.amount = amount
self.type = type
def create_data(n):
result = []
for index in xrange(n):
s = str(index);
result.append(DataObject("T"+s, "P"+s, "A"+s, "ty"+s))
return result
def convert1(trades):
data = []
for trade in trades:
js = dict()
js['time']= trade.time
js['price']= trade.price
js['amount']= trade.amount
js['type']= trade.type
data.append(js)
return data
def convert2(trades):
data = [{'time': trade.time, 'price': trade.price, 'amount': trade.amount, 'type': trade.type}
for trade in trades]
return data
def convert3(trades):
ndata = len(trades)
data = ndata*[None]
for index in xrange(ndata):
t = trades[index]
js = dict()
js['time']= t.time
js['price']= t.price
js['amount']= t.amount
js['type']= t.type
#js = {"time" : t.time, "price" : t.price, "amount" : t.amount, "type" : t.type}
return data
def main(n=1000000):
t0s = time.time()
trades = create_data(n);
t0f = time.time()
t0 = t0f - t0s
gc.disable()
t1s = time.time()
jtrades1 = convert1(trades)
t1f = time.time()
t1 = t1f - t1s
t2s = time.time()
jtrades2 = convert2(trades)
t2f = time.time()
t2 = t2f - t2s
t3s = time.time()
jtrades3 = convert3(trades)
t3f = time.time()
t3 = t3f - t3s
gc.enable()
print ("Times:")
print (" Build ............ " + str(t0))
print (" For loop ......... " + str(t1))
print (" List Comp. ....... " + str(t2))
print (" Ratio ............ " + str(t2/t1))
print (" For loop 2 ....... " + str(t3))
print (" Ratio ............ " + str(t3/t1))
main()
Results on Win7, Core 2 Duo 3.0GHz:
Python 2.7.3:
Times:
Build ............ 2.95600008965
For loop ......... 0.699999809265
List Comp. ....... 0.512000083923
Ratio ............ 0.731428890618
For loop 2 ....... 0.609999895096
Ratio ............ 0.871428659011
Python 3.3.0:
Times:
Build ............ 3.4320058822631836
For loop ......... 1.0200011730194092
List Comp. ....... 0.7500009536743164
Ratio ............ 0.7352942070195492
For loop 2 ....... 0.9500019550323486
Ratio ............ 0.9313733946208623
Those vary a bit, even with GC disabled (much more variance with GC enabled, but about the same results). The third conversion timing shows that a fair-sized chunk of the saved time comes from not calling .append() a million times.
Ignore the "For loop 2" times. This version has a bug and I am out of time to fix it for now.
First you have to check if the performance loss happens while fetching the data from the database or inside the loop.
There is no real option for giving you a significant speedup - also not using a list comprehension as noticed above.
However there is a huge difference in performance between Python 2 and 3.
A simple benchmark showed me that the for-loop is roughly 2,5 times faster with Python 3.3 (using some simple benchmark like the following):
import time
ts = time.time()
data = list()
for i in range(1000000):
d = {}
d['a'] = 1
d['b'] = 2
d['c'] = 3
d['d'] = 4
d['a'] = 5
data.append(d)
print(time.time() - ts)
/opt/python-3.3.0/bin/python3 foo2.py
0.5906929969787598
python2.6 foo2.py
1.74390792847
python2.7 foo2.py
0.673550128937
You will also note that there is a significant performance difference between Python 2.6 and 2.7.
I think it's worth trying to do a raw query against the database because a Model puts a lot of extra boilerplate code into fields (I belive that fields are properties) and like previously mentioned function calls are expensive. See the documentation, there is an example at the bottom of the page that uses dictfetchall which seems like the thing you are after.
You might want to look into the values method. It will return an iterable of dicts instead of model objects, so you don't have to create a lot of intermediate data structures. Your code could be reduced to
return MtgoxTrade.objects.values('time', 'price', 'amount', 'type')

Simple example of retrieving 500 items from dynamodb using Python

Looking for a simple example of retrieving 500 items from dynamodb minimizing the number of queries. I know there's a "multiget" function that would let me break this up into chunks of 50 queries, but not sure how to do this.
I'm starting with a list of 500 keys. I'm then thinking of writing a function that takes this list of keys, breaks it up into "chunks," retrieves the values, stitches them back together, and returns a dict of 500 key-value pairs.
Or is there a better way to do this?
As a corollary, how would I "sort" the items afterwards?
Depending on you scheme, There are 2 ways of efficiently retrieving your 500 items.
1 Items are under the same hash_key, using a range_key
Use the query method with the hash_key
you may ask to sort the range_keys A-Z or Z-A
2 Items are on "random" keys
You said it: use the BatchGetItem method
Good news: the limit is actually 100/request or 1MB max
you will have to sort the results on the Python side.
On the practical side, since you use Python, I highly recommend the Boto library for low-level access or dynamodb-mapper library for higher level access (Disclaimer: I am one of the core dev of dynamodb-mapper).
Sadly, neither of these library provides an easy way to wrap the batch_get operation. On the contrary, there is a generator for scan and for query which 'pretends' you get all in a single query.
In order to get optimal results with the batch query, I recommend this workflow:
submit a batch with all of your 500 items.
store the results in your dicts
re-submit with the UnprocessedKeys as many times as needed
sort the results on the python side
Quick example
I assume you have created a table "MyTable" with a single hash_key
import boto
# Helper function. This is more or less the code
# I added to devolop branch
def resubmit(batch, prev):
# Empty (re-use) the batch
del batch[:]
# The batch answer contains the list of
# unprocessed keys grouped by tables
if 'UnprocessedKeys' in prev:
unprocessed = res['UnprocessedKeys']
else:
return None
# Load the unprocessed keys
for table_name, table_req in unprocessed.iteritems():
table_keys = table_req['Keys']
table = batch.layer2.get_table(table_name)
keys = []
for key in table_keys:
h = key['HashKeyElement']
r = None
if 'RangeKeyElement' in key:
r = key['RangeKeyElement']
keys.append((h, r))
attributes_to_get = None
if 'AttributesToGet' in table_req:
attributes_to_get = table_req['AttributesToGet']
batch.add_batch(table, keys, attributes_to_get=attributes_to_get)
return batch.submit()
# Main
db = boto.connect_dynamodb()
table = db.get_table('MyTable')
batch = db.new_batch_list()
keys = range (100) # Get items from 0 to 99
batch.add_batch(table, keys)
res = batch.submit()
while res:
print res # Do some usefull work here
res = resubmit(batch, res)
# The END
EDIT:
I've added a resubmit() function to BatchList in Boto develop branch. It greatly simplifies the worklow:
add all of your requested keys to BatchList
submit()
resubmit() as long as it does not return None.
this should be available in next release.

Categories