I have a Dataframe in which a column Code is increased significantly everyday and these codes are to be converted into object description for which I am doing something like the following:
product = []
beacon = []
count = []
c_start = time.time()
for i, v in df["D Code"].iteritems():
product.append(Product.objects.get(short_code=v[:2]).description) #how to optimize this?
beacon.append("RFID")
count.append(v[-5:])
c_end = time.time()
print("D Code loop time ", c_end-c_start)
Now initially when the rows were less it used to work in no time but as the data increased the combined Database call for every code takes too much time. Is there any more efficient Django method to loop over a list and get the value?
The df['D Code]` looks something like this:
['TRRFF.1T22AD0029',
'TRRFF.1T22AD0041',
'TRRFF.1T22AD0009',
'TRRFF.1T22AD0032',
'TRRFF.1T22AD0028',
'TRRFF.1T22AD0026',
'TRRFF.1T22AD0040',
'HTRFF.1T22AD0003',
'TRRFF.1T22AD0048',
'PPRFP.1T22AD0017',
'TRRFF.1T22AD0047',
'TRRFF.1T22AD0005',
'TRRFF.1T22AD0033',
'TRRFF.1T22AD0024',
'TRRFF.1T22AD0042'],
You can create a lookup dict with just one query. Then use that dict to find your description.
description_dict = {}
for product in Product.objects.values('short_code', 'description'):
description_dict[product['short_code'] = product['description']
for i, v in df["D Code"].iteritems():
product.append(description_dict[v[:2]])
...
Related
I am looking for a way to visualize, for the lack of a better word, the "density" or "heatmap" of some synthetic time series I have created.
I have a loop that creates a list, which are values of one time series. I don't think it matters but just in case, here is the code of what's going on. This is a Markov Process, so with each i, which represents the hour, i create a new value, depending on the former i and state:
for x in range(10000):
start_h = 0
start_s = 1
generated_values_list = []
for i in range(start_h,120):
if i>=24:
i=i%24
print(str(start_s)+" | " +str(i))
pot_value_list = GMM_vals_container_workingdays_spring["State: "+ str(start_s)+", hour: "+str(i)]
if len(pot_value_list)>50:
actual_value = random.choice(pot_value_list)#
#cdf, gmm_x, gmm = GMM_erstellen(pot_value_list,50)
#actual_value = gmm.sample()[0][0][0]
#print("made by GMM")
else:
actual_value = random.choice(pot_value_list)
#print("made not by GMM")
generated_values_list.append(actual_value)
probabilities_next_state = TPMs_WD[i][start_s-1]
next_state = random.choices(states,weights=probabilities_next_state)
start_s = next_state[0]
plt.plot(generated_values_list)
But - I think - the only part that matters is this:
for x in range(10000):
#some code that creates the generated_values_list
plt.plot(generated_values_list)
This creates, as expected a picture like this:
It is not clear from here which are the most common paths so I would like to make values that are hit frequently are more colorful while not so frequent values are rather grey.
I think seaborn library has something for that but I don't seem to understand the docs.
I have a for loop that cycles through and creates 3 data frames and stores them in a dictionary. From each of these data frames, I would like to be able to create another dictionary, but I cant figure out how to do this.
Here is the repetitive code without the loop:
Trad = allreports2[allreports2['Trad'].notna()]
Alti = allreports2[allreports2['Alti'].notna()]
Alto = allreports2[allreports2['Alto'].notna()]
Trad_dict = dict(zip(Trad.State, Trad.Position))
Alti_dict = dict(zip(Alti.State, Alti.Position))
Alto_dict = dict(zip(Alto.State, Alto.Position))
As stated earlier, I understand how to make the 3 dataframes by storing them in a dictionary and I understand what needs to go on the right side of the equal sign in the second statement in the for loop, but not what goes on the left side (denoted below as XXXXXXXXX).
Routes = ['Trad', 'Alti', 'Alto']
dfd = {}
for route in Routes:
dfd[route] = allreports2[allreports2[route].notna()]
XXXXXXXXX = dict(zip(dfd[route].State, dfd[route].Position))
(Please note: I am very new to Python and teaching myself so apologies in advance!)
This compromises readability, but this should work.
Routes = ['Trad', 'Alti', 'Alto']
dfd, output = [{},{}] # Unpack List
for route in Routes:
dfd[route] = allreports2[allreprots2[route].notna()]
output[route] = dict(zip(dfd[route].State, dfd[route].Position))
Trad_dict, Alti_dict, Alto_dict = list(output.values()) # Unpack List
Reference
How can I get list of values from dict?
I have a ticker that grabs current information of multiple elements and adds it to a list in the format: trade_list.append([[trade_id, results]]).
Say we're tracking trade_id's 4555, 5555, 23232, the trade_list will keep ticking away adding their results to the list, I then want to find the averages of their results individually.
The code works as such:
Find accounts
for a in accounts:
find open trades of accounts
for t in range(len(trades)):
do some math
trades_list.append(trade_id,result)
avernum = 0
average = []
for r in range(len(trades_list)):
average.append(trades_list[r][1]) # This is the value attached to the trade_id
avernum+=1
results = float(sum(average)/avernum))
results_list.append([[trade_id,results]])
This fills out really quickly. This is after two ticks:
print(results_list)
[[[53471, 28.36432]], [[53477, 31.67835]], [[53474, 32.27664]], [[52232, 1908.30604]], [[52241, 350.4758]], [[53471, 28.36432]], [[53477, 31.67835]], [[53474, 32.27664]], [[52232, 1908.30604]], [[52241, 350.4758]]]
These averages will move and change very quickly. I want to use results_list to track and watch them, then compare previous averages to current ones
Thinking:
for r in range(len(results_list)):
if results_list[r][0] == trade_id:
restick.append(results_list[r][1])
resnum = len(restick)
if restick[resnum] > restick[resnum-1]:
do fancy things
Here is some short code that does what you I think you have described, although I might have misunderstood. You basically do exactly what you say; select everything that has a certain trade_id and returns its average.:
TID_INDEX = 0
DATA_INDEX = 1
def id_average(t_id, arr):
filt_arr = [i[DATA_INDEX] for i in arr if i[TID_INDEX] == t_id]
return sum(filt_arr)/len(filt_arr)
I need some help getting my brain around designing an (efficient) markov chain in spark (via python). I've written it as best as I could, but the code I came up with doesn't scale.. Basically for the various map stages, I wrote custom functions and they work fine for sequences of a couple thousand, but when we get in the 20,000+ (and I've got some up to 800k) things slow to a crawl.
For those of you not familiar with markov moodels, this is the gist of it..
This is my data.. I've got the actual data (no header) in an RDD at this point.
ID, SEQ
500, HNL, LNH, MLH, HML
We look at sequences in tuples, so
(HNL, LNH), (LNH,MLH), etc..
And I need to get to this point.. where I return a dictionary (for each row of data) that I then serialize and store in an in memory database.
{500:
{HNLLNH : 0.333},
{LNHMLH : 0.333},
{MLHHML : 0.333},
{LNHHNL : 0.000},
etc..
}
So in essence, each sequence is combined with the next (HNL,LNH become 'HNLLNH'), then for all possible transitions (combinations of sequences) we count their occurrence and then divide by the total number of transitions (3 in this case) and get their frequency of occurrence.
There were 3 transitions above, and one of those was HNLLNH.. So for HNLLNH, 1/3 = 0.333
As a side not, and I'm not sure if it's relevant, but the values for each position in a sequence are limited.. 1st position (H/M/L), 2nd position (M/L), 3rd position (H,M,L).
What my code had previously done was to collect() the rdd, and map it a couple times using functions I wrote. Those functions first turned the string into a list, then merged list[1] with list[2], then list[2] with list[3], then list[3] with list[4], etc.. so I ended up with something like this..
[HNLLNH],[LNHMLH],[MHLHML], etc..
Then the next function created a dictionary out of that list, using the list item as a key and then counted the total ocurrence of that key in the full list, divided by len(list) to get the frequency. I then wrapped that dictionary in another dictionary, along with it's ID number (resulting in the 2nd code block, up a above).
Like I said, this worked well for small-ish sequences, but not so well for lists with a length of 100k+.
Also, keep in mind, this is just one row of data. I have to perform this operation on anywhere from 10-20k rows of data, with rows of data varying between lengths of 500-800,000 sequences per row.
Any suggestions on how I can write pyspark code (using the API map/reduce/agg/etc.. functions) to do this efficiently?
EDIT
Code as follows.. Probably makes sense to start at the bottom. Please keep in mind I'm learning this(Python and Spark) as I go, and I don't do this for a living, so my coding standards are not great..
def f(x):
# Custom RDD map function
# Combines two separate transactions
# into a single transition state
cust_id = x[0]
trans = ','.join(x[1])
y = trans.split(",")
s = ''
for i in range(len(y)-1):
s= s + str(y[i] + str(y[i+1]))+","
return str(cust_id+','+s[:-1])
def g(x):
# Custom RDD map function
# Calculates the transition state probabilities
# by adding up state-transition occurrences
# and dividing by total transitions
cust_id=str(x.split(",")[0])
trans = x.split(",")[1:]
temp_list=[]
middle = int((len(trans[0])+1)/2)
for i in trans:
temp_list.append( (''.join(i)[:middle], ''.join(i)[middle:]) )
state_trans = {}
for i in temp_list:
state_trans[i] = temp_list.count(i)/(len(temp_list))
my_dict = {}
my_dict[cust_id]=state_trans
return my_dict
def gen_tsm_dict_spark(lines):
# Takes RDD/string input with format CUST_ID(or)PROFILE_ID,SEQ,SEQ,SEQ....
# Returns RDD of dict with CUST_ID and tsm per customer
# i.e. {cust_id : { ('NLN', 'LNN') : 0.33, ('HPN', 'NPN') : 0.66}
# creates a tuple ([cust/profile_id], [SEQ,SEQ,SEQ])
cust_trans = lines.map(lambda s: (s.split(",")[0],s.split(",")[1:]))
with_seq = cust_trans.map(f)
full_tsm_dict = with_seq.map(g)
return full_tsm_dict
def main():
result = gen_tsm_spark(my_rdd)
# Insert into DB
for x in result.collect():
for k,v in x.iteritems():
db_insert(k,v)
You can try something like below. It depends heavily on tooolz but if you prefer to avoid external dependencies you can easily replace it with some standard Python libraries.
from __future__ import division
from collections import Counter
from itertools import product
from toolz.curried import sliding_window, map, pipe, concat
from toolz.dicttoolz import merge
# Generate all possible transitions
defaults = sc.broadcast(dict(map(
lambda x: ("".join(concat(x)), 0.0),
product(product("HNL", "NL", "HNL"), repeat=2))))
rdd = sc.parallelize(["500, HNL, LNH, NLH, HNL", "600, HNN, NNN, NNN, HNN, LNH"])
def process(line):
"""
>>> process("000, HHH, LLL, NNN")
('000', {'LLLNNN': 0.5, 'HHHLLL': 0.5})
"""
bits = line.split(", ")
transactions = bits[1:]
n = len(transactions) - 1
frequencies = pipe(
sliding_window(2, transactions), # Get all transitions
map(lambda p: "".join(p)), # Joins strings
Counter, # Count
lambda cnt: {k: v / n for (k, v) in cnt.items()} # Get frequencies
)
return bits[0], frequencies
def store_partition(iter):
for (k, v) in iter:
db_insert(k, merge([defaults.value, v]))
rdd.map(process).foreachPartition(store_partition)
Since you know all possible transitions I would recommend using a sparse representation and ignore zeros. Moreover you can replace dictionaries with sparse vectors to reduce memory footprint.
you can achieve this result by using pure Pyspark, i did using it using pyspark.
To create frequencies, let say you have already achieved and these are input RDDs
ID, SEQ
500, [HNL, LNH, MLH, HML ...]
and to get frequencies like, (HNL, LNH),(LNH, MLH)....
inputRDD..map(lambda (k, list): get_frequencies(list)).flatMap(lambda x: x) \
.reduceByKey(lambda v1,v2: v1 +v2)
get_frequencies(states_list):
"""
:param states_list: Its a list of Customer States.
:return: State Frequencies List.
"""
rest = []
tuples_list = []
for idx in range(0,len(states_list)):
if idx + 1 < len(states_list):
tuples_list.append((states_list[idx],states_list[idx+1]))
unique = set(tuples_list)
for value in unique:
rest.append((value, tuples_list.count(value)))
return rest
and you will get results
((HNL, LNH), 98),((LNH, MLH), 458),() ......
after this you may convert result RDDs into Dataframes or yu can directly insert into DB using RDDs mapPartitions
I write a Djano application which deals with financial data process.
I have to load large data(more than 1000000 records) from MySQL table, and convert the records to JSON data in django views as following:
trades = MtgoxTrade.objects.all()
data = []
for trade in trades:
js = dict()
js['time']= trade.time
js['price']= trade.price
js['amount']= trade.amount
js['type']= trade.type
data.append(js)
return data
The problem is that the FOR loop is very slow(which takes more than 9 seconds for 200000 records), is there any effective way to convert DB records to JSON format data in Python?
Updated: I have run code according to Mike Housky's answer in my ENV(ActivePython2.7,Win7) With code changes and result as:
def create_data(n):
from api.models import MtgoxTrade
result = MtgoxTrade.objects.all()
return result
Build ............ 0.330999851227
For loop ......... 7.98400020599
List Comp. ....... 0.457000017166
Ratio ............ 0.0572394796312
For loop 2 ....... 0.381999969482
Ratio ............ 0.047845686326
You will find the for loop takes about 8 seconds! And if i comment out the For loop,then List Comp also takes such time as:
Times:
Build ............ 0.343000173569
List Comp. ....... 7.57099986076
For loop 2 ....... 0.375999927521
My new question is that whether the for loop will touch the database? But i did not see any DB access log. So strange!
Here are several tips/things to try.
Since you need to make a JSON-string from the queryset eventually, use django's built-in serializers:
from django.core import serializers
data = serializers.serialize("json",
MtgoxTrade.objects.all(),
fields=('time','price','amount','type'))
You can make serialization faster by using ujson or simplejson modules. See SERIALIZATION_MODULES setting.
Also, instead of getting all the field values from the record, be explicit and get only what you need to serialize:
MtgoxTrade.objects.all().values('time','price','amount','type')
Also, you may want to use iterator() method of a queryset:
...For a QuerySet which returns a large number of objects that you
only need to access once, this can result in better performance and a
significant reduction in memory...
Also, you can split your huge queryset into batches, see: Batch querysets.
Also see:
Why is iterating through a large Django QuerySet consuming massive amounts of memory?
Memory efficient Django Queryset Iterator
django: control json serialization
You can use a list comprehension as that prevents many dict() and append() calls:
trades = MtgoxTrade.objects.all()
data = [{'time': trade.time, 'price': trade.price, 'amount': trade.amount, 'type': trade.type}
for trade in trades]
return data
Function calls are expensive in Python so you should aim to avoid them in slow loops.
This answer is in support of Simeon Visser's observation. I ran the following code:
import gc, random, time
if "xrange" not in dir(__builtins__):
xrange = range
class DataObject(object):
def __init__(self, time, price, amount, type):
self.time = time
self.price = price
self.amount = amount
self.type = type
def create_data(n):
result = []
for index in xrange(n):
s = str(index);
result.append(DataObject("T"+s, "P"+s, "A"+s, "ty"+s))
return result
def convert1(trades):
data = []
for trade in trades:
js = dict()
js['time']= trade.time
js['price']= trade.price
js['amount']= trade.amount
js['type']= trade.type
data.append(js)
return data
def convert2(trades):
data = [{'time': trade.time, 'price': trade.price, 'amount': trade.amount, 'type': trade.type}
for trade in trades]
return data
def convert3(trades):
ndata = len(trades)
data = ndata*[None]
for index in xrange(ndata):
t = trades[index]
js = dict()
js['time']= t.time
js['price']= t.price
js['amount']= t.amount
js['type']= t.type
#js = {"time" : t.time, "price" : t.price, "amount" : t.amount, "type" : t.type}
return data
def main(n=1000000):
t0s = time.time()
trades = create_data(n);
t0f = time.time()
t0 = t0f - t0s
gc.disable()
t1s = time.time()
jtrades1 = convert1(trades)
t1f = time.time()
t1 = t1f - t1s
t2s = time.time()
jtrades2 = convert2(trades)
t2f = time.time()
t2 = t2f - t2s
t3s = time.time()
jtrades3 = convert3(trades)
t3f = time.time()
t3 = t3f - t3s
gc.enable()
print ("Times:")
print (" Build ............ " + str(t0))
print (" For loop ......... " + str(t1))
print (" List Comp. ....... " + str(t2))
print (" Ratio ............ " + str(t2/t1))
print (" For loop 2 ....... " + str(t3))
print (" Ratio ............ " + str(t3/t1))
main()
Results on Win7, Core 2 Duo 3.0GHz:
Python 2.7.3:
Times:
Build ............ 2.95600008965
For loop ......... 0.699999809265
List Comp. ....... 0.512000083923
Ratio ............ 0.731428890618
For loop 2 ....... 0.609999895096
Ratio ............ 0.871428659011
Python 3.3.0:
Times:
Build ............ 3.4320058822631836
For loop ......... 1.0200011730194092
List Comp. ....... 0.7500009536743164
Ratio ............ 0.7352942070195492
For loop 2 ....... 0.9500019550323486
Ratio ............ 0.9313733946208623
Those vary a bit, even with GC disabled (much more variance with GC enabled, but about the same results). The third conversion timing shows that a fair-sized chunk of the saved time comes from not calling .append() a million times.
Ignore the "For loop 2" times. This version has a bug and I am out of time to fix it for now.
First you have to check if the performance loss happens while fetching the data from the database or inside the loop.
There is no real option for giving you a significant speedup - also not using a list comprehension as noticed above.
However there is a huge difference in performance between Python 2 and 3.
A simple benchmark showed me that the for-loop is roughly 2,5 times faster with Python 3.3 (using some simple benchmark like the following):
import time
ts = time.time()
data = list()
for i in range(1000000):
d = {}
d['a'] = 1
d['b'] = 2
d['c'] = 3
d['d'] = 4
d['a'] = 5
data.append(d)
print(time.time() - ts)
/opt/python-3.3.0/bin/python3 foo2.py
0.5906929969787598
python2.6 foo2.py
1.74390792847
python2.7 foo2.py
0.673550128937
You will also note that there is a significant performance difference between Python 2.6 and 2.7.
I think it's worth trying to do a raw query against the database because a Model puts a lot of extra boilerplate code into fields (I belive that fields are properties) and like previously mentioned function calls are expensive. See the documentation, there is an example at the bottom of the page that uses dictfetchall which seems like the thing you are after.
You might want to look into the values method. It will return an iterable of dicts instead of model objects, so you don't have to create a lot of intermediate data structures. Your code could be reduced to
return MtgoxTrade.objects.values('time', 'price', 'amount', 'type')