Aggregate and calculate median of arrayfield in django queryset - python

I'm wondering if this is possible in a more efficient way.
I have a dataset in PostGRESQL that is structured like this:
Year, Sitename, Array (length = 4500)
For example:
1982, DANC, array([2,3,4,5,6,7,...])
1982, ANCH, array([5,6,4,3,5,7,...])
1983, DANC, array([3,3,4,6,3,6,...])
1983, ANCH, array([8,8,5,4,3,2,...])
What I want to do is add up the arrays (across rows) by years
E.G.,
1982 1982 1982
DANC ANCH TOT
2 5 7
3 6 9
4 4 8
5 3 8
6 5 11
7 7 14
... ... ...
My Django model looks like this:
class Abundance(models.Model):
abundance_id = models.AutoField(primary_key=True)
site = models.ForeignKey('Site')
season = models.SmallIntegerField()
samples = ArrayField(models.DecimalField(blank=True, decimal_places=3, max_digits=30))
def __unicode__(self):
return self.site
The following code in my Views.py works:
import numpy as np
import bottleneck as bn
...
def testview(request):
s = ["ACUN","BRDM"]
quants = []
medians = []
for yr in range(1982,2015):
X = Abundance.objects.values_list('samples').filter(site__site_id__in = s).filter(season = yr)
h = np.matrix(np.array(X,dtype=float))
i = h.sum(axis=0)
m = bn.median(i)
up = np.percentile(i,95)
down = np.percentile(i,5)
qlist = [yr, round(down,3), round(up,3)]
mlist = [yr, round(m,3)]
quants.append(qlist)
medians.append(mlist)
return JsonResponse({'quants':quants, 'medians':medians})
However, the above code is very slow - especially when drawing many sites. I have tried playing with .aggregate() but I've not found a good solution.
Thanks in advance

You can probably use some of the .aggregate() on there to push the load down to Postgres, but I think one of the bigger problems with speed here is the Decimal field. It's the highest precision, but it's also one of the more expensive types for Python to move in and out of.
That said, I'm not sure if there's a quick way to get the percentiles out from the DB call but the sums and medians you can easily push down to the DB via the Django ORM. For the others (percentiles, etc.) you can probably push them down as well but you'll be delving into custom aggregates for django (https://docs.djangoproject.com/en/1.9/ref/models/expressions/#creating-your-own-aggregate-functions), which if you're going to go that far it might be worth checking out something like aldjemy (https://github.com/Deepwalker/aldjemy/) and convert the entire query over to SQLAlchemy so you have maximum control over it.

Related

How to filter Django objects based on value returned by a method?

I have an Django object with a method get_volume_sum(self) that return a float, and I would like to query the top n objects with the highest value, how can I do that?
For example I could do a loop like this but I would like a more elegant solution.
vol = []
obj = []
for m in Market.object.filter(**args): # 3000 objects
sum = m.get_volume_sum()
vol.append(sum)
obj.append(m.id)
top = search_top_n(obj, vol, n)
And this is how the method looks like:
# return sum of volume over last hours
def get_volume_sum(self, hours):
return Candle.objects.filter(market=self,
dt__gte=timezone.now()-timedelta(hours=hours)
).aggregate(models.Sum('vo'))
From what I see here even with Python there isn't a single line solution.
You should not filter with the method, this will result in an N+1 problem: for 3'000 Market objects, it will generate an additional 3'0000 queries to obtain the volumes.
You can do this in bulk with a .annotate(…) [Django-doc]:
from django.db.models import Sum
hours = 12 # some value for hours
Market.objects.filter(
**args,
candle__dt__gte=timezone.now()-timedelta(hours=hours),
).annotate(
candle_vol=Sum('candle__vo')
).order_by('-candle_vol')
Here there is however a small caveat: if there is no related Candle, then these Markets will be filtered out. We can prevent that by allowing also Markets without Candles with:
from django.db.models import Q, Sum
hours = 12 # some value for hours
Market.objects.filter(
Q(candle__dt__gte=timezone.now()-timedelta(hours=hours)) |
Q(candle=None),
**args
).annotate(
candle_vol=Sum('candle__vo')
).order_by('-candle_vol')

How to make a model in django without predefined attributes?

I am little bit confused with the django model as the title said. I also cannot find any posts in google.
Let's say in the normal way, we will have a model like this:
class something(models.Model):
fName = models.TextField()
lName = models.TextField()
......
lastThings = models.TextField()
However, I don't want to have a model like this. I want to have a model with no predefined attributes. In order words, I can put anythings into this model. My thought is like can I use a loop or some other things to create such model?
class someModel(models.Model):
for i in numberOfModelField:
field[j] = i
j+=1
This is table A to read:
A B C
1 2 3
2 3 4
This is table B to read:
A B C D E F G G G
1 2 3 4 5 6 7 8 9
...............
4 5 3 2 4 5 6 4 3
And so different kind of table can be read
Therefore, I can have a model that fit in any cases. I am not sure is it clear enough to let you understand my confuse. Thank you
To expand on my comment (put as an answer so I can format the code decently).
class something(models.Model):
sheet_name = models.TextField()
row = models.TextField()
col = models.TextField()
cell_value = models.TextField()
class Meta:
unique_together = [[sheet_name, row, col]]
Once you have the values in this format you can do what you want with them. If you know the first row is always headers you could define a header table keyed on sheet_name and col, and map them to header_name as well, or you could just take them from this table.
There's probably better ways of handling this, and I'm still not sure of your use case. If this is loading data temporarily to use in other processes, then this should be fine. If it's to populate some new database for use indefinitely, then you need to spend more time defining the actual tables, though this process would be OK as an intermediate staging area just to get the data out of excel.

Calculating Variable Cash-flow IRR in Python (pandas)

I have a DataFrame of unpredictable cashflows and unpredictable period lengths, and I need to generate a backward-looking IRR.
Doing it in Excel is pretty straightforward using the solver, wondering if there's a good way to pull it off in Python. (I think I could leverage openpyxl to get solver to work in excel from python, but that feels unnecessarily cumbersome).
The problem is pretty straightforward:
NPV of Cash Flow = ((cash_flow)/(1+IRR)^years_ago)
GOAL: Find IRR where SUM(NPV) = 0
My dataframe looks something like this:
cash_flow |years_ago
-----------------------
-3.60837e+06 |4.09167
31462 |4.09167
1.05956e+06 |3.63333
-1.32718e+06 |3.28056
-4.46554e+06 |3.03889
It seems as though other IRR calculators (such as numpy.irr) assume strict period cutoffs (every 3 months, 1 year, etc), which won't work. The other option seems to be the iterative route, where I continually guess, check, and iterate, but that feels like the wrong way to tackle this. Ideally, I'm looking for something that would do this:
irr = calc_irr((cash_flow1,years_ago1),(cash_flow2,years_ago2),etc)
EDIT: Here is the code I'm running the problem from. I have a list of transactions, and I've chosen to create temporary tables by id.
for id in df_tran.id.unique():
temp_df = df_tran[df_tran.id == id]
cash_flow = temp_df.cash_flows.values
years = temp_df.years.values
print(id, cash_flow)
print(years)
#irr_calc = irr(cfs=cash_flow, yrs=years,x0=0.100000)
#print(sid, irr_calc)
where df_tran (which temp_df is based on) looks like:
cash_flow |years |id
0 -3.60837e+06 4.09167 978237
1 31462 4.09167 978237
4 1.05956e+06 3.63333 978237
6 -1.32718e+06 3.28056 978237
8 -4.46554e+06 3.03889 978237
10 -3.16163e+06 2.81944 978237
12 -5.07288e+06 2.58889 978237
14 268833 2.46667 978237
17 -4.74703e+06 1.79167 978237
20 -964987 1.40556 978237
22 -142920 1.12222 978237
24 163894 0.947222 978237
26 -2.2064e+06 0.655556 978237
27 1.23804e+06 0.566667 978237
29 180655 0.430556 978237
30 -85297 0.336111 978237
34 -2.3529e+07 0.758333 1329483
36 21935 0.636111 1329483
38 -3.55067e+06 0.366667 1329483
41 -4e+06 4.14167 1365051
temp_df looks identical to df_tran, except it only holds transactions for a single id.
You can use scipy.optimize.fsolve:
Return the roots of the (non-linear) equations defined by func(x) = 0
given a starting estimate.
First define the function that will be the func parameter to fsolve. This is NPV as a result of your IRR, cash flows, and years. (Vectorize with NumPy.)
import numpy as np
def npv(irr, cfs, yrs):
return np.sum(cfs / (1. + irr) ** yrs)
An example:
cash_flow = np.array([-2., .5, .75, 1.35])
years = np.arange(4)
# A guess
print(npv(irr=0.10, cfs=cash_flow, yrs=years))
0.0886551465064
Now to use fsolve:
from scipy.optimize import fsolve
def irr(cfs, yrs, x0):
return np.asscalar(fsolve(npv, x0=x0, args=(cfs, yrs)))
Your IRR is:
print(irr(cfs=cash_flow, yrs=years, x0=0.10))
0.12129650313214262
And you can confirm that this gets you to a 0 NPV:
res = irr(cfs=cash_flow, yrs=years, x0=0.10)
print(np.allclose(npv(res, cash_flow, years), 0.))
True
All code together:
import numpy as np
from scipy.optimize import fsolve
def npv(irr, cfs, yrs):
return np.sum(cfs / (1. + irr) ** yrs)
def irr(cfs, yrs, x0, **kwargs):
return np.asscalar(fsolve(npv, x0=x0, args=(cfs, yrs), **kwargs))
To make this compatible with your pandas example, just use
cash_flow = df.cash_flow.values
years = df.years_ago.values
Update: the values in your question seem a bit nonsensical (your IRR is going to be some astronomical number if it even exists) but here is how you'd run:
cash_flow = np.array([-3.60837e+06, 31462, 1.05956e+06, -1.32718e+06, -4.46554e+06])
years_ago = np.array([4.09167, 4.09167, 3.63333, 3.28056, 3.03889])
print(irr(cash_flow, years_ago, x0=0.10, maxfev=10000))
1.3977721900669127e+82
Second update: there are a couple minor typos in your code, and your actual flows of $ and timing work out to nonsensical IRRs, but here's what you're looking to do, below. For instance, notice you have one id with one single negative transaction, a negatively infinite IRR.
for i, df in df_tran.groupby('id'):
cash_flow = df.cash_flow.values
years = df.years.values
print('id:', i, 'irr:', irr(cash_flow, years, x0=0.))
id: 978237 irr: 347.8254979851405
id: 1329483 irr: 3.2921314448062817e+114
id: 1365051 irr: 1.0444951674872467e+25

Inexpensive way to add time series intensity in python pandas dataframe

I am trying to sum (and plot) a total from functions which change states at different times using Python's Pandas.DataFrame. For example:
Suppose we have 3 people whose states can be a) holding nothing, b) holding a 5 pound weight, and c) holding a 10 pound weight. Over time, these people pick weights up and put them down. I want to plot the total amount of weight being held. So, given:
My brute forece attempt:
import pandas as ps
import math
import numpy as np
person1=[3,0,10,10,10,10,10]
person2=[4,0,20,20,25,25,40]
person3=[5,0,5,5,15,15,40]
allPeopleDf=ps.DataFrame(np.array(zip(person1,person2,person3)).T)
allPeopleDf.columns=['count','start1', 'end1', 'start2', 'end2', 'start3','end3']
allPeopleDfNoCount=allPeopleDf[['start1', 'end1', 'start2', 'end2', 'start3','end3']]
uniqueTimes=sorted(ps.unique(allPeopleDfNoCount.values.ravel()))
possibleStates=[-1,0,1,2] #extra state 0 for initialization
stateData={}
comboStates={}
#initialize dict to add up all of the stateData
for time in uniqueTimes:
comboStates[time]=0.0
allPeopleDf['track']=-1
allPeopleDf['status']=-1
numberState=len(possibleStates)
starti=-1
endi=0
startState=0
for i in range(3):
starti=starti+2
print starti
endi=endi+2
for time in uniqueTimes:
def helper(row):
start=row[starti]
end=row[endi]
track=row[7]
if start <= time and time < end:
return possibleStates[i+1]
else:
return possibleStates[0]
def trackHelp(row):
status=row[8]
track=row[7]
if track<=status:
return status
else:
return track
def Multiplier(row):
x=row[8]
if x==0:
return 0.0*row[0]
if x==1:
return 5.0*row[0]
if x==2:
return 10.0*row[0]
if x==-1:#numeric place holder for non-contributing
return 0.0*row[0]
allPeopleDf['status']=allPeopleDf.apply(helper,axis=1)
allPeopleDf['track']=allPeopleDf.apply(trackHelp,axis=1)
stateData[time]=allPeopleDf.apply(Multiplier,axis=1).sum()
for k,v in stateData.iteritems():
comboStates[k]=comboStates.get(k,0)+v
print allPeopleDf
print stateData
print comboStates
Plots of weight being held over time might look like the following:
And the sum of the intensities over time might look like the black line in the following:
with the black line defined with the Cartesian points: (0,0 lbs),(5,0 lbs),(5,5 lbs),(15,5 lbs),(15,10 lbs),(20,10 lbs),(20,15 lbs),(25,15 lbs),(25,20 lbs),(40,20 lbs). However, I'm flexible and don't necessarily need to define the combined intensity line as a set of Cartesian points. The unique times can be found with:
print list(set(uniqueTimes).intersection(allNoCountT[1].values.ravel())).sort()
,but I can't come up with a slick way of getting the corresponding intensity values.
I started out with a very ugly function to break apart each "person's" graph so that all people had start and stop times (albeit many stop and start times without state change) at the same time, and then I could add up all the "chunks" of time. This was cumbersome; there has to be a slick pandas way of handling this. If anyone can offer a suggestion or point me to another SO like that I might have missed, I'd appreciate the help!
In case my simplified example isn't clear, another might be plotting the intensity of sound coming from a piano: there are many notes being played for different durations with different intensities. I would like the sum of intensity coming from the piano over time. While my example is simplistic, I need a solution that is more on the scale of a piano song: thousands of discrete intensity levels per key, and many keys contributing over the course of a song.
Edit--Implementation of mgab's provided solution:
import pandas as ps
import math
import numpy as np
person1=['person1',3,0.0,10.0,10.0,10.0,10.0,10.0]
person2=['person2',4,0,20,20,25,25,40]
person3=['person3',5,0,5,5,15,15,40]
allPeopleDf=ps.DataFrame(np.array(zip(person1,person2,person3)).T)
allPeopleDf.columns=['id','intensity','start1', 'end1', 'start2', 'end2', 'start3','end3']
allPeopleDf=ps.melt(allPeopleDf,id_vars=['intensity','id'])
allPeopleDf.columns=['intensity','id','timeid','time']
df=ps.DataFrame(allPeopleDf).drop('timeid',1)
df[df.id=='person1'].drop('id',1) #easier to visualize one id for check
df['increment']=df.groupby('id')['intensity'].transform( lambda x: x.sub(x.shift(), fill_value= 0 ))
TypeError: unsupported operand type(s) for -: 'str' and 'int'
End Edit
Going for the piano keys example, lets assume you have three keys, with 30 levels of intensity.
I would try to keep the data in this format:
import pandas as pd
df = pd.DataFrame([[10,'A',5],
[10,'B',7],
[13,'C',10],
[15,'A',15],
[20,'A',7],
[23,'C',0]], columns=["time", "key", "intensity"])
time key intensity
0 10 A 5
1 10 B 7
2 13 C 10
3 15 A 15
4 20 A 7
5 23 C 0
where you record every change in intensity of any of the keys. From here you can already get the Cartesian coordinates for each individual key as (time,intensity) pairs
df[df.key=="A"].drop('key',1)
time intensity
0 10 5
3 15 15
4 20 7
Then, you can easily create a new column increment that will indicate the change in intensity that occurred for that key at that time point (intensity indicates just the new value of intensity)
df["increment"]=df.groupby("key")["intensity"].transform(
lambda x: x.sub(x.shift(), fill_value= 0 ))
df
time key intensity increment
0 10 A 5 5
1 10 B 7 7
2 13 C 10 10
3 15 A 15 10
4 20 A 7 -8
5 23 C 0 -10
And then, using this new column, you can generate the (time, total_intensity) pairs to use as Cartesian coordinates
df.groupby("time").sum()["increment"].cumsum()
time
10 12
13 22
15 32
20 24
23 14
dtype: int64
EDIT: applying the specific data presented in question
Assuming the data comes as a list of values, starting with the element id (person/piano key), then a factor multiplying the measured weight/intensities for this element, and then pairs of time values indicating the start and end of a series of known states (weight being carried/intensity being emitted). Not sure if I got the data format right. From your question:
data1=['person1',3,0.0,10.0,10.0,10.0,10.0,10.0]
data2=['person2',4,0,20,20,25,25,40]
data3=['person3',5,0,5,5,15,15,40]
And if we know the weight/intensity of each one of the states, we can define:
known_states = [5, 10, 15]
DF_columns = ["time", "id", "intensity"]
Then, the easiest way I came up to load the data includes this function:
import pandas as pd
def read_data(data, states, columns):
id = data[0]
factor = data[1]
reshaped_data = []
for i in xrange(len(states)):
j += 2+2*i
if not data[j] == data[j+1]:
reshaped_data.append([data[j], id, factor*states[i]])
reshaped_data.append([data[j+1], id, -1*factor*states[i]])
return pd.DataFrame(reshaped_data, columns=columns)
Notice that the if not data[j] == data[j+1]: avoids loading data to the dataframe when start and end times for a given state are equal (seems uninformative, and wouldn't appear in your plots anyway). But take it out if you still want these entries.
Then, you load the data:
df = read_data(data1, known_states, DF_columns)
df = df.append(read_data(data2, known_states, DF_columns), ignore_index=True)
df = df.append(read_data(data3, known_states, DF_columns), ignore_index=True)
# and so on...
And then you're right at the beginning of this answer (substituting 'key' by 'id' and the ids, of course)
Appears to be what .sum() is for:
In [10]:
allPeopleDf.sum()
Out[10]:
aStart 0
aEnd 35
bStart 35
bEnd 50
cStart 50
cEnd 90
dtype: int32

Pandas bottleneck, quicker way of slicing?

On an 8 core, 14GB instance, a similar job like this job took ~ 2 weeks to complete and cost a chunk of change, hence any help with with a speed up will be greatly appreciated.
I have an SQL table with ~ 6.6 million rows, two columns, integers in each. The integers denote pandas data-frame locations (bear with me, populating these frame locations is purely to take off some processing time, really not making a dent though):
The integers go up to 26000, and for every integer we look forward in ranges 5-250:
it_starts it_ends
1 5
2 6
...
25,996 26000
...
...
1 6
2 7
...
25,995 26000
...
...
1 7
2 8
...
25,994 26000
If that's not clear enough, the tables were populated with something like this:
chunks = range(5,250)
for chunk_size in chunks:
x = np.array(range(1,len(df)-chunk_size))
y = [k+chunk_size for k in x]
rng_tup = zip(x,y)
#do table inserts
I use this table, as I have said, to take slices from a pandas dataframe, with the following:
rngs = c.execute("SELECT it_starts,it_ends FROM rngtab").fetchall()
for k in rngs:
it_sts=k[0]
it_end=k[1]
fms = df_frame[it_sts:it_end]
Where I have used the following pandas code for 'df_frame', and db is the database in question:
with db:
sqla =("SELECT ctime, Date, Time, high, low FROM quote_meta)
df =psql.read_sql(sqla, db)
df_frame = df.iloc[:,[0,1,2,3,4]]
Hence putting it all together together for clarity:
import sqlite3
import pandas.io.sql as psql
import pandas as pd
db= sqlite3.connect("HIST.db")
c = db.cursor()
c.execute("PRAGMA synchronous = OFF")
c.execute("PRAGMA journal_mode = OFF")
with db:
sqla =("SELECT ctime, Date, Time, high, low FROM quote_meta)
df =psql.read_sql(sqla, db)
df_frame = df.iloc[:,[0,1,2,3,4]]
rngs = c_rng.execute("SELECT it_starts,it_ends FROM rngtab").fetchall()
for k in rngs:
it_sts=k[0]
it_end=k[1]
fms = df_frame[it_sts:it_end]
for i in xrange(0,len(fms)):
#perform trivial (ish) operations on slice,
#trivial compared to the overall time that is.
So as you probably guessed, 'df_frame[it_sts:it_end]' is causing a massive bottleneck as it needs to create ~ 6m slices (*40 separate databases in total)), hence I think it's wise to invest a little time here in asking the question; before I throw more money at it, am I making any cardinal errors here? Is there anything anyone can suggest as a speed up? Thanks.

Categories