Create Excel-like SUMIFS in Pandas - python

I recently learned about pandas and was happy to see its analytics functionality. I am trying to convert Excel array functions into the Pandas equivalent to automate spreadsheets that I have created for the creation of performance attribution reports. In this example, I created a new column in Excel based on conditions within other columns:
={SUMIFS($F$10:$F$4518,$A$10:$A$4518,$C$4,$B$10:$B$4518,0,$C$10:$C$4518," ",$D$10:$D$4518,$D10,$E$10:$E$4518,$E10)}
The formula is summing up the values in the "F" array (security weights) based on certain conditions. "A" array (portfolio ID) is a certain number, "B" array (security id) is zero, "C" array (group description) is " ", "D" array (start date) is the date of the row that I am on, and "E" array (end date) is the date of the row that I am on.
In Pandas, I am using the DataFrame. Creating a new column on a dataframe with the first three conditions is straight forward, but I am having difficult with the last two conditions.
reportAggregateDF['PORT_WEIGHT'] = reportAggregateDF['SEC_WEIGHT_RATE']
[(reportAggregateDF['PORT_ID'] == portID) &
(reportAggregateDF['SEC_ID'] == 0) &
(reportAggregateDF['GROUP_LIST'] == " ") &
(reportAggregateDF['START_DATE'] == reportAggregateDF['START_DATE'].ix[:]) &
(reportAggregateDF['END_DATE'] == reportAggregateDF['END_DATE'].ix[:])].sum()
Obviously the .ix[:] in the last two conditions is not doing anything for me, but is there a way to make the sum conditional on the row that I am on without looping? My goal is to not do any loops, but instead use purely vector operations.

You want to use the apply function and a lambda:
>> df
A B C D E
0 mitfx 0 200 300 0.25
1 gs 1 150 320 0.35
2 duk 1 5 2 0.45
3 bmo 1 145 65 0.65
Let's say I want to sum column C times E but only if column B == 1 and D is greater than 5:
df['matches'] = df.apply(lambda x: x['C'] * x['E'] if x['B'] == 1 and x['D'] > 5 else 0, axis=1)
df.matches.sum()
It might be cleaner to split this into two steps:
df_subset = df[(df.B == 1) & (df.D > 5)]
df_subset.apply(lambda x: x.C * x.E, axis=1).sum()
or to use simply multiplication for speed:
df_subset = df[(df.B == 1) & (df.D > 5)]
print sum(df_subset.C * df_subset.E)
You are absolutely right to want to do this problem without loops.

I'm sure there is a better way, but this did it in a loop:
for idx, eachRecord in reportAggregateDF.T.iteritems():
reportAggregateDF['PORT_WEIGHT'].ix[idx] = reportAggregateDF['SEC_WEIGHT_RATE'][(reportAggregateDF['PORT_ID'] == portID) &
(reportAggregateDF['SEC_ID'] == 0) &
(reportAggregateDF['GROUP_LIST'] == " ") &
(reportAggregateDF['START_DATE'] == reportAggregateDF['START_DATE'].ix[idx]) &
(reportAggregateDF['END_DATE'] == reportAggregateDF['END_DATE'].ix[idx])].sum()

Related

Pandas groupby - Apply conditions on specific groups

I have to implement a pandas groupby operation which is more difficult than the usual simple aggregates I do. The table I'm working with has the following structure:
category price
0 A 89
1 A 58
2 ... ...
3 B 75
4 B 120
5 ... ...
6 C 90
7 C 199
8 ... ...
As shown above, my example DataFrame consists of 3 categories A, B, and C (the real DataFrame I'm working on has ~1000 categories). We will assume that category A has 20 rows and categories B and C have more than 100 rows. These are denoted by the 3 dots (...).
I would like to calculate the average price of each category with the following conditions:
If the number of elements in the category is greater than 100 (i.e., B and C in this example), then the average should be calculated while excluding values that are 3 standard deviations away from the mean within each category.
Else, for the categories that have less than 100 elements (i.e., A in this example), the average should be calculated on the entire group, without any exclusion criteria.
Calculating the average price for each category without any condition on the groups is straightforward: df.groupby("category").agg({"price": "mean"}), but I'm stuck with the extra conditions here.
I also always try to provide a reproducible example while asking questions here but I don't know how to properly write one for this problem with fake data. I hope this format is still ok.
Maybe you can do it like this?
df.groupby('category')['price'].apply(
lambda x: np.mean(x) if len(x) <= 100
else np.mean(x[(x >= np.mean(x) - 3*np.std(x))
& (x <= np.mean(x) + 3*np.std(x))]))
Or without numpy (but with numpy usually works faster):
df.groupby('category')['price'].apply(
lambda x: x.mean() if len(x) <= 100
else x[(x >= x.mean() - 3*x.std())
& (x <= x.mean() + 3*x.std())].mean())
I'm not sure if you will be able to do all of this at once. Try to break down the steps, like this:
Identify the number of elements per category:
df_elements = df.groupby('category').agg({'price':'count'}).reset_index()
df_elements.rename({'price':'n_elements'}, inplace=True,axis=1)
Identify if the number of elements is less than 100 or greater than 100 and then perform the appropriate average calculation:
aux = []
for cat in df_elements.category.unique():
if df_elements[df_elements.category==cat]['n_elements'] < 100:
df_aux = df[df.category==cat].groupby('category').agg({'price':'mean'})
aux.append(df_aux.reset_index())
else:
std_cat = df[df.category==cat]['price'].std()
mean_cat = df[df.category==cat]['price'].mean()
th = 3*std_cat
df_cut = df[(df.category==cat) & (df.price <= mean_cat + th) & (df.price >= mean_cat - th]
df_aux = df_cut.groupby('category').agg({'price':'mean'})
aux.append(df_aux.reset_index())
final_df = pd.concat(aux,axis=0)
final_df.rename({'price':'avg_price'},axis=1,inplace=True)

Pandas for loop with 2 parameters and changing column values

I have 2 columns in a dataframe, one named "day_test" and one named "Temp Column". Some of my values in Temp Column are negative, and I want them to be 1 or 2. I've made a for loop with 2 if statements:
for (i,j) in zip(df['day_test'].astype(int), df['Temp Column'].astype(int)):
if i == 2 and j < 0:
j = 2
if i == 1 and j < 0:
j = 1
I tried printing j so I know the loops are working properly, but the values that I want to change in the dataframe are staying negative.
Thanks
Your code doesn't change the values inside the dataframe, it only changes the j value temporarily.
One way to do it is this:
df['day_test'] = df['day_test'].astype(int)
df['Temp Column'] = df['Temp Column'].astype(int)
df.loc[(df['day_test']==1) & (df['Temp Column']<0),'Temp Column'] = 1
df.loc[(df['day_test']==2) & (df['Temp Column']<0),'Temp Column'] = 2

Is there a way to format cells in a data frame based on conditions on multiple columns?

I have a dataframe with a few columns (one boolean and one numeric). I want to put conditional formatting using pandas styling since I am going to output my dataframe as html in an email, based on the following conditions: 1. boolean column = Y and 2. numeric column > 0.
For example,
col1 col2
Y 15
N 0
Y 0
N 40
Y 20
In the example above, I want to highlight the first and last row since they meet those conditions.
Yes, there is a way. Use lambda expressions to apply conditions and dropna() function to disclude None/NaN values:
df["col2"] = df["col2"].apply(lambda x:x if x > 0 else None)
df["col1"] = df["col1"].apply(lambda x:x if x == 'Y' else None)
df.dropna()
I used the following and it worked:
def highlight_color(s):
if s.Col2 > 0 and s.Col1 == "N":
return ['background-color: red'] * 7
else:
return ['background-color: white'] * 7
df.style.apply(highlight_color, axis=1).render()

Replacing value based on conditional pandas

How do you replace a value in a dataframe for a cell based on a conditional for the entire data frame not just a column. I have tried to use df.where but this doesn't work as planned
df = df.where(operator.and_(df > (-1 * .2), df < 0),0)
df = df.where(df > 0 , df * 1.2)
Basically what Im trying to do here is replace all values between -.2 and 0 to zero across all columns in my dataframe and all values greater than zero I want to multiply by 1.2
You've misunderstood the way pandas.where works, which keeps the values of the original object if condition is true, and replace otherwise, you can try to reverse your logic:
df = df.where((df <= (-1 * .2)) | (df >= 0), 0)
df = df.where(df <= 0 , df * 1.2)
where allows you to have a one-line solution, which is great. I prefer to use a mask like so.
idx = (df < 0) & (df >= -0.2)
df[idx] = 0
I prefer breaking this into two lines because, using this method, it is easier to read. You could force this onto a single line as well.
df[(df < 0) & (df >= -0.2)] = 0
Just another option.

pandas: Is it possible to filter a dataframe with arbitrarily long boolean criteria?

If you know exactly how you want to filter a dataframe, the solution is trivial:
df[(df.A == 1) & (df.B == 1)]
But what if you are accepting user input and do not know beforehand how many criteria the user wants to use? For example, the user wants a filtered data frame where columns [A, B, C] == 1. Is it possible to do something like:
def filterIt(*args, value):
return df[(df.*args == value)]
so if the user calls filterIt(A, B, C, value=1), it returns:
df[(df.A == 1) & (df.B == 1) & (df.C == 1)]
I think the most elegant way to do this is using df.query(), where you can build up a string with all your conditions, e.g.:
import pandas as pd
import numpy as np
cols = {}
for col in ('A', 'B', 'C', 'D', 'E'):
cols[col] = np.random.randint(1, 5, 20)
df = pd.DataFrame(cols)
def filter_df(df, filter_cols, value):
conditions = []
for col in filter_cols:
conditions.append('{c} == {v}'.format(c=col, v=value))
query_expr = ' and '.join(conditions)
print('querying with: {q}'.format(q=query_expr))
return df.query(query_expr)
Example output (your results may differ due to the randomly generated data):
filter_df(df, ['A', 'B'], 1)
querying with: A == 1 and B == 1
A B C D E
6 1 1 1 2 1
11 1 1 2 3 4
Here is another approach. It's cleaner, more performant, and has the advantage that columns can be empty (in which case the entire data frame is returned).
def filter(df, value, *columns):
return df.loc[df.loc[:, columns].eq(value).all(axis=1)]
Explanation
values = df.loc[:, columns] selects only the columns we are interested in.
masks = values.eq(value) gives a boolean data frame indicating equality with the target value.
mask = masks.all(axis=1) applies an AND across columns (returning an index mask). Note that you can use masks.any(axis=1) for an OR.
return df.loc[mask] applies index mask to the data frame.
Demo
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 2, (100, 3)), columns=list('ABC'))
# both columns
assert np.all(filter(df, 1, 'A', 'B') == df[(df.A == 1) & (df.B == 1)])
# no columns
assert np.all(filter(df, 1) == df)
# different values per column
assert np.all(filter(df, [1, 0], 'A', 'B') == df[(df.A == 1) & (df.B == 0)])
Alternative
For a small number of columns (< 5), the following solution, based on steven's answer, is more performant than the above, although less flexible. As-is, it will not work for an empty columns set, and will not work using different values per column.
from operator import and_
def filter(df, value, *columns):
return df.loc[reduce(and_, (df[column] == value for column in columns))]
Retrieving a Series object by key (df[column]) is significantly faster than constructing a DataFrame object around a subset of columns (df.loc[:, columns]).
In [4]: %timeit df['A'] == 1
100 loops, best of 3: 17.3 ms per loop
In [5]: %timeit df.loc[:, ['A']] == 1
10 loops, best of 3: 48.6 ms per loop
Nevertheless, this speedup becomes negligible when dealing with a larger number of columns. The bottleneck becomes ANDing the masks together, for which reduce(and_, ...) is far slower than the Pandas builtin all(axis=1).
Thanks for the help guys. I came up with something similar to Marius after finding out about df.query():
def makeQuery(cols, equivalence=True, *args):
operator = ' == ' if equivalence else ' != '
query = ''
for arg in args:
for col in cols:
query = query + "({}{}{})".format(col, operator, arg) + ' & '
return query[:-3]
query = makeQuery([A, B, C], False, 1, 2)
Contents of query is a string:
(A != 1) & (B != 1) & (C != 1) & (A != 2) & (B != 2) & (C != 2)
that can be passed to df.query(query)
This is pretty messy but it seems to work.
import operator
def filterIt(value,args):
stuff = [getattr(b,thing) == value for thing in args]
return reduce(operator.and_, stuff)
a = {'A':[1,2,3],'B':[2,2,2],'C':[3,2,1]}
b = pd.DataFrame(a)
filterIt(2,['A','B','C'])
0 False
1 True
2 False
dtype: bool
(b.A == 2) & (b.B == 2) & (b.C ==2)
0 False
1 True
2 False
dtype: bool

Categories