I have to implement a pandas groupby operation which is more difficult than the usual simple aggregates I do. The table I'm working with has the following structure:
category price
0 A 89
1 A 58
2 ... ...
3 B 75
4 B 120
5 ... ...
6 C 90
7 C 199
8 ... ...
As shown above, my example DataFrame consists of 3 categories A, B, and C (the real DataFrame I'm working on has ~1000 categories). We will assume that category A has 20 rows and categories B and C have more than 100 rows. These are denoted by the 3 dots (...).
I would like to calculate the average price of each category with the following conditions:
If the number of elements in the category is greater than 100 (i.e., B and C in this example), then the average should be calculated while excluding values that are 3 standard deviations away from the mean within each category.
Else, for the categories that have less than 100 elements (i.e., A in this example), the average should be calculated on the entire group, without any exclusion criteria.
Calculating the average price for each category without any condition on the groups is straightforward: df.groupby("category").agg({"price": "mean"}), but I'm stuck with the extra conditions here.
I also always try to provide a reproducible example while asking questions here but I don't know how to properly write one for this problem with fake data. I hope this format is still ok.
Maybe you can do it like this?
df.groupby('category')['price'].apply(
lambda x: np.mean(x) if len(x) <= 100
else np.mean(x[(x >= np.mean(x) - 3*np.std(x))
& (x <= np.mean(x) + 3*np.std(x))]))
Or without numpy (but with numpy usually works faster):
df.groupby('category')['price'].apply(
lambda x: x.mean() if len(x) <= 100
else x[(x >= x.mean() - 3*x.std())
& (x <= x.mean() + 3*x.std())].mean())
I'm not sure if you will be able to do all of this at once. Try to break down the steps, like this:
Identify the number of elements per category:
df_elements = df.groupby('category').agg({'price':'count'}).reset_index()
df_elements.rename({'price':'n_elements'}, inplace=True,axis=1)
Identify if the number of elements is less than 100 or greater than 100 and then perform the appropriate average calculation:
aux = []
for cat in df_elements.category.unique():
if df_elements[df_elements.category==cat]['n_elements'] < 100:
df_aux = df[df.category==cat].groupby('category').agg({'price':'mean'})
aux.append(df_aux.reset_index())
else:
std_cat = df[df.category==cat]['price'].std()
mean_cat = df[df.category==cat]['price'].mean()
th = 3*std_cat
df_cut = df[(df.category==cat) & (df.price <= mean_cat + th) & (df.price >= mean_cat - th]
df_aux = df_cut.groupby('category').agg({'price':'mean'})
aux.append(df_aux.reset_index())
final_df = pd.concat(aux,axis=0)
final_df.rename({'price':'avg_price'},axis=1,inplace=True)
I have 2 columns in a dataframe, one named "day_test" and one named "Temp Column". Some of my values in Temp Column are negative, and I want them to be 1 or 2. I've made a for loop with 2 if statements:
for (i,j) in zip(df['day_test'].astype(int), df['Temp Column'].astype(int)):
if i == 2 and j < 0:
j = 2
if i == 1 and j < 0:
j = 1
I tried printing j so I know the loops are working properly, but the values that I want to change in the dataframe are staying negative.
Thanks
Your code doesn't change the values inside the dataframe, it only changes the j value temporarily.
One way to do it is this:
df['day_test'] = df['day_test'].astype(int)
df['Temp Column'] = df['Temp Column'].astype(int)
df.loc[(df['day_test']==1) & (df['Temp Column']<0),'Temp Column'] = 1
df.loc[(df['day_test']==2) & (df['Temp Column']<0),'Temp Column'] = 2
I have a dataframe with a few columns (one boolean and one numeric). I want to put conditional formatting using pandas styling since I am going to output my dataframe as html in an email, based on the following conditions: 1. boolean column = Y and 2. numeric column > 0.
For example,
col1 col2
Y 15
N 0
Y 0
N 40
Y 20
In the example above, I want to highlight the first and last row since they meet those conditions.
Yes, there is a way. Use lambda expressions to apply conditions and dropna() function to disclude None/NaN values:
df["col2"] = df["col2"].apply(lambda x:x if x > 0 else None)
df["col1"] = df["col1"].apply(lambda x:x if x == 'Y' else None)
df.dropna()
I used the following and it worked:
def highlight_color(s):
if s.Col2 > 0 and s.Col1 == "N":
return ['background-color: red'] * 7
else:
return ['background-color: white'] * 7
df.style.apply(highlight_color, axis=1).render()
How do you replace a value in a dataframe for a cell based on a conditional for the entire data frame not just a column. I have tried to use df.where but this doesn't work as planned
df = df.where(operator.and_(df > (-1 * .2), df < 0),0)
df = df.where(df > 0 , df * 1.2)
Basically what Im trying to do here is replace all values between -.2 and 0 to zero across all columns in my dataframe and all values greater than zero I want to multiply by 1.2
You've misunderstood the way pandas.where works, which keeps the values of the original object if condition is true, and replace otherwise, you can try to reverse your logic:
df = df.where((df <= (-1 * .2)) | (df >= 0), 0)
df = df.where(df <= 0 , df * 1.2)
where allows you to have a one-line solution, which is great. I prefer to use a mask like so.
idx = (df < 0) & (df >= -0.2)
df[idx] = 0
I prefer breaking this into two lines because, using this method, it is easier to read. You could force this onto a single line as well.
df[(df < 0) & (df >= -0.2)] = 0
Just another option.
If you know exactly how you want to filter a dataframe, the solution is trivial:
df[(df.A == 1) & (df.B == 1)]
But what if you are accepting user input and do not know beforehand how many criteria the user wants to use? For example, the user wants a filtered data frame where columns [A, B, C] == 1. Is it possible to do something like:
def filterIt(*args, value):
return df[(df.*args == value)]
so if the user calls filterIt(A, B, C, value=1), it returns:
df[(df.A == 1) & (df.B == 1) & (df.C == 1)]
I think the most elegant way to do this is using df.query(), where you can build up a string with all your conditions, e.g.:
import pandas as pd
import numpy as np
cols = {}
for col in ('A', 'B', 'C', 'D', 'E'):
cols[col] = np.random.randint(1, 5, 20)
df = pd.DataFrame(cols)
def filter_df(df, filter_cols, value):
conditions = []
for col in filter_cols:
conditions.append('{c} == {v}'.format(c=col, v=value))
query_expr = ' and '.join(conditions)
print('querying with: {q}'.format(q=query_expr))
return df.query(query_expr)
Example output (your results may differ due to the randomly generated data):
filter_df(df, ['A', 'B'], 1)
querying with: A == 1 and B == 1
A B C D E
6 1 1 1 2 1
11 1 1 2 3 4
Here is another approach. It's cleaner, more performant, and has the advantage that columns can be empty (in which case the entire data frame is returned).
def filter(df, value, *columns):
return df.loc[df.loc[:, columns].eq(value).all(axis=1)]
Explanation
values = df.loc[:, columns] selects only the columns we are interested in.
masks = values.eq(value) gives a boolean data frame indicating equality with the target value.
mask = masks.all(axis=1) applies an AND across columns (returning an index mask). Note that you can use masks.any(axis=1) for an OR.
return df.loc[mask] applies index mask to the data frame.
Demo
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 2, (100, 3)), columns=list('ABC'))
# both columns
assert np.all(filter(df, 1, 'A', 'B') == df[(df.A == 1) & (df.B == 1)])
# no columns
assert np.all(filter(df, 1) == df)
# different values per column
assert np.all(filter(df, [1, 0], 'A', 'B') == df[(df.A == 1) & (df.B == 0)])
Alternative
For a small number of columns (< 5), the following solution, based on steven's answer, is more performant than the above, although less flexible. As-is, it will not work for an empty columns set, and will not work using different values per column.
from operator import and_
def filter(df, value, *columns):
return df.loc[reduce(and_, (df[column] == value for column in columns))]
Retrieving a Series object by key (df[column]) is significantly faster than constructing a DataFrame object around a subset of columns (df.loc[:, columns]).
In [4]: %timeit df['A'] == 1
100 loops, best of 3: 17.3 ms per loop
In [5]: %timeit df.loc[:, ['A']] == 1
10 loops, best of 3: 48.6 ms per loop
Nevertheless, this speedup becomes negligible when dealing with a larger number of columns. The bottleneck becomes ANDing the masks together, for which reduce(and_, ...) is far slower than the Pandas builtin all(axis=1).
Thanks for the help guys. I came up with something similar to Marius after finding out about df.query():
def makeQuery(cols, equivalence=True, *args):
operator = ' == ' if equivalence else ' != '
query = ''
for arg in args:
for col in cols:
query = query + "({}{}{})".format(col, operator, arg) + ' & '
return query[:-3]
query = makeQuery([A, B, C], False, 1, 2)
Contents of query is a string:
(A != 1) & (B != 1) & (C != 1) & (A != 2) & (B != 2) & (C != 2)
that can be passed to df.query(query)
This is pretty messy but it seems to work.
import operator
def filterIt(value,args):
stuff = [getattr(b,thing) == value for thing in args]
return reduce(operator.and_, stuff)
a = {'A':[1,2,3],'B':[2,2,2],'C':[3,2,1]}
b = pd.DataFrame(a)
filterIt(2,['A','B','C'])
0 False
1 True
2 False
dtype: bool
(b.A == 2) & (b.B == 2) & (b.C ==2)
0 False
1 True
2 False
dtype: bool