Pandas groupby quantile values - python

I tried to calculate specific quantile values from a data frame, as shown in the code below. There was no problem when calculate it in separate lines.
When attempting to run last 2 lines, I get the following error:
AttributeError: 'SeriesGroupBy' object has no attribute 'quantile(0.25)'
How can I fix this?
import pandas as pd
df = pd.DataFrame(
{
'x': [0, 1, 0, 1, 0, 1, 0, 1],
'y': [7, 6, 5, 4, 3, 2, 1, 0],
'number': [25000, 35000, 45000, 50000, 60000, 70000, 65000, 36000]
}
)
f = {'number': ['median', 'std', 'quantile']}
df1 = df.groupby('x').agg(f)
df.groupby('x').quantile(0.25)
df.groupby('x').quantile(0.75)
# code below with problem:
f = {'number': ['median', 'std', 'quantile(0.25)', 'quantile(0.75)']}
df1 = df.groupby('x').agg(f)

I prefer def functions
def q1(x):
return x.quantile(0.25)
def q3(x):
return x.quantile(0.75)
f = {'number': ['median', 'std', q1, q3]}
df1 = df.groupby('x').agg(f)
df1
Out[1643]:
number
median std q1 q3
x
0 52500 17969.882211 40000 61250
1 43000 16337.584481 35750 55000

#WeNYoBen's answer is great. There is one limitation though, and that lies with the fact that one needs to create a new function for every quantile. This can be a very unpythonic exercise if the number of quantiles become large. A better approach is to use a function to create a function, and to rename that function appropriately.
def rename(newname):
def decorator(f):
f.__name__ = newname
return f
return decorator
def q_at(y):
#rename(f'q{y:0.2f}')
def q(x):
return x.quantile(y)
return q
f = {'number': ['median', 'std', q_at(0.25) ,q_at(0.75)]}
df1 = df.groupby('x').agg(f)
df1
Out[]:
number
median std q0.25 q0.75
x
0 52500 17969.882211 40000 61250
1 43000 16337.584481 35750 55000
The rename decorator renames the function so that the pandas agg function can deal with the reuse of the quantile function returned (otherwise all quantiles results end up in columns that are named q).

There's a nice way if you want to give names to aggregated columns:
df1.groupby('x').agg(
q1_foo=pd.NamedAgg('number', q1),
q2_foo=pd.NamedAgg('number', q2)
)
where q1 and q2 are functions.
Or even simpler:
df1.groupby('x').agg(
q1_foo=('number', q1),
q2_foo=('number', q2)
)

Related

Pandas apply with a function as a keyword

How can I pass an optional function as a keyword into another function called with pd.apply?
For example, sometimes the dataframe I am working with has z in it:
df = pd.DataFrame(
{
"x": [1, 2, 3, 4],
"y": [100, 200, 300, 400],
"z": [0.1, 0.2, 0.3, 0.4],
}
)
df
This is easy because I can just use my function (much simplified herein of course):
def myfunc(dataframe):
a = dataframe["z"] * 2
return pd.Series(
{
"a": a,
}
)
To get my final dataframe with a:
output = df.apply(
myfunc,
axis=1,
)
df = df.join(output)
df
In the case where z is not in my starting dataframe, I need to call a function to calculate it. I can call several different functions to calculate it, and I want to be able to name a specific function. Here, I want to use the function computeZ.
def computeZ(dataframe):
results = dataframe["x"] / 20
return results
df = pd.DataFrame(
{
"x": [1, 2, 3, 4],
"y": [100, 200, 300, 400],
}
)
df
Here is the problem: I am trying this below, but I get TypeError: myfunc2() takes 1 positional argument but 2 were given. What is the correct way to do this? For various reasons, I need to use pd.apply rather than working in numpy arrays or using anonymous functions.
def myfunc2(dataframe, **kwds):
if analysis:
res = globals()[analysis](dataframe["x"])
a = res * 2
else:
a = dataframe["z"] * 2
return pd.Series(
{
"a": a,
}
)
output = df.apply(
myfunc2,
axis=1,
args=("computeZ",),
)
It looks like you need at second positional parameter in the myfunc2 definition. The first argument passed will be the row itself for axis=1 - then your positional parameter:
def myfunc2(row, p2):
return p2(row)
and then the passed positional computeZ should be the actual variable and not a string representation (not "computeZ" but computeZ instead):
output = df.apply(
myfunc2,
axis=1,
args=(computeZ,)
)
output
0 0.05
1 0.10
2 0.15
3 0.20
Or you could do it using kwargs:
def myfunc2(row, **kwargs):
return kwargs['your_func'](row)
output = df.apply(
myfunc2,
axis=1,
your_func=computeZ
)
output
0 0.05
1 0.10
2 0.15
3 0.20

Remove matching values from two columns in Pandas

I have the following data set generated from the equation
df.loc[:,51] = [1217.0, -20.0, 13970.0, -74]
I dropped the negative values (specific values) and got this
df.loc[:,52] = [1217.0, 0, 13970.0, 0]
Now I am trying to get another column with the dropped values
df.loc[:,53] = df.drop_duplicates(subset=[df.loc[:,51], df.loc[:,52]])
I want this result.
The values that are dropped
df.loc[:,53] = [0,-20, 0,-74]
But I got the following error
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Try numpy.where
df.loc[:,53] = np.where(df.loc[:,51] == df.loc[:,52], 0, df.loc[:,51])
Here, I've done it with a sample data:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
'rating': [4, -4, 3.5, -15, 5]
})
df.loc[(df['rating'] < 0 ), 'new_col'] = 0
df.loc[(df['rating'] > 0 ), 'new_col'] = df['rating']
df['dropped'] = np.where(df['rating'] == df['new_col'], 0, df['rating'])
df

Group by and apply custom function in pandas data frame

I have a df as below:
I would like to group by id and flag and create a new column in the df which is the result of: [sum(value1)/sum(value2)] * 12. Therefore I will need the result to be:
I have created a function:
`def calculation (value1, value2):
result = (value1/value2) * 12
return(result)`
Could you advise which is the best way to apply this function along with the grouping, in order to get the desired output?
Many thanks
The following code should work.
import pandas as pd
df = pd.DataFrame({"id" : [1,1,2,2],"flag":["A","B","B","A"],"value1":[520,200,400,410],"value2":[12,5,11,2]})
def calculation(value1, value2):
result = (value1/value2) * 12
return(result)
df.groupby(['id','flag']).apply(lambda x: calculation(x['value1'],x['value2'])).astype(int)
You just have to use the following for groupby and apply.
df.groupby(['id','flag']).apply(lambda x: calculation(x['value1'],x['value2'])).astype(int)
Here's a solution using apply and a lambda function.
import pandas as pd
df = pd.DataFrame([
[1, 'A', 520, 12],
[1, 'B', 200, 5],
[2, 'B', 400, 11],
[2, 'A', 410, 2]],
columns=['id', 'flag', 'value1', 'value2'])
df.groupby(['id', 'flag']).apply(lambda x: (12 * x['value1']) / x['value2'])
If you want to use the function calculation above then just call the apply method like this.
df.groupby(['id', 'flag']).apply(lambda x: calculation(x['value1'], x['value2']))

How to apply a function using two Series as input with the output being a DataFrame of the function result of each combination of arguments?

I have two Series, each containing variables that I want to use in a function. I want to apply the function for each combination of variables with the resulting output being a DataFrame of the calculated values, the index will be the index from one Series and the columns will be the index of the other Series.
I have tried searching for an answer to a similar problem - I'm sure there's one out there but I'm not sure how to describe it for search engines.
I've solved the problem by creating a function using for loops, so you can understand the logic. I want to know if there is a more efficient operation to do this without using for loops.
From what I've read, I'm imagining some combination of a list comprehension with zipped columns to calculate the values, which is then reshaped into the DataFrame but I can't solve it this way.
Here's the code to reproduce the problem and current solution.
import pandas as pd
bands = pd.Series({'A': 5, 'B': 17, 'C': 9, 'D': 34}, name='band')
values = pd.Series({'Jan': 1, 'Feb': 1.02, 'Mar': 1.05, 'Apr': 1.12}, name='values')
# Here is an unused function as an example
myfunc = lambda x, y: x * (1 + 1/y)
def func1(values, bands):
# Initialise empty DataFrame
df = pd.DataFrame(index=bands.index,
columns=values.index)
for month, month_val in values.iteritems():
for band, band_val in bands.iteritems():
df.at[band, month] = band_val * (1/month_val - 1)
return df
outcome = func1(values, bands)
You could use numpy.outer for this:
import numpy as np
import pandas as pd
bands = pd.Series({'A': 5, 'B': 17, 'C': 9, 'D': 34}, name='band')
values = pd.Series({'Jan': 1, 'Feb': 1.02, 'Mar': 1.05, 'Apr': 1.12}, name='values')
outcome = pd.DataFrame(np.outer(bands, ((1 / values) - 1)),
index=bands.index,
columns=values.index)
[out]
Jan Feb Mar Apr
A 0.0 -0.098039 -0.238095 -0.535714
B 0.0 -0.333333 -0.809524 -1.821429
C 0.0 -0.176471 -0.428571 -0.964286
D 0.0 -0.666667 -1.619048 -3.642857
As a function:
def myFunc(ser1, ser2):
result = pd.DataFrame(np.outer(ser1, ((1 / ser2) - 1)),
index=ser1.index,
columns=ser2.index)
return result
myFunc(bands, values)

Transforming pandas Dataframe into dictionary via function taking column inputs

I have the following pandas Dataframe:
dict1 = {'file': ['filename2', 'filename2', 'filename3', 'filename4', 'filename4', 'filename3'], 'amount': [3, 4, 5, 1, 2, 1], 'front':[21889611, 36357723, 196312, 11, 42, 1992], 'back':[21973805, 36403870, 277500, 19, 120, 3210]}
df1 = pd.DataFrame(dict1)
print(df1)
file amount front back
0 filename2 3 21889611 21973805
1 filename2 4 36357723 36403870
2 filename3 5 196312 277500
3 filename4 1 11 19
4 filename4 2 42 120
5 filename3 1 1992 3210
My task is to take N random draws between front and back, whereby N is equal to the value in amount. Parse this into a dictionary.
To do this on an row-by-row basis is easy for me to understand:
e.g. row 1
import numpy as np
random_draws = np.random.choice(np.arange(21889611, 21973805+1), 3)
e.g. row 2
random_draws = np.random.choice(np.arange(36357723, 36403870+1), 4)
Normally with pandas, users could define this as a function and use something like
def func(front, back, amount):
return np.random.choice(np.arange(front, back+1), amount)
df["new_column"].apply(func)
but the result of my function is an array of varying size.
My second problem is that I would like the output to be a dictionary, of the format
{file: [random_draw_results], file: [random_draw_results], file: [random_draw_results], ...}
For the above example df1, the function should output this dictionary (given the draws):
final_dict = {"filename2": [21927457, 21966814, 21898538, 36392840, 36375560, 36384078, 36366833],
"filename3": 212143, 239725, 240959, 197359, 276948, 3199],
"filename4": [100, 83, 15]}
We can pass axis=1 to operate over rows when using apply.
We then need to tell what columns to use and we return a list.
We then either perform some form of groupby or we could use defaultdict as shown below:
dict1 = {'file': ['filename2', 'filename2', 'filename3', 'filename4', 'filename4', 'filename3'], 'amount': [3, 4, 5, 1, 2, 1], 'front':[21889611, 36357723, 196312, 11, 42, 1992], 'back':[21973805, 36403870, 277500, 19, 120, 3210]}
import numpy as np
import pandas as pd
def func(x):
return np.random.choice(np.arange(x.front, x.back+1), x.amount).tolist()
df1 = pd.DataFrame(dict1)
df1["new_column"] = df1.apply(func, axis=1)
df1.groupby('file')['new_column'].apply(sum).to_dict()
Returns:
{'filename2': [21891765,
21904680,
21914414,
36398355,
36358161,
36387670,
36369443],
'filename3': [240766, 217580, 217581, 274396, 241413, 2488],
'filename4': [18, 96, 107]}
Alt2 would be to use (and by some small timings I ran it looks like it runs as fast):
from collections import defaultdict
d = defaultdict(list)
for k,v in df1.set_index('file')['new_column'].items():
d[k].extend(v)

Categories