Group by and apply custom function in pandas data frame - python

I have a df as below:
I would like to group by id and flag and create a new column in the df which is the result of: [sum(value1)/sum(value2)] * 12. Therefore I will need the result to be:
I have created a function:
`def calculation (value1, value2):
result = (value1/value2) * 12
return(result)`
Could you advise which is the best way to apply this function along with the grouping, in order to get the desired output?
Many thanks

The following code should work.
import pandas as pd
df = pd.DataFrame({"id" : [1,1,2,2],"flag":["A","B","B","A"],"value1":[520,200,400,410],"value2":[12,5,11,2]})
def calculation(value1, value2):
result = (value1/value2) * 12
return(result)
df.groupby(['id','flag']).apply(lambda x: calculation(x['value1'],x['value2'])).astype(int)
You just have to use the following for groupby and apply.
df.groupby(['id','flag']).apply(lambda x: calculation(x['value1'],x['value2'])).astype(int)

Here's a solution using apply and a lambda function.
import pandas as pd
df = pd.DataFrame([
[1, 'A', 520, 12],
[1, 'B', 200, 5],
[2, 'B', 400, 11],
[2, 'A', 410, 2]],
columns=['id', 'flag', 'value1', 'value2'])
df.groupby(['id', 'flag']).apply(lambda x: (12 * x['value1']) / x['value2'])
If you want to use the function calculation above then just call the apply method like this.
df.groupby(['id', 'flag']).apply(lambda x: calculation(x['value1'], x['value2']))

Related

How to run variable of one function to another in python?

I am a beginner. I have two function:
1st creating dataframes and some print statement
2nd is downloading the dataframes to csv in colab.
I want to download all dataframes by the df_name.
code:
def fun1():
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
d2 = {'col1': [-5, -6], 'col2': [-7, -8]}
df = pd.DataFrame(data=d)
df2 = pd.DataFrame(data=d2)
print('info', df.info())
print('info', df2.info())
return df, df2
def fun2(df):
from google.colab import files
name1 = 'positive.csv'
name2 = 'negative.csv'
df.to_csv(name1)
df2.to_csv(name2)
files.download(name1)
files.download(name2)
fun2(df) #looking something like this that download my df, func2 should read my df and df2 from fun1()
I tried:
class tom:
def fun1(self):
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
d2 = {'col1': [-5, -6], 'col2': [-7, -8]}
df = pd.DataFrame(data=d)
df2 = pd.DataFrame(data=d2)
print('info', df.info())
print('info', df2.info())
self.df= df
self.df2 = df2
return df, df2
def fun2(self):
df,df2 = fun1()
from google.colab import files
name1 = 'positive.csv'
name2 = 'negative.csv'
df.to_csv(name1)
df2.to_csv(name2)
return files.download(name1) ,files.download(name2)
tom().fun2() #it download files but shows print of fun1 as well which I don't want.
looking for something like
tom().fun2(dataframe_name) #it just download the files nothing else
set permanent variables directly in the class if its not gonna change and
define fun just for actions.
class s:
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
d2 = {'col1': [-5, -6], 'col2': [-7, -8]}
df = pd.DataFrame(data=d)
df2 = pd.DataFrame(data=d2)
name1 = 'positive.csv'
name2 = 'negative.csv'
df.to_csv(name1)
df2.to_csv(name2)
def f():
print('info', df.info())
print('info', df2.info())
def fun(x):
from google.colab import files
return files.download(x)
run
s.f() --it will print value only
s.fun(s.name1) --it will just download the file
Maybe you can save the data you need in a class variable or create another function, that keeps the data from the first function you need the value (lets call it A) and then pass A to the second function as an argument.

How to apply a function using two Series as input with the output being a DataFrame of the function result of each combination of arguments?

I have two Series, each containing variables that I want to use in a function. I want to apply the function for each combination of variables with the resulting output being a DataFrame of the calculated values, the index will be the index from one Series and the columns will be the index of the other Series.
I have tried searching for an answer to a similar problem - I'm sure there's one out there but I'm not sure how to describe it for search engines.
I've solved the problem by creating a function using for loops, so you can understand the logic. I want to know if there is a more efficient operation to do this without using for loops.
From what I've read, I'm imagining some combination of a list comprehension with zipped columns to calculate the values, which is then reshaped into the DataFrame but I can't solve it this way.
Here's the code to reproduce the problem and current solution.
import pandas as pd
bands = pd.Series({'A': 5, 'B': 17, 'C': 9, 'D': 34}, name='band')
values = pd.Series({'Jan': 1, 'Feb': 1.02, 'Mar': 1.05, 'Apr': 1.12}, name='values')
# Here is an unused function as an example
myfunc = lambda x, y: x * (1 + 1/y)
def func1(values, bands):
# Initialise empty DataFrame
df = pd.DataFrame(index=bands.index,
columns=values.index)
for month, month_val in values.iteritems():
for band, band_val in bands.iteritems():
df.at[band, month] = band_val * (1/month_val - 1)
return df
outcome = func1(values, bands)
You could use numpy.outer for this:
import numpy as np
import pandas as pd
bands = pd.Series({'A': 5, 'B': 17, 'C': 9, 'D': 34}, name='band')
values = pd.Series({'Jan': 1, 'Feb': 1.02, 'Mar': 1.05, 'Apr': 1.12}, name='values')
outcome = pd.DataFrame(np.outer(bands, ((1 / values) - 1)),
index=bands.index,
columns=values.index)
[out]
Jan Feb Mar Apr
A 0.0 -0.098039 -0.238095 -0.535714
B 0.0 -0.333333 -0.809524 -1.821429
C 0.0 -0.176471 -0.428571 -0.964286
D 0.0 -0.666667 -1.619048 -3.642857
As a function:
def myFunc(ser1, ser2):
result = pd.DataFrame(np.outer(ser1, ((1 / ser2) - 1)),
index=ser1.index,
columns=ser2.index)
return result
myFunc(bands, values)

Quicker way to iterate pandas dataframe and apply a conditional function

Summary
I am trying to iterate over a large dataframe. Identify unique groups based on several columns, apply the mean to another column based on how many are in the group. My current approach is very slow when iterating over a large dataset and applying the average function across many columns. Is there a way I can do this more efficiently?
Example
Here's a example of the problem. I want to find unique combinations of ['A', 'B', 'C']. For each unique combination, I want the value of column ['D'] / number of rows in the group.
Edit:
Resulting dataframe should preserve the duplicated groups. But with edited column 'D'
import pandas as pd
import numpy as np
import datetime
def time_mean_rows():
# Generate some random data
A = np.random.randint(0, 5, 1000)
B = np.random.randint(0, 5, 1000)
C = np.random.randint(0, 5, 1000)
D = np.random.randint(0, 10, 1000)
# init dataframe
df = pd.DataFrame(data=[A, B, C, D]).T
df.columns = ['A', 'B', 'C', 'D']
tstart = datetime.datetime.now()
# Get unique combinations of A, B, C
unique_groups = df[['A', 'B', 'C']].drop_duplicates().reset_index()
# Iterate unique groups
normalised_solutions = []
for idx, row in unique_groups.iterrows():
# Subset dataframe to the unique group
sub_df = df[
(df['A'] == row['A']) &
(df['B'] == row['B']) &
(df['C'] == row['C'])
]
# If more than one solution, get mean of column D
num_solutions = len(sub_df)
if num_solutions > 1:
sub_df.loc[:, 'D'] = sub_df.loc[:,'D'].values.sum(axis=0) / num_solutions
normalised_solutions.append(sub_df)
# Concatenate results
res = pd.concat(normalised_solutions)
tend = datetime.datetime.now()
time_elapsed = (tstart - tend).seconds
print(time_elapsed)
I know the section causing slowdown is when num_solutions > 1. How can I do this more efficiently
Hm, why don't you use groupby?
df_res = df.groupby(['A', 'B', 'C'])['D'].mean().reset_index()
This is a complement to AT_asks's answer which only gave the first part of the solution.
Once we have df.groupby(['A', 'B', 'C'])['D'].mean() we can use it to change the value of the column 'D' in a copy of the original dataframe provided we use a dataframe sharing same index. The global solution is then:
res = df.set_index(['A', 'B', 'C']).assign(
D=df.groupby(['A', 'B', 'C'])['D'].mean()).reset_index()
This will contains same rows (even if a different order that the res dataframe from OP's question.
Here's a solution I found
Using groupby as suggested by AT, then merging back to the original df and dropping the original ['D', 'E'] columns. Nice speedup!
def time_mean_rows():
# Generate some random data
np.random.seed(seed=42)
A = np.random.randint(0, 10, 10000)
B = np.random.randint(0, 10, 10000)
C = np.random.randint(0, 10, 10000)
D = np.random.randint(0, 10, 10000)
E = np.random.randint(0, 10, 10000)
# init dataframe
df = pd.DataFrame(data=[A, B, C, D, E]).T
df.columns = ['A', 'B', 'C', 'D', 'E']
tstart_grpby = timer()
cols = ['D', 'E']
group_df = df.groupby(['A', 'B', 'C'])[cols].mean().reset_index()
# Merge df
df = pd.merge(df, group_df, how='left', on=['A', 'B', 'C'], suffixes=('_left', ''))
# Get left columns (have not been normalised) and drop
drop_cols = [x for x in df.columns if x.endswith('_left')]
df.drop(drop_cols, inplace=True, axis='columns')
tend_grpby = timer()
time_elapsed_grpby = timedelta(seconds=tend_grpby-tstart_grpby).total_seconds()
print(time_elapsed_grpby)

Transforming pandas Dataframe into dictionary via function taking column inputs

I have the following pandas Dataframe:
dict1 = {'file': ['filename2', 'filename2', 'filename3', 'filename4', 'filename4', 'filename3'], 'amount': [3, 4, 5, 1, 2, 1], 'front':[21889611, 36357723, 196312, 11, 42, 1992], 'back':[21973805, 36403870, 277500, 19, 120, 3210]}
df1 = pd.DataFrame(dict1)
print(df1)
file amount front back
0 filename2 3 21889611 21973805
1 filename2 4 36357723 36403870
2 filename3 5 196312 277500
3 filename4 1 11 19
4 filename4 2 42 120
5 filename3 1 1992 3210
My task is to take N random draws between front and back, whereby N is equal to the value in amount. Parse this into a dictionary.
To do this on an row-by-row basis is easy for me to understand:
e.g. row 1
import numpy as np
random_draws = np.random.choice(np.arange(21889611, 21973805+1), 3)
e.g. row 2
random_draws = np.random.choice(np.arange(36357723, 36403870+1), 4)
Normally with pandas, users could define this as a function and use something like
def func(front, back, amount):
return np.random.choice(np.arange(front, back+1), amount)
df["new_column"].apply(func)
but the result of my function is an array of varying size.
My second problem is that I would like the output to be a dictionary, of the format
{file: [random_draw_results], file: [random_draw_results], file: [random_draw_results], ...}
For the above example df1, the function should output this dictionary (given the draws):
final_dict = {"filename2": [21927457, 21966814, 21898538, 36392840, 36375560, 36384078, 36366833],
"filename3": 212143, 239725, 240959, 197359, 276948, 3199],
"filename4": [100, 83, 15]}
We can pass axis=1 to operate over rows when using apply.
We then need to tell what columns to use and we return a list.
We then either perform some form of groupby or we could use defaultdict as shown below:
dict1 = {'file': ['filename2', 'filename2', 'filename3', 'filename4', 'filename4', 'filename3'], 'amount': [3, 4, 5, 1, 2, 1], 'front':[21889611, 36357723, 196312, 11, 42, 1992], 'back':[21973805, 36403870, 277500, 19, 120, 3210]}
import numpy as np
import pandas as pd
def func(x):
return np.random.choice(np.arange(x.front, x.back+1), x.amount).tolist()
df1 = pd.DataFrame(dict1)
df1["new_column"] = df1.apply(func, axis=1)
df1.groupby('file')['new_column'].apply(sum).to_dict()
Returns:
{'filename2': [21891765,
21904680,
21914414,
36398355,
36358161,
36387670,
36369443],
'filename3': [240766, 217580, 217581, 274396, 241413, 2488],
'filename4': [18, 96, 107]}
Alt2 would be to use (and by some small timings I ran it looks like it runs as fast):
from collections import defaultdict
d = defaultdict(list)
for k,v in df1.set_index('file')['new_column'].items():
d[k].extend(v)

Pandas groupby quantile values

I tried to calculate specific quantile values from a data frame, as shown in the code below. There was no problem when calculate it in separate lines.
When attempting to run last 2 lines, I get the following error:
AttributeError: 'SeriesGroupBy' object has no attribute 'quantile(0.25)'
How can I fix this?
import pandas as pd
df = pd.DataFrame(
{
'x': [0, 1, 0, 1, 0, 1, 0, 1],
'y': [7, 6, 5, 4, 3, 2, 1, 0],
'number': [25000, 35000, 45000, 50000, 60000, 70000, 65000, 36000]
}
)
f = {'number': ['median', 'std', 'quantile']}
df1 = df.groupby('x').agg(f)
df.groupby('x').quantile(0.25)
df.groupby('x').quantile(0.75)
# code below with problem:
f = {'number': ['median', 'std', 'quantile(0.25)', 'quantile(0.75)']}
df1 = df.groupby('x').agg(f)
I prefer def functions
def q1(x):
return x.quantile(0.25)
def q3(x):
return x.quantile(0.75)
f = {'number': ['median', 'std', q1, q3]}
df1 = df.groupby('x').agg(f)
df1
Out[1643]:
number
median std q1 q3
x
0 52500 17969.882211 40000 61250
1 43000 16337.584481 35750 55000
#WeNYoBen's answer is great. There is one limitation though, and that lies with the fact that one needs to create a new function for every quantile. This can be a very unpythonic exercise if the number of quantiles become large. A better approach is to use a function to create a function, and to rename that function appropriately.
def rename(newname):
def decorator(f):
f.__name__ = newname
return f
return decorator
def q_at(y):
#rename(f'q{y:0.2f}')
def q(x):
return x.quantile(y)
return q
f = {'number': ['median', 'std', q_at(0.25) ,q_at(0.75)]}
df1 = df.groupby('x').agg(f)
df1
Out[]:
number
median std q0.25 q0.75
x
0 52500 17969.882211 40000 61250
1 43000 16337.584481 35750 55000
The rename decorator renames the function so that the pandas agg function can deal with the reuse of the quantile function returned (otherwise all quantiles results end up in columns that are named q).
There's a nice way if you want to give names to aggregated columns:
df1.groupby('x').agg(
q1_foo=pd.NamedAgg('number', q1),
q2_foo=pd.NamedAgg('number', q2)
)
where q1 and q2 are functions.
Or even simpler:
df1.groupby('x').agg(
q1_foo=('number', q1),
q2_foo=('number', q2)
)

Categories