Groupby and Named aggregation | Optimize dataframe generation in Pandas

Groupby and Named aggregation | Optimize dataframe generation in Pandas - python

I have a dataframe in Pandas with some columns, something like this:
data = {
'CODIGO_SINIESTRO': [10476434, 10476434, 4482524, 4482524, 4486110],
'CONDICION': ['PASAJERO', 'CONDUCTOR', 'MOTOCICLISTA', 'CICLISTA', 'PEATON'],
'EDAD': [62.0, 29.0, 26.0, 47.0, 33.0],
'SEXO': ['MASCULINO', 'FEMENINO', 'FEMENINO', 'MASCULINO', 'FEMENINO']
}
df = pd.DataFrame(data)
Output:
CODIGO_SINIESTRO CONDICION EDAD SEXO
0 10476434 PASAJERO 62.0 MASCULINO
1 10476434 CONDUCTOR 29.0 MASCULINO
2 4482524 MOTOCICLISTA 26.0 MASCULINO
3 4482524 CICLISTA 47.0 MASCULINO
4 4486110 PEATON 33.0 FEMENINO
So, I want to create another dataframe grouped by 'CODIGO_SINIESTRO' column, and I want the following columns like result:
'CODIGO_SINIESTRO': Id of the row.
'PROMEDIO_EDAD': This column will store edad mean.
'CANTIDAD_HOMBRES': This column will store masculine counts based on 'SEXO' column.
'CANTIDAD_HOMBRES': This column will store femenine counts based on 'SEXO' column.
Finally I want five extra columns named equal to the four values possibles of 'CONDICION' column, this values will store 1 if value exist or 0 if not.
So, I wrote this solution and working as expect, however I have many rows in my dataset (150k+) and the solution is slow (5 minutes). This is my code:
df_final = df.groupby(['CODIGO_SINIESTRO']).agg(
CANTIDAD_HOMBRES=pd.NamedAgg(column='SEXO', aggfunc=lambda x: (x=='MASCULINO').sum()),
CANTIDAD_MUJERES=pd.NamedAgg(column='SEXO', aggfunc=lambda x: (x=='FEMENINO').sum()),
PROMEDIO_EDAD=pd.NamedAgg(column='EDAD', aggfunc=np.mean),
MOTOCICLISTA=pd.NamedAgg(column='CONDICION', aggfunc=lambda x: (x=='MOTOCICLISTA').any().astype(int)),
CONDUCTOR=pd.NamedAgg(column='CONDICION', aggfunc=lambda x: (x=='CONDUCTOR').any().astype(int)),
PEATON=pd.NamedAgg(column='CONDICION', aggfunc=lambda x: (x=='PEATON').any().astype(int)),
CICLISTA=pd.NamedAgg(column='CONDICION', aggfunc=lambda x: (x=='CICLISTA').any().astype(int)),
PASAJERO=pd.NamedAgg(column='CONDICION', aggfunc=lambda x: (x=='PASAJERO').any().astype(int))
).reset_index()
Output:
CODIGO_SINIESTRO CANTIDAD_HOMBRES CANTIDAD_MUJERES PROMEDIO_EDAD ...
0 4482524 1 1 36.5
1 4486110 0 1 33.0
2 10476434 1 1 45.5
... MOTOCICLISTA CONDUCTOR PEATON CICLISTA PASAJERO
1 0 0 1 0
0 0 1 0 0
0 1 0 0 1
How can I optimize this solution?, Are there other ways for resolving that?
Thank you.

Pre-aggregating with vectorized methods should be much more efficient (it turns out it was 100x faster):
df['PROMEDIO_EDAD']= df.groupby('CODIGO_SINIESTRO')['EDAD'].transform(np.mean)
df['CANTIDAD_HOMBRES'] = np.where(df['SEXO'] == 'MASCULINO', 1, 0)
df['CANTIDAD_MUJERES'] = np.where(df['SEXO'] == 'FEMENINO', 1, 0)
for col in df['CONDICION'].unique():
df[col] = np.where(df['CONDICION'] == col, 1, 0)
df = df.groupby(['CODIGO_SINIESTRO', 'PROMEDIO_EDAD']).sum().reset_index().drop('EDAD', axis=1)
df.iloc[:,2:] = (df.iloc[:,2:] > 0).astype(int)
df
Out[1]:
CODIGO_SINIESTRO PROMEDIO_EDAD CANTIDAD_HOMBRES CANTIDAD_MUJERES \
0 4482524 36.5 1 1
1 4486110 33.0 0 1
2 10476434 45.5 1 1
PASAJERO CONDUCTOR MOTOCICLISTA CICLISTA PEATON
0 0 0 1 1 0
1 0 0 0 0 1
2 1 1 0 0 0

Related

Update dataframe column based on another dataframe column without for loop

I have two dataframes df1 and df2.
df1:
id val
1 25
2 40
3 78
df2:
id val
2 8
1 5
Now I want to do something like df1['val'] = df1['val']/df2['val'] for matching id. I can do that by iterating over all df2 rows as df2 is a subset of df1 so it may be missing some values, which I want to keep unchanged. This is what I have right now:
for row in df2.iterrows():
df1.loc[df1['id']==row[1]['id'], 'val'] /= row[1]['val']
df1:
id val
1 5
2 5
3 78
How can I achieve the same without using for loop to improve speed?

Use Series.map with Series.div:
df1['val'] = df1['val'].div(df1['id'].map(df2.set_index('id')['val']), fill_value=1)
print (df1)
id val
0 1 5.0
1 2 5.0
2 3 78.0
Solution with merge with left join:
df1['val'] = df1['val'].div(df1.merge(df2, on='id', how='left')['val_y'], fill_value=1)
print (df1)
id val
0 1 5.0
1 2 5.0
2 3 78.0

Updating values from different dataframe on a certain id value [duplicate]

Note:for simplicity's sake, i'm using a toy example, because copy/pasting dataframes is difficult in stack overflow (please let me know if there's an easy way to do this).
Is there a way to merge the values from one dataframe onto another without getting the _X, _Y columns? I'd like the values on one column to replace all zero values of another column.
df1:
Name Nonprofit Business Education
X 1 1 0
Y 0 1 0 <- Y and Z have zero values for Nonprofit and Educ
Z 0 0 0
Y 0 1 0
df2:
Name Nonprofit Education
Y 1 1 <- this df has the correct values.
Z 1 1
pd.merge(df1, df2, on='Name', how='outer')
Name Nonprofit_X Business Education_X Nonprofit_Y Education_Y
Y 1 1 1 1 1
Y 1 1 1 1 1
X 1 1 0 nan nan
Z 1 1 1 1 1
In a previous post, I tried combine_First and dropna(), but these don't do the job.
I want to replace zeros in df1 with the values in df2.
Furthermore, I want all rows with the same Names to be changed according to df2.
Name Nonprofit Business Education
Y 1 1 1
Y 1 1 1
X 1 1 0
Z 1 0 1
(need to clarify: The value in 'Business' column where name = Z should 0.)
My existing solution does the following:
I subset based on the names that exist in df2, and then replace those values with the correct value. However, I'd like a less hacky way to do this.
pubunis_df = df2
sdf = df1
regex = str_to_regex(', '.join(pubunis_df.ORGS))
pubunis = searchnamesre(sdf, 'ORGS', regex)
sdf.ix[pubunis.index, ['Education', 'Public']] = 1
searchnamesre(sdf, 'ORGS', regex)

Attention: In latest version of pandas, both answers above doesn't work anymore:
KSD's answer will raise error:
df1 = pd.DataFrame([["X",1,1,0],
["Y",0,1,0],
["Z",0,0,0],
["Y",0,0,0]],columns=["Name","Nonprofit","Business", "Education"])
df2 = pd.DataFrame([["Y",1,1],
["Z",1,1]],columns=["Name","Nonprofit", "Education"])
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2.loc[df2.Name.isin(df1.Name),['Nonprofit', 'Education']].values
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']].values
Out[851]:
ValueError: shape mismatch: value array of shape (2,) could not be broadcast to indexing result of shape (3,)
and EdChum's answer will give us the wrong result:
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']]
df1
Out[852]:
Name Nonprofit Business Education
0 X 1.0 1 0.0
1 Y 1.0 1 1.0
2 Z NaN 0 NaN
3 Y NaN 1 NaN
Well, it will work safely only if values in column 'Name' are unique and are sorted in both data frames.
Here is my answer:
Way 1:
df1 = df1.merge(df2,on='Name',how="left")
df1['Nonprofit_y'] = df1['Nonprofit_y'].fillna(df1['Nonprofit_x'])
df1['Business_y'] = df1['Business_y'].fillna(df1['Business_x'])
df1.drop(["Business_x","Nonprofit_x"],inplace=True,axis=1)
df1.rename(columns={'Business_y':'Business','Nonprofit_y':'Nonprofit'},inplace=True)
Way 2:
df1 = df1.set_index('Name')
df2 = df2.set_index('Name')
df1.update(df2)
df1.reset_index(inplace=True)
More guide about update.. The columns names of both data frames need to set index are not necessary same before 'update'. You could try 'Name1' and 'Name2'. Also, it works even if other unnecessary row in df2, which won't update df1. In other words, df2 doesn't need to be the super set of df1.
Example:
df1 = pd.DataFrame([["X",1,1,0],
["Y",0,1,0],
["Z",0,0,0],
["Y",0,1,0]],columns=["Name1","Nonprofit","Business", "Education"])
df2 = pd.DataFrame([["Y",1,1],
["Z",1,1],
['U',1,3]],columns=["Name2","Nonprofit", "Education"])
df1 = df1.set_index('Name1')
df2 = df2.set_index('Name2')
df1.update(df2)
result:
Nonprofit Business Education
Name1
X 1.0 1 0.0
Y 1.0 1 1.0
Z 1.0 0 1.0
Y 1.0 1 1.0

Use the boolean mask from isin to filter the df and assign the desired row values from the rhs df:
In [27]:
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']]
df
Out[27]:
Name Nonprofit Business Education
0 X 1 1 0
1 Y 1 1 1
2 Z 1 0 1
3 Y 1 1 1
[4 rows x 4 columns]

In [27]:
This is the correct one.
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']].values
df
Out[27]:
Name Nonprofit Business Education
0 X 1 1 0
1 Y 1 1 1
2 Z 1 0 1
3 Y 1 1 1
[4 rows x 4 columns]
The above will work only when all rows in df1 exists in df . In other words df should be super set of df1
Incase if you have some non matching rows to df in df1,you should follow below
In other words df is not superset of df1 :
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] =
df1.loc[df1.Name.isin(df.Name),['Nonprofit', 'Education']].values

df2.set_index('Name').combine_first(df1.set_index('Name')).reset_index()

groupby a column and find the count of this column and other column and find the quotient of the two

The df I am working with is:
rank
response
1
1
2
1
3
0
2
0
1
0
2
1
null
1
my desired output:
rank
response_count
count_of_the_rank
response_rate
1
1
2
0.5
2
2
3
0.66
3
0
1
0
null
1
1
1
response rate is calculated as response_count/count_of_the_rank
I want to have a function that will produce this data frame and store in a csv given:
df
columne
this is my attempt without a function:
it works but the quotient is calculated outside, is it possible to do it inside of the agg?
also no csv
rank_df = df.groupby(['rank']).agg(
count_of_the_rank=('rank', 'count'),
response_count=('response', 'sum'))
rank_df['group_target_rate'] = rank_df['response_count']/rank_df['count_of_the_rank']
this is trying with a function but it doesnt work:
def target_rate_analysis(df, column):
new_df = df.groupby([column]).agg(
response_count=('response', 'sum'),
'count_of_the' + column=(column, 'count'),
response_count=('response', 'mean'))
return new_df

Use groupby and then aggregate (for the response_rate you can use "mean"):
df_out = df.groupby("rank", as_index=False).agg(
response_count=("response", "sum"),
count_of_the_rank=("response", "size"),
response_rate=("response", "mean"),
)
print(df_out)
Prints:
rank response_count count_of_the_rank response_rate
0 1 1 2 0.500000
1 2 2 3 0.666667
2 3 0 1 0.000000
EDIT: As a function:
def analysis(df, column):
return df.groupby("rank", as_index=False).agg(
**{
"{}_count".format(column): (column, "sum"),
"{}_count_of_the_rank".format(column): (column, "size"),
"{}_rate".format(column): (column, "mean"),
}
)
print(analysis(df, "response"))
Prints:
rank response_count response_count_of_the_rank response_rate
0 1 1 2 0.500000
1 2 2 3 0.666667
2 3 0 1 0.000000

Pandas - get_dummies with value from another column

I have a dataframe like below. The column Mfr Number is a categorical data type. I'd like to preform get_dummies or one hot encoding on it, but instead of filling in the new column with a 1 if it's from that row, I want it to fill in the value from the quantity column. All the other new 'dummies' should remain a 0 on that row. Is this possible?
Datetime Mfr Number quantity
0 2016-03-15 07:02:00 MWS0460MB 1
1 2016-03-15 07:03:00 TM-120-6X 3
2 2016-03-15 08:33:00 40.50699.0095 5
3 2016-03-15 08:42:00 40.50699.0100 1
4 2016-03-15 08:46:00 CXS-04T098-00-0703R-1025 10

Do it in two steps:
dummies = pd.get_dummies(df['Mfr Number'])
dummies.values[dummies != 0] = df['Quantity']

Check with str.get_dummies and mul
df.Number.str.get_dummies().mul(df.quantity,0)
40.50699.0095 40.50699.0100 ... MWS0460MB TM-120-6X
0 0 0 ... 1 0
1 0 0 ... 0 3
2 5 0 ... 0 0
3 0 1 ... 0 0
4 0 0 ... 0 0
[5 rows x 5 columns]

df = pd.get_dummies(df, columns = ['Mfr Number'])
for col in df.columns[2:]:
df[col] = df[col]*df['quantity']

Python Pandas: Counting the frequency of a specific value in each row of dataframe?

I have a dataframe df:
domain country out1 out2 out3
oranjeslag.nl NL 1 0 NaN
pietervaartjes.nl NL 1 1 0
andreaputting.com.au AU NaN 1 0
michaelcardillo.com US 0 0 NaN
I would like to define two columns sum_0 and sum_1 and count the number of 0s and 1s in columns (out1,out2,out3),per row. So expected results would be:
domain country out1 out2 out3 sum_0 sum_1
oranjeslag.nl NL 1 0 NaN 1 1
pietervaartjes.nl NL 1 1 0 1 2
andreaputting.com.au AU NaN 1 0 1 1
michaelcardillo.com US 0 0 NaN 2 0
I have this code for counting the number of 1s, but I do not know how to count the number of 0s.
df['sum_1'] = df[['out_1','out_2','out_3']].sum(axis=1)
Can anybody help?

You can call sum for each condition, the 1 condition is simple just a straight sum on axis=1, for the second you can compare the df against 0 value and then call sum as before:
In [102]:
df['sum_1'] = df[['out1','out2','out3']].sum(axis=1)
df['sum_0'] = (df[['out1','out2','out3']] == 0).sum(axis=1)
df
Out[102]:
domain country out1 out2 out3 sum_0 sum_1
0 oranjeslag.nl NL 1 0 NaN 1 1
1 pietervaartjes.nl NL 1 1 0 1 2
2 andreaputting.com.au AU NaN 1 0 1 1
3 michaelcardillo.com US 0 0 NaN 2 0

I would do :
df["sum_0"] = df.apply(lambda row: sum(row[0:3]==0) ,axis=1)

maybe pandas changed the bahaviour since 2015 but now the problem with sum is, that when you try to use this code for values > 1, it produces actual sum of these values, not their count (which is what I understood from question and also was looking for)
df['sum_0'] = df[df == 0].count(axis=1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Groupby and Named aggregation | Optimize dataframe generation in Pandas - python

Related

Update dataframe column based on another dataframe column without for loop

Updating values from different dataframe on a certain id value [duplicate]

groupby a column and find the count of this column and other column and find the quotient of the two

Pandas - get_dummies with value from another column

Python Pandas: Counting the frequency of a specific value in each row of dataframe?

Categories

Resources