How to join DataFrames and get max of particular column?

How to join DataFrames and get max of particular column? - python

I have two PySpark DataFrames df1 and df2. They have the same names of columns but might have a different number of rows. Also, some combinations of may not exist in one of DataFrames.
df1 =
wpk ipk num
1 2 23.4
1 3 45.5
2 1 0.0
df2 =
wpk ipk num
1 1 12.0
1 3 40.0
2 1 50.0
I want to obtain a new DataFrame df that is the result of the outer joining of df1 and df2. The df should have the same columns, but the column num should be the max of df1 and df2.
The expected result is this one:
wpk ipk num
1 1 12.0
1 2 23.4
1 3 45.5
2 1 50.0

I'm unsure if this is suitable for your problem, however this would be how I would achieve the specified result.
import pandas as pd
df3 = df1.append(df2).groupby(['wpk','ipk'])['num'].max()

Related

Update dataframe column based on another dataframe column without for loop

I have two dataframes df1 and df2.
df1:
id val
1 25
2 40
3 78
df2:
id val
2 8
1 5
Now I want to do something like df1['val'] = df1['val']/df2['val'] for matching id. I can do that by iterating over all df2 rows as df2 is a subset of df1 so it may be missing some values, which I want to keep unchanged. This is what I have right now:
for row in df2.iterrows():
df1.loc[df1['id']==row[1]['id'], 'val'] /= row[1]['val']
df1:
id val
1 5
2 5
3 78
How can I achieve the same without using for loop to improve speed?

Use Series.map with Series.div:
df1['val'] = df1['val'].div(df1['id'].map(df2.set_index('id')['val']), fill_value=1)
print (df1)
id val
0 1 5.0
1 2 5.0
2 3 78.0
Solution with merge with left join:
df1['val'] = df1['val'].div(df1.merge(df2, on='id', how='left')['val_y'], fill_value=1)
print (df1)
id val
0 1 5.0
1 2 5.0
2 3 78.0

Join multiple dataframes while retaining the row count of the main dataframe

I have a pandas dataframe that I am using to create 2 additional dataframes. After the creation of the two dataframes, I want to merge the two with the original dataframe retaining the count of the original. Is there an easier way of doing it?
Original dataframe example:
First dataframe example:
Second dataframe example:
Final output should look like this :
When I try to do it, I am either getting double the number of rows or half the number of rows.

After reformatted the third dataframe, you can merge each of them:
df3a = df3.rename(columns={'Column4': 'Column2'}).drop_duplicates('Column2')
>>> df1.merge(df2, on='Column2', how='outer') \
.merge(df3a, on='Column2', how='outer')
Column1 Column2 Column3 Column5 Column6
0 p eeee 3.0 7 7
1 q dddd 6.0 6 6
2 s bbbb 4.0 4 4
3 t aaaa 1.0 3 3
4 u ssss 4.0 2 3
5 v rrrr 2.0 1 1
6 NaN cccc NaN 5 5

Merging dataframes with different dimensions and related data

I have 2 dataframes with different size with related data to be merged in an efficient way:
master_df = pd.DataFrame({'kpi_1': [1,2,3,4]},
index=['dn1_app1_bar.com',
'dn1_app2_bar.com',
'dn2_app1_foo.com',
'dn2_app2_foo.com'])
guard_df = pd.DataFrame({'kpi_2': [1,2],
'kpi_3': [10,20]},
index=['dn1_bar.com', 'dn2_foo.com'])
master_df:
kpi_1
dn1_app1_bar.com 1
dn1_app2_bar.com 2
dn2_app1_foo.com 3
dn2_app2_foo.com 4
guard_df:
kpi_2 kpi_3
dn1_bar.com 1 10
dn2_foo.com 2 20
I want to get a dataframe with values from a guard_df's row indexed with <group>_<name> "propagated' to all master_df's rows matching
<group>_.*_<name>.
Expected result:
kpi_1 kpi_2 kpi_3
dn1_app1_bar.com 1 1.0 10.0
dn1_app2_bar.com 2 1.0 10.0
dn2_app1_foo.com 3 2.0 20.0
dn2_app2_foo.com 4 2.0 20.0
What I've managed so far is the following basic approach:
def eval_base_dn(dn):
chunks = dn.split('_')
return '_'.join((chunks[0], chunks[2]))
for dn in master_df.index:
for col in guard_df.columns:
master_df.loc[dn, col] = guard_df.loc[eval_base_dn(dn), col]
but I'm looking for some more performant way to "broadcast" the values and merge the dataframes.

If use pandas 0.25+ is possible pass array, here index to on parameter of merge with left join:
master_df = master_df.merge(guard_df,
left_on=master_df.index.str.replace('_.+_', '_'),
right_index=True,
how='left')
print (master_df)
kpi_1 kpi_2 kpi_3
dn1_app1_bar.com 1 1 10
dn1_app2_bar.com 2 1 10
dn2_app1_foo.com 3 2 20
dn2_app2_foo.com 4 2 20

Try this one:
>>> pd.merge(master_df.assign(guard_df_id=master_df.index.str.split("_").map(lambda x: "{0}_{1}".format(x[0], x[-1]))), guard_df, left_on="guard_df_id", right_index=True).drop(["guard_df_id"], axis=1)
kpi_1 kpi_2 kpi_3
dn1_app1_bar.com 1 1 10
dn1_app2_bar.com 2 1 10
dn2_app1_foo.com 3 2 20
dn2_app2_foo.com 4 2 20

How to merge a dataframe with MultiIndex into another dataframe in an efficient way?

I have two DataFrames and I would like to take the median of one column grouped by a set of two other columns from dataframe A and then merge the calculated median into dataframe B. Let me explain it using the example below:
I have two DataFrames which look like
# DataFrame 1
pu_c do_c fare
0 0 5 10
1 0 5 20
2 1 1 3
# DataFrame 2
pu_c do_c
0 0 3
1 0 5
2 1 1
I would like to take the median of fare grouped by pu_c and do_c using:
a = df1.groupby(['pu_c', 'do_c']).median()['fare']
which will result in:
pu_c do_c
0 5 15
1 1 3
Now I want to merge the median fare calculated in a from df1 into another dataframe such as df2. I know how to do it using the for loops and messy code. I am wondering if there is an efficient way to do it using pandas' merge or concat functions.
My desired output in this example is
pu_c do_c median_fare
0 0 3 NaN (or whatever)
1 0 5 15
2 1 1 3
Note: to reproduce my dataframes use:
import pandas as pd
pu_c = [0, 0, 1]
do_c = [5, 5, 1]
do_c2 = [3, 5, 1]
fare = [10, 20, 3]
df1 = pd.DataFrame({'pu_c': pu_c, 'do_c': do_c, 'fare': fare})
df2 = pd.DataFrame({'pu_c': pu_c, 'do_c': do_c2})

Turn a into a dataframe and rename the values to median_fare using a.to_frame('median_fare'), reset the index, then do an outer merge with df2. It will automatically merge on the 2 columns in common (do_c and pu_c)
df2.merge(a.to_frame('median_fare').reset_index(), how='outer')
do_c pu_c median_fare
0 3 0 NaN
1 5 0 15.0
2 1 1 3.0

Pandas merge two dataframes with different columns

I'm surely missing something simple here. Trying to merge two dataframes in pandas that have mostly the same column names, but the right dataframe has some columns that the left doesn't have, and vice versa.
>df_may
id quantity attr_1 attr_2
0 1 20 0 1
1 2 23 1 1
2 3 19 1 1
3 4 19 0 0
>df_jun
id quantity attr_1 attr_3
0 5 8 1 0
1 6 13 0 1
2 7 20 1 1
3 8 25 1 1
I've tried joining with an outer join:
mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer")
But that yields:
Left data columns not unique: Index([....
I've also specified a single column to join on (on = "id", e.g.), but that duplicates all columns except id like attr_1_x, attr_1_y, which is not ideal. I've also passed the entire list of columns (there are many) to on:
mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer", on=list(df_may.columns.values))
Which yields:
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
What am I missing? I'd like to get a df with all rows appended, and attr_1, attr_2, attr_3 populated where possible, NaN where they don't show up. This seems like a pretty typical workflow for data munging, but I'm stuck.

I think in this case concat is what you want:
In [12]:
pd.concat([df,df1], axis=0, ignore_index=True)
Out[12]:
attr_1 attr_2 attr_3 id quantity
0 0 1 NaN 1 20
1 1 1 NaN 2 23
2 1 1 NaN 3 19
3 0 0 NaN 4 19
4 1 NaN 0 5 8
5 0 NaN 1 6 13
6 1 NaN 1 7 20
7 1 NaN 1 8 25
by passing axis=0 here you are stacking the df's on top of each other which I believe is what you want then producing NaN value where they are absent from their respective dfs.

The accepted answer will break if there are duplicate headers:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects.
For example, here A has 3x trial columns, which prevents concat:
A = pd.DataFrame([[3, 1, 4, 1]], columns=['id', 'trial', 'trial', 'trial'])
# id trial trial trial
# 0 3 1 4 1
B = pd.DataFrame([[5, 9], [2, 6]], columns=['id', 'trial'])
# id trial
# 0 5 9
# 1 2 6
pd.concat([A, B], ignore_index=True)
# InvalidIndexError: Reindexing only valid with uniquely valued Index objects
To fix this, deduplicate the column names before concat:
parser = pd.io.parsers.base_parser.ParserBase({'usecols': None})
for df in [A, B]:
df.columns = parser._maybe_dedup_names(df.columns)
pd.concat([A, B], ignore_index=True)
# id trial trial.1 trial.2
# 0 3 1 4 1
# 1 5 9 NaN NaN
# 2 2 6 NaN NaN
Or as a one-liner but less readable:
pd.concat([df.set_axis(parser._maybe_dedup_names(df.columns), axis=1) for df in [A, B]], ignore_index=True)
Note that for pandas <1.3.0, use: parser = pd.io.parsers.ParserBase({})

I had this problem today using any of concat, append or merge, and I got around it by adding a helper column sequentially numbered and then doing an outer join
helper=1
for i in df1.index:
df1.loc[i,'helper']=helper
helper=helper+1
for i in df2.index:
df2.loc[i,'helper']=helper
helper=helper+1
df1.merge(df2,on='helper',how='outer')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to join DataFrames and get max of particular column? - python

I'm unsure if this is suitable for your problem, however this would be how I would achieve the specified result. import pandas as pd df3 = df1.append(df2).groupby(['wpk','ipk'])['num'].max()

Related

Update dataframe column based on another dataframe column without for loop

Join multiple dataframes while retaining the row count of the main dataframe

Merging dataframes with different dimensions and related data

How to merge a dataframe with MultiIndex into another dataframe in an efficient way?

Pandas merge two dataframes with different columns

Categories

Resources