Update dataframe column based on another dataframe column without for loop - python

I have two dataframes df1 and df2.
df1:
id val
1 25
2 40
3 78
df2:
id val
2 8
1 5
Now I want to do something like df1['val'] = df1['val']/df2['val'] for matching id. I can do that by iterating over all df2 rows as df2 is a subset of df1 so it may be missing some values, which I want to keep unchanged. This is what I have right now:
for row in df2.iterrows():
df1.loc[df1['id']==row[1]['id'], 'val'] /= row[1]['val']
df1:
id val
1 5
2 5
3 78
How can I achieve the same without using for loop to improve speed?

Use Series.map with Series.div:
df1['val'] = df1['val'].div(df1['id'].map(df2.set_index('id')['val']), fill_value=1)
print (df1)
id val
0 1 5.0
1 2 5.0
2 3 78.0
Solution with merge with left join:
df1['val'] = df1['val'].div(df1.merge(df2, on='id', how='left')['val_y'], fill_value=1)
print (df1)
id val
0 1 5.0
1 2 5.0
2 3 78.0

Related

How to combine numeric columns in pandas dataframe with NaN?

I have a dataframe with this format:
ID measurement_1 measurement_2
0 3 NaN
1 NaN 5
2 NaN 7
3 NaN NaN
I want to combine to:
ID measurement measurement_type
0 3 1
1 5 2
2 7 2
For each row there will be a value in either measurement_1 or measurement_2 column, not in both, the other column will be NaN.
In some rows both columns will be NaN.
I want to add a column for the measurement type (depending on which column has the value) and take the actual value out of both columns, and remove the rows that have NaN in both columns.
Is there an easy way of doing this?
Thanks!
Use DataFrame.stack to reshape the dataframe then use reset_index and use DataFrame.assign to assign the column measurement_type by using Series.str.split + Series.str[:1] on level_1:
df1 = (
df.set_index('ID').stack().reset_index(name='measurement')
.assign(mesurement_type=lambda x: x.pop('level_1').str.split('_').str[-1])
)
Result:
print(df1)
ID measurement mesurement_type
0 0 3.0 1
1 1 5.0 2
2 2 7.0 2
Maybe combine_first could help?
import numpy as np
df["measurement"] = df["measurement_1"].combine_first(df["measurement_2"])
df["measurement_type"] = np.where(df["measurement_1"].notnull(), 1, 2)
df.drop(["measurement_1", "measurement_2"], 1)
ID measurement measurement_type
0 0 3 1
1 1 5 2
2 2 7 2
Set a threshold and drop any that has more than one NaN. Use df.assign to fillna() measurement_1 and apply np.where on measurement_2
df= df.dropna(thresh=2).assign(measurement=df.measurement_1.fillna\
(df.measurement_2), measurement_type=np.where(df.measurement_2.isna(),1,2)).drop(columns=['measurement_1','measurement_2'])
ID measurement measurement_type
0 0 3 1
1 1 5 2
2 2 7 2
You could use pandas melt :
(
df.melt("ID", var_name="measurement_type", value_name="measurement")
.dropna()
.assign(measurement_type=lambda x: x.measurement_type.str[-1])
.iloc[:, [0, -1, 1]]
.astype("int8")
)
or wide to long :
(
pd.wide_to_long(df, stubnames="measurement", i="ID",
j="measurement_type", sep="_")
.dropna()
.reset_index()
.astype("int8")
.iloc[:, [0, -1, 1]]
)
ID measurement measurement_type
0 0 3 1
1 1 5 2
2 2 7 2

How to create pairs of column names based on a condition?

I have the following DataFrame df:
df =
min(arc) max(arc) min(gbm)_p1 max(gbm)_p1
1 10 2 5
0 11 1 6
How can I calculate the difference between pairs of max and min columns?
Expected result:
diff(arc) diff(gbm)_p1
9 3
11 5
I assume that apply(lambda x: ...) should be used to calculate the differences row-wise, but how can I create pairs of columns? In my case, I should only calculate the difference between columns that have the same name, e.g. ...(arc) or ...(gbm)_p1. Please notice that min and max prefixes always appear at the beginning of the column names.
Idea is filter both DataFrames by DataFrame.filter with regex where ^ is start of string, rename columns, so possible subtract, because same columns names in both:
df1 = df.filter(regex='^min').rename(columns= lambda x: x.replace('min','diff'))
df2 = df.filter(regex='^max').rename(columns= lambda x: x.replace('max','diff'))
df = df2.sub(df1)
print (df)
diff(arc) diff(gbm)_p1
0 9 3
1 11 5
EDIT:
print (df)
id min(arc) max(arc) min(gbm)_p1 max(gbm)_p1
0 123 1 10 2 5
1 546 0 11 1 6
df1 = df.filter(regex='^min').rename(columns= lambda x: x.replace('min','diff'))
df2 = df.filter(regex='^max').rename(columns= lambda x: x.replace('max','diff'))
df = df[['id']].join(df2.sub(df1))
print (df)
id diff(arc) diff(gbm)_p1
0 123 9 3
1 546 11 5

Merging dataframes with different dimensions and related data

I have 2 dataframes with different size with related data to be merged in an efficient way:
master_df = pd.DataFrame({'kpi_1': [1,2,3,4]},
index=['dn1_app1_bar.com',
'dn1_app2_bar.com',
'dn2_app1_foo.com',
'dn2_app2_foo.com'])
guard_df = pd.DataFrame({'kpi_2': [1,2],
'kpi_3': [10,20]},
index=['dn1_bar.com', 'dn2_foo.com'])
master_df:
kpi_1
dn1_app1_bar.com 1
dn1_app2_bar.com 2
dn2_app1_foo.com 3
dn2_app2_foo.com 4
guard_df:
kpi_2 kpi_3
dn1_bar.com 1 10
dn2_foo.com 2 20
I want to get a dataframe with values from a guard_df's row indexed with <group>_<name> "propagated' to all master_df's rows matching
<group>_.*_<name>.
Expected result:
kpi_1 kpi_2 kpi_3
dn1_app1_bar.com 1 1.0 10.0
dn1_app2_bar.com 2 1.0 10.0
dn2_app1_foo.com 3 2.0 20.0
dn2_app2_foo.com 4 2.0 20.0
What I've managed so far is the following basic approach:
def eval_base_dn(dn):
chunks = dn.split('_')
return '_'.join((chunks[0], chunks[2]))
for dn in master_df.index:
for col in guard_df.columns:
master_df.loc[dn, col] = guard_df.loc[eval_base_dn(dn), col]
but I'm looking for some more performant way to "broadcast" the values and merge the dataframes.
If use pandas 0.25+ is possible pass array, here index to on parameter of merge with left join:
master_df = master_df.merge(guard_df,
left_on=master_df.index.str.replace('_.+_', '_'),
right_index=True,
how='left')
print (master_df)
kpi_1 kpi_2 kpi_3
dn1_app1_bar.com 1 1 10
dn1_app2_bar.com 2 1 10
dn2_app1_foo.com 3 2 20
dn2_app2_foo.com 4 2 20
Try this one:
>>> pd.merge(master_df.assign(guard_df_id=master_df.index.str.split("_").map(lambda x: "{0}_{1}".format(x[0], x[-1]))), guard_df, left_on="guard_df_id", right_index=True).drop(["guard_df_id"], axis=1)
kpi_1 kpi_2 kpi_3
dn1_app1_bar.com 1 1 10
dn1_app2_bar.com 2 1 10
dn2_app1_foo.com 3 2 20
dn2_app2_foo.com 4 2 20

Selecting columns from two dataframes according to another column

I have 2 dataframes, one of them contains some general information about football players, and second of them contains other information like winning matches for each player. They both have the "id" column. However, they are not in same length.
What I want to do is creating a new dataframe which contains 2 columns: "x" from first dataframe and "y" from second dataframe, ONLY where the "id" column contains the same value in both dataframes. Thus, I can match the "x" and "y" columns which belong to same person.
I tried to do it using concat function:
pd.concat([firstdataframe['x'], seconddataframe['y']], axis=1, keys=['x', 'y'])
But I didn't manage to know how to apply the condition of the "id" being equal in both dataframes.
It seems you need merge with default inner join, also each values in id columns has to be unique:
df = pd.merge(df1[['id','x']], df2[['id','y']], on='id')
Sample:
df1 = pd.DataFrame({'id':[1,2,3],'x':[4,3,8]})
print (df1)
id x
0 1 4
1 2 3
2 3 8
df2 = pd.DataFrame({'id':[1,2],'y':[7,0]})
print (df2)
id y
0 1 7
1 2 0
df = pd.merge(df1[['id','x']], df2[['id','y']], on='id')
print (df)
id x y
0 1 4 7
1 2 3 0
Solution with concat is possible, but a bit complicated, becasue need join on indexes with inner join:
df = pd.concat([df1.set_index('id')['x'],
df2.set_index('id')['y']], axis=1, join='inner')
.reset_index()
print (df)
id x y
0 1 4 7
1 2 3 0
EDIT:
If ids are not unique, duplicates create all combinations and output dataframe is expanded:
df1 = pd.DataFrame({'id':[1,2,3],'x':[4,3,8]})
print (df1)
id x
0 1 4
1 2 3
2 3 8
df2 = pd.DataFrame({'id':[1,2,1,1],'y':[7,0,4,2]})
print (df2)
id y
0 1 7
1 2 0
2 1 4
3 1 2
df = pd.merge(df1[['id','x']], df2[['id','y']], on='id')
print (df)
id x y
0 1 4 7
1 1 4 4
2 1 4 2
3 2 3 0

Pandas merge two dataframes with different columns

I'm surely missing something simple here. Trying to merge two dataframes in pandas that have mostly the same column names, but the right dataframe has some columns that the left doesn't have, and vice versa.
>df_may
id quantity attr_1 attr_2
0 1 20 0 1
1 2 23 1 1
2 3 19 1 1
3 4 19 0 0
>df_jun
id quantity attr_1 attr_3
0 5 8 1 0
1 6 13 0 1
2 7 20 1 1
3 8 25 1 1
I've tried joining with an outer join:
mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer")
But that yields:
Left data columns not unique: Index([....
I've also specified a single column to join on (on = "id", e.g.), but that duplicates all columns except id like attr_1_x, attr_1_y, which is not ideal. I've also passed the entire list of columns (there are many) to on:
mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer", on=list(df_may.columns.values))
Which yields:
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
What am I missing? I'd like to get a df with all rows appended, and attr_1, attr_2, attr_3 populated where possible, NaN where they don't show up. This seems like a pretty typical workflow for data munging, but I'm stuck.
I think in this case concat is what you want:
In [12]:
pd.concat([df,df1], axis=0, ignore_index=True)
Out[12]:
attr_1 attr_2 attr_3 id quantity
0 0 1 NaN 1 20
1 1 1 NaN 2 23
2 1 1 NaN 3 19
3 0 0 NaN 4 19
4 1 NaN 0 5 8
5 0 NaN 1 6 13
6 1 NaN 1 7 20
7 1 NaN 1 8 25
by passing axis=0 here you are stacking the df's on top of each other which I believe is what you want then producing NaN value where they are absent from their respective dfs.
The accepted answer will break if there are duplicate headers:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects.
For example, here A has 3x trial columns, which prevents concat:
A = pd.DataFrame([[3, 1, 4, 1]], columns=['id', 'trial', 'trial', 'trial'])
# id trial trial trial
# 0 3 1 4 1
B = pd.DataFrame([[5, 9], [2, 6]], columns=['id', 'trial'])
# id trial
# 0 5 9
# 1 2 6
pd.concat([A, B], ignore_index=True)
# InvalidIndexError: Reindexing only valid with uniquely valued Index objects
To fix this, deduplicate the column names before concat:
parser = pd.io.parsers.base_parser.ParserBase({'usecols': None})
for df in [A, B]:
df.columns = parser._maybe_dedup_names(df.columns)
pd.concat([A, B], ignore_index=True)
# id trial trial.1 trial.2
# 0 3 1 4 1
# 1 5 9 NaN NaN
# 2 2 6 NaN NaN
Or as a one-liner but less readable:
pd.concat([df.set_axis(parser._maybe_dedup_names(df.columns), axis=1) for df in [A, B]], ignore_index=True)
Note that for pandas <1.3.0, use: parser = pd.io.parsers.ParserBase({})
I had this problem today using any of concat, append or merge, and I got around it by adding a helper column sequentially numbered and then doing an outer join
helper=1
for i in df1.index:
df1.loc[i,'helper']=helper
helper=helper+1
for i in df2.index:
df2.loc[i,'helper']=helper
helper=helper+1
df1.merge(df2,on='helper',how='outer')

Categories