Updating values from different dataframe on a certain id value [duplicate] - python

Note:for simplicity's sake, i'm using a toy example, because copy/pasting dataframes is difficult in stack overflow (please let me know if there's an easy way to do this).
Is there a way to merge the values from one dataframe onto another without getting the _X, _Y columns? I'd like the values on one column to replace all zero values of another column.
df1:
Name Nonprofit Business Education
X 1 1 0
Y 0 1 0 <- Y and Z have zero values for Nonprofit and Educ
Z 0 0 0
Y 0 1 0
df2:
Name Nonprofit Education
Y 1 1 <- this df has the correct values.
Z 1 1
pd.merge(df1, df2, on='Name', how='outer')
Name Nonprofit_X Business Education_X Nonprofit_Y Education_Y
Y 1 1 1 1 1
Y 1 1 1 1 1
X 1 1 0 nan nan
Z 1 1 1 1 1
In a previous post, I tried combine_First and dropna(), but these don't do the job.
I want to replace zeros in df1 with the values in df2.
Furthermore, I want all rows with the same Names to be changed according to df2.
Name Nonprofit Business Education
Y 1 1 1
Y 1 1 1
X 1 1 0
Z 1 0 1
(need to clarify: The value in 'Business' column where name = Z should 0.)
My existing solution does the following:
I subset based on the names that exist in df2, and then replace those values with the correct value. However, I'd like a less hacky way to do this.
pubunis_df = df2
sdf = df1
regex = str_to_regex(', '.join(pubunis_df.ORGS))
pubunis = searchnamesre(sdf, 'ORGS', regex)
sdf.ix[pubunis.index, ['Education', 'Public']] = 1
searchnamesre(sdf, 'ORGS', regex)

Attention: In latest version of pandas, both answers above doesn't work anymore:
KSD's answer will raise error:
df1 = pd.DataFrame([["X",1,1,0],
["Y",0,1,0],
["Z",0,0,0],
["Y",0,0,0]],columns=["Name","Nonprofit","Business", "Education"])
df2 = pd.DataFrame([["Y",1,1],
["Z",1,1]],columns=["Name","Nonprofit", "Education"])
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2.loc[df2.Name.isin(df1.Name),['Nonprofit', 'Education']].values
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']].values
Out[851]:
ValueError: shape mismatch: value array of shape (2,) could not be broadcast to indexing result of shape (3,)
and EdChum's answer will give us the wrong result:
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']]
df1
Out[852]:
Name Nonprofit Business Education
0 X 1.0 1 0.0
1 Y 1.0 1 1.0
2 Z NaN 0 NaN
3 Y NaN 1 NaN
Well, it will work safely only if values in column 'Name' are unique and are sorted in both data frames.
Here is my answer:
Way 1:
df1 = df1.merge(df2,on='Name',how="left")
df1['Nonprofit_y'] = df1['Nonprofit_y'].fillna(df1['Nonprofit_x'])
df1['Business_y'] = df1['Business_y'].fillna(df1['Business_x'])
df1.drop(["Business_x","Nonprofit_x"],inplace=True,axis=1)
df1.rename(columns={'Business_y':'Business','Nonprofit_y':'Nonprofit'},inplace=True)
Way 2:
df1 = df1.set_index('Name')
df2 = df2.set_index('Name')
df1.update(df2)
df1.reset_index(inplace=True)
More guide about update.. The columns names of both data frames need to set index are not necessary same before 'update'. You could try 'Name1' and 'Name2'. Also, it works even if other unnecessary row in df2, which won't update df1. In other words, df2 doesn't need to be the super set of df1.
Example:
df1 = pd.DataFrame([["X",1,1,0],
["Y",0,1,0],
["Z",0,0,0],
["Y",0,1,0]],columns=["Name1","Nonprofit","Business", "Education"])
df2 = pd.DataFrame([["Y",1,1],
["Z",1,1],
['U',1,3]],columns=["Name2","Nonprofit", "Education"])
df1 = df1.set_index('Name1')
df2 = df2.set_index('Name2')
df1.update(df2)
result:
Nonprofit Business Education
Name1
X 1.0 1 0.0
Y 1.0 1 1.0
Z 1.0 0 1.0
Y 1.0 1 1.0

Use the boolean mask from isin to filter the df and assign the desired row values from the rhs df:
In [27]:
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']]
df
Out[27]:
Name Nonprofit Business Education
0 X 1 1 0
1 Y 1 1 1
2 Z 1 0 1
3 Y 1 1 1
[4 rows x 4 columns]

In [27]:
This is the correct one.
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']].values
df
Out[27]:
Name Nonprofit Business Education
0 X 1 1 0
1 Y 1 1 1
2 Z 1 0 1
3 Y 1 1 1
[4 rows x 4 columns]
The above will work only when all rows in df1 exists in df . In other words df should be super set of df1
Incase if you have some non matching rows to df in df1,you should follow below
In other words df is not superset of df1 :
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] =
df1.loc[df1.Name.isin(df.Name),['Nonprofit', 'Education']].values

df2.set_index('Name').combine_first(df1.set_index('Name')).reset_index()

Related

Merging two dataframes with an added condition

I got two dataframes that I need to merge as per follows:
Df1
Name Type Speed
a x 1
a y 0
a z 1
Df2
Type Fast Slow
x 2 3
y 3 5
z 4 6
Df3 - DESIRED OUTCOME
Name Type Speed Time
a x 1 2
a y 0 5
a z 1 4
So basically I need to create a new 'Time' column that displays times from either 'Fast' or 'Slow' column based on 'Speed' column and the object 'Type'. I have literally no idea how to do this so any help would be much appreciated! Thanks in advance. Apologies for the confusing explanation..
Use merge + np.where for a more succinct solution:
v = df1.merge(df2, on=['Type'])
v['Time'] = np.where(v['Speed'], v.pop('Fast'), v.pop('Slow'))
Name Type Speed Time
0 a x 1 2
1 a y 0 5
2 a z 1 4
Use melt for reshape first, then map for correct match Speed and last merge with left join:
df = df2.melt('Type', var_name='Speed', value_name='Time')
df['Speed'] = df['Speed'].map({'Fast':1, 'Slow':0})
print (df)
Type Speed Time
0 x 1 2
1 y 1 3
2 z 1 4
3 x 0 3
4 y 0 5
5 z 0 6
df3 = df1.merge(df, how='left', on=['Type','Speed'])
print (df3)
Name Type Speed Time
0 a x 1 2
1 a y 0 5
2 a z 1 4
If performance is important merge is not necessary - map by Series created by set_index with numpy.where - df1['Speed'] is 0 and 1, so is processes like Falses and Trues:
s1 = df2.set_index('Type')['Fast']
s2 = df2.set_index('Type')['Slow']
df1['Time'] = np.where(df1['Speed'], df1['Type'].map(s1), df1['Type'].map(s2))
print (df1)
Name Type Speed Time
0 a x 1 2
1 a y 0 5
2 a z 1 4

Creating a pandas column conditional to another columns values

I'm trying to create a class column in a pandas dataframe conditional another columns values. The value will be 1 if the other column's i+1 value is greater than the i value and 0 otherwise.
For example:
column1 column2
5 1
6 0
3 0
2 1
4 0
How do create column2 by iterating through column1?
You can use the diff method on the first column with a period of -1, then check if it is less than zero to create the second column.
import pandas as pd
df = pd.DataFrame({'c1': [5,6,3,2,4]})
df['c2'] = (df.c1.diff(-1) < 0).astype(int)
df
# returns:
c1 c2
0 5 1
1 6 0
2 3 0
3 2 1
4 4 0
You can also use shift. Performance is almost the same as diff but diff seems to be faster by a a little.
df = pd.DataFrame({'column1': [5,6,3,2,4]})
df['column2'] = (df['column1'] <df['column1'].shift(-1)).astype(int)
print(df)
column1 column2
0 5 1
1 6 0
2 3 0
3 2 1
4 4 0

Selecting columns from two dataframes according to another column

I have 2 dataframes, one of them contains some general information about football players, and second of them contains other information like winning matches for each player. They both have the "id" column. However, they are not in same length.
What I want to do is creating a new dataframe which contains 2 columns: "x" from first dataframe and "y" from second dataframe, ONLY where the "id" column contains the same value in both dataframes. Thus, I can match the "x" and "y" columns which belong to same person.
I tried to do it using concat function:
pd.concat([firstdataframe['x'], seconddataframe['y']], axis=1, keys=['x', 'y'])
But I didn't manage to know how to apply the condition of the "id" being equal in both dataframes.
It seems you need merge with default inner join, also each values in id columns has to be unique:
df = pd.merge(df1[['id','x']], df2[['id','y']], on='id')
Sample:
df1 = pd.DataFrame({'id':[1,2,3],'x':[4,3,8]})
print (df1)
id x
0 1 4
1 2 3
2 3 8
df2 = pd.DataFrame({'id':[1,2],'y':[7,0]})
print (df2)
id y
0 1 7
1 2 0
df = pd.merge(df1[['id','x']], df2[['id','y']], on='id')
print (df)
id x y
0 1 4 7
1 2 3 0
Solution with concat is possible, but a bit complicated, becasue need join on indexes with inner join:
df = pd.concat([df1.set_index('id')['x'],
df2.set_index('id')['y']], axis=1, join='inner')
.reset_index()
print (df)
id x y
0 1 4 7
1 2 3 0
EDIT:
If ids are not unique, duplicates create all combinations and output dataframe is expanded:
df1 = pd.DataFrame({'id':[1,2,3],'x':[4,3,8]})
print (df1)
id x
0 1 4
1 2 3
2 3 8
df2 = pd.DataFrame({'id':[1,2,1,1],'y':[7,0,4,2]})
print (df2)
id y
0 1 7
1 2 0
2 1 4
3 1 2
df = pd.merge(df1[['id','x']], df2[['id','y']], on='id')
print (df)
id x y
0 1 4 7
1 1 4 4
2 1 4 2
3 2 3 0

Drop Rows by Multiple Column Criteria in DataFrame

I have a pandas dataframe that I'm trying to drop rows based on a criteria across select columns. If the values in these select columns are zero, the rows should be dropped. Here is an example.
import pandas as pd
t = pd.DataFrame({'a':[1,0,0,2],'b':[1,2,0,0],'c':[1,2,3,4]})
a b c
0 1 1 1
1 0 2 2
2 0 0 3
3 2 0 4
I would like to try something like:
cols_of_interest = ['a','b'] #Drop rows if zero in all these columns
t = t[t[cols_of_interest]!=0]
This doesn't drop the rows, so I tried:
t = t.drop(t[t[cols_of_interest]==0].index)
And all rows are dropped.
What I would like to end up with is:
a b c
0 1 1 1
1 0 2 2
3 2 0 4
Where the 3rd row (index 2) was dropped because it took on value 0 in BOTH the columns of interest, not just one.
Your problem here is that you first assigned the result of your boolean condition: t = t[t[cols_of_interest]!=0] which overwrites your original df and sets where the condition is not met with NaN values.
What you want to do is generate the boolean mask, then drop the NaN rows and pass thresh=1 so that there must be at least a single non-NaN value in that row, we can then use loc and use the index of this to get the desired df:
In [124]:
cols_of_interest = ['a','b']
t.loc[t[t[cols_of_interest]!=0].dropna(thresh=1).index]
Out[124]:
a b c
0 1 1 1
1 0 2 2
3 2 0 4
EDIT
As pointed out by #DSM you can achieve this simply by using any and passing axis=1 to test the condition and use this to index into your df:
In [125]:
t[(t[cols_of_interest] != 0).any(axis=1)]
Out[125]:
a b c
0 1 1 1
1 0 2 2
3 2 0 4

Filling cells with conditional column means

Consider the following DataFrame:
df2 = pd.DataFrame({
'VAR_1' : [1,1,1,3,3],
'GROUP': [1,1,1,2,2],
})
My goal ist to create a seperate column "GROUP_MEAN" which holds the column "VAR_1" arithmetic mean value.
But - it should always consider the row value in "GROUP".
GROUP VAR_1 GROUP_MEAN
0 1 1 Mean Value GROUP = 1
1 1 1 Mean Value GROUP = 1
2 1 1 Mean Value GROUP = 1
3 2 3 Mean Value GROUP = 2
4 2 3 Mean Value GROUP = 2
I can easily access the overall mean:
df2['GROUP_MEAN'] = df2['VAR_1'].mean()
How do I go about making this conditional on a another column value?
I think this is a perfect use case for transform:
>>> df2 = pd.DataFrame({'VAR_1' : [1,2,3,4,5], 'GROUP': [1,1,1,2,2]})
>>> df2["GROUP_MEAN"] = df2.groupby('GROUP')['VAR_1'].transform('mean')
>>> df2
GROUP VAR_1 GROUP_MEAN
0 1 1 2.0
1 1 2 2.0
2 1 3 2.0
3 2 4 4.5
4 2 5 4.5
[5 rows x 3 columns]
Typically you use transform when you want to broadcast the result across all entries of the group.
assuming that the actual data-frame has columns in addition to VAR_1
ts = df2.groupby( 'GROUP' )['VAR_1'].aggregate( np.mean )
df2[ 'GROUP_MEAN' ] = ts[ df2.GROUP ].values
alternatively last line could also be:
df2 = df2.join( ts, on='GROUP', rsuffix='_MEAN' )

Categories