I have a large python script, which makes two dataframes A and B, and at the end, I want to fill in dataframe A with the values of dataframe B, and keep the columns of dataframe A, but it is not going well.
Dataframe A is like this
A B C D
1 ab
2 bc
3 cd
Dataframe B:
A BB CC
1 C 10
2 C 11
3 D 12
My output must be:
new dataframe
A B C D
1 ab 10
2 bc 11
3 cd 12
But my output is
A B C D
1 ab
2 bc
3 cd
Why is it not filling in the values of dataframe B?
My command is
dfnew = dfB.pivot_table(index='A', columns='BB', values='CC').reindex(index=dfA.index, columns=dfA.columns).fillna(dfA)
I think you need set_index by index column of df for align data, fillna or combine_first and last reset_index:
dfA = pd.DataFrame({'A':[1,2,3], 'B':['ab','bc','cd'], 'C':[np.nan] * 3,'D':[np.nan] * 3})
print (dfA)
A B C D
0 1 ab NaN NaN
1 2 bc NaN NaN
2 3 cd NaN NaN
dfB = pd.DataFrame({'A':[1,2,3], 'BB':['C','C','D'], 'CC':[10,11,12]})
print (dfB)
A BB CC
0 1 C 10
1 2 C 11
2 3 D 12
df = dfB.pivot_table(index='A', columns='BB', values='CC')
print (df)
BB C D
A
1 10.0 NaN
2 11.0 NaN
3 NaN 12.0
dfA = dfA.set_index('A').fillna(df).reset_index()
#dfA = dfA.set_index('A').combine_first(df).reset_index()
print (dfA)
A B C D
0 1 ab 10.0 NaN
1 2 bc 11.0 NaN
2 3 cd NaN 12.0
Related
I have a Pandas dataframe with the following structure:
A B C
a b 1
a b 2
a b 3
c d 7
c d 8
c d 5
c d 6
c d 3
e b 4
e b 3
e b 2
e b 1
And I will like to transform it into this:
A B C1 C2 C3 C4 C5
a b 1 2 3 NAN NAN
c d 7 8 5 6 3
e b 4 3 2 1 NAN
In other words, something like groupby A and B and expand C into different columns.
Knowing that the length of each group is different.
C is already ordered
Shorter groups can have NAN or NULL values (empty), it does not matter.
Use GroupBy.cumcount and pandas.Series.add with 1, to start naming the new columns from 1 onwards, then pass this to DataFrame.pivot, and add DataFrame.add_prefix to rename the columns (C1, C2, C3, etc...). Finally use DataFrame.rename_axis to remove the indexes original name ('g') and transform the MultiIndex into columns by using DataFrame.reset_indexcolumns A,B:
df['g'] = df.groupby(['A','B']).cumcount().add(1)
df = df.pivot(['A','B'], 'g', 'C').add_prefix('C').rename_axis(columns=None).reset_index()
print (df)
A B C1 C2 C3 C4 C5
0 a b 1.0 2.0 3.0 NaN NaN
1 c d 7.0 8.0 5.0 6.0 3.0
2 e b 4.0 3.0 2.0 1.0 NaN
Because NaN is by default of type float, if you need the columns dtype to be integers add DataFrame.astype with Int64:
df['g'] = df.groupby(['A','B']).cumcount().add(1)
df = (df.pivot(['A','B'], 'g', 'C')
.add_prefix('C')
.astype('Int64')
.rename_axis(columns=None)
.reset_index())
print (df)
A B C1 C2 C3 C4 C5
0 a b 1 2 3 <NA> <NA>
1 c d 7 8 5 6 3
2 e b 4 3 2 1 <NA>
EDIT: If there's a maximum N new columns to be added, it means that A,B are duplicated. Therefore, it will beneeded to add helper groups g1, g2 with integer and modulo division, adding a new level in index:
N = 4
g = df.groupby(['A','B']).cumcount()
df['g1'], df['g2'] = g // N, (g % N) + 1
df = (df.pivot(['A','B','g1'], 'g2', 'C')
.add_prefix('C')
.droplevel(-1)
.rename_axis(columns=None)
.reset_index())
print (df)
A B C1 C2 C3 C4
0 a b 1.0 2.0 3.0 NaN
1 c d 7.0 8.0 5.0 6.0
2 c d 3.0 NaN NaN NaN
3 e b 4.0 3.0 2.0 1.0
df1.astype({'C':str}).groupby([*'AB'])\
.agg(','.join).C.str.split(',',expand=True)\
.add_prefix('C').reset_index()
A B C0 C1 C2 C3 C4
0 a b 1 2 3 None None
1 c d 7 8 5 6 3
2 e b 4 3 2 1 None
The accepted solution but avoiding the deprecation warning:
N = 3
g = df_grouped.groupby(['A','B']).cumcount()
df_grouped['g1'], df_grouped['g2'] = g // N, (g % N) + 1
df_grouped = (df_grouped.pivot(index=['A','B','g1'], columns='g2', values='C')
.add_prefix('C_')
.astype('Int64')
.droplevel(-1)
.rename_axis(columns=None)
.reset_index())
I have 2 dfs, which I want to combine as the following:
df1 = pd.DataFrame({"a": [1,2], "b":['A','B'], "c":[3,2]})
df2 = pd.DataFrame({"a": [1,1,1, 2,2,2, 3, 4], "b":['A','A','A','B','B', 'B','C','D'], "c":[3, None,None,2,None,None,None,None]})
Output:
a b c
1 A 3.0
1 A NaN
1 A NaN
2 B 2.0
2 B NaN
2 B NaN
I had an earlier version of this question that only involved df2 and was solved with
df.groupby(['a','b']).filter(lambda g: any(~g['c'].isna()))
but now I need to run it only for rows that appear in df1 (df2 contains rows from df1 but some extra rows which I want to not be included.
Thanks!
You can turn the indicator on with merge
out = df2.merge(df1,indicator=True,how='outer',on=['a','b'])
Out[91]:
a b c_x c_y _merge
0 1 A 3.0 3.0 both
1 1 A NaN 3.0 both
2 1 A NaN 3.0 both
3 2 B 2.0 2.0 both
4 2 B NaN 2.0 both
5 2 B NaN 2.0 both
6 3 C NaN NaN left_only
7 4 D NaN NaN left_only
out = out[out['_merge']=='both']
IIUC, you could merge:
out = df2.merge(df1[['a','b']])
or you could use chained isin:
out1 = df2[df2['a'].isin(df1['a']) & df2['b'].isin(df1['b'])]
Output:
a b c
0 1 A 3.0
1 1 A NaN
2 1 A NaN
3 2 B 2.0
4 2 B NaN
5 2 B NaN
I have three dataframes df1, df2, and df3, which are defined as follows
df1 =
A B C
0 1 a a1
1 2 b b2
2 3 c c3
3 4 d d4
4 5 e e5
5 6 f f6
df2 =
A B C
0 1 a X
1 2 b Y
2 3 c Z
df3 =
A B C
3 4 d P
4 5 e Q
5 6 f R
I have defined a Primary Key list PK = ["A","B"].
Now, I take a fourth dataframe df4 as df4 = df1.sample(n=2), which gives something like
df4 =
A B C
4 5 e e5
1 2 b b2
Now, I want to select the rows from df2 and df1 which matches the values of the primary keys of df4.
For eg, in this case,
I need to get row with
index = 4 from df3,
index = 1 from df2.
If possible I need to get a dataframe as follows:
df =
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
4 5 e e5 5 e Q
1 2 b b2 2 b Y
Any ideas on how to work this out will be very helpful.
Use two consecutive DataFrame.merge operations along with using DataFrame.add_suffix on the right dataframe to left merge the dataframes df4, df2, df3, finally use Series.fillna to replace the missing values with empty string:
df = (
df4.merge(df2.add_suffix('(df2)'), left_on=['A', 'B'], right_on=['A(df2)', 'B(df2)'], how='left')
.merge(df3.add_suffix('(df3)'), left_on=['A', 'B'], right_on=['A(df3)', 'B(df3)'], how='left')
.fillna('')
)
Result:
# print(df)
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
0 5 e e5 5 e Q
1 2 b b2 2 b Y
Here's how I would do it on the entire data set. If you want to sample first, just update the merge statements at the end by replacing df1 with df4 or just take a sample of t
PK = ["A","B"]
df2 = pd.concat([df2,df2], axis=1)
df2.columns=['A','B','C','A(df2)', 'B(df2)', 'C(df2)']
df2.drop(columns=['C'], inplace=True)
df3 = pd.concat([df3,df3], axis=1)
df3.columns=['A','B','C','A(df3)', 'B(df3)', 'C(df3)']
df3.drop(columns=['C'], inplace=True)
t = df1.merge(df2, on=PK, how='left')
t = t.merge(df3, on=PK, how='left')
Output
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
0 1 a a1 1.0 a X NaN NaN NaN
1 2 b b2 2.0 b Y NaN NaN NaN
2 3 c c3 3.0 c Z NaN NaN NaN
3 4 d d4 NaN NaN NaN 4.0 d P
4 5 e e5 NaN NaN NaN 5.0 e Q
5 6 f f6 NaN NaN NaN 6.0 f R
How can I replace specific row-wise duplicate cells in selected columns without dropping rows (preferably without looping through the rows)?
Basically, I want to keep the first value and replace the remaining duplicates in a row with NAN.
For example:
df_example = pd.DataFrame({'A':['a' , 'b', 'c'], 'B':['a', 'f', 'c'],'C':[1,2,3]})
df_example.head()
Original:
A B C
0 a a 1
1 b f 2
2 c c 3
Expected output:
A B C
0 a nan 1
1 b f 2
2 c nan 3
A bit more complicated example is as follows:
Original:
A B C D
0 a 1 a 1
1 b 2 f 5
2 c 3 c 3
Expected output:
A B C D
0 a 1 nan nan
1 b 2 f 5
2 c 3 nan nan
Use DataFrame.mask with Series.duplicated per rows in DataFrame.apply:
df_example = df_example.mask(df_example.apply(lambda x: x.duplicated(), axis=1))
print (df_example)
A B C
0 a NaN 1
1 b f 2
2 c NaN 3
With new data:
df_example = df_example.mask(df_example.apply(lambda x: x.duplicated(), axis=1))
print (df_example)
A B C D
0 a 1 NaN NaN
1 b 2 f 5.0
2 c 3 NaN NaN
How to replace particular values in Dataframe. For example in the below dataframe i want to replace the rows starting with [AA,CB,EZ] and the value i want to replace is ''
df = pandas.DataFrame({'A': ['AA','BB','CB','DD','EZ'],'B':[6,7,8,9,10],'C':[11,12,13,14,15]})
$ df
A B C
0 AA 6 11
1 BB 7 12
2 CB 8 13
3 DD 9 14
4 EZ 10 15
$ Expected Ouputdf
A B C
0 AA
1 BB 7 12
2 CB
3 DD 9 14
4 EZ
You can replace values by boolean mask by empty strings, but get mixed types - strings with numeric and some functions should failed:
mask = df['A'].str.startswith(('AA','CB','EZ'))
df.loc[mask, ['B', 'C']] = ''
print (df)
A B C
0 AA
1 BB 7 12
2 CB
3 DD 9 14
4 EZ
Better is replace values to NaNs:
df.loc[mask, ['B', 'C']] = np.nan
print (df)
A B C
0 AA NaN NaN
1 BB 7.0 12.0
2 CB NaN NaN
3 DD 9.0 14.0
4 EZ NaN NaN
Another solution:
df[['B', 'C']] = df[['B', 'C']].mask(mask)