Python pandas dataframe apply result of function to multiple columns where NaN

Python pandas dataframe apply result of function to multiple columns where NaN - python

I have a dataframe with three columns and a function that calculates the values of column y and z given the value of column x. I need to only calculate the values if they are missing NaN.
def calculate(x):
return 1, 2
df = pd.DataFrame({'x':['a', 'b', 'c', 'd', 'e', 'f'], 'y':[np.NaN, np.NaN, np.NaN, 'a1', 'b2', 'c3'], 'z':[np.NaN, np.NaN, np.NaN, 'a2', 'b1', 'c4']})
x y z
0 a NaN NaN
1 b NaN NaN
2 c NaN NaN
3 d a1 a2
4 e b2 b1
5 f c3 c4
mask = (df.isnull().any(axis=1))
df[['y', 'z']] = df[mask].apply(calculate, axis=1, result_type='expand')
However, I get the following result, although I only apply to the masked set. Unsure what I'm doing wrong.
x y z
0 a 1.0 2.0
1 b 1.0 2.0
2 c 1.0 2.0
3 d NaN NaN
4 e NaN NaN
5 f NaN NaN
If the mask is inverted I get the following result:
df[['y', 'z']] = df[~mask].apply(calculate, axis=1, result_type='expand')
x y z
0 a NaN NaN
1 b NaN NaN
2 c NaN NaN
3 d 1.0 2.0
4 e 1.0 2.0
5 f 1.0 2.0
Expected result:
x y z
0 a 1.0 2.0
1 b 1.0 2.0
2 c 1.0 2.0
3 d a1 a2
4 e b2 b1
5 f c3 c4

you can fillna after calculating for the full dataframe and set_axis
out = (df.fillna(df.apply(calculate, axis=1, result_type='expand')
.set_axis(['y','z'],inplace=False,axis=1)))
print(out)
x y z
0 a 1 2
1 b 1 2
2 c 1 2
3 d a1 a2
4 e b2 b1
5 f c3 c4

Try:
df.loc[mask,["y","z"]] = pd.DataFrame(df.loc[mask].apply(calculate, axis=1).to_list(), index=df[mask].index, columns = ["y","z"])
print(df)
x y z
0 a 1 2
1 b 1 2
2 c 1 2
3 d a1 a2
4 e b2 b1
5 f c3 c4

Related

Pandas groupby two columns and expand the third

I have a Pandas dataframe with the following structure:
A B C
a b 1
a b 2
a b 3
c d 7
c d 8
c d 5
c d 6
c d 3
e b 4
e b 3
e b 2
e b 1
And I will like to transform it into this:
A B C1 C2 C3 C4 C5
a b 1 2 3 NAN NAN
c d 7 8 5 6 3
e b 4 3 2 1 NAN
In other words, something like groupby A and B and expand C into different columns.
Knowing that the length of each group is different.
C is already ordered
Shorter groups can have NAN or NULL values (empty), it does not matter.

Use GroupBy.cumcount and pandas.Series.add with 1, to start naming the new columns from 1 onwards, then pass this to DataFrame.pivot, and add DataFrame.add_prefix to rename the columns (C1, C2, C3, etc...). Finally use DataFrame.rename_axis to remove the indexes original name ('g') and transform the MultiIndex into columns by using DataFrame.reset_indexcolumns A,B:
df['g'] = df.groupby(['A','B']).cumcount().add(1)
df = df.pivot(['A','B'], 'g', 'C').add_prefix('C').rename_axis(columns=None).reset_index()
print (df)
A B C1 C2 C3 C4 C5
0 a b 1.0 2.0 3.0 NaN NaN
1 c d 7.0 8.0 5.0 6.0 3.0
2 e b 4.0 3.0 2.0 1.0 NaN
Because NaN is by default of type float, if you need the columns dtype to be integers add DataFrame.astype with Int64:
df['g'] = df.groupby(['A','B']).cumcount().add(1)
df = (df.pivot(['A','B'], 'g', 'C')
.add_prefix('C')
.astype('Int64')
.rename_axis(columns=None)
.reset_index())
print (df)
A B C1 C2 C3 C4 C5
0 a b 1 2 3 <NA> <NA>
1 c d 7 8 5 6 3
2 e b 4 3 2 1 <NA>
EDIT: If there's a maximum N new columns to be added, it means that A,B are duplicated. Therefore, it will beneeded to add helper groups g1, g2 with integer and modulo division, adding a new level in index:
N = 4
g = df.groupby(['A','B']).cumcount()
df['g1'], df['g2'] = g // N, (g % N) + 1
df = (df.pivot(['A','B','g1'], 'g2', 'C')
.add_prefix('C')
.droplevel(-1)
.rename_axis(columns=None)
.reset_index())
print (df)
A B C1 C2 C3 C4
0 a b 1.0 2.0 3.0 NaN
1 c d 7.0 8.0 5.0 6.0
2 c d 3.0 NaN NaN NaN
3 e b 4.0 3.0 2.0 1.0

df1.astype({'C':str}).groupby([*'AB'])\
.agg(','.join).C.str.split(',',expand=True)\
.add_prefix('C').reset_index()
A B C0 C1 C2 C3 C4
0 a b 1 2 3 None None
1 c d 7 8 5 6 3
2 e b 4 3 2 1 None

The accepted solution but avoiding the deprecation warning:
N = 3
g = df_grouped.groupby(['A','B']).cumcount()
df_grouped['g1'], df_grouped['g2'] = g // N, (g % N) + 1
df_grouped = (df_grouped.pivot(index=['A','B','g1'], columns='g2', values='C')
.add_prefix('C_')
.astype('Int64')
.droplevel(-1)
.rename_axis(columns=None)
.reset_index())

combine and group rows from 2 dfs

I have 2 dfs, which I want to combine as the following:
df1 = pd.DataFrame({"a": [1,2], "b":['A','B'], "c":[3,2]})
df2 = pd.DataFrame({"a": [1,1,1, 2,2,2, 3, 4], "b":['A','A','A','B','B', 'B','C','D'], "c":[3, None,None,2,None,None,None,None]})
Output:
a b c
1 A 3.0
1 A NaN
1 A NaN
2 B 2.0
2 B NaN
2 B NaN
I had an earlier version of this question that only involved df2 and was solved with
df.groupby(['a','b']).filter(lambda g: any(~g['c'].isna()))
but now I need to run it only for rows that appear in df1 (df2 contains rows from df1 but some extra rows which I want to not be included.
Thanks!

You can turn the indicator on with merge
out = df2.merge(df1,indicator=True,how='outer',on=['a','b'])
Out[91]:
a b c_x c_y _merge
0 1 A 3.0 3.0 both
1 1 A NaN 3.0 both
2 1 A NaN 3.0 both
3 2 B 2.0 2.0 both
4 2 B NaN 2.0 both
5 2 B NaN 2.0 both
6 3 C NaN NaN left_only
7 4 D NaN NaN left_only
out = out[out['_merge']=='both']

IIUC, you could merge:
out = df2.merge(df1[['a','b']])
or you could use chained isin:
out1 = df2[df2['a'].isin(df1['a']) & df2['b'].isin(df1['b'])]
Output:
a b c
0 1 A 3.0
1 1 A NaN
2 1 A NaN
3 2 B 2.0
4 2 B NaN
5 2 B NaN

How to compare values of certain columns of one dataframe with the values of same set of columns in another dataframe?

I have three dataframes df1, df2, and df3, which are defined as follows
df1 =
A B C
0 1 a a1
1 2 b b2
2 3 c c3
3 4 d d4
4 5 e e5
5 6 f f6
df2 =
A B C
0 1 a X
1 2 b Y
2 3 c Z
df3 =
A B C
3 4 d P
4 5 e Q
5 6 f R
I have defined a Primary Key list PK = ["A","B"].
Now, I take a fourth dataframe df4 as df4 = df1.sample(n=2), which gives something like
df4 =
A B C
4 5 e e5
1 2 b b2
Now, I want to select the rows from df2 and df1 which matches the values of the primary keys of df4.
For eg, in this case,
I need to get row with
index = 4 from df3,
index = 1 from df2.
If possible I need to get a dataframe as follows:
df =
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
4 5 e e5 5 e Q
1 2 b b2 2 b Y
Any ideas on how to work this out will be very helpful.

Use two consecutive DataFrame.merge operations along with using DataFrame.add_suffix on the right dataframe to left merge the dataframes df4, df2, df3, finally use Series.fillna to replace the missing values with empty string:
df = (
df4.merge(df2.add_suffix('(df2)'), left_on=['A', 'B'], right_on=['A(df2)', 'B(df2)'], how='left')
.merge(df3.add_suffix('(df3)'), left_on=['A', 'B'], right_on=['A(df3)', 'B(df3)'], how='left')
.fillna('')
)
Result:
# print(df)
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
0 5 e e5 5 e Q
1 2 b b2 2 b Y

Here's how I would do it on the entire data set. If you want to sample first, just update the merge statements at the end by replacing df1 with df4 or just take a sample of t
PK = ["A","B"]
df2 = pd.concat([df2,df2], axis=1)
df2.columns=['A','B','C','A(df2)', 'B(df2)', 'C(df2)']
df2.drop(columns=['C'], inplace=True)
df3 = pd.concat([df3,df3], axis=1)
df3.columns=['A','B','C','A(df3)', 'B(df3)', 'C(df3)']
df3.drop(columns=['C'], inplace=True)
t = df1.merge(df2, on=PK, how='left')
t = t.merge(df3, on=PK, how='left')
Output
A B C A(df2) B(df2) C(df2) A(df3) B(df3) C(df3)
0 1 a a1 1.0 a X NaN NaN NaN
1 2 b b2 2.0 b Y NaN NaN NaN
2 3 c c3 3.0 c Z NaN NaN NaN
3 4 d d4 NaN NaN NaN 4.0 d P
4 5 e e5 NaN NaN NaN 5.0 e Q
5 6 f f6 NaN NaN NaN 6.0 f R

Replace column and extend index in DataFrame

I have DataFrame x and I would like to replace one column with Series y
x = DataFrame([[1,2],[3,4]], columns=['C1','C2'], index=['a','b'])
C1 C2
a 1 2
b 3 4
y = Series([5,6,7], index=['a','b','c'])
a 5
b 6
c 7
Simple replacement works fine but keeps original index of DataFrame
x['C1'] = y
C1 C2
a 5 2
b 6 4
I need to have union of indeces of x and y. One solution would be to reindex before replacement
x = x.reindex(x.index.union(y.index), copy=False)
x['C1'] = y
C1 C2
a 5 2.0
b 6 4.0
c 7 NaN
Is there simpler way?

combine_first
Turn y into a DataFrame first with to_frame
y.to_frame('C1').combine_first(x)
C1 C2
a 5 2.0
b 6 4.0
c 7 NaN
align and assign
Use align to... align the indices
x, y = x.align(y, axis=0)
x.assign(C1=y)
C1 C2
a 5 2.0
b 6 4.0
c 7 NaN

Your cat try use join:
pd.DataFrame(y,columns=['C1']).join(x[['C2']])
Output:
C1 C2
a 5 2.0
b 6 4.0
c 7 NaN

Similar to your solution but more succinct, you use reindex, then assign:
res = x.reindex(x.index | y.index).assign(C1=y)
print(res)
C1 C2
a 5 2.0
b 6 4.0
c 7 NaN

You can use concat but you will have to fix the column names, i.e.
import pandas as pd
pd.concat([x.loc[:, 'C2'], y], axis = 1)
which gives,
C2 0
a 2.0 5
b 4.0 6
c NaN 7

Why pandas unstack is throwing an error?

I am trying to unstack two columns :
cols = res.columns[:31]
res[cols] = res[cols].ffill()
res = res.set_index(cols + [31])[32].unstack().reset_index().rename_axis(None, 1)
But I am getting an error :
TypeError: can only perform ops with scalar values
What should I do to avoid it?
My original problem : LINK

I think need convert columns to list:
cols = res.columns[:31].tolist()
EDIT:
Index contains duplicate entries, cannot reshape
means duplicates, here for first 6 columns, so is impossible create new DataFrame, because first 6 column create new index and 7. column create new column, and for 8. column are 2 values:
0 1 2 3 4 5 6 7
0 xx s 1 d f df f 54
1 xx s 1 d f df f g4
New DataFrame:
index = xx s 1 d f df
column = f
value = 54
index = xx s 1 d f df
column = f
value = g4
So solution is aggregate, here working with strings, so need .apply(', '.join):
index = xx s 1 d f df
column = f
value = 54, g4
Or remove duplicates and keep first or last value of dupes rows by drop_duplicates:
index = xx s 1 d f df
column = f
value = 54
index = xx s 1 d f df
column = f
value = g4
res = pd.DataFrame({0: ['xx',np.nan,np.nan,np.nan,'ds', np.nan, np.nan, np.nan, np.nan, 'as'],
1: ['s',np.nan,np.nan,np.nan,'a', np.nan, np.nan, np.nan, np.nan, 't'],
2: ['1',np.nan,np.nan,np.nan,'s', np.nan, np.nan, np.nan, np.nan, 'r'],
3: ['d',np.nan, np.nan, np.nan,'d', np.nan, np.nan, np.nan, np.nan, 'a'],
4: ['f',np.nan, np.nan, np.nan,'f', np.nan, np.nan, np.nan, np.nan, '2'],
5: ['df',np.nan,np.nan,np.nan,'ds',np.nan, np.nan, np.nan, np.nan, 'ds'],
6: ['f','f', 'x', 'r', 'f', 'd', 's', '1', '3', 'k'],
7: ['54','g4', 'r4', '43', '64', '43', 'se', 'gf', 's3', 's4']})
cols = res.columns[:6].tolist()
res[cols] = res[cols].ffill()
print (res)
0 1 2 3 4 5 6 7
0 xx s 1 d f df f 54
1 xx s 1 d f df f g4
2 xx s 1 d f df x r4
3 xx s 1 d f df r 43
4 ds a s d f ds f 64
5 ds a s d f ds d 43
6 ds a s d f ds s se
7 ds a s d f ds 1 gf
8 ds a s d f ds 3 s3
9 as t r a 2 ds k s4
res =res.groupby(cols + [6])[7].apply(', '.join).unstack().reset_index().rename_axis(None, 1)
print (res)
0 1 2 3 4 5 1 3 d f k r s x
0 as t r a 2 ds NaN NaN NaN NaN s4 NaN NaN NaN
1 ds a s d f ds gf s3 43 64 NaN NaN se NaN
2 xx s 1 d f df NaN NaN NaN 54, g4 NaN 43 NaN r4 <-54, g4
Another solution is remove duplicates:
res = res.drop_duplicates(cols + [6])
res = res.set_index(cols + [6])[7].unstack().reset_index().rename_axis(None, 1)
print (res)
0 1 2 3 4 5 1 3 d f k r s x
0 as t r a 2 ds NaN NaN NaN NaN s4 NaN NaN NaN
1 ds a s d f ds gf s3 43 64 NaN NaN se NaN
2 xx s 1 d f df NaN NaN NaN 54 NaN 43 NaN r4 <- 54
res = res.drop_duplicates(cols + [6], keep='last')
res = res.set_index(cols + [6])[7].unstack().reset_index().rename_axis(None, 1)
print (res)
0 1 2 3 4 5 1 3 d f k r s x
0 as t r a 2 ds NaN NaN NaN NaN s4 NaN NaN NaN
1 ds a s d f ds gf s3 43 64 NaN NaN se NaN
2 xx s 1 d f df NaN NaN NaN g4 NaN 43 NaN r4 <- g4

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python pandas dataframe apply result of function to multiple columns where NaN - python

you can fillna after calculating for the full dataframe and set_axis out = (df.fillna(df.apply(calculate, axis=1, result_type='expand') .set_axis(['y','z'],inplace=False,axis=1))) print(out) x y z 0 a 1 2 1 b 1 2 2 c 1 2 3 d a1 a2 4 e b2 b1 5 f c3 c4

Try: df.loc[mask,["y","z"]] = pd.DataFrame(df.loc[mask].apply(calculate, axis=1).to_list(), index=df[mask].index, columns = ["y","z"]) print(df) x y z 0 a 1 2 1 b 1 2 2 c 1 2 3 d a1 a2 4 e b2 b1 5 f c3 c4

Related

Pandas groupby two columns and expand the third

combine and group rows from 2 dfs

How to compare values of certain columns of one dataframe with the values of same set of columns in another dataframe?

Replace column and extend index in DataFrame

Why pandas unstack is throwing an error?

Categories

Resources