I'm trying to replace null values in dataframe d using dataframe f.
d and f are linked by EGI. In d, EGI is a column and is not unique. In f, EGI is unique and is this dataframe's index.
For each row in d, I need the values in that row to be 'masked' by the row in f with corresponding EGI.
Sample data:
d = pd.DataFrame({'EGI':['a1','b2','a1','d4'],'A': ['x', np.nan, 'z', 'e'], 'B': [pd.NaT, 6, 7, 9], 'C': [2, 1, None, 9], 'D': [2, None, np.nan, None]})
EGI A B C D
0 a1 x NaT 2.0 2.0
1 b2 NaN 6 1.0 NaN
2 a1 z 7 NaN NaN
3 d4 e 9 9.0 NaN
f = pd.DataFrame({'B': [5, 8, 9], 'A': ['w', 'y', np.nan], 'D': [None, np.nan, 8], 'test': [5, 8, 9]}, index=['b2', 'a1', 'c3'])
B A D test
b2 5 w NaN 5
a1 8 y NaN 8
c3 9 NaN 8.0 9
Expected output:
EGI A B C D
0 a1 x 8 2.0 2.0
1 b2 w 6 1.0 NaN
2 a1 z 7 NaN NaN
3 d4 e 9 9.0 NaN
What I tried:
m = d.isnull()
m.index = d['EGI'].tolist()
m = m.drop(['EGI'], axis = 1)
d.mask(m, f)
EGI A B C D
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
If dataframes d and f have matching row indexes, we can just fillna:
d.fillna(f)
But in OP's example, the indexes do not match up, so we just need to align them first.
One-liner
Use set_index and reindex to align the indexes to EGI and then fillna:
d.set_index('EGI').fillna(f.reindex(d.EGI))
# EGI A B C D
# 0 a1 x 8.0 2.0 2.0
# 1 b2 w 6 1.0 NaN
# 2 a1 z 7 NaN NaN
# 3 d4 e 9 9.0 NaN
Step-by-step
Use set_index to set d's index to EGI:
d = d.set_index('EGI')
# A B C D
# EGI
# a1 x NaT 2.0 2.0
# b2 NaN 6 1.0 NaN
# a1 z 7 NaN NaN
# d4 e 9 9.0 NaN
Use reindex to align f's index to d's index:
f = f.reindex(d.index)
# B A D test
# EGI
# a1 8.0 y NaN 8.0
# b2 5.0 w NaN 5.0
# a1 8.0 y NaN 8.0
# d4 NaN NaN NaN NaN
Use fillna to fill d's NaNs with f:
d.fillna(f)
# EGI A B C D
# 0 a1 x 8.0 2.0 2.0
# 1 b2 w 6 1.0 NaN
# 2 a1 z 7 NaN NaN
# 3 d4 e 9 9.0 NaN
Note that the column indexes of d and f are not aligned and do not need to be. We only need to align the row indexes, and fillna will handle the rest.
Related
I have the dataframe, that needs to put the not null values into the column.
For example: there maybe more than 5 columns, but no more than 2 not null values each rows
df1 = pd.DataFrame({'A' : [np.nan, np.nan, 'c',np.nan, np.nan, np.nan],
'B' : [np.nan, np.nan, np.nan, 'a', np.nan,'e'],
'C' : [np.nan, 'b', np.nan,'f', np.nan, np.nan],
'D' : [np.nan, np.nan, 'd',np.nan, np.nan, np.nan],
'E' : ['a', np.nan, np.nan,np.nan, np.nan, 'a']})
A B C D E
NaN NaN NaN NaN a
NaN NaN b NaN NaN
c NaN NaN d NaN
NaN a f NaN NaN
NaN NaN NaN NaN NaN
NaN e NaN NaN a
My expected output: To generate 4 new columns, Other_1; Other_1_name; Other_2; Other_2_name, the value will go to Other_1 or Other_2 if there are not null values, and the column name will go to Other_1_name or Other_2_name. if the value is NaN leave the 4 column rows NaN.
A B C D E Other_1 Other_1_name Other_2 Other_2_name
NaN NaN NaN NaN a a E NaN NaN
NaN NaN b NaN NaN b C NaN NaN
c NaN NaN d NaN c A d D
NaN a f NaN NaN a B f C
NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN e NaN NaN a e B a E
Use DataFrame.melt with missing values by DataFrame.dropna for unpivot, then add counter columns by GroupBy.cumcount and reshape by DataFrame.unstack:
df2 = df1.melt(ignore_index=False,var_name='name',value_name='val').dropna()[['val','name']]
g = df2.groupby(level=0).cumcount().add(1)
df2 = df2.set_index(g,append=True).unstack().sort_index(level=1,axis=1,sort_remaining=False)
df2.columns = df2.columns.map(lambda x: f'Other_{x[1]}_{x[0]}')
print (df2)
Other_1_val Other_1_name Other_2_val Other_2_name
0 a E NaN NaN
1 b C NaN NaN
2 c A d D
3 a B f C
5 e B a E
Last append to original:
df = df1.join(df2)
print (df)
A B C D E Other_1_val Other_1_name Other_2_val Other_2_name
0 NaN NaN NaN NaN a a E NaN NaN
1 NaN NaN b NaN NaN b C NaN NaN
2 c NaN NaN d NaN c A d D
3 NaN a f NaN NaN a B f C
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN e NaN NaN a e B a E
I have a the following pandas data frame with the index on the left:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
A17 a b 1 AUG) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
nn6 c d 2 POS) e f 2 Hi)
AZV NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
JFK a b 4 UUI) c v 8 Yo) t f 9 po)
I'm looking to re-shape it to:
0 1 2 3
A17 a b 1 AUG)
nn6 c d 2 POS)
nn6 e f 2 Hi)
AZV NaN NaN NaN NaN
JFK a b 4 UUI)
JFK c v 8 Yo)
JFK t f 9 po
I have tried reshape() and used itertools to iterate over the columns but still can't seem to get it.
Basically, each time a ) is encountered then break to a new line.
The real table has over 150 columns.
thanks
Another option that doesn't require iterating over rows (which can be very slow if there are many) is to do the following
[ins] In [1]: df
Out[1]:
0 1 2 3 4 5 6 7
A17 a b 1 AUG) NaN NaN NaN NaN
nn6 c d 2 POS) e f 2 HI)
AVZ NaN NaN NaN NaN NaN NaN NaN
[ins] In [2]: joined = df.apply(lambda x: ' '.join([str(xi) for xi in x]), axis=1)
[ins] In [4]: split = joined.str.split(')', expand=True).reset_index(drop=False).melt(id_vars='index')
[ins] In [6]: split.drop('variable', axis=1, inplace=True)
[ins] In [7]: split
Out[7]:
index value
0 A17 a b 1 AUG
1 nn6 c d 2 POS
2 AVZ nan nan nan nan nan nan nan
3 A17 nan nan nan nan
4 nn6 e f 2 HI
5 AVZ None
6 A17 None
7 nn6
8 AVZ None
[ins] In [8]: sel = split['value'].str.strip().str.len() > 0
[ins] In [9]: split = split.loc[sel, :]
[ins] In [9]: split
Out[9]:
index value
0 A17 a b 1 AUG
1 nn6 c d 2 POS
2 AVZ nan nan nan nan nan nan nan
3 A17 nan nan nan nan
4 nn6 e f 2 HI
[ins] In [10]: out = split['value'].str.strip().str.split(' ', expand=True)
[ins] In [11]: out.index = split['index']
[ins] In [12]: out
Out[12]:
0 1 2 3 4 5 6
index
A17 a b 1 AUG None None None
nn6 c d 2 POS None None None
AVZ nan nan nan nan nan nan nan
A17 nan nan nan nan None None None
nn6 e f 2 HI None None None
and then it's a matter of dropping the 4th to 6th column which is simple.
I added some of the output so that you can see what's happening in each step.
I think an efficent way to concat the values would be to chunk the dataframe into 4 equal parts and re-concat it along the index.
The issue here is the column names which we can dynamically rename inside the concat statement.
import numpy as np
lst = np.array_split([i for i in range(len(df.columns))],4)
[array([0, 1, 2, 3]),
array([4, 5, 6, 7]),
array([ 8, 9, 10, 11]),
array([12, 13, 14])]
dfs = pd.concat( [
df.iloc[:,i].rename(columns=
dict(zip(df.iloc[:,i].columns,range(4)))
)
for i in lst
]).dropna(how='all')
print(dfs)
0 1 2 3
A17 a b 1.0 AUG)
nn6 c d 2.0 POS)
JFK a b 4.0 UUI)
nn6 e f 2.0 Hi)
JFK c v 8.0 Yo)
JFK t f 9.0 po)
the only diff here is that your missing a row from your desired output due to it being na.
we can do a union with combine_first to get the delta between the two dataframes.
dfs = dfs.combine_first(df.iloc[:,:0])
print(dfs)
0 1 2 3
A17 a b 1.0 AUG)
AZV NaN NaN NaN NaN
JFK a b 4.0 UUI)
JFK c v 8.0 Yo)
JFK t f 9.0 po)
nn6 c d 2.0 POS)
nn6 e f 2.0 Hi)
There are other options, like slicing columns and appending, but this is pretty straightforward.
output = []
for index, row in df.iterrows():
r = row.dropna().values
if len(r) <= 4:
output.append([index,*r])
else:
for x in np.reshape(r, (int(len(r)/4),4)):
output.append([index,*x])
pd.DataFrame(output).set_index(0)
How do you fill only groups inside a dataframe which are not fully nulls?
In the dataframe below, only groups with df.A=b and df.A=c should get filled.
df
A B
0 a NaN
1 a NaN
2 a NaN
3 a NaN
4 b 4.0
5 b NaN
6 b 6.0
7 b 6.0
8 c 7.0
9 c NaN
10 c NaN
Was thinking something like:
if set(df[df.A==(need help here)].B.values) == {np.nan}:.
We can do groupby
df.B=df.groupby('A').B.apply(lambda x : x.ffill().bfill())
Get the indices that are not completely null, and then forwardfill/backwardfill on these indices
df = df.set_index("A")
#get index where entries in B are not completely full
ind = df.loc[df.groupby("A").B.transform(lambda x: x.eq(x))].index.unique()
df.loc[ind] = df.loc[ind].ffill().bfill()
print(df)
B
A
a NaN
a NaN
a NaN
a NaN
b 4.0
b 4.0
b 6.0
b 6.0
c 7.0
c 7.0
c 7.0
I have the following dataframe :
a = [1,2,3,4,5,6,7,8]
x1 = ['j','j','j','k','k','k','k','k']
df = pd.DataFrame({'a': a,'b':x1})
print(df)
a b
1 j
2 j
3 j
4 k
5 k
6 k
7 k
8 k
I am trying get the sum the "a" values for next n rows grouped within column "b" and store it in new columns (for n ranging from 1 to 4).
Essentially I want to end up with four new columns c1, c2, c3, and c4 such that c1 has sum of "next 1" a's, c2 has sum of "next 2" a's, c3 has sum of "next 3" a's and c4 has sum of "next 4" a's.
Therefore, my desired output is:
a b c1 c2 c3 c4
1 j 2.0 5.0 NaN NaN
2 j 3.0 NaN NaN NaN
3 j NaN NaN NaN NaN
4 k 5.0 11.0 18.0 26.0
5 k 6.0 13.0 21.0 NaN
6 k 7.0 15.0 NaN NaN
7 k 8.0 NaN NaN NaN
8 k NaN NaN NaN NaN
I looked for solutions and best I can think of is something like:
for x in range(1,5):
df[x] = df.groupby(['b'])a[::-1].rolling(x+1).sum()[::-1] - a
but this syntax throws errors.
If possible, can you also share how to implement if I need to group by more than one fields. Will really appreciate any help.
Thanks.
Your example dataframe doesn't match your expected output, so let's go with the latter.
I think you can combine a rolling sum with a shift:
for x in range(1, 5):
c = pd.Series(df.groupby("b")["a"].rolling(x).sum().values, index=df.index)
df[f"c{x}"]= c.groupby(df["b"]).shift(-x)
gives me
In [302]: df
Out[302]:
a b c1 c2 c3 c4
0 1 j 2.0 5.0 NaN NaN
1 2 j 3.0 NaN NaN NaN
2 3 j NaN NaN NaN NaN
3 4 k 5.0 11.0 18.0 26.0
4 5 k 6.0 13.0 21.0 NaN
5 6 k 7.0 15.0 NaN NaN
6 7 k 8.0 NaN NaN NaN
7 8 k NaN NaN NaN NaN
If you really want to have multiple keys, you can use a list of keys, but we have to rearrange the call a little:
keys = ["b","b2"]
for x in range(1, 5):
c = pd.Series(df.groupby(keys)["a"].rolling(x).sum().values, index=df.index)
df[f"c{x}"]= c.groupby([df[k] for k in keys]).shift(-x)
or
keys = ["b","b2"]
for x in range(1, 5):
c = pd.Series(df.groupby(keys)["a"].rolling(x).sum().values, index=df.index)
df[f"c{x}"]= df.assign(tmp=c).groupby(keys)["tmp"].shift(-x)
give me
In [409]: df
Out[409]:
a b b2 c1 c2 c3 c4
0 1 j j 2.0 5.0 NaN NaN
1 2 j j 3.0 NaN NaN NaN
2 3 j j NaN NaN NaN NaN
3 4 k k 5.0 NaN NaN NaN
4 5 k k NaN NaN NaN NaN
5 6 k l 7.0 15.0 NaN NaN
6 7 k l 8.0 NaN NaN NaN
7 8 k l NaN NaN NaN NaN
Consider this df:
import pandas as pd, numpy as np
df = pd.DataFrame.from_dict({'id': ['A', 'B', 'A', 'C', 'D', 'B', 'C'],
'val': [1,2,-3,1,5,6,-2],
'stuff':['12','23232','13','1234','3235','3236','732323']})
Question: how to produce a table with as many columns as unique id ({A, B, C}) and
as many rows as df where, for example for the column corresponding to id==A, the values are:
1,
np.nan,
-2,
np.nan,
np.nan,
np.nan,
np.nan
(that is the result of df.groupby('id')['val'].cumsum() joined on the indexes of df).
UMMM pivot
pd.pivot(df.index,df.id,df.val).cumsum()
Out[33]:
id A B C D
0 1.0 NaN NaN NaN
1 NaN 2.0 NaN NaN
2 -2.0 NaN NaN NaN
3 NaN NaN 1.0 NaN
4 NaN NaN NaN 5.0
5 NaN 8.0 NaN NaN
6 NaN NaN -1.0 NaN
One way via a dictionary comprehension and pd.DataFrame.where:
res = pd.DataFrame({i: df['val'].where(df['id'].eq(i)).cumsum() for i in df['id'].unique()})
print(res)
A B C D
0 1.0 NaN NaN NaN
1 NaN 2.0 NaN NaN
2 -2.0 NaN NaN NaN
3 NaN NaN 1.0 NaN
4 NaN NaN NaN 5.0
5 NaN 8.0 NaN NaN
6 NaN NaN -1.0 NaN
For a small number of groups, you may find this method efficient:
df = pd.concat([df]*1000, ignore_index=True)
def piv_transform(df):
return pd.pivot(df.index, df.id, df.val).cumsum()
def dict_transform(df):
return pd.DataFrame({i: df['val'].where(df['id'].eq(i)).cumsum() for i in df['id'].unique()})
%timeit piv_transform(df) # 17.5 ms
%timeit dict_transform(df) # 8.1 ms
Certainly cleaner answers have been supplied - see pivot.
df1 = pd.DataFrame( data = [df.id == x for x in df.id.unique()]).T.mul(df.groupby(['id']).cumsum().squeeze(),axis=0)
df1.columns =df.id.unique()
df1.applymap(lambda x: np.nan if x == 0 else x)
A B C D
0 1.0 NaN NaN NaN
1 NaN 2.0 NaN NaN
2 -2.0 NaN NaN NaN
3 NaN NaN 1.0 NaN
4 NaN NaN NaN 5.0
5 NaN 8.0 NaN NaN
6 NaN NaN -1.0 NaN
Short and simple:
df.pivot(columns='id', values='val').cumsum()