How to replace subset of pandas dataframe with on other series - python

I think this is a trivial question, but i just cant make it work.
d = { 'one': pd.Series([1,2,3,4], index=['a', 'b', 'c', 'd']),
'two': pd.Series([np.nan,6,np.nan,8], index=['a', 'b', 'c', 'd']),
'three': pd.Series([10,20,30,np.nan], index = ['a', 'b', 'c', 'd'])}
​
df = pd.DataFrame(d)
df
one three two
a 1 10.0 NaN
b 2 20.0 6.0
c 3 30.0 NaN
d 4 NaN 8.0
My serires:
​fill = pd.Series([30,60])
I'd like to replace a specific column, let it be 'two'. With my Series called fill, where the column 'two' meets a condition: is Nan. Canyou help me with that?
My desired result:
df
one three two
a 1 10.0 30
b 2 20.0 6.0
c 3 30.0 60
d 4 NaN 8.0

I think you need loc with isnull for replace numpy array created from fill by Series.values:
df.loc[df.two.isnull(), 'two'] = fill.values
print (df)
one three two
a 1 10.0 30.0
b 2 20.0 6.0
c 3 30.0 60.0
d 4 NaN 8.0

Related

Pandas - shifting a rolling sum after grouping spills over to following groups

I might be doing something wrong, but I was trying to calculate a rolling average (let's use sum instead in this example for simplicity) after grouping the dataframe. Until here it all works well, but when I apply a shift I'm finding the values spill over to the group below. See example below:
import pandas as pd
df = pd.DataFrame({'X': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'Y': [1, 2, 3, 1, 2, 3, 1, 2, 3]})
grouped_df = df.groupby(by='X')['Y'].rolling(window=2, min_periods=2).sum().shift(periods=1)
print(grouped_df)
Expected result:
X
A 0 NaN
1 NaN
2 3.0
B 3 NaN
4 NaN
5 3.0
C 6 NaN
7 NaN
8 3.0
Result I actually get:
X
A 0 NaN
1 NaN
2 3.0
B 3 5.0
4 NaN
5 3.0
C 6 5.0
7 NaN
8 3.0
You can see the result of A2 gets passed to B3 and the result of B5 to C6. I'm not sure this is the intended behaviour and I'm doing something wrong or there is some bug in pandas?
Thanks
The problem is that
df.groupby(by='X')['Y'].rolling(window=2, min_periods=2).sum()
returns a new series, then when you chain with shift(), you shift the series as a whole, not within the group.
You need another groupby to shift within the group:
grouped_df = (df.groupby(by='X')['Y'].rolling(window=2, min_periods=2).sum()
.groupby(level=0).shift(periods=1)
)
Or use groupby.transform:
grouped_df = (df.groupby('X')['Y']
.transform(lambda x: x.rolling(window=2, min_periods=2)
.sum().shift(periods=1))
)
Output:
X
A 0 NaN
1 NaN
2 3.0
B 3 NaN
4 NaN
5 3.0
C 6 NaN
7 NaN
8 3.0
Name: Y, dtype: float64

Join/merge dataframes and preserve the row-order

I work in python and pandas.
Let's suppose that I have the following two dataframes df_1 and df_2 (INPUT):
# df1
A B C
0 2 8 6
1 5 2 5
2 3 4 9
3 5 1 1
# df2
A B C
0 2 7 NaN
1 5 1 NaN
2 3 3 NaN
3 5 0 NaN
I want to process it to join/merge them to get a new dataframe which looks like that (EXPECTED OUTPUT):
A B C
0 2 7 NaN
1 5 1 1
2 3 3 NaN
3 5 0 NaN
So basically it is a right-merge/join but with preserving the order of the original right dataframe.
However, if I do this:
df_2 = df_1.merge(df_2[['A', 'B']], on=['A', 'B'], how='right')
then I get this:
A B C
0 5 1 1.0
1 2 7 NaN
2 3 3 NaN
3 5 0 NaN
So I get the right rows joined/merged but the output dataframe does not have the same row-order as the original right dataframe.
How can I do the join/merge and preserve the row-order too?
The code to create the original dataframes is the following:
import pandas as pd
import numpy as np
columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)
columns = ['A', 'B', 'C']
data_2 = [[2, 5, 3, 5], [7, 1, 3, 0], [np.nan, np.nan, np.nan, np.nan]]
data_2 = np.array(data_2).T
df_2 = pd.DataFrame(data=data_2, columns=columns)
I think that by using either .join() or .update() I could get what I want but to start with I am quite surprised that .merge() does not do this very simple thing too.
I think it is bug.
Possible solution with left join:
df_2 = df_2.merge(df_1, on=['A', 'B'], how='left', suffixes=('_','')).drop('C_', axis=1)
print (df_2)
A B C
0 2.0 7.0 NaN
1 5.0 1.0 1.0
2 3.0 3.0 NaN
3 5.0 0.0 NaN
You can play with index between the both dataframe
print(df)
# A B C
# 0 5 1 1.0
# 1 2 7 NaN
# 2 3 3 NaN
# 3 5 0 NaN
df = df.set_index('B')
df = df.reindex(index=df_2['B'])
df = df.reset_index()
df = df[['A', 'B', 'C']]
print(df)
# A B C
# 0 2 7.0 NaN
# 1 5 1.0 1.0
# 2 3 3.0 NaN
# 3 5 0.0 NaN
Source
One quick way is:
df_2=df_2.set_index(['A','B'])
temp = df_1.set_index(['A','B'])
df_2.update(temp)
df_2.reset_index(inplace=True)
As I discuss above with #jezrael above and if I am not missing something, if you do not need both the columns C from the original dataframes and you need only the column C with the matching values then .update() is the quickest way since you do not have to drop the columns that you do not need.

Merging groups with a one dataframe after a groupby

I tried to answer this question by a group-level merging. The below is a slightly modified version of the same question, but I need the output by a group-level merging.
Here are the input dataframes:
df = pd.DataFrame({ "group":[1,1,1 ,2,2],
"cat": ['a', 'b', 'c', 'a', 'c'] ,
"value": range(5),
"value2": np.array(range(5))* 2})
df
cat group value value2
a 1 0 0
b 1 1 2
c 1 2 4
a 2 3 6
c 2 4 8
categories = ['a', 'b', 'c', 'd']
categories = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
print(categories)
cat
0 a
1 b
2 c
3 d
Here's the expected output:
cat group value value2
a 1 0 0
b 1 1 2
c 1 2 4
d NA NA NA
a 2 3 6
c 2 4 8
b NA NA NA
d NA NA NA
Question:
I can achieve what I want by a for loop. Is there a pandas way to do that though?
(I need to perform an outer join between categories and each group of the groupby result of df.groupby('group'))
grouped = df.groupby('group')
merged_list = []
for g in grouped:
merged = pd.merge(categories, g[1], how = 'outer', on='cat')
merged_list.append(merged)
out = pd.concat(merged_list)
I think groupby + merge here is only overcomplicated way for this.
So faster is use reindex by MultiIndex:
mux = pd.MultiIndex.from_product([df['group'].unique(), categories], names=('group','cat'))
df = df.set_index(['group','cat']).reindex(mux).swaplevel(0,1).reset_index()
#add missing values to group column
df['group'] = df['group'].mask(df['value'].isnull())
print (df)
cat group value value2
0 a 1.0 0.0 0.0
1 b 1.0 1.0 2.0
2 c 1.0 2.0 4.0
3 d NaN NaN NaN
4 a 2.0 3.0 6.0
5 b NaN NaN NaN
6 c 2.0 4.0 8.0
7 d NaN NaN NaN
Possible solution:
df = df.groupby('group', group_keys=False)
.apply(lambda x: pd.merge(categories, x, how = 'outer', on='cat'))
cat group value value2
0 a 1.0 0.0 0.0
1 b 1.0 1.0 2.0
2 c 1.0 2.0 4.0
3 d NaN NaN NaN
0 a 2.0 3.0 6.0
1 b NaN NaN NaN
2 c 2.0 4.0 8.0
3 d NaN NaN NaN
Timings:
np.random.seed(123)
N = 1000000
L = list('abcd') #235,94.1,156ms
df = pd.DataFrame({'cat': np.random.choice(L, N, p=(0.002,0.002,0.005, 0.991)),
'group':np.random.randint(10000,size=N),
'value':np.random.randint(1000,size=N),
'value2':np.random.randint(5000,size=N)})
df = df.sort_values(['group','cat']).drop_duplicates(['group','cat']).reset_index(drop=True)
print (df.head(10))
categories = ['a', 'b', 'c', 'd']
def jez1(df):
mux = pd.MultiIndex.from_product([df['group'].unique(), categories], names=('group','cat'))
df = df.set_index(['group','cat']).reindex(mux, fill_value=0).swaplevel(0,1).reset_index()
df['group'] = df['group'].mask(df['value'].isnull())
return df
def jez2(df):
grouped = df.groupby('group')
categories = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
return grouped.apply(lambda x: pd.merge(categories, x, how = 'outer', on='cat'))
def coldspeed(df):
grouped = df.groupby('group')
categories = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
return pd.concat([g[1].merge(categories, how='outer', on='cat') for g in grouped])
def akilat90(df):
grouped = df.groupby('group')
categories = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
merged_list = []
for g in grouped:
merged = pd.merge(categories, g[1], how = 'outer', on='cat')
merged['group'].fillna(merged['group'].mode()[0],inplace=True) # replace the `group` column's `NA`s by mode
merged.fillna(0, inplace=True)
merged_list.append(merged)
return pd.concat(merged_list)
In [471]: %timeit jez1(df)
100 loops, best of 3: 12 ms per loop
In [472]: %timeit jez2(df)
1 loop, best of 3: 14.5 s per loop
In [473]: %timeit coldspeed(df)
1 loop, best of 3: 19.4 s per loop
In [474]: %timeit akilat90(df)
1 loop, best of 3: 22.3 s per loop
To actually answer your question, no - you can only merge 2 dataframes at a time (I'm not aware of multi-way merges in pandas). You cannot avoid the loop, but you certainly can make your code a little neater.
pd.concat([g[1].merge(categories, how='outer', on='cat') for g in grouped])
cat group value value2
0 a 1.0 0.0 0.0
1 b 1.0 1.0 2.0
2 c 1.0 2.0 4.0
3 d NaN NaN NaN
0 a 2.0 3.0 6.0
1 c 2.0 4.0 8.0
2 b NaN NaN NaN
3 d NaN NaN NaN

How to merge/combine columns in pandas?

I have a (example-) dataframe with 4 columns:
data = {'A': ['a', 'b', 'c', 'd', 'e', 'f'],
'B': [42, 52, np.nan, np.nan, np.nan, np.nan],
'C': [np.nan, np.nan, 31, 2, np.nan, np.nan],
'D': [np.nan, np.nan, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])
A B C D
0 a 42.0 NaN NaN
1 b 52.0 NaN NaN
2 c NaN 31.0 NaN
3 d NaN 2.0 NaN
4 e NaN NaN 62.0
5 f NaN NaN 70.0
I would now like to merge/combine columns B, C, and D to a new column E like in this example:
data2 = {'A': ['a', 'b', 'c', 'd', 'e', 'f'],
'E': [42, 52, 31, 2, 62, 70]}
df2 = pd.DataFrame(data2, columns = ['A', 'E'])
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
I found a quite similar question here but this adds the merged colums B, C, and D at the end of column A:
0 a
1 b
2 c
3 d
4 e
5 f
6 42
7 52
8 31
9 2
10 62
11 70
dtype: object
Thanks for help.
Option 1
Using assign and drop
In [644]: cols = ['B', 'C', 'D']
In [645]: df.assign(E=df[cols].sum(1)).drop(cols, 1)
Out[645]:
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
Option 2
Using assignment and drop
In [648]: df['E'] = df[cols].sum(1)
In [649]: df = df.drop(cols, 1)
In [650]: df
Out[650]:
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
Option 3 Lately, I like the 3rd option.
Using groupby
In [660]: df.groupby(np.where(df.columns == 'A', 'A', 'E'), axis=1).first() #or sum max min
Out[660]:
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
In [661]: df.columns == 'A'
Out[661]: array([ True, False, False, False], dtype=bool)
In [662]: np.where(df.columns == 'A', 'A', 'E')
Out[662]:
array(['A', 'E', 'E', 'E'],
dtype='|S1')
The question as written asks for merge/combine as opposed to sum, so posting this to help folks who find this answer looking for help on coalescing with combine_first, which can be a bit tricky.
df2 = pd.concat([df["A"],
df["B"].combine_first(df["C"]).combine_first(df["D"])],
axis=1)
df2.rename(columns={"B":"E"}, inplace=True)
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
What's so tricky about that? in this case there's no problem - but let's say you were pulling the B, C and D values from different dataframes, in which the a,b,c,d,e,f labels were present, but not necessarily in the same order. combine_first() aligns on the index, so you'd need to tack a set_index() on to each of your df references.
df2 = pd.concat([df.set_index("A", drop=False)["A"],
df.set_index("A")["B"]\
.combine_first(df.set_index("A")["C"])\
.combine_first(df.set_index("A")["D"]).astype(int)],
axis=1).reset_index(drop=True)
df2.rename(columns={"B":"E"}, inplace=True)
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
Use difference for columns names without A and then get sum or max:
cols = df.columns.difference(['A'])
df['E'] = df[cols].sum(axis=1).astype(int)
# df['E'] = df[cols].max(axis=1).astype(int)
df = df.drop(cols, axis=1)
print (df)
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
If multiple values per rows:
data = {'A': ['a', 'b', 'c', 'd', 'e', 'f'],
'B': [42, 52, np.nan, np.nan, np.nan, np.nan],
'C': [np.nan, np.nan, 31, 2, np.nan, np.nan],
'D': [10, np.nan, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])
print (df)
A B C D
0 a 42.0 NaN 10.0
1 b 52.0 NaN NaN
2 c NaN 31.0 NaN
3 d NaN 2.0 NaN
4 e NaN NaN 62.0
5 f NaN NaN 70.0
cols = df.columns.difference(['A'])
df['E'] = df[cols].apply(lambda x: ', '.join(x.dropna().astype(int).astype(str)), 1)
df = df.drop(cols, axis=1)
print (df)
A E
0 a 42, 10
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
You can also use ffill with iloc:
df['E'] = df.iloc[:, 1:].ffill(1).iloc[:, -1].astype(int)
df = df.iloc[:, [0, -1]]
print(df)
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
Zero's third option using groupby requires a numpy import and only handles one column outside the set of columns to collapse, while jpp's answer using ffill requires you know how columns are ordered. Here's a solution that has no extra dependencies, takes an arbitrary input dataframe, and only collapses columns if all rows in those columns are single-valued:
import pandas as pd
data = [{'A':'a', 'B':42, 'messy':'z'},
{'A':'b', 'B':52, 'messy':'y'},
{'A':'c', 'C':31},
{'A':'d', 'C':2, 'messy':'w'},
{'A':'e', 'D':62, 'messy':'v'},
{'A':'f', 'D':70, 'messy':['z']}]
df = pd.DataFrame(data)
cols = ['B', 'C', 'D']
new_col = 'E'
if df[cols].apply(lambda x: len(x.notna().value_counts()) == 1, axis=1).all():
df[new_col] = df[cols].ffill(axis=1).dropna(axis=1)
df2 = df.drop(columns=cols)
print(df, '\n\n', df2)
Output:
A B messy C D
0 a 42.0 z NaN NaN
1 b 52.0 y NaN NaN
2 c NaN NaN 31.0 NaN
3 d NaN w 2.0 NaN
4 e NaN v NaN 62.0
5 f NaN [z] NaN 70.0
A messy E
0 a z 42.0
1 b y 52.0
2 c NaN 31.0
3 d w 2.0
4 e v 62.0
5 f [z] 70.0

Retaining only the Column Rows from chosen DF in the calculation results

This is a follow up question on: How to treat NaN or non aligned values as 1s or 0s in multiplying pandas DataFrames
I have the following data:
df1 = pd.DataFrame({"x":[1, 2, 3, 4, 5],
"y":[3, 4, 5, 6, 7]},
index=['a', 'b', 'c', 'd', 'e'])
df2 = pd.DataFrame({"y":[1, NaN, 3, 4, 5],
"z":[3, 4, 5, 6, 7]},
index=['b', 'c', 'd', 'e', 'f'])
I want to get the multiplication of df1 and df2 with all data in df2 retained is there is no corresponding entry in df1 but only rows and columns in df2.
E.g.
print (df1.mul(df2).fillna(df2))
or
print (df1.mul(df2).combine_first(df2))
gives:
x y z
a NaN NaN NaN
b NaN 4.0 3.0
c NaN NaN 4.0
d NaN 18.0 5.0
e NaN 28.0 6.0
f NaN 5.0 7.0
But I want to arrive at:
y z
b 4.0 3.0
c NaN 4.0
d 18.0 5.0
e 28.0 6.0
f 5.0 7.0
NB:
there can be legal NaN, Inf, -Inf values.
columns / rows may not always be to the left or right / top or bottom of the resulting DF, though in the above example this is the case.
I believe the easiest way would be to get the intersection of the index and columns, like this:
In [1142]: c = df1.columns & df2.columns
In [1143]: i = df1.index & df2.index
Now, just index and multiply with df.loc:
In [1145]: df2.loc[i, c] *= df1.loc[i, c]; df2
Out[1145]:
y z
b 4.0 3
c NaN 4
d 18.0 5
e 28.0 6
f 5.0 7

Categories