This is a follow up question on: How to treat NaN or non aligned values as 1s or 0s in multiplying pandas DataFrames
I have the following data:
df1 = pd.DataFrame({"x":[1, 2, 3, 4, 5],
"y":[3, 4, 5, 6, 7]},
index=['a', 'b', 'c', 'd', 'e'])
df2 = pd.DataFrame({"y":[1, NaN, 3, 4, 5],
"z":[3, 4, 5, 6, 7]},
index=['b', 'c', 'd', 'e', 'f'])
I want to get the multiplication of df1 and df2 with all data in df2 retained is there is no corresponding entry in df1 but only rows and columns in df2.
E.g.
print (df1.mul(df2).fillna(df2))
or
print (df1.mul(df2).combine_first(df2))
gives:
x y z
a NaN NaN NaN
b NaN 4.0 3.0
c NaN NaN 4.0
d NaN 18.0 5.0
e NaN 28.0 6.0
f NaN 5.0 7.0
But I want to arrive at:
y z
b 4.0 3.0
c NaN 4.0
d 18.0 5.0
e 28.0 6.0
f 5.0 7.0
NB:
there can be legal NaN, Inf, -Inf values.
columns / rows may not always be to the left or right / top or bottom of the resulting DF, though in the above example this is the case.
I believe the easiest way would be to get the intersection of the index and columns, like this:
In [1142]: c = df1.columns & df2.columns
In [1143]: i = df1.index & df2.index
Now, just index and multiply with df.loc:
In [1145]: df2.loc[i, c] *= df1.loc[i, c]; df2
Out[1145]:
y z
b 4.0 3
c NaN 4
d 18.0 5
e 28.0 6
f 5.0 7
Related
I have a DataFrame where I am looking to fill in values in a column based on their grouping. I only want to fill in the values (by propagating non-NaN values using ffill and bfill) if there is only one unique value in the column to be filled; otherwise, it should be left as is. My code below has a sample dataset where I try to do this, but I get an error.
Code:
df = pd.DataFrame({"A": [1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 6, 6],
"B": ['a', 'a', np.nan, 'b', 'b', 'c', np.nan, 'd', np.nan, 'e', 'e', np.nan, 'h', 'h'],
"C": [5.0, np.nan, 4.0, 4.0, np.nan, 9.0, np.nan, np.nan, 9.0, 8.0, np.nan, 2.0, np.nan, np.nan]})
col_to_groupby = "A"
col_to_modify = "B"
group = df.groupby(col_to_groupby)
modified = group[group[col_to_modify].nunique() == 1].transform(lambda x: x.ffill().bfill())
df.update(modified)
Error:
KeyError: 'Columns not found: False, True'
Original dataset:
A B C
0 1 a 5.0
1 1 a NaN
2 2 NaN 4.0
3 2 b 4.0
4 2 b NaN
5 3 c 9.0
6 3 NaN NaN
7 3 d NaN
8 3 NaN 9.0
9 4 e 8.0
10 4 e NaN
11 5 NaN 2.0
12 6 h NaN
13 6 NaN NaN
Desired result:
A B C
0 1 a 5.0
1 1 a NaN
2 2 b 4.0
3 2 b 4.0
4 2 b NaN
5 3 c 9.0
6 3 NaN NaN
7 3 d NaN
8 3 NaN 9.0
9 4 e 8.0
10 4 e NaN
11 5 NaN 2.0
12 6 h NaN
13 6 h NaN
The above is the desired result because
row index 2 is in group 2, which only has 1 unique value in column B ("b"), so it is changed.
row indices 6 and 8 are in group 3, but there are 2 unique values in column B ("c" and "d"), so they are unaltered.
row index 5 is in group 11, but has no data in column B to propagate.
row index 13 is in group 6, which only has 1 unique value in column B ("h"), so it is changed.
One option is to add a condition in groupby.apply:
df[col_to_modify] = df.groupby(col_to_groupby)[col_to_modify].apply(lambda x: x.ffill().bfill() if x.nunique()==1 else x)
Another could be to use groupby + transform(nunique) + eq to create a boolean filter for the groups with unique values; then update those rows with groupby + first (first drops NaN) using where:
g = df.groupby(col_to_groupby)[col_to_modify]
df[col_to_modify] = g.transform('first').where(g.transform('nunique').eq(1), df[col_to_modify])
Output:
A B C
0 1 a 5.0
1 1 a NaN
2 2 b 4.0
3 2 b 4.0
4 2 b NaN
5 3 c 9.0
6 3 NaN NaN
7 3 d NaN
8 3 NaN 9.0
9 4 e 8.0
10 4 e NaN
11 5 NaN 2.0
12 6 h NaN
13 6 h NaN
I'm tryng to reverse this but I can't figure out how.
I'm starting from
>>> d = {'col1': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'], 'col2': [1, 2, 3, 4, 5, 6, 7, 7]}
>>> df = pd.DataFrame(data=d)
>>> df
col1 col2
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 C 7
7 C 7
And I want to obtain:
col1 new_1 new_2 new_3
0 A 1 2 3
1 B 4 5 6
2 C 7 7 empty
where there are new_x columns based on max number of times a col1 item is repeated.
It seems to be a pretty standard transpose, but I can't find a solution.
Sorry if duplicated.
Thx
Sirius
It's not a one-liner but maybe a bit simpler / easier to follow.
First, aggregate to one lists column:
df_ = pd.DataFrame(df.groupby('col1').col2.agg(list))
which gives
col2
col1
A [1, 2, 3]
B [4, 5, 6]
C [7, 7]
Then, build a new DataFrame from these lists:
df2 = (pd.DataFrame(df_.col2.tolist(), index=df_.index).add_prefix('new_')
.reset_index())
which gives
col1 new_0 new_1 new_2
0 A 1 2 3.0
1 B 4 5 6.0
2 C 7 7 NaN
Please note that:
I interpreted empty as an empty cell, not the 'empty' string
NaN is always seen as a float, that's why values in this column were cast by pandas to floats
use .cumcount() and .unstack() after setting your indices.
cumcount() here groups by your target column and applies a sequential count along the index, this allows us to unstack() it and create your new pivoted structure.
the rest of the code is to obtain your target dataframe, you could also do this with pivot and crosstab.
df1 = df.set_index([df.groupby('col1').cumcount() + 1,
df['col1']]).drop('col1',1)\
.unstack(0)\
.droplevel(0,1)\
.add_prefix('new_')\
.fillna('empty')\
.reset_index()
Or with pivot:
(df.assign(k=df.groupby("col1").cumcount()+1).pivot('col1','k','col2')
.add_prefix("col_").reset_index())
col1 new_1 new_2 new_3
0 A 1.0 2.0 3.0
1 B 4.0 5.0 6.0
2 C 7.0 7.0 empty
d = {'col1': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'], 'col2': [1, 2, 3, 4, 5, 6, 7, 7]}
df = pd.DataFrame(data=d)
print(df)
print(df.pivot_table(index='col1',columns=df.index, values='col2').fillna(0))
output:
0 1 2 3 4 5 6 7
col1
A 1.0 2.0 3.0 0.0 0.0 0.0 0.0 0.0
B 0.0 0.0 0.0 4.0 5.0 6.0 0.0 0.0
C 0.0 0.0 0.0 0.0 0.0 0.0 7.0 7.0
I might be doing something wrong, but I was trying to calculate a rolling average (let's use sum instead in this example for simplicity) after grouping the dataframe. Until here it all works well, but when I apply a shift I'm finding the values spill over to the group below. See example below:
import pandas as pd
df = pd.DataFrame({'X': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'Y': [1, 2, 3, 1, 2, 3, 1, 2, 3]})
grouped_df = df.groupby(by='X')['Y'].rolling(window=2, min_periods=2).sum().shift(periods=1)
print(grouped_df)
Expected result:
X
A 0 NaN
1 NaN
2 3.0
B 3 NaN
4 NaN
5 3.0
C 6 NaN
7 NaN
8 3.0
Result I actually get:
X
A 0 NaN
1 NaN
2 3.0
B 3 5.0
4 NaN
5 3.0
C 6 5.0
7 NaN
8 3.0
You can see the result of A2 gets passed to B3 and the result of B5 to C6. I'm not sure this is the intended behaviour and I'm doing something wrong or there is some bug in pandas?
Thanks
The problem is that
df.groupby(by='X')['Y'].rolling(window=2, min_periods=2).sum()
returns a new series, then when you chain with shift(), you shift the series as a whole, not within the group.
You need another groupby to shift within the group:
grouped_df = (df.groupby(by='X')['Y'].rolling(window=2, min_periods=2).sum()
.groupby(level=0).shift(periods=1)
)
Or use groupby.transform:
grouped_df = (df.groupby('X')['Y']
.transform(lambda x: x.rolling(window=2, min_periods=2)
.sum().shift(periods=1))
)
Output:
X
A 0 NaN
1 NaN
2 3.0
B 3 NaN
4 NaN
5 3.0
C 6 NaN
7 NaN
8 3.0
Name: Y, dtype: float64
I work in python and pandas.
Let's suppose that I have the following two dataframes df_1 and df_2 (INPUT):
# df1
A B C
0 2 8 6
1 5 2 5
2 3 4 9
3 5 1 1
# df2
A B C
0 2 7 NaN
1 5 1 NaN
2 3 3 NaN
3 5 0 NaN
I want to process it to join/merge them to get a new dataframe which looks like that (EXPECTED OUTPUT):
A B C
0 2 7 NaN
1 5 1 1
2 3 3 NaN
3 5 0 NaN
So basically it is a right-merge/join but with preserving the order of the original right dataframe.
However, if I do this:
df_2 = df_1.merge(df_2[['A', 'B']], on=['A', 'B'], how='right')
then I get this:
A B C
0 5 1 1.0
1 2 7 NaN
2 3 3 NaN
3 5 0 NaN
So I get the right rows joined/merged but the output dataframe does not have the same row-order as the original right dataframe.
How can I do the join/merge and preserve the row-order too?
The code to create the original dataframes is the following:
import pandas as pd
import numpy as np
columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)
columns = ['A', 'B', 'C']
data_2 = [[2, 5, 3, 5], [7, 1, 3, 0], [np.nan, np.nan, np.nan, np.nan]]
data_2 = np.array(data_2).T
df_2 = pd.DataFrame(data=data_2, columns=columns)
I think that by using either .join() or .update() I could get what I want but to start with I am quite surprised that .merge() does not do this very simple thing too.
I think it is bug.
Possible solution with left join:
df_2 = df_2.merge(df_1, on=['A', 'B'], how='left', suffixes=('_','')).drop('C_', axis=1)
print (df_2)
A B C
0 2.0 7.0 NaN
1 5.0 1.0 1.0
2 3.0 3.0 NaN
3 5.0 0.0 NaN
You can play with index between the both dataframe
print(df)
# A B C
# 0 5 1 1.0
# 1 2 7 NaN
# 2 3 3 NaN
# 3 5 0 NaN
df = df.set_index('B')
df = df.reindex(index=df_2['B'])
df = df.reset_index()
df = df[['A', 'B', 'C']]
print(df)
# A B C
# 0 2 7.0 NaN
# 1 5 1.0 1.0
# 2 3 3.0 NaN
# 3 5 0.0 NaN
Source
One quick way is:
df_2=df_2.set_index(['A','B'])
temp = df_1.set_index(['A','B'])
df_2.update(temp)
df_2.reset_index(inplace=True)
As I discuss above with #jezrael above and if I am not missing something, if you do not need both the columns C from the original dataframes and you need only the column C with the matching values then .update() is the quickest way since you do not have to drop the columns that you do not need.
I think this is a trivial question, but i just cant make it work.
d = { 'one': pd.Series([1,2,3,4], index=['a', 'b', 'c', 'd']),
'two': pd.Series([np.nan,6,np.nan,8], index=['a', 'b', 'c', 'd']),
'three': pd.Series([10,20,30,np.nan], index = ['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
df
one three two
a 1 10.0 NaN
b 2 20.0 6.0
c 3 30.0 NaN
d 4 NaN 8.0
My serires:
fill = pd.Series([30,60])
I'd like to replace a specific column, let it be 'two'. With my Series called fill, where the column 'two' meets a condition: is Nan. Canyou help me with that?
My desired result:
df
one three two
a 1 10.0 30
b 2 20.0 6.0
c 3 30.0 60
d 4 NaN 8.0
I think you need loc with isnull for replace numpy array created from fill by Series.values:
df.loc[df.two.isnull(), 'two'] = fill.values
print (df)
one three two
a 1 10.0 30.0
b 2 20.0 6.0
c 3 30.0 60.0
d 4 NaN 8.0