I tried to answer this question by a group-level merging. The below is a slightly modified version of the same question, but I need the output by a group-level merging.
Here are the input dataframes:
df = pd.DataFrame({ "group":[1,1,1 ,2,2],
"cat": ['a', 'b', 'c', 'a', 'c'] ,
"value": range(5),
"value2": np.array(range(5))* 2})
df
cat group value value2
a 1 0 0
b 1 1 2
c 1 2 4
a 2 3 6
c 2 4 8
categories = ['a', 'b', 'c', 'd']
categories = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
print(categories)
cat
0 a
1 b
2 c
3 d
Here's the expected output:
cat group value value2
a 1 0 0
b 1 1 2
c 1 2 4
d NA NA NA
a 2 3 6
c 2 4 8
b NA NA NA
d NA NA NA
Question:
I can achieve what I want by a for loop. Is there a pandas way to do that though?
(I need to perform an outer join between categories and each group of the groupby result of df.groupby('group'))
grouped = df.groupby('group')
merged_list = []
for g in grouped:
merged = pd.merge(categories, g[1], how = 'outer', on='cat')
merged_list.append(merged)
out = pd.concat(merged_list)
I think groupby + merge here is only overcomplicated way for this.
So faster is use reindex by MultiIndex:
mux = pd.MultiIndex.from_product([df['group'].unique(), categories], names=('group','cat'))
df = df.set_index(['group','cat']).reindex(mux).swaplevel(0,1).reset_index()
#add missing values to group column
df['group'] = df['group'].mask(df['value'].isnull())
print (df)
cat group value value2
0 a 1.0 0.0 0.0
1 b 1.0 1.0 2.0
2 c 1.0 2.0 4.0
3 d NaN NaN NaN
4 a 2.0 3.0 6.0
5 b NaN NaN NaN
6 c 2.0 4.0 8.0
7 d NaN NaN NaN
Possible solution:
df = df.groupby('group', group_keys=False)
.apply(lambda x: pd.merge(categories, x, how = 'outer', on='cat'))
cat group value value2
0 a 1.0 0.0 0.0
1 b 1.0 1.0 2.0
2 c 1.0 2.0 4.0
3 d NaN NaN NaN
0 a 2.0 3.0 6.0
1 b NaN NaN NaN
2 c 2.0 4.0 8.0
3 d NaN NaN NaN
Timings:
np.random.seed(123)
N = 1000000
L = list('abcd') #235,94.1,156ms
df = pd.DataFrame({'cat': np.random.choice(L, N, p=(0.002,0.002,0.005, 0.991)),
'group':np.random.randint(10000,size=N),
'value':np.random.randint(1000,size=N),
'value2':np.random.randint(5000,size=N)})
df = df.sort_values(['group','cat']).drop_duplicates(['group','cat']).reset_index(drop=True)
print (df.head(10))
categories = ['a', 'b', 'c', 'd']
def jez1(df):
mux = pd.MultiIndex.from_product([df['group'].unique(), categories], names=('group','cat'))
df = df.set_index(['group','cat']).reindex(mux, fill_value=0).swaplevel(0,1).reset_index()
df['group'] = df['group'].mask(df['value'].isnull())
return df
def jez2(df):
grouped = df.groupby('group')
categories = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
return grouped.apply(lambda x: pd.merge(categories, x, how = 'outer', on='cat'))
def coldspeed(df):
grouped = df.groupby('group')
categories = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
return pd.concat([g[1].merge(categories, how='outer', on='cat') for g in grouped])
def akilat90(df):
grouped = df.groupby('group')
categories = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
merged_list = []
for g in grouped:
merged = pd.merge(categories, g[1], how = 'outer', on='cat')
merged['group'].fillna(merged['group'].mode()[0],inplace=True) # replace the `group` column's `NA`s by mode
merged.fillna(0, inplace=True)
merged_list.append(merged)
return pd.concat(merged_list)
In [471]: %timeit jez1(df)
100 loops, best of 3: 12 ms per loop
In [472]: %timeit jez2(df)
1 loop, best of 3: 14.5 s per loop
In [473]: %timeit coldspeed(df)
1 loop, best of 3: 19.4 s per loop
In [474]: %timeit akilat90(df)
1 loop, best of 3: 22.3 s per loop
To actually answer your question, no - you can only merge 2 dataframes at a time (I'm not aware of multi-way merges in pandas). You cannot avoid the loop, but you certainly can make your code a little neater.
pd.concat([g[1].merge(categories, how='outer', on='cat') for g in grouped])
cat group value value2
0 a 1.0 0.0 0.0
1 b 1.0 1.0 2.0
2 c 1.0 2.0 4.0
3 d NaN NaN NaN
0 a 2.0 3.0 6.0
1 c 2.0 4.0 8.0
2 b NaN NaN NaN
3 d NaN NaN NaN
Related
I might be doing something wrong, but I was trying to calculate a rolling average (let's use sum instead in this example for simplicity) after grouping the dataframe. Until here it all works well, but when I apply a shift I'm finding the values spill over to the group below. See example below:
import pandas as pd
df = pd.DataFrame({'X': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'Y': [1, 2, 3, 1, 2, 3, 1, 2, 3]})
grouped_df = df.groupby(by='X')['Y'].rolling(window=2, min_periods=2).sum().shift(periods=1)
print(grouped_df)
Expected result:
X
A 0 NaN
1 NaN
2 3.0
B 3 NaN
4 NaN
5 3.0
C 6 NaN
7 NaN
8 3.0
Result I actually get:
X
A 0 NaN
1 NaN
2 3.0
B 3 5.0
4 NaN
5 3.0
C 6 5.0
7 NaN
8 3.0
You can see the result of A2 gets passed to B3 and the result of B5 to C6. I'm not sure this is the intended behaviour and I'm doing something wrong or there is some bug in pandas?
Thanks
The problem is that
df.groupby(by='X')['Y'].rolling(window=2, min_periods=2).sum()
returns a new series, then when you chain with shift(), you shift the series as a whole, not within the group.
You need another groupby to shift within the group:
grouped_df = (df.groupby(by='X')['Y'].rolling(window=2, min_periods=2).sum()
.groupby(level=0).shift(periods=1)
)
Or use groupby.transform:
grouped_df = (df.groupby('X')['Y']
.transform(lambda x: x.rolling(window=2, min_periods=2)
.sum().shift(periods=1))
)
Output:
X
A 0 NaN
1 NaN
2 3.0
B 3 NaN
4 NaN
5 3.0
C 6 NaN
7 NaN
8 3.0
Name: Y, dtype: float64
I work in python and pandas.
Let's suppose that I have the following two dataframes df_1 and df_2 (INPUT):
# df1
A B C
0 2 8 6
1 5 2 5
2 3 4 9
3 5 1 1
# df2
A B C
0 2 7 NaN
1 5 1 NaN
2 3 3 NaN
3 5 0 NaN
I want to process it to join/merge them to get a new dataframe which looks like that (EXPECTED OUTPUT):
A B C
0 2 7 NaN
1 5 1 1
2 3 3 NaN
3 5 0 NaN
So basically it is a right-merge/join but with preserving the order of the original right dataframe.
However, if I do this:
df_2 = df_1.merge(df_2[['A', 'B']], on=['A', 'B'], how='right')
then I get this:
A B C
0 5 1 1.0
1 2 7 NaN
2 3 3 NaN
3 5 0 NaN
So I get the right rows joined/merged but the output dataframe does not have the same row-order as the original right dataframe.
How can I do the join/merge and preserve the row-order too?
The code to create the original dataframes is the following:
import pandas as pd
import numpy as np
columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)
columns = ['A', 'B', 'C']
data_2 = [[2, 5, 3, 5], [7, 1, 3, 0], [np.nan, np.nan, np.nan, np.nan]]
data_2 = np.array(data_2).T
df_2 = pd.DataFrame(data=data_2, columns=columns)
I think that by using either .join() or .update() I could get what I want but to start with I am quite surprised that .merge() does not do this very simple thing too.
I think it is bug.
Possible solution with left join:
df_2 = df_2.merge(df_1, on=['A', 'B'], how='left', suffixes=('_','')).drop('C_', axis=1)
print (df_2)
A B C
0 2.0 7.0 NaN
1 5.0 1.0 1.0
2 3.0 3.0 NaN
3 5.0 0.0 NaN
You can play with index between the both dataframe
print(df)
# A B C
# 0 5 1 1.0
# 1 2 7 NaN
# 2 3 3 NaN
# 3 5 0 NaN
df = df.set_index('B')
df = df.reindex(index=df_2['B'])
df = df.reset_index()
df = df[['A', 'B', 'C']]
print(df)
# A B C
# 0 2 7.0 NaN
# 1 5 1.0 1.0
# 2 3 3.0 NaN
# 3 5 0.0 NaN
Source
One quick way is:
df_2=df_2.set_index(['A','B'])
temp = df_1.set_index(['A','B'])
df_2.update(temp)
df_2.reset_index(inplace=True)
As I discuss above with #jezrael above and if I am not missing something, if you do not need both the columns C from the original dataframes and you need only the column C with the matching values then .update() is the quickest way since you do not have to drop the columns that you do not need.
I need to create a column that computes the difference between another column's elements:
Column A Computed Column
10 blank # nothing to compute for first record
9 1 # = 10-9
7 2 # = 9-7
4 3 # = 7-4
I am assuming this is a lambda function, but i am not sure how to reference the elements in 'Column A'
Any help/direction you can provide would be great- thanks!
You can do it by shifting the column.
import pandas as pd
dict1 = {'A': [10,9,7,4]}
df = pd.DataFrame.from_dict(dict1)
df['Computed'] = df['A'].shift() - df['A']
print(df)
giving
A Computed
0 10 NaN
1 9 1.0
2 7 2.0
3 4 3.0
EDIT: OP extended his requirement to multi columns
dict1 = {'A': [10,9,7,4], 'B': [10,9,7,4], 'C': [10,9,7,4]}
df = pd.DataFrame.from_dict(dict1)
columns_to_update = ['A', 'B']
for col in columns_to_update:
df['Computed'+col] = df[col].shift() - df[col]
print(df)
By using the columns_to_update, you can choose the columns you want.
A B C ComputedA ComputedB
0 10 10 10 NaN NaN
1 9 9 9 1.0 1.0
2 7 7 7 2.0 2.0
3 4 4 4 3.0 3.0
Use diff.
df = pd.DataFrame(data=[10,9,7,4], columns=['A'])
df['B'] = df.A.diff(-1).shift(1)
Output:
df
Out[140]:
A B
0 10 NaN
1 9 1.0
2 7 2.0
3 4 3.0
I would just do:
df = pd.DataFrame(data=[10,9,7,4], columns=['A'])
df['B'] = abs(df['A'].diff())
The reason for abs() is because diff() computes the difference between current - previous whereas you want previous - current. This method is already built-in to the Series class, so using abs() will get you the correct result by taking the absolute value either way.
To support:
import pandas as pd
df = pd.DataFrame(data=[10,9,7,4], columns=['A'])
df['B'] = abs(df['A'].diff())
>>> df
# Output
A B
0 10 NaN
1 9 1.0
2 7 2.0
3 4 3.0
df2 = pd.DataFrame(data=[10,4,7,9], columns=['A'])
df2['B'] = abs(df2['A'].diff())
>>> df2
# Output
A B
0 10 NaN
1 4 6.0
2 7 3.0
3 9 2.0
To still out perform that of #cosmic_inquiry's solution:
import pandas as pd
df = pd.DataFrame(data=[10,9,7,4], columns=['A'])
df2 = pd.DataFrame(data=[10,4,7,9], columns=['A'])
df['B'] = df['A'].diff() * -1
df2['B'] = df2['A'].diff() * -1
>>> df
# Output:
A B
0 10 NaN
1 9 1.0
2 7 2.0
3 4 3.0
>>> df2
# Output:
A B
0 10 NaN
1 4 6.0
2 7 -3.0
3 9 -2.0
I have a (example-) dataframe with 4 columns:
data = {'A': ['a', 'b', 'c', 'd', 'e', 'f'],
'B': [42, 52, np.nan, np.nan, np.nan, np.nan],
'C': [np.nan, np.nan, 31, 2, np.nan, np.nan],
'D': [np.nan, np.nan, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])
A B C D
0 a 42.0 NaN NaN
1 b 52.0 NaN NaN
2 c NaN 31.0 NaN
3 d NaN 2.0 NaN
4 e NaN NaN 62.0
5 f NaN NaN 70.0
I would now like to merge/combine columns B, C, and D to a new column E like in this example:
data2 = {'A': ['a', 'b', 'c', 'd', 'e', 'f'],
'E': [42, 52, 31, 2, 62, 70]}
df2 = pd.DataFrame(data2, columns = ['A', 'E'])
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
I found a quite similar question here but this adds the merged colums B, C, and D at the end of column A:
0 a
1 b
2 c
3 d
4 e
5 f
6 42
7 52
8 31
9 2
10 62
11 70
dtype: object
Thanks for help.
Option 1
Using assign and drop
In [644]: cols = ['B', 'C', 'D']
In [645]: df.assign(E=df[cols].sum(1)).drop(cols, 1)
Out[645]:
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
Option 2
Using assignment and drop
In [648]: df['E'] = df[cols].sum(1)
In [649]: df = df.drop(cols, 1)
In [650]: df
Out[650]:
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
Option 3 Lately, I like the 3rd option.
Using groupby
In [660]: df.groupby(np.where(df.columns == 'A', 'A', 'E'), axis=1).first() #or sum max min
Out[660]:
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
In [661]: df.columns == 'A'
Out[661]: array([ True, False, False, False], dtype=bool)
In [662]: np.where(df.columns == 'A', 'A', 'E')
Out[662]:
array(['A', 'E', 'E', 'E'],
dtype='|S1')
The question as written asks for merge/combine as opposed to sum, so posting this to help folks who find this answer looking for help on coalescing with combine_first, which can be a bit tricky.
df2 = pd.concat([df["A"],
df["B"].combine_first(df["C"]).combine_first(df["D"])],
axis=1)
df2.rename(columns={"B":"E"}, inplace=True)
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
What's so tricky about that? in this case there's no problem - but let's say you were pulling the B, C and D values from different dataframes, in which the a,b,c,d,e,f labels were present, but not necessarily in the same order. combine_first() aligns on the index, so you'd need to tack a set_index() on to each of your df references.
df2 = pd.concat([df.set_index("A", drop=False)["A"],
df.set_index("A")["B"]\
.combine_first(df.set_index("A")["C"])\
.combine_first(df.set_index("A")["D"]).astype(int)],
axis=1).reset_index(drop=True)
df2.rename(columns={"B":"E"}, inplace=True)
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
Use difference for columns names without A and then get sum or max:
cols = df.columns.difference(['A'])
df['E'] = df[cols].sum(axis=1).astype(int)
# df['E'] = df[cols].max(axis=1).astype(int)
df = df.drop(cols, axis=1)
print (df)
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
If multiple values per rows:
data = {'A': ['a', 'b', 'c', 'd', 'e', 'f'],
'B': [42, 52, np.nan, np.nan, np.nan, np.nan],
'C': [np.nan, np.nan, 31, 2, np.nan, np.nan],
'D': [10, np.nan, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])
print (df)
A B C D
0 a 42.0 NaN 10.0
1 b 52.0 NaN NaN
2 c NaN 31.0 NaN
3 d NaN 2.0 NaN
4 e NaN NaN 62.0
5 f NaN NaN 70.0
cols = df.columns.difference(['A'])
df['E'] = df[cols].apply(lambda x: ', '.join(x.dropna().astype(int).astype(str)), 1)
df = df.drop(cols, axis=1)
print (df)
A E
0 a 42, 10
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
You can also use ffill with iloc:
df['E'] = df.iloc[:, 1:].ffill(1).iloc[:, -1].astype(int)
df = df.iloc[:, [0, -1]]
print(df)
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
Zero's third option using groupby requires a numpy import and only handles one column outside the set of columns to collapse, while jpp's answer using ffill requires you know how columns are ordered. Here's a solution that has no extra dependencies, takes an arbitrary input dataframe, and only collapses columns if all rows in those columns are single-valued:
import pandas as pd
data = [{'A':'a', 'B':42, 'messy':'z'},
{'A':'b', 'B':52, 'messy':'y'},
{'A':'c', 'C':31},
{'A':'d', 'C':2, 'messy':'w'},
{'A':'e', 'D':62, 'messy':'v'},
{'A':'f', 'D':70, 'messy':['z']}]
df = pd.DataFrame(data)
cols = ['B', 'C', 'D']
new_col = 'E'
if df[cols].apply(lambda x: len(x.notna().value_counts()) == 1, axis=1).all():
df[new_col] = df[cols].ffill(axis=1).dropna(axis=1)
df2 = df.drop(columns=cols)
print(df, '\n\n', df2)
Output:
A B messy C D
0 a 42.0 z NaN NaN
1 b 52.0 y NaN NaN
2 c NaN NaN 31.0 NaN
3 d NaN w 2.0 NaN
4 e NaN v NaN 62.0
5 f NaN [z] NaN 70.0
A messy E
0 a z 42.0
1 b y 52.0
2 c NaN 31.0
3 d w 2.0
4 e v 62.0
5 f [z] 70.0
I think this is a trivial question, but i just cant make it work.
d = { 'one': pd.Series([1,2,3,4], index=['a', 'b', 'c', 'd']),
'two': pd.Series([np.nan,6,np.nan,8], index=['a', 'b', 'c', 'd']),
'three': pd.Series([10,20,30,np.nan], index = ['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
df
one three two
a 1 10.0 NaN
b 2 20.0 6.0
c 3 30.0 NaN
d 4 NaN 8.0
My serires:
fill = pd.Series([30,60])
I'd like to replace a specific column, let it be 'two'. With my Series called fill, where the column 'two' meets a condition: is Nan. Canyou help me with that?
My desired result:
df
one three two
a 1 10.0 30
b 2 20.0 6.0
c 3 30.0 60
d 4 NaN 8.0
I think you need loc with isnull for replace numpy array created from fill by Series.values:
df.loc[df.two.isnull(), 'two'] = fill.values
print (df)
one three two
a 1 10.0 30.0
b 2 20.0 6.0
c 3 30.0 60.0
d 4 NaN 8.0