How to merge/combine columns in pandas? - python

I have a (example-) dataframe with 4 columns:
data = {'A': ['a', 'b', 'c', 'd', 'e', 'f'],
'B': [42, 52, np.nan, np.nan, np.nan, np.nan],
'C': [np.nan, np.nan, 31, 2, np.nan, np.nan],
'D': [np.nan, np.nan, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])
A B C D
0 a 42.0 NaN NaN
1 b 52.0 NaN NaN
2 c NaN 31.0 NaN
3 d NaN 2.0 NaN
4 e NaN NaN 62.0
5 f NaN NaN 70.0
I would now like to merge/combine columns B, C, and D to a new column E like in this example:
data2 = {'A': ['a', 'b', 'c', 'd', 'e', 'f'],
'E': [42, 52, 31, 2, 62, 70]}
df2 = pd.DataFrame(data2, columns = ['A', 'E'])
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
I found a quite similar question here but this adds the merged colums B, C, and D at the end of column A:
0 a
1 b
2 c
3 d
4 e
5 f
6 42
7 52
8 31
9 2
10 62
11 70
dtype: object
Thanks for help.

Option 1
Using assign and drop
In [644]: cols = ['B', 'C', 'D']
In [645]: df.assign(E=df[cols].sum(1)).drop(cols, 1)
Out[645]:
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
Option 2
Using assignment and drop
In [648]: df['E'] = df[cols].sum(1)
In [649]: df = df.drop(cols, 1)
In [650]: df
Out[650]:
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
Option 3 Lately, I like the 3rd option.
Using groupby
In [660]: df.groupby(np.where(df.columns == 'A', 'A', 'E'), axis=1).first() #or sum max min
Out[660]:
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
In [661]: df.columns == 'A'
Out[661]: array([ True, False, False, False], dtype=bool)
In [662]: np.where(df.columns == 'A', 'A', 'E')
Out[662]:
array(['A', 'E', 'E', 'E'],
dtype='|S1')

The question as written asks for merge/combine as opposed to sum, so posting this to help folks who find this answer looking for help on coalescing with combine_first, which can be a bit tricky.
df2 = pd.concat([df["A"],
df["B"].combine_first(df["C"]).combine_first(df["D"])],
axis=1)
df2.rename(columns={"B":"E"}, inplace=True)
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
What's so tricky about that? in this case there's no problem - but let's say you were pulling the B, C and D values from different dataframes, in which the a,b,c,d,e,f labels were present, but not necessarily in the same order. combine_first() aligns on the index, so you'd need to tack a set_index() on to each of your df references.
df2 = pd.concat([df.set_index("A", drop=False)["A"],
df.set_index("A")["B"]\
.combine_first(df.set_index("A")["C"])\
.combine_first(df.set_index("A")["D"]).astype(int)],
axis=1).reset_index(drop=True)
df2.rename(columns={"B":"E"}, inplace=True)
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70

Use difference for columns names without A and then get sum or max:
cols = df.columns.difference(['A'])
df['E'] = df[cols].sum(axis=1).astype(int)
# df['E'] = df[cols].max(axis=1).astype(int)
df = df.drop(cols, axis=1)
print (df)
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
If multiple values per rows:
data = {'A': ['a', 'b', 'c', 'd', 'e', 'f'],
'B': [42, 52, np.nan, np.nan, np.nan, np.nan],
'C': [np.nan, np.nan, 31, 2, np.nan, np.nan],
'D': [10, np.nan, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])
print (df)
A B C D
0 a 42.0 NaN 10.0
1 b 52.0 NaN NaN
2 c NaN 31.0 NaN
3 d NaN 2.0 NaN
4 e NaN NaN 62.0
5 f NaN NaN 70.0
cols = df.columns.difference(['A'])
df['E'] = df[cols].apply(lambda x: ', '.join(x.dropna().astype(int).astype(str)), 1)
df = df.drop(cols, axis=1)
print (df)
A E
0 a 42, 10
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70

You can also use ffill with iloc:
df['E'] = df.iloc[:, 1:].ffill(1).iloc[:, -1].astype(int)
df = df.iloc[:, [0, -1]]
print(df)
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70

Zero's third option using groupby requires a numpy import and only handles one column outside the set of columns to collapse, while jpp's answer using ffill requires you know how columns are ordered. Here's a solution that has no extra dependencies, takes an arbitrary input dataframe, and only collapses columns if all rows in those columns are single-valued:
import pandas as pd
data = [{'A':'a', 'B':42, 'messy':'z'},
{'A':'b', 'B':52, 'messy':'y'},
{'A':'c', 'C':31},
{'A':'d', 'C':2, 'messy':'w'},
{'A':'e', 'D':62, 'messy':'v'},
{'A':'f', 'D':70, 'messy':['z']}]
df = pd.DataFrame(data)
cols = ['B', 'C', 'D']
new_col = 'E'
if df[cols].apply(lambda x: len(x.notna().value_counts()) == 1, axis=1).all():
df[new_col] = df[cols].ffill(axis=1).dropna(axis=1)
df2 = df.drop(columns=cols)
print(df, '\n\n', df2)
Output:
A B messy C D
0 a 42.0 z NaN NaN
1 b 52.0 y NaN NaN
2 c NaN NaN 31.0 NaN
3 d NaN w 2.0 NaN
4 e NaN v NaN 62.0
5 f NaN [z] NaN 70.0
A messy E
0 a z 42.0
1 b y 52.0
2 c NaN 31.0
3 d w 2.0
4 e v 62.0
5 f [z] 70.0

Related

When the dataframe has duplicate columns, it seems that fillna function cannot work correctly with dict parameter

I find that after using pd.concat() to concatenate two dataframes with same column name, then df.fillna() will not work correctly with the dict parameter specifying which value to use for each column.
I don't know why? Is something wrong with my understanding?
a1 = pd.DataFrame({'a': [1, 2, 3]})
a2 = pd.DataFrame({'a': [1, 2, 3]})
b = pd.DataFrame({'b': [np.nan, 20, 30]})
c = pd.DataFrame({'c': [40, np.nan, 60]})
x = pd.concat([a1,a2, b, c], axis=1)
print(x)
x = x.fillna({'b':10, 'c': 50})
print(x)
Initial dataframe:
a a b c
0 1 1 NaN 40.0
1 2 2 20.0 NaN
2 3 3 30.0 60.0
Data is unchanged after df.fillna():
a a b c
0 1 1 NaN 40.0
1 2 2 20.0 NaN
2 3 3 30.0 60.0
As mentioned in the comments, there's a problem assigning values to a dataframe in the presence of duplicate column names.
However, you can use this workaround:
for col,val in {'b':10, 'c': 50}.items():
new_col = x[col].fillna(val)
idx = int(x.columns.get_loc(col))
x = x.drop(col,axis=1)
x.insert(loc=idx, column=col, value=new_col)
print(x)
result:
a a b c
0 1 1 10.0 40.0
1 2 2 20.0 50.0
2 3 3 30.0 60.0

Why pandas unstack is throwing an error?

I am trying to unstack two columns :
cols = res.columns[:31]
res[cols] = res[cols].ffill()
res = res.set_index(cols + [31])[32].unstack().reset_index().rename_axis(None, 1)
But I am getting an error :
TypeError: can only perform ops with scalar values
What should I do to avoid it?
My original problem : LINK
I think need convert columns to list:
cols = res.columns[:31].tolist()
EDIT:
Index contains duplicate entries, cannot reshape
means duplicates, here for first 6 columns, so is impossible create new DataFrame, because first 6 column create new index and 7. column create new column, and for 8. column are 2 values:
0 1 2 3 4 5 6 7
0 xx s 1 d f df f 54
1 xx s 1 d f df f g4
New DataFrame:
index = xx s 1 d f df
column = f
value = 54
index = xx s 1 d f df
column = f
value = g4
So solution is aggregate, here working with strings, so need .apply(', '.join):
index = xx s 1 d f df
column = f
value = 54, g4
Or remove duplicates and keep first or last value of dupes rows by drop_duplicates:
index = xx s 1 d f df
column = f
value = 54
index = xx s 1 d f df
column = f
value = g4
res = pd.DataFrame({0: ['xx',np.nan,np.nan,np.nan,'ds', np.nan, np.nan, np.nan, np.nan, 'as'],
1: ['s',np.nan,np.nan,np.nan,'a', np.nan, np.nan, np.nan, np.nan, 't'],
2: ['1',np.nan,np.nan,np.nan,'s', np.nan, np.nan, np.nan, np.nan, 'r'],
3: ['d',np.nan, np.nan, np.nan,'d', np.nan, np.nan, np.nan, np.nan, 'a'],
4: ['f',np.nan, np.nan, np.nan,'f', np.nan, np.nan, np.nan, np.nan, '2'],
5: ['df',np.nan,np.nan,np.nan,'ds',np.nan, np.nan, np.nan, np.nan, 'ds'],
6: ['f','f', 'x', 'r', 'f', 'd', 's', '1', '3', 'k'],
7: ['54','g4', 'r4', '43', '64', '43', 'se', 'gf', 's3', 's4']})
cols = res.columns[:6].tolist()
res[cols] = res[cols].ffill()
print (res)
0 1 2 3 4 5 6 7
0 xx s 1 d f df f 54
1 xx s 1 d f df f g4
2 xx s 1 d f df x r4
3 xx s 1 d f df r 43
4 ds a s d f ds f 64
5 ds a s d f ds d 43
6 ds a s d f ds s se
7 ds a s d f ds 1 gf
8 ds a s d f ds 3 s3
9 as t r a 2 ds k s4
res =res.groupby(cols + [6])[7].apply(', '.join).unstack().reset_index().rename_axis(None, 1)
print (res)
0 1 2 3 4 5 1 3 d f k r s x
0 as t r a 2 ds NaN NaN NaN NaN s4 NaN NaN NaN
1 ds a s d f ds gf s3 43 64 NaN NaN se NaN
2 xx s 1 d f df NaN NaN NaN 54, g4 NaN 43 NaN r4 <-54, g4
Another solution is remove duplicates:
res = res.drop_duplicates(cols + [6])
res = res.set_index(cols + [6])[7].unstack().reset_index().rename_axis(None, 1)
print (res)
0 1 2 3 4 5 1 3 d f k r s x
0 as t r a 2 ds NaN NaN NaN NaN s4 NaN NaN NaN
1 ds a s d f ds gf s3 43 64 NaN NaN se NaN
2 xx s 1 d f df NaN NaN NaN 54 NaN 43 NaN r4 <- 54
res = res.drop_duplicates(cols + [6], keep='last')
res = res.set_index(cols + [6])[7].unstack().reset_index().rename_axis(None, 1)
print (res)
0 1 2 3 4 5 1 3 d f k r s x
0 as t r a 2 ds NaN NaN NaN NaN s4 NaN NaN NaN
1 ds a s d f ds gf s3 43 64 NaN NaN se NaN
2 xx s 1 d f df NaN NaN NaN g4 NaN 43 NaN r4 <- g4

Merging groups with a one dataframe after a groupby

I tried to answer this question by a group-level merging. The below is a slightly modified version of the same question, but I need the output by a group-level merging.
Here are the input dataframes:
df = pd.DataFrame({ "group":[1,1,1 ,2,2],
"cat": ['a', 'b', 'c', 'a', 'c'] ,
"value": range(5),
"value2": np.array(range(5))* 2})
df
cat group value value2
a 1 0 0
b 1 1 2
c 1 2 4
a 2 3 6
c 2 4 8
categories = ['a', 'b', 'c', 'd']
categories = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
print(categories)
cat
0 a
1 b
2 c
3 d
Here's the expected output:
cat group value value2
a 1 0 0
b 1 1 2
c 1 2 4
d NA NA NA
a 2 3 6
c 2 4 8
b NA NA NA
d NA NA NA
Question:
I can achieve what I want by a for loop. Is there a pandas way to do that though?
(I need to perform an outer join between categories and each group of the groupby result of df.groupby('group'))
grouped = df.groupby('group')
merged_list = []
for g in grouped:
merged = pd.merge(categories, g[1], how = 'outer', on='cat')
merged_list.append(merged)
out = pd.concat(merged_list)
I think groupby + merge here is only overcomplicated way for this.
So faster is use reindex by MultiIndex:
mux = pd.MultiIndex.from_product([df['group'].unique(), categories], names=('group','cat'))
df = df.set_index(['group','cat']).reindex(mux).swaplevel(0,1).reset_index()
#add missing values to group column
df['group'] = df['group'].mask(df['value'].isnull())
print (df)
cat group value value2
0 a 1.0 0.0 0.0
1 b 1.0 1.0 2.0
2 c 1.0 2.0 4.0
3 d NaN NaN NaN
4 a 2.0 3.0 6.0
5 b NaN NaN NaN
6 c 2.0 4.0 8.0
7 d NaN NaN NaN
Possible solution:
df = df.groupby('group', group_keys=False)
.apply(lambda x: pd.merge(categories, x, how = 'outer', on='cat'))
cat group value value2
0 a 1.0 0.0 0.0
1 b 1.0 1.0 2.0
2 c 1.0 2.0 4.0
3 d NaN NaN NaN
0 a 2.0 3.0 6.0
1 b NaN NaN NaN
2 c 2.0 4.0 8.0
3 d NaN NaN NaN
Timings:
np.random.seed(123)
N = 1000000
L = list('abcd') #235,94.1,156ms
df = pd.DataFrame({'cat': np.random.choice(L, N, p=(0.002,0.002,0.005, 0.991)),
'group':np.random.randint(10000,size=N),
'value':np.random.randint(1000,size=N),
'value2':np.random.randint(5000,size=N)})
df = df.sort_values(['group','cat']).drop_duplicates(['group','cat']).reset_index(drop=True)
print (df.head(10))
categories = ['a', 'b', 'c', 'd']
def jez1(df):
mux = pd.MultiIndex.from_product([df['group'].unique(), categories], names=('group','cat'))
df = df.set_index(['group','cat']).reindex(mux, fill_value=0).swaplevel(0,1).reset_index()
df['group'] = df['group'].mask(df['value'].isnull())
return df
def jez2(df):
grouped = df.groupby('group')
categories = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
return grouped.apply(lambda x: pd.merge(categories, x, how = 'outer', on='cat'))
def coldspeed(df):
grouped = df.groupby('group')
categories = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
return pd.concat([g[1].merge(categories, how='outer', on='cat') for g in grouped])
def akilat90(df):
grouped = df.groupby('group')
categories = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
merged_list = []
for g in grouped:
merged = pd.merge(categories, g[1], how = 'outer', on='cat')
merged['group'].fillna(merged['group'].mode()[0],inplace=True) # replace the `group` column's `NA`s by mode
merged.fillna(0, inplace=True)
merged_list.append(merged)
return pd.concat(merged_list)
In [471]: %timeit jez1(df)
100 loops, best of 3: 12 ms per loop
In [472]: %timeit jez2(df)
1 loop, best of 3: 14.5 s per loop
In [473]: %timeit coldspeed(df)
1 loop, best of 3: 19.4 s per loop
In [474]: %timeit akilat90(df)
1 loop, best of 3: 22.3 s per loop
To actually answer your question, no - you can only merge 2 dataframes at a time (I'm not aware of multi-way merges in pandas). You cannot avoid the loop, but you certainly can make your code a little neater.
pd.concat([g[1].merge(categories, how='outer', on='cat') for g in grouped])
cat group value value2
0 a 1.0 0.0 0.0
1 b 1.0 1.0 2.0
2 c 1.0 2.0 4.0
3 d NaN NaN NaN
0 a 2.0 3.0 6.0
1 c 2.0 4.0 8.0
2 b NaN NaN NaN
3 d NaN NaN NaN

How to replace subset of pandas dataframe with on other series

I think this is a trivial question, but i just cant make it work.
d = { 'one': pd.Series([1,2,3,4], index=['a', 'b', 'c', 'd']),
'two': pd.Series([np.nan,6,np.nan,8], index=['a', 'b', 'c', 'd']),
'three': pd.Series([10,20,30,np.nan], index = ['a', 'b', 'c', 'd'])}
​
df = pd.DataFrame(d)
df
one three two
a 1 10.0 NaN
b 2 20.0 6.0
c 3 30.0 NaN
d 4 NaN 8.0
My serires:
​fill = pd.Series([30,60])
I'd like to replace a specific column, let it be 'two'. With my Series called fill, where the column 'two' meets a condition: is Nan. Canyou help me with that?
My desired result:
df
one three two
a 1 10.0 30
b 2 20.0 6.0
c 3 30.0 60
d 4 NaN 8.0
I think you need loc with isnull for replace numpy array created from fill by Series.values:
df.loc[df.two.isnull(), 'two'] = fill.values
print (df)
one three two
a 1 10.0 30.0
b 2 20.0 6.0
c 3 30.0 60.0
d 4 NaN 8.0

python pandas: pivot_table silently drops indices with nans

Is there an option not to drop the indices with NaN in them? I think silently dropping these rows from the pivot will at some point cause someone serious pain.
import pandas
import numpy
a = [['a', 'b', 12, 12, 12], ['a', numpy.nan, 12.3, 233., 12], ['b', 'a', 123.23, 123, 1], ['a', 'b', 1, 1, 1.]]
df = pandas.DataFrame(a, columns=['a', 'b', 'c', 'd', 'e'])
df_pivot = df.pivot_table(index=['a', 'b'], values=['c', 'd', 'e'], aggfunc=sum)
print(df)
print(df_pivot)
Output:
a b c d e
0 a b 12.00 12 12
1 a NaN 12.30 233 12
2 b a 123.23 123 1
3 a b 1.00 1 1
c d e
a b
a b 13.00 13 13
b a 123.23 123 1
This is currently not supported, see this issue for the enhancement: https://github.com/pydata/pandas/issues/3729.
Workaround to fill the index with a dummy, pivot, and replace
In [28]: df = df.reset_index()
In [29]: df['b'] = df['b'].fillna('dummy')
In [30]: df['dummy'] = np.nan
In [31]: df
Out[31]:
a b c d e dummy
0 a b 12.00 12 12 NaN
1 a dummy 12.30 233 12 NaN
2 b a 123.23 123 1 NaN
3 a b 1.00 1 1 NaN
In [32]: df.pivot_table(index=['a', 'b'], values=['c', 'd', 'e'], aggfunc=sum)
Out[32]:
c d e
a b
a b 13.00 13 13
dummy 12.30 233 12
b a 123.23 123 1
In [33]: df.pivot_table(index=['a', 'b'], values=['c', 'd', 'e'], aggfunc=sum).reset_index().replace('dummy',np.nan).set_index(['a','b'])
Out[33]:
c d e
a b
a b 13.00 13 13
NaN 12.30 233 12
b a 123.23 123 1
Currently the option "dropna=False" is supported by pivot_table:
df.pivot_table(rows=['a', 'b'], values=['c', 'd', 'e'], aggfunc=sum, dropna=False)

Categories