Pandas boolean slicing different results for same command - python

Why does Example 1 give back NaN, while Example 2 doesn't?
Example 1:
data=DataFrame(np.arange(0,16).reshape(4,4),
index=[list('abcd')],
columns=[list('retz')])
data[data['t'] > 5]
r e t z
a NaN NaN NaN NaN
b NaN NaN 6.0 NaN
c NaN NaN 10.0 NaN
d NaN NaN 14.0 NaN
Example2:
data2 = DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data2[data2['three'] > 5]
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

Your first dataframe has a multiindex
data.axes
> [MultiIndex(levels=[['a', 'b', 'c', 'd']],
labels=[[0, 1, 2, 3]]), MultiIndex(levels=[['e', 'r', 't', 'z']],
labels=[[1, 0, 2, 3]])]
Whereas your second doesn't:
data2.axes
> [Index(['Ohio', 'Colorado', 'Utah', 'New York'], dtype='object'),
Index(['one', 'two', 'three', 'four'], dtype='object')]
It's because you've wrapped list('retz') in another list, so it's interpreted as [['e', 'r', 't', 'z']]. If you want to have just a single index, you would just get rid of the brackets.
data=DataFrame(np.arange(0,16).reshape(4,4),
index=list('abcd'),
columns=list('retz'))
data[data['t'] > 5]
> r e t z
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15

Related

Pandas fill in group if condition is met

I have a DataFrame where I am looking to fill in values in a column based on their grouping. I only want to fill in the values (by propagating non-NaN values using ffill and bfill) if there is only one unique value in the column to be filled; otherwise, it should be left as is. My code below has a sample dataset where I try to do this, but I get an error.
Code:
df = pd.DataFrame({"A": [1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 6, 6],
"B": ['a', 'a', np.nan, 'b', 'b', 'c', np.nan, 'd', np.nan, 'e', 'e', np.nan, 'h', 'h'],
"C": [5.0, np.nan, 4.0, 4.0, np.nan, 9.0, np.nan, np.nan, 9.0, 8.0, np.nan, 2.0, np.nan, np.nan]})
col_to_groupby = "A"
col_to_modify = "B"
group = df.groupby(col_to_groupby)
modified = group[group[col_to_modify].nunique() == 1].transform(lambda x: x.ffill().bfill())
df.update(modified)
Error:
KeyError: 'Columns not found: False, True'
Original dataset:
A B C
0 1 a 5.0
1 1 a NaN
2 2 NaN 4.0
3 2 b 4.0
4 2 b NaN
5 3 c 9.0
6 3 NaN NaN
7 3 d NaN
8 3 NaN 9.0
9 4 e 8.0
10 4 e NaN
11 5 NaN 2.0
12 6 h NaN
13 6 NaN NaN
Desired result:
A B C
0 1 a 5.0
1 1 a NaN
2 2 b 4.0
3 2 b 4.0
4 2 b NaN
5 3 c 9.0
6 3 NaN NaN
7 3 d NaN
8 3 NaN 9.0
9 4 e 8.0
10 4 e NaN
11 5 NaN 2.0
12 6 h NaN
13 6 h NaN
The above is the desired result because
row index 2 is in group 2, which only has 1 unique value in column B ("b"), so it is changed.
row indices 6 and 8 are in group 3, but there are 2 unique values in column B ("c" and "d"), so they are unaltered.
row index 5 is in group 11, but has no data in column B to propagate.
row index 13 is in group 6, which only has 1 unique value in column B ("h"), so it is changed.
One option is to add a condition in groupby.apply:
df[col_to_modify] = df.groupby(col_to_groupby)[col_to_modify].apply(lambda x: x.ffill().bfill() if x.nunique()==1 else x)
Another could be to use groupby + transform(nunique) + eq to create a boolean filter for the groups with unique values; then update those rows with groupby + first (first drops NaN) using where:
g = df.groupby(col_to_groupby)[col_to_modify]
df[col_to_modify] = g.transform('first').where(g.transform('nunique').eq(1), df[col_to_modify])
Output:
A B C
0 1 a 5.0
1 1 a NaN
2 2 b 4.0
3 2 b 4.0
4 2 b NaN
5 3 c 9.0
6 3 NaN NaN
7 3 d NaN
8 3 NaN 9.0
9 4 e 8.0
10 4 e NaN
11 5 NaN 2.0
12 6 h NaN
13 6 h NaN

A different result between concat and np_r when combining dataframe slices [duplicate]

This question already has answers here:
pandas concat generates nan values
(3 answers)
Closed 4 years ago.
Let's assume the following is a dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame({'group1' : ['A', 'A', 'A', 'A',
'A', 'A', 'A', 'A'],
'group2' : ['A', 'A', 'A', 'A',
'A', 'A', 'A', 'A'],
'group3' : ['A', 'A', 'A', 'A',
'A', 'A', 'A', 'A'],
'group4' : ['A', 'A', 'A', 'A',
'A', 'A', 'A', 'A'],
'group5' : ['C', 'C', 'C', 'C',
'C', 'E', 'E', 'E'],
'group6' : ['C', 'C', 'C', 'C',
'C', 'E', 'E', 'E'],
'group7' : ['A', 'A', 'A', 'A',
'A', 'A', 'A', 'A'],
'time' : [-6,-5,-4,-3,-2,-6,-3,-4] ,
'col': [1,2,3,4,5,6,7,8]})
Now, I only wish to select certain slices from the dataframe and the first method I apply is concat:
a=df.iloc[:,0:2]
b=df.iloc[:,6:8]
df1=pd.concat([a,b],sort=False)
df1
The output I get from this code is the following
group1 group2 group7 time
0 A A NaN NaN
1 A A NaN NaN
2 A A NaN NaN
3 A A NaN NaN
4 A A NaN NaN
5 A A NaN NaN
6 A A NaN NaN
7 A A NaN NaN
0 NaN NaN A -6.0
1 NaN NaN A -5.0
2 NaN NaN A -4.0
3 NaN NaN A -3.0
4 NaN NaN A -2.0
5 NaN NaN A -6.0
6 NaN NaN A -3.0
7 NaN NaN A -4.0
It seems to be an odd result. But if I tried with np_r
df.iloc[:5, np.r_[0:2,6:8]]
The output is the correct one...
group1 group2 group7 time
0 A A A -6
1 A A A -5
2 A A A -4
3 A A A -3
4 A A A -2
5 A A A -6
6 A A A -3
7 A A A -4
Is there a more efficient way with concat to fix the output and is np_r the best way to combine slices of dataframes and if so, why?
Use axis=1
a=df.iloc[:,0:2]
b=df.iloc[:,6:8]
df1=pd.concat([a,b],sort=False, axis=1)
group1 group2 group7 time
0 A A A -6
1 A A A -5
2 A A A -4
3 A A A -3
4 A A A -2
5 A A A -6
6 A A A -3
7 A A A -4

How to merge/combine columns in pandas?

I have a (example-) dataframe with 4 columns:
data = {'A': ['a', 'b', 'c', 'd', 'e', 'f'],
'B': [42, 52, np.nan, np.nan, np.nan, np.nan],
'C': [np.nan, np.nan, 31, 2, np.nan, np.nan],
'D': [np.nan, np.nan, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])
A B C D
0 a 42.0 NaN NaN
1 b 52.0 NaN NaN
2 c NaN 31.0 NaN
3 d NaN 2.0 NaN
4 e NaN NaN 62.0
5 f NaN NaN 70.0
I would now like to merge/combine columns B, C, and D to a new column E like in this example:
data2 = {'A': ['a', 'b', 'c', 'd', 'e', 'f'],
'E': [42, 52, 31, 2, 62, 70]}
df2 = pd.DataFrame(data2, columns = ['A', 'E'])
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
I found a quite similar question here but this adds the merged colums B, C, and D at the end of column A:
0 a
1 b
2 c
3 d
4 e
5 f
6 42
7 52
8 31
9 2
10 62
11 70
dtype: object
Thanks for help.
Option 1
Using assign and drop
In [644]: cols = ['B', 'C', 'D']
In [645]: df.assign(E=df[cols].sum(1)).drop(cols, 1)
Out[645]:
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
Option 2
Using assignment and drop
In [648]: df['E'] = df[cols].sum(1)
In [649]: df = df.drop(cols, 1)
In [650]: df
Out[650]:
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
Option 3 Lately, I like the 3rd option.
Using groupby
In [660]: df.groupby(np.where(df.columns == 'A', 'A', 'E'), axis=1).first() #or sum max min
Out[660]:
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
In [661]: df.columns == 'A'
Out[661]: array([ True, False, False, False], dtype=bool)
In [662]: np.where(df.columns == 'A', 'A', 'E')
Out[662]:
array(['A', 'E', 'E', 'E'],
dtype='|S1')
The question as written asks for merge/combine as opposed to sum, so posting this to help folks who find this answer looking for help on coalescing with combine_first, which can be a bit tricky.
df2 = pd.concat([df["A"],
df["B"].combine_first(df["C"]).combine_first(df["D"])],
axis=1)
df2.rename(columns={"B":"E"}, inplace=True)
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
What's so tricky about that? in this case there's no problem - but let's say you were pulling the B, C and D values from different dataframes, in which the a,b,c,d,e,f labels were present, but not necessarily in the same order. combine_first() aligns on the index, so you'd need to tack a set_index() on to each of your df references.
df2 = pd.concat([df.set_index("A", drop=False)["A"],
df.set_index("A")["B"]\
.combine_first(df.set_index("A")["C"])\
.combine_first(df.set_index("A")["D"]).astype(int)],
axis=1).reset_index(drop=True)
df2.rename(columns={"B":"E"}, inplace=True)
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
Use difference for columns names without A and then get sum or max:
cols = df.columns.difference(['A'])
df['E'] = df[cols].sum(axis=1).astype(int)
# df['E'] = df[cols].max(axis=1).astype(int)
df = df.drop(cols, axis=1)
print (df)
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
If multiple values per rows:
data = {'A': ['a', 'b', 'c', 'd', 'e', 'f'],
'B': [42, 52, np.nan, np.nan, np.nan, np.nan],
'C': [np.nan, np.nan, 31, 2, np.nan, np.nan],
'D': [10, np.nan, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])
print (df)
A B C D
0 a 42.0 NaN 10.0
1 b 52.0 NaN NaN
2 c NaN 31.0 NaN
3 d NaN 2.0 NaN
4 e NaN NaN 62.0
5 f NaN NaN 70.0
cols = df.columns.difference(['A'])
df['E'] = df[cols].apply(lambda x: ', '.join(x.dropna().astype(int).astype(str)), 1)
df = df.drop(cols, axis=1)
print (df)
A E
0 a 42, 10
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
You can also use ffill with iloc:
df['E'] = df.iloc[:, 1:].ffill(1).iloc[:, -1].astype(int)
df = df.iloc[:, [0, -1]]
print(df)
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
Zero's third option using groupby requires a numpy import and only handles one column outside the set of columns to collapse, while jpp's answer using ffill requires you know how columns are ordered. Here's a solution that has no extra dependencies, takes an arbitrary input dataframe, and only collapses columns if all rows in those columns are single-valued:
import pandas as pd
data = [{'A':'a', 'B':42, 'messy':'z'},
{'A':'b', 'B':52, 'messy':'y'},
{'A':'c', 'C':31},
{'A':'d', 'C':2, 'messy':'w'},
{'A':'e', 'D':62, 'messy':'v'},
{'A':'f', 'D':70, 'messy':['z']}]
df = pd.DataFrame(data)
cols = ['B', 'C', 'D']
new_col = 'E'
if df[cols].apply(lambda x: len(x.notna().value_counts()) == 1, axis=1).all():
df[new_col] = df[cols].ffill(axis=1).dropna(axis=1)
df2 = df.drop(columns=cols)
print(df, '\n\n', df2)
Output:
A B messy C D
0 a 42.0 z NaN NaN
1 b 52.0 y NaN NaN
2 c NaN NaN 31.0 NaN
3 d NaN w 2.0 NaN
4 e NaN v NaN 62.0
5 f NaN [z] NaN 70.0
A messy E
0 a z 42.0
1 b y 52.0
2 c NaN 31.0
3 d w 2.0
4 e v 62.0
5 f [z] 70.0

How to replace subset of pandas dataframe with on other series

I think this is a trivial question, but i just cant make it work.
d = { 'one': pd.Series([1,2,3,4], index=['a', 'b', 'c', 'd']),
'two': pd.Series([np.nan,6,np.nan,8], index=['a', 'b', 'c', 'd']),
'three': pd.Series([10,20,30,np.nan], index = ['a', 'b', 'c', 'd'])}
​
df = pd.DataFrame(d)
df
one three two
a 1 10.0 NaN
b 2 20.0 6.0
c 3 30.0 NaN
d 4 NaN 8.0
My serires:
​fill = pd.Series([30,60])
I'd like to replace a specific column, let it be 'two'. With my Series called fill, where the column 'two' meets a condition: is Nan. Canyou help me with that?
My desired result:
df
one three two
a 1 10.0 30
b 2 20.0 6.0
c 3 30.0 60
d 4 NaN 8.0
I think you need loc with isnull for replace numpy array created from fill by Series.values:
df.loc[df.two.isnull(), 'two'] = fill.values
print (df)
one three two
a 1 10.0 30.0
b 2 20.0 6.0
c 3 30.0 60.0
d 4 NaN 8.0

python pandas: pivot_table silently drops indices with nans

Is there an option not to drop the indices with NaN in them? I think silently dropping these rows from the pivot will at some point cause someone serious pain.
import pandas
import numpy
a = [['a', 'b', 12, 12, 12], ['a', numpy.nan, 12.3, 233., 12], ['b', 'a', 123.23, 123, 1], ['a', 'b', 1, 1, 1.]]
df = pandas.DataFrame(a, columns=['a', 'b', 'c', 'd', 'e'])
df_pivot = df.pivot_table(index=['a', 'b'], values=['c', 'd', 'e'], aggfunc=sum)
print(df)
print(df_pivot)
Output:
a b c d e
0 a b 12.00 12 12
1 a NaN 12.30 233 12
2 b a 123.23 123 1
3 a b 1.00 1 1
c d e
a b
a b 13.00 13 13
b a 123.23 123 1
This is currently not supported, see this issue for the enhancement: https://github.com/pydata/pandas/issues/3729.
Workaround to fill the index with a dummy, pivot, and replace
In [28]: df = df.reset_index()
In [29]: df['b'] = df['b'].fillna('dummy')
In [30]: df['dummy'] = np.nan
In [31]: df
Out[31]:
a b c d e dummy
0 a b 12.00 12 12 NaN
1 a dummy 12.30 233 12 NaN
2 b a 123.23 123 1 NaN
3 a b 1.00 1 1 NaN
In [32]: df.pivot_table(index=['a', 'b'], values=['c', 'd', 'e'], aggfunc=sum)
Out[32]:
c d e
a b
a b 13.00 13 13
dummy 12.30 233 12
b a 123.23 123 1
In [33]: df.pivot_table(index=['a', 'b'], values=['c', 'd', 'e'], aggfunc=sum).reset_index().replace('dummy',np.nan).set_index(['a','b'])
Out[33]:
c d e
a b
a b 13.00 13 13
NaN 12.30 233 12
b a 123.23 123 1
Currently the option "dropna=False" is supported by pivot_table:
df.pivot_table(rows=['a', 'b'], values=['c', 'd', 'e'], aggfunc=sum, dropna=False)

Categories