Pandas combine two if one is empty - python

I have a table:
A
B
C
x
1
NA
y
NA
4
z
2
NA
p
NA
5
t
6
7
I want to create a new column D which should combine columns B and C if one of the columns is empty (NA):
A
B
C
D
x
1
NA
1
y
NA
4
4
z
2
NA
2
p
NA
5
5
t
6
7
error
In case both columns contain a value, it should return the text 'error' inside the cell.

You could first calculate a mask with rows where both values are present and then fill NA values of, let's say column B, with values from column C. Using the mask calculated in the first step simply assign NA values where needed.
error_mask = df['B'].notna() & df['C'].notna()
df['D'] = df['B'].fillna(df['C'])
df.loc[error_mask, 'D'] = pd.NA
df
A B C D
0 x 1 <NA> 1
1 y <NA> 4 4
2 z 2 <NA> 2
3 p 3 5 <NA>
OR
df = df['D'].astype(str)
df.loc[error_mask, 'D'] = 'error'
I would suggest against assigning a string error where both values are present, since that would make the whole D column an object dtype

There as several ways to achieve this.
Using fillna and mask
df['D'] = df['B'].fillna(df['C']).mask(df['B'].notna()&df['C'].notna(), 'error')
Or numpy.select:
m1 = df['B'].notna()
m2 = df['C'].notna()
df['D'] = np.select([m1&m2, m1], ['error', df['B']], df['C'])
Output:
A B C D
0 x 1.0 NaN 1.0
1 y NaN 4.0 4.0
2 z 2.0 NaN 2.0
3 p NaN 5.0 5.0
4 t 6.0 7.0 error

Adding to the previous answer, you can address this with a series of .apply() methods paired with lambda functions.
Consider the dataframe that you presented, with np.nan as the NA values:
df = pd.DataFrame({
'B':[1, np.nan, 2, np.nan, 6],
'C':[np.nan, 4, np.nan, 5, 7]})
First generate a list of the elements from the series in question:
df['D'] = df.apply(lambda x: list(x), axis=1)
This will net you a pd.Series with a list of values as elements, e.g. [1.0, nan] for the first row. Next, remove all np.nan elements by using that np.nan != np.nan in numpy (see also an answer here: How can I remove Nan from list Python/NumPy)
df['E'] = df['D'].apply(lambda x: [i for i in x if i == i])
Finally, create the error by filtering based on length.
df['F'] = df['E'].apply(lambda x: x[0] if len(x) == 1 else 'error')
The resulting dataframe works like this:
B C D E F
0 1.0 NaN [1.0, nan] [1.0] 1.0
1 NaN 4.0 [nan, 4.0] [4.0] 4.0
2 2.0 NaN [2.0, nan] [2.0] 2.0
3 NaN 5.0 [nan, 5.0] [5.0] 5.0
4 6.0 7.0 [6.0, 7.0] [6.0, 7.0] error
Of course you could chain all this together in a not-so-pythonic, yet single-line answer:
a = df.apply(lambda x: list(x), axis=1).apply(lambda x: [i for i in x if i == i]).apply(lambda x: x[0] if len(x) == 1 else 'error')

Have a look at the function combine_first:
df['C'].combine_first(df['B']).mask(df['B'].notna() & df['C'].notna(), 'error')
Output:
0 1.0
1 4.0
2 2.0
3 5.0
4 error
Name: C, dtype: object

Related

Keep only the 1st non-null value in each row (and replace others with NaN)

I have the following dataframe:
a b
0 3.0 10.0
1 2.0 9.0
2 NaN 8.0
For each row, I need to drop (and replace with NaN) all values, excluding the first non-null one.
This is the expected output:
a b
0 3.0 NaN
1 2.0 NaN
2 NaN 8.0
I know that using the justify function I can identify the first n non-null values, but I need to keep the same structure of the original dataframe.
One way to go, would be:
import pandas as pd
data = {'a': {0: 3.0, 1: 2.0, 2: None}, 'b': {0: 10.0, 1: 9.0, 2: 8.0}}
df = pd.DataFrame(data)
def keep_first_valid(x):
first_valid = x.first_valid_index()
return x.mask(x.index!=first_valid)
df = df.apply(lambda x: keep_first_valid(x), axis=1)
df
a b
0 3.0 NaN
1 2.0 NaN
2 NaN 8.0
So, the first x passed to the function would consist of pd.Series([3.0, 10.0],index=['a','b']).
Inside the function first_valid = x.first_valid_index() will store 'a'; see df.first_valid_index.
Finally, we apply s.mask to get pd.Series([3.0, None],index=['a','b']), which we assign back to the df.
try this:
f = df.copy()
f[:] = f.columns
fv_idx = df.apply(pd.Series.first_valid_index, axis=1).values[:, None]
res = df.where(f == fv_idx)
print(res)
>>>
a b
0 3.0 NaN
1 2.0 NaN
2 NaN 8.0

Pandas - combine two columns

I have 2 columns, which we'll call x and y. I want to create a new column called xy:
x y xy
1 1
2 2
4 4
8 8
There shouldn't be any conflicting values, but if there are, y takes precedence. If it makes the solution easier, you can assume that x will always be NaN where y has a value.
it could be quite simple if your example is accurate
df.fillna(0) #if the blanks are nan will need this line first
df['xy']=df['x']+df['y']
Notice your column type right now is string not numeric anymore
df = df.apply(lambda x : pd.to_numeric(x, errors='coerce'))
df['xy'] = df.sum(1)
More
df['xy'] =df[['x','y']].astype(str).apply(''.join,1)
#df[['x','y']].astype(str).apply(''.join,1)
Out[655]:
0 1.0
1 2.0
2
3 4.0
4 8.0
dtype: object
You can also use NumPy:
import pandas as pd, numpy as np
df = pd.DataFrame({'x': [1, 2, np.nan, np.nan],
'y': [np.nan, np.nan, 4, 8]})
arr = df.values
df['xy'] = arr[~np.isnan(arr)].astype(int)
print(df)
x y xy
0 1.0 NaN 1
1 2.0 NaN 2
2 NaN 4.0 4
3 NaN 8.0 8

Get list of DataFrame column names for non-float columns

I am trying to get a list of column names from a DataFrame corresponding to columns that aren't of type float. Right now I have
categorical = (df.dtypes.values != np.dtype('float64'))
which gives me a boolean array of whether column names are not float or not, but this is not exactly what I'm looking for. Specifically, I would like a list of column names that correspond to the 'True' values in my boolean array.
Use boolean indexing with df.columns:
categorical = df.columns[(df.dtypes.values != np.dtype('float64'))]
Or get difference of columns selected by select_dtypes:
categorical = df.columns.difference(df.select_dtypes('float64').columns)
Sample:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7.,8,9,4,2,3],
'D':[1,3,5.,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7.0 1.0 5 a
1 b 5 8.0 3.0 3 a
2 c 4 9.0 5.0 6 a
3 d 5 4.0 7.0 9 b
4 e 5 2.0 1.0 2 b
5 f 4 3.0 0.0 4 b
categorical = df.columns.difference(df.select_dtypes('float64').columns)
print (categorical)
Index(['A', 'B', 'E', 'F'], dtype='object')

Group rows in list and transpose pandas

grouping rows in list in pandas groupby
i have found question and need to go a step further
output required by this question was
A [1,2]
B [5,5,4]
C [6]
what I'am trying to achieve is
A B C
1 5 6
2 5
4
i have tried using
grouped=dataSet.groupby('Column1')
df = grouped.aggregate(lambda x: list(x))
output im Stucked with is
df.T
Column1 A B C
[1,2] [5,5,4] [6]
I think here there is no need to use columns of lists.
You can achieve your result using a simple dictionary comprehension over the groups generated by groupby:
out = pd.concat({key:
group['b'].reset_index(drop=True)
for key, group in df.groupby('a')}, axis=1)
which gives the desired output:
out
Out[59]:
A B C
0 1.0 5 6.0
1 2.0 5 NaN
2 NaN 4 NaN
I believe you need create DataFrame by contructor:
df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
s = df.groupby('a')['b'].apply(list)
df = pd.DataFrame(s.values.tolist(), index=s.index).T
print (df)
a A B C
0 1.0 5.0 6.0
1 2.0 5.0 NaN
2 NaN 4.0 NaN

Concatenate multiple pandas series efficiently

I understand that I can use combine_first to merge two series:
series1 = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
series2 = pd.Series([1,2,3,4,5],index=['f','g','h','i','j'])
series3 = pd.Series([1,2,3,4,5],index=['k','l','m','n','o'])
Combine1 = series1.combine_first(series2)
print(Combine1
Output:
a 1.0
b 2.0
c 3.0
d 4.0
e 5.0
f 1.0
g 2.0
h 3.0
i 4.0
j 5.0
dtype: float64
What if I need to merge 3 or more series?
I understand that using the following code: print(series1 + series2 + series3)yields:
a NaN
b NaN
c NaN
d NaN
e NaN
f NaN
...
dtype: float64
Can I merge multiple series efficiently without using combine_first multiple times?
Thanks
Combine Series with Non-Overlapping Indexes
To combine series vertically, use pd.concat.
# Setup
series_list = [
pd.Series(range(1, 6), index=list('abcde')),
pd.Series(range(1, 6), index=list('fghij')),
pd.Series(range(1, 6), index=list('klmno'))
]
pd.concat(series_list)
a 1
b 2
c 3
d 4
e 5
f 1
g 2
h 3
i 4
j 5
k 1
l 2
m 3
n 4
o 5
dtype: int64
Combine with Overlapping Indexes
series_list = [
pd.Series(range(1, 6), index=list('abcde')),
pd.Series(range(1, 6), index=list('abcde')),
pd.Series(range(1, 6), index=list('kbmdf'))
]
If the Series have overlapping indices, you can either combine (add) the keys,
pd.concat(series_list, axis=1, sort=False).sum(axis=1)
a 2.0
b 6.0
c 6.0
d 12.0
e 10.0
k 1.0
m 3.0
f 5.0
dtype: float64
Alternatively, just drop duplicates values on the index if you want to take only the first/last value (when there are duplicates).
res = pd.concat(series_list, axis=0)
# keep first value
res[~res.index.duplicated(keep='first')]
# keep last value
res[~res.index.duplicated(keep='last')]
Presuming that you were using the behavior of combine_first to prioritize the values of the series in order as combine_first is meant for, you could succinctly make multiple calls to it with a lambda expression.
from functools import reduce
l_series = [series1, series2, series3]
reduce(lambda s1, s2: s1.combine_first(s2), l_series)
Of course if the indices are unique as in your current example, you can simply use pd.concat instead.
Demo
series1 = pd.Series(list(range(5)),index=['a','b','c','d','e'])
series2 = pd.Series(list(range(5, 10)),index=['a','g','h','i','j'])
series3 = pd.Series(list(range(10, 15)),index=['k','b','m','c','o'])
from functools import reduce
l_series = [series1, series2, series3]
print(reduce(lambda s1, s2: s1.combine_first(s2), l_series))
# a 0.0
# b 1.0
# c 2.0
# d 3.0
# e 4.0
# g 6.0
# h 7.0
# i 8.0
# j 9.0
# k 10.0
# m 12.0
# o 14.0
# dtype: float64
Agree with what #codespeed has pointed out in his answer.
I think it will depend on user needs. If series index are confirmed with no overlapping, concat will be a better option. (as original question posted, there is no index overlapping, then concat will be a better option)
If there is index overlapping, you might need to consider how to handle overlapping, which value to be overwritten. (as example provided by codespeed, if index are matching to different values, need to be careful about combine_first)
i.e. (note series3 is same as series1, series2 is same as series4)
import pandas as pd
import numpy as np
series1 = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
series2 = pd.Series([2,3,4,4,5],index=['a','b','c','i','j'])
series3 = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
series4 = pd.Series([2,3,4,4,5],index=['a','b','c','i','j'])
print(series1.combine_first(series2))
a 1.0
b 2.0
c 3.0
d 4.0
e 5.0
i 4.0
j 5.0
dtype: float64
print(series4.combine_first(series3))
a 2.0
b 3.0
c 4.0
d 4.0
e 5.0
i 4.0
j 5.0
dtype: float64
You would use combine_first if you want one series's values prioritized over the other. Its usually used to fill the missing values in the first series. I am not sure whats the expected output in your example but looks like you can use concat
pd.concat([series1, series2, series3])
You get
a 1
b 2
c 3
d 4
e 5
f 1
g 2
h 3
i 4
j 5
k 1
l 2
m 3
n 4
o 5

Categories