Replacing empty values with NaN in object/categorical variables - python

So I've searched SO for this and found a bunch of useful threads on how to replace empty values with NaN. However I can't get any of them to work on my DataFrame.
I've used:
df.replace('', np.NaN)
df3 = df.applymap(lambda x: np.nan if x == '' else x)
and even:
df.iloc[:,86:350] = df.iloc[:,86:350].apply(lambda x: x.str.strip()).replace('', np.nan)
and the code runs fine without error but when I look in my dataframe i still have b'' values instead of NaN. Any ideas on what I am missing?
I'm sorry for not giving the code to reproduce this as I don't know how to do that as I suspect it's specific to my dataframe which I imported from SPSS and these values were string variables in SPSS if that helps.

You were close with your second try:
df = df.applymap(lambda x: np.NaN if not x else x)
To show that both '' and b'' will evaluate to True in the conditional:
l = ['', b'']
for x in l:
if x:
print ('Not empty')
else:
print ('Empty')
>>> Empty
>>> Empty
Sample:
from pandas import DataFrame
from numpy import NaN
df = DataFrame([[1,2,''], ['',b'',3], [4, 5, b'']])
print (df)
# Output
0 1 2
0 1 2
1 b'' 3
2 4 5 b''
df2 = df.applymap(lambda x: NaN if not x else x)
print (df2)
# Output
0 1 2
0 1 2 NaN
1 NaN NaN 3
2 4 5 NaN

Related

pandas combine columns without null keep string values

I want to combine columns without null and keep string values.
Example data:
a,b,c
123.jpg,213.jpg,987.jpg
,159.jpg,
There is my code:
cols = ['a','b','c']
df['combine_columns'] = df[cols].stack().groupby(level=0),agg(','.join)
print(df)
And the result:
a,b,c,combine_columns
123.jpg,213.jpg,987.jpg,"123.jpg,213.jpg,987.jpg"
,159.jpg,,159.jpg
But I want something like this:
a,b,c,combine_columns
123.jpg,213.jpg,987.jpg,""123.jpg","213.jpg","987.jpg""
,159.jpg,,"159.jpg"
How can I do this?
You can use apply with a list comprehension and pandas.notna as filter:
df['combine_columns'] = df.apply(lambda x: ','.join([e for e in x if pd.notna(e)]),
axis=1)
output:
a b c combine_columns
0 123.jpg 213.jpg 987.jpg 123.jpg,213.jpg,987.jpg
1 NaN 159.jpg NaN 159.jpg
Adding extra " in the string:
df['combine_columns'] = df.apply(lambda x: '"%s"' % ','.join([e for e in x if pd.notna(e)]),
axis=1)
output:
a b c combine_columns
0 123.jpg 213.jpg 987.jpg "123.jpg,213.jpg,987.jpg"
1 NaN 159.jpg NaN "159.jpg"

Having trouble replacing empty strings with NaN using Pandas.DataFranme.replace()

I have a pandas dataframe which has some observations with empty strings which I want to replace with NaN (np.nan).
I am successfully replacing most of these empty strings using
df.replace(r'\s+',np.nan,regex=True).replace('',np.nan)
But I am still finding empty strings. For example, when I run
sub_df = df[df['OBJECT_COL'] == '']
sub_df.replace(r'\s+', np.nan, regex = True)
print(sub_df['OBJECT_COL'] == '')
The output all returns True
Is there a different method I should be trying? Is there a way to read the encoding of these cells such that perhaps my .replace() is not effective because the encoding is weird?
Another Alternatives.
sub_df.replace(r'^\s+$', np.nan, regex=True)
OR, to replace an empty string and records with only spaces
sub.df.replace(r'^\s*$', np.nan, regex=True)
Alternative:
using apply() with function lambda.
sub_df.apply(lambda x: x.str.strip()).replace('', np.nan)
Just Example illustration:
>>> import numpy as np
>>> import pandas as pd
Example DataFrame having empty strings and whitespaces..
>>> sub_df
col_A
0
1
2 somevalue
3 othervalue
4
Solutions applied For the different conditions:
Best Solution:
1)
>>> sub_df.replace(r'\s+',np.nan,regex=True).replace('',np.nan)
col_A
0 NaN
1 NaN
2 somevalue
3 othervalue
4 NaN
2) This works but partially not for both cases:
>>> sub_df.replace(r'^\s+$', np.nan, regex=True)
col_A
0
1 NaN
2 somevalue
3 othervalue
4 NaN
3) This also works for both conditions.
>>> sub_df.replace(r'^\s*$', np.nan, regex=True)
col_A
0 NaN
1 NaN
2 somevalue
3 othervalue
4 NaN
4) This also works for both conditions.
>>> sub_df.apply(lambda x: x.str.strip()).replace('', np.nan)
col_A
0 NaN
1 NaN
2 somevalue
3 othervalue
4 NaN
pd.Series.replace does not work in-place by default. You need to specify inplace=True explicitly:
sub_df.replace(r'\s+', np.nan, regex=True, inplace=True)
Or, alternatively, assign back to sub_df:
sub_df = sub_df.replace(r'\s+', np.nan, regex=True)
Try np.where:
df['OBJECT_COL'] = np.where(df['OBJECT_COL'] == '', np.nan, df['OBJECT_COL'])

how to remove 0's from a string without impacting other cells in pandas data frame?

I have a data frame which has "0's" and looks as below:
df = pd.DataFrame({
'WARNING':['4402,43527,0,7628,54337',4402,0,0,'0,1234,56437,76252',0,3602],
'FAILED':[0,0,'5555,6753,0','4572,0,8764,8753',9876,0,'0,4579,7514']
})
I want to remove the zeroes from the strings where there are multiple values such that the results df looks like this:
df = pd.DataFrame({
'WARNING':['4402,43527,7628,54337',4402,0,0,'1234,56437,76252',0,3602],
'FAILED':[0,0,'5555,6753','4572,8764,8753',9876,0,'4579,7514']
})
However the ones which have individual 0's in a cell should remain intact. How do I achieve this?
df = pd.DataFrame({
'WARNING':['0,0786,1230,01234,0',4402,0,0,'0,1234,56437,76252',0,3602],
'FAILED':[0,0,'5555,6753,0','4572,0,8764,8753',9876,0,'0,4579,7514']
})
df.apply(lambda x: x.str.strip('0,|,0')).replace(",0,", ",")
Output:
WARNING FAILED
0 786,1230,01234 NaN
1 NaN NaN
2 NaN 5555,6753
3 NaN 4572,0,8764,8753
4 1234,56437,76252 NaN
5 NaN NaN
6 NaN 4579,7514
I would solve it with a list comprehension.
In [1]: df.apply(lambda col: col.astype(str).apply(lambda x: ','.join([y for y in x.split(',') if y != '0']) if ',' in x else x), axis=0)
Out[1]: 
FAILED WARNING
0 0 4402,43527,7628,54337
1 0 4402
2 5555,6753 0
3 4572,8764,8753 0
4 9876 1234,56437,76252
5 0 0
6 4579,7514 3602
Breaking it down:
Iterate over all columns with df.apply(lambda col: ..., axis=0)
Convert each column's values to string with col.astype(str)
Apply a function to each "cell" of col with .apply(lambda x: ...)
The lambda function first checks if ',' exists in x, otherwise returns the original value of x
If ',' in x, it splits x by ',', which creates a list of y's
It keeps only the y != '0'
It joins everything at the end with a ','.join(...)
You can use regex with a negative look behind to replace 0, only if it not preceded by another digit.
import re
df.applymap(lambda x: re.sub(r'(?<![0-9])0,', '', str(x)))
WARNING FAILED
0 4402,43527,7628,54337 0
1 4402 0
2 0 5555,6753,0
3 0 4572,8764,8753
4 1234,56437,76252 9876
5 0 0
6 3602 4579,7514
For the test case W-B points out:
s = '0,0999,9990,999'
re.sub(r'(?<![0-9])0,', '', s)
#'0999,9990,999'

Pandas apply & map to every element of every column

How to apply a custom function to every element of every column if its the value is not null?
Lets say I have a data frame of 10 columns, out of which I want to apply a lower() function to every element of just 4 columns if pd.notnull(x), else just keep none as value.
I tried to use like this,
s.apply(lambda x: change_to_lowercase(x), axis = 1)
def change_to_lowercase(s):
s['A'] = s['A'].map(lambda x: x.lower() if pd.notnull(x) else x)
s['B'] = s['B'].map(lambda x: x.lower() if pd.notnull(x) else x)
s['C'] = s['C'].map(lambda x: x.lower() if pd.notnull(x) else x)
s['D'] = s['D'].map(lambda x: x.lower() if pd.notnull(x) else x)
return s
But since my columns are mixed datatype(which is NaN as float, rest as unicode). This is throwing me an error -
float has no attribute map.
How to get rid of this error?
I think you need DataFrame.applymap because working elementwise:
L = [[1.5, 'Test', np.nan, 2], ['Test', np.nan, 2,'TEST'], ['Test', np.nan,1.5, 2]]
df = pd.DataFrame(L, columns=list('abcd'))
print (df)
a b c d
0 1.5 Test NaN 2
1 Test NaN 2.0 TEST
2 Test NaN 1.5 2
cols = ['a','b']
#for python 2 change str to basestring
df[cols] = df[cols].applymap(lambda x: x.lower() if isinstance(x, str) else x)
print (df)
a b c d
0 1.5 test NaN 2
1 test NaN 2.0 TEST
2 test NaN 1.5 2
You are trying to map a Series and then in lambda you take the entire row.
You should also check for integers, floats etc that don't have a method .lower(). So the best is to check if it is a string, not just if it is not a notnull, in my opinion.
This works:
s = pd.DataFrame([{'A': 1.5, 'B':"Test", 'C': np.nan, 'D':2}])
s
A B C D
0 1.5 Test NaN 2
s1 = s.apply(lambda x: x[0].lower() if isinstance(x[0], basestring) else x[0]).copy()
s1
A 1.5
B test
C NaN
D 2
dtype: object
For python 3 to check if string isinstance(x[0], str)
To be able to select columns:
s1 = pd.DataFrame()
columns = ["A", "B"]
for column in columns:
s1[column] = s[column].apply(lambda x: x.lower() if isinstance(x, str) else x).copy()
s1
A B
0 1.5 test

How to print rows if values appear in any column of pandas dataframe

I would like to print all rows of a dataframe where I find the value '-' in any of the columns. Can someone please explain a way that is better than those described below?
This Q&A already explains how to do so by using boolean indexing but each column needs to be declared separately:
print df.ix[df['A'].isin(['-']) | df['B'].isin(['-']) | df['C'].isin(['-'])]
I tried the following but I get an error 'Cannot index with multidimensional key':
df.ix[df[df.columns.values].isin(['-'])]
So I used this code but I'm not happy with the separate printing for each column tested because it is harder to work with and can print the same row more than once:
import pandas as pd
d = {'A': [1,2,3], 'B': [4,'-',6], 'C': [7,8,'-']}
df = pd.DataFrame(d)
for i in range(len(d.keys())):
temp = df.ix[df.iloc[:,i].isin(['-'])]
if temp.shape[0] > 0:
print temp
Output looks like this:
A B C
1 2 - 8
[1 rows x 3 columns]
A B C
2 3 6 -
[1 rows x 3 columns]
Thanks for your advice.
Alternatively, you could do something like df[df.isin(["-"]).any(axis=1)], e.g.
>>> df = pd.DataFrame({'A': [1,2,3], 'B': ['-','-',6], 'C': [7,8,9]})
>>> df.isin(["-"]).any(axis=1)
0 True
1 True
2 False
dtype: bool
>>> df[df.isin(["-"]).any(axis=1)]
A B C
0 1 - 7
1 2 - 8
(Note I changed the frame a bit so I wouldn't get the axes wrong.)
you can do:
>>> idx = df.apply(lambda ts: any(ts == '-'), axis=1)
>>> df[idx]
A B C
1 2 - 8
2 3 6 -
or
lambda ts: '-' in ts.values
note that in looks into the index not the values, so you need .values

Categories