I have a pandas dataframe which has some observations with empty strings which I want to replace with NaN (np.nan).
I am successfully replacing most of these empty strings using
df.replace(r'\s+',np.nan,regex=True).replace('',np.nan)
But I am still finding empty strings. For example, when I run
sub_df = df[df['OBJECT_COL'] == '']
sub_df.replace(r'\s+', np.nan, regex = True)
print(sub_df['OBJECT_COL'] == '')
The output all returns True
Is there a different method I should be trying? Is there a way to read the encoding of these cells such that perhaps my .replace() is not effective because the encoding is weird?
Another Alternatives.
sub_df.replace(r'^\s+$', np.nan, regex=True)
OR, to replace an empty string and records with only spaces
sub.df.replace(r'^\s*$', np.nan, regex=True)
Alternative:
using apply() with function lambda.
sub_df.apply(lambda x: x.str.strip()).replace('', np.nan)
Just Example illustration:
>>> import numpy as np
>>> import pandas as pd
Example DataFrame having empty strings and whitespaces..
>>> sub_df
col_A
0
1
2 somevalue
3 othervalue
4
Solutions applied For the different conditions:
Best Solution:
1)
>>> sub_df.replace(r'\s+',np.nan,regex=True).replace('',np.nan)
col_A
0 NaN
1 NaN
2 somevalue
3 othervalue
4 NaN
2) This works but partially not for both cases:
>>> sub_df.replace(r'^\s+$', np.nan, regex=True)
col_A
0
1 NaN
2 somevalue
3 othervalue
4 NaN
3) This also works for both conditions.
>>> sub_df.replace(r'^\s*$', np.nan, regex=True)
col_A
0 NaN
1 NaN
2 somevalue
3 othervalue
4 NaN
4) This also works for both conditions.
>>> sub_df.apply(lambda x: x.str.strip()).replace('', np.nan)
col_A
0 NaN
1 NaN
2 somevalue
3 othervalue
4 NaN
pd.Series.replace does not work in-place by default. You need to specify inplace=True explicitly:
sub_df.replace(r'\s+', np.nan, regex=True, inplace=True)
Or, alternatively, assign back to sub_df:
sub_df = sub_df.replace(r'\s+', np.nan, regex=True)
Try np.where:
df['OBJECT_COL'] = np.where(df['OBJECT_COL'] == '', np.nan, df['OBJECT_COL'])
Related
I have the following mapping
mapping = {'sum12':2, 'sum6':1,
'avg12':2, 'avg6':1,
'diff':3, 'mean':4}
and I have a dataframe with variables like
var1 var2
0 abc_sum12 mean_jkl
1 pqr_sum6 pqr_avg6
2 diff_xyz qwerty
If any of the substrings are present in the strings in the dataframe, I want to replace them with their corresponding values. If no substring is present, I want to replace it with np.nan. At present, the only solution I can think of is going through every row, checking if any of the substrings is present in every string, and replacing it with the specific number corresponding with that substring. Is there a better way to do it.
The output in the end would be
var1 var2
0 2 4.0
1 1 1.0
2 3 NaN
I believe if you replace using regex, it will partial match and give you the result you want. The only exception is the qwerty value which will remain unchanged. If you then coerce the entire df to numeric, it will return NaN for that value or any other non-numeric.
import pandas as pd
mapping = {'sum12':2, 'sum6':1,
'avg12':2, 'avg6':1,
'diff':3, 'mean':4}
df = pd.DataFrame({'var1': ['abc_sum12', 'pqr_sum6', 'diff_xyz'],
'var2': ['mean_jkl', 'pqr_avg6', 'qwerty']})
df = df.replace(mapping, regex=True).apply(lambda x: pd.to_numeric(x, errors='coerce'))
print(df)
output
var1 var2
0 2 4.0
1 1 1.0
2 3 NaN
Another approach:
mapping = {'sum12':2, 'sum6':1,
'avg12':2, 'avg6':1,
'diff':3, 'mean':4}
df = pd.DataFrame(
{'var1': {0: 'abc_sum12', 1: 'pqr_sum6', 2: 'diff_xyz'},
'var2': {0: 'mean_jkl', 1: 'pqr_avg6', 2: 'qwerty'}})
df_new[:] = np.nan
df_new = df_new.astype('float')
for name,col in df.items():
for key,val in mapping.items():
df_new[name][col.str.contains(key)] = val
The resulting dataframe df_new:
var1 var2
0 2.0 4.0
1 1.0 1.0
2 3.0 NaN
I am trying to split a column up into two columns based on a delimeter. The column presently has text that is separated by a '-'. Some of the values in the column are NaN, so when I run the code below, I get the following error message: ValueError: Columns must be same length as key.
I don't want to delete the NaN values, but am not sure how to skip them so that this splitting works.
The code I have right now is:
df[['A','B']] = df['A'].str.split('-',expand=True)
Your code works well with NaN values but you have to use n=1 as parameter of str.split:
Suppose this dataframe:
df = pd.DataFrame({'A': ['hello-world', np.nan, 'raise-an-exception']}
print(df)
# Output:
A
0 hello-world
1 NaN
2 raise-an-exception
Reproducible error:
df[['A', 'B']] = df['A'].str.split('-', expand=True)
print(df)
# Output:
...
ValueError: Columns must be same length as key
Use n=1:
df[['A', 'B']] = df['A'].str.split('-', n=1, expand=True)
print(df)
# Output:
A B
0 hello world
1 NaN NaN
2 raise an-exception
An alternative is to generate more columns:
df1 = df['A'].str.split('-', expand=True)
df1.columns = df1.columns.map(lambda x: chr(x+65))
print(df1)
# Output:
A B C
0 hello world None
1 NaN NaN NaN
2 raise an exception
Maybe filter them out with loc:
df.loc[df['A'].notna(), ['A','B']] = df.loc[df['A'].notna(), 'A'].str.split('-',expand=True)
My dataframes are like below
df1
id c1
1 abc
2 def
3 ghi
df2
id set1
1 [123,456]
2 [789]
When I join df1 and df2 (final_data = df1.merge(df2, how = 'left')). It gives me
final_df
id c1 set1
1 abc [123,456]
2 def [789]
3 ghi NaN
I'm using below code to replace NaN with empty array []
for row in final_df.loc[final_df.set1.isnull(), 'set1'].index:
final_df.at[row, 'set1'] = []
The issue is if df2 is empty dataframe. It is giving
ValueError: setting an array element with a sequence.
PS: I'm using pandas 0.23.4 version
Pandas is not designed to be used with series of lists. You lose all vectorised functionality and any manipulations on such series involve inefficient, Python-level loops.
One work-around is to define a series of empty lists:
res = df1.merge(df2, how='left')
empty = pd.Series([[] for _ in range(len(df.index))], index=df.index)
res['set1'] = res['set1'].fillna(empty)
print(res)
id c1 set1
0 1 abc [123, 456]
1 2 def [789]
2 3 ghi []
A better idea at this point, if viable, is to split your lists into separate series:
res = res.join(pd.DataFrame(res.pop('set1').values.tolist()))
print(res)
id c1 0 1
0 1 abc 123.0 456.0
1 2 def 789.0 NaN
2 3 ghi NaN NaN
This is is not ideal but will get your work done
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[1,'abc'],[2,'def'],[3,'ghi']], columns=['id', 'c1'])
df2 = pd.DataFrame([[1,[123,456]],[2,[789]]], columns=['id', 'set1'])
df=pd.merge(df1,df2, how='left', on='id')
df['set1'].fillna(0, inplace=True)
df['set1']=df['set1'].apply( lambda x:pd.Series({'set1': [] if x == 0 else x}))
print(df)
So I've searched SO for this and found a bunch of useful threads on how to replace empty values with NaN. However I can't get any of them to work on my DataFrame.
I've used:
df.replace('', np.NaN)
df3 = df.applymap(lambda x: np.nan if x == '' else x)
and even:
df.iloc[:,86:350] = df.iloc[:,86:350].apply(lambda x: x.str.strip()).replace('', np.nan)
and the code runs fine without error but when I look in my dataframe i still have b'' values instead of NaN. Any ideas on what I am missing?
I'm sorry for not giving the code to reproduce this as I don't know how to do that as I suspect it's specific to my dataframe which I imported from SPSS and these values were string variables in SPSS if that helps.
You were close with your second try:
df = df.applymap(lambda x: np.NaN if not x else x)
To show that both '' and b'' will evaluate to True in the conditional:
l = ['', b'']
for x in l:
if x:
print ('Not empty')
else:
print ('Empty')
>>> Empty
>>> Empty
Sample:
from pandas import DataFrame
from numpy import NaN
df = DataFrame([[1,2,''], ['',b'',3], [4, 5, b'']])
print (df)
# Output
0 1 2
0 1 2
1 b'' 3
2 4 5 b''
df2 = df.applymap(lambda x: NaN if not x else x)
print (df2)
# Output
0 1 2
0 1 2 NaN
1 NaN NaN 3
2 4 5 NaN
UPDATE: This is no longer an issue since at least pandas version 0.18.1. Concatenating empty series doesn't drop them anymore so this question is out of date.
I want to create a pandas dataframe from a list of series using .concat. The problem is that when one of the series is empty it doesn't get included in the resulting dataframe but this makes the dataframe be the wrong dimensions when I then try to rename its columns with a multi-index.
UPDATE: Here's an example...
import pandas as pd
sers1 = pd.Series()
sers2 = pd.Series(['a', 'b', 'c'])
df1 = pd.concat([sers1, sers2], axis=1)
This produces the following dataframe:
>>> df1
0 a
1 b
2 c
dtype: object
But I want it to produce something like this:
>>> df2
0 1
0 NaN a
1 NaN b
2 NaN c
It does this if I put a single nan value anywhere in ser1 but it seems like this should be possible automatically even if some of my series are totally empty.
Passing an argument for levels will do the trick. Here's an example. First, the wrong way:
import pandas as pd
ser1 = pd.Series()
ser2 = pd.Series([1, 2, 3])
list_of_series = [ser1, ser2, ser1]
df = pd.concat(list_of_series, axis=1)
Which produces this:
>>> df
0
0 1
1 2
2 3
But if we add some labels to the levels argument, it will include all the empty series too:
import pandas as pd
ser1 = pd.Series()
ser2 = pd.Series([1, 2, 3])
list_of_series = [ser1, ser2, ser1]
labels = range(len(list_of_series))
df = pd.concat(list_of_series, levels=labels, axis=1)
Which produces the desired dataframe:
>>> df
0 1 2
0 NaN 1 NaN
1 NaN 2 NaN
2 NaN 3 NaN