How to skip NaN values when splitting up a column - python

I am trying to split a column up into two columns based on a delimeter. The column presently has text that is separated by a '-'. Some of the values in the column are NaN, so when I run the code below, I get the following error message: ValueError: Columns must be same length as key.
I don't want to delete the NaN values, but am not sure how to skip them so that this splitting works.
The code I have right now is:
df[['A','B']] = df['A'].str.split('-',expand=True)

Your code works well with NaN values but you have to use n=1 as parameter of str.split:
Suppose this dataframe:
df = pd.DataFrame({'A': ['hello-world', np.nan, 'raise-an-exception']}
print(df)
# Output:
A
0 hello-world
1 NaN
2 raise-an-exception
Reproducible error:
df[['A', 'B']] = df['A'].str.split('-', expand=True)
print(df)
# Output:
...
ValueError: Columns must be same length as key
Use n=1:
df[['A', 'B']] = df['A'].str.split('-', n=1, expand=True)
print(df)
# Output:
A B
0 hello world
1 NaN NaN
2 raise an-exception
An alternative is to generate more columns:
df1 = df['A'].str.split('-', expand=True)
df1.columns = df1.columns.map(lambda x: chr(x+65))
print(df1)
# Output:
A B C
0 hello world None
1 NaN NaN NaN
2 raise an exception

Maybe filter them out with loc:
df.loc[df['A'].notna(), ['A','B']] = df.loc[df['A'].notna(), 'A'].str.split('-',expand=True)

Related

pandas combine columns without null keep string values

I want to combine columns without null and keep string values.
Example data:
a,b,c
123.jpg,213.jpg,987.jpg
,159.jpg,
There is my code:
cols = ['a','b','c']
df['combine_columns'] = df[cols].stack().groupby(level=0),agg(','.join)
print(df)
And the result:
a,b,c,combine_columns
123.jpg,213.jpg,987.jpg,"123.jpg,213.jpg,987.jpg"
,159.jpg,,159.jpg
But I want something like this:
a,b,c,combine_columns
123.jpg,213.jpg,987.jpg,""123.jpg","213.jpg","987.jpg""
,159.jpg,,"159.jpg"
How can I do this?
You can use apply with a list comprehension and pandas.notna as filter:
df['combine_columns'] = df.apply(lambda x: ','.join([e for e in x if pd.notna(e)]),
axis=1)
output:
a b c combine_columns
0 123.jpg 213.jpg 987.jpg 123.jpg,213.jpg,987.jpg
1 NaN 159.jpg NaN 159.jpg
Adding extra " in the string:
df['combine_columns'] = df.apply(lambda x: '"%s"' % ','.join([e for e in x if pd.notna(e)]),
axis=1)
output:
a b c combine_columns
0 123.jpg 213.jpg 987.jpg "123.jpg,213.jpg,987.jpg"
1 NaN 159.jpg NaN "159.jpg"

Having trouble replacing empty strings with NaN using Pandas.DataFranme.replace()

I have a pandas dataframe which has some observations with empty strings which I want to replace with NaN (np.nan).
I am successfully replacing most of these empty strings using
df.replace(r'\s+',np.nan,regex=True).replace('',np.nan)
But I am still finding empty strings. For example, when I run
sub_df = df[df['OBJECT_COL'] == '']
sub_df.replace(r'\s+', np.nan, regex = True)
print(sub_df['OBJECT_COL'] == '')
The output all returns True
Is there a different method I should be trying? Is there a way to read the encoding of these cells such that perhaps my .replace() is not effective because the encoding is weird?
Another Alternatives.
sub_df.replace(r'^\s+$', np.nan, regex=True)
OR, to replace an empty string and records with only spaces
sub.df.replace(r'^\s*$', np.nan, regex=True)
Alternative:
using apply() with function lambda.
sub_df.apply(lambda x: x.str.strip()).replace('', np.nan)
Just Example illustration:
>>> import numpy as np
>>> import pandas as pd
Example DataFrame having empty strings and whitespaces..
>>> sub_df
col_A
0
1
2 somevalue
3 othervalue
4
Solutions applied For the different conditions:
Best Solution:
1)
>>> sub_df.replace(r'\s+',np.nan,regex=True).replace('',np.nan)
col_A
0 NaN
1 NaN
2 somevalue
3 othervalue
4 NaN
2) This works but partially not for both cases:
>>> sub_df.replace(r'^\s+$', np.nan, regex=True)
col_A
0
1 NaN
2 somevalue
3 othervalue
4 NaN
3) This also works for both conditions.
>>> sub_df.replace(r'^\s*$', np.nan, regex=True)
col_A
0 NaN
1 NaN
2 somevalue
3 othervalue
4 NaN
4) This also works for both conditions.
>>> sub_df.apply(lambda x: x.str.strip()).replace('', np.nan)
col_A
0 NaN
1 NaN
2 somevalue
3 othervalue
4 NaN
pd.Series.replace does not work in-place by default. You need to specify inplace=True explicitly:
sub_df.replace(r'\s+', np.nan, regex=True, inplace=True)
Or, alternatively, assign back to sub_df:
sub_df = sub_df.replace(r'\s+', np.nan, regex=True)
Try np.where:
df['OBJECT_COL'] = np.where(df['OBJECT_COL'] == '', np.nan, df['OBJECT_COL'])

Why does concat Series to DataFrame with index matching columns not work?

I want to append a Series to a DataFrame where Series's index matches DataFrame's columns using pd.concat, but it gives me surprises:
df = pd.DataFrame(columns=['a', 'b'])
sr = pd.Series(data=[1,2], index=['a', 'b'], name=1)
pd.concat([df, sr], axis=0)
Out[11]:
a b 0
a NaN NaN 1.0
b NaN NaN 2.0
What I expected is of course:
df.append(sr)
Out[14]:
a b
1 1 2
It really surprises me that pd.concat is not index-columns aware. So is it true that if I want to concat a Series as a new row to a DF, then I can only use df.append instead?
Need DataFrame from Series by to_frame and transpose:
a = pd.concat([df, sr.to_frame(1).T])
print (a)
a b
1 1 2
Detail:
print (sr.to_frame(1).T)
a b
1 1 2
Or use setting with enlargement:
df.loc[1] = sr
print (df)
a b
1 1 2
"df.loc[1] = sr" will drop the column if it isn't in df
df = pd.DataFrame(columns = ['a','b'])
sr = pd.Series({'a':1,'b':2,'c':3})
df.loc[1] = sr
df will be like:
a b
1 1 2

str error when replacing values in pandas dataframe

My code scrapes information from the website and puts it into a dataframe. But i'm not certain why the order of the code will give rise to the error: AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
Basically, the data scraped has over 20 rows and 10 columns.
Some values are within brackets ie: (2,333) and I want to change it to: -2333.
Some values have words n.a and I want to change it to numpy.nan
some values are - and I want to change them to numpy.nan too.
Doesn't Work
for final_df, engine_name in zip((df_foo, df_bar, df_far), (['engine_foo', 'engine_bar', 'engine_far'])):
# Replacing necessary items for final clean up
final_df.replace('-', numpy.nan, inplace=True)
final_df.replace('n.a.', numpy.nan, inplace=True)
for i in final_df.columns:
final_df[i] = final_df[i].str.replace(')', '')
final_df[i] = final_df[i].str.replace(',', '')
final_df[i] = final_df[i].str.replace('(', '-')
# Appending Code to dataframe
final_df = final_df.T
final_df.insert(loc=0, column='Code', value=some_code)
# This produces the error - AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
Works
for final_df, engine_name in zip((df_foo, df_bar, df_far), (['engine_foo', 'engine_bar', 'engine_far'])):
# Replacing necessary items for final clean up
for i in final_df.columns:
final_df[i] = final_df[i].str.replace(')', '')
final_df[i] = final_df[i].str.replace(',', '')
final_df[i] = final_df[i].str.replace('(', '-')
final_df.replace('-', numpy.nan, inplace=True)
final_df.replace('n.a.', numpy.nan, inplace=True)
# Appending Code to dataframe
final_df = final_df.T
final_df.insert(loc=0, column='Code', value=some_code)
# This doesn't give me any errors and returns me what I want.
Any thoughts on why this happens?
For me works double replace - first with regex=True for replace substrings and second for all values:
np.random.seed(23)
df = pd.DataFrame(np.random.choice(['(2,333)','n.a.','-',2.34], size=(3,3)),
columns=list('ABC'))
print (df)
A B C
0 2.34 - (2,333)
1 n.a. - (2,333)
2 2.34 n.a. (2,333)
df1 = df.replace(['\(','\)','\,'], ['-','',''], regex=True).replace(['-','n.a.'], np.nan)
print(df1)
A B C
0 2.34 NaN -2333
1 NaN NaN -2333
2 2.34 NaN -2333
df1 = df.replace(['-','n.a.'], np.nan).replace(['\(','\)','\,'], ['-','',''], regex=True)
print(df1)
A B C
0 2.34 NaN -2333
1 NaN NaN -2333
2 2.34 NaN -2333
EDIT:
Your error means you want replace some non string column (e.g. all columns are NaNs in column B) by str.replace:
df1 = df.apply(lambda x: x.str.replace('\(','-').str.replace('\)','')
.str.replace(',','')).replace(['-','n.a.'], np.nan)
print(df1)
A B C
0 2.34 NaN -2333
1 NaN NaN -2333
2 2.34 NaN -2333
df1 = df.replace(['-','n.a.'], np.nan)
.apply(lambda x: x.str.replace('\(','-')
.str.replace('\)','')
.str.replace(',',''))
print(df1)
AttributeError: ('Can only use .str accessor with string values, which use np.object_ dtype in pandas', 'occurred at index B')
dtype of column B is float64:
df1 = df.replace(['-','n.a.'], np.nan)
print(df1)
A B C
0 2.34 NaN (2,333)
1 NaN NaN (2,333)
2 2.34 NaN (2,333)
print (df1.dtypes)
A object
B float64
C object
dtype: object

Include empty series when creating a pandas dataframe with .concat

UPDATE: This is no longer an issue since at least pandas version 0.18.1. Concatenating empty series doesn't drop them anymore so this question is out of date.
I want to create a pandas dataframe from a list of series using .concat. The problem is that when one of the series is empty it doesn't get included in the resulting dataframe but this makes the dataframe be the wrong dimensions when I then try to rename its columns with a multi-index.
UPDATE: Here's an example...
import pandas as pd
sers1 = pd.Series()
sers2 = pd.Series(['a', 'b', 'c'])
df1 = pd.concat([sers1, sers2], axis=1)
This produces the following dataframe:
>>> df1
0 a
1 b
2 c
dtype: object
But I want it to produce something like this:
>>> df2
0 1
0 NaN a
1 NaN b
2 NaN c
It does this if I put a single nan value anywhere in ser1 but it seems like this should be possible automatically even if some of my series are totally empty.
Passing an argument for levels will do the trick. Here's an example. First, the wrong way:
import pandas as pd
ser1 = pd.Series()
ser2 = pd.Series([1, 2, 3])
list_of_series = [ser1, ser2, ser1]
df = pd.concat(list_of_series, axis=1)
Which produces this:
>>> df
0
0 1
1 2
2 3
But if we add some labels to the levels argument, it will include all the empty series too:
import pandas as pd
ser1 = pd.Series()
ser2 = pd.Series([1, 2, 3])
list_of_series = [ser1, ser2, ser1]
labels = range(len(list_of_series))
df = pd.concat(list_of_series, levels=labels, axis=1)
Which produces the desired dataframe:
>>> df
0 1 2
0 NaN 1 NaN
1 NaN 2 NaN
2 NaN 3 NaN

Categories