If substring in string, replace string with number - python

I have the following mapping
mapping = {'sum12':2, 'sum6':1,
'avg12':2, 'avg6':1,
'diff':3, 'mean':4}
and I have a dataframe with variables like
var1 var2
0 abc_sum12 mean_jkl
1 pqr_sum6 pqr_avg6
2 diff_xyz qwerty
If any of the substrings are present in the strings in the dataframe, I want to replace them with their corresponding values. If no substring is present, I want to replace it with np.nan. At present, the only solution I can think of is going through every row, checking if any of the substrings is present in every string, and replacing it with the specific number corresponding with that substring. Is there a better way to do it.
The output in the end would be
var1 var2
0 2 4.0
1 1 1.0
2 3 NaN

I believe if you replace using regex, it will partial match and give you the result you want. The only exception is the qwerty value which will remain unchanged. If you then coerce the entire df to numeric, it will return NaN for that value or any other non-numeric.
import pandas as pd
mapping = {'sum12':2, 'sum6':1,
'avg12':2, 'avg6':1,
'diff':3, 'mean':4}
df = pd.DataFrame({'var1': ['abc_sum12', 'pqr_sum6', 'diff_xyz'],
'var2': ['mean_jkl', 'pqr_avg6', 'qwerty']})
df = df.replace(mapping, regex=True).apply(lambda x: pd.to_numeric(x, errors='coerce'))
print(df)
output
var1 var2
0 2 4.0
1 1 1.0
2 3 NaN

Another approach:
mapping = {'sum12':2, 'sum6':1,
'avg12':2, 'avg6':1,
'diff':3, 'mean':4}
df = pd.DataFrame(
{'var1': {0: 'abc_sum12', 1: 'pqr_sum6', 2: 'diff_xyz'},
'var2': {0: 'mean_jkl', 1: 'pqr_avg6', 2: 'qwerty'}})
df_new[:] = np.nan
df_new = df_new.astype('float')
for name,col in df.items():
for key,val in mapping.items():
df_new[name][col.str.contains(key)] = val
The resulting dataframe df_new:
var1 var2
0 2.0 4.0
1 1.0 1.0
2 3.0 NaN

Related

How to skip NaN values when splitting up a column

I am trying to split a column up into two columns based on a delimeter. The column presently has text that is separated by a '-'. Some of the values in the column are NaN, so when I run the code below, I get the following error message: ValueError: Columns must be same length as key.
I don't want to delete the NaN values, but am not sure how to skip them so that this splitting works.
The code I have right now is:
df[['A','B']] = df['A'].str.split('-',expand=True)
Your code works well with NaN values but you have to use n=1 as parameter of str.split:
Suppose this dataframe:
df = pd.DataFrame({'A': ['hello-world', np.nan, 'raise-an-exception']}
print(df)
# Output:
A
0 hello-world
1 NaN
2 raise-an-exception
Reproducible error:
df[['A', 'B']] = df['A'].str.split('-', expand=True)
print(df)
# Output:
...
ValueError: Columns must be same length as key
Use n=1:
df[['A', 'B']] = df['A'].str.split('-', n=1, expand=True)
print(df)
# Output:
A B
0 hello world
1 NaN NaN
2 raise an-exception
An alternative is to generate more columns:
df1 = df['A'].str.split('-', expand=True)
df1.columns = df1.columns.map(lambda x: chr(x+65))
print(df1)
# Output:
A B C
0 hello world None
1 NaN NaN NaN
2 raise an exception
Maybe filter them out with loc:
df.loc[df['A'].notna(), ['A','B']] = df.loc[df['A'].notna(), 'A'].str.split('-',expand=True)

Adding new columns to Pandas Data Frame which the length of new column value is bigger than length of index

I'm in a trouble with adding a new column to a pandas dataframe when the length of new column value is bigger than length of index.
Data may like this :
import pandas as pd
df = pd.DataFrame(
{
"bar": ["A","B","C"],
"zoo": [1,2,3],
})
So, you see, length of this df's index is 3.
And next I wanna add a new column , code may like this two ways below:
df["new_col"] = [1,2,3,4]
It'll raise an error : Length of values does not match length of index.
Or:
df["new_col"] = pd.Series([1,2,3,4])
I will just get values[1,2,3] in my data frame df.
(The count of new column values can't out of the max index).
Now, what I want just like :
Is there a better way ?
Looking forward to your answer,thanks!
Use DataFrame.join with change Series name and right join:
#if not default index
#df = df.reset_index(drop=True)
df = df.join(pd.Series([1,2,3,4]).rename('new_col'), how='right')
print (df)
bar zoo new_col
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
Another idea is add reindex by new s.index:
s = pd.Series([1,2,3,4])
df = df.reindex(s.index)
df["new_col"] = s
print (df)
bar zoo new_col
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
s = pd.Series([1,2,3,4])
df = df.reindex(s.index).assign(new_col = s)
df = pd.DataFrame(
{
"bar": ["A","B","C"],
"zoo": [1,2,3],
})
new_col = pd.Series([1,2,3,4])
df = pd.concat([df,new_col],axis=1)
print(df)
bar zoo 0
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4

Having trouble replacing empty strings with NaN using Pandas.DataFranme.replace()

I have a pandas dataframe which has some observations with empty strings which I want to replace with NaN (np.nan).
I am successfully replacing most of these empty strings using
df.replace(r'\s+',np.nan,regex=True).replace('',np.nan)
But I am still finding empty strings. For example, when I run
sub_df = df[df['OBJECT_COL'] == '']
sub_df.replace(r'\s+', np.nan, regex = True)
print(sub_df['OBJECT_COL'] == '')
The output all returns True
Is there a different method I should be trying? Is there a way to read the encoding of these cells such that perhaps my .replace() is not effective because the encoding is weird?
Another Alternatives.
sub_df.replace(r'^\s+$', np.nan, regex=True)
OR, to replace an empty string and records with only spaces
sub.df.replace(r'^\s*$', np.nan, regex=True)
Alternative:
using apply() with function lambda.
sub_df.apply(lambda x: x.str.strip()).replace('', np.nan)
Just Example illustration:
>>> import numpy as np
>>> import pandas as pd
Example DataFrame having empty strings and whitespaces..
>>> sub_df
col_A
0
1
2 somevalue
3 othervalue
4
Solutions applied For the different conditions:
Best Solution:
1)
>>> sub_df.replace(r'\s+',np.nan,regex=True).replace('',np.nan)
col_A
0 NaN
1 NaN
2 somevalue
3 othervalue
4 NaN
2) This works but partially not for both cases:
>>> sub_df.replace(r'^\s+$', np.nan, regex=True)
col_A
0
1 NaN
2 somevalue
3 othervalue
4 NaN
3) This also works for both conditions.
>>> sub_df.replace(r'^\s*$', np.nan, regex=True)
col_A
0 NaN
1 NaN
2 somevalue
3 othervalue
4 NaN
4) This also works for both conditions.
>>> sub_df.apply(lambda x: x.str.strip()).replace('', np.nan)
col_A
0 NaN
1 NaN
2 somevalue
3 othervalue
4 NaN
pd.Series.replace does not work in-place by default. You need to specify inplace=True explicitly:
sub_df.replace(r'\s+', np.nan, regex=True, inplace=True)
Or, alternatively, assign back to sub_df:
sub_df = sub_df.replace(r'\s+', np.nan, regex=True)
Try np.where:
df['OBJECT_COL'] = np.where(df['OBJECT_COL'] == '', np.nan, df['OBJECT_COL'])

None in if condition, how to handle missing data?

If the value of age is missing I want to create a variable with the value of 1. Instead everything is None in the output of the Value column.
raw_data1 = {'id': [1,2,3,5],
'age': [0, np.nan, 10, 2]}
df1 = pd.DataFrame(raw_data1, columns = ['id','age'])
def my_test(b):
if b is None:
return 1
df1['Value'] = df1.apply(lambda row: my_test(row['age']), axis=1)
How can implement it? I know that there are several ways, but I would like to focus on the use of a function, (def my_test etc.).
If I understood you correctly, you can use:
df1['value'] = np.where(df1['age'].isnull(), 1, '')
Output:
id age value
0 1 0.0
1 2 NaN 1
2 3 10.0
3 5 2.0
You can use row.get('age') instead of row['age'].
get() returns null if age is not inside the dict
Do this instead,
>>> df1.value = df1.age.isna().astype(int)
>>> df1
id age value
0 1 0.0 0
1 2 NaN 1
2 3 10.0 0
3 5 2.0 0
You can use map for this
df1['Value'] = df1['age'].map(lambda x : 1 if np.isnan(x) else np.nan)
If you want to make use of your function, you can use map like this
def my_test(b):
if np.isnan(b):
return 1
else:
return np.nan
df1['Value'] = df1['age'].map(lambda x : my_test(x))

python pandas Ignore Nan in integer comparisons

I am trying to create dummy variables based on integer comparisons in series where Nan is common. A > comparison raises errors if there are any Nan values, but I want the comparison to return a Nan. I understand that I could use fillna() to replace Nan with a value that I know will be false, but I would hope there is a more elegant way to do this. I would need to change the value in fillna() if I used less than, or used a variable that could be positive or negative, and that is one more opportunity to create errors. Is there any way to make 30 < Nan = Nan?
To be clear, I want this:
df['var_dummy'] = df[df['var'] >= 30].astype('int')
to return a null if var is null, 1 if it is 30+, and 0 otherwise. Currently I get ValueError: cannot reindex from a duplicate axis.
Here's a way:
s1 = pd.Series([1, 3, 4, 2, np.nan, 5, np.nan, 7])
s2 = pd.Series([2, 1, 5, 5, np.nan, np.nan, 2, np.nan])
(s1 < s2).mask(s1.isnull() | s2.isnull(), np.nan)
Out:
0 1.0
1 0.0
2 1.0
3 1.0
4 NaN
5 NaN
6 NaN
7 NaN
dtype: float64
This masks the boolean array returned from (s1 < s2) if any of them is NaN. In that case, it returns NaN. But you cannot have NaNs in a boolean array so it will be casted as float.
Solution 1
df['var_dummy'] = 1 * df.loc[~pd.isnull(df['var']), 'var'].ge(30)
Solution 2
df['var_dummy'] = df['var'].apply(lambda x: np.nan if x!=x else 1*(x>30))
x!=x is equivalent to math.isnan()
You can use the notna() method. Here is an example:
import pandas as pd
list1 = [12, 34, -4, None, 45]
list2 = ['a', 'b', 'c', 'd', 'e']
# Calling DataFrame constructor on above lists
df = pd.DataFrame(list(zip(list1, list2)), columns =['var1','letter'])
#Assigning new dummy variable:
df['var_dummy'] = df['var1'][df['var1'].notna()] >= 30
# or you can also use: df['var_dummy'] = df.var1[df.var1.notna()] >= 30
df
Will produce the below output:
var1 letter var_dummy
0 12.0 a False
1 34.0 b True
2 -4.0 c False
3 NaN d NaN
4 45.0 e True
So the new dummy variable has NaN value for the original variable's NaN rows.
The only thing that does not match your request is that the dummy variable takes False and True values instead of 0 and 1, but you can easily reassign the values.
One thing, however, you cannot change is that the new dummy variable has to be float type because it contains NaN value, which by itself is a special float value.
More information about NaN float are mentioned here:
How can I check for NaN values?
and here:
https://towardsdatascience.com/navigating-the-hell-of-nans-in-python-71b12558895b

Categories