Python pandas replace function not working with escaped characters - python

I've already looked at half a dozen SO questions on the Python 3 pandas replace function, and none of them apply to this situation. I have the text \" present in some data, and I need to eliminate ONLY the backslash. Toy code:
import pandas as pd
df = pd.DataFrame(columns=['a'])
df.loc[0] = ['Replace \\"']
df
with output
a
0 Replace \"
My goal is to rewrite df so that it looks like this:
a
0 Replace "
None of the following work:
df.replace('\\"', '"', regex=True)
df.replace('\\"', '\"', regex=True)
df.replace('\\\"', '\"', regex=True)
df.replace('\\\"', '\"', regex=True)
df.replace(r'\"', r'"', regex=True)
df.replace({'\\"':'"'}, regex=True)
df.replace({r'\"':r'"'}, regex=True)
df.replace(to_replace=r'\"', value=r'"', regex=True)
df.replace(to_replace=r'\"', value=r'"', regex=False)
I can't search just for the backslash, because I have legitimate backslashes elsewhere in the data that I don't want to replace.
Thanks for your time!

You can use apply:
In [2596]: df.apply(lambda x: x.str.replace(r'\\"', r'"'))
Out[2596]:
a
0 Replace "
If there's only column in question, you can also do this, which will be a little more performant:
In [2614]: df['a'].str.replace(r'\\"', r'"')
Out[2614]:
0 Replace "
Name: a, dtype: object

Try
df.a.str.replace('\\','')
result:
0 Replace "
For the whole data frame you can use:
for col in df:
df[col] = df[col].str.replace(r'\\','')

Related

Replacing strings with special characters in pandas dataframe

I'm looking to replace strings inside DataFrame columns, these strings contain special characters.
I tried the following, but nothing changed in the DataFrame:
data = {'col1': ["series ${z_mask0}", "series ${z_mask1}", "series ${z_mask2}"]}
df = pd.DataFrame(data)
print(df)
old_values = ["${z_mask0}", "${z_mask1}", "${z_mask2}"]
new_values = ["${z_00}", "${z_01}", "${z_02}"]
df = df.replace(old_values_sign, new_values_sign, regex=True)
print(df)
The intended output is:
['series ${z_00}', 'series ${z_01}', 'series ${z_02']
You need to escape the $ character using \ in the old_values list:
old_values = ["\${z_mask0}", "\${z_mask1}", "\${z_mask2}"]
The above should be enough. Here is all the code:
old_values = ["\${z_mask0}", "\${z_mask1}", "\${z_mask2}"]
new_values = ["${z_00}", "${z_01}", "${z_02}"]
df = df.replace(old_values, new_values, regex=True)
print(df)
Output:
col1
0 series ${z_00}
1 series ${z_01}
2 series ${z_02}
Have you tried using raw strings for the old_values? There are some RegEx characters in there that may be interfering with your result ("{", "}", and "$"). Try this instead:
old_values = [r"${z_mask0}", r"${z_mask1}", r"${z_mask2}"]
Note the r before each string

Pandas: replace string with special characters

I have a dataframe (see after) in which I have two columns containing either a list of patients or an empty list (like that [''] ). I want to remove the empty list
What i have:
Homozygous_list
heterozygous_list
[Patient1,Patient2]
['']
['']
[Patient1]
What i want:
Homozygous_list
heterozygous_list
[Patient1,Patient2]
[Patient1]
I try several thing like :
variants["Homozygous_list"].replace("['']","", regex=True, inplace=True)
or
variants["Homozygous_list"].replace("\[''\]","", regex=True, inplace=True)
or
variants["Homozygous_list"] = variants["Homozygous_list"].replace("['']","", regex=True)
etc but nothing seems to work.
If you really have lists of strings, use applymap:
df = df.applymap(lambda x: '' if x==[''] else x) # or pd.NA in place of ''
output:
Homozygous_list heterozygous_list
0 [Patient1, Patient2]
1 [Patient1]
used input:
df = pd.DataFrame({'Homozygous_list ': [['Patient1','Patient2'], ['']],
'heterozygous_list': [[''], ['Patient1']]})

Python dataframe : strip part of string, on each column row, if it is in specific format [duplicate]

I have read some pricing data into a pandas dataframe the values appear as:
$40,000*
$40000 conditions attached
I want to strip it down to just the numeric values.
I know I can loop through and apply regex
[0-9]+
to each field then join the resulting list back together but is there a not loopy way?
Thanks
You could use Series.str.replace:
import pandas as pd
df = pd.DataFrame(['$40,000*','$40000 conditions attached'], columns=['P'])
print(df)
# P
# 0 $40,000*
# 1 $40000 conditions attached
df['P'] = df['P'].str.replace(r'\D+', '', regex=True).astype('int')
print(df)
yields
P
0 40000
1 40000
since \D matches any character that is not a decimal digit.
You could use pandas' replace method; also you may want to keep the thousands separator ',' and the decimal place separator '.'
import pandas as pd
df = pd.DataFrame(['$40,000.32*','$40000 conditions attached'], columns=['pricing'])
df['pricing'].replace(to_replace="\$([0-9,\.]+).*", value=r"\1", regex=True, inplace=True)
print(df)
pricing
0 40,000.32
1 40000
You could remove all the non-digits using re.sub():
value = re.sub(r"[^0-9]+", "", value)
regex101 demo
You don't need regex for this. This should work:
df['col'] = df['col'].astype(str).convert_objects(convert_numeric=True)
In case anyone is still reading this. I'm working on a similar problem and need to replace an entire column of pandas data using a regex equation I've figured out with re.sub
To apply this on my entire column, here's the code.
#add_map is rules of replacement for the strings in pd df.
add_map = dict([
("AV", "Avenue"),
("BV", "Boulevard"),
("BP", "Bypass"),
("BY", "Bypass"),
("CL", "Circle"),
("DR", "Drive"),
("LA", "Lane"),
("PY", "Parkway"),
("RD", "Road"),
("ST", "Street"),
("WY", "Way"),
("TR", "Trail"),
])
obj = data_909['Address'].copy() #data_909['Address'] contains the original address'
for k,v in add_map.items(): #based on the rules in the dict
rule1 = (r"(\b)(%s)(\b)" % k) #replace the k only if they're alone (lookup \
b)
rule2 = (lambda m: add_map.get(m.group(), m.group())) #found this online, no idea wtf this does but it works
obj = obj.str.replace(rule1, rule2, regex=True, flags=re.IGNORECASE) #use flags here to avoid the dictionary iteration problem
data_909['Address_n'] = obj #store it!
Hope this helps anyone searching for the problem I had. Cheers

regexp match in pandas

In want to execute a regexp match on a dataframe column in order to modify the content of the column.
For example, given this dataframe:
import pandas as pd
df = pd.DataFrame([['abra'], ['charmender'], ['goku']],
columns=['Name'])
print(df.head())
I want to execute the following regex match:
CASE
WHEN REGEXP_MATCH(Landing Page,'abra') THEN "kadabra"
WHEN REGEXP_MATCH(Landing Page,'charmender') THEN "charmaleon"
ELSE "Unknown" END
My solution is the following:
df.loc[df['Name'].str.contains("abra", na=False), 'Name'] = "kadabra"
df.loc[df['Name'].str.contains("charmender", na=False), 'Name'] = "charmeleon"
df.head()
It works but I do not know if there is a better way of doing it.
Moreover, I have to rewrite all the regex cases line by line in Python. Is there a way to execute the regex directly in Pandas?
Are you looking for map:
df['Name'] = df['Name'].map({'abra':'kadabra','charmender':'charmeleon'})
Output:
Name
0 kadabra
1 charmeleon
2 NaN
Update: For partial matches:
df = pd.DataFrame([['this abra'], ['charmender'], ['goku']],
columns=['Name'])
replaces = {'abra':'kadabra','charmender':'charmeleon'}
df['Name'] = df['Name'].str.extract(fr"\b({'|'.join(replaces.keys())})\b")[0].map(replaces)
And you get the same output (with different dataframe)

pandas applying regex to replace values

I have read some pricing data into a pandas dataframe the values appear as:
$40,000*
$40000 conditions attached
I want to strip it down to just the numeric values.
I know I can loop through and apply regex
[0-9]+
to each field then join the resulting list back together but is there a not loopy way?
Thanks
You could use Series.str.replace:
import pandas as pd
df = pd.DataFrame(['$40,000*','$40000 conditions attached'], columns=['P'])
print(df)
# P
# 0 $40,000*
# 1 $40000 conditions attached
df['P'] = df['P'].str.replace(r'\D+', '', regex=True).astype('int')
print(df)
yields
P
0 40000
1 40000
since \D matches any character that is not a decimal digit.
You could use pandas' replace method; also you may want to keep the thousands separator ',' and the decimal place separator '.'
import pandas as pd
df = pd.DataFrame(['$40,000.32*','$40000 conditions attached'], columns=['pricing'])
df['pricing'].replace(to_replace="\$([0-9,\.]+).*", value=r"\1", regex=True, inplace=True)
print(df)
pricing
0 40,000.32
1 40000
You could remove all the non-digits using re.sub():
value = re.sub(r"[^0-9]+", "", value)
regex101 demo
You don't need regex for this. This should work:
df['col'] = df['col'].astype(str).convert_objects(convert_numeric=True)
In case anyone is still reading this. I'm working on a similar problem and need to replace an entire column of pandas data using a regex equation I've figured out with re.sub
To apply this on my entire column, here's the code.
#add_map is rules of replacement for the strings in pd df.
add_map = dict([
("AV", "Avenue"),
("BV", "Boulevard"),
("BP", "Bypass"),
("BY", "Bypass"),
("CL", "Circle"),
("DR", "Drive"),
("LA", "Lane"),
("PY", "Parkway"),
("RD", "Road"),
("ST", "Street"),
("WY", "Way"),
("TR", "Trail"),
])
obj = data_909['Address'].copy() #data_909['Address'] contains the original address'
for k,v in add_map.items(): #based on the rules in the dict
rule1 = (r"(\b)(%s)(\b)" % k) #replace the k only if they're alone (lookup \
b)
rule2 = (lambda m: add_map.get(m.group(), m.group())) #found this online, no idea wtf this does but it works
obj = obj.str.replace(rule1, rule2, regex=True, flags=re.IGNORECASE) #use flags here to avoid the dictionary iteration problem
data_909['Address_n'] = obj #store it!
Hope this helps anyone searching for the problem I had. Cheers

Categories