Replacing strings with special characters in pandas dataframe - python

I'm looking to replace strings inside DataFrame columns, these strings contain special characters.
I tried the following, but nothing changed in the DataFrame:
data = {'col1': ["series ${z_mask0}", "series ${z_mask1}", "series ${z_mask2}"]}
df = pd.DataFrame(data)
print(df)
old_values = ["${z_mask0}", "${z_mask1}", "${z_mask2}"]
new_values = ["${z_00}", "${z_01}", "${z_02}"]
df = df.replace(old_values_sign, new_values_sign, regex=True)
print(df)
The intended output is:
['series ${z_00}', 'series ${z_01}', 'series ${z_02']

You need to escape the $ character using \ in the old_values list:
old_values = ["\${z_mask0}", "\${z_mask1}", "\${z_mask2}"]
The above should be enough. Here is all the code:
old_values = ["\${z_mask0}", "\${z_mask1}", "\${z_mask2}"]
new_values = ["${z_00}", "${z_01}", "${z_02}"]
df = df.replace(old_values, new_values, regex=True)
print(df)
Output:
col1
0 series ${z_00}
1 series ${z_01}
2 series ${z_02}

Have you tried using raw strings for the old_values? There are some RegEx characters in there that may be interfering with your result ("{", "}", and "$"). Try this instead:
old_values = [r"${z_mask0}", r"${z_mask1}", r"${z_mask2}"]
Note the r before each string

Related

Pandas: replace string with special characters

I have a dataframe (see after) in which I have two columns containing either a list of patients or an empty list (like that [''] ). I want to remove the empty list
What i have:
Homozygous_list
heterozygous_list
[Patient1,Patient2]
['']
['']
[Patient1]
What i want:
Homozygous_list
heterozygous_list
[Patient1,Patient2]
[Patient1]
I try several thing like :
variants["Homozygous_list"].replace("['']","", regex=True, inplace=True)
or
variants["Homozygous_list"].replace("\[''\]","", regex=True, inplace=True)
or
variants["Homozygous_list"] = variants["Homozygous_list"].replace("['']","", regex=True)
etc but nothing seems to work.
If you really have lists of strings, use applymap:
df = df.applymap(lambda x: '' if x==[''] else x) # or pd.NA in place of ''
output:
Homozygous_list heterozygous_list
0 [Patient1, Patient2]
1 [Patient1]
used input:
df = pd.DataFrame({'Homozygous_list ': [['Patient1','Patient2'], ['']],
'heterozygous_list': [[''], ['Patient1']]})

How to drop columns that end with a specific wildcard string?

I have a series of columns:
COLUMNS = contract_number, award_date_x, publication_date_x, award_date_y, publication_date_y, award_date, publication_date
I would like to drop all of the 'publication_date' columns that end with '_[a+z'], so that my final result would look like this:
COLUMNS = contract_number, award_date, award_date_x, award_date_y, publication_date
I have tried the following with no luck:
df_merge=df_merge.drop(c for c in df_merge.columns if c.str.contains('publication_date_[a+z]$'))
Thanks
Try this,
import re
columns = df_merge.columns.tolist() # getting all columns
for col in columns:
if re.match(r"publication_date_[a-z]$",col): #regex for your match case
df_merge.drop([col], axis=1, inplace=True) # If regex matches, then remove the column
df_merge.head() # Filtered dataframe
lis = ["publication_date_x", "publication_date", "publication_date_x_y_y", "hello"]
new_list = [x for x in lis if not x.startswith('publication_date_')]
the output will be
new_list: ["publication_date", "hello"]
If you want to use str.contains you'll need to make the list of columns a Series.
series_cols = pd.Series(df_merge.columns)
bool_series_cols = series_cols.str.contains('publication_date_[a-z]$')
df_merge.drop([c for c, bool_c in zip(series_cols, bool_series_cols) if bool_c], axis=1, inplace=True)

regexp match in pandas

In want to execute a regexp match on a dataframe column in order to modify the content of the column.
For example, given this dataframe:
import pandas as pd
df = pd.DataFrame([['abra'], ['charmender'], ['goku']],
columns=['Name'])
print(df.head())
I want to execute the following regex match:
CASE
WHEN REGEXP_MATCH(Landing Page,'abra') THEN "kadabra"
WHEN REGEXP_MATCH(Landing Page,'charmender') THEN "charmaleon"
ELSE "Unknown" END
My solution is the following:
df.loc[df['Name'].str.contains("abra", na=False), 'Name'] = "kadabra"
df.loc[df['Name'].str.contains("charmender", na=False), 'Name'] = "charmeleon"
df.head()
It works but I do not know if there is a better way of doing it.
Moreover, I have to rewrite all the regex cases line by line in Python. Is there a way to execute the regex directly in Pandas?
Are you looking for map:
df['Name'] = df['Name'].map({'abra':'kadabra','charmender':'charmeleon'})
Output:
Name
0 kadabra
1 charmeleon
2 NaN
Update: For partial matches:
df = pd.DataFrame([['this abra'], ['charmender'], ['goku']],
columns=['Name'])
replaces = {'abra':'kadabra','charmender':'charmeleon'}
df['Name'] = df['Name'].str.extract(fr"\b({'|'.join(replaces.keys())})\b")[0].map(replaces)
And you get the same output (with different dataframe)

Python pandas replace function not working with escaped characters

I've already looked at half a dozen SO questions on the Python 3 pandas replace function, and none of them apply to this situation. I have the text \" present in some data, and I need to eliminate ONLY the backslash. Toy code:
import pandas as pd
df = pd.DataFrame(columns=['a'])
df.loc[0] = ['Replace \\"']
df
with output
a
0 Replace \"
My goal is to rewrite df so that it looks like this:
a
0 Replace "
None of the following work:
df.replace('\\"', '"', regex=True)
df.replace('\\"', '\"', regex=True)
df.replace('\\\"', '\"', regex=True)
df.replace('\\\"', '\"', regex=True)
df.replace(r'\"', r'"', regex=True)
df.replace({'\\"':'"'}, regex=True)
df.replace({r'\"':r'"'}, regex=True)
df.replace(to_replace=r'\"', value=r'"', regex=True)
df.replace(to_replace=r'\"', value=r'"', regex=False)
I can't search just for the backslash, because I have legitimate backslashes elsewhere in the data that I don't want to replace.
Thanks for your time!
You can use apply:
In [2596]: df.apply(lambda x: x.str.replace(r'\\"', r'"'))
Out[2596]:
a
0 Replace "
If there's only column in question, you can also do this, which will be a little more performant:
In [2614]: df['a'].str.replace(r'\\"', r'"')
Out[2614]:
0 Replace "
Name: a, dtype: object
Try
df.a.str.replace('\\','')
result:
0 Replace "
For the whole data frame you can use:
for col in df:
df[col] = df[col].str.replace(r'\\','')

I'm trying in pandas to get the n'th value of space by splitting the column value with delimiter

I creatated a dataframe df with csv data looking like:
col_1,col_2
001,JOHN VARCHAR(11) NOT NULL^RANDY VARCHAR(2) NOT NULL^MICHAEL VARCHAR(105) NOT NULL^DATE STRING
002,Danny VARCHAR(87)^EDWARD VARCHAR(4) NOT NULL^ROB VARCHAR(73) NOT NULL
I'm trying to get the second value of space delimiter by splitting col_2 by ^ delimiter like below df
col_1,col_2,col_3
001,JOHN VARCHAR(11) NOT NULL^RANDY VARCHAR(2) NOT NULL^MICHAEL VARCHAR(105) NOT NULL^DATE STRING,VARCHAR(11)^VARCHAR(2)^VARCHAR(105)^STRING
002,Danny VARCHAR(87)^EDWARD VARCHAR(4) NOT NULL^ROB VARCHAR(73) NOT NULL,VARCHAR(87)^VARCHAR(4)^VARCHAR(73)
I'm using below but unable to get the 2nd value of space
df['col_3'] = df['col_2'].map(lambda v: v.split(' ')[1])
it might not answer your question directly, but i think the question should be related to how to explode a list inside a pandas dataframe.
df["col_2"].str.split("^", expand=True).stack().reset_index()
You're on the right path, you can split the value of col2 using the ^ character and get the data types joined by ^ and assign it to col3 as such:
import pandas as pd
data = {'col1':['001','002'],
'col2': ['JOHN VARCHAR(11) NOT NULL^RANDY VARCHAR(2) NOT NULL^MICHAEL VARCHAR(105) NOT NULL^DATE STRING',
'Danny VARCHAR(87)^EDWARD VARCHAR(4) NOT NULL^ROB VARCHAR(73) NOT NULL']}
df = pd.DataFrame.from_dict(data)
df['col3'] = list(map(lambda x: '^'.join([col.split(' ')[1] for col in x]), df.col2.str.split('^')) )
Results
0 VARCHAR(11)^VARCHAR(2)^VARCHAR(105)^STRING
1 VARCHAR(87)^VARCHAR(4)^VARCHAR(73)
Name: col3, dtype: object

Categories