Replacing strings with special characters in pandas dataframe

Replacing strings with special characters in pandas dataframe - python

I'm looking to replace strings inside DataFrame columns, these strings contain special characters.
I tried the following, but nothing changed in the DataFrame:
data = {'col1': ["series ${z_mask0}", "series ${z_mask1}", "series ${z_mask2}"]}
df = pd.DataFrame(data)
print(df)
old_values = ["${z_mask0}", "${z_mask1}", "${z_mask2}"]
new_values = ["${z_00}", "${z_01}", "${z_02}"]
df = df.replace(old_values_sign, new_values_sign, regex=True)
print(df)
The intended output is:
['series ${z_00}', 'series ${z_01}', 'series ${z_02']

You need to escape the $ character using \ in the old_values list:
old_values = ["\${z_mask0}", "\${z_mask1}", "\${z_mask2}"]
The above should be enough. Here is all the code:
old_values = ["\${z_mask0}", "\${z_mask1}", "\${z_mask2}"]
new_values = ["${z_00}", "${z_01}", "${z_02}"]
df = df.replace(old_values, new_values, regex=True)
print(df)
Output:
col1
0 series ${z_00}
1 series ${z_01}
2 series ${z_02}

Have you tried using raw strings for the old_values? There are some RegEx characters in there that may be interfering with your result ("{", "}", and "$"). Try this instead:
old_values = [r"${z_mask0}", r"${z_mask1}", r"${z_mask2}"]
Note the r before each string

Related

Pandas: replace string with special characters

I have a dataframe (see after) in which I have two columns containing either a list of patients or an empty list (like that [''] ). I want to remove the empty list
What i have:
Homozygous_list
heterozygous_list
[Patient1,Patient2]
['']
['']
[Patient1]
What i want:
Homozygous_list
heterozygous_list
[Patient1,Patient2]
[Patient1]
I try several thing like :
variants["Homozygous_list"].replace("['']","", regex=True, inplace=True)
or
variants["Homozygous_list"].replace("\[''\]","", regex=True, inplace=True)
or
variants["Homozygous_list"] = variants["Homozygous_list"].replace("['']","", regex=True)
etc but nothing seems to work.

If you really have lists of strings, use applymap:
df = df.applymap(lambda x: '' if x==[''] else x) # or pd.NA in place of ''
output:
Homozygous_list heterozygous_list
0 [Patient1, Patient2]
1 [Patient1]
used input:
df = pd.DataFrame({'Homozygous_list ': [['Patient1','Patient2'], ['']],
'heterozygous_list': [[''], ['Patient1']]})

How to drop columns that end with a specific wildcard string?

I have a series of columns:
COLUMNS = contract_number, award_date_x, publication_date_x, award_date_y, publication_date_y, award_date, publication_date
I would like to drop all of the 'publication_date' columns that end with '_[a+z'], so that my final result would look like this:
COLUMNS = contract_number, award_date, award_date_x, award_date_y, publication_date
I have tried the following with no luck:
df_merge=df_merge.drop(c for c in df_merge.columns if c.str.contains('publication_date_[a+z]$'))
Thanks

Try this,
import re
columns = df_merge.columns.tolist() # getting all columns
for col in columns:
if re.match(r"publication_date_[a-z]$",col): #regex for your match case
df_merge.drop([col], axis=1, inplace=True) # If regex matches, then remove the column
df_merge.head() # Filtered dataframe

lis = ["publication_date_x", "publication_date", "publication_date_x_y_y", "hello"]
new_list = [x for x in lis if not x.startswith('publication_date_')]
the output will be
new_list: ["publication_date", "hello"]

If you want to use str.contains you'll need to make the list of columns a Series.
series_cols = pd.Series(df_merge.columns)
bool_series_cols = series_cols.str.contains('publication_date_[a-z]$')
df_merge.drop([c for c, bool_c in zip(series_cols, bool_series_cols) if bool_c], axis=1, inplace=True)

regexp match in pandas

In want to execute a regexp match on a dataframe column in order to modify the content of the column.
For example, given this dataframe:
import pandas as pd
df = pd.DataFrame([['abra'], ['charmender'], ['goku']],
columns=['Name'])
print(df.head())
I want to execute the following regex match:
CASE
WHEN REGEXP_MATCH(Landing Page,'abra') THEN "kadabra"
WHEN REGEXP_MATCH(Landing Page,'charmender') THEN "charmaleon"
ELSE "Unknown" END
My solution is the following:
df.loc[df['Name'].str.contains("abra", na=False), 'Name'] = "kadabra"
df.loc[df['Name'].str.contains("charmender", na=False), 'Name'] = "charmeleon"
df.head()
It works but I do not know if there is a better way of doing it.
Moreover, I have to rewrite all the regex cases line by line in Python. Is there a way to execute the regex directly in Pandas?

Are you looking for map:
df['Name'] = df['Name'].map({'abra':'kadabra','charmender':'charmeleon'})
Output:
Name
0 kadabra
1 charmeleon
2 NaN
Update: For partial matches:
df = pd.DataFrame([['this abra'], ['charmender'], ['goku']],
columns=['Name'])
replaces = {'abra':'kadabra','charmender':'charmeleon'}
df['Name'] = df['Name'].str.extract(fr"\b({'|'.join(replaces.keys())})\b")[0].map(replaces)
And you get the same output (with different dataframe)

Python pandas replace function not working with escaped characters

I've already looked at half a dozen SO questions on the Python 3 pandas replace function, and none of them apply to this situation. I have the text \" present in some data, and I need to eliminate ONLY the backslash. Toy code:
import pandas as pd
df = pd.DataFrame(columns=['a'])
df.loc[0] = ['Replace \\"']
df
with output
a
0 Replace \"
My goal is to rewrite df so that it looks like this:
a
0 Replace "
None of the following work:
df.replace('\\"', '"', regex=True)
df.replace('\\"', '\"', regex=True)
df.replace('\\\"', '\"', regex=True)
df.replace('\\\"', '\"', regex=True)
df.replace(r'\"', r'"', regex=True)
df.replace({'\\"':'"'}, regex=True)
df.replace({r'\"':r'"'}, regex=True)
df.replace(to_replace=r'\"', value=r'"', regex=True)
df.replace(to_replace=r'\"', value=r'"', regex=False)
I can't search just for the backslash, because I have legitimate backslashes elsewhere in the data that I don't want to replace.
Thanks for your time!

You can use apply:
In [2596]: df.apply(lambda x: x.str.replace(r'\\"', r'"'))
Out[2596]:
a
0 Replace "
If there's only column in question, you can also do this, which will be a little more performant:
In [2614]: df['a'].str.replace(r'\\"', r'"')
Out[2614]:
0 Replace "
Name: a, dtype: object

Try
df.a.str.replace('\\','')
result:
0 Replace "
For the whole data frame you can use:
for col in df:
df[col] = df[col].str.replace(r'\\','')

I'm trying in pandas to get the n'th value of space by splitting the column value with delimiter

I creatated a dataframe df with csv data looking like:
col_1,col_2
001,JOHN VARCHAR(11) NOT NULL^RANDY VARCHAR(2) NOT NULL^MICHAEL VARCHAR(105) NOT NULL^DATE STRING
002,Danny VARCHAR(87)^EDWARD VARCHAR(4) NOT NULL^ROB VARCHAR(73) NOT NULL
I'm trying to get the second value of space delimiter by splitting col_2 by ^ delimiter like below df
col_1,col_2,col_3
001,JOHN VARCHAR(11) NOT NULL^RANDY VARCHAR(2) NOT NULL^MICHAEL VARCHAR(105) NOT NULL^DATE STRING,VARCHAR(11)^VARCHAR(2)^VARCHAR(105)^STRING
002,Danny VARCHAR(87)^EDWARD VARCHAR(4) NOT NULL^ROB VARCHAR(73) NOT NULL,VARCHAR(87)^VARCHAR(4)^VARCHAR(73)
I'm using below but unable to get the 2nd value of space
df['col_3'] = df['col_2'].map(lambda v: v.split(' ')[1])

it might not answer your question directly, but i think the question should be related to how to explode a list inside a pandas dataframe.
df["col_2"].str.split("^", expand=True).stack().reset_index()

You're on the right path, you can split the value of col2 using the ^ character and get the data types joined by ^ and assign it to col3 as such:
import pandas as pd
data = {'col1':['001','002'],
'col2': ['JOHN VARCHAR(11) NOT NULL^RANDY VARCHAR(2) NOT NULL^MICHAEL VARCHAR(105) NOT NULL^DATE STRING',
'Danny VARCHAR(87)^EDWARD VARCHAR(4) NOT NULL^ROB VARCHAR(73) NOT NULL']}
df = pd.DataFrame.from_dict(data)
df['col3'] = list(map(lambda x: '^'.join([col.split(' ')[1] for col in x]), df.col2.str.split('^')) )
Results
0 VARCHAR(11)^VARCHAR(2)^VARCHAR(105)^STRING
1 VARCHAR(87)^VARCHAR(4)^VARCHAR(73)
Name: col3, dtype: object

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replacing strings with special characters in pandas dataframe - python

Have you tried using raw strings for the old_values? There are some RegEx characters in there that may be interfering with your result ("{", "}", and "$"). Try this instead: old_values = [r"${z_mask0}", r"${z_mask1}", r"${z_mask2}"] Note the r before each string

Related

Pandas: replace string with special characters

How to drop columns that end with a specific wildcard string?

regexp match in pandas

Python pandas replace function not working with escaped characters

I'm trying in pandas to get the n'th value of space by splitting the column value with delimiter

Categories

Resources