Pandas: replace string with special characters - python

I have a dataframe (see after) in which I have two columns containing either a list of patients or an empty list (like that [''] ). I want to remove the empty list
What i have:
Homozygous_list
heterozygous_list
[Patient1,Patient2]
['']
['']
[Patient1]
What i want:
Homozygous_list
heterozygous_list
[Patient1,Patient2]
[Patient1]
I try several thing like :
variants["Homozygous_list"].replace("['']","", regex=True, inplace=True)
or
variants["Homozygous_list"].replace("\[''\]","", regex=True, inplace=True)
or
variants["Homozygous_list"] = variants["Homozygous_list"].replace("['']","", regex=True)
etc but nothing seems to work.

If you really have lists of strings, use applymap:
df = df.applymap(lambda x: '' if x==[''] else x) # or pd.NA in place of ''
output:
Homozygous_list heterozygous_list
0 [Patient1, Patient2]
1 [Patient1]
used input:
df = pd.DataFrame({'Homozygous_list ': [['Patient1','Patient2'], ['']],
'heterozygous_list': [[''], ['Patient1']]})

Related

Replacing strings with special characters in pandas dataframe

I'm looking to replace strings inside DataFrame columns, these strings contain special characters.
I tried the following, but nothing changed in the DataFrame:
data = {'col1': ["series ${z_mask0}", "series ${z_mask1}", "series ${z_mask2}"]}
df = pd.DataFrame(data)
print(df)
old_values = ["${z_mask0}", "${z_mask1}", "${z_mask2}"]
new_values = ["${z_00}", "${z_01}", "${z_02}"]
df = df.replace(old_values_sign, new_values_sign, regex=True)
print(df)
The intended output is:
['series ${z_00}', 'series ${z_01}', 'series ${z_02']
You need to escape the $ character using \ in the old_values list:
old_values = ["\${z_mask0}", "\${z_mask1}", "\${z_mask2}"]
The above should be enough. Here is all the code:
old_values = ["\${z_mask0}", "\${z_mask1}", "\${z_mask2}"]
new_values = ["${z_00}", "${z_01}", "${z_02}"]
df = df.replace(old_values, new_values, regex=True)
print(df)
Output:
col1
0 series ${z_00}
1 series ${z_01}
2 series ${z_02}
Have you tried using raw strings for the old_values? There are some RegEx characters in there that may be interfering with your result ("{", "}", and "$"). Try this instead:
old_values = [r"${z_mask0}", r"${z_mask1}", r"${z_mask2}"]
Note the r before each string

How to remove newline characters from string in pandas python

For some reason, none of the solutions previously posted about this seem to answer my question.
I am reading in an excel page with 150+ sheets. I am looping through them and preparing the data to be concatenated together. (doing things like deleting unneeded/blank columns, and transforming some data) However, for some reason, I cannot get rid of any of the newline characters, no matter what I try. Here are some variations that I've tried so you can see what DIDN'T work.
import pandas as pd
import os
os.chdir(r'C:\Users\agray\Downloads')
sheets_dict = pd.read_excel('2022_Advanced_Control.xlsx', sheet_name=None)
df_list = list(sheets_dict.values())
df_list_clean = []
The top part stays the same, this loop portion is what changes.
for df in df_list:
df.columns = [c.replace(' ', '_') for c in df.columns]
df.drop(df.columns.difference(['Prescription_Drug_Name','Drug_Tier', 'Drug_Notes']), 1, inplace=True)
df.drop(df.tail(3).index, inplace=True)
df.loc[:, 'Prescription_Drug_Name'] = df.loc[:, 'Prescription_Drug_Name'].replace("\n", "", inplace=True)
df_list_clean.append(df)
This gives me a column that has nothing but blank values.
Here's another way I tried
for df in df_list:
df.columns = [c.replace(' ', '_') for c in df.columns]
df.drop(df.columns.difference(['Prescription_Drug_Name','Drug_Tier', 'Drug_Notes']), 1, inplace=True)
df.drop(df.tail(3).index, inplace=True)
df['Prescription_Drug_Name'] = df['Prescription_Drug_Name'].replace(r'\n','', regex=True, inplace=True)
df_list_clean.append(df)
This version is only applying to a copy, so none of the changes it says it's making are actually being made to my df. Any ideas how to get rid of all these "/n" characters in my column? Thanks!
Use str.strip():
df['Prescription_Drug_Name'] = df['Prescription_Drug_Name'].str.replace(r'\n', '')
I always advise against inplace=True. Make an explicit copy where you mean to.
This version is only applying to a copy, so none of the changes... being made to my df. Why don't you clone your data like this:
for df in df_list:
clean = df.copy()
clean.columns = [c.replace(' ', '_') for c in df.columns]
clean = clean.drop(df.columns.difference(['Prescription_Drug_Name','Drug_Tier', 'Drug_Notes']), 1)
# drop last three rows
clean = clean.iloc[:-3]
# modify column, remove `inplace` here
clean['Prescription_Drug_Name'] = clean['Prescription_Drug_Name'].replace(r'\n','', regex=True)
df_list_clean.append(clean)
That being said, all of the above can be chained, so you can do something like this:
for df in df_list
clean = (df.rename(columns=lambda x: x.replace(' ', '_'))
.reindex(['Prescription_Drug_Name','Drug_Tier', 'Drug_Notes'], axis=1).dropna(axis=0, how='all') # select only the columns
.iloc[:-3]
.assign(Prescription_Drug_Name=lambda x: x.replace(r'\n', '', regex=True)
)
df_list_clean.append(clean)

Python pandas replace function not working with escaped characters

I've already looked at half a dozen SO questions on the Python 3 pandas replace function, and none of them apply to this situation. I have the text \" present in some data, and I need to eliminate ONLY the backslash. Toy code:
import pandas as pd
df = pd.DataFrame(columns=['a'])
df.loc[0] = ['Replace \\"']
df
with output
a
0 Replace \"
My goal is to rewrite df so that it looks like this:
a
0 Replace "
None of the following work:
df.replace('\\"', '"', regex=True)
df.replace('\\"', '\"', regex=True)
df.replace('\\\"', '\"', regex=True)
df.replace('\\\"', '\"', regex=True)
df.replace(r'\"', r'"', regex=True)
df.replace({'\\"':'"'}, regex=True)
df.replace({r'\"':r'"'}, regex=True)
df.replace(to_replace=r'\"', value=r'"', regex=True)
df.replace(to_replace=r'\"', value=r'"', regex=False)
I can't search just for the backslash, because I have legitimate backslashes elsewhere in the data that I don't want to replace.
Thanks for your time!
You can use apply:
In [2596]: df.apply(lambda x: x.str.replace(r'\\"', r'"'))
Out[2596]:
a
0 Replace "
If there's only column in question, you can also do this, which will be a little more performant:
In [2614]: df['a'].str.replace(r'\\"', r'"')
Out[2614]:
0 Replace "
Name: a, dtype: object
Try
df.a.str.replace('\\','')
result:
0 Replace "
For the whole data frame you can use:
for col in df:
df[col] = df[col].str.replace(r'\\','')

pandas to_csv split some rows in 2 lines

i have a problem with pandas.to_csv
pandas dataframe work correctly and pd.to_excel work well too.
when i try to use .to_csv some rows splitted in two (i see it in wordpad and excel)
for example:
line 1: provincia;comune;Ragione sociale;indirizzo;civico;destinazione;sup_coperta
line2: AR;CHIUSI DELLA VERNA;ex sacci;LOC. CORSALONE STRADA REGIONALE
line3: 71;;SITO DISMESSO;
my code toscana.to_csv("toscana.csv", index = False, encoding = "utf-8", sep=";")
EDIT: i add some line with the problem
(thx to all for the comments!)
`
how i can remove line breaks in values? I found \r in a cell splitted in 2 csv lines: Out[17]: u'IMPIANTI SPORTIVI: CIRCOLO CULTURALE RICREATIVO \rPESTELLO'
i solve with
def replace(x):
if type(x) == str or type(x) == unicode:
x = x.replace('\r', '')
else:
x = x[0].replace('\r', '')
return x
toscana["indirizzo"] = toscana["indirizzo"].map(lambda x: x.replace('"', ''))
toscana["indirizzo"] = toscana["indirizzo"].map(lambda x: replace(x))
toscana["Ragione sociale"] = toscana["Ragione sociale"].map(lambda x: x.replace('"', ''))
toscana["Ragione sociale"] = toscana["Ragione sociale"].map(lambda x: replace(x))
there is smarter methods to do it?
You can use the Pandas Replace method to achieve this rather than creating a new function.
Pandas Replace Method
It includes regex so you can include expressions in the replace such as | for Or
In the example we will use regex=True and replace the \\ with a \ using regex
adding inplace = True will change the value without adding / removing any data from the position in the table.
r"\\t|\\n|\\r" is the same as "\\t" or "\\n" or "\\r" and we replace with the single \
df.replace(to_replace=[r"\\t|\\n|\\r", "\t|\n|\r"], value=["",""], regex=True, inplace=True)

Pythonic/efficient way to strip whitespace from every Pandas Data frame cell that has a stringlike object in it

I'm reading a CSV file into a DataFrame. I need to strip whitespace from all the stringlike cells, leaving the other cells unchanged in Python 2.7.
Here is what I'm doing:
def remove_whitespace( x ):
if isinstance( x, basestring ):
return x.strip()
else:
return x
my_data = my_data.applymap( remove_whitespace )
Is there a better or more idiomatic to Pandas way to do this?
Is there a more efficient way (perhaps by doing things column wise)?
I've tried searching for a definitive answer, but most questions on this topic seem to be how to strip whitespace from the column names themselves, or presume the cells are all strings.
Stumbled onto this question while looking for a quick and minimalistic snippet I could use. Had to assemble one myself from posts above. Maybe someone will find it useful:
data_frame_trimmed = data_frame.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
You could use pandas' Series.str.strip() method to do this quickly for each string-like column:
>>> data = pd.DataFrame({'values': [' ABC ', ' DEF', ' GHI ']})
>>> data
values
0 ABC
1 DEF
2 GHI
>>> data['values'].str.strip()
0 ABC
1 DEF
2 GHI
Name: values, dtype: object
We want to:
Apply our function to each element in our dataframe - use applymap.
Use type(x)==str (versus x.dtype == 'object') because Pandas will label columns as object for columns of mixed datatypes (an object column may contain int and/or str).
Maintain the datatype of each element (we don't want to convert everything to a str and then strip whitespace).
Therefore, I've found the following to be the easiest:
df.applymap(lambda x: x.strip() if type(x)==str else x)
When you call pandas.read_csv, you can use a regular expression that matches zero or more spaces followed by a comma followed by zero or more spaces as the delimiter.
For example, here's "data.csv":
In [19]: !cat data.csv
1.5, aaa, bbb , ddd , 10 , XXX
2.5, eee, fff , ggg, 20 , YYY
(The first line ends with three spaces after XXX, while the second line ends at the last Y.)
The following uses pandas.read_csv() to read the files, with the regular expression ' *, *' as the delimiter. (Using a regular expression as the delimiter is only available in the "python" engine of read_csv().)
In [20]: import pandas as pd
In [21]: df = pd.read_csv('data.csv', header=None, delimiter=' *, *', engine='python')
In [22]: df
Out[22]:
0 1 2 3 4 5
0 1.5 aaa bbb ddd 10 XXX
1 2.5 eee fff ggg 20 YYY
The "data['values'].str.strip()" answer above did not work for me, but I found a simple work around. I am sure there is a better way to do this. The str.strip() function works on Series. Thus, I converted the dataframe column into a Series, stripped the whitespace, replaced the converted column back into the dataframe. Below is the example code.
import pandas as pd
data = pd.DataFrame({'values': [' ABC ', ' DEF', ' GHI ']})
print ('-----')
print (data)
data['values'].str.strip()
print ('-----')
print (data)
new = pd.Series([])
new = data['values'].str.strip()
data['values'] = new
print ('-----')
print (new)
Here is a column-wise solution with pandas apply:
import numpy as np
def strip_obj(col):
if col.dtypes == object:
return (col.astype(str)
.str.strip()
.replace({'nan': np.nan}))
return col
df = df.apply(strip_obj, axis=0)
This will convert values in object type columns to string. Should take caution with mixed-type columns. For example if your column is zip codes with 20001 and ' 21110 ' you will end up with '20001' and '21110'.
This worked for me - applies it to the whole dataframe:
def panda_strip(x):
r =[]
for y in x:
if isinstance(y, str):
y = y.strip()
r.append(y)
return pd.Series(r)
df = df.apply(lambda x: panda_strip(x))
I found the following code useful and something that would likely help others. This snippet will allow you to delete spaces in a column as well as in the entire DataFrame, depending on your use case.
import pandas as pd
def remove_whitespace(x):
try:
# remove spaces inside and outside of string
x = "".join(x.split())
except:
pass
return x
# Apply remove_whitespace to column only
df.orderId = df.orderId.apply(remove_whitespace)
print(df)
# Apply to remove_whitespace to entire Dataframe
df = df.applymap(remove_whitespace)
print(df)

Categories