How to make pandas replace work like python default replace

How to make pandas replace work like python default replace - python

I have a csv that has a column called 'ra'.
This is the first 'ra' value the csv has: 8570.0 - I will use it as an example.
I need to remove '.0'.
So I've tried:
dtypes = {
'ra': 'str',
}
df['ra_csv'] = pd.DataFrame({'ra_csv':df['ra']}).replace('.0', '', regex=true).astype(str)
This code returns me '85' instead of '8570'. It's replacing all the 0s, and somehow removed the number '7' aswell.
How can I make it return '8750'? Thanks.

Option 1: use to_numeric to first convert the data to numeric type and convert to int,
df['ra_csv'] = pd.to_numeric(df['ra_csv']).astype(int)
Option 2: using str.replace
df['ra_csv'] = df['ra_csv'].str.replace('\..*', '')
You get
ra_csv
0 8570

The regex pattern .0 has two matches in your string '8570.0'. . matches any character.
70
.0
Since you are using df.replace setting regex=False wouldn't because it checks for exact matches only.
From docs df.replace:
str: string exactly matching to_replace will be replaced with value
Possible fixes are either fix your regex or use pd.Series.str.replace
Fixing your regex
df.replace('\.0', '', regex=True)
Using str.replace
df['ra'].str.replace('.0', '', regex=False)

Related

Strange pandas behaviour. character is found where it does not exist

I aim to write a function to apply to an entire dataframe. Each column is checked to see if it contains the currency symbol '$' and remove it.
Surprisingly, a case like:
import pandas as pd
dates = pd.date_range(start='2021-01-01', end='2021-01-10').strftime('%d-%m-%Y')
print(dates)
output:
Index(['01-01-2021', '02-01-2021', '03-01-2021', '04-01-2021', '05-01-2021', '06-01-2021', '07-01-2021', '08-01-2021', '09-01-2021', '10-01-2021'], dtype='object')
But when I do:
dates.str.contains('$').all()
It returns True. Why???

.contains uses regex (by default), not just a raw string. And $ means the end of the line in regex (intuitively or not, all strings have "the end"). To check the symbol "$" you need to escape it:
dates.str.contains('\$').all()
Or you can use regex=False argument of the .contains():
dates.str.contains('$', regex=False).all()
Both options return False.

replacing special characters in string Python

I'm trying to replace special characters in a data frame with unaccented or different ones.
I can replace one with
df['col_name'] = df.col_name.str.replace('?','j')
this turned the '?' to 'j' - but - I can't seem to figure out how to change more than one..
I have a list of special characters that I want to change. I've tried using a dictionary but it doesn't seem to work
the_reps = {'?','j'}
df1 = df.replace(the_reps, regex = True)
this gave me the error nothing to replace at position 0
EDIT:
this is what worked - although it is probably not that pretty:
df[col]=df.col.str.replace('old char','new char')
df[col]=df.col.str.replace('old char','new char')
df[col]=df.col.str.replace('old char','new char')
df[col]=df.col.str.replace('old char','new char')...
for each one ..

import re
s=re.sub("[_list of special characters_]","",_your string goes here_)
print(s)
An example for this..
str="Hello$#& Python3$"
import re
s=re.sub("[$#&]","",str)
print (s)
#Output:Hello Python3
Explanation goes here..
s=re.sub("[$#&]","",s)
Pattern to be replaced → “[$#&]”
[] used to indicate a set of characters
[$#&] → will match either $ or # or &
The replacement string is given as an empty string
If these characters are found in the string, they’ll be replaced with an empty string

you can use Series.replace with a dictionary
#d = { 'actual character ':'replacement ',...}
df.columns = df.columns.to_series().replace(d, regex=True)

Try This:
import re
my_str = "hello Fayzan-Bhatti Ho~!w"
my_new_string = re.sub('[^a-zA-Z0-9 \n\.]', '', my_str)
print my_new_string
Output: hello FayzanBhatti How

Pandas replace using regex

I have a column that has null/missing values written as strings such as 'There is no classification', 'unkown: there is no accurate classification', and other variants. I would like to replace all of these values with None.
I have tried this but it isn't working:
df['Fourth level classification'] = df['Fourth level classification'].replace(
to_replace=r'.*[Tt]here is no .*', value=None, regex=True
)
Furthermore, how can I make the entire to_replace string case insenensitive, so that it would also match with 'tHere is NO cLaSsification', etc.?

You can try this:
df['Fourth level classification'] = (df['Fourth level classification']
.str
.lower()
.replace(r'(.*(there is no).*)', pd.isna, regex=True))

applying replace strings lambda to all rows in python [duplicate]

I have a column in my dataframe like this:
range
"(2,30)"
"(50,290)"
"(400,1000)"
...
and I want to replace the , comma with - dash. I'm currently using this method but nothing is changed.
org_info_exc['range'].replace(',', '-', inplace=True)
Can anybody help?

Use the vectorised str method replace:
df['range'] = df['range'].str.replace(',','-')
df
range
0 (2-30)
1 (50-290)
EDIT: so if we look at what you tried and why it didn't work:
df['range'].replace(',','-',inplace=True)
from the docs we see this description:
str or regex: str: string exactly matching to_replace will be replaced
with value
So because the str values do not match, no replacement occurs, compare with the following:
df = pd.DataFrame({'range':['(2,30)',',']})
df['range'].replace(',','-', inplace=True)
df['range']
0 (2,30)
1 -
Name: range, dtype: object
here we get an exact match on the second row and the replacement occurs.

For anyone else arriving here from Google search on how to do a string replacement on all columns (for example, if one has multiple columns like the OP's 'range' column):
Pandas has a built in replace method available on a dataframe object.
df.replace(',', '-', regex=True)
Source: Docs

If you only need to replace characters in one specific column, somehow regex=True and in place=True all failed, I think this way will work:
data["column_name"] = data["column_name"].apply(lambda x: x.replace("characters_need_to_replace", "new_characters"))
lambda is more like a function that works like a for loop in this scenario.
x here represents every one of the entries in the current column.
The only thing you need to do is to change the "column_name", "characters_need_to_replace" and "new_characters".

Replace all commas with underscore in the column names
data.columns= data.columns.str.replace(' ','_',regex=True)

In addition, for those looking to replace more than one character in a column, you can do it using regular expressions:
import re
chars_to_remove = ['.', '-', '(', ')', '']
regular_expression = '[' + re.escape (''. join (chars_to_remove)) + ']'
df['string_col'].str.replace(regular_expression, '', regex=True)

Almost similar to the answer by Nancy K, this works for me:
data["column_name"] = data["column_name"].apply(lambda x: x.str.replace("characters_need_to_replace", "new_characters"))

If you want to remove two or more elements from a string, example the characters '$' and ',' :
Column_Name
===========
$100,000
$1,100,000
... then use:
data.Column_Name.str.replace("[$,]", "", regex=True)
=> [ 100000, 1100000 ]

Removing characters from a string in pandas

I have a similar question to this one: Pandas DataFrame: remove unwanted parts from strings in a column.
So I used:
temp_dataframe['PPI'] = temp_dataframe['PPI'].map(lambda x: x.lstrip('PPI/'))
Most, of the items start with a 'PPI/' but not all. It seems that when an item without the 'PPI/' suffix encountered this error:
AttributeError: 'float' object has no attribute 'lstrip'
Am I missing something here?

use replace:
temp_dataframe['PPI'].replace('PPI/','',regex=True,inplace=True)
or string.replace:
temp_dataframe['PPI'].str.replace('PPI/','')

use vectorised str.lstrip:
temp_dataframe['PPI'] = temp_dataframe['PPI'].str.lstrip('PPI/')
it looks like you may have missing values so you should mask those out or replace them:
temp_dataframe['PPI'].fillna('', inplace=True)
or
temp_dataframe.loc[temp_dataframe['PPI'].notnull(), 'PPI'] = temp_dataframe['PPI'].str.lstrip('PPI/')
maybe a better method is to filter using str.startswith and use split and access the string after the prefix you want to remove:
temp_dataframe.loc[temp_dataframe['PPI'].str.startswith('PPI/'), 'PPI'] = temp_dataframe['PPI'].str.split('PPI/').str[1]
As #JonClements pointed out that lstrip is removing whitespace rather than removing the prefix which is what you're after.
update
Another method is to pass a regex pattern that looks for the optionally prefix and extract all characters after the prefix:
temp_dataframe['PPI'].str.extract('(?:PPI/)?(.*)', expand=False)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to make pandas replace work like python default replace - python

Option 1: use to_numeric to first convert the data to numeric type and convert to int, df['ra_csv'] = pd.to_numeric(df['ra_csv']).astype(int) Option 2: using str.replace df['ra_csv'] = df['ra_csv'].str.replace('\..*', '') You get ra_csv 0 8570

Related

Strange pandas behaviour. character is found where it does not exist

replacing special characters in string Python

Pandas replace using regex

applying replace strings lambda to all rows in python [duplicate]

Removing characters from a string in pandas

Categories

Resources