Pandas replace using regex

Pandas replace using regex - python

I have a column that has null/missing values written as strings such as 'There is no classification', 'unkown: there is no accurate classification', and other variants. I would like to replace all of these values with None.
I have tried this but it isn't working:
df['Fourth level classification'] = df['Fourth level classification'].replace(
to_replace=r'.*[Tt]here is no .*', value=None, regex=True
)
Furthermore, how can I make the entire to_replace string case insenensitive, so that it would also match with 'tHere is NO cLaSsification', etc.?

You can try this:
df['Fourth level classification'] = (df['Fourth level classification']
.str
.lower()
.replace(r'(.*(there is no).*)', pd.isna, regex=True))

Related

Strange pandas behaviour. character is found where it does not exist

I aim to write a function to apply to an entire dataframe. Each column is checked to see if it contains the currency symbol '$' and remove it.
Surprisingly, a case like:
import pandas as pd
dates = pd.date_range(start='2021-01-01', end='2021-01-10').strftime('%d-%m-%Y')
print(dates)
output:
Index(['01-01-2021', '02-01-2021', '03-01-2021', '04-01-2021', '05-01-2021', '06-01-2021', '07-01-2021', '08-01-2021', '09-01-2021', '10-01-2021'], dtype='object')
But when I do:
dates.str.contains('$').all()
It returns True. Why???

.contains uses regex (by default), not just a raw string. And $ means the end of the line in regex (intuitively or not, all strings have "the end"). To check the symbol "$" you need to escape it:
dates.str.contains('\$').all()
Or you can use regex=False argument of the .contains():
dates.str.contains('$', regex=False).all()
Both options return False.

ValueError: could not convert string to float: " " (empty string?)

How do I go about removing an empty string or at least having regex ignore it?
I have some data that looks like this
EIV (5.11 gCO₂/t·nm)
I'm trying to extract the numbers only. I have done the following:
df['new column'] = df['column containing that value'].str.extract(r'((\d+.\d*)|(\d+)|(\.\d+)|(\d+[eE][+]?\d*)?)').astype('float')
since the numbers Can be floats, integers, and I think there's one exponent 4E+1
However when I run it I then get the error as in title which I presume is an empty string.
What am I missing here to allow the code to run?

Try this
import re
c = "EIV (5.11 gCO₂/t·nm)"
x = re.findall("[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?", c)
print(x)
Will give
['5.11']

The problem is not only the number of groups, but the fact that the last alternative in your regex is optional (see ? added right after it, and your regex demo). However, since Series.str.extract returns the first match, your regex matches and returns the empty string at the start of the string if the match is not at the string start position.
It is best to use the well-known single alternative patterns to match any numbers with a single capturing group, e.g.
df['col'].str.extract(r'((?:(?:\b[0-9]+)?\.)?\b[0-9]+(?:[eE][-+]?[0-9]+)?)\b').astype(float)
See Example Regexes to Match Common Programming Language Constructs.
Pandas test:
import pandas as pd
df = pd.DataFrame({'col':['EIV (5.11 gCO₂/t·nm)', 'EIV (5.11E+12 gCO₂/t·nm)']})
df['col'].str.extract(r'((?:(?:\b[0-9]+)?\.)?\b[0-9]+(?:[eE][-+]?[0-9]+)?)\b').astype(float)
# => 0
# 0 5.110000e+00
# 1 5.110000e+12
There also quite a lot of other such regex variations at Parsing scientific notation sensibly?, and you may also use r"([-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?)", r"(-?\d+(?:\.\d*)?(?:[eE][+-]?\d+)?)", r"([+-]?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?)", etc.

If your column consist of data of same format(as you have posted - EIV (5.11 gCO₂/t·nm)) then it will surely work
import pandas as pd
df['new_exctracted_column'] = df['column containing that value'].str.extract('(\d+(?:\.\d+)?)')
df
5.11

How to make pandas replace work like python default replace

I have a csv that has a column called 'ra'.
This is the first 'ra' value the csv has: 8570.0 - I will use it as an example.
I need to remove '.0'.
So I've tried:
dtypes = {
'ra': 'str',
}
df['ra_csv'] = pd.DataFrame({'ra_csv':df['ra']}).replace('.0', '', regex=true).astype(str)
This code returns me '85' instead of '8570'. It's replacing all the 0s, and somehow removed the number '7' aswell.
How can I make it return '8750'? Thanks.

Option 1: use to_numeric to first convert the data to numeric type and convert to int,
df['ra_csv'] = pd.to_numeric(df['ra_csv']).astype(int)
Option 2: using str.replace
df['ra_csv'] = df['ra_csv'].str.replace('\..*', '')
You get
ra_csv
0 8570

The regex pattern .0 has two matches in your string '8570.0'. . matches any character.
70
.0
Since you are using df.replace setting regex=False wouldn't because it checks for exact matches only.
From docs df.replace:
str: string exactly matching to_replace will be replaced with value
Possible fixes are either fix your regex or use pd.Series.str.replace
Fixing your regex
df.replace('\.0', '', regex=True)
Using str.replace
df['ra'].str.replace('.0', '', regex=False)

Using a dictionary to replace strings not working

I am trying to use the following code to make replacements in a pandas dataframe however:
replacerscompanya = {',':'','.':'','-':'','ltd':'limited','&':'and'}
df1['CompanyA'] = df1['CompanyA'].replace(replacerscompanya)
replacersaddress1a = {',':'','.':'','-':'','ltd':'limited','&':'and', r'\brd\b':'road'}
df1['Address1A'] = df1['Address1A'].replace(replacersaddress1a)
replacersaddress2a = {',':'','.':'','-':'','ltd':'limited','&':'and', r'\brd\b':'road'}
df1['Address2A'] = df1['Address2A'].replace(replacersaddress2a)
It does not give me an error but when i check the dataframe, no replacements have been made.
I had previously just used a number of lines of the code below to acheive the same result but I was hoping to create something a bit simpler to adjust.
df1['CompanyA'] = df1['CompanyA'].str.replace('.','')
Any ideas as to what is going on here?
Thanks!

Escape . in dictionary because special regex character and add parameter regex=True for substring replacement and also for replace by regex:
replacersaddress1a = {',':'','\.':'','-':'','ltd':'limited','&':'and', r'\brd\b':'road'}
df1['CompanyA'] = df1['CompanyA'].replace(replacerscompanya, regex=True)

Removing characters from a string in pandas

I have a similar question to this one: Pandas DataFrame: remove unwanted parts from strings in a column.
So I used:
temp_dataframe['PPI'] = temp_dataframe['PPI'].map(lambda x: x.lstrip('PPI/'))
Most, of the items start with a 'PPI/' but not all. It seems that when an item without the 'PPI/' suffix encountered this error:
AttributeError: 'float' object has no attribute 'lstrip'
Am I missing something here?

use replace:
temp_dataframe['PPI'].replace('PPI/','',regex=True,inplace=True)
or string.replace:
temp_dataframe['PPI'].str.replace('PPI/','')

use vectorised str.lstrip:
temp_dataframe['PPI'] = temp_dataframe['PPI'].str.lstrip('PPI/')
it looks like you may have missing values so you should mask those out or replace them:
temp_dataframe['PPI'].fillna('', inplace=True)
or
temp_dataframe.loc[temp_dataframe['PPI'].notnull(), 'PPI'] = temp_dataframe['PPI'].str.lstrip('PPI/')
maybe a better method is to filter using str.startswith and use split and access the string after the prefix you want to remove:
temp_dataframe.loc[temp_dataframe['PPI'].str.startswith('PPI/'), 'PPI'] = temp_dataframe['PPI'].str.split('PPI/').str[1]
As #JonClements pointed out that lstrip is removing whitespace rather than removing the prefix which is what you're after.
update
Another method is to pass a regex pattern that looks for the optionally prefix and extract all characters after the prefix:
temp_dataframe['PPI'].str.extract('(?:PPI/)?(.*)', expand=False)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas replace using regex - python

You can try this: df['Fourth level classification'] = (df['Fourth level classification'] .str .lower() .replace(r'(.(there is no).)', pd.isna, regex=True))

Related

Strange pandas behaviour. character is found where it does not exist

ValueError: could not convert string to float: " " (empty string?)

How to make pandas replace work like python default replace

Using a dictionary to replace strings not working

Removing characters from a string in pandas

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas replace using regex - python

You can try this: df['Fourth level classification'] = (df['Fourth level classification'] .str .lower() .replace(r'(.*(there is no).*)', pd.isna, regex=True))

Related

Strange pandas behaviour. character is found where it does not exist

ValueError: could not convert string to float: " " (empty string?)

How to make pandas replace work like python default replace

Using a dictionary to replace strings not working

Removing characters from a string in pandas

Categories

Resources

You can try this: df['Fourth level classification'] = (df['Fourth level classification'] .str .lower() .replace(r'(.(there is no).)', pd.isna, regex=True))