Strange pandas behaviour. character is found where it does not exist - python

I aim to write a function to apply to an entire dataframe. Each column is checked to see if it contains the currency symbol '$' and remove it.
Surprisingly, a case like:
import pandas as pd
dates = pd.date_range(start='2021-01-01', end='2021-01-10').strftime('%d-%m-%Y')
print(dates)
output:
Index(['01-01-2021', '02-01-2021', '03-01-2021', '04-01-2021', '05-01-2021', '06-01-2021', '07-01-2021', '08-01-2021', '09-01-2021', '10-01-2021'], dtype='object')
But when I do:
dates.str.contains('$').all()
It returns True. Why???

.contains uses regex (by default), not just a raw string. And $ means the end of the line in regex (intuitively or not, all strings have "the end"). To check the symbol "$" you need to escape it:
dates.str.contains('\$').all()
Or you can use regex=False argument of the .contains():
dates.str.contains('$', regex=False).all()
Both options return False.

Related

Splitting latlong into lat and long

I'm trying to remove the square brackets from a list of latlongs called latlng, which is held in a dataframe (df) in the format as per example below:
latlng
[-1.4253128, 52.9015902]
I've got this but it returns NaN:
df['latitude'] = df['latlng'].str.replace('[','')
and produces this warning:
<ipython-input-59-34b588ef7f4b>:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True.
df['latitude'] = df['latlng'].str.replace('[','')
It seems that if I write the file to a csv and then use that file, the above works. But ideally, I want it all in once script.
Any help is appreciated!
I think you meant to write df['latitude'] = str(df['latlng']).replace('[','').
Yet, this only removes the opening bracket and will not return long or lat alone.
You might want to use regex to split the string, something on the lines such as
import re
splitted = re.match("\[(-?\d+\.\d+),\s*(-?\d+\.\d+)\]", "[-1.4253128, 52.9015902]")
print(splitted[1])
print(splitted[2])
Use indices:
df['latitude'] = df['latlng'].str[0]
df['longitude'] = df['latlng'].str[1]
Even if they are objects, str[] accessor allows access to their values.

ValueError: could not convert string to float: " " (empty string?)

How do I go about removing an empty string or at least having regex ignore it?
I have some data that looks like this
EIV (5.11 gCO₂/t·nm)
I'm trying to extract the numbers only. I have done the following:
df['new column'] = df['column containing that value'].str.extract(r'((\d+.\d*)|(\d+)|(\.\d+)|(\d+[eE][+]?\d*)?)').astype('float')
since the numbers Can be floats, integers, and I think there's one exponent 4E+1
However when I run it I then get the error as in title which I presume is an empty string.
What am I missing here to allow the code to run?
Try this
import re
c = "EIV (5.11 gCO₂/t·nm)"
x = re.findall("[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?", c)
print(x)
Will give
['5.11']
The problem is not only the number of groups, but the fact that the last alternative in your regex is optional (see ? added right after it, and your regex demo). However, since Series.str.extract returns the first match, your regex matches and returns the empty string at the start of the string if the match is not at the string start position.
It is best to use the well-known single alternative patterns to match any numbers with a single capturing group, e.g.
df['col'].str.extract(r'((?:(?:\b[0-9]+)?\.)?\b[0-9]+(?:[eE][-+]?[0-9]+)?)\b').astype(float)
See Example Regexes to Match Common Programming Language Constructs.
Pandas test:
import pandas as pd
df = pd.DataFrame({'col':['EIV (5.11 gCO₂/t·nm)', 'EIV (5.11E+12 gCO₂/t·nm)']})
df['col'].str.extract(r'((?:(?:\b[0-9]+)?\.)?\b[0-9]+(?:[eE][-+]?[0-9]+)?)\b').astype(float)
# => 0
# 0 5.110000e+00
# 1 5.110000e+12
There also quite a lot of other such regex variations at Parsing scientific notation sensibly?, and you may also use r"([-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?)", r"(-?\d+(?:\.\d*)?(?:[eE][+-]?\d+)?)", r"([+-]?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?)", etc.
If your column consist of data of same format(as you have posted - EIV (5.11 gCO₂/t·nm)) then it will surely work
import pandas as pd
df['new_exctracted_column'] = df['column containing that value'].str.extract('(\d+(?:\.\d+)?)')
df
5.11

Pandas replace using regex

I have a column that has null/missing values written as strings such as 'There is no classification', 'unkown: there is no accurate classification', and other variants. I would like to replace all of these values with None.
I have tried this but it isn't working:
df['Fourth level classification'] = df['Fourth level classification'].replace(
to_replace=r'.*[Tt]here is no .*', value=None, regex=True
)
Furthermore, how can I make the entire to_replace string case insenensitive, so that it would also match with 'tHere is NO cLaSsification', etc.?
You can try this:
df['Fourth level classification'] = (df['Fourth level classification']
.str
.lower()
.replace(r'(.*(there is no).*)', pd.isna, regex=True))

Extracting numerical value with special characters from a string but removing other occurrences of those characters

I am using Python and pandas and have a DataFrame column
that contains a string. I want to keep the float number within the string and get rid of '- .' at the end of the float (string).
So far I have been using a regular expression below to get rid of characters and brackets from the original string but it leaves '-' and '.' from the non-numeric part of the string in place.
Example input string :
14,513.045Non-compliant with installation req.
When I try to modify it this is what I get:
14,513.045- . (example of positive number string)
I also want to be able to parse negative numbers, such as:
-234.670
The first - in the string is for negative float number. I would like to keep the first - and first . but get rid of the subsequent ones - the ones which do not belong to the number.
This is the code that I tried to use to achieve that:
dataframe3['single_chainage2'] = dataframe3['single_chainage'].str.replace(r"[a-zA-Z*()]",'')
But it leaves me with 14,513.045- .
I saw no way of doing the above using pandas alone and saw that regex was the recommended way.
You dont't need to replace, I think you can use Series.str.extract instead to get the string you need.
In [1]: import pandas as pd
In [2]: ser = pd.Series(["14,513.045Non-compliant with installation req.", "14,513.045- .", "-234.670"])
In [3]: pat = r'^(?P<num>-?(\d+,)*\d+(\.\d+)?)'
In [5]: ser.str.extract(pat)['num']
Out[5]:
0 14,513.045
1 14,513.045
2 -234.670
Name: num, dtype: object
and a named group is needed in the regex pattern (num in this example) .
and if need to convert it to numeric dtype:
In [7]: ser.str.extract(pat)['num'].str.replace(',', '').astype(float)
Out[7]:
0 14513.045
1 14513.045
2 -234.670
Name: num, dtype: float64
Rather than removing the characters that you don't want, just specify a pattern that you want to find and extract it. It should be much less error prone.
You want to extract a positive and negative number that can be floating point:
import re
number_match = re.search("[+-]?(\d+,?)*(\.\d+)?", 'Your string.')
number = number_match.group(0)
Testing the code above:
test_string_positive='14,513.045Non-compliant with installation req.'
test_string_negative='-234.670Non-compliant with installation req.'
In [1]: test=re.search("[+-]?(\d+,?)*(\.\d+)?",test_string_positive)
In [2]: test.group(0)
Out[2]: '14,513.045'
In [3]: test=re.search("[+-]?(\d+,?)*(\.\d+)?",test_string_negative)
In [4]: test.group(0)
Out[4]: '-234.670'
With this solution you don't want to do replace but rather just assign the value of the regex match.
number_match = re.search("[+-]?(\d+,?)*(\.\d+)?", <YOUR_STRING>)
number = number_match.group(0)
dataframe3['single_chainage2'] = number
I split that into 3 lines to show you how it logically follows. Hopefully, that makes sense.
You should substitute the value of <YOUR_STRING> with a string representation of data. As for how to get a string value out of a Pandas DataFrame, this question might have some answers to that. I'm not sure about how your DataFrame actually looks but I guess something like df['single_chainage'][0] should work. Basically if you index in Pandas, it returns some Pandas specific info and if you want to retrieve just the string itself you have to specify that explicitly.

Removing characters from a string in pandas

I have a similar question to this one: Pandas DataFrame: remove unwanted parts from strings in a column.
So I used:
temp_dataframe['PPI'] = temp_dataframe['PPI'].map(lambda x: x.lstrip('PPI/'))
Most, of the items start with a 'PPI/' but not all. It seems that when an item without the 'PPI/' suffix encountered this error:
AttributeError: 'float' object has no attribute 'lstrip'
Am I missing something here?
use replace:
temp_dataframe['PPI'].replace('PPI/','',regex=True,inplace=True)
or string.replace:
temp_dataframe['PPI'].str.replace('PPI/','')
use vectorised str.lstrip:
temp_dataframe['PPI'] = temp_dataframe['PPI'].str.lstrip('PPI/')
it looks like you may have missing values so you should mask those out or replace them:
temp_dataframe['PPI'].fillna('', inplace=True)
or
temp_dataframe.loc[temp_dataframe['PPI'].notnull(), 'PPI'] = temp_dataframe['PPI'].str.lstrip('PPI/')
maybe a better method is to filter using str.startswith and use split and access the string after the prefix you want to remove:
temp_dataframe.loc[temp_dataframe['PPI'].str.startswith('PPI/'), 'PPI'] = temp_dataframe['PPI'].str.split('PPI/').str[1]
As #JonClements pointed out that lstrip is removing whitespace rather than removing the prefix which is what you're after.
update
Another method is to pass a regex pattern that looks for the optionally prefix and extract all characters after the prefix:
temp_dataframe['PPI'].str.extract('(?:PPI/)?(.*)', expand=False)

Categories