How do I go about removing an empty string or at least having regex ignore it?
I have some data that looks like this
EIV (5.11 gCO₂/t·nm)
I'm trying to extract the numbers only. I have done the following:
df['new column'] = df['column containing that value'].str.extract(r'((\d+.\d*)|(\d+)|(\.\d+)|(\d+[eE][+]?\d*)?)').astype('float')
since the numbers Can be floats, integers, and I think there's one exponent 4E+1
However when I run it I then get the error as in title which I presume is an empty string.
What am I missing here to allow the code to run?
Try this
import re
c = "EIV (5.11 gCO₂/t·nm)"
x = re.findall("[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?", c)
print(x)
Will give
['5.11']
The problem is not only the number of groups, but the fact that the last alternative in your regex is optional (see ? added right after it, and your regex demo). However, since Series.str.extract returns the first match, your regex matches and returns the empty string at the start of the string if the match is not at the string start position.
It is best to use the well-known single alternative patterns to match any numbers with a single capturing group, e.g.
df['col'].str.extract(r'((?:(?:\b[0-9]+)?\.)?\b[0-9]+(?:[eE][-+]?[0-9]+)?)\b').astype(float)
See Example Regexes to Match Common Programming Language Constructs.
Pandas test:
import pandas as pd
df = pd.DataFrame({'col':['EIV (5.11 gCO₂/t·nm)', 'EIV (5.11E+12 gCO₂/t·nm)']})
df['col'].str.extract(r'((?:(?:\b[0-9]+)?\.)?\b[0-9]+(?:[eE][-+]?[0-9]+)?)\b').astype(float)
# => 0
# 0 5.110000e+00
# 1 5.110000e+12
There also quite a lot of other such regex variations at Parsing scientific notation sensibly?, and you may also use r"([-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?)", r"(-?\d+(?:\.\d*)?(?:[eE][+-]?\d+)?)", r"([+-]?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?)", etc.
If your column consist of data of same format(as you have posted - EIV (5.11 gCO₂/t·nm)) then it will surely work
import pandas as pd
df['new_exctracted_column'] = df['column containing that value'].str.extract('(\d+(?:\.\d+)?)')
df
5.11
Related
I aim to write a function to apply to an entire dataframe. Each column is checked to see if it contains the currency symbol '$' and remove it.
Surprisingly, a case like:
import pandas as pd
dates = pd.date_range(start='2021-01-01', end='2021-01-10').strftime('%d-%m-%Y')
print(dates)
output:
Index(['01-01-2021', '02-01-2021', '03-01-2021', '04-01-2021', '05-01-2021', '06-01-2021', '07-01-2021', '08-01-2021', '09-01-2021', '10-01-2021'], dtype='object')
But when I do:
dates.str.contains('$').all()
It returns True. Why???
.contains uses regex (by default), not just a raw string. And $ means the end of the line in regex (intuitively or not, all strings have "the end"). To check the symbol "$" you need to escape it:
dates.str.contains('\$').all()
Or you can use regex=False argument of the .contains():
dates.str.contains('$', regex=False).all()
Both options return False.
I just started with python, now I see myself needing the following, I have the following string:
1184-7380501-2023-183229
what i need is to trim this string and get only the following characters after the first hyphen. it should be as follows:
1184-738
how can i do this?
s = "1184-7380501-2023-183229"
print(s[:8])
Or perhaps
import re
pattern = re.compile(r'^\d+-...')
m = pattern.search(s)
print(m[0])
which accommodates variable length numeric prefixes.
You could (you can do this a lot of different ways) use partition() and join()...
"".join([token[:3] if idx == 2 else token for idx, token in enumerate("1184-7380501-2023-183229".partition("-"))])
I am new to python, I have been using regex for matching, etc. Now I am facing a small issue with it, I have a string str = "vans>=20.09 and stands == 'four'". I want the values after the Comparison Operators, I have used regex to extract that the pattern which I gave is working fine in extracting the values like int and strings but it is not extracting the float values. What is the best pattern, so that regex will extract all kind of values(int, float, strings)?
My code:
import re
str = "vans>=20.09 and stands == 'four'"
rx = re.compile(r"""(?P<key>\w+)\s*[<>=]+\s*'?(?P<value>\w+)'?""")
result = {m.group('key'): m.group('value') for m in rx.finditer(str)}
which gives:
{'vans': '20', 'stands': 'four'}
Expected Output:
{'vans': '20.09', 'stands': 'four'}
You can extend the second \w group with an \. to include dots.
rx = re.compile(r"""(?P<key>\w+)\s*[<>=]+\s*'?(?P<value>[\w\.]+)'?""")
This should work fine, strings like 12.34.56 will also be matched as value.
There is a problem in identifying the comparison operators as well. The following should suffice all use cases. However, there is a caveat - for numbers with no digit following the decimal, only the value before the decimal point will be selected.
rx = re.compile(r"""(?P<key>\w+)\s*[<>!=]=\s*'?(?P<value>(\w|[+-]?\d*(\.\d)?\d*)+)'?""")
I am using Python and pandas and have a DataFrame column
that contains a string. I want to keep the float number within the string and get rid of '- .' at the end of the float (string).
So far I have been using a regular expression below to get rid of characters and brackets from the original string but it leaves '-' and '.' from the non-numeric part of the string in place.
Example input string :
14,513.045Non-compliant with installation req.
When I try to modify it this is what I get:
14,513.045- . (example of positive number string)
I also want to be able to parse negative numbers, such as:
-234.670
The first - in the string is for negative float number. I would like to keep the first - and first . but get rid of the subsequent ones - the ones which do not belong to the number.
This is the code that I tried to use to achieve that:
dataframe3['single_chainage2'] = dataframe3['single_chainage'].str.replace(r"[a-zA-Z*()]",'')
But it leaves me with 14,513.045- .
I saw no way of doing the above using pandas alone and saw that regex was the recommended way.
You dont't need to replace, I think you can use Series.str.extract instead to get the string you need.
In [1]: import pandas as pd
In [2]: ser = pd.Series(["14,513.045Non-compliant with installation req.", "14,513.045- .", "-234.670"])
In [3]: pat = r'^(?P<num>-?(\d+,)*\d+(\.\d+)?)'
In [5]: ser.str.extract(pat)['num']
Out[5]:
0 14,513.045
1 14,513.045
2 -234.670
Name: num, dtype: object
and a named group is needed in the regex pattern (num in this example) .
and if need to convert it to numeric dtype:
In [7]: ser.str.extract(pat)['num'].str.replace(',', '').astype(float)
Out[7]:
0 14513.045
1 14513.045
2 -234.670
Name: num, dtype: float64
Rather than removing the characters that you don't want, just specify a pattern that you want to find and extract it. It should be much less error prone.
You want to extract a positive and negative number that can be floating point:
import re
number_match = re.search("[+-]?(\d+,?)*(\.\d+)?", 'Your string.')
number = number_match.group(0)
Testing the code above:
test_string_positive='14,513.045Non-compliant with installation req.'
test_string_negative='-234.670Non-compliant with installation req.'
In [1]: test=re.search("[+-]?(\d+,?)*(\.\d+)?",test_string_positive)
In [2]: test.group(0)
Out[2]: '14,513.045'
In [3]: test=re.search("[+-]?(\d+,?)*(\.\d+)?",test_string_negative)
In [4]: test.group(0)
Out[4]: '-234.670'
With this solution you don't want to do replace but rather just assign the value of the regex match.
number_match = re.search("[+-]?(\d+,?)*(\.\d+)?", <YOUR_STRING>)
number = number_match.group(0)
dataframe3['single_chainage2'] = number
I split that into 3 lines to show you how it logically follows. Hopefully, that makes sense.
You should substitute the value of <YOUR_STRING> with a string representation of data. As for how to get a string value out of a Pandas DataFrame, this question might have some answers to that. I'm not sure about how your DataFrame actually looks but I guess something like df['single_chainage'][0] should work. Basically if you index in Pandas, it returns some Pandas specific info and if you want to retrieve just the string itself you have to specify that explicitly.
I have a string with a lot of recurrencies of a single pattern like
a = 'eresQQQutnohnQQQjkhjhnmQQQlkj'
and I have another string like
b = 'rerTTTytu'
I want to substitute the entire second string having as a reference the 'QQQ' and the 'TTT', and I want to find in this case 3 different results:
'ererTTTytuohnQQQjkhjhnmQQQlkj'
'eresQQQutnrerTTTytujhnmQQQlkj'
'eresQQQutnohnQQQjkhjrerTTTytu'
I've tried using re.sub
re.sub('\w{3}QQQ\w{3}' ,b,a)
but I obtain only the first one, and I don't know how to get the other two solutions.
Edit: As you requested, the two characters surrounding 'QQQ' will be replaced as well now.
I don't know if this is the most elegant or simplest solution for the problem, but it works:
import re
# Find all occurences of ??QQQ?? in a - where ? is any character
matches = [x.start() for x in re.finditer('\S{2}QQQ\S{2}', a)]
# Replace each ??QQQ?? with b
results = [a[:idx] + re.sub('\S{2}QQQ\S{2}', b, a[idx:], 1) for idx in matches]
print(results)
Output
['errerTTTytunohnQQQjkhjhnmQQQlkj',
'eresQQQutnorerTTTytuhjhnmQQQlkj',
'eresQQQutnohnQQQjkhjhrerTTTytuj']
Since you didn't specify the output format, I just put it in a list.