Extracting Numbers from Formatted Strings with Unusual Delimiters in Python - python

How can I get the numbers in a formatted string like the following in Python? It has a mixed combination of delimiters such as tab, parenthesis, cm, space, and #.
I used the following code but it does not split the numbers correctly.
s = "1.0000e+036 (1.2365e-004,6.3265e+003cm) (2.3659e-002, 2.3659e-002#)"
parts = re.split('\s|(?<!\d)[,.](?!\d)', s)
print(parts)
['1.0000e+036', '(1.2365e-004,6.3265e+003cm)', '(2.3659e-002,', '2.3659e-002#)']
I am trying to extract:
[1.0000e+036, 1.2365e-004, 6.3265e+003, 2.3659e-002, 2.3659e-002]
Could someone kindly help?
Update:
I tried the regular expression as following, which fails to split the positive exponential numbers
s = "1.0000e+036 (1.2365e-004,6.3265e+003cm) (2.3659e-002, 2.3659e-002#)"
match_number = re.compile('-?\ *[0-9]+\.?[0-9]*(?:[Ee]\ *-?\ *[0-9]+)?')
final_list = [float(x) for x in re.findall(match_number, s)]
print(final_list)
[1.0, 36.0, 0.00012365, 6.3265, 3.0, 0.023659, 0.023659]
As can be seen, the first number is 1e36 which was parsed as two numbers 1.0 and 36.0.

You don't need to treat those items as delimiters. Rather, all you appear to need is a regex to extract all the floats in the line (including exponential / engineering notation), and simply ignore the remaining characters. Comprehensive numerical expressions are readily available on line with a simple search.

Related

How to get the float values after particular symbol using regex in python?

I am new to python, I have been using regex for matching, etc. Now I am facing a small issue with it, I have a string str = "vans>=20.09 and stands == 'four'". I want the values after the Comparison Operators, I have used regex to extract that the pattern which I gave is working fine in extracting the values like int and strings but it is not extracting the float values. What is the best pattern, so that regex will extract all kind of values(int, float, strings)?
My code:
import re
str = "vans>=20.09 and stands == 'four'"
rx = re.compile(r"""(?P<key>\w+)\s*[<>=]+\s*'?(?P<value>\w+)'?""")
result = {m.group('key'): m.group('value') for m in rx.finditer(str)}
which gives:
{'vans': '20', 'stands': 'four'}
Expected Output:
{'vans': '20.09', 'stands': 'four'}
You can extend the second \w group with an \. to include dots.
rx = re.compile(r"""(?P<key>\w+)\s*[<>=]+\s*'?(?P<value>[\w\.]+)'?""")
This should work fine, strings like 12.34.56 will also be matched as value.
There is a problem in identifying the comparison operators as well. The following should suffice all use cases. However, there is a caveat - for numbers with no digit following the decimal, only the value before the decimal point will be selected.
rx = re.compile(r"""(?P<key>\w+)\s*[<>!=]=\s*'?(?P<value>(\w|[+-]?\d*(\.\d)?\d*)+)'?""")

Not getting intended results with "either/or" character in python regex

I'm trying to match some fairly simple text but am having trouble with the "|" character. The text is:
"TF0876 some text Y N 2.31 - 0.01\n TF9788 more text N Y - 2.3 -\n TF1626"
and I want to extract two items using re.findall:
"TF0876 some text for Y N 2.31" and
"TF9788 more text N Y -"
The code I thought would work is:
mat = re.compile(r"TF\d{4}.*?[Y|N] [Y|N] [-|\d\.\d*]",flags=re.DOTALL)
test2 = re.findall(mat,text)
print(test2)
However, this gives me the following list:
['TF0876 some text for Y N 2', 'TF9788 more text N Y -']
For some reason, in the first match that the regex finds stops at the "2", rather than the "2.31" which is what I want. If instead of the \d\.\d* I simply type in2.31 then it still only matches only up to the "2". In fact whatever I type, I only seem to get one character from either side of the "|". I don't understand this; the regex HOWTO says that the expression Crow|Servo will match "Crow" or "Servo", but nothing smaller (such as "Cro"). In my case the opposite seems to be happening, so I clearly don't understand something and would be grateful for help.
Thanks.
The problem lies within your compiled statement, try changing it to
mat = re.compile(r"TF\d{4}.*?[YN] [YN] [-\d\.]*",flags=re.DOTALL)
You will not need the "|" within "[]". These brackets already signalize a range or collection of different possible expressions.
Second Option is to use groups by applying "()" brackets instead of your "[]". Depends on what you want to match exactly. Both will work on your given example texts.
The problem is that you are using brackets [] instead of parentheses () to separate subgroups. Try this:
import re
text = "TF0876 some text Y N 2.31 - 0.01\n TF9788 more text N Y - 2.3 -\n TF1626"
mat = re.compile(r"TF\d{4}.*?(?:Y|N) (?:Y|N) (?:-|\d\.\d*)",flags=re.DOTALL)
test2 = re.findall(mat, text)
print(test2)
# ['TF0876 some text Y N 2.31', 'TF9788 more text N Y -']
Here the ?: bits are just so subgroups are not captured. Note that (?:Y|N) is basically the same as simply [YN].

Parsing String by regular expression in python

How can I parse this string in python?
Input String:
someplace 2018:6:18:0 25.0114 95.2818 2.71164 66.8962 Entire grid contents are set to missing data
to this
Output array:
['someplace','2018:6:18:0','25.0114','95.2818','2.71164','66.8962','Entire grid contents are set to missing data']
I have already tried with split(' ') but as it is not clear how many spaces are between the sub-strings and inside the last sub-string there may be spaces so this doesn't work.
I need the regular expression.
If you do not provide a sep-character, pythons split(sep=None, maxsplit=-1) (doku) will treat consecutive whitespaces as one whitespace and split by those. You can limit the amount of splits to be done by providing a maxsplit value:
data = "someplace 2018:6:18:0 25.0114 95.2818 2.71164 66.8962 Entire grid contents are set to missing data"
spl = data.split(None,6) # dont give a split-char, use 6 splits at most
print(spl)
Output:
['someplace', '2018:6:18:0', '25.0114', '95.2818', '2.71164',
'66.8962', 'Entire grid contents are set to missing data']
This will work as long as the first text does not contain any whitespaces.
If the fist text may contain whitespaces, you can use/refine this regex solution:
import re
reg = re.findall(r"([^\d]+?) +?([\d:]+) +?([\d.]+) +?([\d.]+) +?([\d.]+) +?([\d.]+) +(.*)$",data)[0]
print(reg)
Output:
('someplace', '2018:6:18:0', '25.0114', '95.2818', '2.71164', '66.8962', 'Entire grid contents are set to missing data')
Use f.e.https://regex101.com to check/proof the regex against your other data (follow the link, it uses above regex on sample data)
[A-Z]{1}[a-zA-Z ]{15,45}|[\w|:|.]+
You can test it here https://pythex.org/
Modify 15,45 according to your needs.
Maxsplit works with re.split(), too:
import re
re.split(r"\s+",text,maxsplit=6)
Out:
['someplace',
'2018:6:18:0',
'25.0114',
'95.2818',
'2.71164',
'66.8962',
'Entire grid contents are set to missing data']
EDIT:
If the first and last text parts don't contain digits, we don't need maxsplit and do not have to rely on number of parts with consecutive spaces:
re.split("\s+(?=\d)|(?<=\d)\s+",s)
We cut the string where a space is followed by a digit or vice versa using lookahead and lookbehind.
It is hard to answer your question as the requirements are not very precise. I think I would split the line with the split() function and then join the items when their contents has no numbers. Here is a snippet that works with your lonely sample:
def containsNumbers(s):
return any(c.isdigit() for c in s)
data = "someplace 2018:6:18:0 25.0114 95.2818 2.71164 66.8962 Entire grid contents are set to missing data"
lst = data.split()
lst2 = []
i = 0
agg = ''
while i < len(lst):
if containsNumbers(lst[i]):
if agg != '':
lst2.append(agg)
agg = ''
lst2.append(lst[i])
else:
agg += ' '+lst[i]
agg = agg.strip()
if i == len(lst) - 1:
lst2.append(agg)
i += 1
print(lst2)

Suggestion for python regex and selecting columns [duplicate]

This question already has answers here:
Split string on whitespace in Python [duplicate]
(4 answers)
Closed 8 years ago.
How can I select, in a file with 3, 4 or X columns separated by space (not constant space, but multiple spaces on each line) select the first 2 columns of each row with a regex?
My files consist of : IP [SPACES] Subnet_Mask [SPACES] NEXT_HOP_IP [NEW LINE]
All rows use that format. How can I extract only the first 2 columns? (IP & Subnet mask)
Here is an example on which to try your regex:
10.97.96.0 10.97.97.128 47.73.1.0
47.73.4.128 47.73.7.6 47.73.8.0
47.73.15.0 47.73.40.0 47.73.41.0
85.205.9.164 85.205.14.44 172.17.103.0
172.17.103.8 172.17.103.48 172.17.103.56
172.17.103.96 172.17.103.100 172.17.103.136
172.17.103.140 172.17.104.44 172.17.105.28
172.17.105.32 172.17.105.220 172.17.105.224
Don't look to the specific IPs. I know the second column is not formed of valid address masks. It's just an example.
I already tried:
(?P<IP_ADD>\s*[1-9][0-9]{1,2}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})(?P<space>\s*)(?P<MASK>[1-9][0-9]{1,2}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}(\s+|\D*))
But it doesn't quite work...
With a regular expression:
If you want to get the 2 first columns, whatever they contain, and whatever amount of space separates them, you can use \S (matches anything but whitespaces) and \s (matches whitespaces only) to achieve that:
import re
lines = """
47.73.4.128 47.73.7.6 47.73.8.0
47.73.15.0 47.73.40.0 47.73.41.0
85.205.9.164 85.205.14.44 172.17.103.0
172.17.103.8 172.17.103.48 172.17.103.56
172.17.103.96 172.17.103.100 172.17.103.136
172.17.103.140 172.17.104.44 172.17.105.28
172.17.105.32 172.17.105.220 172.17.105.224
"""
regex = re.compile(r'(\S+)\s+(\S+)')
regex.findall(lines)
Result:
[('10.97.96.0', '10.97.97.128'),
('47.73.1.0', '47.73.4.128'),
('47.73.7.6', '47.73.8.0'),
('47.73.15.0', '47.73.40.0'),
('47.73.41.0', '85.205.9.164'),
('85.205.14.44', '172.17.103.0'),
('172.17.103.8', '172.17.103.48'),
('172.17.103.56', '172.17.103.96'),
('172.17.103.100', '172.17.103.136'),
('172.17.103.140', '172.17.104.44'),
('172.17.105.28', '172.17.105.32'),
('172.17.105.220', '172.17.105.224')]
Without a regular expression
If you didn't want to use a regex, and still be able to handle multiple spaces, you could also do:
while ' ' in lines: # notice the two-spaces-string
lines = lines.replace(' ', ' ')
columns = [line.split(' ')[:2] for line in lines.split('\n') if line]
Pros and cons:
The advantage of using a regex is that it would also parse the data properly if separators include tabulations, which wouldn't be the case with the 2nd solution.
On the other hand, regular expressions require more computing than a simple string splitting, which could make a difference on very large data sets.
One liner it is:
[s.split()[:2] for s in string.split('\n')]
Example
string = """10.97.96.0 10.97.97.128 47.73.1.0
47.73.4.128 47.73.7.6 47.73.8.0
47.73.15.0 47.73.40.0 47.73.41.0
85.205.9.164 85.205.14.44 172.17.103.0
172.17.103.8 172.17.103.48 172.17.103.56
172.17.103.96 172.17.103.100 172.17.103.136
172.17.103.140 172.17.104.44 172.17.105.28
172.17.105.32 172.17.105.220 172.17.105.224"""
print [s.split()[:2] for s in string.split('\n')]
Outputs
[['10.97.96.0', '10.97.97.128']
['47.73.4.128', '47.73.7.6']
['47.73.15.0', '47.73.40.0']
['85.205.9.164', '85.205.14.44']
['172.17.103.8', '172.17.103.48']
['172.17.103.96', '172.17.103.100']
['172.17.103.140', '172.17.104.44']
['172.17.105.32', '172.17.105.220']]
Since you need "some sort of one-liner", there are many ways that does not involve python.
Maybe:
| awk '{print $1,$2}'
with anything that produces your input on stdout.
Edited to perform space match with any number of spaces.
You can accomplish this with python regular expressions like this as an option if you know it's going to be the first 2 space separated values.
A nice regex cheat sheet will also help you find out some shortcuts. Specific tokens classes like words, spaces, and numbers have these little shortcuts.
import re
line = "10.97.96.0 10.97.97.128 47.73.1.0"
result = re.split("\s+", line)[0:2]
result
['10.97.96.0', '10.97.97.128']

splitting string in Python (2.7)

I have a string such as the one below:
26 (passengers:22 crew:4)
or
32 (passengers:? crew: ?)
. What I'm looking to do is split up the code so that just the numbers representing the number of passengers and crew are extracted. If it's a question mark, I'd look for it to be replaced by a "".
I'm aware I can use string.replace("?", "") to replace the ? however how do I go about extracting the numeric characters for crew or passengers respectively? The numbers may vary from two digits to three so I can't slice the last few characters off the string or at a specific interval.
Thanks in advance
A regular expression to match those would be:
r'\(\s*passengers:\s*(\d{1,3}|\?)\s+ crew:\s*(\d{1,3}|\?)\s*\)'
with some extra whitespace tolerance thrown in.
Results:
>>> import re
>>> numbers = re.compile(r'\(\s*passengers:\s*(\d{1,3}|\?)\s+ crew:\s*(\d{1,3}|\?)\s*\)')
>>> numbers.search('26 (passengers:22 crew:4)').groups()
('22', '4')
>>> numbers.search('32 (passengers:? crew: ?)').groups()
('?', '?')

Categories