Python regex tokenizer for simple expression - python

I wrote this regex that splits the expression 'res=3+x_sum*11' into lexemes
import re
print(re.findall('(\w+)(=)(\d+)(\*|\+)(\w+)(\*|\+)(\d+)', 'res=3+x_sum*11'))
with my output looking like this:
[('res', '=', '3', '+', 'x_sum', '*', '11')]
but i want re.findall to return a list of the lexemes and their tokens so that each lexeme is in its own group. That output should look like this:
[('', 'res', ''), ('', '', '='), ('3', '', ''), ('', '', '+'),
('', 'x_sum', ''), ('', '', '*'), ('11', '', '')]
How do i get re.findall to return an output like that

You may tokenize the string using
re.findall(r'(\d+)|([^\W\d]+)|(\W)', s)
See the regex demo. Note that re.findall returns a list of tuples once the pattern contains several capturing groups. The pattern above contains 3 capturing groups, thus, each tuple contains 3 elements: 1+ digits, 1+ letters/underscores, or a non-word char.
More details
(\d+) - Capturing group 1: 1+ digits
| - or
([^\W\d]+) - Capturing group 2: 1+ chars other than non-word and digit chars (letters or underscores)
| - or
(\W) - Capturing group 3: a non-word char.
See Python demo:
import re
rx = r"(\d+)|([^\W\d]+)|(\W)"
s = "res=3+x_sum*11"
print(re.findall(rx, s))
# => [('', 'res', ''), ('', '', '='), ('3', '', ''), ('', '', '+'), ('', 'x_sum', ''), ('', '', '*'), ('11', '', '')]

Related

Python Regex to look for ALL special characters in a keyboard [duplicate]

I'm trying to split strings every time I'm encountering a punctuation mark or numbers, such as:
toSplit = 'I2eat!Apples22becauseilike?Them'
result = re.sub('[0123456789,.?:;~!##$%^&*()]', ' \1',toSplit).split()
The desired output would be:
['I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them']
However, the code above (although it properly splits where it's supposed to) removes all the numbers and punctuation marks.
Any clarification would be greatly appreciated.
Use re.split with capture group:
toSplit = 'I2eat!Apples22becauseilike?Them'
result = re.split('([0-9,.?:;~!##$%^&*()])', toSplit)
result
Output:
['I', '2', 'eat', '!', 'Apples', '2', '', '2', 'becauseilike', '?', 'Them']
If you want to split repeated numbers or punctuation, add +:
result = re.split('([0-9,.?:;~!##$%^&*()]+)', toSplit)
result
Output:
['I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them']
You may tokenize strings like you have into digits, letters, and other chars that are not whitespace, letters and digits using
re.findall(r'\d+|(?:[^\w\s]|_)+|[^\W\d_]+', toSplit)
Here,
\d+ - 1+ digits
(?:[^\w\s]|_)+ - 1+ chars other than word and whitespace chars or _
[^\W\d_]+ - any 1+ Unicode letters.
See the regex demo.
Matching approach is more flexible than splitting as it also allows tokenizing complex structure. Say, you also want to tokenize decimal (float, double...) numbers. You will just need to use \d+(?:\.\d+)? instead of \d+:
re.findall(r'\d+(?:\.\d+)?|(?:[^\w\s]|_)+|[^\W\d_]+', toSplit)
^^^^^^^^^^^^^
See this regex demo.
Use re.split to split at whenever a alphabet range is found
>>> import re
>>> re.split(r'([A-Za-z]+)', toSplit)
['', 'I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them', '']
>>>
>>> ' '.join(re.split(r'([A-Za-z]+)', toSplit)).split()
['I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them']

Parse equation to list of tuples in Python

I want to parse equations and get a list of tuples.
For example, when I enter
2x = 4+3y,
I want to get
[('', '2', 'x', '='), ('','4','',''), ('+','3','y','')]
This is my regex so far:
([+-]*)([0-9]+)([a-z]*)([<=>]*)
It works fine for the above query but it does not capture equations like
2 = x +3y, (where x does not have any coefficient)
How do I capture that?
(\d*)(\w*) *(=) *(\d*)(\w*) *[+|\-|*|/] *(\d*)(\w*)
How about this regex?
It separates all operands and operators. (and inside operands it also splits number and variable).
For testing the regex I normally use https://regex101.com/ so you can build regex with live changes there.
If you change the quantifier on the coefficient from + (one or more) to * (zero or more) then you should get the result you are after. You will also get an empty string match due to all the quantifiers now being * but you can filter out that match.
>>> import re
>>> e1 = "2x=4+3y"
>>> e2 = "2=x+3y"
>>> re.findall("([+-]*)([0-9]*)([a-z]*)([<=>]*)", e1)
[('', '2', 'x', '='), ('', '4', '', ''), ('+', '3', 'y', ''), ('', '', '', '')]
>>> re.findall("([+-]*)([0-9]*)([a-z]*)([<=>]*)", e2)
[('', '2', '', '='), ('', '', 'x', ''), ('+', '3', 'y', ''), ('', '', '', '')]
Note: whilst this solves your direct question this is not a good approach to parsing infix equations.

Partitioning a string with multiple delimiters

I know partition() exists, but it only takes in one value, I'm trying to partition around various values:
for example say I wanted to partition around symbols in a string:
input: "function():"
output: ["function", "(", ")", ":"]
I can't seem to find an efficient way to handle variable amounts of partitioning.
You can use re.findall with an alternation pattern that matches either a word or a non-space character:
re.findall(r'\w+|\S', s)
so that given s = 'function():', this returns:
['function', '(', ')', ':']
You could re.split by \W and use (...) to keep the delimiters, then remove empty parts.
>>> import re
>>> s = "function(): return foo + 3"
>>> [s for s in re.split(r"(\W)", s) if s.strip()]
['function', '(', ')', ':', 'return', 'foo', '+', '3']
Note that this will split after every special character; if you want to keep certain groups of special characters together, e.g. == or <=, you should test those first with |.
>>> s = "function(): return foo + 3 == 42"
>>> [s for s in re.split(r"(\W)", s) if s.strip()]
['function', '(', ')', ':', 'return', 'foo', '+', '3', '=', '=', '42']
>>> [s for s in re.split(r"(==|!=|<=|\W)", s) if s.strip()]
['function', '(', ')', ':', 'return', 'foo', '+', '3', '==', '42']

Regex to parse equation

I'm trying to parse an equation such as
5x>=7-5y+4z
into a list of tuples with python:
[('', '5', 'x', '>='), ('', '7', '', ''), ('-', '5', 'y', ''), ('+', '4', 'z', '')]
I've managed to write a pattern (pattern = "[+-]?\d*[a-z]?[><=]*") to break the equation into groups, but I have no idea how to make it return tuples.
Any help appreciated...
I think you want this:
import re
pattern = re.compile(r'([+-]?)([0-9]+)([a-z]?)([><]?=?)')
re.findall(pattern, '5x>=7-5y+4z')
>>> [('', '5', 'x', '>='), ('', '7', '', ''), ('-', '5', 'y', ''), ('+', '4', 'z', '')]
Each instance of the regex passed to re.findall is put into a tuple, which is then further split into strings corresponding to each of the groups in the regex.
I took some liberties with the interpretation of the actual regex, since I'm not sure what the expected output for other cases would be (for example, would there be a 0x term?)

How to force regex stop when hits a 'character' and continue from the start again

import re
match = re.findall(r'(a)(?:.*?(b)|.*?)(?:.*?(c)|.*?)(d)?',
'axxxbxd,axxbxxcd,axxxxxd,axcxxx')
print (match)
output: [('a', 'b', 'c', 'd'), ('a', '', 'c', '')]
I want output as below:
[('a','b','','d'),('a','b','c','d'),('a','','','d'),('a','','c','')]
Each list starts with 'a' and has 4 items from the string separated by comma respectively.
If you want to obtain several matches from a delimited string, either split the string with the delimiters first and run your regex, or replace the . with the [^<YOUR_DELIMITING_CHARS>] (paying attention to \, ^, ] and - that must be escaped). Also note that you can get rid of redundancy in the pattern using optional non-capturing groups.
Note that I assume that a, b and c are placeholders and the real life values can be both single and multicharacter values.
import re
s = 'axxxbxd,axxbxxcd,axxxxxd,axcxxx'
r = r'(a)(?:.*?(b))?(?:.*?(c))?(d)?'
print([re.findall(r, x) for x in s.split(',')])
print ([re.findall(r, x) for x in re.split(r'\W', s)])
# => [('a', 'b', '', ''), ('a', 'b', 'c', 'd'), ('a', '', '', ''), ('a', '', 'c', '')]
See the Python demo.
If your delimiters are non-word chars, use \W.
import re
s = 'axxxbxd,axxbxxcd,axxxxxd,axcxxx'
r = r'(a)(?:.*?(b)|.*?)(?:.*?(c)|.*?)(d)?'
print([re.findall(r, x) for x in s.split(',')])
print ([re.findall(r, x) for x in re.split(r'\W', s)])
# => [[('a', 'b', '', '')], [('a', 'b', 'c', 'd')], [('a', '', '', '')], [('a', '', 'c', '')]]
See the Python demo
If the strings can contain line breaks, pass re.DOTALL modifier to the re.findall calls.
Pattern details
(a) - Group 1 capturing a
(?:.*?(b))? - an optional non-capturing group matching a sequence of:
.*? - any char (other than line break chars if the re.S / re.DOTALL modifier is not used), zero or more occurrences, but as few as possible
(b) - Group 2: a b value
(?:.*?(c))?
.*? - any char (other than line break chars if the re.S / re.DOTALL modifier is not used), zero or more occurrences, but as few as possible
(c) - Group 3: a c value
(d)? - Group 4 (optional): a d.
Considering that the crucial sequence a... b... c... d should be matched in strict order - use straight-forward approach:
s = 'axxxbxd,xxbxxcxxd,xxbxxxd|axcxxx' # extended example
result = []
for seq in re.split(r'\W', s): # split by non-word character
result.append([c if c in seq else '' for c in ('a','b','c','d')])
print(result)
The output:
[['a', 'b', '', 'd'], ['', 'b', 'c', 'd'], ['', 'b', '', 'd'], ['a', '', 'c', '']]

Categories