Splitting string with regex and re.findall

Splitting string with regex and re.findall - python

I want to match any number of digits, decimal points, and the letter e, or ONE CHARACTER in this list of characters when it occurs in a string:
+ - % ^ * / ( )
then I want to break the subject string into a list containing each individual match.
I have the following Regex to attempt to accomplish this, which I'm fairly certain it does correctly: ([0-9.e]+|[\^\*\/\%\+\-\(\)]) I even went on regex101.com and tested it, and it properly matches how I want it to:
However, when I run re.findall() on the following string (5+2)*5 it returns me the following list:
['(', '5', '+', '2', u')*', '5']
What is wrong with my regex?

Related

regex expression to match multiple digits on a single line

Trying to get multiple digits even if its on the same line.
this is what I have so far
import re
lines = '''
1 2
3
4
G3434343 '''
x = re.findall('^\d{1,2}',lines,re.MULTILINE)
print(x)
the output I am getting is:
[1,3,4]
The output I want to get is:
[1,2,3,4]
not sure what to try next any ideas ? Keep in mind the numbers can also be two digits here's another input example
8 9
11
15
G54444
the output for the above should be
[8,9,11,15]

You use the ^ which anchors the regex to the beginning of the line. Remove that to find e.g., the "2".
Then you'll also get the "34"s from the "G3434343" which you don't seem to want. So you need to tell the regex that you need to have a word boundary in front and after the digit(s) by using \b. (Adjust this as needed, it's not clear from your examples how you decide which digits you want.)
However, that \b gets interpreted as a backspace, and nothing matches. So you need to change the string to a raw string with a leading 'r' or escape all backslashes with another backslash.
import re
lines = '''
1 2
3
4
G3434343 '''
x = re.findall(r'\b\d{1,2}\b', lines, re.MULTILINE)
print(x)
This prints:
['1', '2', '3', '4']

You can try with lookarounds:
(?<=\s)\d{1,2}(?=\s)
Then map every match of every line to an integer:
[list(map(int, re.findall('(?<=\s)\d{1,2}(?=\s)', line))) for line in lines]
Output:
[[1,2,3,4], [8,9,11,15]]
Check the Python demo here.

Why does my regular expression return tuples for every character in a string? [duplicate]

This question already has answers here:
re.findall behaves weird
(3 answers)
Closed last year.
I am making a simple project for my math class in which I want to verify if a given function body (string) only contains the allowed expressions (digits, basic trigonometry, +, -, *, /).
I am using regular expressions with the re.findall method.
My current code:
import re
def valid_expression(exp) -> bool:
# remove white spaces
exp = exp.replace(" ", "")
# characters to search for
chars = r"(cos)|(sin)|(tan)|[\d+/*x)(-]"
z = re.findall(chars, exp)
return "".join(z) == exp
However, when I test this any expression the re.findall(chars, exp) will return a list of tuples with 3 empty strings: ('', '', '') for every character in the string unless there is a trig function in which case it will return a tuple with the trig function and two empty strings.
Ex: cos(x) -> [('cos', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]
I don't understand why it does this, I have tested the regular expression on regexr.com and it works fine. I get that it uses javascript but normally there should be no difference right ?
Thank you for any explanation and/or fix.

Short answer: If the result you want is ['cos', '(', 'x', ')'], you need something like
'(cos|sin|tan|[)(-*x]|\d+)':
>>> re.findall(r'(cos|sin|tan|[)(-*x]|\d+)', "cos(x)")
['cos', '(', 'x', ')']
From the documentation for findall:
The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.
For 'cos(x)', you start with ('cos', '', '') because cos matched, but neither sin nor tan matched. For each of (, x, and ), none of the three capture groups matched, although the bracket expression did. Since it isn't inside a capture group, anything it matches isn't included in your output.
As an aside, [\d+/*x)(-] doesn't include multidigit integers as a match. \d+ is not a regular expression; it's the two characters d and +. (The escape is a no-op, since d has no special meaning inside [...].) As a result, it matches exactly one of the following eight characters:
d
+
/
*
x
)
(
-

You have three groups (an expression with parentheses) in your regex, so you get tuples with three items. Also you get four results for all substrings that matches with your regex: first for 'cos', second for '(', third for 'x', and the last for ')'. But the last part of your regex doesn't marked as a group, so you don't get this matches in your tuple. If you change your regex like r"(cos)|(sin)|(tan)|([\d+/*x)(-])" you will get tuples with four items. And every tuple will have one non empty item.
Unfortunately, this fix doesn't help you to verify that you have no prohibited lexemes. It's just to understand what's going on.
I would suggest you to convert your regex to a negative form: you may check that you have no anything except allowed lexemes instead of checking that you have some allowed ones. I guess this way should work for simple cases. But, I am afraid, for more sophisticated expression you have to use something other than regex.

findall returns tuples because your regular expression has capturing groups. To make a group non-capturing, add ?: after the opening parenthesis:
r"(?:cos)|(?:sin)|(?:tan)|[\d+/*x)(-]"

Regex breaking with preceding characters

I am trying to parse a phone number from a group of strings by compiling this regex:
exp = re.compile(r'(\+\d|)(([^0-9\s]|)\d\d\d([^0-9\s]|)([^0-9\s]|)\d+([^0-9\s]|)\d+)')
This successfully matches with a line like "+1(123)-456-7890". However, if I add anything in front of it, like "P: +1(123)-456-7890" it does not match. I tested on Regex websites but can't figure this out at all.

You might consider using re.search (which scans) instead of re.match, which only looks at the beginning of the string. You could instead add a .* to the start.

Your regex will return following results
[('+1', '(123)-456-7890', '(', ')', '-', '-')]
If format is fixed you can use something like
phone = re.compile(r"\+\d\(\d+\)-\d+-\d+")
\d - matches digit.
+ - one or more occurrences.
\+ - for matching "+"
\( - for matching "("
str = "P: +1(123)-456-7890"
phone.findall(str)
Output :
['+1(123)-456-7890']

How can I get the number of groups to vary depending on the number of lines?

I have this regex: ^:([^:]+):([^:]*) which works as in this regex101 link.
Now, in Python, I have this:
def get_data():
data = read_mt_file()
match_fields = re.compile('^:([^:]+):([^:]*)', re.MULTILINE)
fields = re.findall(match_fields, data)
return fields
Which, for a file containing the data from regex101, returns:
[('1', 'text\ntext\n\n'), ('20', 'text\n\n'), ('21', 'text\ntext\ntext\n\n'), ('22', ' \n\n'), ('25', 'aa\naa\naaaaa')]
Now, this is ok, but I want to change the regex, so that I can get the number of groups to vary depending on the number of lines. Meaning:
for the first line, now, I get two groups:
1
text\ntext\n\n
I'd like to get instead:
1
((text\n), (text\n\n)) <-- those should be somehow in the same group but separated, each in his own subgroup. Somehow I need to know they both belong to 1 field, but are sepparate lines.
So, In python, the desired result for that file would be:
[('1', '(text\n), (text\n\n)'), ('20', 'text\n\n'), ('21', '(text\n), (text\n), (text\n\n)'), ('22', ' \n\n'), ('25', '(aa\n), (aa\n), (aaaaa)')]
Is this possible with regex? Could this be achieved with some nice string manipulation instead ?

To do what you want, you'd need another regex.
This is as re.match only matches the last item it matches:
>>> re.match(r'(\d)+', '12345').groups()
('5',)
Instead of using one regex you'll need to use two.
The one that you are using at the moment, and then one to match all the 'sub-groups', using say re.findall.
You can get these sub-groups by simply matching anything that isn't a \n and then any amount of \n.
So you could use a regex such as [^\n]+\n*:
>>> re.findall(r'[^\n]+\n*', 'text\ntext')
['text\n', 'text']
>>> re.findall(r'[^\n]+\n*', 'text\ntext\n\n')
['text\n', 'text\n\n']
>>> re.findall(r'[^\n]+\n*', '')
[]

You may use a simple trick: after getting the matches with your regex, run a .+\n* regex over the Group 2 value:
import re
p = re.compile(r'^:([^:]+):([^:]+)', re.MULTILINE)
s = ":1:text\ntext\n\n:20:text\n\n:21:text\ntext\ntext\n\n:22: \n\n:25:aa\naa\naaaaa"
print([[x.group(1)] + re.findall(r".+\n*", x.group(2)) for x in p.finditer(s)])
Here,
p.finditer(s) finds all matches in the string using your regex
[x.group(1)] - a list created from the first group contents
re.findall(r".+\n*", x.group(2)) - fetches individual lines from Group 2 contents (with trailing newlines, 0 or more)
[] + re.findall - combining the lists into 1.
Result is
[['1', 'text\n', 'text\n\n'], ['20', 'text\n\n'], ['21', 'text\n', 'text\n', 'text\n\n'], ['22', ' \n\n'], ['25', 'aa\n', 'aa\n', 'aaaaa']]
Another approach: match all the substrings with your pattern and then use a re.sub to add ), ( between the lines ending with optional newlines:
[(x, "({})".format(re.sub(r".+(?!\n*$)\n+", r"\g<0>), (", y))) for x, y in p.findall(s)]
Result:
[('1', '(text\n), (text\n\n)'), ('20', '(text\n\n)'), ('21', '(text\n), (text\n), (text\n\n)'), ('22', '( \n\n)'), ('25', '(aa\n), (aa\n), (aaaaa)')]
See the Python 3 demo
Here:
p.findall(s) - grabs all the matches in the form of a list of tuples containing your capture group contents using your regex
(x, "({})".format(re.sub(r".+(?!\n*$)\n+", r"\g<0>), (", y))) - creates a tuple from Group 1 contents and Group 2 contents that are a bit modified with the re.sub the way described below
.+(?!\n*$)\n+ - pattern that matches 1+ characters other than newline and then 1+ newline symbols if they are not at the end of the string. If they are at the end of the string, there will be no replacement made (to avoid , () at the end). The \g<0> in the replacement string is re-inserting the whole match back into the resulting string and appends ), ( to it.

Python regex findall alternation behavior

I'm using Python 2.7.6. I can't understand the following result from re.findall:
>>> re.findall('\d|\(\d,\d\)', '(6,7)')
['(6,7)']
I expected the above to return ['6', '7'], because according to the documentation:
'|'
A|B, where A and B can be arbitrary REs, creates a regular
expression that will match either A or B. An arbitrary number of REs
can be separated by the '|' in this way. This can be used inside
groups (see below) as well. As the target string is scanned, REs
separated by '|' are tried from left to right. When one pattern
completely matches, that branch is accepted. This means that once A
matches, B will not be tested further, even if it would produce a
longer overall match. In other words, the '|' operator is never
greedy. To match a literal '|', use \|, or enclose it inside a
character class, as in [|].
Thanks for your help

As mentioned in document :
This means that once A matches, B will not be tested further, even if it would produce a longer overall match.
So in this case regex engine doesn't match the \d because your string stars with ( and not \d so it will match the second case that is \(\d,\d\). But if your string stared with \d it would match \d :
>>> re.findall('\d|\d,\d\)', '6,7)')
['6', '7']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting string with regex and re.findall - python

Related

regex expression to match multiple digits on a single line

Why does my regular expression return tuples for every character in a string? [duplicate]

Regex breaking with preceding characters

How can I get the number of groups to vary depending on the number of lines?

Python regex findall alternation behavior

Categories

Resources