Why do spaces in regular expressions relative to '+' cause issues? [duplicate]

Why do spaces in regular expressions relative to '+' cause issues? [duplicate] - python

This question already has answers here:
Finding Plus Sign in Regular Expression
(7 answers)
Closed 2 years ago.
In the Python code, I used re.compile() to to check whether given word is exists.
PATTERNS = {
re.compile(r'[\w\s] + total+ [\w\s] + cases'): data.get_total_cases,
re.compile(r'[\w\s] + total cases'): data.get_total_cases,
re.compile(r'[\w\s] + total + [\w\s] + deaths'): data.get_total_deaths,
re.compile(r'[\w\s] + total deaths'): data.get_total_deaths
}
This did not work as expected. I couldn't find anything wrong. Finally, I removed spaces after every character set [\w\s] because it was the only visible difference between my code and original code that I had referenced.
PATTERNS = {
re.compile(r'[\w\s]+ total+ [\w\s]+ cases'): data.get_total_cases,
re.compile(r'[\w\s]+ total cases'): data.get_total_cases,
re.compile(r'[\w\s]+ total+ [\w\s]+ deaths'): data.get_total_deaths,
re.compile(r'[\w\s]+ total deaths'): data.get_total_deaths
}
Now the code is working and all patterns can be successfully identified. But still I couldn't find why these spaces cause this issue?

The + symbol in a regex expression means "one or more of".
So + means "one or more of (space). And [\w\s]+ means "one or more of any alphanumeric or whitespace characters".
If you are wanting to match a pattern that is like 10 total + 10 cases with a + as a literal, you need to escape the + sign. a raw string (r before the string) allows for literal backslashes in the string, which can be used to escape in the regex pattern.
re.compile(r"[\w\s]+ total \+ [\w\s]+ cases")
Notice the \+ which means "literally a + sign" rather than "one or more of".

Related

Limiting phone numbers, regex starts with only a character "+" [duplicate]

This question already has answers here:
Checking whole string with a regex
(5 answers)
Closed 2 years ago.
Im trying to limit an input of phone numbers to:
1-16 digits OR
A single "+" followed by 1-16 digits.
This is my code
txt = "+++854"
x = re.search(str("([\+]{0,1}[0-9]{3,16})|([0-9]{3,16})"), txt)
###^\+[0-9]{1,16}|[0-9]{1,16}", txt) #startswith +, followed by numbers.
if x:
print("YES! We have a match!")
else:
print("No match")
# Thatas a match
Yet it yields a match. I tried also "^+{0,1}[0-9]{1,16}|[0-9]{1,16}" but despite it works in "https://regex101.com/r/aP0qH2/4" it doesnt work in my code as i think it should work.

re.search searches for "the first location where the regular expression pattern produces a match" and returns the resulting match object. In the string "+++854", the substring "+854" matches.
To match the whole string, use re.match. The documentation has a section about the difference between re.match and re.search.

pattern = r"\+?[0-9]{16}"

bad character range :-' at position 6 exception in python [duplicate]

This question already has an answer here:
Python regex bad character range.
(1 answer)
Closed 2 years ago.
I am comparing two strings but excluding the punctuation marks in both.
Here is my code snippet:
punctuation = r"[.?!,;:-']"
string1 = re.sub(punctuation, r"", string1)
string2 = re.sub(punctuation, r"", string2)
After running this code I get following exception
bad character range :-' at position 6
How to get rid of this exception? What's the meaning of "bad character range"?

- has special meaning inside [] in regular expression pattern - for example [A-Z] are ASCII uppercase letters (from A to Z), so if you need literal - you need to escape it i.e.
punctuation = r"[.?!,;:\-']"
I also want to point regex101.com which is useful for testing regular patterns.

A - inside a character class [...] is used to denote a range of characters, for example: [0-9] would be equivalent to [0123456789].
Here, the :-' would mean any character between : and '. However, if you look up the character numbers, you see that they are in the wrong order for that to be a valid range:
>>> ord(":")
58
>>> ord("'")
39
In the opposite order '-: (inside the []) it would be a valid character range.
In any case, it is not what you want. You want the - to be interpreted as a literal - character.
There are two ways to achieve this. Either:
escape the - by writing \-
or put the - as the first or last character inside the [], e.g. r"[.?!,;:'-]"

Not sure about how /?(.+) works in my regex [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 1 year ago.
What is the difference between:
(.+?)
and
(.*?)
when I use it in my php preg_match regex?

They are called quantifiers.
* 0 or more of the preceding expression
+ 1 or more of the preceding expression
Per default a quantifier is greedy, that means it matches as many characters as possible.
The ? after a quantifier changes the behaviour to make this quantifier "ungreedy", means it will match as little as possible.
Example greedy/ungreedy
For example on the string "abab"
a.*b will match "abab" (preg_match_all will return one match, the "abab")
while a.*?b will match only the starting "ab" (preg_match_all will return two matches, "ab")
You can test your regexes online e.g. on Regexr, see the greedy example here

The first (+) is one or more characters. The second (*) is zero or more characters. Both are non-greedy (?) and match anything (.).

In RegEx, {i,f} means "between i to f matches". Let's take a look at the following examples:
{3,7} means between 3 to 7 matches
{,10} means up to 10 matches with no lower limit (i.e. the low limit is 0)
{3,} means at least 3 matches with no upper limit (i.e. the high limit is infinity)
{,} means no upper limit or lower limit for the number of matches (i.e. the lower limit is 0 and the upper limit is infinity)
{5} means exactly 4
Most good languages contain abbreviations, so does RegEx:
+ is the shorthand for {1,}
* is the shorthand for {,}
? is the shorthand for {,1}
This means + requires at least 1 match while * accepts any number of matches or no matches at all and ? accepts no more than 1 match or zero matches.
Credit: Codecademy.com

+ matches at least one character
* matches any number (including 0) of characters
The ? indicates a lazy expression, so it will match as few characters as possible.

A + matches one or more instances of the preceding pattern. A * matches zero or more instances of the preceding pattern.
So basically, if you use a + there must be at least one instance of the pattern, if you use * it will still match if there are no instances of it.

Consider below is the string to match.
ab
The pattern (ab.*) will return a match for capture group with result of ab
While the pattern (ab.+) will not match and not returning anything.
But if you change the string to following, it will return aba for pattern (ab.+)
aba

+ is minimal one, * can be zero as well.

A star is very similar to a plus, the only difference is that while the plus matches 1 or more of the preceding character/group, the star matches 0 or more.

I think the previous answers fail to highlight a simple example:
for example we have an array:
numbers = [5, 15]
The following regex expression ^[0-9]+ matches: 15 only.
However, ^[0-9]* matches both 5 and 15. The difference is that the + operator requires at least one duplicate of the preceding regex expression

Why findall() function return a weird value? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 1 year ago.
What is the difference between:
(.+?)
and
(.*?)
when I use it in my php preg_match regex?

They are called quantifiers.
* 0 or more of the preceding expression
+ 1 or more of the preceding expression
Per default a quantifier is greedy, that means it matches as many characters as possible.
The ? after a quantifier changes the behaviour to make this quantifier "ungreedy", means it will match as little as possible.
Example greedy/ungreedy
For example on the string "abab"
a.*b will match "abab" (preg_match_all will return one match, the "abab")
while a.*?b will match only the starting "ab" (preg_match_all will return two matches, "ab")
You can test your regexes online e.g. on Regexr, see the greedy example here

The first (+) is one or more characters. The second (*) is zero or more characters. Both are non-greedy (?) and match anything (.).

In RegEx, {i,f} means "between i to f matches". Let's take a look at the following examples:
{3,7} means between 3 to 7 matches
{,10} means up to 10 matches with no lower limit (i.e. the low limit is 0)
{3,} means at least 3 matches with no upper limit (i.e. the high limit is infinity)
{,} means no upper limit or lower limit for the number of matches (i.e. the lower limit is 0 and the upper limit is infinity)
{5} means exactly 4
Most good languages contain abbreviations, so does RegEx:
+ is the shorthand for {1,}
* is the shorthand for {,}
? is the shorthand for {,1}
This means + requires at least 1 match while * accepts any number of matches or no matches at all and ? accepts no more than 1 match or zero matches.
Credit: Codecademy.com

+ matches at least one character
* matches any number (including 0) of characters
The ? indicates a lazy expression, so it will match as few characters as possible.

A + matches one or more instances of the preceding pattern. A * matches zero or more instances of the preceding pattern.
So basically, if you use a + there must be at least one instance of the pattern, if you use * it will still match if there are no instances of it.

Consider below is the string to match.
ab
The pattern (ab.*) will return a match for capture group with result of ab
While the pattern (ab.+) will not match and not returning anything.
But if you change the string to following, it will return aba for pattern (ab.+)
aba

+ is minimal one, * can be zero as well.

A star is very similar to a plus, the only difference is that while the plus matches 1 or more of the preceding character/group, the star matches 0 or more.

I think the previous answers fail to highlight a simple example:
for example we have an array:
numbers = [5, 15]
The following regex expression ^[0-9]+ matches: 15 only.
However, ^[0-9]* matches both 5 and 15. The difference is that the + operator requires at least one duplicate of the preceding regex expression

Python 3.5: Formatting a String with Spaces

I have seen questions similar to this, yet none that address this particular issue. I have a calculator expression using +, -, *, or / operators, and I want to standardize it so that anything someone enters will be homogenous with how my program wants it...
My program wants a string of the format " 10 - 7 * 5 / 2 + 3 ", with the spaces before and after, and in-between each value. I want to take anything someone enters such as "10-7*5/2+3" or " 10- 7*5/2 + 3 ", and make it into the first format I specified.
My first idea was to convert the string to a list, then join with spaces in-between and concatenate the spaces on the front and end, but the clear problem with that is that the '10' gets split into '1' and '0' and comes out as '1 0' after joining.
s = s.replace(" ", "")
if s[0] == "-":
s = "0" + s
else:
s = s
s = " " + " ".join(list(s)) + " "
I was thinking maybe doing something with RegEx might help, but I'm not entire sure how to put that into action. The main slip up for me mentally is getting the '10' and other higher order numbers not to split apart into their constituents when I do this.
I'm in python 3.5.

Solution
One idea if you're only ever dealing with very simple calculator expressions (i.e. digits and operands). If you also have other possible elements, you'd just have to adjust the regex.
Use a regex to extract the relevant pieces, ignoring whitespace, and then re-compose them together using a join.
def compose(expr):
elems = re.findall(r'(\d+|[\+,\-,\*,/])', expr) # a group consists of a digit sequence OR an operand
return ' ' + ' '.join(elems) + ' ' # puts a single space between all groups and one before and after
compose('10- 7*5/2 + 3')
# ' 10 - 7 * 5 / 2 + 3 '
compose('10-7*5/2+3')
# ' 10 - 7 * 5 / 2 + 3 '
Detailed Regex Explanation
The meat of the re.findall call is the regular expression: r'(\d+|[\+,\-,\*,/])'
The first bit: \d means match one digit. + means match one or more of the preceding expression. So together \d+ means match one or more digits in a row.
The second bit: [...] is the character-set notation. It means match one of any of the characters in the set. Now +, -, * are all special regex chars, so you have to escape them with a backslash. Forward slash is not special, so it does not require an escape. So [\+,\-,\*,/] means match one of any of +, -, *, /.
The | in between the two regexes is your standard OR operator. So match either the first expression OR the second one. And parenthesis are group notation in regexes, indicating what is the part of the regex you actually want to be returned.

I'd suggest taking a simple and easy approach; remove all spaces and then go through the string character by character, adding spaces before and after each operator symbol.
Anything with two operators in a row is going to be invalid syntax anyway, so you can leave that to your existing calculator code to throw errors on.
sanitised_string = ""
for char in unformatted_string_without_spaces:
if char in some_list_of_operators_you_made:
sanitised_string += " " + char + " "
else:
sanitised_string += char

Just like #fukanchik suggested, this is usually done in reverse, as in breaking the input string down into its basic components, and then re-assembling it again as you like.
I'd say you are on the right track using RegEx, as it's perfect for parsing this kind of input (perfect as in you don't need to write a more advanced parser). For this, just define all your symbols as little regexes:
lexeme_regexes = [r"\+", "-", r"\*", "/", "\d+"]
and then assemble a big regex that you can use for "walking" your input string:
regex = re.compile("|".join(lexeme_regexes))
lexemes = regex.findall("10 - 7 * 5 / 2 + 3")
To get to your normalized form, just assemble it again:
normalized = " ".join(lexemes)
This example doesn't ensure that all operators are seemlessly split by whitespace though, that'll need some more effort.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why do spaces in regular expressions relative to '+' cause issues? [duplicate] - python

Related

Limiting phone numbers, regex starts with only a character "+" [duplicate]

bad character range :-' at position 6 exception in python [duplicate]

Not sure about how /?(.+) works in my regex [duplicate]

Why findall() function return a weird value? [duplicate]

Python 3.5: Formatting a String with Spaces

Categories

Resources