Trouble with a very simple regex - python

I am using python to try to write some simple code that looks through strings with regular expressions and finds things. In this string:
and the next nothing is 44827
I want my regex to return just the numbers.
I have set up my python program like this:
buf = "and the next nothing is 44827"
number = re.search("[0-9]*", buf)
print buf
print number.group()
What number.group() returns is an empty string. However, when the regex is:
number = re.search("[0-9]+", buf)
The full number (44827) is properly extracted. What am I missing here?

The problem is that [0-9]* matches zero or more digits, so it is more than happy to match to a zero-length string.
Meanwhile, [0-9]+ matches one or more digits, so it needs to see at least one number in order to catch.
you might want to use findall and handle the case in which you have multiple numbers per line.

Your first regex matches the empty string before the letter "a", so it stops there. Your second doesn't, so it keeps trying.

It's because the first attempt matches an empty string - you're asking it for "0 or more digits" - so the first match is empty at the beginning of the string. When you ask for "one or more digits", the first match starts at the first '4', and continues from there until the end of the number.

See for yourself.
[0-9]* http://regexr.com?30je4
[0-9]+ http://regexr.com?30je7
Hint :
* matches 0-or-more times
+ matches 1-or-more times
Obviously, the first case has more precedence over the second. And the regex engine has NO problem at all, to not match anything. :-)

Related

Need a specific explanation of part of a regex code

I'm developing a calculator program in Python, and need to remove leading zeros from numbers so that calculations work as expected. For example, if the user enters "02+03" into the calculator, the result should return 5. In order to remove these leading zeroes in-front of digits, I asked a question on here and got the following answer.
self.answer = eval(re.sub(r"((?<=^)|(?<=[^\.\d]))0+(\d+)", r"\1\2", self.equation.get()))
I fully understand how the positive lookbehind to the beginning of the string and lookbehind to the non digit, non period character works. What I'm confused about is where in this regex code can I find the replacement for the matched patterns?
I found this online when researching regex expressions.
result = re.sub(pattern, repl, string, count=0, flags=0)
Where is the "repl" in the regex code above? If possible, could somebody please help to explain what the r"\1\2" is used for in this regex also?
Thanks for your help! :)
The "repl" part of the regex is this component:
r"\1\2"
In the "find" part of the regex, group capturing is taking place (ordinarily indicated by "()" characters around content, although this can be overridden by specific arguments).
In python regex, the syntax used to indicate a reference to a positional captured group (sometimes called a "backreference") is "\n" (where "n" is a digit refering to the position of the group in the "find" part of the regex).
So, this regex is returning a string in which the overall content is being replaced specifically by parts of the input string matched by numbered groups.
Note: I don't believe the "\1" part of the "repl" is actually required. I think:
r"\2"
...would work just as well.
Further reading: https://www.regular-expressions.info/brackets.html
Firstly, repl includes what you are about to replace.
To understand \1\2 you need to know what capture grouping is.
Check this video out for basics of Group capturing.
Here , since your regex splits every match it finds into groups which are 1,2... so on. This is so because of the parenthesis () you have placed in the regex.
$1 , $2 or \1,\2 can be used to refer to them.
In this case: The regex is replacing all numbers after the leading 0 (which is caught by group 2) with itself.
Note: \1 is not necessary. works fine without it.
See example:
>>> import re
>>> s='awd232frr2cr23'
>>> re.sub('\d',' ',s)
'awd frr cr '
>>>
Explanation:
As it is, '\d' is for integer so removes them and replaces with repl (in this case ' ').

How to match digits only after a particular string, stop matching if non-digit is found - Python 27

I have huge string like this dsdasdludocid=15878284988193842600#lrd=0x3be04dcc5b5ac513:0xdc5b0011ebb625a8,2
I want to get the number after ludocid, only consecutive numbers.
I have tried this regex (ludocid).*(?=\d+\d+) and many more but no luck.
You can try ludocid=(\d+):
s = "dsdasdludocid=15878284988193842600#lrd=0x3be04dcc5b5ac513:0xdc5b0011ebb625a8,2"
import re
re.findall(r"ludocid=(\d+)", s)
# ['15878284988193842600']
You can use this regex:
ludocid\D*(\d+)
RegEx Demo
This will match literal ludocid followed by 0 or more non-digits and then it will match 1 or more digits in captured group #1
Code:
>>> s = 'dsdasdludocid=15878284988193842600#lrd=0x3be04dcc5b5ac513:0xdc5b0011ebb625a8,2'
>>> print re.search(r'ludocid\D*(\d+)', s).group(1)
15878284988193842600
It looks like you just threw a bunch of regex bits together... Let's work through that.
First, this is the correct regex: ludocid.(\d+)
(You would want to use it with re.search instead of re.match, by the way. Match requires the regex to match the entire string.)
But let's look at yours and see what went wrong and how we can get to the correct regex.
(ludocid).*(?=\d+\d+)
Imagine a regex as a function. You pass it the right things, and it gives you the appropriate result. When you wrap things in parentheses, you're saying "Find this and give it back to me." You don't need the ludocid given back to you, I'm guessing... so remove those paren.
ludocid.*(?=\d+\d+)
Now you've got a .*. This is dangerous in regular expressions because it literally says "Grab as many of anything as you possibly can!" Often I use the non-greedy version (.*?), but in this case it looks like we're just expecting a single extra character there. If you know the literal character you can use that, but to be safe I'll leave it as ., which says "Grab any one character."
ludocid.(?=\d+\d+)
Now let's go inside the parentheses. You've got \d+\d+, which says "Find a sequence of one or more digits, and then find another sequence of one or more digits." This equates to "Find a sequence of two or more digits." I don't think this is what you wanted (it's not how you described the problem, anyway), so let's reduce that:
ludocid.(?=\d+)
Okay, great. Now... what is (?=...) for? It's called a lookahead assertion. It says "If you find this string, match things in front of it." The example given in the Python 2.7 documentation is:
(?=...)
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
Essentially this means that your regex will never return the digits. Instead, it looks to see if digits exist, and then it returns things from the rest of the regex. Remove the lookahead assertion and we're there:
ludocid.(\d+)
When you use this with re.search, you'll get the group you want:
>>> s = "dsdasdludocid=15878284988193842600#lrd=0x3be04dcc5b5ac513:0xdc5b0011ebb625a8,2"
>>> import re
>>> re.search(r"ludocid.(\d+)", s).group(1)
'15878284988193842600'
To match only the digits that follow, stopping at the first non-numeric char, try a positive look behind:
(?<=ludocid=)(\d+)
So:
re.findall(r"(?<=ludocid=)(\d+)", s)
The positive look behind will look for what you want, and only match if it is preceded by the 'flag' string.
**Note: **You may need to escape that second = sign like this: (?<=ludocid\=)(\d+)

How to make regex that matches a number with commas for every three digits?

I am a beginner in Python and in regular expressions and now I try to deal with one exercise, that sound like that:
How would you write a regex that matches a number with commas for
every three digits? It must match the following:
'42'
'1,234'
'6,368,745'
but not the following:
'12,34,567' (which has only two digits between the commas)
'1234' (which lacks commas)
I thought it would be easy, but I've already spent several hours and still don't have write answer. And even the answer, that was in book with this exercise, doesn't work at all (the pattern in the book is ^\d{1,3}(,\d{3})*$)
Thank you in advance!
The answer in your book seems correct for me. It works on the test cases you have given also.
(^\d{1,3}(,\d{3})*$)
The '^' symbol tells to search for integers at the start of the line. d{1,3} tells that there should be at least one integer but not more than 3 so ;
1234,123
will not work.
(,\d{3})*$
This expression tells that there should be one comma followed by three integers at the end of the line as many as there are.
Maybe the answer you are looking for is this:
(^\d+(,\d{3})*$)
Which matches a number with commas for every three digits without limiting the number being larger than 3 digits long before the comma.
You can go with this (which is a slightly improved version of what the book specifies):
^\d{1,3}(?:,\d{3})*$
Demo on Regex101
I got it to work by putting the stuff between the carrot and the dollar in parentheses like so: re.compile(r'^(\d{1,3}(,\d{3})*)$')
but I find this regex pretty useless, because you can't use it to find these numbers in a document because the string has to begin and end with the exact phrase.
#This program is to validate the regular expression for this scenerio.
#Any properly formattes number (w/Commas) will match.
#Parsing through a document for this regex is beyond my capability at this time.
print('Type a number with commas')
sentence = input()
import re
pattern = re.compile(r'\d{1,3}(,\d{3})*')
matches = pattern.match(sentence)
if matches.group(0) != sentence:
#Checks to see if the input value
#does NOT match the pattern.
print ('Does Not Match the Regular Expression!')
else:
print(matches.group(0)+ ' matches the pattern.')
#If the values match it will state verification.
The Simple answer is :
^\d{1,2}(,\d{3})*$
^\d{1,2} - should start with a number and matches 1 or 2 digits.
(,\d{3})*$ - once ',' is passed it requires 3 digits.
Works for all the scenarios in the book.
test your scenarios on https://pythex.org/
I also went down the rabbit hole trying to write a regex that is a solution to the question in the book. The question in the book does not assume that each line is such a number, that is, there might be multiple such numbers in the same line and there might some kind of quotation marks around the number (similar to the question text). On the other hand, the solution provided in the book makes those assumptions: (^\d{1,3}(,\d{3})*$)
I tried to use the question text as input and ended up with the following pattern, which is way too complicated:
r'''(
(?:(?<=\s)|(?<=[\'"])|(?<=^))
\d{1,3}
(?:,\d{3})*
(?:(?=\s)|(?=[\'"])|(?=$))
)'''
(?:(?<=\s)|(?<=[\'"])|(?<=^)) is a non-capturing group that allows
the number to start after \s characters, ', ", or the start of the text.
(?:,\d{3})* is a non-capturing group to avoid capturing, for example, 123 in 12,123.
(?:(?=\s)|(?=[\'"])|(?=$)) is a non-capturing group that allows
the number to end before \s characters, ', ", or the end of the text (no newline case).
Obviously you could extend the list of allowed characters around the number.

Python Regex: Question mark (?) doesn't match in middle of string

I bumped into the problem while playing around in Python: when I create a random string, let's say "test 1981", the following Python call returns with an empty string.
>>> re.search('\d?', "test 1981").group()
''
I was wondering why this is. I was reading through some other posts, and it seems that it has to do with greedy vs. non-greedy operators. Is it that the '?' checks to see if the first value is a digit, and if it's not, it takes the easier, quicker path and just outputs nothing?
Any clarification would help. Thanks!
Your pattern matches a digit or the empty string. It starts at the first character and tries to match a digit, what it is doing next is trying to match the alternative, means the empty string, voilà a match is found before the first character.
I think you expected it to move on and try to match on the next character, but that is not done, first it tries to match what the quantifier allows on the first position. And that is 0 or one digit.
The use of the optional quantifier makes only sense in combination with a required part, say you want a digit followed by an optional one:
>>> re.search('\d\d?', "test 1981").group()
'19'
Otherwise your pattern is always true.
Regex
\d?
simply means that it should optionally (?) match single digit (\d).
If you use something like this, it will work as you expect (match single digit anywhere in the string):
\d
re.search('\d?', "test 1981").group() greedily matches the first match of the pattern (0 or 1 digits) it can find. In this case that's zero digits. Note that re.search('\d?', "1981 test").group() actually matches the string '1' at the beginning of the string. What you're probably looking for here is re.search('\d+', "test 1981").group(), which finds the whole string 1981 no matter where it is.

Python regular expression to match # followed by 0-7 followed by ##

I would like to intercept string starting with \*#\*
followed by a number between 0 and 7
and ending with: ##
so something like \*#\*0##
but I could not find a regex for this
Assuming you want to allow only one # before and two after, I'd do it like this:
r'^(\#{1}([0-7])\#{2})'
It's important to note that Alex's regex will also match things like
###7######
########1###
which may or may not matter.
My regex above matches a string starting with #[0-7]## and ignores the end of the string. You could tack a $ onto the end if you wanted it to match only if that's the entire line.
The first backreference gives you the entire #<number>## string and the second backreference gives you the number inside the #.
None of the above examples are taking into account the *#*
^\*#\*[0-7]##$
Pass : *#*7##
Fail : *#*22324324##
Fail : *#3232#
The ^ character will match the start of the string, \* will match a single asterisk, the # characters do not need to be escape in this example, and finally the [0-7] will only match a single character between 0 and 7.
r'\#[0-7]\#\#'
The regular expression should be like ^#[0-7]##$
As I understand the question, the simplest regular expression you need is:
rex= re.compile(r'^\*#\*([0-7])##$')
The {1} constructs are redundant.
After doing rex.match (or rex.search, but it's not necessary here), .group(1) of the match object contains the digit given.
EDIT: The whole matched string is always available as match.group(0). If all you need is the complete string, drop any parentheses in the regular expression:
rex= re.compile(r'^\*#\*[0-7]##$')

Categories