s="XX.1.1. Accidents"
pattern = re.compile(r'\d|[a-zA-Z]\.\s([a-zA-Z]\S+)')
match=pattern.search(s)
if match:
print(match.group(1))
the output is None. However, I think it should have been "Accidents" Can someone tell me why?
Your | is messing with it - since you're not placing it in a capturing group or anything, it'll match \d or all of [a-zA-Z]\.\s([a-zA-Z]\S+). This is an issue because the regex will act greedily and you'll end up just a single \d.
If you use (?:\d|[a-zA-Z])\.\s([a-zA-Z]\S+), it'll work properly and you'll receive Accidents.
You can put the whole first part in square brackets, then search for the space, and then for the characters and the rest:
pattern = re.compile(r'[a-zA-Z\d.]*\s([a-zA-Z]\S+)')
Related
I'm developing a calculator program in Python, and need to remove leading zeros from numbers so that calculations work as expected. For example, if the user enters "02+03" into the calculator, the result should return 5. In order to remove these leading zeroes in-front of digits, I asked a question on here and got the following answer.
self.answer = eval(re.sub(r"((?<=^)|(?<=[^\.\d]))0+(\d+)", r"\1\2", self.equation.get()))
I fully understand how the positive lookbehind to the beginning of the string and lookbehind to the non digit, non period character works. What I'm confused about is where in this regex code can I find the replacement for the matched patterns?
I found this online when researching regex expressions.
result = re.sub(pattern, repl, string, count=0, flags=0)
Where is the "repl" in the regex code above? If possible, could somebody please help to explain what the r"\1\2" is used for in this regex also?
Thanks for your help! :)
The "repl" part of the regex is this component:
r"\1\2"
In the "find" part of the regex, group capturing is taking place (ordinarily indicated by "()" characters around content, although this can be overridden by specific arguments).
In python regex, the syntax used to indicate a reference to a positional captured group (sometimes called a "backreference") is "\n" (where "n" is a digit refering to the position of the group in the "find" part of the regex).
So, this regex is returning a string in which the overall content is being replaced specifically by parts of the input string matched by numbered groups.
Note: I don't believe the "\1" part of the "repl" is actually required. I think:
r"\2"
...would work just as well.
Further reading: https://www.regular-expressions.info/brackets.html
Firstly, repl includes what you are about to replace.
To understand \1\2 you need to know what capture grouping is.
Check this video out for basics of Group capturing.
Here , since your regex splits every match it finds into groups which are 1,2... so on. This is so because of the parenthesis () you have placed in the regex.
$1 , $2 or \1,\2 can be used to refer to them.
In this case: The regex is replacing all numbers after the leading 0 (which is caught by group 2) with itself.
Note: \1 is not necessary. works fine without it.
See example:
>>> import re
>>> s='awd232frr2cr23'
>>> re.sub('\d',' ',s)
'awd frr cr '
>>>
Explanation:
As it is, '\d' is for integer so removes them and replaces with repl (in this case ' ').
I have this expression
:([^"]*) \(([^"]*)\)
and this text
:chkpf_uid ("{4astr-hn389-918ks}")
:"#cert" ("false")
Im trying to match it so that on the first sentence ill get these groups:
chkpf_uid
{4astr-hn389-918ks}
and on the second, ill get these:
#cert
false
I want to avoid getting the quotes.
I can't seem to understand why the expression I use won't match these, especially if I switch the [^"]* to a (.*).
with ([^"]*): wont match
with (.*): does match, but with quotes
This is using the re module in python 2.7
Sidenote: your input may require a specific parser to handle, especially if it may have escape sequences.
Answering the question itself, remember that a regex is processed from left to right sequentially, and the string is processed the same here. A match is returned if the pattern matches a portion/whole string (depending on the method used).
If there are quotation marks in the string, and your pattern does not let match those quotes, the match will be failed, no match will be returned.
A possible solution can be adding the quotes as otpional subpatterns:
:"?([^"]*)"? \("?([^"]*)"?\)
^^ ^^ ^^ ^^
See the regex demo
The parts you need are captured into groups, and the quotes, present or not, are just matched, left out of your re.findall reach.
UPDATED
I want to find a string within a big text
..."img good img two_apple.txt"
Want to extract the two_apples.txt from a text, but it can change to one_apple, three_apple..so on...
When I try to use lookbehinds, it matches text all the way from the beginning.
You are mis-using lookarounds. Looks like you dont even NEED a lookaround:
pattern = r'src="images/(.+?.png")'
should work for you. As my comment suggests though, using regex is not recommended for parsing HTML/XML style documents but you do you.
EDIT - accommodate your edit:
Now that I understand your problem more, I can see why you would want to use a look-around. However, since you are looking for a file name, you know there aren't going to be any spaces in the name, so you can just ensure that your capturing token does not include spaces:
pattern = r'src="img (\w+?.png")'
^ ensure there is a space HERE because of how your text is
\w - \w is equivalent to [a-zA-Z0-9_] (any letters, numbers or underscore)
This removes the greediness of capture the first 'img ' string that pops up and ensures your capture group doesnt have any spaces.
by using \w, I am assuming you are only expecting _ and letter characters. to include anything else, make your own character group with [any characters you want to capture in here]
" ([^ ]+_apple\.txt)"
Starts with a space, ends with _apple.txt. The middle bit is anything-except-a-space which stops it matching "good img two". Parentheses to capture the bit you care about.
Try it here: https://regex101.com/r/wO7lG3/2
I am using Python 2.7 and have a question with regards to regular expressions. My string would be something like this...
"SecurityGroup:Pub HDP SG"
"SecurityGroup:Group-Name"
"SecurityGroup:TestName"
My regular expression looks something like below
[^S^e^c^r^i^t^y^G^r^o^u^p^:].*
The above seems to work but I have the feeling it is not very efficient and also if the string has the word "group" in it, that will fail as well...
What I am looking for is the output should find anything after the colon (:). I also thought I can do something like using group 2 as my match... but the problem with that is, if there are spaces in the name then I won't be able to get the correct name.
(SecurityGroup):(\w{1,})
Why not just do
security_string.split(':')[1]
To grab the second part of the String after the colon?
You could use lookbehind:
pattern = re.compile(r"(?<=SecurityGroup:)(.*)")
matches = re.findall(pattern, your_string)
Breaking it down:
(?<= # positive lookbehind. Matches things preceded by the following group
SecurityGroup: # pattern you want your matches preceded by
) # end positive lookbehind
( # start matching group
.* # any number of characters
) # end matching group
When tested on the string "something something SecurityGroup:stuff and stuff" it returns matches = ['stuff and stuff'].
Edit:
As mentioned in a comment, pattern = re.compile(r"SecurityGroup:(.*)") accomplishes the same thing. In this case you are matching the string "SecurityGroup:" followed by anything, but only returning the stuff that follows. This is probably more clear than my original example using lookbehind.
Maybe this:
([^:"]+[^\s](?="))
Regex live here.
I'm trying to implement Pig Latin with Python.
I want to match strings which begins by a consonant or "qu" (no matter the case) so to find the first letters, so at first I was doing :
first_letters = re.findall(r"^[^aeiou]+|^[qQ][uU]", "qualification")
It didn't work (finds only "q") so I figured that i had to add the q in the first group :
first_letters = re.findall(r"^[^aeiouq]+|^[qQ][uU]", "qualification")
so that works (it finds "qu" and not only "q") !
But playing around I found myself with this :
first_letters = re.findall(r"{^[^aeiou]+}|{^[qQ][uU]}", "qualification")
that didn't work because it is the same as the first expression I tried I think.
But finally this also worked :
first_letters = re.findall(r"{^[^aeiou]+}|(^[qQ][uU])", "qualification")
and I don't know why.
Someone can tell me why ?
You should put qu before [^aeuio], because otherwise "q" gets captured by the class and fails to match. Besides that, [Qq][Uu] is not needed, just provide the case insensitive flag:
first_letters = re.findall(r"^(qu|[^aeiou]+)", "qualification", re.I)
Given that you're probably going to match the rest of the string as well, this would be more practical:
start, rest = re.findall(r"^(qu|[^aeiou]+)?(.+)", word, re.I)[0]
Reverse the order of the rules:
>>> re.findall(r"^[qQ][uU]|^[^aeiou]+", "qualification")
['qu']
>>> re.findall(r"^[qQ][uU]|^[^aeiou]+", "boogie")
['b']
>>> re.findall(r"^[qQ][uU]|^[^aeiou]+", "blogie")
['bl']
In your first case, the first regex ^[^aeiou]+ matches the q. In the second case, since you've added q to the first part, the regex engine examines the second expression and matches qu.
In your other cases, I don't think the first expression does what you think it does (i.e. the ^ character inside the braces), so it's the second expression which matches again.
The first part of your 3rd and 4th patterns, {^[^aeiou]+} is trying to match a literal curly brace { followed by a start-of-line followed by one or more non-vowel characters, followed by a literal closing curly brace }. Since you don't have re.MULTILINE enabled, I'd assume that your pattern will be technically valid but unable to match any input.
The | runs left to right, and stops at the first success. So, that's why you found only q with the first expression, and qu with the second.
Not sure what your final regex does, particularly with regard to the {} expression. The part after the | will match in qualification, though. Perhaps that's what you are seeing.
You might find the re.I (ignore case) flag useful.