Python regex find everything between substring and first space [duplicate] - python

This question already has answers here:
Python non-greedy regexes
(7 answers)
Closed 2 years ago.
I have a string string = "radios label="Does the command/question above meet the Rules to the left?" name="tq_utt_test" validates="required" gold="true" aggregation="agg"" and I want to be able to extract the substring within the "name". So in this case I want to extract "tq_utt_test" because it is the substring inside name.
I've tried regex re.findall('name=(.*)\s', string) which I thought would extract everything after the substring name= and before the first space. But after running that regex, it actually returned "tq_utt_test" validates="required" gold="true". So seems like it's returning everything between name= and the last space, instead of everything between name= and first space.
Is there a way to twist this regex so that it returns everything after name= and before the first space?

I will just do re.findall('name=([^ ]*)\s', string)

Related

python regex filter out exact string [duplicate]

This question already has answers here:
Regular expression to match a line that doesn't contain a word
(34 answers)
Closed 3 years ago.
I'm trying to write a regex that filters out matches if they contain "plex" in them.
plex-release -> should not match
my-release -> should match
potato -> should match
Been playing with pythex and came up with this one that works partially:
(?![plex])(\w+)[-_](release|version)$
However this also messes with any other values containing the letter "p".
I'm trying to come up with a regex that leaves out matches that only contain the string "plex" and in this order, not just any letter from the string.
Yes, you can do it using this regex.
^((?!plex).)*$
Source : Regular expression to match a line that doesn't contain a word

Why does regex with “|” (or/alternation) match differently when order is switched? [duplicate]

This question already has answers here:
Why doesn't regular expression alternation (A|B) match as per doc?
(3 answers)
Closed 3 years ago.
I want to clarify a doubt in python - regular expression
import re
stri="Item3. Super Market ListsItem4"
#1st print
print(re.sub(r'(Item[0-9]|Item[0-9]\.)', "", stri,))
#2nd print
print(re.sub(r'(Item[0-9]\.|Item[0-9])', "", stri,))
In the stri, I need to remove the "Item4" and "Item3."
output -
'. Super Market Lists'
' Super Market Lists'
My question is, I used OR(|) operator for both patterns.
In the 1st print statement, it did not remove the dot(.) in the given string. And in the 2nd print statement, I switched the pattern with OR operator. In this time, it removed the dot(.) in the string. Why it happens like this
Thank you
It happens because it first tries to match the left operand of the OR operator.
Because it matches without the dot, it removes the matched part without looking into the right operand.

Is there a reason that .str.split() will not work with '$_'? [duplicate]

This question already has answers here:
Split Strings into words with multiple word boundary delimiters
(31 answers)
Closed 3 years ago.
I am trying to split a string using .str.split('$_') but this is not working.
Other combinations like 'W_' or '$' work fine but not '$'. I also tried .str.replace('$') - which also does not work.
Initial string is '$WA:G_COUNTRY'
using
ClearanceUnq['Clearance'].str.split('$_')
results in [$WA:G_COUNTRY]
no split....
whereas
ClearanceUnq['Clearance'].str.split('$')
results in [, WA:G_COUNTRY]
as expected
This is because it is trying to split the string when it finds a $ AND a _ right next to eachother, which does not occur in your first string.

re.findall return separate non-overlapping results [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
What do 'lazy' and 'greedy' mean in the context of regular expressions?
(13 answers)
Closed 4 years ago.
I am new to Python and I am struggling a bit with regular expressions. If I have an input like this:
text = <tag>xyz</tag>\n<tag>abc</tag>
Is it possible to get an output list with elements like:
matches = ['<tag>xyz</tag>','<tag>abc</tag>]
Right now I am using the following regex
matches = re.findall(r"<tag>[\w\W]*</tag>", text)
But instead of a list with two elements I am getting only one element with the whole input string like:
matches = ['<tag>xyz</tag>\n<tag>abc</tag>']
Could someone please guide me?
Thank you.
You just need to make your capture non-greedy.
Change this regex,
<tag>[\w\W]*</tag>
to
<tag>[\w\W]*?</tag>
import re
text = '<tag>xyz</tag>\n<tag>abc</tag>'
matches = re.findall(r"<tag>[\w\W]*?</tag>", text)
print(matches)
Prints,
['<tag>xyz</tag>', '<tag>abc</tag>']

Finding latex class name using python regex [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 4 years ago.
I want to use python regex to find the document class in a latex document.
A latex file contains \documentclass{myclass} somewhere near the top. I want to find myclass using regex.
This is what I've tried so far:
latex_text = "blank /documentclass{myclass} words, more text /documentclassdoc{11} more words"
s=re.search(r'/documentclass{(?P<class_name>.*)}', latex_text)
It matches: myclass} words, more text /documentclassdoc{11
How can I change it to only match myclass. It should also stop searching after it finds a match, as the document can get quite long.
I know the file should only have one documentclass, but I want to handle the case where there is more than 1 as well.
import re
latex_text = "blank /documentclass{myclass} words, more text /documentclassdoc{11} more words"
print(re.search(r'/documentclass\{(.*?)\}', latex_text).group())

Categories