make str.contains more explicit [duplicate] - python

This question already has answers here:
How to match a whole word with a regular expression?
(4 answers)
Closed 3 years ago.
I'm looking for a cell in a data frame that contains the word "Scan". Unfortunately, there is also a word "Scan-Steuerung" which I would like to ignore.
How can I do this in python?
Is it also possible to get the index of this cell?
I'm looking for a cell in a data frame that contains the word "Scan". Unfortunately, there is also a word "Scan-Steuerung" which I would like to ignore.
How can I do this in python?
Is it also possible to get the index of this cell?
edit: I think it would be sufficient when I can read these two lines separately. At the moment, I use:
line = df[df["Name:"].str.contains("Scan")]
and when I print, I receive both lines at once.

Use Regex pattern boundaries \b
Ex:
df["Col"].str.contains(r"\bScan\b")

Related

Python regex find everything between substring and first space [duplicate]

This question already has answers here:
Python non-greedy regexes
(7 answers)
Closed 2 years ago.
I have a string string = "radios label="Does the command/question above meet the Rules to the left?" name="tq_utt_test" validates="required" gold="true" aggregation="agg"" and I want to be able to extract the substring within the "name". So in this case I want to extract "tq_utt_test" because it is the substring inside name.
I've tried regex re.findall('name=(.*)\s', string) which I thought would extract everything after the substring name= and before the first space. But after running that regex, it actually returned "tq_utt_test" validates="required" gold="true". So seems like it's returning everything between name= and the last space, instead of everything between name= and first space.
Is there a way to twist this regex so that it returns everything after name= and before the first space?
I will just do re.findall('name=([^ ]*)\s', string)

Is there a reason that .str.split() will not work with '$_'? [duplicate]

This question already has answers here:
Split Strings into words with multiple word boundary delimiters
(31 answers)
Closed 3 years ago.
I am trying to split a string using .str.split('$_') but this is not working.
Other combinations like 'W_' or '$' work fine but not '$'. I also tried .str.replace('$') - which also does not work.
Initial string is '$WA:G_COUNTRY'
using
ClearanceUnq['Clearance'].str.split('$_')
results in [$WA:G_COUNTRY]
no split....
whereas
ClearanceUnq['Clearance'].str.split('$')
results in [, WA:G_COUNTRY]
as expected
This is because it is trying to split the string when it finds a $ AND a _ right next to eachother, which does not occur in your first string.

re.findall return separate non-overlapping results [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
What do 'lazy' and 'greedy' mean in the context of regular expressions?
(13 answers)
Closed 4 years ago.
I am new to Python and I am struggling a bit with regular expressions. If I have an input like this:
text = <tag>xyz</tag>\n<tag>abc</tag>
Is it possible to get an output list with elements like:
matches = ['<tag>xyz</tag>','<tag>abc</tag>]
Right now I am using the following regex
matches = re.findall(r"<tag>[\w\W]*</tag>", text)
But instead of a list with two elements I am getting only one element with the whole input string like:
matches = ['<tag>xyz</tag>\n<tag>abc</tag>']
Could someone please guide me?
Thank you.
You just need to make your capture non-greedy.
Change this regex,
<tag>[\w\W]*</tag>
to
<tag>[\w\W]*?</tag>
import re
text = '<tag>xyz</tag>\n<tag>abc</tag>'
matches = re.findall(r"<tag>[\w\W]*?</tag>", text)
print(matches)
Prints,
['<tag>xyz</tag>', '<tag>abc</tag>']

Date regex in a sentence [duplicate]

This question already has answers here:
How to match a whole word with a regular expression?
(4 answers)
Closed 4 years ago.
I'm trying to use the date regex from this post:
^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]|(?:Jan|Mar|May|Jul|Aug|Oct|Dec)))\1|(?:(?:29|30)(\/|-|\.)(?:0?[1,3-9]|1[0-2]|(?:Jan|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)(?:0?2|(?:Feb))\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9]|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep))|(?:1[0-2]|(?:Oct|Nov|Dec)))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
However, I want to find all matches that are also wrapped around white spaces.
For example in this sentence:
I went to Disney World on 11/11/1989 and once more on 12/12/2009
I want to get back:
11/11/1989
12/12/2009
How do I accomplish this? I'm using Python3 regex module if it matters.
If you want to tweak the regex you linked to work in a string like that, change the three ^ and $s to word boundaries (\b) instead:
\b(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]|(?:Jan|Mar|May|Jul|Aug|Oct|Dec)))\1|(?:(?:29|30)(\/|-|\.)(?:0?[1,3-9]|1[0-2]|(?:Jan|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)(?:0?2|(?:Feb))\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))\b|\b(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9]|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep))|(?:1[0-2]|(?:Oct|Nov|Dec)))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})\b
https://regex101.com/r/WX5Itv/1

Finding latex class name using python regex [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 4 years ago.
I want to use python regex to find the document class in a latex document.
A latex file contains \documentclass{myclass} somewhere near the top. I want to find myclass using regex.
This is what I've tried so far:
latex_text = "blank /documentclass{myclass} words, more text /documentclassdoc{11} more words"
s=re.search(r'/documentclass{(?P<class_name>.*)}', latex_text)
It matches: myclass} words, more text /documentclassdoc{11
How can I change it to only match myclass. It should also stop searching after it finds a match, as the document can get quite long.
I know the file should only have one documentclass, but I want to handle the case where there is more than 1 as well.
import re
latex_text = "blank /documentclass{myclass} words, more text /documentclassdoc{11} more words"
print(re.search(r'/documentclass\{(.*?)\}', latex_text).group())

Categories