Python Regex to detect underscore between letters - python

How do I make a regex in python that returns a string with all underscores between lowercase letters?
For example, it should detect and return: 'aa_bb_cc' , 'swd_qq' , 'hello_there_friend'
But it should not return these: 'aA_bb' , 'aa_' , '_ddQ' , 'aa_baa_2cs'
My code is ([a-z]+_[a-z]+)+ , but it returns only one underscore. It should return all underscores seperated by lowercase letters.
For example, when I pass the string "aab_cbbbc_vv", it returns only 'aab_cbbbc' instead of 'aab_cbbbc_vv'
Thank you

Your regex is almost correct. If you change it to:
^([a-z]+)(_[a-z]+)+$
It woks as you can check here.
^ - matches the beginning of the string
$ - the end of the string
You need these so that you are not getting partial matches when matching the strings you don't want to get matched.

try this code to get it
import re
s = "aa_bb_cc swd_qq hello_there_friend aA_bb aa_ _ddQ aa_baa_2cs"
print(re.findall(r"[a-z][a-z_]+\_[a-z]+",s))
the output sould be
['aa_bb_cc', 'swd_qq', 'hello_there_friend', 'aa_baa']

The reason that you get only results with 1 underscore for your example data is that ([a-z]+_[a-z]+)+ repeats a match of [a-z]+, then an underscore and then again [a-z]+
That would for example match a_b or a_bc_d, but only a partial match for a_b_c as there has to be at least a char a-z present before each _ for every iteration.
You could update your pattern to:
\b[a-z]+(?:_[a-z]+)+\b
Explanation
\b A word boundary
[a-z]+ Match 1+ chars a-z
(?:_[a-z]+)+ Repeat 1+ times matching _ and 1+ chars a-z
\b A word boundary
regex demo

Related

Regex to extract first 5 digit+character from last hyphen

I am trying to extract first 5 character+digit from last hyphen.
Here is the example
String -- X008-TGa19-ER751QF7
Output -- X008-TGa19-ER751
String -- X002-KF13-ER782cPU80
Output -- X002-KF13-ER782
My attempt -- I could manage to take element from the last -- (\w+)[^-.]*$
But now how to take first 5, then return my the entire value as the output as shown in the example.
You can optionally repeat a - and 1+ word chars from the start of the string. Then match the last - and match 5 word chars.
^\w+(?:-\w+)*-\w{5}
^ Start of string
\w+ Math 1+ word chars
(?:-\w+)* Optionally repeat - and 1+ word chars
-\w{5} Match - and 5 word chars
Regex demo
import re
regex = r"^\w+(?:-\w+)*-\w{5}"
s = ("X008-TGa19-ER751QF7\n"
"X002-KF13-ER782cPU80")
print(re.findall(regex, s, re.MULTILINE))
Output
['X008-TGa19-ER751', 'X002-KF13-ER782']
Note that \w can also match _.
If there can also be other character in the string, to get the first 5 digits or characters except _ after the last hyphen, you can match word characters without an underscore using a negated character class [^\W_]{5}
Repeat that 5 times while asserting no more underscore at the right.
^.*-[^\W_]{5}(?=[^-]*$)
Regex demo
(\w+-\w+-\w{5}) seems to capture what you're asking for.
Example:
https://regex101.com/r/PcPSim/1
If you are open for non-regex solution, you can use this which is based on splitting, slicing and joining the strings:
>>> my_str = "X008-TGa19-ER751QF7"
>>> '-'.join(s[:5] for s in my_str.split('-'))
'X008-TGa19-ER751'
Here I am splitting the string based on hyphen -, slicing the string to get at max five chars per sub-string, and joining it back using str.join() to get the string in your desired format.
^(.*-[^-]{5})[^-]*$
Capture group 1 is what you need
https://regex101.com/r/SYz9i5/1
Explanation
^(.*-[^-]{5})[^-]*$
^ Start of line
( Capture group 1 start
.* Any number of any character
- hyphen
[^-]{5} 5 non-hyphen character
) Capture group 1 end
[^-]* Any number of non-hyphen character
$ End of line
Another simpler one is
^(.*-.{5}).*$
This should be quite straight-forward.
This is making use of behaviour greedy match of first .*, which will try to match as much as possible, so the - will be the last one with at least 5 character following it.
https://regex101.com/r/CFqgeF/1/

does Regex omit part of a string if it has already been matched?

Python 3.8.2
the task at hand is simple: to match lowercase characters separated by a single underscore. So the pattern could be r"[a-z]+_[a-z]+"
now my issue is that I expected re.findall() to pair up all the following:
"ash_tonic_transit_so_kern_err_looo_"
instead of paring all the words around each underscore ('ash_tonic', 'tonic_transit', 'transit_so', ETC) I get three pairs: ['ash_tonic', 'transit_so', 'kern_err']
Does python re omit part of the string once a match has been found instead of running the search again?
import re
def match_lower(s):
patternRegex = re.compile(r'[a-z]+_[a-z]+')
mo = patternRegex.findall(s)
return mo
print(match_lower('ash_tonic_transit_so_kern_err_looo_'))
You could use a positive lookahead with a capturing group to get the matches, and start the match asserting what is directly to the left is not a char a-z using a negative lookbehind.
Use re.findall which will return the values from the capturing group.
(?<![a-z])(?=([a-z]+_[a-z]+))
Explanation
(?<![a-z]) Negative lookabehind, assert what is directly to the left is not a char a-z
(?= Positive lookahead, assert what on the right is
([a-z]+_[a-z]+) Capture group 1, match 1+ chars a-z _ 1+ chars a-z
) Close lookahead
Regex demo | Python demo
import re
regex = r"(?<![a-z])(?=([a-z]+_[a-z]+))"
test_str = "ash_tonic_transit_so_kern_err_looo_"
print(re.findall(regex, test_str))
Output
['ash_tonic', 'tonic_transit', 'transit_so', 'so_kern', 'kern_err', 'err_looo']
This is explicitly mentioned in the documentation of re.findall:
Return all non-overlapping matches of pattern in string, as a list of strings.
For instance, 'ash_tonic' and 'tonic_transit' overlap, so they won't be considered two distinct matches.

Regex that not ending with smaller case

creating the regex which is having at least 3 chars and not end with
import re
re.findall(r'(\w{3,})(?![a-z])\b','I am tyinG a mixed charAv case VOW')
My Out
['tyinG', 'mixed', 'charAv', 'case', 'VOW']
My Expected is
['tyinG', 'VOW']
I am getting the proper out when i am doing the re.findall(r'(\w{3,})(?<![a-z])\b','I am tyinG a mixed charAv case VOW')
when i did the je.im my first regex which doesnot having < giving correct only
What is the relevance of < here
The first pattern (\w{3,})(?![a-z])\b does not give you the expected result because the pattern is first matching 3+ word chars and then asserts using a negative lookahead (?! that what is directly on the right is not a lowercase char a-z.
That assertion will be true as the lowercase a-z chars are already matched by \w
The second pattern (\w{3,})(?<![a-z])\b does give you the right result as it first tries to match 3 or more word chars and after that asserts using a negative lookbehind (?<! what is directly to the left is not a lowercase char a-z.
If you want to use a lookaround, you can make the pattern a bit more efficient by making use of a word boundary at the beginning.
At the end of the pattern place the negative lookbehind after the word boundary to first anchor it and then do the assertion.
\b\w{3,}\b(?<![a-z])
Note that you can omit the capturing group if you want the single match only.

Python regex: removing all special characters and numbers NOT attached to words

I am trying to remove all special characters and numbers in python, except numbers that are directly attached to words.
I have succeded in doing this for all cases of special characters and numbers attached and not attached to words, how to do it in such a way that numbers attached are not matched.
Here's what I did:
import regex as re
string = "win32 backdoor guid:64664646 DNS-lookup h0lla"
re.findall(r'[^\p{P}\p{S}\s\d]+', string.lower())
I get as output
win backdoor guid DNS lookup h lla
But I want to get:
win32 backdoor guid DNS lookup h0lla
demo: https://regex101.com/r/x4HrGo/1
To match alphanumeric strings or only letter words you may use the following pattern with re:
import re
# ...
re.findall(r'(?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*|[^\W\d_]+', text.lower())
See the regex demo.
Details
(?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*- either 1+ letters followed with a digit, or 1+ digits followed with a letter, and then 0+ letters/digits
| - or
[^\W\d_]+ - either any 1+ Unicode letters
NOTE It is equivalent to \d*[^\W\d_][^\W_]* pattern posted by PJProudhon, that matches any 1+ alphanumeric character chunks with at least 1 letter in them.
You could give a try to \b\d*[^\W\d_][^\W_]*\b
Decomposition:
\b # word boundary
/d* # zero or more digits
[^\W\d_] # one alphabetic character
[^\W_]* # zero or more alphanumeric characters
\b # word boundary
For beginners:
[^\W] is typical double negated construct. Here you want to match any character which is not alphanumeric or _ (\W is the negation of \w, which matches any alphanumeric character plus _ - common equivalent [a-zA-Z0-9_]).
It reveals useful here to compose:
Any alphanumeric character = [^\W_] matches any character which is not non-[alphanumeric or _] and is not _.
Any alphabetic character = [^\W\d_] matches any character which is not non-[alphanumeric or _] and is not digit (\d) and is not _.
Some further reading here.
Edit:
When _ is also considered a word delimiter, just skip the word boundaries, which toggle on that character, and use \d*[^\W\d_][^\W_]*.
Default greediness of star operator will ensure all relevant characters are actually matched.
Demo.
Try this RegEx instead:
([A-Za-z]+(\d)*[A-Za-z]*)
You can expand it from here, for example flipping the * and + on the first and last sets to capture string like "win32" and "01ex" equally.

Regex to match a string with 2 capital letters only

I want to write a regex which will match a string only if the string consists of two capital letters.
I tried - [A-Z]{2}, [A-Z]{2, 2} and [A-Z][A-Z] but these only match the string 'CAS' while I am looking to match only if the string is two capital letters like 'CA'.
You could use anchors:
^[A-Z]{2}$
^ matches the beginning of the string, while $ matches its end.
Note in your attempts, you used [A-Z]{2, 2} which should actually be [A-Z]{2,2} (without space) to mean the same thing as the others.
You need to add word boundaries,
\b[A-Z]{2}\b
DEMO
Explanation:
\b Matches between a word character and a non-word character.
[A-Z]{2} Matches exactly two capital letters.
\b Matches between a word character and a non-word character.
You could try:
\b[A-Z]{2}\b
\b matches a word boundary.
Try =
^[A-Z][A-Z]$
Just added start and end points for the string.

Categories