Find a lowercase letter surronded by three uppercase letters - python

I have a string with a mix of uppercase and lowercase letters. I need want to find every lowercase letter that is surronded by 3 uppercase letters and extract it from the string.
For instance ZZZaZZZ I want to extract the a in the previous string.
I have written a script that is able to extract ZZZaZZZ but not the a alone. I know I need to use nested regex expressions to do this but I can not wrap my mind on how to implement this. The following is what I have:
import string, re
if __name__ == "__main__":
#open the file
eqfile = open("string.txt")
gibberish = eqfile.read()
eqfile.close()
r = re.compile("[A-Z]{3}[a-z][A-Z]{3}")
print r.findall(gibberish)
EDIT:
Thanks for the answers guys! I guess I should have been more specific. I need to find the lowercase letter that is surrounded by three uppercase letters that are exactly the same, such as in my example ZZZaZZZ.

You are so close! Read about the .group* methods of MatchObjects. For example, if your script ended with
r = re.compile("[A-Z]{3}([a-z])[A-Z]{3}")
print r.match(gibberish).group(1)
then you'd capture the desired character inside the first group.
To address the new constraint of matching repeated letters, you can use backreferences:
r = re.compile(r'([A-Z])\1{2}(?P<middle>[a-z])\1{3}')
m = r.match(gibberish)
if m is not None:
print m.group('middle')
That reads like:
Match a letter A-Z and remember it.
Match two occurrences of the first letter found.
Match your lowercase letter and store it in the group named middle.
Match three more consecutive instances of the first letter found.
If a match was found, print the value of the middle group.

r = re.compile("(?<=[A-Z]{3})[a-z](?=[A-Z]{3})")
(?<=...) indicates a positive lookbehind and (?=...) is a positive lookahead.
module re
(?=...)
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
(?<=...)
Matches if the current position in the string is preceded by a match for ... that ends at the current position.

You need to capture the part of the string you are interested in with parentheses, and then access it with re.MatchObject#group:
r = re.compile("[A-Z]{3}([a-z])[A-Z]{3}")
m = r.match(gibberish)
if m:
print "Match! Middle letter was " + m.group(1)
else:
print "No match."

Related

How to limit list of string is pattern with regex?

I tried to compose patten with regex, and tried to validate multiple strings. However, seems my patterns fine according to regex documentation, but some reason, some invalid string is not validated correctly. Can anyone point me out what is my mistakes here?
test use case
this is test use case for one input string:
import re
usr_pat = r"^\$\w+_src_username_\w+$"
u_name='$ini_src_username_cdc_char4ec_pits'
m = re.match(usr_pat, u_name, re.M)
if m:
print("Valid username:", m.group())
else:
print("ERROR: Invalid user_name:\n", u_name)
I am expecting this return error because I am expecting input string must start with $ sign, then one string _\w+, then _, then src, then _, then user_name, then _, then end with only one string \w+. this is how I composed my pattern and tried to validate the different input strings, but some reason, it is not parsed correctly. Did I miss something here? can anyone point me out here?
desired output
this is valid and invalid input:
valid:
$ini_src_usrname_ajkc2e
$ini_src_password_ajkc2e
$ini_src_conn_url_ajkc2e
invalid:
$ini_src_usrname_ajkc2e_chan4
$ini_src_password_ajkc2e_tst1
$ini_smi_src_conn_url_ajkc2e_tst2
ini_smi_src_conn_url_ajkc2e_tst2
$ini_src_usrname_ajkc2e_chan4_jpn3
according to regex documentation, r"^\$\w+_src_username_\w+$" this should capture the logic that I want to parse, but it is not working all my test case. what did I miss here? thanks
The \w character class also matches underscores and numbers:
Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.
(https://docs.python.org/3/library/re.html#regular-expression-syntax).
So the final \w+ matches the entirety of cdc_char4ec_pits
I think you are looking for [a-zA-Z0-9] which will not match underscores.
usr_pat = r"^\$[a-zA-Z0-9]+_src_username_[a-zA-Z0-9]+$"
\w+
First: \w means that capture:
1- one letter from a to z, or from A to Z
OR
2- one number from 0 to 9
OR
3- an underscore(_)
Second: The plus(+) sign after \w means that matches the previous token between one and unlimited times.
So if my regex pattern is: r"^\$\w+$"
It would match the string: '$ini_src_username_cdc_char4ec_pits'
1- The ^\$ will match the dollar sign at the beginning of the string $
2- \w+ at first it will match the character i of the word ini and because of the + sign it will continue to match the character n and the second i. After that the underscore exists after the word ini will be matched as well, this is because \w matches an underscore not just a number or a letter, the word src will be matched too, the underscore after the word src will be matched, the username word will be matched too and the whole string will be matched.
You mentioned the word "string", if you mean letters and numbers such as : "bla123", "123455" or "BLAbla", then you can use something like [a-zA-Z0-9]+ instead of \w+.

Regex for validating barcodes

Im new to regexes. Im having a hard time understanding everything and I wanted to write a not so simple program about e-mails but then decided on barcodes. The barcode is valid if it:
Is surrounded with a "#" followed by one or more "#"
Is at least 6 characters long (without the surrounding "#" or "#")
Starts with a capital letter
Contains only letters (lower and upper case) and digits
Ends with a capital letter
I tried a couple of things and achieved absolutely nothing. I even watched a detailed explanation of regexes but still can't come up with anything.
Sample input: ##GoodCodE## would be valid but #Invalid_CodE# / ##InvalidTry## would not.
(##+)([A-Z][A-Za-z0-9]{4,}[A-Z])(?:##+) Thank you for all the help! ?= didn't include the ##+ so I replaced it with ?: which does apparently.
You could use the following expression: ^##+[A-Z][A-Za-z0-9]{4,}[A-Z]##+$
Broken down it means:
^ requires beginning of string (line)
# match a # character
#+ match one or more of # characters
[A-Z] match an uppercase letter (counts as 1st of the 6)
#[A-Za-z0-9]{4,} match 4 or more of upper/lowercase letters and digits
[A-Z] match an uppercase letter (counts as last of the 6)
# match a # character
#+ match one or more of # characters
$ requires end of string (line)
Try this one:
import re
def check_re(inval):
"""
>>> check_re('##GoodCodE##')
True
>>> check_re('#Invalid_CodE#/##InvalidTry##')
False
>>> check_re('##IdE##')
False
"""
bar_re = re.compile(r'^##+[A-Z]([A-Za-z0-9]){4,}[A-Z]##+$')
m = re.match(bar_re, inval)
return m is not None
if __name__ == '__main__':
import doctest
doctest.testmod()
See the explanation in Alain's answer.

Matching a number in a file with Python

I have about 15,000 files I need to parse which could contain one or more strings/numbers from a list I have. I need to separate the files with matching strings.
Given a string: 3423423987, it could appear independently as "3423423987", or as "3423423987_1" or "3423423987_1a", "3423423987-1a", but it could also be "2133423423987". However, I only want to detect the matching sequence where it is not a part of another number, only when it has a suffix of some sort.
So 3423423987_1 is acceptable, but 13423423987 is not.
I'm having trouble with regex, haven't used it much to be honest.
Simply speaking, if I simulate this with a list of possible positives and negatives, I should get 7 hits, for the given list. I would like to extract the text till the end of the word, so that I can record that later.
Here's my code:
def check_text_for_string(text_to_parse, string_to_find):
import re
matches = []
pattern = r"%s_?[^0-9,a-z,A-Z]\W"%string_to_find
return re.findall(pattern, text_to_parse)
if __name__ =="__main__":
import re
word_to_match = "3423423987"
possible_word_list = [
"3423423987_1 the cake is a lie", #Match
"3423423987sdgg call me Ishmael", #Not a match
"3423423987 please sir, can I have some more?", #Match
"3423423987", #Match
"3423423987 ", #Match
"3423423987\t", #Match
"adsgsdzgxdzg adsgsdag\t3423423987\t", #Match
"1233423423987", #Not a match
"A3423423987", #Not a match
"3423423987-1a\t", #Match
"3423423987.0", #Not a match
"342342398743635645" #Not a match
]
print("%d words in sample list."%len(possible_word_list))
print("Only 7 should match.")
matches = check_text_for_string("\n".join(possible_word_list), word_to_match)
print("%d matched."%len(matches))
print(matches)
But clearly, this is wrong. Could someone help me out here?
It seems you just want to make sure the number is not matched as part of a, say, float number. You then need to use lookarounds, a lookbehind and a lookahead to disallow dots with digits before and after.
(?<!\d\.)(?:\b|_)3423423987(?:\b|_)(?!\.\d)
See the regex demo
To also match the "prefixes" (or, better call them "suffixes" here), you need to add something like \S* (zero or more non-whitespaces) or (?:[_-]\w+)? (an optional sequence of a - or _ followed with 1+ word chars) at the end of the pattern.
Details:
(?<!\d\.) - fail the match if we have a digit and a dot before the current position
(?:\b|_) - either a word boundary or a _ (we need it as _ is a word char)
3423423987 - the search string
(?:\b|_) - ibid
(?!\.\d) - fail the match if a dot + digit is right after the current position.
So, use
pattern = r"(?<!\d\.)(?:\b|_)%s(?:\b|_)(?!\.\d)"%string_to_find
See the Python demo
If there can be floats like Text with .3423423987 float value, you will need to also add another lookbehind (?<!\.) after the first one: (?<!\d\.)(?<!\.)(?:\b|_)3423423987(?:\b|_)(?!\.\d)
You can use this pattern:
(?:\b|^)3423423987(?!\.)(?=\b|_|$)
(?:\b|^) asserts that there are no other numbers to the left
(?!\.) asserts the number isn't followed by a dot
(?=\b|_|$) asserts the number is followed by a non word character, an underscore or nothing

Python Regex: password must contain at least one uppercase letter and number

I am doing form validation for a password using Python and Flask. The password needs to contain at least one uppercase letter and at least one number.
My current failed attempt...
re.compile(r'^[A-Z\d]$')
We can use the pattern '\d.*[A-Z]|[A-Z].*\d' to search for entries that have at least one capital letter and one number. Logically speaking there are only two ways that a capital letter and a number can appear in a string. Either the letter comes first and the number after or the number first and the letter after.
The pipe | indicates 'OR', so we will look at each side separately. \d.*[A-Z] matches a number that is followed by a capital letter, [A-Z].*\d matches any capital letter that is followed by a number.
words = ['Password1', 'password2', 'passwordthree', 'P4', 'mypassworD1!!!', '898*(*^$^#%&#abcdef']
for x in words:
print re.search('\d.*[A-Z]|[A-Z].*\d', x)
#<_sre.SRE_Match object at 0x00000000088146B0>
#None
#None
#<_sre.SRE_Match object at 0x00000000088146B0>
#<_sre.SRE_Match object at 0x00000000088146B0>
#None
Another option is to use a lookahead.
^(?=.*?[A-Z]).*\d
See demo at regex101
The lookahead at ^ start checks if an [A-Z] is ahead. If so matches a digit.
To match with string that contains at least one digit character, use:
.*[0-9].*
The similar regex applies to check for upper/lower case.
Regular expressions don't have an AND operator, so it's pretty hard to write a regex that matches valid passwords, when validity is defined by something AND something else AND something else.
But, regular expressions do have an OR operator, so just apply DeMorgan's theorem, and write a regex that matches invalid passwords:
anything with no numbers OR anything with no uppercase
So:
^([^0-9]*|[^A-Z]*)$
If anything matches that, then it's an invalid password.
I think this will work for you:
^(?=.*[A-Z])(?=.*\d).*$
Using two, non-greedy lookahead assertions:
^(?=[^A-Z]*[A-Z])(?=[^0-9]*[0-9])
The above asserts that the current input position (the start of string), consists of 0 or more non-capital letters followed by a capital letter and it also consists of 0 or more non digits followed by a digit.
import re
tests = [
'x1bcAd',
'xAbcd1d',
'abcde',
'1234',
'AAAA'
]
for test in tests:
m = re.match(r'^(?=[^A-Z]*[A-Z])(?=[^0-9]*[0-9])', test)
print(test, 'Passed' if m else 'Failed')
Prints:
x1bcAd Passed
xAbcd1d Passed
abcde Failed
1234 Failed
AAAA Failed

Regex expression for: 4 letter words using 'a' to 'e' in lower case

I can manage to get the four letter word from a-z using this ^[a-z]{4}$ But I am not sure how to get it so there is a a and e in in the word. I've tried this but it only gets the worlds with the ae on the end. ^[a-z]{2}[a][e]$
import re
import sys
import time
pattern = '^[a-z]{4}$[a][e]'
#c = ^[^a][a]{2}
regexp = re.compile(pattern)
inFile = open('words.txt','r')
outFile = open('exercise04.log','w')
for line in inFile:
match = regexp.search(line)
if match:
time.sleep(0.1)
print(line)
outFile.write(line)
inFile.close()
outFile.close()
Example output from ^[a-z]{2}[a][e]$
alae
blae
brae
frae
spae
thae
twae
I i'm looking for random words such as
akes
aejs
soae
skea
esao
You need to use lookahead to check for a line which contains both a and e
^(?=.*?a)(?=.*?e)[a-z]{4}$
DEMO
Explanation:
^ Start of a line.
(?=.*?a) Positive lookahead asserts that there must be a letter a present in that particular line.
(?=.*?e) Positive lookahead asserts that there must be a letter e present in that particular line. Lookarounds usually won't match any characters but it only asserts whether a match is possible or not.
[a-z]{4} Exactly four lowercase letters.
$ End of the line anchor.
If the problem is: "Find words of exactly four letters, in which there exists at least one a, and at least one e, in any order", one (faster than regexp, possibly) way to do this is to propose exactly those three questions.
My Python is, um, all but non-existent, but:
if 4 == word.length and "a" in word and "e" in word:
seems to be a bit less difficult to understand than a regex.
A few problems with your original regex '[a-z]{4}$[a][e]'
The [a-z] character set has a quantifier of 4 following it, meaning that it will match 4 characters and you were trying to match 2 characters following that.
The '$' precedes the other characters you want to match, and the '$' in a regular expression means the end of the line.
If you merely want to match ae at the end you can simply use [a-z]{2}ae for a string-literal match.
I generically use word boundaries instead of ^ and $ for beginning and end of line as the word may have whitespace preceding it. Combine that with a positive lookahead for 'ae':
\b(?=.*?ae)[a-z]{4}\b

Categories