Im new to regexes. Im having a hard time understanding everything and I wanted to write a not so simple program about e-mails but then decided on barcodes. The barcode is valid if it:
Is surrounded with a "#" followed by one or more "#"
Is at least 6 characters long (without the surrounding "#" or "#")
Starts with a capital letter
Contains only letters (lower and upper case) and digits
Ends with a capital letter
I tried a couple of things and achieved absolutely nothing. I even watched a detailed explanation of regexes but still can't come up with anything.
Sample input: ##GoodCodE## would be valid but #Invalid_CodE# / ##InvalidTry## would not.
(##+)([A-Z][A-Za-z0-9]{4,}[A-Z])(?:##+) Thank you for all the help! ?= didn't include the ##+ so I replaced it with ?: which does apparently.
You could use the following expression: ^##+[A-Z][A-Za-z0-9]{4,}[A-Z]##+$
Broken down it means:
^ requires beginning of string (line)
# match a # character
#+ match one or more of # characters
[A-Z] match an uppercase letter (counts as 1st of the 6)
#[A-Za-z0-9]{4,} match 4 or more of upper/lowercase letters and digits
[A-Z] match an uppercase letter (counts as last of the 6)
# match a # character
#+ match one or more of # characters
$ requires end of string (line)
Try this one:
import re
def check_re(inval):
"""
>>> check_re('##GoodCodE##')
True
>>> check_re('#Invalid_CodE#/##InvalidTry##')
False
>>> check_re('##IdE##')
False
"""
bar_re = re.compile(r'^##+[A-Z]([A-Za-z0-9]){4,}[A-Z]##+$')
m = re.match(bar_re, inval)
return m is not None
if __name__ == '__main__':
import doctest
doctest.testmod()
See the explanation in Alain's answer.
Related
I tried to compose patten with regex, and tried to validate multiple strings. However, seems my patterns fine according to regex documentation, but some reason, some invalid string is not validated correctly. Can anyone point me out what is my mistakes here?
test use case
this is test use case for one input string:
import re
usr_pat = r"^\$\w+_src_username_\w+$"
u_name='$ini_src_username_cdc_char4ec_pits'
m = re.match(usr_pat, u_name, re.M)
if m:
print("Valid username:", m.group())
else:
print("ERROR: Invalid user_name:\n", u_name)
I am expecting this return error because I am expecting input string must start with $ sign, then one string _\w+, then _, then src, then _, then user_name, then _, then end with only one string \w+. this is how I composed my pattern and tried to validate the different input strings, but some reason, it is not parsed correctly. Did I miss something here? can anyone point me out here?
desired output
this is valid and invalid input:
valid:
$ini_src_usrname_ajkc2e
$ini_src_password_ajkc2e
$ini_src_conn_url_ajkc2e
invalid:
$ini_src_usrname_ajkc2e_chan4
$ini_src_password_ajkc2e_tst1
$ini_smi_src_conn_url_ajkc2e_tst2
ini_smi_src_conn_url_ajkc2e_tst2
$ini_src_usrname_ajkc2e_chan4_jpn3
according to regex documentation, r"^\$\w+_src_username_\w+$" this should capture the logic that I want to parse, but it is not working all my test case. what did I miss here? thanks
The \w character class also matches underscores and numbers:
Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.
(https://docs.python.org/3/library/re.html#regular-expression-syntax).
So the final \w+ matches the entirety of cdc_char4ec_pits
I think you are looking for [a-zA-Z0-9] which will not match underscores.
usr_pat = r"^\$[a-zA-Z0-9]+_src_username_[a-zA-Z0-9]+$"
\w+
First: \w means that capture:
1- one letter from a to z, or from A to Z
OR
2- one number from 0 to 9
OR
3- an underscore(_)
Second: The plus(+) sign after \w means that matches the previous token between one and unlimited times.
So if my regex pattern is: r"^\$\w+$"
It would match the string: '$ini_src_username_cdc_char4ec_pits'
1- The ^\$ will match the dollar sign at the beginning of the string $
2- \w+ at first it will match the character i of the word ini and because of the + sign it will continue to match the character n and the second i. After that the underscore exists after the word ini will be matched as well, this is because \w matches an underscore not just a number or a letter, the word src will be matched too, the underscore after the word src will be matched, the username word will be matched too and the whole string will be matched.
You mentioned the word "string", if you mean letters and numbers such as : "bla123", "123455" or "BLAbla", then you can use something like [a-zA-Z0-9]+ instead of \w+.
I have about 15,000 files I need to parse which could contain one or more strings/numbers from a list I have. I need to separate the files with matching strings.
Given a string: 3423423987, it could appear independently as "3423423987", or as "3423423987_1" or "3423423987_1a", "3423423987-1a", but it could also be "2133423423987". However, I only want to detect the matching sequence where it is not a part of another number, only when it has a suffix of some sort.
So 3423423987_1 is acceptable, but 13423423987 is not.
I'm having trouble with regex, haven't used it much to be honest.
Simply speaking, if I simulate this with a list of possible positives and negatives, I should get 7 hits, for the given list. I would like to extract the text till the end of the word, so that I can record that later.
Here's my code:
def check_text_for_string(text_to_parse, string_to_find):
import re
matches = []
pattern = r"%s_?[^0-9,a-z,A-Z]\W"%string_to_find
return re.findall(pattern, text_to_parse)
if __name__ =="__main__":
import re
word_to_match = "3423423987"
possible_word_list = [
"3423423987_1 the cake is a lie", #Match
"3423423987sdgg call me Ishmael", #Not a match
"3423423987 please sir, can I have some more?", #Match
"3423423987", #Match
"3423423987 ", #Match
"3423423987\t", #Match
"adsgsdzgxdzg adsgsdag\t3423423987\t", #Match
"1233423423987", #Not a match
"A3423423987", #Not a match
"3423423987-1a\t", #Match
"3423423987.0", #Not a match
"342342398743635645" #Not a match
]
print("%d words in sample list."%len(possible_word_list))
print("Only 7 should match.")
matches = check_text_for_string("\n".join(possible_word_list), word_to_match)
print("%d matched."%len(matches))
print(matches)
But clearly, this is wrong. Could someone help me out here?
It seems you just want to make sure the number is not matched as part of a, say, float number. You then need to use lookarounds, a lookbehind and a lookahead to disallow dots with digits before and after.
(?<!\d\.)(?:\b|_)3423423987(?:\b|_)(?!\.\d)
See the regex demo
To also match the "prefixes" (or, better call them "suffixes" here), you need to add something like \S* (zero or more non-whitespaces) or (?:[_-]\w+)? (an optional sequence of a - or _ followed with 1+ word chars) at the end of the pattern.
Details:
(?<!\d\.) - fail the match if we have a digit and a dot before the current position
(?:\b|_) - either a word boundary or a _ (we need it as _ is a word char)
3423423987 - the search string
(?:\b|_) - ibid
(?!\.\d) - fail the match if a dot + digit is right after the current position.
So, use
pattern = r"(?<!\d\.)(?:\b|_)%s(?:\b|_)(?!\.\d)"%string_to_find
See the Python demo
If there can be floats like Text with .3423423987 float value, you will need to also add another lookbehind (?<!\.) after the first one: (?<!\d\.)(?<!\.)(?:\b|_)3423423987(?:\b|_)(?!\.\d)
You can use this pattern:
(?:\b|^)3423423987(?!\.)(?=\b|_|$)
(?:\b|^) asserts that there are no other numbers to the left
(?!\.) asserts the number isn't followed by a dot
(?=\b|_|$) asserts the number is followed by a non word character, an underscore or nothing
I can manage to get the four letter word from a-z using this ^[a-z]{4}$ But I am not sure how to get it so there is a a and e in in the word. I've tried this but it only gets the worlds with the ae on the end. ^[a-z]{2}[a][e]$
import re
import sys
import time
pattern = '^[a-z]{4}$[a][e]'
#c = ^[^a][a]{2}
regexp = re.compile(pattern)
inFile = open('words.txt','r')
outFile = open('exercise04.log','w')
for line in inFile:
match = regexp.search(line)
if match:
time.sleep(0.1)
print(line)
outFile.write(line)
inFile.close()
outFile.close()
Example output from ^[a-z]{2}[a][e]$
alae
blae
brae
frae
spae
thae
twae
I i'm looking for random words such as
akes
aejs
soae
skea
esao
You need to use lookahead to check for a line which contains both a and e
^(?=.*?a)(?=.*?e)[a-z]{4}$
DEMO
Explanation:
^ Start of a line.
(?=.*?a) Positive lookahead asserts that there must be a letter a present in that particular line.
(?=.*?e) Positive lookahead asserts that there must be a letter e present in that particular line. Lookarounds usually won't match any characters but it only asserts whether a match is possible or not.
[a-z]{4} Exactly four lowercase letters.
$ End of the line anchor.
If the problem is: "Find words of exactly four letters, in which there exists at least one a, and at least one e, in any order", one (faster than regexp, possibly) way to do this is to propose exactly those three questions.
My Python is, um, all but non-existent, but:
if 4 == word.length and "a" in word and "e" in word:
seems to be a bit less difficult to understand than a regex.
A few problems with your original regex '[a-z]{4}$[a][e]'
The [a-z] character set has a quantifier of 4 following it, meaning that it will match 4 characters and you were trying to match 2 characters following that.
The '$' precedes the other characters you want to match, and the '$' in a regular expression means the end of the line.
If you merely want to match ae at the end you can simply use [a-z]{2}ae for a string-literal match.
I generically use word boundaries instead of ^ and $ for beginning and end of line as the word may have whitespace preceding it. Combine that with a positive lookahead for 'ae':
\b(?=.*?ae)[a-z]{4}\b
I use part of code to read a website and scrap some information and place it into Google and print some directions.
I'm having an issue as some of the information. the site i use sometimes adds a # followed by 3 random numbers then a / and another 3 numbers e.g #037/100
how can i use python to ignore this "#037/100" string?
I currently use
for i, part in enumerate(list(addr_p)):
if '#' in part:
del addr_p[i]
break
to remove the # if found but I'm not sure how to do it for the random numbers
Any ideas ?
If you find yourself wanting to remove "three digits followed by a forward slash followed by three digits" from a string s, you could do
import re
s = "this is a string #123/234 with other stuff"
t = re.sub('#\d{3}\/\d{3}', '', s)
print t
Result:
'this is a string with other stuff'
Explanation:
# - literal character '#'
\d{3} - exactly three digits
\/ - forward slash (escaped since it can have special meaning)
\d{3} - exactly three digits
And the whole thing that matches the above (if it's present) is replaced with '' - i.e. "removed".
import re
re.sub('#[0-9]+\/[0-9]+$', '', addr_p[i])
I'm no wizzard with regular expressions but i'd imagine you could so something like this.
You could even handle '#' in the regexp as well.
If the format is always the same, then you could check if the line starts with a #, then set the string to itself without the first 8 characters.
if part[0:1] == '#':
part = part[8:]
if the first letter is a #, it sets the string to itself, from the 8th character to the end.
I'd double your problems and match against a regular expression for this.
import re
regex = re.compile(r'([\w\s]+)#\d+\/\d+([\w\s]+)')
m = regex.match('This is a string with a #123/987 in it')
if m:
s = m.group(1) + m.group(2)
print(s)
A more concise way:
import re
s = "this is a string #123/234 with other stuff"
t = re.sub(r'#\S+', '', s)
print(t)
I have a string with a mix of uppercase and lowercase letters. I need want to find every lowercase letter that is surronded by 3 uppercase letters and extract it from the string.
For instance ZZZaZZZ I want to extract the a in the previous string.
I have written a script that is able to extract ZZZaZZZ but not the a alone. I know I need to use nested regex expressions to do this but I can not wrap my mind on how to implement this. The following is what I have:
import string, re
if __name__ == "__main__":
#open the file
eqfile = open("string.txt")
gibberish = eqfile.read()
eqfile.close()
r = re.compile("[A-Z]{3}[a-z][A-Z]{3}")
print r.findall(gibberish)
EDIT:
Thanks for the answers guys! I guess I should have been more specific. I need to find the lowercase letter that is surrounded by three uppercase letters that are exactly the same, such as in my example ZZZaZZZ.
You are so close! Read about the .group* methods of MatchObjects. For example, if your script ended with
r = re.compile("[A-Z]{3}([a-z])[A-Z]{3}")
print r.match(gibberish).group(1)
then you'd capture the desired character inside the first group.
To address the new constraint of matching repeated letters, you can use backreferences:
r = re.compile(r'([A-Z])\1{2}(?P<middle>[a-z])\1{3}')
m = r.match(gibberish)
if m is not None:
print m.group('middle')
That reads like:
Match a letter A-Z and remember it.
Match two occurrences of the first letter found.
Match your lowercase letter and store it in the group named middle.
Match three more consecutive instances of the first letter found.
If a match was found, print the value of the middle group.
r = re.compile("(?<=[A-Z]{3})[a-z](?=[A-Z]{3})")
(?<=...) indicates a positive lookbehind and (?=...) is a positive lookahead.
module re
(?=...)
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
(?<=...)
Matches if the current position in the string is preceded by a match for ... that ends at the current position.
You need to capture the part of the string you are interested in with parentheses, and then access it with re.MatchObject#group:
r = re.compile("[A-Z]{3}([a-z])[A-Z]{3}")
m = r.match(gibberish)
if m:
print "Match! Middle letter was " + m.group(1)
else:
print "No match."