I am sure this is simple but I can't see or possibly can't find a solution.
Suppose I have a string, for example something--abcd--something and I want to find abcc in the string. I want to allow for one mismatch, meaning that the output of the below code should be True.
my_string='something--abcd--something'
substring = 'abcc'
if substring in my_string:
print('True')
else:
print('False')
I know that the substring is not in my_string but then what I want is to allow for one mismatch then the output will be True.
How can I achieve that?
There are certainly finer ways to do it, but one solution is to search for it with regexes in which one of the characters is replaced by a dot (or '\w' if you want the character to be a letter and nothing else).
We use a generator to lazily generate the regexes by replacing one of the letters each time, then check if any of these regexes match:
import re
def with_one_dot(s):
for i in range(len(s)):
yield s[:i] + '.' + s[i+1:]
def match_all_but_one(string, target):
return any(re.search(fuzzy_target, string) for fuzzy_target in with_one_dot(target))
def find_fuzzy(string, target):
" Return the start index of the fuzzy match, -1 if not found"
for fuzzy_target in with_one_dot(target):
m = re.search(fuzzy_target, string)
if m:
return m.start()
return -1
my_string = 'something--abcd--something'
print(match_all_but_one(my_string, 'abcc')) # 1 difference
# True
print(find_fuzzy(my_string, 'abcc'))
# 11
print(match_all_but_one(my_string,'abbb')) # 2 differences
# False
print(find_fuzzy(my_string, 'abbb'))
# -1
The with_one_dot(s) generator yields s with one letter replaced by a dot on each iteration:
for reg in with_one_dot('abcd'):
print(reg)
outputs:
.bcd
a.cd
ab.d
abc.
Each of these strings is used as a regex and tested on my_string. The dot . in a regex means 'match anything', so it allows any symbol instead of the original letter.
any returns True immediately if any of theses regexes matches, False if none does.
Related
I am trying, but struggling, to write an algorithm that checks if a substring exists in a piece of text. The piece of text can include punctuation, but only alphanumeric characters should be considered when searching for the substring.
I would like to return the start and end index of the substring. It is guaranteed that the substring exists in the text. However, these indexes should also account for punctuation that was ignored in the search.
For example, for the text BD ACA;B_ 1E and the substring AB1, the algorithm should return 5 and 11 as the start and end index. (text[5:11] -> A;B_ 1 == AB1 with punctuation removed.)
This is the best I have done so far.
def search(text, sub):
print(text, sub)
if not sub:
return True
for i, char in enumerate(text):
if char == sub[0]:
return search(text[1:], sub[1:])
else:
return search(text[1:], sub)
result = search("BD ACA;B_ 1E", "AB1")
print(result)
I am printing out the two strings at the entry point to the function just to see how it progresses.
If the first n chars of the substring are found, but not the rest, then the search doesn't restart to search for the entire substring again. This is what I was attempting to do with the for loop, but can't get it working. This means that it will return true if call characters of the substring simply exist somewhere, in order, in the text.
I have just tried to initially get the search working successfully, I haven't even tried to track and return the start and end indexes yet.
You can use a regular expression for that. [\W_]* will match any non-alphanumeric character sequence, so you could alternate the letters of the search string with that pattern:
import re
def search(text, sub):
match = re.search(r"[\W_]*".join(sub), text)
return match and match.span(0)
text = "BD ACA;B_ 1E"
span = search(text, "AB1")
if span:
start, end = span
print(start, end)
print(text[start:end])
There is a string function isalnum() that can check for alphanumeric chars.
t = 'BD ACA;B_ 1'
s = 'AB1'
def is_in_st(t, s):
start = t.rfind(s[0])
end = t.rfind(s[-1]) + 1
if start and end:
return s == ''.join([c for c in t[start:end] if c.isalnum()])
is_in_st(t, s)
True
I've been able to isolate the list (or string) of characters I want excluded from a user entered string. But I don't see how to then remove all these unwanted characters. After I do this, I think I can try joining the user string so it all becomes one alphabet input like the instructions say.
Instructions:
Remove all non-alpha characters
Write a program that removes all non-alpha characters from the given input.
For example, if the input is:
-Hello, 1 world$!
the output should be:
Helloworld
My code:
userEntered = input()
makeList = userEntered.split()
def split(userEntered):
return list(userEntered)
if userEntered.isalnum() == False:
for i in userEntered:
if i.isalpha() == False:
#answer = userEntered[slice(userEntered.index(i))]
reference = split(userEntered)
excludeThis = i
print(excludeThis)
When I print excludeThis, I get this as my output:
-
,
1
$
!
So I think I might be on the right track. I need to figure it out how to get these characters out of the user input. Any help is appreciated.
Loop over the input string. If the character is alphabetic, add it to the result string.
userEntered = input()
result = ''
for char in userEntered:
if char.isalpha():
result += char
print(result)
This can also be done with a regular expression:
import re
userEntered = input()
result = re.sub(r'[^a-z]', '', userEntered, flags=re.I)
The regexp [^a-z] matches anything except an alphabetic character. The re.I flag makes it case-insensitive. These are all replaced with an empty string, which removes them.
There's basically two main parts to this: distinguish alpha from non-alpha, and get a string with only the former. If isalpha() is satisfactory for the former, then that leaves the latter. My understanding is that the solution that is considered most Pythonic would be to join a comprehension. This would like this:
''.join(char for char in userEntered if char.isalpha())
BTW, there are several places in the code where you are making it more complicated than it needs to be. In Python, you can iterate over strings, so there's no need to convert userEntered to a list. isalnum() checks whether the string is all alphanumeric, so it's rather irrelevant (alphanumeric includes digits). You shouldn't ever compare a boolean to True or False, just use the boolean. So, for instance, if i.isalpha() == False: can be simplified to just if not i.isalpha():.
I want to validate a PAN card whose first 5 characters are alphabets, the next 4 are numbers and the last character is an alphabet again. I can't use isalnum() because I want to check this specific order too, not just verify whether it contains both numbers and letters.
Here is a snipped of my code:
def validate_PAN(pan):
for i in pan:
pan.isalpha(pan[0:4])==True:
return 1
pan.isdigit(pan[5:9])==True:
return 1
pan.isalpha(pan[9])==True:
return 1
else:
return 0
This obviously returns an error since it is wrong. How can I fix this?
Just do string slicing and check
s[:5].isalpha()
pan[0:4] - Here you check for the first 4 characters and not 5 characters.
s[m:n] - This will slice the string from mth character till nth character (not including n)
Mistake in your code
pan.isalpha(pan[0:4])==True
This is giving you the error because isalpha() doesn't accept any arguments and you aren't using if before it.
You must use - if pan[:5].isalpha() == True:
You can use regular expression for simplicity sake
import re
PAN_1 = 'ABCDE1111E'
PAN_2 = 'ABC1111DEF'
def is_valid_PAN(PAN_number):
return bool(re.match(r'[a-z]{5}\d{4}[a-z]', PAN_number, re.IGNORECASE))
print(is_valid_PAN(PAN_1)) #True
print(is_valid_PAN(PAN_2)) #False
Regular expressions are a good fit for this.
import re
# Pattern for matching a PAN number
pattern = r'\b[A-Z]{5}[0-9]{4}[A-Z]\b'
# compile the pattern for better performance with repetitive matches
pobject = re.compile(pattern)
pan_number = "AXXMP1234Z"
result = pobject.match(pan_number)
if result:
print ("Matched for PAN: ", res.group(0))
else:
print("Failed")
I would like to find whether "xy" in a string, "xy" is optional, for each character it can only appear once. For example:
def findpat(texts, pat):
for text in texts:
if re.search(pat, t):
print re.search(pat, t).group()
else:
print None
pat = re.compile(r'[xy]*?b')
text = ['xyb', 'xb', 'yb', 'yxb','b', 'xyxb']
findpat(text, pat)
# it prints
# xyb
# xb
# yb
# yxb
# b
# xyxb
For the last one, my desired output is "yxb".
How should I modify my regex? Many thanks
You may use the following approach: match and capture the two groups, ([xy]*)(b). Then, once a match is found, check if the length of the value in Group 1 is the same as the number of unique chars in this value. If not, remove the chars from the start of the group value until you get a string with the length of the number of unique chars.
Something like:
def findpat(texts, pat):
for t in texts:
m = re.search(pat, t) # Find a match
if m:
tmp = set([x for x in m.group(1)]) # Get the unqiue chars
if len(tmp) == len(m.group(1)): # If Group 1 length is the same
print re.search(pat, t).group() # Report a whole match value
else:
res = m.group(1)
while len(tmp) < len(res): # While the length of the string is not
res = res[1:] # equal to the number of unique chars, truncate from the left
print "{}{}".format(res, m.group(2)) # Print the result
else:
print None # Else, no match
pat = re.compile(r'([xy]*)(b)')
text = ['xyb', 'xb', 'yb', 'yxb','b', 'xyxb']
findpat(text, pat)
# => [xyb, xb, yb, yxb, b, yxb]
See the Python demo
You can use this pattern
r'(x?y?|yx)b'
To break down, the interesting part x?y?|yx will match:
empty string
only x
only y
xy
and on the alternative branch, yx
As an advice, when you aren't very comfortable with regex and your number of scenarios are small, you could simply brute force the pattern. It's ugly, but it makes clear what your cases are:
r'b|xb|yb|xyb|yxb'
Part 2.
For a generic solution, that will do the same, but for any number of characters instead of just {x, y}, the following regex style can be used:
r'(?=[^x]*x?[^x]*b)(?=[^y]*y?[^y]*b)(?=[^z]*z?[^z]*b)[xyz]*b'
I'll explain it a bit:
By using lookaheads you advance the regex cursor and for each position, you just "look ahead" and see if what follows respects a certain condition. By using this technique, you may combine several conditions into a single regex.
For a cursor position, we test each character from our set to appear at most once from the position, until we match our target b character. We do this with this pattern [^x]*x?[^x]*, which means match not-x if there are any, match at most one x, then match any number of not x
Once the test conditions are met, we start advancing the cursor and matching all the characters from our needed set, until we find a b. At this point we are guaranteed that we won't match any duplicates, because we performed our lookahead tests.
Note: I strongly suspect that this has poor performance, because it does backtracking. You should only use it for small test strings.
Test it.
Well, the regexp that literally passes your test cases is:
pat = re.compile(r'(x|y|xy|yx)?b$')
where the "$" anchors the string at the end and thereby ensures it's the last match found.
However it's a little more tricky to use the regexp mechanism(s) to ensure that only one matching character from the set is used ...
From Wiktor Stribiżew's comment & demo, I got my answer.
pat = re.compile(r'([xy]?)(?:(?!\1)[xy])?b')
Thanks you all!
I was trying to create a regular expression which would allow any number of consonants , or any number of vowels , or a mix of consonants and vowels such that we only have any number of consonants in the beginning followed by any number of vowels ONLY, no consonant should be allowed after the vowels, it would be more clear from the example:
The following cases should pass:
TR, EE, TREE, Y, BY.
But the following should not pass the expression :
TROUBLE, OATS, TREES, IVY, TROUBLES, PRIVATE, OATEN, ORRERY.
So generally it can be visualized as : [C] [V]
C - Consonants
V - Vowels
[ ] - where the square brackets denote arbitrary presence of their contents.
And I reached upto this piece of code:
import re
def find_m(word):
if re.match("[^aeiou]*?[aeiou]*?",word):
print "PASS"
else:
print "FAIL"
find_m("tr")
find_m("ee")
find_m("tree")
find_m("y")
find_m("by")
find_m("trouble")
find_m("oats")
find_m("trees")
find_m("ivy")
find_m("aaabbbbaaa")
But it is passing for all the cases, I need a correct expressions which gives the desired results.
All you need to do is to add another anchor $ at the end of you regex as
if re.match("[^aeiou]*[aeiou]*$",word):
$ achors the regex at the end of the string. Allows nothing after The vowels
Note
You can drop the non greedy ? from the regex as the non greedy does not have any effect on the specified character class
The regex would match empty strings as well.
The mandatory part in the string can be specified by replacing the * with +. Say for example if the input must contain vowels, then the regex must be
if re.match("[^aeiou]*[aeiou]+$",word):
Or you can check if the string is empty using a lookahead so as to ensure the string is non empty as
if re.match("^(?=.+)[^aeiou]*[aeiou]*$",word):
Test
$ cat test.py
import re
def find_m(word):
if re.match("[^aeiou]*[aeiou]*$",word):
print "PASS"
else:
print "FAIL"
find_m("tr")
find_m("ee")
find_m("tree")
find_m("y")
find_m("by")
find_m("trouble")
find_m("oats")
find_m("trees")
find_m("ivy")
find_m("aaabbbbaaa")
$ python test.py
PASS
PASS
PASS
PASS
PASS
FAIL
FAIL
FAIL
FAIL
FAIL
Fill in the code to check if the text passed contains the vowels a, e and i, with exactly one occurrence of any other character in between.
import re
def check_aei (text):
result = re.search(r"a.e.i", text)
return result != None
print(check_aei("academia")) # True
print(check_aei("aerial")) # False
print(check_aei("paramedic")) # True