Pattern match with character/numeric pattern in Python [duplicate] - python

This question already has answers here:
Product code looks like abcd2343, how to split by letters and numbers?
(6 answers)
Closed 1 year ago.
I'm trying to write a Python function that follows the pattern below. Essentially a pattern-matching algorithm is required.
isSubstring(pattern, word) -> bool
A) isSubstring("b1tecat", "bytecat") -> True
B) isSubstring("b2ecat", "bytecat") -> True
C) isSubstring("b5cat", "bytecat") -> False
D) isSubstring("b2tecat", "bytecat") -> False
E) isSubstring("bytecat", "bytecat") -> True
F) isSubstring("2", "be") -> True
G) isSubstring("2bbbb", "b") -> False
The code below is the basic solution that works for the (E) case from above, but obviously it does nothing to account for numbers in the pattern. Have searched leetcode, hackerrank, geeksforgeeks, etc, but can't find a decent solution.
def isSubstring(substring, string):
len_substring = len(substring)
len_string = len(string)
for i in range(len_string - len_substring + 1):
j = 0
while j < len_substring:
if string[i+j] != substring[j]:
break
j += 1
if j == len_substring:
return True
return False
How can I account for the numbers in the pattern?

You can use an index in the second array, and when the character(s) from the first array is/are numeric, evaluate that number as integer and increase the index with that number. I assume that if the left string looks like "a21b", it means the second string should have 21 characters between "a" and "b". To easily identify consecutive digits, I would suggest a regular expression \d+|\D to split up the first string into its individual parts:
import re
def isSubstring(substring, string):
tokens = re.findall(r"\d+|\D", substring)
for i in range(len(string)):
for ch in tokens:
if i >= len(string):
return False
elif ch.isdigit():
i += int(ch)
elif ch != string[i]:
break
else:
i += 1
else:
return True
return False
It will however be easier to rely on regular expressions themselves, as follows:
Convert the first string to a regular expression itself, and then see if the second string matches it:
import re
def isSubstring(substring, string):
regex = re.sub(r"(\d+)", r".{\1}", re.escape(substring))
return re.search(regex, string) is not None

Building on #trincot answer here, you can use his answer to build a regex that matches what you are searching for.
For example: "b2tecat" is really just r"b..tecat" - feed that to another re.findall and you will find all of the occurrences of your string.
import re
def isSubstring(substring, string):
regex = r""
for ch in re.findall(r"\d+|\D", substring):
if ch.isdigit():
regex += "." * int(ch)
else:
regex += ch
if re.search(regex, re.escape(string)):
return True
else:
return False
Side note: there are more "classical" ways to solve this problem - pattern matching with "don't cares". For example using fft.

Related

ValueError when validating data

I am trying to make function, that returns True when value of hcl is correct with required specification (it's inside multi-line comment in the function). The first thing I wanted to check was if length of that value is correct (should be # + 6 other chars), and when that would be correct I would check if all chars are in group of a-f or 0-9 - and that was my idea to solve this problem, but unfortunately there is a
ValueError: substring not found
(when second elem of list goes to the function), that I don't understand(btw. as always, you have some reasoning, and when it there is a mistake you can't found it, because for you everything is working and this 'should work').
def check_hcl(line):
'''
a # followed by exactly six characters 0-9 or a-f.
'''
print(line[line.index(':')+1], len(line[line.index(':')+2:]))
if line[line.index(':')+1] != '#' or len(line[line.index(':')+2:]) != 6:
return False
else:
return True
list = ['hcl:#866857','#52a9af','#cfa07d','7d3b0c','#cc0362','#a9784']
#false #false
for i in list:
print(check_hcl(i))
You can use the match() method from the built-in re module:
import re
def check_hcl(line):
if re.match("(.*?)#[a-f0-9]{6}", line):
return True
return False
list = ['hcl:#866857','#52a9af','#cfa07d','7d3b0c','#cc0362','#a9784']
for i in list:
print(check_hcl(i))
Output:
True
True
True
False
True
False
Explanation:
The pattern (.*?)#[a-f0-9]{6} can be broken down to 3 parts:
(.*?) matches anything of any length, including substrings of length 0.
# matches a '#'.
[a-f0-9]{6} matches a substring of characters a to f and numbers 0 to 9 of length 6.
Credits to #Ian.

Upper case between a char python

I need help how do I get upper case between two stars like this.
INPUT: "S*t*a*r*s are every*where*"
OUTPUT: "STaRs are everyWHERE"
My code is here:
def trans(s):
x = ""
a = False
for j in range(len(s)):
if s[j] == "*" or a:
a = True
if a:
x += s[j].upper()
else:
x += s[j]
return "".join(x.replace("*",""))
The problem is I don't know where in loop set back to False. Now it just sees * and makes everything uppercase.
Note: The other answers does a fine job of showing you how to fix your code. Here's another way to do it, which you may find easier once you learn regular expressions.
You could use re.sub function.
>>> s = "S*t*a*r*s are every*where*"
>>> re.sub(r'\*([^*]*)\*', lambda m: m.group(1).upper(), s)
'STaRs are everyWHERE'
In regex * is a special meta character which repeats the previous token zero or more times. In-order to match a literal *, you need to use \* in the regex.
So \*([^*]*)\* regex matches every pair of * blocks ie (*t*, *r*, *where*) and the in-between characters (chars present inside the * block) are captured by the group 1.
For every match, re.sub function would replace the matched *..* block with string-inside-*.upper() . ie, it would apply the upper() function on the strings present inside the * and return the result as replacement string.
You need to toggle your state; each time you find a * you invert the state of your function so that you can switch between uppercasing and lowercasing as you traverse the text.
You can most easily do this with not; not a would return True if it was False and vice-versa:
def trans(s):
x = ""
a = False
for j in range(len(s)):
if s[j] == "*":
a = not a # change state; false to true and true to false
continue # no need to add the star to the output
if a:
x += s[j].upper()
else:
x += s[j]
return x
Each time you find a * character, a is toggled; by using continue at that time you also prevent the * character being added to the output, so the replace() can be avoided altogether. The ''.join() call on a string produces just the same string again, it is not needed in this case.
You don't need a range() here, you could just loop over s directly. You could use better names too:
def trans(string):
result = ""
do_upper = False
for character in string:
if character == "*":
do_upper = not do_upper # change state; false to true and true to false
continue # no need to add the star to the output
result += character.upper() if do_upper else character
return result
Demo:
>>> def trans(string):
... result = ""
... do_upper = False
... for character in string:
... if character == "*":
... do_upper = not do_upper # change state; false to true and true to false
... continue # no need to add the star to the output
... result += character.upper() if do_upper else character
... return result
...
>>> trans('S*t*a*r*s are every*where*')
'STaRs are everyWHERE'
Think of it like this. Whenever you see a *, you need to alternate between upper and original cases. So, implement the same in the code, like this
def trans(s):
x, flag = "", False
# You can iterate the string object with `for`
for char in s:
# The current character is a `*`
if char == "*":
# flip the flag everytime you see a `*`.
flag = not flag
# Skip further processing, as the current character is `*`
continue
if flag:
# If the flag is Truthy, we need to uppercase the string
x += char.upper()
else:
# Otherwise add the character as it is to the result.
x += char
# no need to `join` and `replace`, as we already skipped `*`. So just return.
return x
a should toggle between True and False. You only set it to True. Also iterate directly over the characters of the string instead over a index. And use more comprehensive variable names. The join is unnecessary and the replace is not needed, if you skip the '*' at once:
def trans(text):
result = ""
upper = False
for char in text:
if char == "*":
upper = not upper
elif upper:
result += char.upper()
else:
result += char
return result

Emulate Python str.find(substring) using iteration but not built-in functions

How can I find the position of a substring in a string without using str.find() in Python? How should I loop it?
def find substring(string,substring):
for i in xrange(len(string)):
if string[i]==substring[0]:
print i
else: print false
For example, when string = "ATACGTG" and substring = "ACGT", it should return 2. I want to understand how str.find() works
You can use Boyer-Moore or Knuth-Morris-Pratt. Both create tables to precalculate faster moves on each miss. The B-M page has a python implementation. And both pages refer to other string-searching algorithms.
I can't think of a way to do it without any built-in functions at all.
I can:
def find_substring(string, substring):
def starts_with(string, substring):
while True:
if substring == '':
return True
if string == '' or string[0] != substring[0]:
return False
string, substring = string[1:], substring[1:]
n = 0
while string != '' and substring != '':
if starts_with(string, substring):
return n
string = string[1:]
n += 1
return -1
print(find_substring('ATACGTG', 'ACGT'))
I.e. avoiding built-ins len(), range(), etc. By not using built-in len() we lose some efficiency in that we could have finished sooner. The OP specified iteration, which the above uses, but the recursive variant is a bit more compact:
def find_substring(string, substring, n=0):
def starts_with(string, substring):
if substring == '':
return True
if string == '' or string[0] != substring[0]:
return False
return starts_with(string[1:], substring[1:])
if string == '' or substring == '':
return -1
if starts_with(string, substring):
return n
return find_substring(string[1:], substring, n + 1)
print(find_substring('ATACGTG', 'ACGT'))
Under the constraint of not using find, you can use str.index instead, which returns a ValueError if the substring is not found:
def find_substring(a_string, substring):
try:
print(a_string.index(substring))
except ValueError:
print('Not Found')
and usage:
>>> find_substring('foo bar baz', 'bar')
4
>>> find_substring('foo bar baz', 'quux')
Not Found
If you must loop, you can do this, which slides along the string, and with a matching first character then checks to see if the rest of the string startswith the substring, which is a match:
def find_substring(a_string, substring):
for i, c in enumerate(a_string):
if c == substring[0] and a_string[i:].startswith(substring):
print(i)
return
else:
print(False)
To do it with no string methods:
def find_substring(a_string, substring):
for i in range(len(a_string)):
if a_string[i] == substring[0] and a_string[i:i+len(substring)] == substring:
print(i)
return
else:
print(False)
I can't think of a way to do it without any built-in functions at all.

Search for a pattern in a string in python

Question: I am very new to python so please bear with me. This is a homework assignment that I need some help with.
So, for the matchPat function, I need to write a function that will take two arguments, str1 and str2, and return a Boolean indicating whether str1 is in str2. But I have to use an asterisk as a wild card in str1. The * can only be used in str1 and it will represent one or more characters that I need to ignore. Examples of matchPat are as follow:
matchPat ( 'a*t*r', 'anteaters' ) : True
matchPat ( 'a*t*r', 'albatross' ) : True
matchPat ( 'a*t*r', 'artist' ) : False
My current matchPat function can tell whether the characters of str1 are in str2 but I don't really know how I could tell python (by using the * as a wild card) to look for 'a' (the first letter) and after it finds a, skip the next 0 or more characters until it finds the next letter(which would be 't' in the example) and so on.
def matchPat(str1,str2):
## str(*)==str(=>1)
if str1=='':
return True
elif str2=='':
return False
elif str1[0]==str2[0]:
return matchPat(str1[2],str2[len(str1)-1])
else: return True
Python strings have the in operator; you can check if str1 is a substring of str2 using str1 in str2.
You can split a string into a list of substrings based on a token. "a*b*c".split("*") is ["a","b","c"].
You can find the offset of next occurrence of a substring in a string using the string's find method.
So the problem of wildcard matching becomes:
split the pattern into parts which were separated by astrix
for each part of the pattern
can we find this after the previous part's locations?
You are going to have to cope with corner cases like patterns that start with or end with an asterisk or have two asterisk beside each other and so on. Good luck!
There is a find() method of strings that searches for a substring from a particular point, returning either its index (if found) or -1 if not found. The index() method is similar but raises an exception if the target string is not found.
I'd suggest that you first split the pattern string on "*". This will give you a list of chunks to look for. Set the starting position to zero, and for each element in the list of chunks, do a find() or index() from the current position.
If you find the current chunk then work out from its starting position and length where to start searching for the next chunk and update the starting position. If you find all the chunks then the target string matches the pattern. If any chunk is missing then the pattern search should fail.
Since this is homework I am hoping that gives you enough of an idea to move on.
The basic idea here is to compare each character in str1 and str2, and if char in str1 is "*", find that character in str2 which is the character next to the "*" in str1.
Assuming that you are not going to use any function, (except find(), which can be implemented easily), this is the hard way (the code is straight-forward but messy, and I've commented wherever possible)-
def matchPat(str1, str2):
index1 = 0
index2 = 0
while index1 < len(str1):
c = str1[index1]
#Check if the str2 has run it's course.
if index2 >= len(str2):
#This needs to be checked,assuming matchPatch("*", "") to be true
if(len(str2) == 0 and str1 == "*"):
return True
return False
#If c is not "*", then it's normal comparision.
if c != "*":
if c != str2[index2]:
return False
index2 += 1
#If c is "*", then you need to increment str1,
#search for the next value in str2,
#and update index2
else:
index1 += 1
if(index1 == len(str1)):
return True
c = str1[index1]
#Search the character in str2
i = str2.find(c, index2)
#If search fails, return False
if(i == -1):
return False
index2 = i + 1
index1 += 1
return True
OUTPUT -
print matchPat("abcde", "abcd")
#False
print matchPat("a", "")
#False
print matchPat("", "a")
#True
print matchPat("", "")
#True
print matchPat("abc", "abc")
#True
print matchPat("ab*cd", "abacacd")
#False
print matchPat("ab*cd", "abaascd")
#True
print matchPat ('a*t*r', 'anteater')
#True
print matchPat ('a*t*r', 'albatross')
#True
print matchPat ('a*t*r', 'artist')
#False
Without giving you the complete answer, first, split the str1 string into a list of strings on the '*' character. I usually call str1 the "needle" and str2 the "haystack", since you are looking for the needle in the haystack.
needles = needle.split('*')
Next, have a counter (which I will call i) start at 0. You will always be looking at haystack[i:] for the next string in needles.
In pseudocode, it'll look like this:
needles = needle.split('*')
i = 0
loop through all strings in needles:
if current needle not in haystack[i:], return false
increment i to just after the occurence of the current needle in haystack (use the find() string method or write your own function to handle this)
return true
Are you allowed to use regular expressions? If so, the function you're looking for already exists in the re.search function:
import re
bool(re.search('a.t.r', 'anteasters')) # True
bool(re.search('a.t.r', 'artist' )) # False
And if asterisks are a strict necessity, you can use regular expressions for that, too:
newstr = re.sub('\*', '.', 'a*t*r') # Replace * with .
bool(re.search(newstr, 'anteasters')) # Search using the new string
If regular expressions aren't allowed, the simplest way to do that would be to look at substrings of the second string that are the same length as the first string, and compare the two. Something like this:
def matchpat(str1, str2):
if len(str1) > len(str2): return False #Can't match if the first string is longer
for i in range(0, len(str2)-len(str1)+1):
substring = str2[i:i+len(str1)] # create substring of same length as first string
for j in range(0, len(str1)):
matched = False # assume False until match is found
if str1[j] != '*' and str1[j] != substring[j]: # check each character
break
matched = True
if matched == True: break # we don't need to keep searching if we've found a match
return matched

How to replace the Nth appearance of a needle in a haystack? (Python)

I am trying to replace the Nth appearance of a needle in a haystack. I want to do this simply via re.sub(), but cannot seem to come up with an appropriate regex to solve this. I am trying to adapt: http://docstore.mik.ua/orelly/perl/cookbook/ch06_06.htm but am failing at spanning multilines, I suppose.
My current method is an iterative approach that finds the position of each occurrence from the beginning after each mutation. This is pretty inefficient and I would like to get some input. Thanks!
I think you mean re.sub. You could pass a function and keep track of how often it was called so far:
def replaceNthWith(n, replacement):
def replace(match, c=[0]):
c[0] += 1
return replacement if c[0] == n else match.group(0)
return replace
Usage:
re.sub(pattern, replaceNthWith(n, replacement), str)
But this approach feels a bit hacky, maybe there are more elegant ways.
DEMO
Something like this regex should help you. Though I'm not sure how efficient it is:
#N=3
re.sub(
r'^((?:.*?mytexttoreplace){2}.*?)mytexttoreplace',
'\1yourreplacementtext.',
'mystring',
flags=re.DOTALL
)
The DOTALL flag is important.
I've been struggling for a while with this, but I found a solution that I think is pretty pythonic:
>>> def nth_matcher(n, replacement):
... def alternate(n):
... i=0
... while True:
... i += 1
... yield i%n == 0
... gen = alternate(n)
... def match(m):
... replace = gen.next()
... if replace:
... return replacement
... else:
... return m.group(0)
... return match
...
...
>>> re.sub("([0-9])", nth_matcher(3, "X"), "1234567890")
'12X45X78X0'
EDIT: the matcher consists of two parts:
the alternate(n) function. This returns a generator that returns an infinite sequence True/False, where every nth value is True. Think of it like list(alternate(3)) == [False, False, True, False, False, True, False, ...].
The match(m) function. This is the function that gets passed to re.sub: it gets the next value in alternate(n) (gen.next()) and if it's True it replaces the matched value; otherwise, it keeps it unchanged (replaces it with itself).
I hope this is clear enough. If my explanation is hazy, please say so and I'll improve it.
Could you do it using re.findall with MatchObject.start() and MatchObject.end()?
find all occurences of pattern in string with .findall, get indices of Nth occurrence with .start/.end, make new string with replacement value using the indices?
If the pattern ("needle") or replacement is a complex regular expression, you can't assume anything. The function "nth_occurrence_sub" is what I came up with as a more general solution:
def nth_match_end(pattern, string, n, flags):
for i, match_object in enumerate(re.finditer(pattern, string, flags)):
if i + 1 == n:
return match_object.end()
def nth_occurrence_sub(pattern, repl, string, n=0, flags=0):
max_n = len(re.findall(pattern, string, flags))
if abs(n) > max_n or n == 0:
return string
if n < 0:
n = max_n + n + 1
sub_n_times = re.sub(pattern, repl, string, n, flags)
if n == 1:
return sub_n_times
nm1_end = nth_match_end(pattern, string, n - 1, flags)
sub_nm1_times = re.sub(pattern, repl, string, n - 1, flags)
sub_nm1_change = sub_nm1_times[:-1 * len(string[nm1_end:])]
components = [
string[:nm1_end],
sub_n_times[len(sub_nm1_change):]
]
return ''.join(components)
I have a similar function I wrote to do this. I was trying to replicate SQL REGEXP_REPLACE() functionality. I ended up with:
def sql_regexp_replace( txt, pattern, replacement='', position=1, occurrence=0, regexp_modifier='c'):
class ReplWrapper(object):
def __init__(self, replacement, occurrence):
self.count = 0
self.replacement = replacement
self.occurrence = occurrence
def repl(self, match):
self.count += 1
if self.occurrence == 0 or self.occurrence == self.count:
return match.expand(self.replacement)
else:
try:
return match.group(0)
except IndexError:
return match.group(0)
occurrence = 0 if occurrence < 0 else occurrence
flags = regexp_flags(regexp_modifier)
rx = re.compile(pattern, flags)
replw = ReplWrapper(replacement, occurrence)
return txt[0:position-1] + rx.sub(replw.repl, txt[position-1:])
One important note that I haven't seen mentioned is that you need to return match.expand() otherwise it won't expand the \1 templates properly and will treat them as literals.
If you want this to work you'll need to handle the flags differently (or take it from my github, it's simple to implement and you can dummy it for a test by setting it to 0 and ignoring my call to regexp_flags()).

Categories