Search for a pattern in a string in python

Search for a pattern in a string in python - python

Question: I am very new to python so please bear with me. This is a homework assignment that I need some help with.
So, for the matchPat function, I need to write a function that will take two arguments, str1 and str2, and return a Boolean indicating whether str1 is in str2. But I have to use an asterisk as a wild card in str1. The * can only be used in str1 and it will represent one or more characters that I need to ignore. Examples of matchPat are as follow:
matchPat ( 'a*t*r', 'anteaters' ) : True
matchPat ( 'a*t*r', 'albatross' ) : True
matchPat ( 'a*t*r', 'artist' ) : False
My current matchPat function can tell whether the characters of str1 are in str2 but I don't really know how I could tell python (by using the * as a wild card) to look for 'a' (the first letter) and after it finds a, skip the next 0 or more characters until it finds the next letter(which would be 't' in the example) and so on.
def matchPat(str1,str2):
## str(*)==str(=>1)
if str1=='':
return True
elif str2=='':
return False
elif str1[0]==str2[0]:
return matchPat(str1[2],str2[len(str1)-1])
else: return True

Python strings have the in operator; you can check if str1 is a substring of str2 using str1 in str2.
You can split a string into a list of substrings based on a token. "a*b*c".split("*") is ["a","b","c"].
You can find the offset of next occurrence of a substring in a string using the string's find method.
So the problem of wildcard matching becomes:
split the pattern into parts which were separated by astrix
for each part of the pattern
can we find this after the previous part's locations?
You are going to have to cope with corner cases like patterns that start with or end with an asterisk or have two asterisk beside each other and so on. Good luck!

There is a find() method of strings that searches for a substring from a particular point, returning either its index (if found) or -1 if not found. The index() method is similar but raises an exception if the target string is not found.
I'd suggest that you first split the pattern string on "*". This will give you a list of chunks to look for. Set the starting position to zero, and for each element in the list of chunks, do a find() or index() from the current position.
If you find the current chunk then work out from its starting position and length where to start searching for the next chunk and update the starting position. If you find all the chunks then the target string matches the pattern. If any chunk is missing then the pattern search should fail.
Since this is homework I am hoping that gives you enough of an idea to move on.

The basic idea here is to compare each character in str1 and str2, and if char in str1 is "*", find that character in str2 which is the character next to the "*" in str1.
Assuming that you are not going to use any function, (except find(), which can be implemented easily), this is the hard way (the code is straight-forward but messy, and I've commented wherever possible)-
def matchPat(str1, str2):
index1 = 0
index2 = 0
while index1 < len(str1):
c = str1[index1]
#Check if the str2 has run it's course.
if index2 >= len(str2):
#This needs to be checked,assuming matchPatch("*", "") to be true
if(len(str2) == 0 and str1 == "*"):
return True
return False
#If c is not "*", then it's normal comparision.
if c != "*":
if c != str2[index2]:
return False
index2 += 1
#If c is "*", then you need to increment str1,
#search for the next value in str2,
#and update index2
else:
index1 += 1
if(index1 == len(str1)):
return True
c = str1[index1]
#Search the character in str2
i = str2.find(c, index2)
#If search fails, return False
if(i == -1):
return False
index2 = i + 1
index1 += 1
return True
OUTPUT -
print matchPat("abcde", "abcd")
#False
print matchPat("a", "")
#False
print matchPat("", "a")
#True
print matchPat("", "")
#True
print matchPat("abc", "abc")
#True
print matchPat("ab*cd", "abacacd")
#False
print matchPat("ab*cd", "abaascd")
#True
print matchPat ('a*t*r', 'anteater')
#True
print matchPat ('a*t*r', 'albatross')
#True
print matchPat ('a*t*r', 'artist')
#False

Without giving you the complete answer, first, split the str1 string into a list of strings on the '*' character. I usually call str1 the "needle" and str2 the "haystack", since you are looking for the needle in the haystack.
needles = needle.split('*')
Next, have a counter (which I will call i) start at 0. You will always be looking at haystack[i:] for the next string in needles.
In pseudocode, it'll look like this:
needles = needle.split('*')
i = 0
loop through all strings in needles:
if current needle not in haystack[i:], return false
increment i to just after the occurence of the current needle in haystack (use the find() string method or write your own function to handle this)
return true

Are you allowed to use regular expressions? If so, the function you're looking for already exists in the re.search function:
import re
bool(re.search('a.t.r', 'anteasters')) # True
bool(re.search('a.t.r', 'artist' )) # False
And if asterisks are a strict necessity, you can use regular expressions for that, too:
newstr = re.sub('\*', '.', 'a*t*r') # Replace * with .
bool(re.search(newstr, 'anteasters')) # Search using the new string
If regular expressions aren't allowed, the simplest way to do that would be to look at substrings of the second string that are the same length as the first string, and compare the two. Something like this:
def matchpat(str1, str2):
if len(str1) > len(str2): return False #Can't match if the first string is longer
for i in range(0, len(str2)-len(str1)+1):
substring = str2[i:i+len(str1)] # create substring of same length as first string
for j in range(0, len(str1)):
matched = False # assume False until match is found
if str1[j] != '*' and str1[j] != substring[j]: # check each character
break
matched = True
if matched == True: break # we don't need to keep searching if we've found a match
return matched

Related

How can I count the spaces and symbols in a string in python?

I'm trying to create a function that can determine whether a word or sentence is an anagram. I've come this far, but I can't figure out how to tell my function to handle special characters such as '!' or '?', or spaces in the string. Right now, the function will read spaces and symbols and return an anagram as False. Here's the code
def is_anagram(string_a, string_b):
string_a.lower()
string_b.lower()
if len(string_a) != len(string_b):
return False
char_times_a = dict()
char_times_b = dict()
for i in range(len(string_a)):
if string_a[i] not in char_times_a.keys():
char_times_a[string_a[i]] = 0
else:
char_times_a[string_a[i]] += 1
if string_b[i] not in char_times_b.keys():
char_times_b[string_b[i]] = 0
else:
char_times_b[string_b[i]] += 1
return char_times_a == char_times_b
is_anagram('scar', 'cars')
True
is_anagram('Tom Marvolo Riddle', 'I am Lord Voldemort')
False
that last statement should return as true, because it is an anagram.

Change the first two lines of the function from:
string_a.lower()
string_b.lower()
To:
string_a = string_a.lower().replace(' ', '')
string_b = string_b.lower().replace(' ', '')
You need to assign it back and also replace the spaces, lower does not do it in place.

Your problem is that your function considers a whitespace as part of the set of chars alongside the other letters. This might be what you want, bit then your second example is indeed not an anagram because there is different number of whitespaces.
Specifically, this line returns False for your example:
if len(string_a) != len(string_b):
return False
But even if you would remove it, your function counts the number of whitespaces, and also the characters are not lowercased, so it will return False either way.

You can create your functions like :
def is_anagram(string_a, string_b):
return set(string_a.lower()) == set(string_b.lower()) and len(string_a) == len(string_b)
or
def is_anagram(string_a, string_b):
return sorted(string_a.lower()) == sorted(string_b.lower())

You can use this method to remove all instances of special characters and white spaces from a string. The isalnum method returns True for alphanumeric characters. The lower method was included to prevent an error if your function is case sensitive.
string_a = ''.join(filter(str.isalnum, string_a)).lower()
string_b = ''.join(filter(str.isalnum, string_b)).lower()

Pattern match with character/numeric pattern in Python [duplicate]

This question already has answers here:
Product code looks like abcd2343, how to split by letters and numbers?
(6 answers)
Closed 1 year ago.
I'm trying to write a Python function that follows the pattern below. Essentially a pattern-matching algorithm is required.
isSubstring(pattern, word) -> bool
A) isSubstring("b1tecat", "bytecat") -> True
B) isSubstring("b2ecat", "bytecat") -> True
C) isSubstring("b5cat", "bytecat") -> False
D) isSubstring("b2tecat", "bytecat") -> False
E) isSubstring("bytecat", "bytecat") -> True
F) isSubstring("2", "be") -> True
G) isSubstring("2bbbb", "b") -> False
The code below is the basic solution that works for the (E) case from above, but obviously it does nothing to account for numbers in the pattern. Have searched leetcode, hackerrank, geeksforgeeks, etc, but can't find a decent solution.
def isSubstring(substring, string):
len_substring = len(substring)
len_string = len(string)
for i in range(len_string - len_substring + 1):
j = 0
while j < len_substring:
if string[i+j] != substring[j]:
break
j += 1
if j == len_substring:
return True
return False
How can I account for the numbers in the pattern?

You can use an index in the second array, and when the character(s) from the first array is/are numeric, evaluate that number as integer and increase the index with that number. I assume that if the left string looks like "a21b", it means the second string should have 21 characters between "a" and "b". To easily identify consecutive digits, I would suggest a regular expression \d+|\D to split up the first string into its individual parts:
import re
def isSubstring(substring, string):
tokens = re.findall(r"\d+|\D", substring)
for i in range(len(string)):
for ch in tokens:
if i >= len(string):
return False
elif ch.isdigit():
i += int(ch)
elif ch != string[i]:
break
else:
i += 1
else:
return True
return False
It will however be easier to rely on regular expressions themselves, as follows:
Convert the first string to a regular expression itself, and then see if the second string matches it:
import re
def isSubstring(substring, string):
regex = re.sub(r"(\d+)", r".{\1}", re.escape(substring))
return re.search(regex, string) is not None

Building on #trincot answer here, you can use his answer to build a regex that matches what you are searching for.
For example: "b2tecat" is really just r"b..tecat" - feed that to another re.findall and you will find all of the occurrences of your string.
import re
def isSubstring(substring, string):
regex = r""
for ch in re.findall(r"\d+|\D", substring):
if ch.isdigit():
regex += "." * int(ch)
else:
regex += ch
if re.search(regex, re.escape(string)):
return True
else:
return False
Side note: there are more "classical" ways to solve this problem - pattern matching with "don't cares". For example using fft.

ValueError when validating data

I am trying to make function, that returns True when value of hcl is correct with required specification (it's inside multi-line comment in the function). The first thing I wanted to check was if length of that value is correct (should be # + 6 other chars), and when that would be correct I would check if all chars are in group of a-f or 0-9 - and that was my idea to solve this problem, but unfortunately there is a
ValueError: substring not found
(when second elem of list goes to the function), that I don't understand(btw. as always, you have some reasoning, and when it there is a mistake you can't found it, because for you everything is working and this 'should work').
def check_hcl(line):
'''
a # followed by exactly six characters 0-9 or a-f.
'''
print(line[line.index(':')+1], len(line[line.index(':')+2:]))
if line[line.index(':')+1] != '#' or len(line[line.index(':')+2:]) != 6:
return False
else:
return True
list = ['hcl:#866857','#52a9af','#cfa07d','7d3b0c','#cc0362','#a9784']
#false #false
for i in list:
print(check_hcl(i))

You can use the match() method from the built-in re module:
import re
def check_hcl(line):
if re.match("(.*?)#[a-f0-9]{6}", line):
return True
return False
list = ['hcl:#866857','#52a9af','#cfa07d','7d3b0c','#cc0362','#a9784']
for i in list:
print(check_hcl(i))
Output:
True
True
True
False
True
False
Explanation:
The pattern (.*?)#[a-f0-9]{6} can be broken down to 3 parts:
(.*?) matches anything of any length, including substrings of length 0.
# matches a '#'.
[a-f0-9]{6} matches a substring of characters a to f and numbers 0 to 9 of length 6.
Credits to #Ian.

How would I detect duplicate elements of a string from another string in python?

So how would I go about finding a duplicate element of a string from another string in python using a for the most part one-to-two line or a quick fix?
for example,
str1 = "abccde"
str2 = "abcde"
# gets me c
Through the use of str2, finding there was a duplicate element in str1, so detecting that str1 has a duplicate of an element in str2. Not sure if there's a way through .count to do that, like str1.count(str2) or something.
I'm using this contextually for my hangman assignment and I'm a beginner coder, so we are using mostly built-in functions and the basics for the assignments, and there's a piece of my code within my loop that will keep printing because it dings the double letters.
Ex. hello, grinding, concoction.
So I pretty much made a "used" string, and I am trying to compare that to my correct letters list, and the guesses are 'appended' so I can avoid that.
note: they will be inputted, so I won't be able to say or just hardcode the letter c if that makes sense.
Thank you!

Using set with str.count:
def find_dup(str1, str2):
return [i for i in set(str1) if str1.count(i) > 1 and i in set(str2)]
Output:
find_dup("abccde", "abcde")
# ['c']
find_dup("abcdeffghi" , "aaaaaabbbbbbcccccddeeeeefffffggghhiii") # from comment
# ['f']

My guess is that maybe you're trying to write a method similar to:
def duplicate_string(str1: str, str2: str) -> str:
str2_set = set(str2)
if len(str2_set) != len(str2):
raise ValueError(f'{str2} has duplicate!')
output = ''
for char in str1:
if char in str2_set:
str2_set.remove(char)
else:
output += char
return output
str1 = "abccccde"
str2 = "abcde"
print(duplicate_string(str1, str2))
Output
ccc
Here, we would first raise an error, if str2 itself had a duplicate. Then, we'd loop through str1, either remove a char from the str1_set or append the duplicate in an output string.

You are basically searching a diff function between the two strings. Adapting this beautiful answer
import difflib
cases=[('abcccde', 'abcde')]
for a,b in cases:
print('{} => {}'.format(a,b))
for i,s in enumerate(difflib.ndiff(a, b)):
if s[0]==' ': continue
elif s[0]=='-':
print(u'The second string is missing the "{}" in position {} of the first string'.format(s[-1],i))
elif s[0]=='+':
print(u'The first string is missing the "{}" in position {} of the second string'.format(s[-1],i))
print()
Output
abcccde => abcde
The second string is missing the "c" in position 3 of the first string
The second string is missing the "c" in position 4 of the first string

Find index of last occurrence of a substring in a string

I want to find the position (or index) of the last occurrence of a certain substring in given input string str.
For example, suppose the input string is str = 'hello' and the substring is target = 'l', then it should output 3.
How can I do this?

Use .rfind():
>>> s = 'hello'
>>> s.rfind('l')
3
Also don't use str as variable name or you'll shadow the built-in str().

You can use rfind() or rindex()
Python2 links: rfind() rindex()
>>> s = 'Hello StackOverflow Hi everybody'
>>> print( s.rfind('H') )
20
>>> print( s.rindex('H') )
20
>>> print( s.rfind('other') )
-1
>>> print( s.rindex('other') )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: substring not found
The difference is when the substring is not found, rfind() returns -1 while rindex() raises an exception ValueError (Python2 link: ValueError).
If you do not want to check the rfind() return code -1, you may prefer rindex() that will provide an understandable error message. Else you may search for minutes where the unexpected value -1 is coming from within your code...
Example: Search of last newline character
>>> txt = '''first line
... second line
... third line'''
>>> txt.rfind('\n')
22
>>> txt.rindex('\n')
22

Use the str.rindex method.
>>> 'hello'.rindex('l')
3
>>> 'hello'.index('l')
2

Not trying to resurrect an inactive post, but since this hasn't been posted yet...
(This is how I did it before finding this question)
s = "hello"
target = "l"
last_pos = len(s) - 1 - s[::-1].index(target)
Explanation: When you're searching for the last occurrence, really you're searching for the first occurrence in the reversed string. Knowing this, I did s[::-1] (which returns a reversed string), and then indexed the target from there. Then I did len(s) - 1 - the index found because we want the index in the unreversed (i.e. original) string.
Watch out, though! If target is more than one character, you probably won't find it in the reversed string. To fix this, use last_pos = len(s) - 1 - s[::-1].index(target[::-1]), which searches for a reversed version of target.

Try this:
s = 'hello plombier pantin'
print (s.find('p'))
6
print (s.index('p'))
6
print (s.rindex('p'))
15
print (s.rfind('p'))

For this case both rfind() and rindex() string methods can be used, both will return the highest index in the string where the substring is found like below.
test_string = 'hello'
target = 'l'
print(test_string.rfind(target))
print(test_string.rindex(target))
But one thing should keep in mind while using rindex() method, rindex() method raises a ValueError [substring not found] if the target value is not found within the searched string, on the other hand rfind() will just return -1.

The more_itertools library offers tools for finding indices of all characters or all substrings.
Given
import more_itertools as mit
s = "hello"
pred = lambda x: x == "l"
Code
Characters
Now there is the rlocate tool available:
next(mit.rlocate(s, pred))
# 3
A complementary tool is locate:
list(mit.locate(s, pred))[-1]
# 3
mit.last(mit.locate(s, pred))
# 3
Substrings
There is also a window_size parameter available for locating the leading item of several items:
s = "How much wood would a woodchuck chuck if a woodchuck could chuck wood?"
substring = "chuck"
pred = lambda *args: args == tuple(substring)
next(mit.rlocate(s, pred=pred, window_size=len(substring)))
# 59

Python String rindex() Method
Description
Python string method rindex() returns the last index where the substring str is found, or raises an exception if no such index exists, optionally restricting the search to string[beg:end].
Syntax
Following is the syntax for rindex() method −
str.rindex(str, beg=0 end=len(string))
Parameters
str − This specifies the string to be searched.
beg − This is the starting index, by default its 0
len − This is ending index, by default its equal to the length of the string.
Return Value
This method returns last index if found otherwise raises an exception if str is not found.
Example
The following example shows the usage of rindex() method.
Live Demo
!/usr/bin/python
str1 = "this is string example....wow!!!";
str2 = "is";
print str1.rindex(str2)
print str1.index(str2)
When we run above program, it produces following result −
5
2
Ref: Python String rindex() Method
- Tutorialspoint

If you don't wanna use rfind then this will do the trick/
def find_last(s, t):
last_pos = -1
while True:
pos = s.find(t, last_pos + 1)
if pos == -1:
return last_pos
else:
last_pos = pos

# Last Occurrence of a Character in a String without using inbuilt functions
str = input("Enter a string : ")
char = input("Enter a character to serach in string : ")
flag = 0
count = 0
for i in range(len(str)):
if str[i] == char:
flag = i
if flag == 0:
print("Entered character ",char," is not present in string")
else:
print("Character ",char," last occurred at index : ",flag)

you can use rindex() function to get the last occurrence of a character in string
s="hellloooloo"
b='l'
print(s.rindex(b))

str = "Hello, World"
target='l'
print(str.rfind(target) +1)
or
str = "Hello, World"
flag =0
target='l'
for i,j in enumerate(str[::-1]):
if target == j:
flag = 1
break;
if flag == 1:
print(len(str)-i)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Search for a pattern in a string in python - python

Related

How can I count the spaces and symbols in a string in python?

Pattern match with character/numeric pattern in Python [duplicate]

ValueError when validating data

How would I detect duplicate elements of a string from another string in python?

Find index of last occurrence of a substring in a string

Categories

Resources