I'm OCRing some text from two different sources. They can each make mistakes in different places, where they won't recognize a letter/group of letters. If they don't recognize something, it's replaced with a ?. For example, if the word is Roflcopter, one source might return Ro?copter, while another, Roflcop?er. I'd like a function that returns whether two matches might be equivalent, allowing for multiple ?s. Example:
match("Ro?copter", "Roflcop?er") --> True
match("Ro?copter", "Roflcopter") --> True
match("Roflcopter", "Roflcop?er") --> True
match("Ro?co?er", "Roflcop?er") --> True
So far I can match one OCR with a perfect one by using regular expressions:
>>> def match(tn1, tn2):
tn1re = tn1.replace("?", ".{0,4}")
tn2re = tn2.replace("?", ".{0,4}")
return bool(re.match(tn1re, tn2) or re.match(tn2re, tn1))
>>> match("Roflcopter", "Roflcop?er")
True
>>> match("R??lcopter", "Roflcopter")
True
But this doesn't work when they both have ?s in different places:
>>> match("R??lcopter", "Roflcop?er")
False
Well, as long as one ? corresponds to one character, then I can suggest a performant and a compact enough method.
def match(str1, str2):
if len(str1) != len(str2): return False
for index, ch1 in enumerate(str1):
ch2 = str2[index]
if ch1 == '?' or ch2 == '?': continue
if ch1 != ch2: return False
return True
>>> ================================ RESTART ================================
>>>
>>> match("Roflcopter", "Roflcop?er")
True
>>> match("R??lcopter", "Roflcopter")
True
>>>
>>> match("R??lcopter", "Roflcop?er")
True
>>>
Edit: Part B), brain-fart free now.
def sets_match(set1, set2):
return any(match(str1, str2) for str1 in set1 for str2 in set2)
>>> ================================ RESTART ================================
>>>
>>> s1 = set(['a?', 'fg'])
>>> s2 = set(['?x'])
>>> sets_match(s1, s2) # a? = x?
True
>>>
Thanks to Hamish Grubijan for this idea. Every ? in my ocr'd names can be anywhere from 0 to 3 letters. What I do is expand each string to a list of possible expansions:
>>> list(expQuestions("?flcopt?"))
['flcopt', 'flcopt#', 'flcopt##', 'flcopt###', '#flcopt', '#flcopt#', '#flcopt##', '#flcopt###', '##flcopt', '##flcopt#', '##flcopt##', '##flcopt###', '###flcopt', '###flcopt#', '###flcopt##', '###flcopt###']
then I expand both and use his matching function, which I called matchats:
def matchOCR(l, r):
for expl in expQuestions(l):
for expr in expQuestions(r):
if matchats(expl, expr):
return True
return False
Works as desired:
>>> matchOCR("Ro?co?er", "?flcopt?")
True
>>> matchOCR("Ro?co?er", "?flcopt?z")
False
>>> matchOCR("Ro?co?er", "?flc?pt?")
True
>>> matchOCR("Ro?co?e?", "?flc?pt?")
True
The matching function:
def matchats(l, r):
"""Match two strings with # representing exactly 1 char"""
if len(l) != len(r): return False
for i, c1 in enumerate(l):
c2 = r[i]
if c1 == "#" or c2 == "#": continue
if c1 != c2: return False
return True
and the expanding function, where cartesian_product does just that:
def expQuestions(s):
"""For OCR w/ a questionmark in them, expand questions with
#s for all possibilities"""
numqs = s.count("?")
blah = list(s)
for expqs in cartesian_product([(0,1,2,3)]*numqs):
newblah = blah[:]
qi = 0
for i,c in enumerate(newblah):
if newblah[i] == '?':
newblah[i] = '#'*expqs[qi]
qi += 1
yield "".join(newblah)
Using the Levenshtein distance may be useful. It will give a value of how similar the strings are to each other. This will work if they are different lengths, too. The linked page has some psuedocode to get you started.
You'll end up with something like this:
>>> match("Roflcopter", "Roflcop?er")
1
>>> match("R??lcopter", "Roflcopter")
2
>>> match("R?lcopter", "Roflcop?er")
3
So you could have a maximum threshold below which you say they may match.
This might not be the most Pythonic of options, but if a ? is allowed to match any number of characters, then the following backtracking search does the trick:
def match(a,b):
def matcher(i,j):
if i == len(a) and j == len(b):
return True
elif i < len(a) and a[i] == '?' \
or j < len(b) and b[j] == '?':
return i < len(a) and matcher(i+1,j) \
or j < len(b) and matcher(i,j+1)
elif i == len(a) or j == len(b):
return False
else:
return a[i] == b[j] and matcher(i+1,j+1)
return matcher(0,0)
This may be adapted to be more stringent in what to match. Also, to save stack space, the final case (i+1,j+1) may be transformed into a non-recursive solution.
Edit: some more clarification in response to the reactions below. This is an adaptation of a naive matching algorithm for simplified regexes/NFAs (see Kernighan's contrib to Beautiful Code, O'Reilly 2007 or Jurafsky & Martin, Speech and Language Processing, Prentice Hall 2009).
How it works: the matcher function recursively walks through both strings/patterns, starting at (0,0). It succeeds when it reaches the end of both strings (len(a),len(b)); it fails when it encounters two unequal characters or the end of one string while there are still characters to match in the other string.
When matcher encounters a variable (?) in either string (say a), it can do two things: either skip over the variable (matching zero characters), or skip over the next character in b but keep pointing to the variable in a, allowing it to match more characters.
Related
So I have been trying to solve the Easy questions on Leetcode and so far I dont understand most of the answers I find on the internet. I tried working on the Isomorphic strings problem (here:https://leetcode.com/problems/isomorphic-strings/description/)
and I came up with the following code
def isIso(a,b):
if(len(a) != len(b)):
return false
x=[a.count(char1) for char1 in a]
y=[b.count(char1) for char1 in b]
return x==y
string1 = input("Input string1..")
string2 = input("Input string2..")
print(isIso(string1,string2))
Now I understand that this may be the most stupid code you have seen all day but that is kinda my point. I'd like to know why this would be wrong(and where) and how I should further develop on this.
If I understand the problem correctly, because a character can map to itself, it's just a case of seeing if the character counts for the two words are the same.
So egg and add are isomorphic as they have character counts of (1,2). Similarly paper and title have counts of (1,1,1,2).
foo and bar aren't isomorphic as the counts are (1,2) and (1,1,1) respectively.
To see if the character counts are the same we'll need to sort them.
So:
from collections import Counter
def is_isomorphic(a,b):
a_counts = list(Counter(a).values())
a_counts.sort()
b_counts = list(Counter(b).values())
b_counts.sort()
if a_counts == b_counts:
return True
return False
Your code is failing because here:
x=[a.count(char1) for char1 in a]
You count the occurrence of each character in the string for each character in the string. So a word like 'odd' won't have counts of (1,2), it'll have (1,2,2) as you count d twice!
You can use two dicts to keep track of the mapping of each character in a to b, and the mapping of each character in b to a while you iterate through a, and if there's any violation in a corresponding character, return False; otherwise return True in the end.
def isIso(a, b):
m = {} # mapping of each character in a to b
r = {} # mapping of each character in b to a
for i, c in enumerate(a):
if c in m:
if b[i] != m[c]:
return False
else:
m[c] = b[i]
if b[i] in r:
if c != r[b[i]]:
return False
else:
r[b[i]] = c
return True
So that:
print(isIso('egg', 'add'))
print(isIso('foo', 'bar'))
print(isIso('paper', 'title'))
print(isIso('paper', 'tttle')) # to test reverse mapping
would output:
True
False
True
False
I tried by creating a dictionary, and it resulted in 72ms runtime.
here's my code -
def isIsomorphic(s: str, t: str) -> bool:
my_dict = {}
if len(s) != len(t):
return False
else:
for i in range(len(s)):
if s[i] in my_dict.keys():
if my_dict[s[i]] == t[i]:
pass
else:
return False
else:
if t[i] in my_dict.values():
return False
else:
my_dict[s[i]] = t[i]
return True
There are many different ways on how to do it. Below I provided three different ways by using a dictionary, set, and string.translate.
Here I provided three different ways how to solve Isomorphic String solution in Python.
from itertools import zip_longest
def isomorph(a, b):
return len(set(a)) == len(set(b)) == len(set(zip_longest(a, b)))
here is the second way to do it:
def isomorph(a, b):
return [a.index(x) for x in a] == [b.index(y) for y in b]
If I have this:
a='abcdefghij'
b='de'
Then this finds b in a:
b in a => True
Is there a way of doing an similar thing with lists?
Like this:
a=list('abcdefghij')
b=list('de')
b in a => False
The 'False' result is understandable - because its rightly looking for an element 'de', rather than (what I happen to want it to do) 'd' followed by 'e'
This is works, I know:
a=['a', 'b', 'c', ['d', 'e'], 'f', 'g', 'h']
b=list('de')
b in a => True
I can crunch the data to get what I want - but is there a short Pythonic way of doing this?
To clarify: I need to preserve ordering here (b=['e','d'], should return False).
And if it helps, what I have is a list of lists: these lists represents all possible paths (a list of visited nodes) from node-1 to node-x in a directed graph: I want to 'factor' out common paths in any longer paths. (So looking for all irreducible 'atomic' paths which constituent all the longer paths).
Related
Best Way To Determine if a Sequence is in another sequence in Python
I suspect there are more pythonic ways of doing it, but at least it gets the job done:
l=list('abcdefgh')
pat=list('de')
print pat in l # Returns False
print any(l[i:i+len(pat)]==pat for i in xrange(len(l)-len(pat)+1))
Don't know if this is very pythonic, but I would do it in this way:
def is_sublist(a, b):
if not a: return True
if not b: return False
return b[:len(a)] == a or is_sublist(a, b[1:])
Shorter solution is offered in this discussion, but it suffers from the same problems as solutions with set - it doesn't consider order of elements.
UPDATE:
Inspired by MAK I introduced more concise and clear version of my code.
UPDATE:
There are performance concerns about this method, due to list copying in slices. Also, as it is recursive, you can encounter recursion limit for long lists. To eliminate copying, you can use Numpy slices which creates views, not copies. If you encounter performance or recursion limit issues you should use solution without recursion.
I think this will be faster - It uses C implementation list.index to search for the first element, and goes from there on.
def find_sublist(sub, bigger):
if not bigger:
return -1
if not sub:
return 0
first, rest = sub[0], sub[1:]
pos = 0
try:
while True:
pos = bigger.index(first, pos) + 1
if not rest or bigger[pos:pos+len(rest)] == rest:
return pos
except ValueError:
return -1
data = list('abcdfghdesdkflksdkeeddefaksda')
print find_sublist(list('def'), data)
Note that this returns the position of the sublist in the list, not just True or False. If you want just a bool you could use this:
def is_sublist(sub, bigger):
return find_sublist(sub, bigger) >= 0
I timed the accepted solution, my earlier solution and a new one with an index. The one with the index is clearly best.
EDIT: I timed nosklo's solution, it's even much better than what I came up with. :)
def is_sublist_index(a, b):
if not a:
return True
index = 0
for elem in b:
if elem == a[index]:
index += 1
if index == len(a):
return True
elif elem == a[0]:
index = 1
else:
index = 0
return False
def is_sublist(a, b):
return str(a)[1:-1] in str(b)[1:-1]
def is_sublist_copylist(a, b):
if a == []: return True
if b == []: return False
return b[:len(a)] == a or is_sublist_copylist(a, b[1:])
from timeit import Timer
print Timer('is_sublist([99999], range(100000))', setup='from __main__ import is_sublist').timeit(number=100)
print Timer('is_sublist_copylist([99999], range(100000))', setup='from __main__ import is_sublist_copylist').timeit(number=100)
print Timer('is_sublist_index([99999], range(100000))', setup='from __main__ import is_sublist_index').timeit(number=100)
print Timer('sublist_nosklo([99999], range(100000))', setup='from __main__ import sublist_nosklo').timeit(number=100)
Output in seconds:
4.51677298546
4.5824368
1.87861895561
0.357429027557
So, if you aren't concerned about the order the subset appears, you can do:
a=list('abcdefghij')
b=list('de')
set(b).issubset(set(a))
True
Edit after you clarify: If you need to preserve order, and the list is indeed characters as in your question, you can use:
''.join(a).find(''.join(b)) > 0
This should work with whatever couple of lists, preserving the order.
Is checking if b is a sub list of a
def is_sublist(b,a):
if len(b) > len(a):
return False
if a == b:
return True
i = 0
while i <= len(a) - len(b):
if a[i] == b[0]:
flag = True
j = 1
while i+j < len(a) and j < len(b):
if a[i+j] != b[j]:
flag = False
j += 1
if flag:
return True
i += 1
return False
>>>''.join(b) in ''.join(a)
True
Not sure how complex your application is, but for pattern matching in lists, pyparsing is very smart and easy to use.
Use the lists' string representation and remove the square braces. :)
def is_sublist(a, b):
return str(a)[1:-1] in str(b)
EDIT: Right, there are false positives ... e.g. is_sublist([1], [11]). Crappy answer. :)
This question already has answers here:
efficiently checking that string consists of one character in Python
(8 answers)
Closed 6 years ago.
What is the shortest way to check if a given string has the same characters?
For example if you have name = 'aaaaa' or surname = 'bbbb' or underscores = '___' or p = '++++', how do you check to know the characters are the same?
An option is to check whether the set of its characters has length 1:
>>> len(set("aaaa")) == 1
True
Or with all(), this could be faster if the strings are very long and it's rare that they are all the same character (but then the regex is good too):
>>> s = "aaaaa"
>>> s0 = s[0]
>>> all(c == s0 for c in s[1:])
True
You can use regex for this:
import re
p = re.compile(ur'^(.)\1*$')
re.search(p, "aaaa") # returns a match object
re.search(p, "bbbb") # returns a match object
re.search(p, "aaab") # returns None
Here's an explanation of what this regex pattern means: https://regexper.com/#%5E(.)%5C1*%24
Also possible:
s = "aaaaa"
s.count(s[0]) == len(s)
compare == len(name) * name[0]
if(compare):
# all characters are same
else:
# all characters aren't same
Here are a couple of ways.
def all_match0(s):
head, tail = s[0], s[1:]
return tail == head * len(tail)
def all_match1(s):
head, tail = s[0], s[1:]
return all(c == head for c in tail)
all_match = all_match0
data = [
'aaaaa',
'bbbb',
'___',
'++++',
'q',
'aaaaaz',
'bbbBb',
'_---',
]
for s in data:
print(s, all_match(s))
output
aaaaa True
bbbb True
___ True
++++ True
q True
aaaaaz False
bbbBb False
_--- False
all_match0 will be faster unless the string is very long, because its testing loop runs at C speed, but it uses more RAM because it constructs a duplicate string. For very long strings, the time taken to construct the duplicate string becomes significant, and of course it can't do any testing until it creates that duplicate string.
all_match1 should only be slightly slower, even for short strings, and because it stops testing as soon as it finds a mismatch it may even be faster than all_match0, if the mismatch occurs early enough in the string.
try to use Counter (High-performance container datatypes).
>>> from collections import Counter
>>> s = 'aaaaaaaaa'
>>> c = Counter(s)
>>> len(c) == 1
True
If you check the code below i used for loops to check if in a set of words, one word is the suffix of another.
My question is, how can i replace the double for loop? The guy who wrote the task mentioned that there is a solution using algorithms (not sure what's that :/ )
def checkio(words):
if len(words) == 1: return False
else:
for w1 in words:
for w2 in words:
if w1 == w2:
continue
elif w1.endswith(w2) or w2.endswith(w1): return True
else: return False
print checkio({"abc","cba","ba","a","c"}) # prints True in Komodo
print checkio({"walk", "duckwalk"}) # prints True
Second question:
it appears that the current function doesn't work in every environment.
Can someone point out what i did wrong? It works on my Komodo IDE but won't work on chekio website.
here is a link to the task : http://www.checkio.org/mission/end-of-other/
Let Python generate all combinations to be checked:
import itertools
def checkio(data):
return any((x.endswith(y) or y.endswith(x)) for x, y in itertools.combinations(data, 2))
And let Python test it:
assert checkio({"abc","cba","ba","a","c"}) == True
assert checkio({"walk", "duckwalk"}) == True
assert checkio({"aaa", "bbb"}) == False
Here is a for loop version using itertools.combinations():
def checkio(words):
for w1, w2 in itertools.combinations(words, 2):
if w1.endswith(w2) or w2.endswith(w1):
return True
return False
print checkio({"abc","cba","ba","a","c"}) # prints True in Komodo only :/
print checkio({"walk", "duckwalk"}) # prints True
print checkio({"a", "foo", "bar"}) # prints False
Giving:
True
True
False
If you print each iteration you will see how the combinations() function works, so for the last example you will see it try the following:
a - foo
a - bar
foo - bar
Thanks guys, all of your comments and replies helped me to look at it dfferently. i think that this code is alot less bulky and clear and doesnt require module import
def checkio(words):
for w1 in words:
for w2 in words:
if w1 != w2 and (w1.endswith(w2) or w2.endswith(w1)):
return True
return False
Can use intersections and comprehensions:
def checkio(words):
for w in words:
ends = {w[i:] for i in range(1,len(w))}
if len(words & ends) > 0:
return True
return False
Output:
>>> checkio({"walk", "duckwalk"})
True
>>> checkio({"walk", "duckbill"})
False
The way it works is as follows. Suppose that words contained the word 'scared'. When w is 'scared' the set of slices ends becomes {'cared', 'ared', 'red', 'e', 'd'}. & is Python's intersection operator. If any word is common to both words and ends, e.g. 'red', this intersection will be nonempty hence len(words & ends) > 0 will be True -- which is then returned as the function value. If the code succeeds in looping through all words without encountering and for which len(words & ends) > 0, there are no examples of one word being a suffix of another in the list and False is thus returned.
Python's str.endswith() will do the work.
Sample script:
>>> a = 'hello'
>>> a.endswith('llo')
True
>>> a.endswith('ello')
True
>>> a.endswith('o')
True
>>> a.endswith('lo')
True
>>> a.endswith('ell')
False
Wrapping to a function:
import itertools
def checkio(words):
words = [ w for w, s in itertools.product(words, words) if w != s and ( w.endswith(s) or s.endswith(w) ) ]
return False if len(words) == 0 else True
Sample output: You can test it here:
checkio( {"hello", "lo", "he"} ) => True
checkio( {"hello", "la", "hellow", "cow"} ) => False
checkio( {"walk", "duckwalk"} ) => True
checkio( {"one"} ) => False
checkio( {"helicopter", "li", "he"} ) => False
This is a function in a greater a program that solves a sudoku puzzle. At this point, I would like the function to return false if there is more then 1 occurrence of a number unless the number is zero. What do am I missing to achieve this?
L is a list of numbers
l =[1,0,0,2,3,0,0,8,0]
def alldifferent1D(l):
for i in range(len(l)):
if l.count(l[i])>1 and l[i] != 0: #does this do it?
return False
return True
Assuming the list is length 9, you can ignore the inefficiency of using count here (Using a helper datastructure - Counter etc probably takes longer than running .count() a few times). You can write the expression to say they are all different more naturally as:
def alldifferent1D(L):
return all(L.count(x) <= 1 for x in L if x != 0)
This also saves calling count() for all the 0's
>>> from collections import counter
>>> def all_different(xs):
... return len(set(Counter(filter(None, xs)).values()) - set([1])) == 0
Tests:
>>> all_different([])
True
>>> all_different([0,0,0])
True
>>> all_different([0,0,1,2,3])
True
>>> all_different([1])
True
>>> all_different([1,2])
True
>>> all_different([0,2,0,1,2,3])
False
>>> all_different([2,2])
False
>>> all_different([1,2,3,2,2,3])
False
So we can break this down into two problems:
Getting rid of the zeros, since we don't care about them.
Checking if there are any duplicate numbers.
Striping the zeros is easy enough:
filter(lambda a: a != 0, x)
And we can check for differences in a set (which has only one of each element) and a list
if len(x) == len(set(x)):
return True
return False
Making these into functions we have:
def remove_zeros(x):
return filter(lambda a: a != 0, x)
def duplicates(x):
if len(x) == len(set(x)):
return True
return False
def alldifferent1D(x):
return duplicates(remove_zeros(x))
One way to avoid searching for every entry in every position is to:
flags = (len(l)+1)*[False];
for cell in l:
if cell>0:
if flags[cell]:
return False
flags[cell] = True
return True
The flags list has a True at index k if the value k has been seen before in the list.
I'm sure you could speed this up with list comprehension and an all() or any() test, but this worked well enough for me.
PS: The first intro didn't survive my edit, but this is from a Sudoku solver I wrote years ago. (Python 2.4 or 2.5 iirc)