Using hashing to find a repeated substring inside a string - python

Given the problem: Find a repeated substring in a string, is it possible to use hashing? I want to create a dictionary with the substrings as keys and the number of repeated instances as values. Here is what I have so far. I am getting an error because I am using a substring as a key for the dictionary. Can anyone spot my mistake? Thank you!!!
def findsubs(str):
d={}
for i in range(len(str)-1):
for j in range(i+2, len(str)-2):
if d[str[i:j]]>1:
return str[i:j]
else:
d[str[i:j]] = d[str[i:j]] +1
return 0
print findsubs("abcbc")

The general idea should work. It's just that if a key isn't found in the dictionary when you do a lookup, you get an error - so you have to check whether the key exists before doing a look-up and initialize if it is doesn't:
def findsubs(str):
d={}
for i in range(len(str)-1):
for j in range(i+2, len(str)-2):
if str[i:j] not in d:
d[str[i:j]] = 0
if d[str[i:j]]>1:
return str[i:j]
else:
d[str[i:j]] = d[str[i:j]] +1
return 0
Note that instead of if str[i:j] not in d: d[str[i:j]] = 0, you can do d.setdefault(str[i:j], 0), which sets the value to 0 if the key isn't in the dict, and leaves it unchanged if it does.
A few more comments though:
You should return None, not 0, if you don't find anything.
You shouldn't call a variable str since that's a built-in function.
You want to iterate j until the end of the string.
As-written, it'll only return a substring if it's been found 3 times. Really you can re-write it using a set of previously-found substrings instead:
So:
def findsubs(s):
found = set()
for i in range(len(s)-1):
for j in range(i+2, len(s)+1):
substr = s[i:j]
if substr in found:
return substr
found.add(substr)
return None

You were almost there
def findsubs(instr):
d={}
for i in range(len(instr)):
for j in range(i+2, len(instr)+1):
print instr[i:j]
d[instr[i:j]] = d.get(instr[i:j],0) + 1
return d
instr = 'abcdbcab'
print instr
print findsubs('abcdbcab')
This will work, i added an inside print for debug purposes, remove it after you test it.
The result is the dict with the substring count has you asked for :)
{'abcd': 1, 'ab': 2, 'cdb': 1, 'dbc': 1, 'cdbcab': 1, 'cd': 1, 'abc': 1, 'cdbc': 1, 'bcab': 1, 'abcdbc': 1, 'ca': 1, 'db
ca': 1, 'bc': 2, 'dbcab': 1, 'db': 1, 'cab': 1, 'bcdbcab': 1, 'bcdbc': 1, 'abcdbca': 1, 'cdbca': 1, 'abcdbcab': 1, 'bcdb
': 1, 'bcd': 1, 'abcdb': 1, 'bca': 1, 'bcdbca': 1}

Related

Return a string value based on the values in the list in Python

After generating a random list with 0s and 1s
decision = [0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0]
I want to generate another list which returns 'pass' values in the decision are 1, and 'fail' if values are 0
['fail', 'fail', 'pass', 'pass', 'pass', 'fail', 'fail', 'pass',....'fail']
I tried list comprehension using,
newlist = ["pass" for k in decision if k == 0]
but I could not think of a way to integrate else condition if k==1.
Please help.
Use the condition in the value part of the comprehension"
newlist = ["pass" if k == 1 else "fail" for k in decision]
Alternatively, in case you have more values create a dictionary:
res_dict = {
0 : "Equal",
1 : "Higher",
-1 : "Lower",
}
newlist = [res_dict.get(x) for x in decision]
I know my answer is not what you want but I believe it will be easier if you just use True or False. Here the code:
decision = [0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0]
result = [d == 1 for d in decision] # // So 1 will be True and 0 will be False
counter=0
otherlist=[]
for element in mylist:
if element == 0:
otherlist[counter]="fail"
else:
otherlist[counter]="pass"
counter += 1
It doesnt use comprehension but it'll do the trick. Hope this helps.
An even faster option would be:
otherlist = []
for element in mylist:
if element == 0:
otherlist.append("fail")
else:
otherlist.append("pass")
You can also allow 0s to represent False and 1s to represent True
otherlist = []
for element in mylist:
if element == 0:
otherlist.append(False)
else:
otherlist.append(True)

Incrementing a dictionary across multiple strings - not using any import functions

I am trying to increment a dictionary in Python 3, counting the number of occurrences of certain characters in a string.
I cannot use any import/count function to do so, and must iterate through one character at a time. All other queries on here seem to use an import functionality to answer the questions.
So far I have a dictionary that doesnt count duplicate characters nor does it update when presented with a new string:
def IncrementalCount(counts, line, charList):
counts={}
for letter in charList:
if letter in line:
if letter in counts.keys():
counts[letter] += 1
else:
counts[letter] = 1
if letter not in line:
counts[letter] = 0
return counts
counts = {}
counts= IncrementalCount(counts,"{hello!{}","}{#")
print(counts)
counts= IncrementalCount(counts,"#Goodbye!}","}{#!#")
print(counts)
current result
{'}': 1, '{': 1, '#': 0}
{'}': 1, '{': 0, '#': 1, '!': 1, '#': 0}
desired result
{'}': 1, '{': 2, '#': 0}
{'}': 2, '{': 2, '#': 1, '!': 1, '#': 0}
Any help would be really appreciated on what edits i need to make. I dont understand why my "counts[letter] +=1" doesnt count duplicate entries.
You iterate over every letter and test if letter in line, but you do not count the number of times it occurs in the output.
Nevertheless, you make things too complex, we can construct a dictionary by using line.count(..) instead and write this as list comprehension:
def IncrementalCount(line, charList):
return { letter: line.count(letter) for letter in charList }
This then produces:
>>> IncrementalCount("{hello!{}","}{#")
{'}': 1, '{': 2, '#': 0}
>>> IncrementalCount("#Goodbye!}","}{#!#")
{'}': 1, '{': 0, '#': 1, '!': 1, '#': 0}
In case we wish to increment an existing dictionary, we can use:
def IncrementalCount(counts, line, charList):
for letter in charList:
counts[letter] += line.count(letter)
Or in case it is possible that not all keys are present, we can use for instane a defaultdict (which is usually a more compact and efficient way), or we can retrieve it
def IncrementalCount(counts, line, charList):
for letter in charList:
counts.setdefault(letter, 0)
counts[letter] += line.count(letter)
Or we can use .get(..) to retrieve it with a default value, but usually a defaultdict is a better design decision here.
N.B.: usually function names in Python are all lower case and underscores (_) are used for spaces, so here it would by Pythonic to name it incremental_count.
N.B.: there exist more effective ways to count: right now we iterate through the entire string for every character.
Try this
def IncrementalCount(counts, line, charList):
for letter in charList:
counts[letter]=counts.get(letter,0)+line.count(letter)
return counts
counts = {}
counts= IncrementalCount(counts,"{hello!{}","}{#")
print(counts)
counts= IncrementalCount(counts,"#Goodbye!}","}{#!#")
print(counts)
out put
{'}': 1, '{': 2, '#': 0}
{'}': 2, '{': 2, '#': 1, '!': 1, '#': 0}

Word frequency in a string without spaces and with special characters?

Let's say I have the following string:
"hello&^uevfehello!`.<hellohow*howdhAreyou"
How would I go about counting the frequency of english words that are substrings of it? In this case I would want a result such as:
{'hello': 3, 'how': 2, 'are': 1, 'you': 1}
I searched previous question which were similar to this one but I couldn't really find anything that works. A close solution seemed to be using regular expressions, but it didn't work either. It might be because I was implementing it wrong since I'm not familiar with how it actually works.
How to find the count of a word in a string?
it's the last answer
from collections import *
import re
Counter(re.findall(r"[\w']+", text.lower()))
I also tried creating a very bad function that iterates through every single possible arrangement of consecutive letters in the string (up to a max of 8 letters or so). The problem with doing that is
1) it's way longer than it should be and
2) it adds extra words. ex: if "hello" was in the string, "hell" would also be found.
I'm not very familiar with regex which is probably the right way to do this.
d, w = "hello&^uevfehello!`.<hellohow*howdhAreyou", ["hello","how","are","you"]
import re, collections
pattern = re.compile("|".join(w), flags = re.IGNORECASE)
print collections.Counter(pattern.findall(d))
Output
Counter({'hello': 3, 'how': 2, 'you': 1, 'Are': 1})
from collections import defaultdict
s = 'hello&^uevfehello!`.<hellohow*howdhAreyou'
word_counts = defaultdict(lambda: 0)
i = 0
while i < len(s):
j = len(s)
while j > i:
if is_english_word(s[i:j]):
word_counts[s[i:j]] += 1
break
j -= 1
if j == i:
i += 1
else:
i = j
print word_counts
You need to extract all words from the string, then for each word you need to find substrings and then check if any of the substring is english word. I have used english dictionary from answer in How to check if a word is an English word with Python?
There are some false positives in the result however so you may want to use better dictionary or have a custom method to check for desired words.
import re
import enchant
from collections import defaultdict
# Get all substrings in given string.
def get_substrings(string):
for i in range(0, len(string)):
for j in range(i, len(string)):
yield s[i:j+1]
text = "hello&^uevfehello!`.<hellohow*howdhAreyou"
strings = re.split(r"[^\w']+", text.lower())
# Use english dictionary to check if a word exists.
dictionary = enchant.Dict("en_US")
counts = defaultdict(int)
for s in strings:
for word in get_substrings(s):
if (len(word) > 1 and dictionary.check(word)):
counts[word] += 1
print counts
Output:
defaultdict(, {'are': 1, 'oho': 1, 'eh': 1, 'ell': 3,
'oh': 1, 'lo': 3, 'll': 3, 'yo': 1, 'how': 2, 'hare': 1, 'ho': 2,
'ow': 2, 'hell': 3, 'you': 1, 'ha': 1, 'hello': 3, 're': 1, 'he': 3})

How to return the number of characters whose frequency is above a threshold

How do I print the number of upper case characters whose frequency is above a threshold (in the tutorial)?
The homework question is:
Your task is to write a function which takes as input a single non-negative number and returns (not print) the number of characters in the tally whose count is strictly greater than the argument of the function. Your function should be called freq_threshold.
My answer is:
mobyDick = "Blah blah A B C A RE."
def freq_threshold(threshold):
tally = {}
for char in mobyDick:
if char in tally:
tally[char] += 1
else:
tally[char] = 1
for key in tally.keys():
if key.isupper():
print tally[key],tally.keys
if threshold>tally[key]:return threshold
else:return tally[key]
It doesn't work, but I don't know where it is wrong.
Your task is to return number of characters that satisfy the condition. You're trying to return count of occurrences of some character. Try this:
result = 0
for key in tally.keys():
if key.isupper() and tally[key] > threshold:
result += 1
return result
You can make this code more pythonic. I wrote it this way to make it more clear.
The part where you tally up the number of each character is fine:
>>> pprint.pprint ( tally )
{' ': 5,
'.': 1,
'A': 2,
'B': 2,
'C': 1,
'E': 1,
'R': 1,
'a': 2,
'b': 1,
'h': 2,
'l': 2,
'\x80': 2,
'\xe3': 1}
The error is in how you are summarising the tally.
Your assignment asked you to print the number of characters occurring more than n times in the string.
What you are returning is either n or the number of times one particular character occurred.
You instead need to step through your tally of characters and character counts, and count how many characters have frequencies exceeding n.
Do not reinvent the wheel, but use a counter object, e.g.:
>>> from collections import Counter
>>> mobyDick = "Blah blah A B C A RE."
>>> c = Counter(mobyDick)
>>> c
Counter({' ': 6, 'a': 2, 'B': 2, 'h': 2, 'l': 2, 'A': 2, 'C': 1, 'E': 1, '.': 1, 'b': 1, 'R': 1})
from collections import Counter
def freq_threshold(s, n):
cnt = Counter(s)
return [i for i in cnt if cnt[i]>n and i.isupper()]
To reinvent the wheel:
def freq_threshold(s, n):
d = {}
for i in s:
d[i] = d.get(i, 0)+1
return [i for i in d if d[i]>n and i.isupper()]

Python: is index() buggy at all?

I'm working through this thing on pyschools and it has me mystified.
Here's the code:
def convertVector(numbers):
totes = []
for i in numbers:
if i!= 0:
totes.append((numbers.index(i),i))
return dict((totes))
Its supposed to take a 'sparse vector' as input (ex: [1, 0, 1 , 0, 2, 0, 1, 0, 0, 1, 0])
and return a dict mapping non-zero entries to their index.
so a dict with 0:1, 2:1, etc where x is the non zero item in the list and y is its index.
So for the example number it wants this: {0: 1, 9: 1, 2: 1, 4: 2, 6: 1}
but instead gives me this: {0: 1, 4: 2} (before its turned to a dict it looks like this:
[(0, 1), (0, 1), (4, 2), (0, 1), (0, 1)]
My plan is for i to iterate through numbers, create a tuple of that number and its index, and then turn that into a dict. The code seems straightforward, I'm at a loss.
It just looks to me like numbers.index(i) is not returning the index, but instead returning some other, unsuspected number.
Is my understanding of index() defective? Are there known index issues?
Any ideas?
index() only returns the first:
>>> a = [1,2,3,3]
>>> help(a.index)
Help on built-in function index:
index(...)
L.index(value, [start, [stop]]) -> integer -- return first index of value.
Raises ValueError if the value is not present.
If you want both the number and the index, you can take advantage of enumerate:
>>> for i, n in enumerate([10,5,30]):
... print i,n
...
0 10
1 5
2 30
and modify your code appropriately:
def convertVector(numbers):
totes = []
for i, number in enumerate(numbers):
if number != 0:
totes.append((i, number))
return dict((totes))
which produces
>>> convertVector([1, 0, 1 , 0, 2, 0, 1, 0, 0, 1, 0])
{0: 1, 9: 1, 2: 1, 4: 2, 6: 1}
[Although, as someone pointed out though I can't find it now, it'd be easier to write totes = {} and assign to it directly using totes[i] = number than go via a list.]
What you're trying to do, it could be done in one line:
>>> dict((index,num) for index,num in enumerate(numbers) if num != 0)
{0: 1, 2: 1, 4: 2, 6: 1, 9: 1}
Yes your understanding of list.index is incorrect. It finds the position of the first item in the list which compares equal with the argument.
To get the index of the current item, you want to iterate over with enumerate:
for index, item in enumerate(iterable):
# blah blah
The problem is that .index() looks for the first occurence of a certain argument. So for your example it always returns 0 if you run it with argument 1.
You could make use of the built in enumerate function like this:
for index, value in enumerate(numbers):
if value != 0:
totes.append((index, value))
Check the documentation for index:
Return the index in the list of the first item whose value is x. It is
an error if there is no such item.
According to this definition, the following code appends, for each value in numbers a tuple made of the value and the first position of this value in the whole list.
totes = []
for i in numbers:
if i!= 0:
totes.append((numbers.index(i),i))
The result in the totes list is correct: [(0, 1), (0, 1), (4, 2), (0, 1), (0, 1)].
When turning it into again, again, the result is correct, since for each possible value, you get the position of its first occurrence in the original list.
You would get the result you want using i as the index instead:
result = {}
for i in range(len(numbers)):
if numbers[i] != 0:
result[i] = numbers[i]
index() returns the index of the first occurrence of the item in the list. Your list has duplicates which is the cause of your confusion. So index(1) will always return 0. You can't expect it to know which of the many instances of 1 you are looking for.
I would write it like this:
totes = {}
for i, num in enumerate(numbers):
if num != 0:
totes[i] = num
and avoid the intermediate list altogether.
Riffing on #DSM:
def convertVector(numbers):
return dict((i, number) for i, number in enumerate(numbers) if number)
Or, on re-reading, as #Rik Poggi actually suggests.

Categories