Finding most sequences of specified length

Finding most sequences of specified length - python

I'm trying to write python code that will take a string and a length, and search through the string to tell me which sub-string of that particular length occurs the most, prioritizing the first if there's a tie.
For example, "cadabra abra" 2 should return ab
I tried:
import sys
def main():
inputstring = str(sys.argv[1])
length = int(sys.argv[2])
Analyze(inputstring, length)
def Analyze(inputstring, length):
count = 0;
runningcount = -1;
sequence = ""
substring = ""
for i in range(0, len(inputstring)):
substring = inputstring[i:i+length]
for j in range(i+length,len(inputstring)):
#print(runningcount)
if inputstring[j:j+2] == substring:
print("runcount++")
runningcount += 1
print(runningcount)
if runningcount > count:
count = runningcount
sequence = substring
print(sequence)
main()
But can't seem to get it to work. I know I'm at least doing something wrong with the counts, but I'm not sure what. This is my first program in Python too, but I think my problem is probably more with the algorithm than the syntax.

Try to use built-in method, they will make your life easier, this way:
>>> s = "cadabra abra"
>>> x = 2
>>> l = [s[i:i+x] for i in range(len(s)-x+1)]
>>> l
['ca', 'ad', 'da', 'ab', 'br', 'ra', 'a ', ' a', 'ab', 'br', 'ra']
>>> max(l, key=lambda m:s.count(m))
'ab'
EDIT:
Much simpler syntax as per Stefan Pochmann comment:
>>> max(l, key=s.count)

import sys
from collections import OrderedDict
def main():
inputstring = sys.argv[1]
length = int(sys.argv[2])
analyze(inputstring, length)
def analyze(inputstring, length):
d = OrderedDict()
for i in range(0, len(inputstring) - length + 1):
substring = inputstring[i:i+length]
if substring in d:
d[substring] += 1
else:
d[substring] = 1
maxlength = max(d.values())
for k,v in d.items():
if v == maxlength:
print(k)
break
main()

Pretty good stab at a solution for a first Python program. As you learn the language, spend some time reading the excellent documentation. It is full of examples and tips.
For example, the standard library includes a Counter class for counting things (obviously) and an OrderedDict class which remebers the ording in which keys are entered. But the documentation includes an example that combines the two to make an OrderedCounter, which can be used to solve you problem like this:
from collections import Counter, OrderedDict
class OrderedCounter(Counter, OrderedDict):
pass
def analyze(s, n):
substrings = (s[i:i+n] for i in range(len(s)-n+1))
counts = OrderedCounter(substrings)
return max(counts.keys(), key=counts.__getitem__)
analyze("cadabra abra", 2)

Related

How to make a python script that gives you every iteration of a string from a pattern

So I'm trying to make a python script that takes a pattern (ex: c**l) where it'll return every iteration of the string (* = any character in the alphabet)...
So, we get something like: caal, cbal, ccal and so forth.
I've tried using the itertools library's product but I haven't been able to make it work properly. So after 2 hours I've decide to turn to Stack Overflow.
Here's my current code. It's not complete since I feel stuck
alphabet = list('abcdefghijklmnopqrstuvwxyz')
wildChar = False
tmp_string = ""
combinations = []
if '*' in pattern:
wildChar = True
tmp_string = pattern.replace('*', '', pattern.count('*')+1)
if wildChar:
tmp = []
for _ in range(pattern.count('*')):
tmp.append(list(product(tmp_string, alphabet)))
for array in tmp:
for instance in array:
combinations.append("".join(instance))
tmp = []
print(combinations)

You could try:
from itertools import product
from string import ascii_lowercase
pattern = "c**l"
repeat = pattern.count("*")
pattern = pattern.replace("*", "{}")
for letters in product(ascii_lowercase, repeat=repeat):
print(pattern.format(*letters))
Result:
caal
cabl
cacl
...
czxl
czyl
czzl

Use itertools.product
import itertools
import string
s = 'c**l'
l = [c if c != '*' else string.ascii_lowercase) for c in s]
out = [''.join(c) for c in itertools.product(*l)]
Output:
>>> out
['caal',
'cabl',
'cacl',
'cadl',
'cael',
'cafl',
'cagl',
'cahl',
'cail',
'cajl'
...

Getting All Combinations or Permutations of a Character using terms

i want all combinations words of character using terms. Example :
word = 'aan'
result = ['ana', 'naa', 'aan']
terms :
number of character 'a' -> 2
number of character 'n' -> 1

I tried a one liner solution, and give the result in a list
You can use permutation tools from itertools package to get all of the permutations (not combinations) solutions
from itertools import permutations
word = 'aan'
list(set([ ''.join(list(i)) for i in permutations(word,len(word))]))

If I really understand what you want, I would do it like that:
from itertools import permutations
result = set()
for combination in permutations("aan"):
result.add(combination)

You can use recursion with a generator:
from collections import Counter
def combo(d, c = []):
if len(c) == len(d):
yield ''.join(c)
else:
_c1, _c2 = Counter(d), Counter(c)
for i in d:
if _c2.get(i, 0) < _c1[i]:
yield from combo(d, c+[i])
word = 'aan'
print(list(set(combo(word))))
Output:
['aan', 'naa', 'ana']
word = 'ain'
print(list(set(combo(word))))
Output:
['ina', 'nia', 'nai', 'ani', 'ian', 'ain']

Remove N consecutive repeated characters in a string

I am trying to solve a problem where the user inputs a string say str = "aaabbcc" and an integer n = 2.
So the function is supposed to remove characters that appearing 'n' times from the str and output only "aaa".
I tried couple of approaches and I'm not able to obtain the right output.
Are there any Regular expression functions that I could use or any recursive functions or just plain old iterations.
Thanks in advance.

Using itertools.groupby
Ex:
from itertools import groupby
s = "aaabbcc"
n = 2
result = ""
for k, v in groupby(s):
value = list(v)
if not len(value) == n:
result += "".join(value)
print(result)
Output:
aaa

You can use itertools.groupby:
>>> s = "aaabbccddddddddddeeeee"
>>> from itertools import groupby
>>> n = 3
>>> groups = (list(values) for _, values in groupby(s))
>>> "".join("".join(v) for v in groups if len(v) < n)
'bbcc'

from collections import Counter
counts = Counter(string)
string = "".join(c for c in string if counts[c] != 2)
Edit: Wait, sorry, I missed "consecutive". This will remove characters that occur exactly two times in the whole string (fitting your example, but not the general case).
Consecutive filter is a bit more complex, but doable - just find the consecutive runs first, then filter out the ones which have length two.
runs = [[string[0], 0]]
for c in string:
if c == runs[-1][0]:
runs[-1][1] += 1
else:
runs.append([c, 1])
string = "".join(c*length for c,length in runs if length != 2)
Edit2: As the other answers correctly point out, the first part of this is done natively by groupby
from itertools import groupby
string = "".join(c*length for c,length in groupby(string) if length != 2)

In [15]: some_string = 'aaabbcc'
In [16]: n = 2
In [17]: final_string = ''
In [18]: for k, v in Counter(some_string).items():
...: if v != n:
...: final_string += k * v
...:
In [19]: final_string
Out[19]: 'aaa'
You'll need: from collections import Counter

from collections import defaultdict
def fun(string,n):
dic = defaultdict(int)
for i in string:
dic[i]+=1
check = []
for i in dic:
if dic[i]==n:
check.append(i)
for i in check:
del dic[i]
return dic
string = "aaabbcc"
n = 2
result = fun(string, n)
sol =''
for i in result:
sol+=i*result[i]
print(sol)
output
aaa

split sentence without space in python (nltk?)

I have a set of concatenated word and i want to split them into arrays
For example :
split_word("acquirecustomerdata")
=> ['acquire', 'customer', 'data']
I found pyenchant, but it's not available for 64bit windows.
Then i tried to split each string into sub string and then compare them to wordnet to find a equivalent word.
For example :
from nltk import wordnet as wn
def split_word(self, word):
result = list()
while(len(word) > 2):
i = 1
found = True
while(found):
i = i + 1
synsets = wn.synsets(word[:i])
for s in synsets:
if edit_distance(s.name().split('.')[0], word[:i]) == 0:
found = False
break;
result.append(word[:i])
word = word[i:]
print(result)
But this solution is not sure and is too long.
So I'm looking for your help.
Thank you

Check - Word Segmentation Task from Norvig's work.
from __future__ import division
from collections import Counter
import re, nltk
WORDS = nltk.corpus.brown.words()
COUNTS = Counter(WORDS)
def pdist(counter):
"Make a probability distribution, given evidence from a Counter."
N = sum(counter.values())
return lambda x: counter[x]/N
P = pdist(COUNTS)
def Pwords(words):
"Probability of words, assuming each word is independent of others."
return product(P(w) for w in words)
def product(nums):
"Multiply the numbers together. (Like `sum`, but with multiplication.)"
result = 1
for x in nums:
result *= x
return result
def splits(text, start=0, L=20):
"Return a list of all (first, rest) pairs; start <= len(first) <= L."
return [(text[:i], text[i:])
for i in range(start, min(len(text), L)+1)]
def segment(text):
"Return a list of words that is the most probable segmentation of text."
if not text:
return []
else:
candidates = ([first] + segment(rest)
for (first, rest) in splits(text, 1))
return max(candidates, key=Pwords)
print segment('acquirecustomerdata')
#['acquire', 'customer', 'data']
For better solution than this you can use bigram/trigram.
More examples at : Word Segmentation Task

There is a library called "wordsegment" that you can use: http://www.grantjenks.com/docs/wordsegment/
pip install wordsegment
import wordsegment
from wordsegment import load, segment
load()
segment("acquirecustomerdata")
Output:
['acquire', 'customer', 'data']

If you have a list of all possible words, you can use something like this:
import re
word_list = ["go", "walk", "run", "jump"] # list of all possible words
pattern = re.compile("|".join("%s" % word for word in word_list))
s = "gowalkrunjump"
result = re.findall(pattern, s)

Finding the most frequent character in a string

I found this programming problem while looking at a job posting on SO. I thought it was pretty interesting and as a beginner Python programmer I attempted to tackle it. However I feel my solution is quite...messy...can anyone make any suggestions to optimize it or make it cleaner? I know it's pretty trivial, but I had fun writing it. Note: Python 2.6
The problem:
Write pseudo-code (or actual code) for a function that takes in a string and returns the letter that appears the most in that string.
My attempt:
import string
def find_max_letter_count(word):
alphabet = string.ascii_lowercase
dictionary = {}
for letters in alphabet:
dictionary[letters] = 0
for letters in word:
dictionary[letters] += 1
dictionary = sorted(dictionary.items(),
reverse=True,
key=lambda x: x[1])
for position in range(0, 26):
print dictionary[position]
if position != len(dictionary) - 1:
if dictionary[position + 1][1] < dictionary[position][1]:
break
find_max_letter_count("helloworld")
Output:
>>>
('l', 3)
Updated example:
find_max_letter_count("balloon")
>>>
('l', 2)
('o', 2)

There are many ways to do this shorter. For example, you can use the Counter class (in Python 2.7 or later):
import collections
s = "helloworld"
print(collections.Counter(s).most_common(1)[0])
If you don't have that, you can do the tally manually (2.5 or later has defaultdict):
d = collections.defaultdict(int)
for c in s:
d[c] += 1
print(sorted(d.items(), key=lambda x: x[1], reverse=True)[0])
Having said that, there's nothing too terribly wrong with your implementation.

If you are using Python 2.7, you can quickly do this by using collections module.
collections is a hight performance data structures module. Read more at
http://docs.python.org/library/collections.html#counter-objects
>>> from collections import Counter
>>> x = Counter("balloon")
>>> x
Counter({'o': 2, 'a': 1, 'b': 1, 'l': 2, 'n': 1})
>>> x['o']
2

Here is way to find the most common character using a dictionary
message = "hello world"
d = {}
letters = set(message)
for l in letters:
d[message.count(l)] = l
print d[d.keys()[-1]], d.keys()[-1]

Here's a way using FOR LOOP AND COUNT()
w = input()
r = 1
for i in w:
p = w.count(i)
if p > r:
r = p
s = i
print(s)

The way I did uses no built-in functions from Python itself, only for-loops and if-statements.
def most_common_letter():
string = str(input())
letters = set(string)
if " " in letters: # If you want to count spaces too, ignore this if-statement
letters.remove(" ")
max_count = 0
freq_letter = []
for letter in letters:
count = 0
for char in string:
if char == letter:
count += 1
if count == max_count:
max_count = count
freq_letter.append(letter)
if count > max_count:
max_count = count
freq_letter.clear()
freq_letter.append(letter)
return freq_letter, max_count
This ensures you get every letter/character that gets used the most, and not just one. It also returns how often it occurs. Hope this helps :)

If you want to have all the characters with the maximum number of counts, then you can do a variation on one of the two ideas proposed so far:
import heapq # Helps finding the n largest counts
import collections
def find_max_counts(sequence):
"""
Returns an iterator that produces the (element, count)s with the
highest number of occurrences in the given sequence.
In addition, the elements are sorted.
"""
if len(sequence) == 0:
raise StopIteration
counter = collections.defaultdict(int)
for elmt in sequence:
counter[elmt] += 1
counts_heap = [
(-count, elmt) # The largest elmt counts are the smallest elmts
for (elmt, count) in counter.iteritems()]
heapq.heapify(counts_heap)
highest_count = counts_heap[0][0]
while True:
try:
(opp_count, elmt) = heapq.heappop(counts_heap)
except IndexError:
raise StopIteration
if opp_count != highest_count:
raise StopIteration
yield (elmt, -opp_count)
for (letter, count) in find_max_counts('balloon'):
print (letter, count)
for (word, count) in find_max_counts(['he', 'lkj', 'he', 'll', 'll']):
print (word, count)
This yields, for instance:
lebigot#weinberg /tmp % python count.py
('l', 2)
('o', 2)
('he', 2)
('ll', 2)
This works with any sequence: words, but also ['hello', 'hello', 'bonjour'], for instance.
The heapq structure is very efficient at finding the smallest elements of a sequence without sorting it completely. On the other hand, since there are not so many letter in the alphabet, you can probably also run through the sorted list of counts until the maximum count is not found anymore, without this incurring any serious speed loss.

def most_frequent(text):
frequencies = [(c, text.count(c)) for c in set(text)]
return max(frequencies, key=lambda x: x[1])[0]
s = 'ABBCCCDDDD'
print(most_frequent(s))
frequencies is a list of tuples that count the characters as (character, count). We apply max to the tuples using count's and return that tuple's character. In the event of a tie, this solution will pick only one.

I noticed that most of the answers only come back with one item even if there is an equal amount of characters most commonly used. For example "iii 444 yyy 999". There are an equal amount of spaces, i's, 4's, y's, and 9's. The solution should come back with everything, not just the letter i:
sentence = "iii 444 yyy 999"
# Returns the first items value in the list of tuples (i.e) the largest number
# from Counter().most_common()
largest_count: int = Counter(sentence).most_common()[0][1]
# If the tuples value is equal to the largest value, append it to the list
most_common_list: list = [(x, y)
for x, y in Counter(sentence).items() if y == largest_count]
print(most_common_count)
# RETURNS
[('i', 3), (' ', 3), ('4', 3), ('y', 3), ('9', 3)]

Question :
Most frequent character in a string
The maximum occurring character in an input string
Method 1 :
a = "GiniGinaProtijayi"
d ={}
chh = ''
max = 0
for ch in a : d[ch] = d.get(ch,0) +1
for val in sorted(d.items(),reverse=True , key = lambda ch : ch[1]):
chh = ch
max = d.get(ch)
print(chh)
print(max)
Method 2 :
a = "GiniGinaProtijayi"
max = 0
chh = ''
count = [0] * 256
for ch in a : count[ord(ch)] += 1
for ch in a :
if(count[ord(ch)] > max):
max = count[ord(ch)]
chh = ch
print(chh)
Method 3 :
import collections
line ='North Calcutta Shyambazaar Soudipta Tabu Roopa Roopi Gina Gini Protijayi Sovabazaar Paikpara Baghbazaar Roopa'
bb = collections.Counter(line).most_common(1)[0][0]
print(bb)
Method 4 :
line =' North Calcutta Shyambazaar Soudipta Tabu Roopa Roopi Gina Gini Protijayi Sovabazaar Paikpara Baghbazaar Roopa'
def mostcommonletter(sentence):
letters = list(sentence)
return (max(set(letters),key = letters.count))
print(mostcommonletter(line))

Here are a few things I'd do:
Use collections.defaultdict instead of the dict you initialise manually.
Use inbuilt sorting and max functions like max instead of working it out yourself - it's easier.
Here's my final result:
from collections import defaultdict
def find_max_letter_count(word):
matches = defaultdict(int) # makes the default value 0
for char in word:
matches[char] += 1
return max(matches.iteritems(), key=lambda x: x[1])
find_max_letter_count('helloworld') == ('l', 3)

If you could not use collections for any reason, I would suggest the following implementation:
s = input()
d = {}
# We iterate through a string and if we find the element, that
# is already in the dict, than we are just incrementing its counter.
for ch in s:
if ch in d:
d[ch] += 1
else:
d[ch] = 1
# If there is a case, that we are given empty string, then we just
# print a message, which says about it.
print(max(d, key=d.get, default='Empty string was given.'))

sentence = "This is a great question made me wanna watch matrix again!"
char_frequency = {}
for char in sentence:
if char == " ": #to skip spaces
continue
elif char in char_frequency:
char_frequency[char] += 1
else:
char_frequency[char] = 1
char_frequency_sorted = sorted(
char_frequency.items(), key=lambda ky: ky[1], reverse=True
)
print(char_frequency_sorted[0]) #output -->('a', 9)

# return the letter with the max frequency.
def maxletter(word:str) -> tuple:
''' return the letter with the max occurance '''
v = 1
dic = {}
for letter in word:
if letter in dic:
dic[letter] += 1
else:
dic[letter] = v
for k in dic:
if dic[k] == max(dic.values()):
return k, dic[k]
l, n = maxletter("Hello World")
print(l, n)
output: l 3

you may also try something below.
from pprint import pprint
sentence = "this is a common interview question"
char_frequency = {}
for char in sentence:
if char in char_frequency:
char_frequency[char] += 1
else:
char_frequency[char] = 1
pprint(char_frequency, width = 1)
out = sorted(char_frequency.items(),
key = lambda kv : kv[1], reverse = True)
print(out)
print(out[0])

statistics.mode(data)
Return the single most common data point from discrete or nominal data. The mode (when it exists) is the most typical value and serves as a measure of central location.
If there are multiple modes with the same frequency, returns the first one encountered in the data. If the smallest or largest of those is desired instead, use min(multimode(data)) or max(multimode(data)). If the input data is empty, StatisticsError is raised.
import statistics as stat
test = 'This is a test of the fantastic mode super special function ssssssssssssss'
test2 = ['block', 'cheese', 'block']
val = stat.mode(test)
val2 = stat.mode(test2)
print(val, val2)
mode assumes discrete data and returns a single value. This is the standard treatment of the mode as commonly taught in schools:
mode([1, 1, 2, 3, 3, 3, 3, 4])
3
The mode is unique in that it is the only statistic in this package that also applies to nominal (non-numeric) data:
mode(["red", "blue", "blue", "red", "green", "red", "red"])
'red'

Here is how I solved it, considering the possibility of multiple most frequent chars:
sentence = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, \
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut
enim."
joint_sentence = sentence.replace(" ", "")
frequencies = {}
for letter in joint_sentence:
frequencies[letter] = frequencies.get(letter, 0) +1
biggest_frequency = frequencies[max(frequencies, key=frequencies.get)]
most_frequent_letters = {key: value for key, value in frequencies.items() if value == biggest_frequency}
print(most_frequent_letters)
Output:
{'e': 12, 'i': 12}

#file:filename
#quant:no of frequent words you want
def frequent_letters(file,quant):
file = open(file)
file = file.read()
cnt = Counter
op = cnt(file).most_common(quant)
return op

# This code is to print all characters in a string which have highest frequency
def find(str):
y = sorted([[a.count(i),i] for i in set(str)])
# here,the count of unique character and the character are taken as a list
# inside y(which is a list). And they are sorted according to the
# count of each character in the list y. (ascending)
# Eg : for "pradeep", y = [[1,'r'],[1,'a'],[1,'d'],[2,'p'],[2,'e']]
most_freq= y[len(y)-1][0]
# the count of the most freq character is assigned to the variable 'r'
# ie, most_freq= 2
x= []
for j in range(len(y)):
if y[j][0] == most_freq:
x.append(y[j])
# if the 1st element in the list of list == most frequent
# character's count, then all the characters which have the
# highest frequency will be appended to list x.
# eg :"pradeep"
# x = [['p',2],['e',2]] O/P as expected
return x
find("pradeep")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding most sequences of specified length - python

Related

How to make a python script that gives you every iteration of a string from a pattern

Getting All Combinations or Permutations of a Character using terms

Remove N consecutive repeated characters in a string

split sentence without space in python (nltk?)

Finding the most frequent character in a string

Categories

Resources