CS50 DNA: STR counter only works most of the time

CS50 DNA: STR counter only works most of the time - python

Here is a mock up of the function. A lot of the samples have one or two STRs coming back as 1. Can someone help me understand what I am doing wrong?
dnaSamp = input("DNA: ")
strSeq = ["TATC"]###["AGATC", "TTTTTTCT", "AATG", "TCTAG", "GATA", "TATC", "GAAA", "TCTG"]
hiScore = [0] * len(strSeq)
for i in range(len(strSeq)): # cycle throught the varios STRs
for j in range(len(dnaSamp) - (len(strSeq)-1)): # loop over dna sample
k = j + len(strSeq[i]) # variable to control the length of the STR sequence
if dnaSamp[j : k] == strSeq[i]:
counter = 0
for l in range(len(dnaSamp)): #if match look at next set
if dnaSamp[j + (l * len(strSeq[i])) : k + (l * len(strSeq[i]))] == strSeq[i]:
counter += 1
continue
break
if counter > hiScore[i]:
hiScore[i] = counter #save highest counter
print(f"{strSeq[i]} = {hiScore[i]}" )

Related

count characters occurences in string

I want to find out how often does "reindeer" (in any order) come in a random string and what is the left over string after "reindeer" is removed. I need to preserve order of the left over string
So for example
"erindAeer" -> A (reindeer comes 1 time)
"ierndeBeCrerindAeer" -> ( 2 reindeers, left over is BCA)
I thought of sorting and removing "reindeer", but i need to preserve the order . What's a good way to do this?

We can replace those letters after knowing how many times they repeat, and Counter is convenient for counting elements.
from collections import Counter
def leftover(letter_set, string):
lcount, scount = Counter(letter_set), Counter(string)
repeat = min(scount[l] // lcount[l] for l in lcount)
for l in lcount:
string = string.replace(l, "", lcount[l] * repeat)
return f"{repeat} {letter_set}, left over is {string}"
print(leftover("reindeer", "ierndeBeCrerindAeer"))
print(leftover("reindeer", "ierndeBeCrerindAeere"))
print(leftover("reindeer", "ierndeBeCrerindAee"))
Output:
2 reindeer, left over is BCA
2 reindeer, left over is BCAe
1 reindeer, left over is BCerindAee

Here is a rather simple approach using collections.Counter:
from collections import Counter
def purge(pattern, string):
scount, pcount = Counter(string), Counter(pattern)
cnt = min(scount[x] // pcount[x] for x in pcount)
scount.subtract(pattern * cnt)
return cnt, "".join(scount.subtract(c) or c for c in string if scount[c])
>>> purge("reindeer", "ierndeBeCrerindAeer")
(2, 'BCA')

Here is the code in Python:
def find_reindeers(s):
rmap = {}
for x in "reindeer":
if x not in rmap:
rmap[x] = 0
rmap[x] += 1
hmap = {key: 0 for key in "reindeer"}
for x in s:
if x in "reindeer":
hmap[x] += 1
total_occ = min([hmap[x]//rmap[x] for x in "reindeer"])
left_over = ""
print(hmap, rmap)
for x in s:
if (x in "reindeer" and hmap[x] > total_occ * rmap[x]) or (x not in "reindeer"):
left_over += x
return total_occ, left_over
print(find_reindeers("ierndeBeCrerindAeer"))
Output for ierndeBeCrerindAeer:
(2, "BCA")

You can do it by using count and replace string function:
import queue
word = "reindeer"
given_string = "ierndeBeCrerindAeer"
new_string = ""
counter = 0
tmp = ""
letters = queue.Queue()
for i in given_string:
if not i in word:
new_string += i
else:
letters.put(i)
x = 0
while x < len(word):
while not letters.empty():
j = letters.get()
if j == word[x]:
tmp += j
# print(tmp)
break
else:
letters.put(j)
x = x +1
if tmp == word:
counter += 1
tmp = ""
x = 0
print(f"The word {word} occurs {counter} times in the string {given_string}.")
print("The left over word is",new_string)
Output will be:
The word reindeer occurs 2 times in the string ierndeBeCrerindAeer.
The left over word is BCA
It's easy to use queue here so that we don't repeat the elements that are already present or found.
Hope this answers your question, Thank you!

Python Optimization : Find the most occured sequence of 4 letters inside a 1000 letters string randomly generated

I'm here to ask help about my program.
I realise a program that raison d'être is to find the most occured four letters string on a x letters bigger string which have been generated randomly.
As example, if you would know the most occured sequence of four letters in 'abcdeabcdef' it's pretty easy to understand that is 'abcd' so the program will return this.
Unfortunately, my program works very slow, I mean, It take 119.7 seconds, for analyze all possibilities and display the results for only a 1000 letters string.
This is my program, right now :
import random
chars = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
string = ''
for _ in range(1000):
string += str(chars[random.randint(0, 25)])
print(string)
number = []
for ____ in range(0,26):
print(____)
for ___ in range(0,26):
for __ in range(0, 26):
for _ in range(0, 26):
test = chars[____] + chars[___] + chars[__] + chars[_]
print('trying :',test, end = ' ')
number.append(0)
for i in range(len(string) -3):
if string[i: i+4] == test:
number[len(number) -1] += 1
print('>> finished')
_max = max(number)
for i in range(len(number)-1):
if number[i] == _max :
j, k, l, m = i, 0, 0, 0
while j > 25:
j -= 26
k += 1
while k > 25:
k -= 26
l += 1
while l > 25:
l -= 26
m += 1
Result = chars[m] + chars[l] + chars[k] + chars[j]
print(str(Result),'occured',_max, 'times' )
I think there is ways to optimize it but at my level, I really don't know. Maybe the structure itself is not the best. Hope you'll gonna help me :D

You only need to loop through your list once to count the 4-letter sequences. You are currently looping n*n*n*n. You can use zip to make a four letter sequence that collects the 997 substrings, then use Counter to count them:
from collections import Counter
import random
chars = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
s = "".join([chars[random.randint(0, 25)] for _ in range(1000)])
it = zip(s, s[1:], s[2:], s[3:])
counts = Counter(it)
counts.most_common(1)
Edit:
.most_common(x) returns a list of the x most common strings. counts.most_common(1) returns a single item list with the tuple of letters and number of times it occurred like; [(('a', 'b', 'c', 'd'), 2)]. So to get a string, just index into it and join():
''.join(counts.most_common(1)[0][0])

Even with your current approach of iterating through every possible 4-letter combination, you can speed up a lot by keeping a dictionary instead of a list, and testing whether the sequence occurs at all first before trying to count the occurrences:
counts = {}
for a in chars:
for b in chars:
for c in chars:
for d in chars:
test = a + b + c + d
print('trying :',test, end = ' ')
if test in s: # if it occurs at all
# then record how often it occurs
counts[test] = sum(1 for i in range(len(s)-4)
if test == s[i:i+4])
The multiple loops can be replaced with itertools.permutations, though this improves readability rather than performance:
length = 4
for sequence in itertools.permutations(chars, length):
test = "".join(sequence)
if test in s:
counts[test] = sum(1 for i in range(len(s)-length) if test == s[i:i+length])
You can then display the results like this:
_max = max(counts.values())
for k, v in counts.items():
if v == _max:
print(k, "occurred", _max, "times")
Provided that the string is shorter or around the same length as 26**4 characters, then it is much faster still to iterate through the string rather than through every combination:
length = 4
counts = {}
for i in range(len(s) - length):
sequence = s[i:i+length]
if sequence in counts:
counts[sequence] += 1
else:
counts[sequence] = 1
This is equivalent to the Counter approach already suggested.

smallest window contains all the elements in an array

I need to write a function to find the smallest window that contains all the elements in an array. Below is what I have tried:
def function(item):
x = len(set(item))
i = 0
j = len(item) - 1
result = len(item)
while i <= j:
if len(set(item[i + 1: j + 1])) == x:
result = min(result, len(item[i + 1: j + 1]))
i += 1
elif len(set(item[i:j])) == x:
result = min(result, len(item[i:j]))
j -= 1
else:
return result
return result
print(function([8,8,8,8,1,2,5,7,8,8,8,8]))
The time complexity is in O(N^2), Can someone help me to improve it to O(N) or better? Thanks.

You can use the idea from How to find smallest substring which contains all characters from a given string? for this specific case and get a O(N) solution.
Keep a counter for how many copies of each unique number is included in the window and move the end of the window to the right until all unique numbers are included at least once. Then move the start of the window until one unique number disappears. Then repeat:
from collections import Counter
def smallest_window(items):
element_counts = Counter()
n_unique = len(set(items))
characters_included = 0
start_enumerator = enumerate(items)
min_window = len(items)
for end, element in enumerate(items):
element_counts[element] += 1
if element_counts[element] == 1:
characters_included += 1
while characters_included == n_unique:
start, removed_element = next(start_enumerator)
min_window = min(end-start+1, min_window)
element_counts[removed_element] -= 1
if element_counts[removed_element] == 0:
characters_included -= 1
return min_window
>>> smallest_window([8,8,8,8,1,2,5,7,8,8,8,8])
5

This problem can be solved as below.
def lengthOfLongestSublist(s):
result = 0
#set a dictionary to store item in s as the key and index as value
d={}
i=0
j=0
while (j < len(s)):
#if find the s[j] value is already exist in the dictionary,
#move the window start point from i to i+1
if (s[j] in d):
i = max(d[s[j]] + 1,i)
#each time loop, compare the current length of s to the previouse one
result = max(result,j-i+1)
#store s[j] as key and the index of s[j] as value
d[s[j]] = j
j = j + 1
return result
lengthOfLongestSubstring([8,8,8,8,8,5,6,7,8,8,8,8,])
Output: 4
Set a dictionary to store the value of input list as key and index
of the list as the value. dic[l[j]]=j
In the loop, find if the current value exists in the dictionary. If
exist, move the start point from i to i + 1.
Update result.
The complexity is O(n).

Eradicate the error - 'bool' obj can't be iterated

The code intends to print a frequency table for a random input discrete data. Here's the code :
from math import log10
from random import randint
N = int(input("Enter number of observations:\n"))
l = [ randint(1,100) for var in range (N) ]
print(l)
l.sort()
print(l)
k = 1 + (3.332*log10(N))
k1 = round(k)
print ("Number of intervals should be = ",k1)
x = N//k1 + 1
print("S.No\t\tIntervals\t\tFrequency")
c = 1 #count
while c <= k:
a = (c-1)*x
b = c*x
count = 0
for v in range(a,b) in l:
count += 1
print(c,"\t\t","{}-{}".format(a,b),"\t\t",count)
c += 1
This shows the above cited error, how to resolve this?

The issue is that range(a,b) sets up a list of integers from a to b-1. What you are asking for is for the code to go through l and pick out numbers matching those criteria, which looks instead like:
for v in l:
if ((v>=a) and (v<b)):
count += 1
If you really want to use range, and your data are going to stay integers, then it would look like:
for v in l:
if v in range(int(a),int(b)):
count += 1
Also
x = N//k1 + 1
should be
x = 100//k1 + 1

Consensus sequence help in python

I am having difficulty getting this scoring function to work. The objective of my program is to make a t x n matrix and find a consensus sequence.
I keep getting a error :
TypeError: 'int' object is not subscriptable.
Any help would be appreciated.
def Score(s, i, l, dna):
t = len(dna) # t = number of dna sequences
# Step 1: Extract the alignment corresponding to starting positions in s
alignment = []
for j in range(0, i):
alignment.append(dna[j][s[j]:s[j]+l])
# Step 2: Create the corresponding profile matrix
profile = [[],[],[],[]] # prepare an empty 4 x l profile matrix first
for j in range(0, 4):
profile[j] = [0] * l
for c in range(0, l): # for each column number c
for r in range(0, i): # for each row number r in column c
if alignment[r][c] == 'a':
profile[0][c] = profile[0][c] + 1
elif alignment[r][c] == 't':
profile[1][c] = profile[1][c] + 1
elif alignment[r][c] == 'g':
profile[2][c] = profile[2][c] + 1
else:
profile[3][c] = profile[3][c] + 1
# Step 3: Compute the score from the profile matrix
score = 0
for c in range(0, l):
score = score + max([profile[0][c], profile[1][c], profile[2][c], profile[3][c]])
return score

Is your variable dna a dictionary,
if so use def Score(s, i, l, **dna)
If it is int variable, you can't access it as dna[j][s[j]:s[j]+l]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

CS50 DNA: STR counter only works most of the time - python

Related

count characters occurences in string

Python Optimization : Find the most occured sequence of 4 letters inside a 1000 letters string randomly generated

smallest window contains all the elements in an array

Eradicate the error - 'bool' obj can't be iterated

Consensus sequence help in python

Categories

Resources