GC content in DNA sequence (Rosaland): how to improve my code? - python

Below is my code for a Rosalind question to calculate GC content without using Biopython.
Can anyone give me some suggestions how to improve it? For example, I cannot include the last sequence in the seq_list inside the for loop and have to append one more time.
Also, is there a better way to pair seq_name and GC content so I can easily print out the sequence name with highest GC content?
Thank you very much
# to open FASTA format sequence file:
s=open('5_GC_content.txt','r').readlines()
# to create two lists, one for names, one for sequences
name_list=[]
seq_list=[]
data='' # to put the sequence from several lines together
for line in s:
line=line.strip()
for i in line:
if i == '>':
name_list.append(line[1:])
if data:
seq_list.append(data)
data=''
break
else:
line=line.upper()
if all([k==k.upper() for k in line]):
data=data+line
seq_list.append(data) # is there a way to include the last sequence in the for loop?
GC_list=[]
for seq in seq_list:
i=0
for k in seq:
if k=="G" or k=='C':
i+=1
GC_cont=float(i)/len(seq)*100.0
GC_list.append(GC_cont)
m=max(GC_list)
print name_list[GC_list.index(m)] # to find the index of max GC
print "{:0.6f}".format(m)

if all([k==k.upper() for k in line]):
Why don't you just check that line == line.upper() ?
i=0
for k in seq:
if k=="G" or k=='C':
i+=1
can be replaced with
i = sum(1 for k in seq if k in ['G', 'C'])
is there a way to include the last sequence in the for loop?
I think there is no better way to do that.

To avoid appending your seq list a second time remove:
if all([k==k.upper() for k in line]):
data=data+line
and add it below line.strip()
The problem you are facing is data is an empty string the first time you enter the for i in line loop. Therefore, if data: is False.

Related

Breaking a list into smaller lists at a point

I've written a function that takes for example [["a",1],["b",2],["a",2],["b",3]], where each small list has a letter and a number, and returns, [["a",1,2,"b",2,3]].
There is a lot more to this problem, but to make things simple, the next step is to turn this into a form [["a",3],["b",5]]. The second item of each smaller list, is the sum of the numbers between the letters ie 1,2 are associated with "a", 2,3 are associated with "b", as seen in the original list. The number of occurrences of a letter is unlimited.
Another example To summarize: function([["a",1,3,4,"b",2,2,"c",4,5]]) => [["a",8],["b",4],["c",9]]
Nothing I've written has come close to accomplishing this. This is a kind of bare-bones challenge, no list comprehension and nothing can be imported
This code can help you:
# Assuming a random initial list:
data = [["a",1,3,4,4,2,"b",2,2,3,5,2,3,"c",4,3,5,5]]
# An empty list where it will be added the result:
new_data = []
# Variable to accumulate the sum of every letter:
sume = 0
# FOR loop to scan the "data" variable:
for i in data[0]:
# If type of the i variable is string, we assume it's a letter:
if type(i) == str:
# Add accumulated sum
new_data.append(sume)
# We restart *sume* variable:
sume = 0
# We add a new letter read:
new_data.append(i)
else:
# We accumulate the sum of each letter:
sume += i
# We extract the 0 added initially and added the last sum:
new_data = new_data[1::]+[sume]
# Finally, separate values in pairs with a FOR loop and add it to "new_data2":
new_data2 = []
for i in range(len(new_data)//2):
pos1 = i*2
pos2 = pos1+1
new_data2.append([new_data[pos1],new_data[pos2]])
# Print data and new_data2 to verify results:
print (data)
print (new_data2)
# Pause the script:
input()
This code can work once by script, but it can convert in a nested function to use it in the way you are looking for.
It’s normally expected you post your solution first, but it seems that you have tried some things and need help. For future questions make sure you include your attempt, since it helps us provide more help as to why your solution doesn't work, and what additional steps you can take to improve your solution.
Assuming that your list always starts with a letter or str, and all numbers are of type int, you could use a dictionary to do the counting. I have added comments to explain the logic.
def group_consecutive(lst):
groups = {}
key = None
for item in lst:
# If we found a string, set the key and continue to next iteration immediately
if isinstance(item, str):
key = item
continue
# Add item to counts
# Using dict.get() to initialize to 0 if ket doesn't exist
groups[key] = groups.get(key, 0) + item
# Replacing list comprehension: [[k, v] for k, v in groups.items()]
result = []
for k, v in groups.items():
result.append([k, v])
return result
Then you could call the function like this:
>>> group_consecutive(["a",1,3,4,"b",2,2,"c",4,5])
[['a', 8], ['b', 4], ['c', 9]]
A better solution would probably use collections.Counter or collections.defaultdict to do the counting, but since you mentioned no imports then the above solution adheres to that.

How to know the number of names that start with each letter of a txt file using pyhton

I need to know how I can calculate the number of words in the list that start with the letter A, B, C .. Z.
Here I leave the reading part of the txt file
#!/usr/bin/python
def main():
lines = []
xs = []
try:
with open("bin-nombres.txt", 'r') as fp:
lines = [lines.strip() for lines in fp]
for i in lines:
print(i[0])
xs = counterByLetter(i[0])
print(xs)
except EOFError as e:
print(e)
finally:
pass
def counterByLetter(data):
return [(k, v) for k, v in {v: data.count(v) for v in 'abcdefghijklmnopqrstuvwxyz'}.items()]
if __name__ == "__main__":
main()
I must calculate the number of words that begin with [A ... Z]. For examples.
There are 3 words that start with A.
There are 20 words that start with B.
etc..
Here I leave the solution to the problem. Thanking those who helped me !!
import string
def main():
try:
# this initiates the counter with 0 for each letter
letter_count = {letter: 0 for letter in list(string.ascii_lowercase)}
with open("bin-nombres.txt", 'r') as fp:
for line in fp:
line = line.strip()
initial = line[0].lower()
letter_count[initial] += 1 # and here I increment per word
#iterating over the dictionary to get the key and the value.
#In the iteration process the values will be added to know the amount of words.
size = 0
for key , value in letter_count.items():
size += value
print("Names that start with the letter '{}' have {} numbers.".format(key , value))
print("Total names in the file: {}".format(size))
except EOFError as e:
print(e)
if __name__ == "__main__":
main()
So, according to your updated answer (1 word per line, already alphabetically sorted), something like this should work:
import string
def main():
try:
# this initiates our counter with 0 for each letter
letter_count = {letter: 0 for letter in list(string.ascii_lowercase)}
with open("words.txt", 'r') as fp:
for line in fp:
line = line.strip()
initial = line[0].lower()
letter_count[initial] += 1 # and here we increment per word
print(letter_count)
except EOFError as e:
print(e)
if __name__ == "__main__":
main()
UPDATE:
It's good that you don't just want a readymade solution, but your code has a few issues and some points are not super pythonic, that's why I suggested to do it as above. If you really want to go with your solution, you need to fix your counterByLetter function. The problem with it is that you're not actually storing the results anywhere, you're always returning a new array of results for each word. You probably have a word starting with 'z' as the last word of the file, hence the result having 0 as the count for all letters, except 'z', which has one. You need to update your values for the current letter in that function, instead of calculating the whole array at once.
Assume that, there have a list name list which have 3 elements:
list = ["Geeks", "For", "Triks"]
And have a array which have 26 elements.
array = ["0", "0", ......"0", "0"......"0","0"]
array[0] represent the number of words start with A.
..................
..................
array[25] represent the number of words start with Z.
Then,
if list[n][0] start with A then you need to increment array[0] by 1.
if array[5] = 7 then it's mean that there are 7 words start with F.
This is the straightforward logic for find the result.
I'd suggest to change a bit your code like this.
Use collection.defaultdict set to int as value: using the first letter as key of the dictionary you are able to increment its value each there is a match. So:
from collections import defaultdict
Set xs as xs = defaultdict(int)
Change the for i in lines: body to
for i in lines:
xs[i[0]] += 1
If you print xs at the end of the for loop you'll get something like:
defaultdict(<class 'int'>, {'P': 3, 'G': 2, 'R': 2})
Keys in dict are case sensitive, so, take care of transforming the case, if required.
You don't need an external method to do the counting job.

Anagram from large file

I have a file having 10,000 word on it. I wrote a program to find anagram word from that file but its taking too much time to get to output. For small file program works well. Try to optimize the code.
count=0
i=0
j=0
with open('file.txt') as file:
lines = [i.strip() for i in file]
for i in range(len(lines)):
for j in range(i):
if sorted(lines[i]) == sorted(lines[j]):
#print(lines[i])
count=count+1
j=j+1
i=i+1
print('There are ',count,'anagram words')
I don't fully understand your code (for example, why do you increment i and j inside the loop?). But the main problem is that you have a nested loop, which makes the runtime of the algorithm O(n^2), i.e. if the file becomes 10 times as large, the execution time will become (approximately) 100 times as long.
So you need a way to avoid that. One possible way is to store the lines in a smarter way, so that you don't have to walk through all lines every time. Then the runtime becomes O(n). In this case you can use the fact that anagrams consist of the same characters (only in a different order). So you can use the "sorted" variant as a key in a dictionary to store all lines that can be made from the same letters in a list under the same dictionary key. There are other possibilities of course, but in this case I think it works out quite nice :-)
So, fully working example code:
#!/usr/bin/env python3
from collections import defaultdict
d = defaultdict(list)
with open('file.txt') as file:
lines = [line.strip() for line in file]
for line in lines:
sorted_line = ''.join(sorted(line))
d[sorted_line].append(line)
anagrams = [d[k] for k in d if len(d[k]) > 1]
# anagrams is a list of lists of lines that are anagrams
# I would say the number of anagrams is:
count = sum(map(len, anagrams))
# ... but in your example your not counting the first words, only the "duplicates", so:
count -= len(anagrams)
print('There are', count, 'anagram words')
UPDATE
Without duplicates, and without using collections (although I strongly recommend to use it):
#!/usr/bin/env python3
d = {}
with open('file.txt') as file:
lines = [line.strip() for line in file]
lines = set(lines) # remove duplicates
for line in lines:
sorted_line = ''.join(sorted(line))
if sorted_line in d:
d[sorted_line].append(line)
else:
d[sorted_line] = [line]
anagrams = [d[k] for k in d if len(d[k]) > 1]
# anagrams is a list of lists of lines that are anagrams
# I would say the number of anagrams is:
count = sum(map(len, anagrams))
# ... but in your example your not counting the first words, only the "duplicates", so:
count -= len(anagrams)
print('There are', count, 'anagram words')
Well it is unclear whether you account for duplicates or not, however if you don't you can remove duplicates from your list of words and that will spare you a huge amount of runtime in my opinion. You can check for anagrams and then use sum() to get the their total number. This should do it:
def get_unique_words(lines):
unique = []
for word in " ".join(lines).split(" "):
if word not in unique:
unique.append(word)
return unique
def check_for_anagrams(test_word, words):
return sum([1 for word in words if (sorted(test_word) == sorted(word) and word != test_word)])
with open('file.txt') as file:
lines = [line.strip() for line in file]
unique = get_unique_words(lines)
count = sum([check_for_anagrams(word, unique) for word in unique])
print('There are ', count,'unique anagram words aka', int(count/2), 'unique anagram couples')

Nested for loops with large data set

I have a list of sublists each of which consists of one or more strings. I am comparing each string in one sublist to every other string in the other sublists. This consists of writing two for loops. However, my data set is ~5000 sublists, which means my program keeps running forever unless I run the code in increments of 500 sublists. How do I change the flow of this program so I can still look at all j values corresponding to each i, and yet be able to run the program for ~5000 sublists. (wn is Wordnet library)
Here's part of my code:
for i in range(len(somelist)):
if i == len(somelist)-1: #if the last sublist, do not compare
break
title_former = somelist[i]
for word in title_former:
singular = wn.morphy(word) #convert to singular
if singular == None:
pass
elif singular != None:
newWordSyn = getNewWordSyn(word,singular)
if not newWordSyn:
uncounted_words.append(word)
else:
for j in range(i+1,len(somelist)):
title_latter = somelist[j]
for word1 in title_latter:
singular1 = wn.morphy(word1)
if singular1 == None:
uncounted_words.append(word1)
elif singular1 != None:
newWordSyn1 = getNewWordSyn(word1,singular1)
tempSimilarity = newWordSyn.wup_similarity(newWordSyn1)
Example:
Input = [['space', 'invaders'], ['draw']]
Output= {('space','draw'):0.5,('invaders','draw'):0.2}
The output is a dictionary with corresponding string pair tuple and their similarity value. The above code snippet is not complete.
How about doing a bit of preprocessing instead of doing a bunch of operations over and over? I did not test this, but you get the idea; you need to take anything you can out of the loop.
# Preprocessing:
unencountered_words = []
preprocessed_somelist = []
for sublist in somelist:
new_sublist = []
preprocessed_somelist.append(new_sublist)
for word in sublist:
temp = wn.morphy(word)
if temp:
new_sublist.append(temp)
else:
unencountered_words.append(word)
# Nested loops:
for i in range(len(preprocessed_somelist) - 1): #equivalent to your logic
for word in preprocessed_somelist[i]:
for j in range(i+1, len(preprocessed_somelist)):
for word1 in preprocessed_somelist[j]:
tempSimilarity = newWordSyn.wup_similarity(newWordSyn1)
you could try something like this but I doubt it will be faster (and you will probably need to change the distance function)
def dist(s1,s2):
return sum([i!=j for i,j in zip(s1,s2)]) + abs(len(s1)-len(s2))
dict([((k,v),dist(k,v)) for k,v in itertools.product(Input1,Input2)]
This is always going to have scaling issues, because you're doing n^2 string comparisons. Julius' optimization is certainly a good starting point.
The next thing you can do is store similarity results so you don't have to compare the same words repeatedly.
One other optimisation you can make is store comparisons of words and reuse them if the same words are encountered.
key = (newWordSyn, newWordSyn1)
if key in prevCompared:
tempSimilarity = prevCompared[(word, word1)]
else:
tempSimilarity = newWordSyn.wup_similarity(newWordSyn1)
prevCompared[key] = tempSimilarity
prevCompared[(newWordSyn1, newWordSyn)] = tempSimilarity
This only helps if you will see a lot of the same word combination, but i think wup_similarity is quite expensive.

Reading and Grouping a List of Data in Python

I have been struggling with managing some data. I have data that I have turned into a list of lists each basic sublist has a structure like the following
<1x>begins
<2x>value-1
<3x>value-2
<4x>value-3
some indeterminate number of other values
<1y>next observation begins
<2y>value-1
<3y>value-2
<4y>value-3
some indeterminate number of other values
this continues for an indeterminate number of times in each sublist
EDIT I need to get all the occurrences of <2,<3 & <4 separated out and grouped together I am creating a new list of lists [[<2x>value-1,<3x>value-2, <4x>value-3], [<2y>value-1, <3y>value-2, <4y>value-3]]
EDIT all of the lines that follow <4x> and <4y> (and for that matter <4anyalpha> have the same type of coding and I don't know a-priori how high the numbers can go-just think of these as sgml tags that are not closed I used numbers because my fingers were hurting from all the coding I have been doing today.
The solution I have come up with finally is not very pretty
listINeed=[]
for sublist in biglist:
for line in sublist:
if '<2' in line:
var2=line
if '<3' in line:
var3=line
if '<4' in line:
var4=line
templist=[]
templist.append(var2)
templist.append(var3)
templist.append(var4)
listIneed.append(templist)
templist=[]
var4=var2=var3=''
I have looked at ways to try to clean this up but have not been successful. This works fine I just saw this as another opportunity to learn more about python because I would think that this should be processable by a one line function.
itertools.groupby() can get you by.
itertools.groupby(biglist, operator.itemgetter(2))
If you want to pick out the second, third, and fourth elements of each sublist, this should work:
listINeed = [sublist[1:4] for sublist in biglist]
You're off to a good start by noticing that your original solution may work but lacks elegance.
You should parse the string in a loop, creating a new variable for each line.
Here's some sample code:
import re
s = """<1x>begins
<2x>value-1
<3x>value-2
<4x>value-3
some indeterminate number of other values
<1y>next observation begins
<2y>value-1
<3y>value-2
<4y>value-3"""
firstMatch = re.compile('^\<1x')
numMatch = re.compile('^\<(\d+)')
listIneed = []
templist = None
for line in s.split():
if firstMatch.match(line):
if templist is not None:
listIneed.append(templist)
templist = [line]
elif numMatch.match(line):
#print 'The matching number is %s' % numMatch.match(line).groups(1)
templist.append(line)
if templist is not None: listIneed.append(templist)
print listIneed
If I've understood your question correctly:
import re
def getlines(ori):
matches = re.finditer(r'(<([1-4])[a-zA-Z]>.*)', ori)
mainlist = []
sublist = []
for sr in matches:
if int(sr.groups()[1]) == 1:
if sublist != []:
mainlist.append(sublist)
sublist = []
else:
sublist.append(sr.groups()[0])
else:
mainlist.append(sublist)
return mainlist
...would do the job for you, if you felt like using regular expressions.
The version below would break all of the data down into sublists (not just the first four in each grouping) which might be more useful depending what else you need to do to the data. Use David's listINeed = [sublist[1:4] for sublist in biglist] to get the first four results from each list for the specific task above.
import re
def getlines(ori):
matches = re.finditer(r'(<(\d*)[a-zA-Z]>.*)', ori)
mainlist = []
sublist = []
for sr in matches:
if int(sr.groups()[1]) == 1:
print "1 found!"
if sublist != []:
mainlist.append(sublist)
sublist = []
else:
sublist.append(sr.groups()[0])
else:
mainlist.append(sublist)
return mainlist

Categories