Converting fasta file into list with names and sequences

Converting fasta file into list with names and sequences - python

Goal: Return a list of the names and sequences
def ReadFastaFile(filename):
fileObj = open(filename, 'r')
sequences = []
seqFragments = []
for line in fileObj:
if line.startswith('>'):
if seqFragments:
sequence = ''.join(seqFragments)
sequences.append(sequence)
seqFragments = []
else:
seq = line.rstrip()
seqFragments.append(seq)
if seqFragments:
sequence = ''.join(seqFragments)
sequences.append(sequence)
fileObj.close()
return sequences
I want to get list with name and sequence
This code gives me a list with only the sequence, because i first thought i would not need the name for what I want to do. But now i realized that it would be good to also include the names. Maybe if possible also in a dictionary form, so that it is like: dict = {'name':sequence}. Does somebody have I idea how to alter the code to achieve this?

This should be pretty straightforward:
def ReadFastaFile(filename):
fileObj = open(filename, 'r')
sequences = dict()
seqFragments = []
for line in fileObj:
if line.startswith('>'):
if seqFragments:
sequence = ''.join(seqFragments)
sequences[id] = sequence
seqFragments = []
id = line.rstrip()[1:]
else:
seq = line.rstrip()
seqFragments.append(seq)
if seqFragments:
sequence = ''.join(seqFragments)
sequences[id] = sequence
fileObj.close()
return sequences

Related

How to increase the speed of CSV data matching?

I have a scripts that parse two CSV files and compares the first column from one file with the second column from another file. The problem is those files are big and it takes some time to finish the process. The question is how to improve the speed? I tried to use yield from lines before the for cycle but the problem is then I have convert lines[1:] to list(lines[1:]) as result it makes no sense.
def pk():
with open('way/to/first.csv') as csv_file:
lines = csv_file.readlines()
full_list = []
for line in lines[1:]:
array = line.split(',')
list_pk = array[0].replace('"', '')
full_list.append(list_pk)
return full_list
def fk():
with open('way/to/second.csv') as csv_file:
lines = csv_file.readlines()
full_list = []
for line in lines[1:]:
array = line.split(',')
list_fk = array[1].replace('"', '')
full_list.append(list_fk)
return full_list
def res():
f = fk()
p = pk()
for i in f:
if i not in p:
raise AssertionError(f'{i} not found')

Try using python's "set difference" to find the elements in set A that do not have a match in set B:
def res():
fset = set(fk())
pset = set(pk())
print('items in F that are missing from P:')
print(fset - pset)

How to print out lines longer than specific lenght

I have an input file like this:
#sample1
ATGGTTCCAAGGCCTTGGTTAATTGGGGGGTTTTTTTTTTTTTTTTTTT
#sample2
TTGGAACCTTGGCCAATTAAGGGGGGGGGTTTTTTTCCCCCCCCCCCCC
#sample3
GGTTGGTTGGGAATTTGGTTAACCTTTTTAAATTTTTTTTTTTGGGGGG
AATTTTTTTTTTTTTGG
I want to print out the line that have specific minimum length. For example, if the minimum length I want is 66, then the output will be :
#sample3
GGTTGGTTGGGAATTTGGTTAACCTTTTTAAATTTTTTTTTTTGGGGGG
AATTTTTTTTTTTTTGG
Since only the sequence of sample 3 have the minimum length 66
Below is my code sofar:
fastfile = {}
with open(sys.argv[1]) as f:
for line in f:
line = line.strip()
if not line:
continue
if line.startswith("#"):
sequencenumber = line[1:]
if sequencenumber not in fastfile:
fastfile[sequencenumber] = []
continue
sequence = line
fastfile[sequencenumber].append(sequence)
output = []
for key, value in fastfile.items():
if len(value) >= sys.argv[2]:
output.append(value)
print (output)
Argv[1] is the path of the input file and argv[2] is the specific minimum length.

You want the values of the fastfile dictionary to be strings not lists, so instead of appending consecutive sequences to a running list, you need to concatenating them to a running string:
fastfile = {}
with open(sys.argv[1]) as f:
for line in f:
line = line.strip()
if not line:
continue
if line[0] == "#":
sequencenumber = line[1:]
if sequencenumber not in fastfile:
fastfile[sequencenumber] = ""
continue
fastfile[sequencenumber] += line
output = []
for key, value in fastfile.items():
if len(value) >= sys.argv[2]:
output.append(value)
print (output)
Or if you need to store the strings in a list like you originally do, then use "".join(value) to concatenate all the strings together, like so:
output = []
for key, value in fastfile.items():
if len("".join(value)) >= sys.argv[2]:
output.append("".join(value))
output

This looks much simpler:
with open(argv[1]) as fin :
text = fin.read()
min_length = int(argv[2])
parts = text.split('#')
# choose only the parts that have strings over the min_length
parts = [p for p in parts if any(len(i) > min_length for i in p.split('\n'))]
output = '#'.join( parts )

Import textfile - list index out of range

infile = open("/Users/name/Downloads/points.txt", "r")
line = infile.readline()
while line != "":
line = infile.readline()
wordlist = line.split()
x_co = float(wordlist[0])
y_co = float(wordlist[1])
I looked around but actually didn't find something helpful for my problem.
I have a .txt file with x (first column) and y (second column) coordinates (see picture).
I want every x and y coordinate separated but when I run my code I always get an ERROR:
x_co = float(wordList[0])
IndexError: list index out of range
Thanks for helping!

filename = "/Users/name/Downloads/points.txt"
with open(filename) as infile:
for line in infile:
wordlist = line.split()
x_co = float(wordlist[0])
y_co = float(wordlist[1])
with automatically handles file closing
For more such idiomatic ways in Python, read this

Better you can do this way:
infile = open("/Users/name/Downloads/points.txt", "r")
for line in infile:
if line:
wordlist = line.split()
x_co = float(wordlist[0])
y_co = float(wordlist[1])

Dictionaries overwriting in Python

This program is to take the grammar rules found in Binary.text and store them into a dictionary, where the rules are:
N = N D
N = D
D = 0
D = 1
but the current code returns D: D = 1, N:N = D, whereas I want N: N D, N: D, D:0, D:1
import sys
import string
#default length of 3
stringLength = 3
#get last argument of command line(file)
filename1 = sys.argv[-1]
#get a length from user
try:
stringLength = int(input('Length? '))
filename = input('Filename: ')
except ValueError:
print("Not a number")
#checks
print(stringLength)
print(filename)
def str2dict(filename="Binary.txt"):
result = {}
with open(filename, "r") as grammar:
#read file
lines = grammar.readlines()
count = 0
#loop through
for line in lines:
print(line)
result[line[0]] = line
print (result)
return result
print (str2dict("Binary.txt"))

Firstly, your data structure of choice is wrong. Dictionary in python is a simple key-to-value mapping. What you'd like is a map from a key to multiple values. For that you'll need:
from collections import defaultdict
result = defaultdict(list)
Next, where are you splitting on '=' ? You'll need to do that in order to get the proper key/value you are looking for? You'll need
key, value = line.split('=', 1) #Returns an array, and gets unpacked into 2 variables
Putting the above two together, you'd go about in the following way:
result = defaultdict(list)
with open(filename, "r") as grammar:
#read file
lines = grammar.readlines()
count = 0
#loop through
for line in lines:
print(line)
key, value = line.split('=', 1)
result[key.strip()].append(value.strip())
return result

Dictionaries, by definition, cannot have duplicate keys. Therefor there can only ever be a single 'D' key. You could, however, store a list of values at that key if you'd like. Ex:
from collections import defaultdict
# rest of your code...
result = defaultdict(list) # Use defaultdict so that an insert to an empty key creates a new list automatically
with open(filename, "r") as grammar:
#read file
lines = grammar.readlines()
count = 0
#loop through
for line in lines:
print(line)
result[line[0]].append(line)
print (result)
return result
This will result in something like:
{"D" : ["D = N D", "D = 0", "D = 1"], "N" : ["N = D"]}

getting error unhashable type list in my python code

I'm getting this error when I run my python code, but I'm kind of learning my way around python and I'm having trouble decipher what's wrong with the code. I'm getting "unhashable type: list" error. Error is showing on line 54, and 35. I wonder if I'm missing some import. I've checked the code, but I don't see the error
#!/usr/bin/python
import string
def rotate(str, n):
inverted = ''
for i in str:
#calculating starting point in ascii
if i.isupper():
start = ord('A')
else:
start = ord('a')
d = ord(i) - start
j = chr((d + n) % 26 + start)
#calculating starting point in ascii(d + n) + start
inverted += j
return inverted
'''
making a dictionary out of a file containing all words
'''
def make_dictionary():
filename = "/home/jorge/words.txt"
fin = open(filename, 'r')
dic = dict()
for line in fin:
line = line.split()
dic[line] = line
return dic
'''
function that rotates a word and find other words
'''
def find_word(word):
rotated_words = dict() #dictionary for storing rotated words
for i in range(1, 14):
rotated = rotate(word, i)
if rotated in dic:
print word, rotated, i
if __name__ == "__main__":
words = make_dictionary()
for w in words:
find_word(w)
I wonder if I'm missing some imports?

For example:
line = line.split()
dic[line] = line
line is a list after the split and, as the error message tells you, lists aren't hashable; dictionary keys must be hashable. The minimal fix is to use an (immutable, hashable) tuple instead:
dic[tuple(line)] = line
Note that dictionary values can be lists, the restriction applies only to keys.

This makes line a list:
line = line.split()
dict keys need to be hashable, and lists are not hashable:
dic[line] = line
In your code it's not clear you need a dict. A set of words would suffice:
def make_set():
filename = "/home/jorge/words.txt"
result = set()
with open(filename, 'r') as fin:
for line in fin:
for word in line.split():
result.add(word)
return result
Using a set will remove duplicate words.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting fasta file into list with names and sequences - python

Related

How to increase the speed of CSV data matching?

How to print out lines longer than specific lenght

Import textfile - list index out of range

Dictionaries overwriting in Python

getting error unhashable type list in my python code

Categories

Resources