Parse multi-fasta file to extract out sequences

Parse multi-fasta file to extract out sequences - python

I am trying to write a script in python to parse a large fasta file, I do not want to use biopython since I am learning scripting. The script needs to print the accession number, sequence length, and sequence gc content to the console. I've been able to extract the accession numbers, but am unable to extract the sequences since they're being read as lines and this is preventing me from calculating sequence length and gc content.
Could anyone help me?
I've tried to group the lines in a list, but then that creates multiple lists within a list and I'm not sure how to join them either.
seq=""
seqcount=0
seqlen=0
gc=0
#prompt user for file name
infile=input("Enter the name of your designated .fasta file: ")
with open(infile, "r") as fasta:
print("\n")
print ("Accession Number \t Sequence Length \t GC content (%)")
for line in fasta:
line.strip()
if line[0]==">":
seqcount+=1 #counts number sequences in file
accession=line.split("|")[3] #extract accession
seq=""
else:
seq+=line[:-1]
seqlen=len(seq)
print(accession, "\t \t", seqlen)
print("\n")
print("There are a total of", seqcount, "sequences in this file.")

You were not far away from a proper code:
seq=""
seqcount=0
#prompt user for file name
infile=input("Enter the name of your designated .fasta file: ")
def pct_gc(s):
gc = s.count('G') + s.count('C') + s.count('g') + s.count('c')
total = len(s)
return gc*100.0/total
with open(infile, "r") as fasta:
print("\n")
print ("Accession Number\tSequence Length\tGC content (%)")
for line in fasta:
line = line.strip()
if line[0]==">":
if seq != "":
print("{}\t{}\t{}".format(accession, pct_gc(seq), len(seq)))
seqcount+=1 #counts number sequences in file
accession=line.split("|")[3] #extract accession
seq=""
else:
seq+=line[:-1]
print("{}\t{}\t{}".format(accession, pct_gc(seq), len(seq)))
print("\n")
print("There are a total of " + str(seqcount) + " sequences in this file.")
Things to look for:
You don't need to update length in every iteration. Just compute it at the end.
str.strip() does not modify the object, instead returns a stripped object
You must use the fact that you know that you read a full sequence when you find the next one and the sequence is not empty. Is at that point that you must write the output.
The last sequence is not finished by a new accession, so you have to process it independently at the end, after the loop.
Use string formats or concatenate strings. If you just put strings and vars separated by commas, you get a tuple representation output.

Related

How can i sort order of wordcount with Python?

I am using this code to count the same words in a text file.
filename = input("Enter name of input file: ")
file = open(filename, "r", encoding="utf8")
wordCounter = {}
with open(filename,'r',encoding="utf8") as fh:
for line in fh:
# Replacing punctuation characters. Making the string to lower.
# The split will spit the line into a list.
word_list = line.replace(',','').replace('\'','').replace('.','').replace("'",'').replace('"','').replace('"','').replace('#','').replace('!','').replace('^','').replace('$','').replace('+','').replace('%','').replace('&','').replace('/','').replace('{','').replace('}','').replace('[','').replace(']','').replace('(','').replace(')','').replace('=','').replace('*','').replace('?','').lower().split()
for word in word_list:
# Adding the word into the wordCounter dictionary.
if word not in wordCounter:
wordCounter[word] = 1
else:
# if the word is already in the dictionary update its count.
wordCounter[word] = wordCounter[word] + 1
print('{:15}{:3}'.format('Word','Count'))
print('-' * 18)
# printing the words and its occurrence.
for word,occurance in wordCounter.items():
print(word,occurance)
I need them to be in order in bigger number to smaller number as output. For example:
word 1: 25
word 2: 12
word 3: 5
.
.
.
I also need to get the input as just ".txt" file. If the user writes anything different the program must get an error as "Write a valid file name".
How can i sort output and make the error code at the same time ?

For printing in order, you can sort them prior to printing by the occurrence like this:
for word,occurance in sorted(wordCounter.items(), key=lambda x: x[1], reverse=True):
print(word,occurance)
In order to check whether the file is valid in the way that you want, you can consider using:
import os
path1 = "path/to/file1.txt"
path2 = "path/to/file2.png"
if not path1.lower().endswith('.txt'):
print("Write a valid file name")
if not os.path.exists(path1):
print("File does not exists!")

You can try:
if ( filename[-4:0] != '.txt'):
print('Please input a valid file name')
And repeat input command...

Can't import data from files into one line

I'm currently making this program that was given to me by my school and it's to write your own name in ASCII text art but that was just copying and pasting. I am trying to make it so the user enters an input and there their name is output. My program currently works except it doesnt stay on one line.
My code:
name = input("What is your name: ")
splitname = list(name)
for i in range(len(splitname)):
f=open(splitname[i] + ".txt","r")
contents = f.read()
print(contents)
And this is what it outputs:
I would like to get it all onto one line if possible, how would I do so?

The solution is a bit more complicated because you have to print out line by line, but you already need all the contents of the 'letter' files.
The solution would be to read the first line of the first letter, then concatenate this string with the first line of the next letter and so on. Then do the same for the second line until you printed all lines.
I will not provide a complete solution, but I can help to fix your code. To start you have to only read one line of the letter file. You can do this with f.readline() instead of f.read() each consecutive call of this function will read the next line in this file, if the handle is still open.

To print the ASCII letters one next to the other, you have to split the letter into multiple lines and concatenate all the corresponding lines.
Assuming your ASCII text is made of 8 lines:
name = input("What is your name: ")
splitname = list(name)
# Put the right number of lines of the ASCII letter
letter_height = 8
# This will contain the new lines
# obtained concatenating the lines
# of the single letters
complete_lines = [""] * letter_height
for i in range(len(splitname)):
f = open(splitname[i] + ".txt","r")
contents = f.read()
# Split the letter in lines
lines = contents.splitlines()
# Concatenate the lines
for j in range(letter_height):
complete_lines[j] = complete_lines[j] + " " + lines[j]
# Print all the lines
for j in range(letter_height):
print(complete_lines[j])

Python - How to split a list into two separate lists dynamically

I am using Python-3 and I am reading a text file which can have multiple paragraphs separated by '\n'. I want to split all those paragraphs into a separate list. There can be n number of paragraphs in the input file.
So this split and output list creation should happen dynamically thereby allowing me to view a particular paragraph by just entering the paragraph number as list[2] or list[3], etc....
So far I have tried the below process :
input = open("input.txt", "r") #Reading the input file
lines = input.readlines() #Creating a List with separate sentences
str = '' #Declaring a empty string
for i in range(len(lines)):
if len(lines[i]) > 2: #If the length of a line is < 2, It means it can be a new paragraph
str += lines[i]
This method will not store paragraphs into a new list (as I am not sure how to do it). It will just remove the line with '\n' and stores all the input lines into str variable. When I tried to display the contents of str, it is showing the output as words. But I need them as sentences.
And my code should store all the sentences until first occurence of '\n' into a separate list and so on.
Any ideas on this ?
UPDATE
I found a way to print all the lines that are present until '\n'. But when I try to store them into the list, it is getting stored as letters, not as whole sentences. Below is the code snippet for reference
input = open("input.txt", "r")
lines = input.readlines()
input_ = []
for i in range(len(lines)):
if len(lines[i]) <= 2:
for j in range(i):
input_.append(lines[j]) #This line is storing as letters.
even "input_ += lines" is storing as letters, Not as sentences.
Any idea how to modify this code to get the desired output ?

Don't forgot to do input.close(), or the file won't save.
Alternatively you can use with.
#Using "with" closes the file automatically, so you don't need to write file.close()
with open("input.txt","r") as file:
file_ = file.read().split("\n")
file_ is now a list with each paragraph as a separate item.
It's as simple as 2 lines.

How to convert a list into float for using the '.join' function?

I have to compress a file into a list of words and list of positions to recreate the original file. My program should also be able to take a compressed file and recreate the full text, including punctuation and capitalization, of the original file. I have everything correct apart from the recreation, using the map function my program can't convert my list of positions into floats because of the '[' as it is a list.
My code is:
text = open("speech.txt")
CharactersUnique = []
ListOfPositions = []
DownLine = False
while True:
line = text.readline()
if not line:
break
TwoList = line.split()
for word in TwoList:
if word not in CharactersUnique:
CharactersUnique.append(word)
ListOfPositions.append(CharactersUnique.index(word))
if not DownLine:
CharactersUnique.append("\n")
DownLine = True
ListOfPositions.append(CharactersUnique.index("\n"))
w = open("List_WordsPos.txt", "w")
for c in CharactersUnique:
w.write(c)
w.close()
x = open("List_WordsPos.txt", "a")
x.write(str(ListOfPositions))
x.close()
with open("List_WordsPos.txt", "r") as f:
NewWordsUnique = f.readline()
f.close()
h = open("List_WordsPos.txt", "r")
lines = h.readlines()
NewListOfPositions = lines[1]
NewListOfPositions = map(float, NewListOfPositions)
print("Recreated Text:\n")
recreation = " " .join(NewWordsUnique[pos] for pos in (NewListOfPositions))
print(recreation)
The error I get is:
Task 3 Code.py", line 42, in <genexpr>
recreation = " " .join(NewWordsUnique[pos] for pos in (NewListOfPositions))
ValueError: could not convert string to float: '['
I am using Python IDLE 3.5 (32-bit). Does anyone have any ideas on how to fix this?

Why do you want to turn the position values in the list into floats, since they list indices, and those must be integer? I suspected this might be an instance of what is called the XY Problem.
I also found your code difficult to understand because you haven't followed the PEP 8 - Style Guide for Python Code. In particular, with how many (although not all) of the variable names are CamelCased, which according to the guidelines, should should be reserved for the class names.
In addition some of your variables had misleading names, like CharactersUnique, which actually [mostly] contained unique words.
So, one of the first things I did was transform all the CamelCased variables into lowercase underscore-separated words, like camel_case. In several instances I also gave them better names to reflect their actual contents or role: For example: CharactersUnique became unique_words.
The next step was to improve the handling of files by using Python's with statement to ensure they all would be closed automatically at the end of the block. In other cases I consolidated multiple file open() calls into one.
After all that I had it almost working, but that's when I discovered a problem with the approach of treating newline "\n" characters as separate words of the input text file. This caused a problem when the file was being recreated by the expression:
" ".join(NewWordsUnique[pos] for pos in (NewListOfPositions))
because it adds one space before and after every "\n" character encountered that aren't there in the original file. To workaround that, I ended up writing out the for loop that recreates the file instead of using a list comprehension, because doing so allows the newline "words" could be handled properly.
At any rate, here's the resulting rewritten (and working) code:
input_filename = "speech.txt"
compressed_filename = "List_WordsPos.txt"
# Two lists to represent contents of input file.
unique_words = ["\n"] # preload with newline "word"
word_positions = []
with open(input_filename, "r") as input_file:
for line in input_file:
for word in line.split():
if word not in unique_words:
unique_words.append(word)
word_positions.append(unique_words.index(word))
word_positions.append(unique_words.index("\n")) # add newline at end of each line
# Write representations of the two data-structures to compressed file.
with open(compressed_filename, "w") as compr_file:
words_repr = " ".join(repr(word) for word in unique_words)
compr_file.write(words_repr + "\n")
positions_repr = " ".join(repr(posn) for posn in word_positions)
compr_file.write(positions_repr + "\n")
def strip_quotes(word):
"""Strip the first and last characters from the string (assumed to be quotes)."""
tmp = word[1:-1]
return tmp if tmp != "\\n" else "\n" # newline "words" are special case
# Recreate input file from data in compressed file.
with open(compressed_filename, "r") as compr_file:
line = compr_file.readline()
new_unique_words = list(map(strip_quotes, line.split()))
line = compr_file.readline()
new_word_positions = map(int, line.split()) # using int, not float here
words = []
lines = []
for posn in new_word_positions:
word = new_unique_words[posn]
if word != "\n":
words.append(word)
else:
lines.append(" ".join(words))
words = []
print("Recreated Text:\n")
recreation = "\n".join(lines)
print(recreation)
I created my own speech.txt test file from the first paragraph of your question and ran the script on it with these results:
Recreated Text:
I have to compress a file into a list of words and list of positions to recreate
the original file. My program should also be able to take a compressed file and
recreate the full text, including punctuation and capitalization, of the
original file. I have everything correct apart from the recreation, using the
map function my program can't convert my list of positions into floats because
of the '[' as it is a list.

Per your question in the comments:
You will want to split the input on spaces. You will also likely want to use different data structures.
# we'll map the words to a list of positions
all_words = {}
with open("speech.text") as f:
data = f.read()
# since we need to be able to re-create the file, we'll want
# line breaks
lines = data.split("\n")
for i, line in enumerate(lines):
words = line.split(" ")
for j, word in enumerate(words):
if word in all_words:
all_words[word].append((i, j)) # line and pos
else:
all_words[word] = [(i, j)]
Note that this does not yield maximum compression as foo and foo. count as separate words. If you want more compression, you'll have to go character by character. Hopefully now you can use a similar approach to do so if desired.

Search for values in all text files and multiply them by fixed value ? (in PYTHON ?)

-
Hi friends.
I have a lot of files, which contains text information, but I want to search only specific lines, and then in these lines search for on specific position values and multiply them with fixed value (or entered with input).
Example text:
1,0,0,0,1,0,0
15.000,15.000,135.000,15.000
7
3,0,0,0,2,0,0
'holep_str',50.000,-15.000,20.000,20.000,0.000
3
3,0,0,100,3,-8,0
58.400,-6.600,'14',4.000,0.000
4
3,0,0,0,3,-8,0
50.000,-15.000,50.000,-15.000
7
3,0,0,0,4,0,0
'holep_str',100.000,-15.000,14.000,14.000,0.000
3
3,0,0,100,5,-8,0
108.400,-6.600,'14',4.000,0.000
And I want to identify and modify only lines with "holep_str" text:
'holep_str',50.000,-15.000,20.000,20.000,0.000
'holep_str',100.000,-15.000,14.000,14.000,0.000
There are in each line that begins with the string "holep_str" two numbers, at position 3rd and 4th value:
20.000 20.000
14.000 14.000
And these can be identified like:
1./ number after 3rd comma on line beginning with "holep_str"
2./ number after 4th comma on line beginning with "holep_str"
RegEx cannot help, Python probably sure, but I'm in time press - and go no further with the language...
Is there somebody that can explain how to write this relative simple code, that finds all lines with "search string" (= "holep_str") - and multiply the values after 3rd & 4th comma by FIXVALUE (or value input - for example "2") ?
The code should walk through all files with defined extension (choosen by input - for example txt) where the code is executed - search all values on needed lines and multiply them and write back...
So it looks like - if FIXVALUE = 2:
'holep_str',50.000,-15.000,40.000,40.000,0.000
'holep_str',100.000,-15.000,28.000,28.000,0.000
And whole text looks like then:
1,0,0,0,1,0,0
15.000,15.000,135.000,15.000
7
3,0,0,0,2,0,0
'holep_str',50.000,-15.000,40.000,40.000,0.000
3
3,0,0,100,3,-8,0
58.400,-6.600,'14',4.000,0.000
4
3,0,0,0,3,-8,0
50.000,-15.000,50.000,-15.000
7
3,0,0,0,4,0,0
'holep_str',100.000,-15.000,28.000,28.000,0.000
3
3,0,0,100,5,-8,0
108.400,-6.600,'14',4.000,0.000
Thank You.

with open(file_path) as f:
lines = f.readlines()
for line in lines:
if line.startswith(r"'holep_str'"):
split_line = line.split(',')
num1 = float(split_line[3])
num2 = float(split_line[4])
print num1, num2
# do stuff with num1 and num2
Once you .split() the lines with the argument ,, you get a list. Then, you can find the values you want by index, which are 3 and 4 in your case. I also convert them to float at the end.

Also final solution - whole program (version: python-3.6.0-amd64):
# import external functions / extensions ...
import os
import glob
# functions definition section
def fnc_walk_through_files(path, file_extension):
for (dirpath, dirnames, filenames) in os.walk(path):
for filename in filenames:
if filename.endswith(file_extension):
yield os.path.join(path, filename)
# some variables for counting
line_count = 0
# Feed data to program by entering them on keyboard
print ("Enter work path (e.g. d:\\test) :")
workPath = input( "> " )
print ("File extension to perform Search-Replace on [spf] :")
fileExt = input( "> " )
print ("Enter multiplier value :")
multiply_value = input( "> " )
print ("Text to search for :")
textToSearch = input( "> " )
# create temporary variable with path and mask for deleting all ".old" files
delPath = workPath + "\*.old"
# delete old ".old" files to allow creating backups
for files_to_delete in glob.glob(delPath, recursive=False):
os.remove(files_to_delete)
# do some needed operations...
print("\r") #enter new line
multiply_value = float(multiply_value) # convert multiplier to float
textToSearch_mod = "\'" + textToSearch # append apostrophe to begin of searched text
textToSearch_mod = str(textToSearch_mod) # convert variable to string for later use
# print information line of what will be searched for
print ("This is what will be searched for, to identify right line: ", textToSearch_mod)
print("\r") #enter new line
# walk through all files with specified extension <-- CALLED FUNCTION !!!
for fname in fnc_walk_through_files(workPath, fileExt):
print("\r") # enter new line
# print filename of processed file
print(" Filename processed:", fname )
# and proccess every file and print out numbers
# needed to multiplying located at 3rd and 4th position
with open(fname, 'r') as f: # opens fname file for reading
temp_file = open('tempfile','w') # open (create) tempfile for writing
lines = f.readlines() # read lines from f:
line_count = 0 # reset counter
# loop througt all lines
for line in lines:
# line counter increment
line_count = line_count + 1
# if line starts with defined string - she will be processed
if line.startswith(textToSearch_mod):
# line will be divided into parts delimited by ","
split_line = line.split(',')
# transfer 3rd part to variable 1 and make it float number
old_num1 = float(split_line[3])
# transfer 4th part to variable 2 and make it float number
old_num2 = float(split_line[4])
# multiply both variables
new_num1 = old_num1 * multiply_value
new_num2 = old_num2 * multiply_value
# change old values to new multiplied values as strings
split_line[3] = str(new_num1)
split_line[4] = str(new_num2)
# join the line back with the same delimiter "," as used for dividing
line = ','.join(split_line)
# print information line on which has been the searched string occured
print ("Changed from old:", old_num1, old_num2, "to new:", new_num1, new_num2, "at line:", line_count)
# write changed line with multiplied numbers to temporary file
temp_file.write(line)
else:
# write all other unchanged lines to temporary file
temp_file.write(line)
# create new name for backup file with adding ".old" to the end of filename
new_name = fname + '.old'
# rename original file to new backup name
os.rename(fname,new_name)
# close temporary file to enable future operation (in this case rename)
temp_file.close()
# rename temporary file to original filename
os.rename('tempfile',fname)
Also after 2 days after asking with a help of good people and hard study of the language :-D (indentation was my nightmare) and using some snippets of code on this site I have created something that works... :-) I hope it helps other people with similar question...
At beginning the idea was clear - but no knowledge of the language...
Now - all can be done - only what man can imagine is the border :-)
I miss GOTO in Python :'( ... I love spaghetti, not the spaghetti code, but sometimes it would be good to have some label<--goto jumps... (but this is not the case...)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse multi-fasta file to extract out sequences - python

Related

How can i sort order of wordcount with Python?

Can't import data from files into one line

Python - How to split a list into two separate lists dynamically

How to convert a list into float for using the '.join' function?

Search for values in all text files and multiply them by fixed value ? (in PYTHON ?)

Categories

Resources