I am trying to write a script in python to parse a large fasta file, I do not want to use biopython since I am learning scripting. The script needs to print the accession number, sequence length, and sequence gc content to the console. I've been able to extract the accession numbers, but am unable to extract the sequences since they're being read as lines and this is preventing me from calculating sequence length and gc content.
Could anyone help me?
I've tried to group the lines in a list, but then that creates multiple lists within a list and I'm not sure how to join them either.
seq=""
seqcount=0
seqlen=0
gc=0
#prompt user for file name
infile=input("Enter the name of your designated .fasta file: ")
with open(infile, "r") as fasta:
print("\n")
print ("Accession Number \t Sequence Length \t GC content (%)")
for line in fasta:
line.strip()
if line[0]==">":
seqcount+=1 #counts number sequences in file
accession=line.split("|")[3] #extract accession
seq=""
else:
seq+=line[:-1]
seqlen=len(seq)
print(accession, "\t \t", seqlen)
print("\n")
print("There are a total of", seqcount, "sequences in this file.")
You were not far away from a proper code:
seq=""
seqcount=0
#prompt user for file name
infile=input("Enter the name of your designated .fasta file: ")
def pct_gc(s):
gc = s.count('G') + s.count('C') + s.count('g') + s.count('c')
total = len(s)
return gc*100.0/total
with open(infile, "r") as fasta:
print("\n")
print ("Accession Number\tSequence Length\tGC content (%)")
for line in fasta:
line = line.strip()
if line[0]==">":
if seq != "":
print("{}\t{}\t{}".format(accession, pct_gc(seq), len(seq)))
seqcount+=1 #counts number sequences in file
accession=line.split("|")[3] #extract accession
seq=""
else:
seq+=line[:-1]
print("{}\t{}\t{}".format(accession, pct_gc(seq), len(seq)))
print("\n")
print("There are a total of " + str(seqcount) + " sequences in this file.")
Things to look for:
You don't need to update length in every iteration. Just compute it at the end.
str.strip() does not modify the object, instead returns a stripped object
You must use the fact that you know that you read a full sequence when you find the next one and the sequence is not empty. Is at that point that you must write the output.
The last sequence is not finished by a new accession, so you have to process it independently at the end, after the loop.
Use string formats or concatenate strings. If you just put strings and vars separated by commas, you get a tuple representation output.
I have to compress a file into a list of words and list of positions to recreate the original file. My program should also be able to take a compressed file and recreate the full text, including punctuation and capitalization, of the original file. I have everything correct apart from the recreation, using the map function my program can't convert my list of positions into floats because of the '[' as it is a list.
My code is:
text = open("speech.txt")
CharactersUnique = []
ListOfPositions = []
DownLine = False
while True:
line = text.readline()
if not line:
break
TwoList = line.split()
for word in TwoList:
if word not in CharactersUnique:
CharactersUnique.append(word)
ListOfPositions.append(CharactersUnique.index(word))
if not DownLine:
CharactersUnique.append("\n")
DownLine = True
ListOfPositions.append(CharactersUnique.index("\n"))
w = open("List_WordsPos.txt", "w")
for c in CharactersUnique:
w.write(c)
w.close()
x = open("List_WordsPos.txt", "a")
x.write(str(ListOfPositions))
x.close()
with open("List_WordsPos.txt", "r") as f:
NewWordsUnique = f.readline()
f.close()
h = open("List_WordsPos.txt", "r")
lines = h.readlines()
NewListOfPositions = lines[1]
NewListOfPositions = map(float, NewListOfPositions)
print("Recreated Text:\n")
recreation = " " .join(NewWordsUnique[pos] for pos in (NewListOfPositions))
print(recreation)
The error I get is:
Task 3 Code.py", line 42, in <genexpr>
recreation = " " .join(NewWordsUnique[pos] for pos in (NewListOfPositions))
ValueError: could not convert string to float: '['
I am using Python IDLE 3.5 (32-bit). Does anyone have any ideas on how to fix this?
Why do you want to turn the position values in the list into floats, since they list indices, and those must be integer? I suspected this might be an instance of what is called the XY Problem.
I also found your code difficult to understand because you haven't followed the PEP 8 - Style Guide for Python Code. In particular, with how many (although not all) of the variable names are CamelCased, which according to the guidelines, should should be reserved for the class names.
In addition some of your variables had misleading names, like CharactersUnique, which actually [mostly] contained unique words.
So, one of the first things I did was transform all the CamelCased variables into lowercase underscore-separated words, like camel_case. In several instances I also gave them better names to reflect their actual contents or role: For example: CharactersUnique became unique_words.
The next step was to improve the handling of files by using Python's with statement to ensure they all would be closed automatically at the end of the block. In other cases I consolidated multiple file open() calls into one.
After all that I had it almost working, but that's when I discovered a problem with the approach of treating newline "\n" characters as separate words of the input text file. This caused a problem when the file was being recreated by the expression:
" ".join(NewWordsUnique[pos] for pos in (NewListOfPositions))
because it adds one space before and after every "\n" character encountered that aren't there in the original file. To workaround that, I ended up writing out the for loop that recreates the file instead of using a list comprehension, because doing so allows the newline "words" could be handled properly.
At any rate, here's the resulting rewritten (and working) code:
input_filename = "speech.txt"
compressed_filename = "List_WordsPos.txt"
# Two lists to represent contents of input file.
unique_words = ["\n"] # preload with newline "word"
word_positions = []
with open(input_filename, "r") as input_file:
for line in input_file:
for word in line.split():
if word not in unique_words:
unique_words.append(word)
word_positions.append(unique_words.index(word))
word_positions.append(unique_words.index("\n")) # add newline at end of each line
# Write representations of the two data-structures to compressed file.
with open(compressed_filename, "w") as compr_file:
words_repr = " ".join(repr(word) for word in unique_words)
compr_file.write(words_repr + "\n")
positions_repr = " ".join(repr(posn) for posn in word_positions)
compr_file.write(positions_repr + "\n")
def strip_quotes(word):
"""Strip the first and last characters from the string (assumed to be quotes)."""
tmp = word[1:-1]
return tmp if tmp != "\\n" else "\n" # newline "words" are special case
# Recreate input file from data in compressed file.
with open(compressed_filename, "r") as compr_file:
line = compr_file.readline()
new_unique_words = list(map(strip_quotes, line.split()))
line = compr_file.readline()
new_word_positions = map(int, line.split()) # using int, not float here
words = []
lines = []
for posn in new_word_positions:
word = new_unique_words[posn]
if word != "\n":
words.append(word)
else:
lines.append(" ".join(words))
words = []
print("Recreated Text:\n")
recreation = "\n".join(lines)
print(recreation)
I created my own speech.txt test file from the first paragraph of your question and ran the script on it with these results:
Recreated Text:
I have to compress a file into a list of words and list of positions to recreate
the original file. My program should also be able to take a compressed file and
recreate the full text, including punctuation and capitalization, of the
original file. I have everything correct apart from the recreation, using the
map function my program can't convert my list of positions into floats because
of the '[' as it is a list.
Per your question in the comments:
You will want to split the input on spaces. You will also likely want to use different data structures.
# we'll map the words to a list of positions
all_words = {}
with open("speech.text") as f:
data = f.read()
# since we need to be able to re-create the file, we'll want
# line breaks
lines = data.split("\n")
for i, line in enumerate(lines):
words = line.split(" ")
for j, word in enumerate(words):
if word in all_words:
all_words[word].append((i, j)) # line and pos
else:
all_words[word] = [(i, j)]
Note that this does not yield maximum compression as foo and foo. count as separate words. If you want more compression, you'll have to go character by character. Hopefully now you can use a similar approach to do so if desired.
-
Hi friends.
I have a lot of files, which contains text information, but I want to search only specific lines, and then in these lines search for on specific position values and multiply them with fixed value (or entered with input).
Example text:
1,0,0,0,1,0,0
15.000,15.000,135.000,15.000
7
3,0,0,0,2,0,0
'holep_str',50.000,-15.000,20.000,20.000,0.000
3
3,0,0,100,3,-8,0
58.400,-6.600,'14',4.000,0.000
4
3,0,0,0,3,-8,0
50.000,-15.000,50.000,-15.000
7
3,0,0,0,4,0,0
'holep_str',100.000,-15.000,14.000,14.000,0.000
3
3,0,0,100,5,-8,0
108.400,-6.600,'14',4.000,0.000
And I want to identify and modify only lines with "holep_str" text:
'holep_str',50.000,-15.000,20.000,20.000,0.000
'holep_str',100.000,-15.000,14.000,14.000,0.000
There are in each line that begins with the string "holep_str" two numbers, at position 3rd and 4th value:
20.000 20.000
14.000 14.000
And these can be identified like:
1./ number after 3rd comma on line beginning with "holep_str"
2./ number after 4th comma on line beginning with "holep_str"
RegEx cannot help, Python probably sure, but I'm in time press - and go no further with the language...
Is there somebody that can explain how to write this relative simple code, that finds all lines with "search string" (= "holep_str") - and multiply the values after 3rd & 4th comma by FIXVALUE (or value input - for example "2") ?
The code should walk through all files with defined extension (choosen by input - for example txt) where the code is executed - search all values on needed lines and multiply them and write back...
So it looks like - if FIXVALUE = 2:
'holep_str',50.000,-15.000,40.000,40.000,0.000
'holep_str',100.000,-15.000,28.000,28.000,0.000
And whole text looks like then:
1,0,0,0,1,0,0
15.000,15.000,135.000,15.000
7
3,0,0,0,2,0,0
'holep_str',50.000,-15.000,40.000,40.000,0.000
3
3,0,0,100,3,-8,0
58.400,-6.600,'14',4.000,0.000
4
3,0,0,0,3,-8,0
50.000,-15.000,50.000,-15.000
7
3,0,0,0,4,0,0
'holep_str',100.000,-15.000,28.000,28.000,0.000
3
3,0,0,100,5,-8,0
108.400,-6.600,'14',4.000,0.000
Thank You.
with open(file_path) as f:
lines = f.readlines()
for line in lines:
if line.startswith(r"'holep_str'"):
split_line = line.split(',')
num1 = float(split_line[3])
num2 = float(split_line[4])
print num1, num2
# do stuff with num1 and num2
Once you .split() the lines with the argument ,, you get a list. Then, you can find the values you want by index, which are 3 and 4 in your case. I also convert them to float at the end.
Also final solution - whole program (version: python-3.6.0-amd64):
# import external functions / extensions ...
import os
import glob
# functions definition section
def fnc_walk_through_files(path, file_extension):
for (dirpath, dirnames, filenames) in os.walk(path):
for filename in filenames:
if filename.endswith(file_extension):
yield os.path.join(path, filename)
# some variables for counting
line_count = 0
# Feed data to program by entering them on keyboard
print ("Enter work path (e.g. d:\\test) :")
workPath = input( "> " )
print ("File extension to perform Search-Replace on [spf] :")
fileExt = input( "> " )
print ("Enter multiplier value :")
multiply_value = input( "> " )
print ("Text to search for :")
textToSearch = input( "> " )
# create temporary variable with path and mask for deleting all ".old" files
delPath = workPath + "\*.old"
# delete old ".old" files to allow creating backups
for files_to_delete in glob.glob(delPath, recursive=False):
os.remove(files_to_delete)
# do some needed operations...
print("\r") #enter new line
multiply_value = float(multiply_value) # convert multiplier to float
textToSearch_mod = "\'" + textToSearch # append apostrophe to begin of searched text
textToSearch_mod = str(textToSearch_mod) # convert variable to string for later use
# print information line of what will be searched for
print ("This is what will be searched for, to identify right line: ", textToSearch_mod)
print("\r") #enter new line
# walk through all files with specified extension <-- CALLED FUNCTION !!!
for fname in fnc_walk_through_files(workPath, fileExt):
print("\r") # enter new line
# print filename of processed file
print(" Filename processed:", fname )
# and proccess every file and print out numbers
# needed to multiplying located at 3rd and 4th position
with open(fname, 'r') as f: # opens fname file for reading
temp_file = open('tempfile','w') # open (create) tempfile for writing
lines = f.readlines() # read lines from f:
line_count = 0 # reset counter
# loop througt all lines
for line in lines:
# line counter increment
line_count = line_count + 1
# if line starts with defined string - she will be processed
if line.startswith(textToSearch_mod):
# line will be divided into parts delimited by ","
split_line = line.split(',')
# transfer 3rd part to variable 1 and make it float number
old_num1 = float(split_line[3])
# transfer 4th part to variable 2 and make it float number
old_num2 = float(split_line[4])
# multiply both variables
new_num1 = old_num1 * multiply_value
new_num2 = old_num2 * multiply_value
# change old values to new multiplied values as strings
split_line[3] = str(new_num1)
split_line[4] = str(new_num2)
# join the line back with the same delimiter "," as used for dividing
line = ','.join(split_line)
# print information line on which has been the searched string occured
print ("Changed from old:", old_num1, old_num2, "to new:", new_num1, new_num2, "at line:", line_count)
# write changed line with multiplied numbers to temporary file
temp_file.write(line)
else:
# write all other unchanged lines to temporary file
temp_file.write(line)
# create new name for backup file with adding ".old" to the end of filename
new_name = fname + '.old'
# rename original file to new backup name
os.rename(fname,new_name)
# close temporary file to enable future operation (in this case rename)
temp_file.close()
# rename temporary file to original filename
os.rename('tempfile',fname)
Also after 2 days after asking with a help of good people and hard study of the language :-D (indentation was my nightmare) and using some snippets of code on this site I have created something that works... :-) I hope it helps other people with similar question...
At beginning the idea was clear - but no knowledge of the language...
Now - all can be done - only what man can imagine is the border :-)
I miss GOTO in Python :'( ... I love spaghetti, not the spaghetti code, but sometimes it would be good to have some label<--goto jumps... (but this is not the case...)
In my code, I want to insert words into a text file from a user. So I have these words in the text file that must be replaced by the user input, here are the strings must be replaced in the file , adjective,plural_noun,noun.
file1 = open('Sample.txt', 'w')
*adjective*,*plural_noun*,*noun*,*verb*,*male_first_name* = [
line.strip() for line in open('Sample.txt')]
for t in *adjective* :
print(input("enter an adjective: ", file=file1))
print(input("enter an plural noun: ", file=file1))
print(input("enter an verb: ", file=file1))
file1.close()
A little something to get you started...
file1 = open('Sample.txt', 'r')
text = file1.read()
while (text.find('*replace*') != -1):
inp = raw_input("enter some text to replace: ");
text = text.replace('*replace*', inp, 1)
print(text)
If Sample.txt contains This is some text to *replace* and the user input is xyz, this code prints:
This is some text to xyz
Let's step through it bit by bit:
file1 = open('Sample.txt', 'r') opens the file for reading ('r' means "for reading").
text = file1.read() reads the content of the file and puts it in the variable text.
while (text.find('*replace*') != -1): looks for occurrences of the string *replace* and continues with the indented commands as long as it finds one.
inp = raw_input("enter some text to replace: "), which only runs if there is a remaining occurrence of *replace*, gets user input and puts it in the variable inp.
text = text.replace('*replace*', inp, 1), which also only runs if there is a remaining occurrence of *replace*, replaces the next occurrence of *replace* with the user input, overwriting the old text.
print(text), which runs once all occurrences of *replace* have been replaced with user input, prints out the new text.
This is not how you would write an efficient programme with lots of different *string* strings, but hopefully it will lead you in the right direction and walking before running is often a good idea.
There is excellent online Python documentation and you can also use the pydoc tool -- e.g. pydoc str.replace from the command line.