How can I check whether a given file is FASTA? - python

I am designing a code that requires a .fasta file to be input at one of the early stages. Right now, I am validating the input using this function:
def file_validation(fasta):
while True:
try:
file_name= str(raw_input(fasta))
except IOError:
print("Please give the name of the fasta file that exists in the folder!")
continue
if not(file_name.endswith(".fasta")):
print("Please give the name of the file with the .fasta extension!")
else:
break
return file_name
Now, although this function works fine, there is still some room for error in the sense that a user could potentially maybe input a file that, while having a file name that ends with .fasta, could have some non-.fasta content inside. What could I do to prevent this and let the user know that his/her .fasta file is corrupted?

Why not just parse the file as if it were FASTA and see whether it breaks?
Using biopython, which silently fails by returning an empty generator on non-FASTA files:
from Bio import SeqIO
my_file = "example.csv" # Obviously not FASTA
def is_fasta(filename):
with open(filename, "r") as handle:
fasta = SeqIO.parse(handle, "fasta")
return any(fasta) # False when `fasta` is empty, i.e. wasn't a FASTA file
is_fasta(my_file)
# False

Related

Try/Except not running properly when opening files

I am trying to open a file with this Try/Except block but it is going straight to Except and not opening the file.
I've tried opening multiple different files but they are going directly to not being able to open.
import string
fname = input('Enter a file name: ')
try:
fhand = open(fname)
except:
print('File cannot be opened:', fname)
exit()
counts = dict()
L_N=0
for line in fhand:
line= line.rstrip()
line = line.translate(line.maketrans(' ', ' ',string.punctuation))
line = line.lower()
words = line.split()
L_N+=1
for word in words:
if word not in counts:
counts[word]= [L_N]
else:
if L_N not in counts[word]:
counts[word].append(L_N)
for h in range(len(counts)):
print(counts)
out_file = open('word_index.txt', 'w')
out_file.write('Text file being analyzed is: '+str(fname)+ '\n\n')
out.file_close()
I would like the output to read a specific file and count the created dictionary
make sure you are inputting quotes for your filename ("myfile.txt") if using python 2.7. if python3, quotes are not required.
make sure your input is using absolute path to the file, or make sure the file exists in the same place you are running the python program.
for example,
if your program and current working directory is in ~/code/
and you enter: 'myfile.txt', 'myfile.txt' must exist in ~/code/
however, its best you provide the absolute path to your input file such as
/home/user/myfile.txt
then your script will work 100% of the time, no matter what directory you call your script from.

pickle.dump dumps nothing when appending to file

User may give a bunch of urls as command line args. All URLs given in the past are serialized with pickle. The script checks all given URLs, if they are unique then they are serialized and appended to a file. At least that's what should be happening. Nothing is being appended. However when I open the file in write mode,the new, unique URL is written. So what gives? Code is:
def get_new_urls():
if(len(urls.URLs) != 0): # check if empty
with open(urlFile, 'rb') as f:
try:
cereal = pickle.load(f)
print(cereal)
toDump = []
for arg in urls.URLs:
if (arg in cereal):
print("Duplicate URL {0} given, ignoring it.".format(arg))
else:
toDump.append(arg)
except Exception as e:
print("Holy bleep something went wrong: {0}".format(e))
return(toDump)
urlsToDump = get_new_urls()
print(urlsToDump)
# TODO: append new URLs
if(urlsToDump):
with open(urlFile, 'ab') as f:
pickle.dump(urlsToDump, f)
# TODO check HTML of each page against the serialized copy
with open(urlFile, 'rb') as f:
try:
cereal = pickle.load(f)
print(cereal)
except EOFError: # your URL file is empty, bruh
pass
Pickle writes out the data you give it in a special format, e.g. it will write some header/metadata/etc, to the file you give it.
It is not intended to work this way; concatenating two pickle files doesn't really make sense. To achieve a concatenation of your data, you'd need to first read whatever is in the file into your urlsToDump, then update your urlsToDump with any new data, and then finally dump it out again (overwriting the whole file, not appending).
After
with open(urlFile, 'rb') as f:
you need a while loop, to repeatedly unpickle (repeatedly read) from the file until hitting EOF.

Generator Reading Log File While Name Changes Python

The goal is to read a log file in real time line by line (standard generator stuff) but the catch is, the file name changes at various intervals. The name change can't be helped (application dictated appended with a time string) and the name is changed when the log file size reaches ~2MB (guesstimate).
My approach was to create a file getter function that got the file (or new file) and then passed that to the generator. I thought that when the file changed names I would get a 'File not found' error, but what my test showed, is that the file name change is prevented entirely as 'another program is using this file'. The name change must be allowed, and this reader code cannot interfere with the application logging process at all.
import os
import time
import fnmatch
directory = '\\foo\\'
def fileGenerator(logFile):
""" Run a line generator """
logFile.seek(0,2)
while True:
line = logFile.readline()
if not line:
time.sleep(0.1)
continue
yield line
def fileGetter():
""" Get the Logging File """
matchedFiles = []
for afile in os.listdir(directory):
if fnmatch.fnmatch(afile,'amc_*.txt'):
matchedFiles.append(afile)
if len(matchedFiles)==1:
#There was exactly one matching file found send it to the generator
return os.path.join(directory,matchedFiles[0])
else:
#There either wasn't a file found or many matching
#Error out and stop process... critical error
if __name__ == '__main__':
filePath = fileGetter()
try:
logFile = open(filePath,"r")
except Exception as e:
#Catch the file not found and go back to the file path getter
#Send the file back to the generator
print e
if logFile:
loglines = fileGenerator(logFile)
for line in loglines:
#handle the line
print line,
If you can't hold the file open while waiting for new content to be written to it, I suggest saving the file position you were last at and closing the file before you sleep, and then reopening the file and seeking to that point afterwards. You could also investigate filesystem notification systems if you care about spotting file additions or renames immediately.
def log_reader():
filename = "does_not_exist"
filepos = 0
while True:
try:
file = open(filename)
except FileNotFoundError:
filename = fileGetter()
# if renamed files start empty, set filepos to zero here!
continue
file.seek(filepos)
while True:
line = file.readline()
if not line:
filepos = file.tell()
file.close()
sleep(0.1) # you may want to test different sleep lengths to avoid FS thrash
break
yield line
The opening and closing of the file may stress out your filesystem if you do it too much, so I'd suggest sleeping longer than your previous code did (but you may want to test to see how well your OS handles it if you care about how responsive your log reader is).

Creating new text file in Python?

Is there a method of creating a text file without opening a text file in "w" or "a" mode? For instance If I wanted to open a file in "r" mode but the file does not exist then when I catch IOError I want a new file to be created
e.g.:
while flag == True:
try:
# opening src in a+ mode will allow me to read and append to file
with open("Class {0} data.txt".format(classNo),"r") as src:
# list containing all data from file, one line is one item in list
data = src.readlines()
for ind,line in enumerate(data):
if surname.lower() and firstName.lower() in line.lower():
# overwrite the relevant item in data with the updated score
data[ind] = "{0} {1}\n".format(line.rstrip(),score)
rewrite = True
else:
with open("Class {0} data.txt".format(classNo),"a") as src:
src.write("{0},{1} : {2}{3} ".format(surname, firstName, score,"\n"))
if rewrite == True:
# reopen src in write mode and overwrite all the records with the items in data
with open("Class {} data.txt".format(classNo),"w") as src:
src.writelines(data)
flag = False
except IOError:
print("New data file created")
# Here I want a new file to be created and assigned to the variable src so when the
# while loop iterates for the second time the file should successfully open
At the beginning just check if the file exists and create it if it doesn't:
filename = "Class {0} data.txt"
if not os.path.isfile(filename):
open(filename, 'w').close()
From this point on you can assume the file exists, this will greatly simplify your code.
No operating system will allow you to create a file without actually writing to it. You can encapsulate this in a library so that the creation is not visible, but it is impossible to avoid writing to the file system if you really want to modify the file system.
Here is a quick and dirty open replacement which does what you propose.
def open_for_reading_create_if_missing(filename):
try:
handle = open(filename, 'r')
except IOError:
with open(filename, 'w') as f:
pass
handle = open(filename, 'r')
return handle
Better would be to create the file if it doesn't exist, e.g. Something like:
import sys, os
def ensure_file_exists(file_name):
""" Make sure that I file with the given name exists """
(the_dir, fname) = os.path.split(file_name)
if not os.path.exists(the_dir):
sys.mkdirs(the_dir) # This may give an exception if the directory cannot be made.
if not os.path.exists(file_name):
open(file_name, 'w').close()
You could even have a safe_open function that did something similar prior to opening for read and returning the file handle.
The sample code provided in the question is not very clear, specially because it invokes multiple variables that are not defined anywhere. But based on it here is my suggestion. You can create a function similar to touch + file open, but which will be platform agnostic.
def touch_open( filename):
try:
connect = open( filename, "r")
except IOError:
connect = open( filename, "a")
connect.close()
connect = open( filename, "r")
return connect
This function will open the file for you if it exists. If the file doesn't exist it will create a blank file with the same name and the open it. An additional bonus functionality with respect to import os; os.system('touch test.txt') is that it does not create a child process in the shell making it faster.
Since it doesn't use the with open(filename) as src syntax you should either remember to close the connection at the end with connection = touch_open( filename); connection.close() or preferably you could open it in a for loop. Example:
file2open = "test.txt"
for i, row in enumerate( touch_open( file2open)):
print i, row, # print the line number and content
This option should be preferred to data = src.readlines() followed by enumerate( data), found in your code, because it avoids looping twice through the file.

Reading a file and displaying the sum of names within that file

What I would like the final code to execute is read a string of names in a text document named, 'names.txt'. Then tell the program to calculate how many names there are in that file and display the amount of names. The code I have so far was meant to display the sum of the numbers in a text file, but it was close enough to the program I need now that I think I may be able to rework it to gather the amount of strings/names and display that instead of the sum.
Here is the code so far:
def main():
#initialize an accumulator.
total = 0.0
try:
# Open the file.
myfile = open('names.txt', 'r')
# Read and display the file's contents.
for line in myfile:
amount = float(line)
total += amount
# Close the file.
myfile.close()
except IOError:
print('An error occured trying to read the file.')
except ValueError:
print('Non-numeric data found in the file.')
except:
print('An error occured.')
# Call the main function.
main()
I am still really new to Python programming so please don't be too harsh on me. If anyone can figure out how to rework this to display the amount of numbers/names instead of the sum of numbers. I would greatly appreciate it. If this program cannot be reworked, I would be happy to settle for a new solution.
Edit: This it an example of what the 'names.txt' will look like:
john
mary
paul
ann
If you just want to count the lines in the file
# Open the file.
myfile = open('names.txt', 'r')
#Count the lines in the file
totalLines = len(myfile.readlines()):
# Close the file.
myfile.close()
fh = open("file","r")
print "%d lines"%len(fh.readlines())
fh.close()
or you could do
fh=open("file","r")
print "%d words"%len(fh.read().split())
fh.close()
All this is readily available information that is not hard to find if you put forth some effort...just getting the answers usually results in flunked classes...
Considering the names in your text files are delimited by line.
myfile = open('names.txt', 'r')
lstLines = myfile.read().split('\n')
dict((name,lstLines.count(name)) for name in lstLines)
This creates a dictionary of each name having its number of occurrence.
To search for the occurrence of perticular name such as 'name1' in the list
lstLines.count('name1')
Assuming names are splitted using whitespaces :
def main():
#initialize an accumulator.
total = 0.0
try:
# Open the file.
myfile = open('names.txt', 'r')
# Read and display the file's contents.
for line in myfile:
words = line.split()
total += len(words)
# Close the file.
myfile.close()
except IOError:
print('An error occured trying to read the file.')
except ValueError:
print('Non-numeric data found in the file.')
except:
print('An error occured.')
# Call the main function.
main()
Use with statement to open a file. It will close the file properly even if an exception occurred. You can omit the file mode, it is default.
If each name is on its own line and there are no duplicates:
with open('names.txt') as f:
number_of_nonblank_lines = sum(1 for line in f if line.strip())
name_count = number_of_nonblank_lines
The task is very simple. Start with a new code to avoid accumulating unused/invalid for the problem code.
If all you need is to count lines in a file (like wc -l command) then you could use .count('\n') method:
#!/usr/bin/env python
import sys
from functools import partial
read_chunk = partial(sys.stdin.read, 1 << 15) # or any text file instead of stdin
print(sum(chunk.count('\n') for chunk in iter(read_chunk, '')))
See also, Why is reading lines from stdin much slower in C++ than Python?

Categories