Read user inputted .fasta file and parse using Biopython? - python

I am trying to create a python script where the user can type in their FASTA file and that file will then be parsed using Biopython. I am struggling to get this to work. The script I have thus far is this:
#!/usr/bin/python3
file_name = input("Insert full file name including the fasta extension: ")
with open(file_name, "r") as inf:
seq = inf.read()
from Bio.SeqIO.FastaIO import SimpleFastaParser
count = 0
total_len = 0
with open(inf) as in_file:
for title, seq in SimpleFastaParser(in_file):
count += 1
total_len += len(seq)
print("%i records with total sequence length %i" % (count, total_len))
I would like the user to be prompted to type in their file and its extension and that file should be used to parse with Biopython such that that output is printed. I also want to be about to send the print output to a log file. Any help would be appreciate.
The purpose of the script is to take a fasta file, parse and trim primers. I know there is an easy method to do this using Biopython entirely but as per instruction Biopython can only be used to parse not trim. Any insight into this would be appreciated as well.

Firstly, you have two places where you open the fasta file
One where you store the contents in seq
Then you try to open inf, but you don't assign inf as a variable in this code snippet.
You may want to include some check to makes sure a valid file path was used
Also, this is a good case for using argparse:
#!/usr/bin/python3
import argparse
from Bio.SeqIO
import os
import sys
def main(infile):
# check that the file exists
if not os.path.is_file(infile):
print("file not found")
sys.exit()
count = 0
total_len = 0
for seq_record in SeqIO.parse(open(infile), "fasta"):
count += 1
total_len += len(seq_record.seq)
print("%i records with total sequence length %i" % (count, total_len))
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='some script to do something with fasta files',
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('-in', '--infile', type=str, default=None, required=True,
help='The path and file name of the fasta file')
args = parser.parse_args()
infile = args.infile
main(infile)
If you need to use input:
#!/usr/bin/python3
from Bio.SeqIO
import os
import sys
infile = input("Insert full file name including the fasta extension: ")
# remove any white space
infile = infile.strip()
# check that the file exists
if not os.path.is_file(infile):
print("file not found")
sys.exit()
count = 0
total_len = 0
for seq_record in SeqIO.parse(open(infile), "fasta"):
count += 1
total_len += len(seq_record.seq)
print("%i records with total sequence length %i" % (count, total_len))

Related

Counting reads and bases from a list of fastq files

I trimmed my Illumina short reads, forward and reverse, by using Trimmomatic. The Trimmomatic's outputs were: paired_1 - unpaired_1, and paired_2 - unpaired_2.fastq.gz files. I want to know how big was the impact of trimming by counting the number of reads and bases of each file in my directory. I had made a script to count the number of bases and reads for each file in my directory; however, I have problems in if __name__=='__main__'. When I do the for loop I don't know the order of the files that will be run, how can I make it to call the files by the order I see from the screen? Additionally, I also need help with correcting the script as I don't get any stdout.
Thank you in advance for your help.
#!/usr/bin/env python
from sys import argv
import os
def get_num_bases(file_content):
total = []
for linenumber, line in enumerate(file_content):
mod=linenumber%4
if mod==0:
ID = line.strip()[1:]
#print(ID)
if mod==1:
seq = line.strip()
counting = 0
counting += seq.count("T")+ seq.count("A") + seq.count("C") + seq.count("G")
total.append(counting)
allbases = sum(total)
print("Number of bases are: " , allbases)
def get_num_reads(file_content):
total_1 = []
for line in file_content:
num_reads = 0
num_reads += content.count(line)
total_1.append(num_reads)
print("Number of reads are: ", sum(total_1)/int(4))
if __name__=='__main__':
path = os.getcwd()
dir_files = os.listdir(path)
list_files = []
for file in dir_files:
if file.endswith("fastq.gz"):
if file not in list_files:
file_content = open(file, "r").readlines()
list_files.append(file)
print("This is the filename: ", file, get_num_bases(file_content), get_num_reads(file_content))

Running python mapreduce for multiple files

I am trying to implement python mapreduce for multiple files in directory, so that it will take folder and string as an argument and list files with the frequency of that string within that files. The output should be like that:
Filename Output
-------- --------------
x.txt 8
y.txt 12
I have tried to implement it but when I run it with command below:
cat /home/habil/Downloads/hadoop_test/*.txt | python mapper.py "AA" | python reducer.py
It gives me "AA 479" which are the frequency in all 5 files
This is my mapper.py
#!/usr/bin/env python
import sys
import textwrap
from os import listdir
from os.path import isfile, join
#Argument of the path
#folderPath = sys.argv[2]
#onlyfiles = [f for f in listdir(sys.argv[2]) if isfile(join(sys.argv[2], f))]
# Get the string sequence from the user
searchWord = sys.argv[1]
# Length of the word
searchWordLength = len(sys.argv[1])
# helper Function
def locations_of_substring(string, substring):
"""Return a list of locations of a substring."""
substring_length = len(substring)
def recurse(locations_found, start):
location = string.find(substring, start)
if location != -1:
return recurse(locations_found + [location], location+substring_length)
else:
return locations_found
return recurse([], 0)
#--- get all lines from stdin ---
for line in sys.stdin:
#--- remove leading and trailing whitespace---
line = line.strip()
temp = locations_of_substring(line, searchWord)
if len(temp) != 0:
for count in temp:
print '%s\t%s' % (line[count:count+searchWordLength], "1")
And below is my reducer:
#!/usr/bin/env python
import sys
# maps words to their counts
word2count = {}
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
continue
try:
word2count[word] = word2count[word]+count
except:
word2count[word] = count
# write the tuples to stdout
# Note: they are unsorted
for word in word2count.keys():
print '%s\t%s'% ( word, word2count[word])
How can I get the desired result, so that it will run for each file in the directory, once and print seperate results. Any help or hint is appreciated. Thanks in advance.

Read set amount of words from multiple text files and save as new files

I have several hundred text files of books (file001.txt, file002.txt etc), I want to read the first 3,000 words from each file and save it as a new one (eg file001_first3k.txt, file002_first3k.txt).
I've seen terminal solutions for Mac and Linux (I have both) but they seem to be for displaying to the terminal window and for set amount of characters, not words.
Posting this in Python as it seems that it is more likely to have a solution here than for terminal and I have some experience of Python.
Hopefully this will get you started, it makes the assumption that it is ok to split on spaces in order to determine the number of words.
import os
import sys
def extract_first_3k_words(directory):
original_file_suffix = ".txt"
new_file_suffix = "_first3k.tx"
filenames = [f for f in os.listdir(directory)
if f.endswith(original_file_suffix) and not f.endswith(new_file_suffix)]
for filename in filenames:
with open(filename, "r") as original_file:
# Get the first 3k words of the file
num_words = 3000
file_content = original_file.read()
words = file_content.split(" ")
first_3k_words = " ".join(words[:num_words])
# Write the new file
new_filename = filename.replace(original_file_suffix, new_file_suffix)
with open(new_filename, "w") as new_file:
new_file.write(first_3k_words)
print "Extracted 3k words from: %s to %s" % (filename, new_filename)
if __name__ == "__main__":
if len(sys.argv) != 2:
print "Usage: python file_splitter.py <target_directory>"
exit()
directory = sys.argv[1]
extract_first_3k_words(directory)

Python select a file from a list

I have a folder that contains several log file that I will parse with python.
I would show the list of file contained into a folder like:
[1] FileName1.log
[2] FileName2.log
And then the user can choose the right file writing the file list number.
For instance, to parse the file "FileName2.log" the user press 2.
In my script I can show the list of file but I don't now how to pick up a file from a list by index.
This is my script
import os
import sys
items = os.listdir("D:/Logs")
fileList = []
for names in items:
if names.endswith(".log"):
fileList.append(names)
cnt = 0
for fileName in fileList:
sys.stdout.write( "[%d] %s\n\r" %(cnt, fileName) )
cnt = cnt + 1
fileName = raw_input("\n\rSelect log file [0 -" + str(cnt) + " ]: ")
Thanks for the help!
import os
import sys
items = os.listdir("D:/Logs")
fileList = [name for name in items if name.endswith(".log")]
for cnt, fileName in enumerate(fileList, 1):
sys.stdout.write("[%d] %s\n\r" % (cnt, fileName))
choice = int(input("Select log file[1-%s]: " % cnt))
print(fileList[choice])
you own version of code with few modifications, hope this solves your purpose
If you have the names in an array like this:
fileList = ['FileName1.log','FileName2.log']
you can pull them out by using their index (remember that arrarys are 0-indexed) so fileList[0] would be 'FileName1.log'
when you ask for the user to input a number (eg 0, 1, 2) you would then use that number to get the file you want. like this:
fileToRead=fileList[userInput]
if you asked for 1,2,3 you would need to use userInput-1 to make sure it is correctly 0-indexed.
then you open the file you now have:
f=open(fileToRead, 'r')
you can read more about open here
If fileList is a list of files, and fileName is the user input, you can reference the file the user chose by using the following:
fileList[fileName]
import glob
import os
dirpath = r"D:\Logs" # the directory that contains the log files
prefix = "FileName"
fpaths = glob.glob(os.path.join(dirpath, "{}*.log".format(prefix))) # get all the log files
fpaths.sort(key=lambda fname: int(fname.split('.',1)[0][len(prefix):])) # sort the log files by number
print("Select a file to view:")
for i,fpath in enumerate(fpaths, 1):
print("[{}]: {}".format(i, os.path.basename(fpath)))
choice = int(input("Enter a selection number: ")) # assuming valid inputs
choice -= 1 # correcting for python's 0-indexing
print("You have chosen", os.path.basename(fpaths[choice]))
Just add in the end something like this...
sys.stdout.write(fileList[int(fileName)])
Indexing in python as in many other languages starts from 0. Try this:
import os
import sys
items = os.listdir("D:/Logs")
fileList = []
for names in items:
if names.endswith(".log"):
fileList.append(names)
cnt = 0
for fileName in fileList:
sys.stdout.write( "[%d] %s\n\r" %(cnt, fileName) )
cnt = cnt + 1
fileName = int(raw_input("\n\rSelect log file [0 - " + str(cnt - 1) + "]: "))
print(fileList[fileName])
You need to cast input from raw_input() to int. And then you can use the obtained number as index for your list. 0 is the first file, 1 is the second file etc.

Mafft only creating one file with Python

So I'm working on a project to align a sequence ID and its code. I was given a barcode file, which contains a tag for a DNA sequence, i.e. TTAGG. There's several tags (ATTAC, ACCAT, etc.) which then get removed from the a sequence file and placed with a seq ID.
Example:
sequence file --> SEQ 01 TTAGGAACCCAAA
barcode file --> TTAGG
the output file I want will remove the barcode and use it to create a new fasta format file.
Example:
testfile.TTAGG which when opened should have
>SEQ01
AACCCAAA
There are several of these files. I want to take each one of this files that I create and run them through mafft, but when I run my script, it only concentrates on one file for mafft. The files I mentioned above come out ok, but when mafft runs, it only runs the last file created.
Here's my script:
#!/usr/bin/python
import sys
import os
fname = sys.argv[1]
barcodefname = sys.argv[2]
barcodefile = open(barcodefname, "r")
for barcode in barcodefile:
barcode = barcode.strip()
outfname = "%s.%s" % (fname, barcode)
outf = open(outfname, "w+")
handle = open(fname, "r")
mafftname = outfname + ".mafft"
for line in handle:
newline = line.split()
seq = newline[0]
brc = newline[1]
potential_barcode = brc[:len(barcode)]
if potential_barcode == barcode:
outseq = brc[len(barcode):]
barcodeseq = ">%s\n%s\n" % (seq,outseq)
outf.write(barcodeseq)
handle.close()
outf.close()
cmd = "mafft %s > %s" % (outfname, mafftname)
os.system(cmd)
barcodefile.close()
I hope that was clear enough! Please help! I've tried changing my indentations, adjusting when I close the file. Most of the time it won't make the .mafft file at all, sometimes it does but doesn't put anything it, but mostly it only works on that last file created.
Example:
the beginning of the code creates files such as -
testfile.ATTAC
testfile.AGGAC
testfile.TTAGG
then when it runs mafft it only creates
testfile.TTAGG.mafft (with the correct input)
I have tried close the outf file and then opening it again, in which it tells me I'm coercing it.
I've changed to the outf file to write only, doesn't change anything.
The reason why mafft only aligns the last file file is because its execution is outside the loop.
As your code stands, you create an input file name variable (outfname) in each iteration of the loop, but this variable is always overwritten in the next iteration. Therefore, when your code eventually reaches the mafft execution command, the outfname variable will contain the last file name of the loop.
To correct this, simply insert the mafft execution command inside the loop:
#!/usr/bin/python
import sys
import os
fname = sys.argv[1]
barcodefname = sys.argv[2]
barcodefile = open(barcodefname, "r")
for barcode in barcodefile:
barcode = barcode.strip()
outfname = "%s.%s" % (fname, barcode)
outf = open(outfname, "w+")
handle = open(fname, "r")
mafftname = outfname + ".mafft"
for line in handle:
newline = line.split()
seq = newline[0]
brc = newline[1]
potential_barcode = brc[:len(barcode)]
if potential_barcode == barcode:
outseq = brc[len(barcode):]
barcodeseq = ">%s\n%s\n" % (seq,outseq)
outf.write(barcodeseq)
handle.close()
outf.close()
cmd = "mafft %s > %s" % (outfname, mafftname)
os.system(cmd)
barcodefile.close()

Categories