Counting reads and bases from a list of fastq files - python

I trimmed my Illumina short reads, forward and reverse, by using Trimmomatic. The Trimmomatic's outputs were: paired_1 - unpaired_1, and paired_2 - unpaired_2.fastq.gz files. I want to know how big was the impact of trimming by counting the number of reads and bases of each file in my directory. I had made a script to count the number of bases and reads for each file in my directory; however, I have problems in if __name__=='__main__'. When I do the for loop I don't know the order of the files that will be run, how can I make it to call the files by the order I see from the screen? Additionally, I also need help with correcting the script as I don't get any stdout.
Thank you in advance for your help.
#!/usr/bin/env python
from sys import argv
import os
def get_num_bases(file_content):
total = []
for linenumber, line in enumerate(file_content):
mod=linenumber%4
if mod==0:
ID = line.strip()[1:]
#print(ID)
if mod==1:
seq = line.strip()
counting = 0
counting += seq.count("T")+ seq.count("A") + seq.count("C") + seq.count("G")
total.append(counting)
allbases = sum(total)
print("Number of bases are: " , allbases)
def get_num_reads(file_content):
total_1 = []
for line in file_content:
num_reads = 0
num_reads += content.count(line)
total_1.append(num_reads)
print("Number of reads are: ", sum(total_1)/int(4))
if __name__=='__main__':
path = os.getcwd()
dir_files = os.listdir(path)
list_files = []
for file in dir_files:
if file.endswith("fastq.gz"):
if file not in list_files:
file_content = open(file, "r").readlines()
list_files.append(file)
print("This is the filename: ", file, get_num_bases(file_content), get_num_reads(file_content))

Related

Read user inputted .fasta file and parse using Biopython?

I am trying to create a python script where the user can type in their FASTA file and that file will then be parsed using Biopython. I am struggling to get this to work. The script I have thus far is this:
#!/usr/bin/python3
file_name = input("Insert full file name including the fasta extension: ")
with open(file_name, "r") as inf:
seq = inf.read()
from Bio.SeqIO.FastaIO import SimpleFastaParser
count = 0
total_len = 0
with open(inf) as in_file:
for title, seq in SimpleFastaParser(in_file):
count += 1
total_len += len(seq)
print("%i records with total sequence length %i" % (count, total_len))
I would like the user to be prompted to type in their file and its extension and that file should be used to parse with Biopython such that that output is printed. I also want to be about to send the print output to a log file. Any help would be appreciate.
The purpose of the script is to take a fasta file, parse and trim primers. I know there is an easy method to do this using Biopython entirely but as per instruction Biopython can only be used to parse not trim. Any insight into this would be appreciated as well.
Firstly, you have two places where you open the fasta file
One where you store the contents in seq
Then you try to open inf, but you don't assign inf as a variable in this code snippet.
You may want to include some check to makes sure a valid file path was used
Also, this is a good case for using argparse:
#!/usr/bin/python3
import argparse
from Bio.SeqIO
import os
import sys
def main(infile):
# check that the file exists
if not os.path.is_file(infile):
print("file not found")
sys.exit()
count = 0
total_len = 0
for seq_record in SeqIO.parse(open(infile), "fasta"):
count += 1
total_len += len(seq_record.seq)
print("%i records with total sequence length %i" % (count, total_len))
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='some script to do something with fasta files',
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('-in', '--infile', type=str, default=None, required=True,
help='The path and file name of the fasta file')
args = parser.parse_args()
infile = args.infile
main(infile)
If you need to use input:
#!/usr/bin/python3
from Bio.SeqIO
import os
import sys
infile = input("Insert full file name including the fasta extension: ")
# remove any white space
infile = infile.strip()
# check that the file exists
if not os.path.is_file(infile):
print("file not found")
sys.exit()
count = 0
total_len = 0
for seq_record in SeqIO.parse(open(infile), "fasta"):
count += 1
total_len += len(seq_record.seq)
print("%i records with total sequence length %i" % (count, total_len))

Read set amount of words from multiple text files and save as new files

I have several hundred text files of books (file001.txt, file002.txt etc), I want to read the first 3,000 words from each file and save it as a new one (eg file001_first3k.txt, file002_first3k.txt).
I've seen terminal solutions for Mac and Linux (I have both) but they seem to be for displaying to the terminal window and for set amount of characters, not words.
Posting this in Python as it seems that it is more likely to have a solution here than for terminal and I have some experience of Python.
Hopefully this will get you started, it makes the assumption that it is ok to split on spaces in order to determine the number of words.
import os
import sys
def extract_first_3k_words(directory):
original_file_suffix = ".txt"
new_file_suffix = "_first3k.tx"
filenames = [f for f in os.listdir(directory)
if f.endswith(original_file_suffix) and not f.endswith(new_file_suffix)]
for filename in filenames:
with open(filename, "r") as original_file:
# Get the first 3k words of the file
num_words = 3000
file_content = original_file.read()
words = file_content.split(" ")
first_3k_words = " ".join(words[:num_words])
# Write the new file
new_filename = filename.replace(original_file_suffix, new_file_suffix)
with open(new_filename, "w") as new_file:
new_file.write(first_3k_words)
print "Extracted 3k words from: %s to %s" % (filename, new_filename)
if __name__ == "__main__":
if len(sys.argv) != 2:
print "Usage: python file_splitter.py <target_directory>"
exit()
directory = sys.argv[1]
extract_first_3k_words(directory)

Python select a file from a list

I have a folder that contains several log file that I will parse with python.
I would show the list of file contained into a folder like:
[1] FileName1.log
[2] FileName2.log
And then the user can choose the right file writing the file list number.
For instance, to parse the file "FileName2.log" the user press 2.
In my script I can show the list of file but I don't now how to pick up a file from a list by index.
This is my script
import os
import sys
items = os.listdir("D:/Logs")
fileList = []
for names in items:
if names.endswith(".log"):
fileList.append(names)
cnt = 0
for fileName in fileList:
sys.stdout.write( "[%d] %s\n\r" %(cnt, fileName) )
cnt = cnt + 1
fileName = raw_input("\n\rSelect log file [0 -" + str(cnt) + " ]: ")
Thanks for the help!
import os
import sys
items = os.listdir("D:/Logs")
fileList = [name for name in items if name.endswith(".log")]
for cnt, fileName in enumerate(fileList, 1):
sys.stdout.write("[%d] %s\n\r" % (cnt, fileName))
choice = int(input("Select log file[1-%s]: " % cnt))
print(fileList[choice])
you own version of code with few modifications, hope this solves your purpose
If you have the names in an array like this:
fileList = ['FileName1.log','FileName2.log']
you can pull them out by using their index (remember that arrarys are 0-indexed) so fileList[0] would be 'FileName1.log'
when you ask for the user to input a number (eg 0, 1, 2) you would then use that number to get the file you want. like this:
fileToRead=fileList[userInput]
if you asked for 1,2,3 you would need to use userInput-1 to make sure it is correctly 0-indexed.
then you open the file you now have:
f=open(fileToRead, 'r')
you can read more about open here
If fileList is a list of files, and fileName is the user input, you can reference the file the user chose by using the following:
fileList[fileName]
import glob
import os
dirpath = r"D:\Logs" # the directory that contains the log files
prefix = "FileName"
fpaths = glob.glob(os.path.join(dirpath, "{}*.log".format(prefix))) # get all the log files
fpaths.sort(key=lambda fname: int(fname.split('.',1)[0][len(prefix):])) # sort the log files by number
print("Select a file to view:")
for i,fpath in enumerate(fpaths, 1):
print("[{}]: {}".format(i, os.path.basename(fpath)))
choice = int(input("Enter a selection number: ")) # assuming valid inputs
choice -= 1 # correcting for python's 0-indexing
print("You have chosen", os.path.basename(fpaths[choice]))
Just add in the end something like this...
sys.stdout.write(fileList[int(fileName)])
Indexing in python as in many other languages starts from 0. Try this:
import os
import sys
items = os.listdir("D:/Logs")
fileList = []
for names in items:
if names.endswith(".log"):
fileList.append(names)
cnt = 0
for fileName in fileList:
sys.stdout.write( "[%d] %s\n\r" %(cnt, fileName) )
cnt = cnt + 1
fileName = int(raw_input("\n\rSelect log file [0 - " + str(cnt - 1) + "]: "))
print(fileList[fileName])
You need to cast input from raw_input() to int. And then you can use the obtained number as index for your list. 0 is the first file, 1 is the second file etc.

Using python script to search in multiple files and outputting an individual file for each one

I am trying to get a program up and running that takes astronomical data files with the extension .fits and takes all of the files with that extension in a folder and searches for specific header information, and subsequently places it into a text folder corresponding to each file. I am using a while loop, and please forgive me if this code is badly formatted, it is my first time using python! My main problem is that I can only get the program to read one file before it closes itself.
#!/usr/bin/env python
#This code properly imports all '.fits' files in a specified directory and
#outputs them into a .txt format that allows several headers and their contained
#data to be read.
import copy
import sys
import pyfits
import string
import glob
import os.path
import fnmatch
import numpy as np
DIR = raw_input("Please input a valid directory : ") #-----> This prompts for input from the user to find the '.fits' files
os.chdir(DIR)
initialcheck = 0 #Initiates the global counter for the number of '.fits' files in the specified directory
targetcheck = 0 #Initiates the global counter for the amount of files that have been processed
def checkinitial(TD):
#This counts the number of '.fits' files in your directory
for files in glob.iglob('*.fits'):
check = len(glob.glob1(TD,"*.fits"))
global initialcheck
initialcheck = check
if initialcheck == 0:
print 'There are no .FITS files in this directory! Try Again...'
sys.exit()
return initialcheck
def sorter(TD, targcheck, inicheck):
#This function will call the two counters and compare them until the number of processed files is greater than the files in the #directory, thereby finishing the loop
global initialcheck
inicheck = initialcheck
global targetcheck
targcheck = targetcheck
while targcheck <= inicheck:
os.walk(TD)
for allfiles in glob.iglob('*.fits'):
print allfiles #This prints out the filenames the porgram is currently processing
with pyfits.open(allfiles) as HDU:
#This block outlines all of the search terms in their respective headers, you will need to set the indices #below to search in the correct header for the specified term you are looking for, however no alterations to #the header definitions should be made.
HDU_HD_0 = HDU[0].header
HDU_HD_1 = HDU[1].header
#HDU_HD_2 = HDU[2].header -----> Not usually needed, can be activated if data from this header is required
#HDU_HD_3 = HDU[3].header -----> Use this if the '.fits' file contains a third header (unlikely but possible)
KeplerIDIndex = HDU_HD_0.index('KEPLERID')
ChannelIndex = HDU_HD_0.index('SKYGROUP')
TTYPE1Index = HDU_HD_1.index('TTYPE1')
TTYPE8Index = HDU_HD_1.index('TTYPE8')
TTYPE9Index = HDU_HD_1.index('TTYPE9')
TTYPE11Index = HDU_HD_1.index('TTYPE11')
TTYPE12Index = HDU_HD_1.index('TTYPE12')
TTYPE13Index = HDU_HD_1.index('TTYPE13')
TTYPE14Index = HDU_HD_1.index('TTYPE14')
TUNIT1Index = HDU_HD_1.index('TUNIT1')
TUNIT8Index = HDU_HD_1.index('TUNIT8')
TUNIT9Index = HDU_HD_1.index('TUNIT9')
TUNIT11Index = HDU_HD_1.index('TUNIT11')
TUNIT12Index = HDU_HD_1.index('TUNIT12')
TUNIT13Index = HDU_HD_1.index('TUNIT13')
TUNIT14Index = HDU_HD_1.index('TUNIT14')
#The below variables are an index search for the data found in the specified indices above, allowing the data #to be found in teh numpy array that '.fits' files use
File_Data_KID = list( HDU_HD_0[i] for i in [KeplerIDIndex])
File_Data_CHAN = list( HDU_HD_0[i] for i in [ChannelIndex])
Astro_Data_1 = list( HDU_HD_1[i] for i in [TTYPE1Index])
Astro_Data_8 = list( HDU_HD_1[i] for i in [TTYPE8Index])
Astro_Data_9 = list( HDU_HD_1[i] for i in [TTYPE9Index])
Astro_Data_11 = list( HDU_HD_1[i] for i in [TTYPE11Index])
Astro_Data_12 = list( HDU_HD_1[i] for i in [TTYPE12Index])
Astro_Data_13 = list( HDU_HD_1[i] for i in [TTYPE13Index])
Astro_Data_14 = list( HDU_HD_1[i] for i in [TTYPE14Index])
Astro_Data_Unit_1 = list( HDU_HD_1[i] for i in [TUNIT1Index])
Astro_Data_Unit_8 = list( HDU_HD_1[i] for i in [TUNIT8Index])
Astro_Data_Unit_9 = list( HDU_HD_1[i] for i in [TUNIT9Index])
Astro_Data_Unit_11 = list( HDU_HD_1[i] for i in [TUNIT11Index])
Astro_Data_Unit_12 = list( HDU_HD_1[i] for i in [TUNIT12Index])
Astro_Data_Unit_13 = list( HDU_HD_1[i] for i in [TUNIT13Index])
Astro_Data_Unit_14 = list( HDU_HD_1[i] for i in [TUNIT14Index])
HDU.close()
with open('Processed ' + allfiles + ".txt", "w") as copy:
targetcheck += 1
Title1_Format = '{0}-----{1}'.format('Kepler I.D.','Channel')
Title2_Format = '-{0}--------{1}------------{2}------------{3}------------{4}------------{5}-------------{6}-'.format('TTYPE1','TTYPE8','TTYPE9','TTYPE11','TTYPE12','TTYPE13','TTYPE14')
File_Format = '{0}--------{1}'.format(File_Data_KID, File_Data_CHAN)
Astro_Format = '{0}---{1}---{2}---{3}---{4}---{5}---{6}'.format(Astro_Data_1, Astro_Data_8, Astro_Data_9, Astro_Data_11, Astro_Data_12, Astro_Data_13, Astro_Data_14)
Astro_Format_Units = '{0} {1} {2} {3} {4} {5} {6}'.format(Astro_Data_Unit_1, Astro_Data_Unit_8, Astro_Data_Unit_9, Astro_Data_Unit_11, Astro_Data_Unit_12, Astro_Data_Unit_13, Astro_Data_Unit_14)
copy.writelines("%s\n" % Title1_Format)
copy.writelines( "%s\n" % File_Format)
copy.writelines('\n')
copy.writelines("%s\n" % Title2_Format)
copy.writelines( "%s\n" % Astro_Format)
copy.writelines('\n')
copy.writelines( "%s\n" % Astro_Format_Units)
Results = copy
return Results
checkinitial(DIR)
sorter(DIR, targetcheck, initialcheck)
I think you keep getting confused between a single file and a list of files. Try something like this:
def checkinitial(TD):
#This counts the number of '.fits' files in your directory
check = len(glob.glob1(TD,"*.fits"))
if not check:
print 'There are no .FITS files in this directory! Try Again...'
sys.exit()
return check
def sorter(TD, targcheck, inicheck):
"""This function will call the two counters and compare them until the number of processed
files is greater than the files in the directory, thereby finishing the loop
"""
for in_file in glob.iglob(os.path.join(TD,'*.fits')):
print in_file # This prints out the filenames the program is currently processing
with pyfits.open(in_file) as HDU:
# <Process input file HDU here>
out_file_name = 'Processed_' + os.path.basename(in_file) + ".txt"
with open(os.path.join(TD, out_file_name), "w") as copy:
# <Write stuff to your output file copy here>

How to use Python to find a string in a line and change the text n lines after the string

I need to find every instance of "translate" in a text file and replace a value 4 lines after finding the text:
"(many lines)
}
}
translateX xtran
{
keys
{
k 0 0.5678
}
}
(many lines)"
The value 0.5678 needs to be 0. It will always be 4 lines below the "translate" string
The file has up to about 10,000 lines.
example text file name: 01F.pz2.
I'd also like to cycle through the folder and repeat the process for every file with the pz2 extension (up to 40).
Any help would be appreciated!
Thanks.
I'm not quite sure about the logic for replacing 0.5678 in your file, therefore I use a function for that - change it to whatever you need, or explain more in details what you want. Last number in line? only floating-point number?
Try:
import os
dirname = "14432826"
lines_distance= 4
def replace_whatever(line):
# Put your logic for replacing here
return line.replace("0.5678", "0")
for filename in filter(lambda x:x.endswith(".pz2") and not x.startswith("m_"), os.listdir(dirname)):
print filename
with open(os.path.join(dirname, filename), "r") as f_in, open(os.path.join(dirname,"m_%s" % filename), "w") as f_out:
replace_tasks = []
for line in f_in:
# search marker in line
if line.strip().startswith("translate"):
print "Found marker in", line,
replace_tasks.append(lines_distance)
# replace if necessary
if len(replace_tasks)>0 and replace_tasks[0] == 0:
del replace_tasks[0]
print "line to change is", line,
line_to_write = replace_whatever(line)
else:
line_to_write = line
# Write to output
f_out.write(line_to_write)
# decrease counters
for i, task in enumerate(replace_tasks):
replace_tasks[i] -= 1
The comments within the code should help understanding. The main concept is the list replace_tasks that keeps record of when the next line to modify will come.
Remarks: Your code sample suggests that the data in your file are structured. It will definitely be saver to read this structure and work on it instead of search-and-replace approach on a plain text file.
Thorsten, I renamed my original files to have the .old extension and the following code works:
import os
target_dir = "."
# cycle through files
for path, dirs, files in os.walk(target_dir):
# file is the file counter
for file in files:
# get the filename and extension
filename, ext = os.path.splitext(file)
# see if the file is a pz2
if ext.endswith('.old') :
# rename the file to "old"
oldfilename = filename + ".old"
newfilename = filename + ".pz2"
old_filepath = os.path.join(path, oldfilename)
new_filepath = os.path.join(path, newfilename)
# open the old file for reading
oldpz2 = open (old_filepath,"r")
# open the new file for writing
newpz2 = open (new_filepath,"w")
# reset changeline
changeline = 0
currentline = 0
# cycle through old lines
for line in oldpz2 :
currentline = currentline + 1
if line.strip().startswith("translate"):
changeline = currentline + 4
if currentline == changeline :
print >>newpz2," k 0 0"
else :
print >>newpz2,line

Categories