So I'm working on a project to align a sequence ID and its code. I was given a barcode file, which contains a tag for a DNA sequence, i.e. TTAGG. There's several tags (ATTAC, ACCAT, etc.) which then get removed from the a sequence file and placed with a seq ID.
Example:
sequence file --> SEQ 01 TTAGGAACCCAAA
barcode file --> TTAGG
the output file I want will remove the barcode and use it to create a new fasta format file.
Example:
testfile.TTAGG which when opened should have
>SEQ01
AACCCAAA
There are several of these files. I want to take each one of this files that I create and run them through mafft, but when I run my script, it only concentrates on one file for mafft. The files I mentioned above come out ok, but when mafft runs, it only runs the last file created.
Here's my script:
#!/usr/bin/python
import sys
import os
fname = sys.argv[1]
barcodefname = sys.argv[2]
barcodefile = open(barcodefname, "r")
for barcode in barcodefile:
barcode = barcode.strip()
outfname = "%s.%s" % (fname, barcode)
outf = open(outfname, "w+")
handle = open(fname, "r")
mafftname = outfname + ".mafft"
for line in handle:
newline = line.split()
seq = newline[0]
brc = newline[1]
potential_barcode = brc[:len(barcode)]
if potential_barcode == barcode:
outseq = brc[len(barcode):]
barcodeseq = ">%s\n%s\n" % (seq,outseq)
outf.write(barcodeseq)
handle.close()
outf.close()
cmd = "mafft %s > %s" % (outfname, mafftname)
os.system(cmd)
barcodefile.close()
I hope that was clear enough! Please help! I've tried changing my indentations, adjusting when I close the file. Most of the time it won't make the .mafft file at all, sometimes it does but doesn't put anything it, but mostly it only works on that last file created.
Example:
the beginning of the code creates files such as -
testfile.ATTAC
testfile.AGGAC
testfile.TTAGG
then when it runs mafft it only creates
testfile.TTAGG.mafft (with the correct input)
I have tried close the outf file and then opening it again, in which it tells me I'm coercing it.
I've changed to the outf file to write only, doesn't change anything.
The reason why mafft only aligns the last file file is because its execution is outside the loop.
As your code stands, you create an input file name variable (outfname) in each iteration of the loop, but this variable is always overwritten in the next iteration. Therefore, when your code eventually reaches the mafft execution command, the outfname variable will contain the last file name of the loop.
To correct this, simply insert the mafft execution command inside the loop:
#!/usr/bin/python
import sys
import os
fname = sys.argv[1]
barcodefname = sys.argv[2]
barcodefile = open(barcodefname, "r")
for barcode in barcodefile:
barcode = barcode.strip()
outfname = "%s.%s" % (fname, barcode)
outf = open(outfname, "w+")
handle = open(fname, "r")
mafftname = outfname + ".mafft"
for line in handle:
newline = line.split()
seq = newline[0]
brc = newline[1]
potential_barcode = brc[:len(barcode)]
if potential_barcode == barcode:
outseq = brc[len(barcode):]
barcodeseq = ">%s\n%s\n" % (seq,outseq)
outf.write(barcodeseq)
handle.close()
outf.close()
cmd = "mafft %s > %s" % (outfname, mafftname)
os.system(cmd)
barcodefile.close()
Related
I have some issues in my program. I have been trying to come up with a script which compares text files with a master text file and the program prints out the difference.
Basically, these are network configuration and we need to compare them in bulk to make sure all devices have standard configurations. For example, the script should read each file (file1, file2..etc.) line by line and compare it with the master file (master.txt).
I am able to compare one file at a time, however, when comparing two or more files I get an "index out of range" error.
I want to compare multiple files, probably in hundreds so I need to know how to fix his loop. Understand that this could be because program trying to ready
import difflib
import sys
hosts0 = open("C:\\Users\\p1329760\\Desktop\\Personal\\Python\\Projects\\sample\\master.txt","r")
hosts1 = open("C:\\Users\\p1329760\\Desktop\\Personal\\Python\\Projects\\sample\\file1.txt","r")
hosts2 = open("C:\\Users\\p1329760\\Desktop\\Personal\\Python\\Projects\\sample\\file2.txt","r")
lines1 = hosts0.readlines()
#print(lines11)
with open('output_compare.txt', 'w') as f:
#global original_stdout
for i,lines2 in enumerate(hosts1):
if lines2 != lines1[i]:
original_stdout = sys.stdout
sys.stdout = f
print("line ", i, " in hosts1 is different \n")
print(lines2)
sys.stdout = original_stdout
else:
pass
with open('output_compare1.txt', 'w') as file:
for i,lines3 in enumerate(hosts2):
if lines3 != lines1[i]:
original_stdout = sys.stdout
sys.stdout = file
print("line ", i, " in hosts1 is different \n")
print(lines3)
sys.stdout = original_stdout
else:
pass
Hi here is what you could do:
You can have a list off all the file name
namefile = [....]
And a function which takes the file name
def compare (filename):
fileobj = open(filename)
infile = fileobj.read().split()
for i in range(0,len(infile)):
if infile[i] == masterin[i]:
pass
else:
print(...)
After that you have to open the master file
master = open( "...")
masterin = master.read().split()
After that a loop and your done
for i in namefile:
compare (i)
I have several hundred text files of books (file001.txt, file002.txt etc), I want to read the first 3,000 words from each file and save it as a new one (eg file001_first3k.txt, file002_first3k.txt).
I've seen terminal solutions for Mac and Linux (I have both) but they seem to be for displaying to the terminal window and for set amount of characters, not words.
Posting this in Python as it seems that it is more likely to have a solution here than for terminal and I have some experience of Python.
Hopefully this will get you started, it makes the assumption that it is ok to split on spaces in order to determine the number of words.
import os
import sys
def extract_first_3k_words(directory):
original_file_suffix = ".txt"
new_file_suffix = "_first3k.tx"
filenames = [f for f in os.listdir(directory)
if f.endswith(original_file_suffix) and not f.endswith(new_file_suffix)]
for filename in filenames:
with open(filename, "r") as original_file:
# Get the first 3k words of the file
num_words = 3000
file_content = original_file.read()
words = file_content.split(" ")
first_3k_words = " ".join(words[:num_words])
# Write the new file
new_filename = filename.replace(original_file_suffix, new_file_suffix)
with open(new_filename, "w") as new_file:
new_file.write(first_3k_words)
print "Extracted 3k words from: %s to %s" % (filename, new_filename)
if __name__ == "__main__":
if len(sys.argv) != 2:
print "Usage: python file_splitter.py <target_directory>"
exit()
directory = sys.argv[1]
extract_first_3k_words(directory)
Let's say that I have a file that contains different MAC address notations of multiple MAC addresses. I want to replace all the matching notations of one MAC address I have parsed from an argument input. So far my script generates all the notations I need and can loop through the lines of the text and show the lines that have to be changed.
import argparse, sys
parser = argparse.ArgumentParser()
parser.add_argument("-f", "--filename")
parser.add_argument("-m", "--mac_address")
args = parser.parse_args()
mac = args.mac_address #In this case 00:1a:e8:31:71:7f
colon2 = mac #00:1a:e8:31:71:7f
dot2 = colon2.replace(":",".") # 00.1a.e8.31.71.7f
hyphen2 = colon2.replace(":","-") # 00-1a-e8-31-71-7f
nosymbol = colon2.replace(":","") # 001ae831717f
colon4 = ':'.join(nosymbol[i:i+4] for i in range(0, len(nosymbol), 4)) # 001a:e831:717f
dot4 = colon4.replace(":",".") # 001a.e831.717f
hyphen4 = colon4.replace(":","-") # 001a-e831-717f
replacethis = [colon2,dot2,hyphen2,dot4,colon4,nosymbol,hyphen4]
with open(args.filename, 'r+') as f:
text = f.read()
for line in text.split('\n'):
for n in replacethis:
if line.replace(n, mac) != line:
print line + '\n has to change to: \n'line.replace(n,mac)
else:
continue
If the file would look like this:
fb:76:03:f0:67:01
fb.76.03.f0.67.01
fb-76-03-f0-67-01
001a:e831:727f
001ae831727f
fb76.03f0.6701
001ae831727f
fb76:03f0:6701
001a.e831.727f
fb76-03f0-6701
fb7603f06701
it should change to:
fb:76:03:f0:67:01
fb.76.03.f0.67.01
fb-76-03-f0-67-01
00:1a:e8:31:71:7f
00:1a:e8:31:71:7f
fb76.03f0.6701
00:1a:e8:31:71:7f
fb76:03f0:6701
00:1a:e8:31:71:7f
fb76-03f0-6701
fb7603f06701
I am struggling at writing the new lines containing the changed MAC address notation back to the file replacing the previous line.
Is there a way to do this?
A simple way to achieve what you are asking you can add a line to store the final values you get, and after that include another ‘with open’ statement to write it to a new file.
replacethis = [colon2, dot2, hyphen2, dot4, colon4, nosymbol, hyphen4]
final_values =[]
with open(args.filename, 'r+') as f:
text = f.read()
for line in text.split('\n'):
for n in replacethis:
if line.replace(n, mac) != line:
print line + '\n has to change to: \n'line.replace(n,mac)
final_values.append(line.replace(n, mac)
else:
continue
final_values.append(line)
with open(new_file_name, ‘w’) as new_f:
new_f.write(final_values)
Note that if new_file_name = your old file name, you will overwrite the original file.
I hope that answers your question.
This script reads and writes all the individual html files in a directory. The script reiterates, highlight and write the output.The issue is, after highlighting the last instance of the search item, the script removes all the remaining contents after the last search instance in the output of each file. Any help here is appreciated.
import os
import sys
import re
source = raw_input("Enter the source files path:")
listfiles = os.listdir(source)
for f in listfiles:
filepath = os.path.join(source+'\\'+f)
infile = open(filepath, 'r+')
source_content = infile.read()
color = ('red')
regex = re.compile(r"(\b in \b)|(\b be \b)|(\b by \b)|(\b user \b)|(\bmay\b)|(\bmight\b)|(\bwill\b)|(\b's\b)|(\bdon't\b)|(\bdoesn't\b)|(\bwon't\b)|(\bsupport\b)|(\bcan't\b)|(\bkill\b)|(\betc\b)|(\b NA \b)|(\bfollow\b)|(\bhang\b)|(\bbelow\b)", re.I)
i = 0; output = ""
for m in regex.finditer(source_content):
output += "".join([source_content[i:m.start()],
"<strong><span style='color:%s'>" % color[0:],
source_content[m.start():m.end()],
"</span></strong>"])
i = m.end()
outfile = open(filepath, 'w')
outfile.seek(0, 2)
outfile.write(output)
print "\nProcess Completed!\n"
infile.close()
outfile.close()
raw_input()
After your for loop is over, you need to include whatever is left after the last match:
...
i = m.end()
output += source_content[i:]) # Here's the end of your file
outfile = open(filepath, 'w')
...
I need to find every instance of "translate" in a text file and replace a value 4 lines after finding the text:
"(many lines)
}
}
translateX xtran
{
keys
{
k 0 0.5678
}
}
(many lines)"
The value 0.5678 needs to be 0. It will always be 4 lines below the "translate" string
The file has up to about 10,000 lines.
example text file name: 01F.pz2.
I'd also like to cycle through the folder and repeat the process for every file with the pz2 extension (up to 40).
Any help would be appreciated!
Thanks.
I'm not quite sure about the logic for replacing 0.5678 in your file, therefore I use a function for that - change it to whatever you need, or explain more in details what you want. Last number in line? only floating-point number?
Try:
import os
dirname = "14432826"
lines_distance= 4
def replace_whatever(line):
# Put your logic for replacing here
return line.replace("0.5678", "0")
for filename in filter(lambda x:x.endswith(".pz2") and not x.startswith("m_"), os.listdir(dirname)):
print filename
with open(os.path.join(dirname, filename), "r") as f_in, open(os.path.join(dirname,"m_%s" % filename), "w") as f_out:
replace_tasks = []
for line in f_in:
# search marker in line
if line.strip().startswith("translate"):
print "Found marker in", line,
replace_tasks.append(lines_distance)
# replace if necessary
if len(replace_tasks)>0 and replace_tasks[0] == 0:
del replace_tasks[0]
print "line to change is", line,
line_to_write = replace_whatever(line)
else:
line_to_write = line
# Write to output
f_out.write(line_to_write)
# decrease counters
for i, task in enumerate(replace_tasks):
replace_tasks[i] -= 1
The comments within the code should help understanding. The main concept is the list replace_tasks that keeps record of when the next line to modify will come.
Remarks: Your code sample suggests that the data in your file are structured. It will definitely be saver to read this structure and work on it instead of search-and-replace approach on a plain text file.
Thorsten, I renamed my original files to have the .old extension and the following code works:
import os
target_dir = "."
# cycle through files
for path, dirs, files in os.walk(target_dir):
# file is the file counter
for file in files:
# get the filename and extension
filename, ext = os.path.splitext(file)
# see if the file is a pz2
if ext.endswith('.old') :
# rename the file to "old"
oldfilename = filename + ".old"
newfilename = filename + ".pz2"
old_filepath = os.path.join(path, oldfilename)
new_filepath = os.path.join(path, newfilename)
# open the old file for reading
oldpz2 = open (old_filepath,"r")
# open the new file for writing
newpz2 = open (new_filepath,"w")
# reset changeline
changeline = 0
currentline = 0
# cycle through old lines
for line in oldpz2 :
currentline = currentline + 1
if line.strip().startswith("translate"):
changeline = currentline + 4
if currentline == changeline :
print >>newpz2," k 0 0"
else :
print >>newpz2,line