looping filenames in python - python

I have few hundred bigfiles(based on line nos.).
I am trying to write a code using a loop.
First the loop reads the bigfile in the folder,
second it will make a folder of the same filename it is reading
and lastly it will slice the file in the same folder created.
This loop should iterate over all the bigfiles present in the folder.
My code is as follow:
import glob
import os
os.chdir("/test code/")
lines_per_file = 106
sf = None
for file in glob.glob("*.TAB"):
with open(file) as bigfile:
for lineno, line in enumerate(bigfile):
if lineno % lines_per_file == 0:
if sf:
sf.close()
sf_filename = '/test code/201511_sst/sf_{}.txt'.format(lineno + lines_per_file)
sf = open(sf_filename, "w")
sf.write(line)
if sf:
sf.close()
I am getting the output as follow:
In [35]: runfile('/test code/file_loop_16Jan.py', wdir='/test code')
In [36]:
I need a little guidance in looping the files so that I can achieve it. I think no error means I am missing something !!
Please anyone can help me out !

sf is set to None at start so you never enter in the if sf loop: no output file is ever written anywhere.
Besides, when you close the file, you have to set sf to None again or you'll get "operation on closed file" when closing again.
But that won't do what you want. You want to split the file so do this:
if lineno % lines_per_file == 0:
# new file, close previous file if any
if sf:
sf.close()
# open new file
sf_filename = '/test code/201511_sst/sf_{}.txt'.format(lineno + lines_per_file)
sf = open(sf_filename, "w")
# write the line in the current handler
sf.write(line)
the first if is encountered at start: good. Since sf is None it doesn't call close (for the best)
it then opens the file with the new filename
now the line is written in the new file handle (you have to write one line at each iteration, not only when the modulo matches)
On next iterations, when the modulo matches, the previous file is closed, and a new handle is created with a new filename.
Don't forget to close the last file handle when exiting the loop:
if sf:
sf.close()
I haven't tested it but the logic is here. Comment if you have subsequent issues I'll edit my post.
Aside: another problem is that if there are more than 1 big *.TAB file, the split file will be overwritten. To avoid that, I would add the input file basename in the output file for instance (lineno is reset in each loop):
sf_filename = '/test code/201511_sst/{}_sf_{}.txt'.format(os.path.splitext(os.path.basename(file))[0]),lineno + lines_per_file)
you can do that by storing the end lineno too and compute a line offset. It's up to you

Since You're already using with statement for reading files, you can also use the same for writing into files, so that way, you don't need to close the file object explicitly. see these links.
https://docs.python.org/2/reference/compound_stmts.html#with
https://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects
You can simply do this:
with open(file,"w") as sf:
// read/write file content and do your stuff here

Related

How to modify and overwrite large files?

I want to make several modifications to some lines in the file and overwrite the file. I do not want to create a new file with the changes, and since the file is large (hundreds of MB), I don't want to read it all at once in memory.
datfile = 'C:/some_path/text.txt'
with open(datfile) as file:
for line in file:
if line.split()[0] == 'TABLE':
# if this is true, I want to change the second word of the line
# something like: line.split()[1] = 'new'
Please note that an important part of the problem is that the file is big. There are several solutions on the site that address the similar problems but do not account for the size of the files.
Is there a way to do this in python?
You can't replace the contents of a portion of a file without rewriting the remainder of the file regardless of python. Each byte of a file lives in a fixed location on a disk or flash memory. If you want to insert text into the file that is shorter or longer than the text it replaces, you will need to move the remainder of the file. If your replacement is longer than the original text, you will probably want to write a new file to avoid overwriting the data.
Given how file I/O works, and the operations you are already performing on the file, making a new file will not be as big of a problem as you think. You are already reading in the entire file line-by-line and parsing the content. Doing a buffered write of the replacement data will not be all that expensive.
from tempfile import NamedTemporaryFile
from os import remove, rename
from os.path import dirname
datfile = 'C:/some_path/text.txt'
try:
with open(datfile) as file, NamedTemporaryFile(mode='wt', dir=dirname(datfile), delete=False) as output:
tname = output.name
for line in file:
if line.startswith('TABLE'):
ls = line.split()
ls[1] = 'new'
line = ls.join(' ') + '\n'
output.write(line)
except:
remove(tname)
else:
rename(tname, datfile)
Passing dir=dirname(datfile) to NamedTemporaryFile should guarantee that the final rename does not have to copy the file from one disk to another in most cases. Using delete=False allows you to do the rename if the operation succeeds. The temporary file is deleted by name if any problem occurs, and renamed to the original file otherwise.

Using a for loop to add a new line to a table: python

I am trying to create a .bed file after searching through DNA sequences for two regular expressions. Ideally, I'd like to generate a tab-separated file which contains the sequence description, the start location of the first regex and the end location of the second regex. I know that the regex section works, it's just creating the \t separated file I am struggling with.
I was hoping that I could open/create a file and simply print a new line for each iteration of the for loop that contains this information, like so:
with open("Mimp_hits.bed", "a+") as file_object:
for line in file_object:
print(f'{sequence.description}\t{h.start()}\t{h_rc.end()}')
file_object.close()
But this doesn't seem to work (creates empty file). I have also tried to use file_object.write, but again this creates an empty file too.
This is all of the code I have including searching for the regexes:
import re, sys
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
infile = sys.argv[1]
for sequence in SeqIO.parse(infile, "fasta"):
hit = re.finditer(r"CAGTGGG..GCAA[TA]AA", str(sequence.seq))
mimp_length = 400
for h in hit:
h_start = h.start()
hit_rc = re.finditer(r"TT[TA]TTGC..CCCACTG", str(sequence.seq))
for h_rc in hit_rc:
h_rc_end = h_rc.end()
length = h_rc_end - h_start
if length > 0:
if length < mimp_length:
with open("Mimp_hits.bed", "a+") as file_object:
for line in file_object:
print(sequence.description, h.start(), h_rc.end())
file_object.close()
This is the desired output:
Focub_II5_mimp_1__contig_1.16(656599:656809) 2 208
Focub_II5_mimp_2__contig_1.47(41315:41540) 2 223
Focub_II5_mimp_3__contig_1.65(13656:13882) 2 224
Focub_II5_mimp_4__contig_1.70(61591:61809) 2 216
This is example input:
>Focub_II5_mimp_1__contig_1.16(656599:656809)
TACAGTGGGATGCAAAAAGTATTCGCAGGTGTGTAGAGAGATTTGTTGCTCGGAAGCTAGTTAGGTGTAGCTTGTCAGGTTCTCAGTACCCTATATTACACCGAGATCAGCGGGATAATCTAGTCTCGAGTACATAAGCTAAGTTAAGCTACTAACTAGCGCAGCTGACACAACTTACACACCTGCAAATACTTTTTGCATCCCACTGTA
>Focub_II5_mimp_2__contig_1.47(41315:41540)
TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTCTGCCGCTAGCCCATTTTAACAGCTAGAGTGTGTATATTAACCTCACACATAGCTATCTCTTATACTAATTGGTTAGGGAAAACCTCTAACCAGGATTAGGAGTCAACATAGCTTGTTTTAGGCTAAGAGGTGTGTGTCAGTACACCAAAGGGTATTCATACTTATTGCCCCCCACTGTA
>Focub_II5_mimp_3__contig_1.65(13656:13882)
TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTTCTGCCGCTAGCCTATTTTAATAGTTAGAGTGTGCATATTAACCTCACACATAGCTATCTTATATACTAATCGGTTAGGGAAAACCTCTAACCAGGATTAGGAGTCAACATAGCTTCTTTTAGGCTAAGAGGTGTGTGTCAGTACACCAAAGGGTATTCATACTTATTGCCCCCCACTGTA
>Focub_II5_mimp_4__contig_1.70(61591:61809)
TACAGTGGGATGCAATAAGTTTGAATGCAGGCTGAAGTACCAGCTGTTGTAATCTAGCTCCTGTATACAACGCTTTAGCTTGATAAAGTAAGCGCTAAGCTGTATCAGGCAAAAGGCTATCCCGATTGGGGTATTGCTACGTAGGGAACTGGTCTTACCTTGGTTAGTCAGTGAATGTGTACTTGAGTTTGGATTCAAACTTATTGCATCCCACTGTA
Is anybody able to help?
Thank you :)
to write a line to a file you would do something like this:
with open("file.txt", "a") as f:
print("new line", file=f)
and if you want it tab separated you can also add sep="\t", this is why python 3 made print a function so you can use sep, end, file, and flush keyword arguments. :)
opening a file for appending means the file pointer starts at the end of the file which means that writing to it doesn't override any data (gets appended to the end of the file) and iterating over it (or otherwise reading from it) gives nothing like you already reached the end of the file.
So instead of iterating over the lines of the file you would just write the single line to it:
with open("Mimp_hits.bed", "a") as file_object:
print(sequence.description, h.start(), h_rc.end(), file=file_object)
you can also consider just opening the file near the beginning of the loop since opening it once and writing multiple times is more efficient than opening it multiple times, also the with block automatically closes the file so no need to do that explicitly.
You are trying to open the file in "a+" mode, and loop over lines from it (which will not find anything because the file is positioned at the end when you do that). In any case, if this is an output file only, then you would open it in "a" mode to append to it.
Probably you just want to open the file once for appending, and inside the with statement, do your main loop, using file_object.write(...) when you want to actually append strings to the file. Note that there is no need for file_object.close() when using this with construct.
with open("Mimp_hits.bed", "a") as file_object:
for sequence in SeqIO.parse(infile, "fasta"):
# ... etc per original code ...
if length < mimp_length:
file_object.write("{}\t{}\t{}\n".format(
sequence.description, h.start(), h_rc.end()))

Python unable to read lines from text file after read()

I have a problem whereby I am trying to first check a text file for the existence of a known string, and based on this, loop over the file and insert a different line.
For some reason, after calling file.read() to check for the test string, the for loop appears not to work. I have tried calling file.seek(0) to get back to the start of the file, but this has not helped.
My current code is as follows:
try:
f_old = open(text_file)
f_new = open(text_file + '.new','w')
except:
print 'Unable to open text file!'
logger.info('Unable to open text file, exiting')
sys.exit()
wroteOut = False
# first check if file contains an test string
if '<dir>' in f_old.read():
#f_old.seek(0) # <-- do we need to do this??
for line in f_old: # loop thru file
print line
if '<test string>' in line:
line = ' <found the test string!>'
if '<test string2>' in line:
line = ' <found test string2!>'
f_new.write(line) # write out the line
wroteOut = True # set flag so we know it worked
f_new.close()
f_old.close()
You already know the answer:
#f_old.seek(0) # <-- do we need to do this??
Yes, you need to seek back to the start of the file before you can read the contents again.
All file operations work with the current file position. Using file.read() reads all of the file, leaving the current position set to the end of the file. If you wanted to re-read data from the start of the file, a file.seek(0) call is required. The alternatives are to:
Not read the file again, you just read all of the data, so use that information instead. File operations are slow, using the same data from memory is much, much faster:
contents = f_old.read()
if '<dir>' in contents:
for line in contents.splitlines():
# ....
Re-open the file. Opening a file in read mode puts the current file position back at the start.

Remove line from a text file after read

I have a text file named 1.txt which contains the following:
123456
011111
02222
03333
and I have created a python code which copy the first line to x number of folders to file number.txt
then copy the second to x number of folders:
progs = int(raw_input( "Folders Number : "))
with open('1.txt', 'r') as f:
progs2 = f.read().splitlines()
progs3 = int(raw_input( "Copy time for each line : "))
for i in xrange(progs):
splis = int(math.ceil(float(progs)/len(progs3)))
with open("{0}/number.txt".format(pathname),'w') as fi:
fi.write(progs2[i/splis])
I want to edit the code to remove the line after copying it to the specified number of folder;
like when the code copy the number 123456 I want it to be deleted from the file so when I use the program again to continue from the second number.
Any idea about the code?
I'd like to write this as a comment but I do not have the necessary points
to do that so I'll just write an answer. Adding up on Darren Ringer's answer.
After reading the line you could close the file and open it again overwriting
it with the old content except for the the line which you want to remove,
which has already been described in this answer:
Deleting a specific line in a file (python)
Another option would be to use in-place Filtering using the same filename
for your output which would replace your old file with the filtered content. This
is essentially the same. You just don't have to open and close the file again.
This has also already been answered by 1_CR in the following question and can also
be found at https://docs.python.org/ (Optional in-place filtering section):
Deleting a line from a text file
Adapted to your case it would look something like this:
import fileinput
import sys, os
os.chdir('/Path/to/your/file')
for line_number, line in enumerate(fileinput.input('1.txt', inplace=1)):
if line_number == 0:
# do something with the line
else:
sys.stdout.write(line) # Write the remaining lines back to your file
Cheers
You could load all the lines into a list with readlines(), then when you get a line to work with simply remove it from the list and then write the list to the file. You will be overwriting the entire file every time you perform a read (not just removing the data inline) but there is no way to do it otherwise while simultaneously ensuring the file contents are up-to-date (Of which I am aware).

How do I modify the last line of a file?

The last line of my file is:
29-dez,40,
How can I modify that line so that it reads:
29-Dez,40,90,100,50
Note: I don't want to write a new line. I want to take the same line and put new values after 29-Dez,40,
I'm new at python. I'm having a lot of trouble manipulating files and for me every example I look at seems difficult.
Unless the file is huge, you'll probably find it easier to read the entire file into a data structure (which might just be a list of lines), and then modify the data structure in memory, and finally write it back to the file.
On the other hand maybe your file is really huge - multiple GBs at least. In which case: the last line is probably terminated with a new line character, if you seek to that position you can overwrite it with the new text at the end of the last line.
So perhaps:
f = open("foo.file", "wb")
f.seek(-len(os.linesep), os.SEEK_END)
f.write("new text at end of last line" + os.linesep)
f.close()
(Modulo line endings on different platforms)
To expand on what Doug said, in order to read the file contents into a data structure you can use the readlines() method of the file object.
The below code sample reads the file into a list of "lines", edits the last line, then writes it back out to the file:
#!/usr/bin/python
MYFILE="file.txt"
# read the file into a list of lines
lines = open(MYFILE, 'r').readlines()
# now edit the last line of the list of lines
new_last_line = (lines[-1].rstrip() + ",90,100,50")
lines[-1] = new_last_line
# now write the modified list back out to the file
open(MYFILE, 'w').writelines(lines)
If the file is very large then this approach will not work well, because this reads all the file lines into memory each time and writes them back out to the file, which is very inefficient. For a small file however this will work fine.
Don't work with files directly, make a data structure that fits your needs in form of a class and make read from/write to file methods.
I recently wrote a script to do something very similar to this. It would traverse a project, find all module dependencies and add any missing import statements. I won't clutter this post up with the entire script, but I'll show how I went about modifying my files.
import os
from mmap import mmap
def insert_import(filename, text):
if len(text) < 1:
return
f = open(filename, 'r+')
m = mmap(f.fileno(), os.path.getsize(filename))
origSize = m.size()
m.resize(origSize + len(text))
pos = 0
while True:
l = m.readline()
if l.startswith(('import', 'from')):
continue
else:
pos = m.tell() - len(l)
break
m[pos+len(text):] = m[pos:origSize]
m[pos:pos+len(text)] = text
m.close()
f.close()
Summary: This snippet takes a filename and a blob of text to insert. It finds the last import statement already present, and sticks the text in at that location.
The part I suggest paying most attention to is the use of mmap. It lets you work with files in the same manner you may work with a string. Very handy.

Categories