Modifying a file in-place inside nested for loops - python

I am iterating directories and files inside of them while I modify in place each file. I am looking to have the new modified file being read right after.
Here is my code with descriptive comments:
# go through each directory based on their ids
for id in id_list:
id_dir = os.path.join(ouput_dir, id)
os.chdir(id_dir)
# go through all files (with a specific extension)
for filename in glob('*' + ext):
# modify the file by replacing all new-line characters with an empty space
with fileinput.FileInput(filename, inplace=True) as f:
for line in f:
print(line.replace('\n', ' '), end='')
# here I would like to read the NEW modified file
with open(filename) as newf:
content = newf.read()
As it stands, the newf is not the new modified one, but instead the original f. I think I understand why that is, however I found it difficult to overcome that issue.
I can always do 2 separate iterations (go through each directory based on their ids, go through all files (with a specific extension) and modify the file, and then repeat iteration to read each one of them) but I was hoping if there was a more efficient way around it. Perhaps if it would be possible to restart the second for loop after the modification has taken place and then have the read take place (so to avoid at least repeating the outer for loop).
Any ideas/designs of to achieve the above with a clean and efficient way?

For me it works with this code:
#!/usr/bin/env python3
import os
from glob import glob
import fileinput
id_list=['1']
ouput_dir='.'
ext = '.txt'
# go through each directory based on their ids
for id in id_list:
id_dir = os.path.join(ouput_dir, id)
os.chdir(id_dir)
# go through all files (with a specific extension)
for filename in glob('*' + ext):
# modify the file by replacing all new-line characters with an empty space
for line in fileinput.FileInput(filename, inplace=True):
print(line.replace('\n', ' ') , end="")
# here I would like to read the NEW modified file
with open(filename) as newf:
content = newf.read()
print(content)
notice how I iterate over the lines!

I am not saying that the way you are going about doing this is incorrect but I feel that you are overcomplicating it. Here is my super simple solution.
import glob, fileinput
for filename in glob('*' + ext):
f_in = (x.rstrip() for x in open(filename, 'rb').readlines()) #instead of trying to modify in place we instead read in data and replace raw_values.
with open(filename, 'wb') as f_out: # we then write the data stream back out
#extra modification to the data can go here, i just remove the /r and /n and write back out
for i in f_in:
f_out.write(i)
#now there is no need to read the data back in because we already have a static referance to it.

Related

os.path.basename to outfile

For every input file processed (see code below) I am trying to use "os.path.basename" to write to a new output file - I know I am missing something obvious...?
import os
import glob
import gzip
dbpath = '/home/university/Desktop/test'
for infile in glob.glob( os.path.join(dbpath, 'G[D|E]/????/*.gz') ):
print("current file is: " + infile)
**
outfile=os.path.basename('/home/university/Desktop/test/G[D|E]
/????/??????.xaa.fastq.gz').rsplit('.xaa.fastq.gz')[0]
file=open(outfile, 'w+')
**
gzsuppl = Chem.ForwardSDMolSupplier(gzip.open(infile))
for m in gzsuppl:
if m is None: continue
...etc
file.close()
print(count)
It is not clear to me how to capture the variable [0] (i.e. everything upstream of .xaa.fastq.gz) and use as the basename for the new output file?
Unfortunately it simply writes the new output file as "??????" rather than the actual sequence of 6 letters.
Thanks for any help given.
This seems like it will get everything upstream of the .xaa.fastq.gz in the paths returned from glob() in your sample code:
import os
filepath = '/home/university/Desktop/test/GD /AAML/DEAAML.xaa.fastq.gz'
filepath = os.path.normpath(filepath) # Changes path separators for Windows.
# This section was adapted from answer https://stackoverflow.com/a/3167684/355230
folders = []
while 1:
filepath, folder = os.path.split(filepath)
if folder:
folders.append(folder)
else:
if filepath:
folders.append(filepath)
break
folders.reverse()
if len(folders) > 1:
# The last element of folders should contain the original filename.
filename_prefix = os.path.basename(folders[-1]).split('.')[0]
outfile = os.path.join(*(folders[:-1] + [filename_prefix + '.rest_of_filename']))
print(outfile) # -> \home\university\Desktop\test\GD \AAML\DEAAML.rest_of_filename
Of course what ends-up in outfile isn't the final path plus filename since I don't know what the remainder of the filename will be and just put a placeholder in (the '.rest_of_filename').
I'm not familiar with the kind of input data you're working with, but here's what I can tell you:
The "something obvious" you're missing is that outfile has no connection to infile. Your outfile line uses the ?????? rather than the actual filename because that's what you ask for. It's glob.glob that turns it into a list of matches.
Here's how I'd write that aspect of the outfile line:
outfile = infile.rsplit('.xaa.fastq.gz', 1)[0]
(The , 1 ensures that it'll never split more than once, no matter how crazy a filename gets. It's just a good habit to get into when using split or rsplit like this.)
You're setting yourself up for a bug, because the glob pattern can match *.gz files which don't end in .xaa.fastq.gz, which would mean that a random .gz file which happens to wind up in the folder listing would cause outfile to have the same path as infile and you'd end up writing to the input file.
There are three solutions to this problem which apply to your use case:
Use *.xaa.fastq.gz instead of *.gzin your glob. I don't recommend this because it's easy for a typo to sneak in and make them different again, which would silently reintroduce the bug.
Write your output to a different folder than you took your input from.
outfile = os.path.join(outpath, os.path.relpath(infile, dbpath))
outparent = os.path.dirname(outfile)
if not os.path.exists(outparent):
os.makedirs(outparent)
Add an assert outfile != infile line so the program will die with a meaningful error message in the "this should never actually happen" case, rather than silently doing incorrect things.
The indentation of what you posted could be wrong, but it looks like you're opening a bunch of files, then only closing the last one. My advice is to use this instead, so it's impossible to get that wrong:
with open(outfile, 'w+') as file:
# put things which use `file` here
The name file is already present in the standard library and the variable names you chose are unhelpful. I'd rename infile to inpath, outfile to outpath, and file to outfile. That way, you can tell whether each one is a path (ie. a string) or a Python file object just from the variable name and there's no risk of accessing file before you (re)define it and getting a very confusing error message.

Add content to multiple (txt) files python

I need over 2000 dummy (txt) files voor testing a recycle bin function. I've created the txt dummy files with the following code:
list = range(0,2000)
vulling = list
with open("path/file.txt", "w") as f:
for s in vulling:
f.write(str(s) +"\n")
List = open("path/file.txt")
List2 = (s.strip() + ' dummy' for s in List)
for item in List2:
open('path/%s.txt'%(item,), 'w')
But, since I can't upload empty files, I need to add content to those files. The content can be the same for all those files. For example: add a string "Spam" to every file. What would be the best solution for this?
The easiest thing to do is just to create the files with the content in them that you want to begin with:
import os.path
def create_test_files(target_dir, content, n=2000, template="file_%s.txt"):
for i in xrange(n):
path = os.path.join(target_dir, template % i)
with open(path, 'w') as fh:
fh.write(content)
yield path
for file_name in create_test_files("/tmp/example", 'Spam'):
print file_name
This is choosing file names for you, so if you need specific ones, you'll have to change it.
This is really quite fast. The other approach (create then copy) will result in having to read the original file 2000 times. Seeing as we already know the content we want in there, we can save that time.
Note: This solution uses a generator, so unless you force it to iterate (e.g. by putting it in loop, or tuple), it won't generate any files.

Using Python to search multiple text files for matches to a list of strings

So am starting from scratch on a program that I haven't really seen replicated anywhere else. I'll describe exactly what I want it to do:
I have a list of strings that looks like this:
12482-2958
02274+2482
23381-3857
..........
I want to take each of these strings and search through a few dozen files (all named wds000.dat, wds005.dat, wds010.dat, etc) for matches. If one of them finds a match, I want to write that string to a new file, so in the end I have a list of strings that had matches.
If I need to be more clear about something, please let me know. Any help on where to start with this would be much appreciated. Thanks guys and gals!
Something like this should work
import os
#### your array ####
myarray = {"12482-2958", "02274+2482", "23381-3857"}
path = os.path.expanduser("path/to/myfile")
newpath = os.path.expanduser("path/to/myResultsFile")
filename = 'matches.data'
newf = open(os.path.join(newpath, filename), "w+")
###### Loops through every element in the above array ####
for element in myarray:
elementstring=''.join(element)
#### opens the path where all of your .dat files are ####
files = os.listdir(path)
for f in files:
if f.strip().endswith(".dat"):
openfile = open(os.path.join(path, f), 'rb')
#### loops through every line in the file comparing the strings ####
for line in openfile:
if elementstring in line:
newf.write(line)
openfile.close()
newf.close()
Define a function that gets a path and a string and checks for match.
You can use: open(), find(), close()
Then just create all paths in a for loop, for every path check all strings with the function and print to file if needed
Not explained much... Needing more explaining?
Not so pythonic... and probably has something to straighten out but pretty much the logic to follow:
from glob import glob
strings = ['12482-2958',...] # your strings
output = []
for file in glob('ws*.dat'):
with open(file, 'rb+') as f:
for line in f.readlines():
for subs in strings:
if subs in line:
output.append(line)
print(output)

Python: read multiple source txt files, copy by criteria into 1 output file

My objective is to read multiple txt source files in a folder (small size), then copy lines selected by criteria into one output txt file.
I can do this with 1 source file, but I have no output (empty) when I try to read multiple files and do the same.
With my SO research I wrote following code (no output):
import glob
# import re --- taken out as 'overkill'
path = 'C:/Doc/version 1/Input*.txt' # read source files in this folder with this name format
list_of_files=glob.glob(path)
criteria = ['AB', 'CD', 'EF'] # select lines that start with criteria
#list_of_files = glob.glob('./Input*.txt')
with open("P_out.txt", "a") as f_out:
for fileName in list_of_files:
data_list = open( fileName, "r" ).readlines()
for line in data_list:
for letter in criteria:
if line.startswith(letter):
f_out.write('{}\n'.format(line))
Thank you for your help.
#abe and #ppperry: I'd like to particularly thank you for your earlier input.
Problems with your code:
You have two duplicate variables files and list_of_files but only use the latter.
Every time you open a file, you override the variable data_list, which erases the contents of the previous file read.
When you search the file for matching lines, you use the variable fileName instead of data_list!
Places that could use simplification:
Using the re module is overkill for just finding out whether a string starts with another string. You can use line.startswith(letter).
The errors:
Line #14 should look for lines in data_list, not fileName.
"I can do this with 1 source file, but I have no output (empty) when I try to read multiple files and do the same." Lines 14 through 17 should be indented or else the for loop that iterates over the list_of_files will only loop over the first file.
You did not even use lines 4 and 5, so why include them? They have no effect.
Here is your code fixed, with comments:
import glob
import re
#path = 'C:\Doc\version 1\Output*.txt' # read all source files with this name format
#files=glob.glob(path)
criteria = ['AB', 'CD', 'EF'] # select lines that start with criteria
list_of_files = glob.glob('./Output*.txt')
with open("P_out.txt", "a") as f_out: #use "a" so you can keep the data from the last Output.txt
for fileName in list_of_files:
data_list = open( fileName, "r" ).readlines()
#indenting the below will allow you to search through all files.
for line in data_list: #Search data_list, not fileName
for letter in criteria:
if re.search(letter,line):
f_out.writelines('{}\n'.format(line))
#I recommend the \n so that the text does not get concatenated when moving from file to file.
#Really? I promise with will not lie to you.
#f_out.close() # 'with' construction should close files, yet I make sure they close
For those who downvoted, why not include a comment to justify your judgment? Everything the OP requested has been satisfied. If you think you can further improve the answer, suggest an edit. Thank you.

Differences between enumerate(fileinput.input(file)) and enumerate(file)

I'm looking for some help with my code which is rigth below :
for file in file_name :
if os.path.isfile(file):
for line_number, line in enumerate(fileinput.input(file, inplace=1)):
print file
os.system("pause")
if line_number ==1:
line = line.replace('Object','#Object')
sys.stdout.write(line)
I wanted to modify some previous extracted files in order to plot them with matplotlib. To do so, I remove some lines, comment some others.
My problem is the following :
Using for line_number, line in enumerate(fileinput.input(file, inplace=1)): gives me only 4 out of 5 previous extracted files (when looking file_name list contains 5 references !)
Using for line_number, line in enumerate(file): gives me the 5 previous extracted file, BUT I don't know how to make modifications using the same file without creating another one...
Did you have an idea on this issue? Is this a normal issue?
There a number of things that might help you.
Firstly file_name appears to be a list of file names. It might be better named file_names and then you could use file_name for each one. You have verified that this does hold 5 entries.
The enumerate() function is used to help when enumerating a list of items to provide both an index and the item for each loop. This saves you having to use a separate counter variable, e.g.
for index, item in enumerate(["item1", "item2", "item3"]):
print index, item
would print:
0 item1
1 item2
2 item3
This is not really required, as you have chosen to use the fileinput library. This is designed to take a list of files and iterate over all of the lines in all of the files in one single loop. As such you need to tweak your approach a bit, assuming your list of files is called file_names then you write something as follows:
# Keep only files in the file list
file_names = [file_name for file_name in file_names if os.path.isfile(file_name)]
# Iterate all lines in all files
for line in fileinput.input(file_names, inplace=1):
if fileinput.filelineno() == 1:
line = line.replace('Object','#Object')
sys.stdout.write(line)
The main point here being that it is better to pre filter any non-filenames before passing the list to fileinput. I will leave it up to you to fix the output.
fileinput provides a number of functions to help you figure out which file or line number is currently being processed.
Assuming you're still having trouble, my typical approach is to open a file read-only, read its contents into a variable, close the file, make an edited variable, open the file to write (wiping out original file), and finally write the edited contents.
I like this approach since I can simply change the file_name that gets written out if I want to test my edits without wiping out the original file.
Also, I recommend naming containers using plural nouns, like #Martin Evans suggests.
import os
file_names = ['file_1.txt', 'file_2.txt', 'file_3.txt', 'file_4.txt', 'file_5.txt']
file_names = [x for x in file_names if os.path.isfile(x)] # see #Martin's answer again
for file_name in file_names:
# Open read-only and put contents into a list of line strings
with open(file_name, 'r') as f_in:
lines = f_in.read().splitlines()
# Put the lines you want to write out in out_lines
out_lines = []
for index_no, line in enumerate(lines):
if index_no == 1:
out_lines.append(line.replace('Object', '#Object'))
elif ...
else:
out_lines.append(line)
# Uncomment to write to different file name for edits testing
# with open(file_name + '.out', 'w') as f_out:
# f_out.write('\n'.join(out_lines))
# Write out the file, clobbering the original
with open(file_name, 'w') as f_out:
f_out.write('\n'.join(out_lines))
Downside with this approach is that each file needs to be small enough to fit into memory twice (lines + out_lines).
Best of luck!

Categories