so I have a some code that opens a text file containing a list of paths to files like so:
C:/Users/User/Desktop/mini_mouse/1980
C:/Users/User/Desktop/mini_mouse/1982
C:/Users/User/Desktop/mini_mouse/1984
It then opens these files individually, line-by-line, and does some filtering to the files. I then want it to output the result to a completely different folder called:
output_location = 'C:/Users/User/Desktop/test2/'
As it stands, my code currently outputs the result to the place where the original file was opened i.e if it opens the file C:/Users/User/Desktop/mini_mouse/1980, the output will be in the same folder under the name '1980_filtered'. I, however, would like the output to go into the output_location. Could anyone see where I am going wrong currently? Any help would be greatly appreciated! Here is my code:
import os
def main():
stop_words_path = 'C:/Users/User/Desktop/NLTK-stop-word-list.txt'
stopwords = get_stop_words_list(stop_words_path)
output_location = 'C:/Users/User/Desktop/test2/'
list_file = 'C:/Users/User/Desktop/list_of_files.txt'
with open(list_file, 'r') as f:
for file_name in f:
#print(file_name)
if file_name.endswith('\n'):
file_name = file_name[:-1]
#print(file_name)
file_path = os.path.join(file_name) # joins the new path of the file to the current file in order to access the file
filestring = '' # file string which will take all the lines in the file and add them to itself
with open(file_path, 'r') as f2: # open the file
print('just opened ' + file_name)
print('\n')
for line in f2: # read file line by line
x = remove_stop_words(line, stopwords) # remove stop words from line
filestring += x # add newly filtered line to the file string
filestring += '\n' # Create new line
new_file_path = os.path.join(output_location, file_name) + '_filtered' # creates a new file of the file that is currenlty being filtered of stopwords
with open(new_file_path, 'a') as output_file: # opens output file
output_file.write(filestring)
if __name__ == "__main__":
main()
Assuming you're using Windows (because you have a normal Windows filesystem), you have to use backslashes in your pathnames. Note that this is only on Windows. I know it's annoying, so I changed it for you (you're welcome :)). You also have to use two backslashes, as it will try to use it as an escape char.
import os
def main():
stop_words_path = 'C:\\Users\\User\\Desktop\\NLTK-stop-word-list.txt'
stopwords = get_stop_words_list(stop_words_path)
output_location = 'C:\\Users\\User\\Desktop\\test2\\'
list_file = 'C:\\Users\\User\\Desktop\\list_of_files.txt'
with open(list_file, 'r') as f:
for file_name in f:
#print(file_name)
if file_name.endswith('\n'):
file_name = file_name[:-1]
#print(file_name)
file_path = os.path.join(file_name) # joins the new path of the file to the current file in order to access the file
filestring = '' # file string which will take all the lines in the file and add them to itself
with open(file_path, 'r') as f2: # open the file
print('just opened ' + file_name)
print('\n')
for line in f2: # read file line by line
x = remove_stop_words(line, stopwords) # remove stop words from line
filestring += x # add newly filtered line to the file string
filestring += '\n' # Create new line
new_file_path = os.path.join(output_location, file_name) + '_filtered' # creates a new file of the file that is currenlty being filtered of stopwords
with open(new_file_path, 'a') as output_file: # opens output file
output_file.write(filestring)
if __name__ == "__main__":
main()
Based your code it looks like an issue in the line:
new_file_path = os.path.join(output_location, file_name) + '_filtered'
In Python's os.path.join() any absolute path (or drive letter in Windows) in the inputs will discard everything before it and restart the join from the new absolute path (or drive letter). Since you're calling file_name directly from list_of_files.txt and you have each path formatted there relative to the C: drive, each call to os.path.join() is dropping output_location and being reset to the original file path.
See Why doesn't os.path.join() work in this case? for a better explanation of this behavior.
When building the output path you could strip the file name, "1980" for instance, from the path "C:/Users/User/Desktop/mini_mouse/1980" and join based on the output_location variable and the isolated file name.
Related
I'm creating new files from originally existing ones in the mdp folder by changing a couple of lines in those files using python. I need to do this for 1000 files. Can anyone suggest a for loop which reads all files and changes them and creates new in one go?
This way I have to change the the number followed by 'md_' in the path and it's tedious because there are a 1000 files here.
I tried using str() but there was a 'could not read file error'
fin = open("/home/abc/xyz/mdp/md_1.mdp", "rt")
fout = open("/home/abc/xyz/middle/md_1.mdp", "wt")
for line in fin:
fout.write(line.replace('integrator = md', 'integrator
= md-vv'))
fin = open("/home/abc/xyz/middle/md_1.mdp", "rt")
fout = open("/home/abc/xyz/mdb/md_1.mdp", "wt")
for line in fin:
fout.write(line.replace('dt = 0.001', 'dt
= -0.001'))
fin.close()
fout.close()
os.listdir(path) is your friend:
import os
sourcedir = "/home/abc/xyz/mdp"
destdir = "/home/abc/xyz/middle"
for filename in os.listdir(sourcedir):
if not filename.endswith(".mdp"):
continue
source = os.path.join(sourcedir, filename)
dest = os.path.join(destdir, filename)
# with open(xxx) as varname makes sure the file(s)
# will be closed whatever happens in the 'with' block
# NB text mode is the default, and so is read mode
with open(source) as fin, open(dest, "w") as fout:
# python files are iterable... avoids reading
# the whole file in memory at once
for line in fin:
# will only work for those exact strings,
# you may want to use regexps if number of
# whitespaces vary etc
line = line.replace("dt = 0.001", "dt = -0.001")
line = line.replace(
'integrator = md',
'integrator = md-vv'
)
fout.write(line)
Assuming you want to edit all files that are located in the mdp folder you could do something like this.
import os
dir = "/home/abc/xyz/mdp/"
for filename in os.listdir(dir):
with open(dir + filename, "r+") as file:
text = file.read()
text = text.replace("dt = 0.001", "dt = -0.001")
file.seek(0)
file.write(text)
file.truncate()
This will go through every file and change it using str.replace().
If there are other files in the mdp folder that you do not want to edit, you could use and if-statement to check for the correct file name. Add something like this to encase the with open statement.
if filename.startswith("md_")
I have a folder with multiple files like so:
1980
1981
1982
In each of these files is some text. I want to loop through each of these files and do some operation to each file then save the edited file to another folder and move onto the next file and so on. The result would be that I have the original folder and then another folder with the edited version of each file in it like so:
1980_filtered
1981_filtered
1982_filtered
Is it possible to do this?
Currently I have some code that loops through the files in a folder, does some filtering to each file and then saves all the edits of each file into one massive file. Here is my code:
import os
input_location = 'C:/Users/User/Desktop/mini_mouse'
output_location = 'C:/Users/User/Desktop/filter_mini_mouse/mouse'
for root, dir, files in os.walk(input_location):
for file in files:
os.chdir(input_location)
with open(file, 'r') as f, open('NLTK-stop-word-list', 'r') as f2:
mouse_file = f.read().split() # reads file and splits it into a list
stopwords = f2.read().split()
x = (' '.join(i for i in mouse_file if i.lower() not in (x.lower() for x in stopwords)))
with open(output_location, 'a') as output_file:
output_file.write(x)
Any help would be greatly appreciated!
You need to specify what each new file is called. To do so, Python has some good string formatting methods. Fortunately, your new desired file names are easy to do in a loop
import os
input_location = 'C:/Users/User/Desktop/mini_mouse'
output_location = 'C:/Users/User/Desktop/filter_mini_mouse/mouse'
for root, dir, files in os.walk(input_location):
for file in files:
new_file = "{}_filtered.txt".format(file)
os.chdir(input_location)
with open(file, 'r') as f, open('NLTK-stop-word-list', 'r') as f2:
mouse_file = f.read().split()
stopwords = f2.read().split()
x = (' '.join(i for i in mouse_file if i.lower() not in (x.lower() for x in stopwords)))
with open(output_location+'/'+new_file, 'w') as output_file: # Changed 'append' to 'write'
output_file.write(x)
If you're in Python 3.7, you can do
new_file = f"{file}_filtered.txt"
and
with open(f"{output_location}/{new_file}", 'w') as output_file:
output_file.write(x)
First of all you should start by opening the NLTK-stop-word-list only once, so I moved it outside of your loops. Second, os.chdir() is redundant, you can use os.path.join() to get your current file path (and to construct your new file path):
import os
input_location = 'C:/Users/User/Desktop/mini_mouse'
output_location = 'C:/Users/User/Desktop/filter_mini_mouse/'
stop_words_path = 'C:/Users/User/Desktop/NLTK-stop-word-list.txt'
with open(stop_words_path, 'r') as stop_words:
for root, dirs, files in os.walk(input_location):
for name in files:
file_path = os.path.join(root, name)
with open(file_path, 'r') as f:
mouse_file = f.read().split() # reads file and splits it into a list
stopwords = stop_words.read().split()
x = (' '.join(i for i in mouse_file if i.lower() not in (x.lower() for x in stopwords)))
new_file_path = os.path.join(output_location, name) + '_filtered'
with open(new_file_path, 'a') as output_file:
output_file.write(x)
P.S: I took the liberty to change some of your variable names as they were part of python's built in words ('file' and 'dir'). If you'll run __builtins__.__dict__.keys() you'll see them there.
just to get you started, I want to merge json array files into a single file, with (comma) appended to the end of the array.
MemoryError now in my code, please help me!
in my code >
import os, sys
path = "censored"
dirs = os.listdir(path)
save_list = []
s = ""
for file in dirs:
save_list.append(file)
for i in range(len(save_list)):
f = open(path + save_list[i], 'r')
s += f.read()
s += s.replace("]", "],")
f.close()
ff = open("a", 'w')
ff.write(s)
ff.close()
print("done")
Error >
Traceback (most recent call last):
File "test.py", line 15, in <module>
s += s.replace("]", "],")
MemoryError
Want result
file "a" in substance
[{a:b}]
file "b" in substance
[{c:d}]
file "c" want result substance
[{a:b}], [{c:d}]
Based on your code, you're opening every single file one after another, and only closing the last one. You should be opening and closing each file after you are done with it to make it more memory efficient.
You can actually progressively write to the output file instead of storing it as a string in memory and writing it at one shot.
There's actually no need to save all files from dir into save_list and reaccess it in another loop. So you can omit save_list.
Putting everything together, you'll get the following code snippet:
# everything above as follows
for file in dirs:
curr_file_path = path + file
curr_file_string = ""
# using this would close the file automatically
with open(curr_file_path, 'r') as f:
raw_file = f.read()
curr_file_string = raw_file.replace("]", "],")
# open the output file and set the mode to append ('a') to batch write
# similarly, this will close the output file after every write
with open("output file", "a") as out_f:
out_f.write(curr_file_string)
print("done")
I am trying to get a Python script that will open a few text files, read the content and every time it finds a word from a list, block that out with new text, then write it to a new file, for each file.
Right now, I can get it to write all of the source files to a single file, which is my script below, but I am not sure how to proceed to having a new file for every source file.
import os
KeyWords=["Magic","harry","wand"]
rootdir = "C:\\books"
fileslist = []
##blanks file and preps for new data
fileout = open(rootdir+"\\output\\newfile.txt","w")
print (fileout)
fileout.write("Start of file\n\nLocation of output: "+rootdir+"\\output \n\nFiles that are being Processed:\n\n")
fileout.close()
def sourcelist(fileslist):
file=open(fileslist,"r")
fileout=open(rootdir+"\\output\\newfile.txt", "a")
for line in file:
if any(word.lower() in line.lower() for word in KeyWords):
print("Word Found\n\n" + '\t'+line + "\nEnd\n")
fileout.write("<<<SEARCH TERM FOUND>>>\n\n" + '\t'+line + "\n<<<END OF BLOCK>>>\n")
else:
#print('\t'+line) #No need to print the lines with no Key words in
fileout.write('\t'+line)
#return #not sure what return does?
for root, dirs, files in os.walk(rootdir):
dirs.clear()
for file in files:
filepath = root + os.sep + file
if filepath.endswith(".txt"):
fileslist.append(filepath)
for path in fileslist:
sourcelist(path)
print("\n".join(fileslist))
with open(rootdir+"\\output\\newfile.txt","a") as output:
output.write("\n".join(fileslist)+"\n\n\n")
output.close()
This is a bit tough to answer as a whole, but here's a general approach.
I have the following file structure:
hp_extracts: # directory
hp_parser.py
-- inps/
-- harry_1.txt
-- harry_2.txt
-- outs/
<nothing>
Contents of inps/harry_1.txt:
When Harry got his wand it was Magic
something something magic something
something harry something
Contents of inps/harry_2.txt:
magic something something
something
harry something something
This is the contents of hp_parser.py:
import os
all_files = os.listdir('inps/')
keywords=["magic","harry","wand"]
for file in all_files:
with open('inps/{}'.format(file)) as infile, open('outs/{}'.format(file), 'w') as outfile:
for line in infile:
#print(line)
for word in line.split():
if word.lower() in keywords:
line = line.replace(word, '<<<SEARCH TERM FOUND>>> {} <<<END OF BLOCK>>>'.format(word))
outfile.write(line)
I need to find every instance of "translate" in a text file and replace a value 4 lines after finding the text:
"(many lines)
}
}
translateX xtran
{
keys
{
k 0 0.5678
}
}
(many lines)"
The value 0.5678 needs to be 0. It will always be 4 lines below the "translate" string
The file has up to about 10,000 lines.
example text file name: 01F.pz2.
I'd also like to cycle through the folder and repeat the process for every file with the pz2 extension (up to 40).
Any help would be appreciated!
Thanks.
I'm not quite sure about the logic for replacing 0.5678 in your file, therefore I use a function for that - change it to whatever you need, or explain more in details what you want. Last number in line? only floating-point number?
Try:
import os
dirname = "14432826"
lines_distance= 4
def replace_whatever(line):
# Put your logic for replacing here
return line.replace("0.5678", "0")
for filename in filter(lambda x:x.endswith(".pz2") and not x.startswith("m_"), os.listdir(dirname)):
print filename
with open(os.path.join(dirname, filename), "r") as f_in, open(os.path.join(dirname,"m_%s" % filename), "w") as f_out:
replace_tasks = []
for line in f_in:
# search marker in line
if line.strip().startswith("translate"):
print "Found marker in", line,
replace_tasks.append(lines_distance)
# replace if necessary
if len(replace_tasks)>0 and replace_tasks[0] == 0:
del replace_tasks[0]
print "line to change is", line,
line_to_write = replace_whatever(line)
else:
line_to_write = line
# Write to output
f_out.write(line_to_write)
# decrease counters
for i, task in enumerate(replace_tasks):
replace_tasks[i] -= 1
The comments within the code should help understanding. The main concept is the list replace_tasks that keeps record of when the next line to modify will come.
Remarks: Your code sample suggests that the data in your file are structured. It will definitely be saver to read this structure and work on it instead of search-and-replace approach on a plain text file.
Thorsten, I renamed my original files to have the .old extension and the following code works:
import os
target_dir = "."
# cycle through files
for path, dirs, files in os.walk(target_dir):
# file is the file counter
for file in files:
# get the filename and extension
filename, ext = os.path.splitext(file)
# see if the file is a pz2
if ext.endswith('.old') :
# rename the file to "old"
oldfilename = filename + ".old"
newfilename = filename + ".pz2"
old_filepath = os.path.join(path, oldfilename)
new_filepath = os.path.join(path, newfilename)
# open the old file for reading
oldpz2 = open (old_filepath,"r")
# open the new file for writing
newpz2 = open (new_filepath,"w")
# reset changeline
changeline = 0
currentline = 0
# cycle through old lines
for line in oldpz2 :
currentline = currentline + 1
if line.strip().startswith("translate"):
changeline = currentline + 4
if currentline == changeline :
print >>newpz2," k 0 0"
else :
print >>newpz2,line