The idea is to write N files using N processes.
The data for the file to be written are coming from multiple files which are stored on a dictionary that has a list as a value and it looks like this:
dic = {'file1':['data11.txt', 'data12.txt', ..., 'data1M.txt'],
'file2':['data21.txt', 'data22.txt', ..., 'data2M.txt'],
...
'fileN':['dataN1.txt', 'dataN2.txt', ..., 'dataNM.txt']}
so file1 is data11 + data12 + ... + data1M etc...
So my code looks like this:
jobs = []
for d in dic:
outfile = str(d)+"_merged.txt"
with open(outfile, 'w') as out:
p = multiprocessing.Process(target = merger.merger, args=(dic[d], name, out))
jobs.append(p)
p.start()
out.close()
and the merger.py looks like this:
def merger(files, name, outfile):
time.sleep(2)
sys.stdout.write("Merging %n...\n" % name)
# the reason for this step is that all the different files have a header
# but I only need the header from the first file.
with open(files[0], 'r') as infile:
for line in infile:
print "writing to outfile: ", name, line
outfile.write(line)
for f in files[1:]:
with open(f, 'r') as infile:
next(infile) # skip first line
for line in infile:
outfile.write(line)
sys.stdout.write("Done with: %s\n" % name)
I do see the file written on the folder it should go to, but it's empty. no header, no nothing. I had put prints in there to see if everything is correct but nothing works.
Help!
Since the worker processes run in parallel to the main process creating them, the files named out get closed before the workers can write to them. This will happen even if you remove out.close() because of the with statement. Rather pass each process the filename and let the process open and close the file.
The problem is that you don't close the file in the child so internally buffered data is lost. You could move the file open to the child or wrap the whole thing in a try/finally block to make sure the file closes. A potential advantage of opening in the parent is that you can handle file errors there. I'm not saying its compelling, just an option.
def merger(files, name, outfile):
try:
time.sleep(2)
sys.stdout.write("Merging %n...\n" % name)
# the reason for this step is that all the different files have a header
# but I only need the header from the first file.
with open(files[0], 'r') as infile:
for line in infile:
print "writing to outfile: ", name, line
outfile.write(line)
for f in files[1:]:
with open(f, 'r') as infile:
next(infile) # skip first line
for line in infile:
outfile.write(line)
sys.stdout.write("Done with: %s\n" % name)
finally:
outfile.close()
UPDATE
There has been some confusion about parent/child file decriptors and what happens to files in the child. The underlying C library does not flush data to disk if a file is still open when the program exits. The theory is that a properly running program closes things before exit. Here is an example where the child loses data because it does not close the file.
import multiprocessing as mp
import os
import time
if os.path.exists('mytestfile.txt'):
os.remove('mytestfile.txt')
def worker(f, do_close=False):
time.sleep(2)
print('writing')
f.write("this is data")
if do_close:
print("closing")
f.close()
print('without close')
f = open('mytestfile.txt', 'w')
p = mp.Process(target=worker, args=(f, False))
p.start()
f.close()
p.join()
print('file data:', open('mytestfile.txt').read())
print('with close')
os.remove('mytestfile.txt')
f = open('mytestfile.txt', 'w')
p = mp.Process(target=worker, args=(f, True))
p.start()
f.close()
p.join()
print('file data:', open('mytestfile.txt').read())
I run it on linux and I get
without close
writing
file data:
with close
writing
closing
file data: this is data
Related
I have some issues in my program. I have been trying to come up with a script which compares text files with a master text file and the program prints out the difference.
Basically, these are network configuration and we need to compare them in bulk to make sure all devices have standard configurations. For example, the script should read each file (file1, file2..etc.) line by line and compare it with the master file (master.txt).
I am able to compare one file at a time, however, when comparing two or more files I get an "index out of range" error.
I want to compare multiple files, probably in hundreds so I need to know how to fix his loop. Understand that this could be because program trying to ready
import difflib
import sys
hosts0 = open("C:\\Users\\p1329760\\Desktop\\Personal\\Python\\Projects\\sample\\master.txt","r")
hosts1 = open("C:\\Users\\p1329760\\Desktop\\Personal\\Python\\Projects\\sample\\file1.txt","r")
hosts2 = open("C:\\Users\\p1329760\\Desktop\\Personal\\Python\\Projects\\sample\\file2.txt","r")
lines1 = hosts0.readlines()
#print(lines11)
with open('output_compare.txt', 'w') as f:
#global original_stdout
for i,lines2 in enumerate(hosts1):
if lines2 != lines1[i]:
original_stdout = sys.stdout
sys.stdout = f
print("line ", i, " in hosts1 is different \n")
print(lines2)
sys.stdout = original_stdout
else:
pass
with open('output_compare1.txt', 'w') as file:
for i,lines3 in enumerate(hosts2):
if lines3 != lines1[i]:
original_stdout = sys.stdout
sys.stdout = file
print("line ", i, " in hosts1 is different \n")
print(lines3)
sys.stdout = original_stdout
else:
pass
Hi here is what you could do:
You can have a list off all the file name
namefile = [....]
And a function which takes the file name
def compare (filename):
fileobj = open(filename)
infile = fileobj.read().split()
for i in range(0,len(infile)):
if infile[i] == masterin[i]:
pass
else:
print(...)
After that you have to open the master file
master = open( "...")
masterin = master.read().split()
After that a loop and your done
for i in namefile:
compare (i)
I'm trying to merge multiple files into one file using python, I've tried several methods but they all result in the final file missing out on some lines. The size of the file can vary a lot, so I'd prefer using something which does not load the whole file into memory.
My knowledge on this is a bit limited but I read that it's probably due to the writing buffering aka, the file is not immediately written, the information is instead kept momentarily in memory and later written to the file.
I've tried multiple ways to solve this: using shutil.copyfileobj, classical python's read/write , adding a tag to the end of the file, checking the tail of both files, using file.flush followed by os.fsync, and finally, adding a few seconds of time.sleep. Everything fails, could anyone advise on an infallible way to merge files?
Some approaches seem to work fine on my local PC, but when tried on an another system (HPC) the error occurs, so this is kinda hard to replicate.
these are all the approaches I tried so far:
#support functions
def tail(file_path):
last_line = None
with open(file_path) as file:
line=file.readline()
while line:
last_line=str(line)
line=file.readline()
return last_line
def wait_for_flush(output_file,tail_in):
c = 0
while not file_exists(output_file):
sleep(5)
c += 1
if c > 100: raise BrokenConcatenation(output_file)
tail_out = tail(output_file)
while tail_out != tail_in:
while not tail_out:
sleep(2)
tail_out = tail(output_file)
c += 1
if c > 100: raise BrokenConcatenation(output_file)
tail_out = tail(output_file)
c += 1
sleep(2)
if c > 100: raise BrokenConcatenation(output_file)
def merge_two_files(file1,file2):
with open(file1, 'a+') as f1:
with open(file2) as f2:
line=f2.readline()
while line:
f1.write(line)
line=f2.readline()
#forcing disk write
f1.flush()
os.fsync(f1)
#main functions
def concat_files(output_file,list_file_paths,stdout_file=None,add_tag=False):
print('Concatenating files into ',output_file,flush=True,file=stdout_file)
print(output_file)
list_files=list(list_file_paths)
while len(list_files)>1:
file1=list_files.pop(0)
file2=list_files.pop(0)
merge_two_files(file1,file2)
sleep(1)
os.remove(file2)
list_files.append(file1)
final_file=list_files.pop()
move_file(final_file,output_file)
def concat_files(output_file,list_file_paths,stdout_file=None,add_tag=False):
print('Concatenating files into ',output_file,flush=True,file=stdout_file)
with open(output_file, 'wb',buffering=0) as wfd:
for f in list_file_paths:
with open(f,'rb') as fd:
shutil.copyfileobj(fd, wfd)
#forcing disk write
wfd.flush()
os.fsync(wfd)
sleep(2)
def concat_files(output_file,list_file_paths,stdout_file=None,add_tag=False):
print('Concatenating files into ',output_file,flush=True,file=stdout_file)
with open(output_file, 'w+') as wfd:
for f in list_file_paths:
with open(f) as fd:
line = fd.readline()
while line:
wfd.write(line)
line = fd.readline()
if add_tag:
tail_in='#'+f+'\n'
wfd.write(tail_in)
else: tail_in=tail(f)
# forcing disk write
wfd.flush()
os.fsync(wfd)
wait_for_flush(output_file,tail_in)
#resets file whenever we open file, doesnt work
def concat_files(output_file,list_file_paths,stdout_file=None):
print('Concatenating files into ',output_file,flush=True,file=stdout_file)
for f in list_file_paths:
with open(output_file, 'wb') as wfd:
with open(f,'rb') as fd:
shutil.copyfileobj(fd, wfd)
#forcing disk write
wfd.flush()
os.fsync(wfd)
def concat_files(output_file,list_file_paths,stdout_file=None):
print('Concatenating files into ',output_file,flush=True,file=stdout_file)
with open(output_file, 'w+') as outfile:
for f in list_file_paths:
with open(f) as infile:
line=infile.readline()
while line:
outfile.write(line)
line=infile.readline()
#forcing disk write
outfile.flush()
os.fsync(outfile)
def concat_files(output_file,list_file_paths,stdout_file=None):
print('Concatenating files into ',output_file,flush=True,file=stdout_file)
with open(output_file, 'wb') as wfd:
for f in list_file_paths:
with open(f,'rb') as fd:
shutil.copyfileobj(fd, wfd)
#forcing disk write
wfd.flush()
os.fsync(wfd)
If you don't want to read large files into memory, I would say this should just work:
def concat_files(output_file, list_file_paths):
print('Concatenating files into', output_file)
with open(output_file, 'w') as wfd:
for f in list_file_paths:
print(f, '...')
with open(f) as fd:
for line in fd:
wfd.write(line)
wfd.write(f'eof - {f}\n') # mod to indicate end of this file
print('Done.')
This should create the output_file as a new file and read each file from list_file_paths, a line at a time, writing to the new file.
Update: see mod to indicate end of this file
I'm reading the file in my HDFS using Python language.
Each file has a header and I'm trying to merge the files. However, the header in each file also gets merged.
Is there a way to skip the header from second file?
hadoop = sc._jvm.org.apache.hadoop
conf = hadoop.conf.Configuration()
fs = hadoop.fs.FileSystem.get(conf)
src_dir = "/mnt/test/"
out_stream = fs.create(hadoop.fs.Path(dst_file), overwrite)
files = []
for f in fs.listStatus(hadoop.fs.Path(src_dir)):
if f.isFile():
files.append(f.getPath())
for file in files:
in_stream = fs.open(file)
hadoop.io.IOUtils.copyBytes(in_stream, out_stream, conf, False)
Currently I have solved the problem with below logic, however would like to know if there is any better and efficient solution? appreciate your help
for idx,file in enumerate(files):
if debug:
print("Appending file {} into {}".format(file, dst_file))
# remove header from the second file
if idx>0:
file_str = ""
with open('/'+str(file).replace(':',''),'r+') as f:
for idx,line in enumerate(f):
if idx>0:
file_str = file_str + line
with open('/'+str(file).replace(':',''), "w+") as f:
f.write(file_str)
in_stream = fs.open(file) # InputStream object and copy the stream
try:
hadoop.io.IOUtils.copyBytes(in_stream, out_stream, conf, False) # False means don't close out_stream
finally:
in_stream.close()
What you are doing now is appending repeatedly to a string. This is a fairly slow process. Why not write directly to the output file as you are reading?
for file_idx, file in enumerate(files):
with open(...) as out_f, open(...) as in_f:
for line_num, line in enumerate(in_f):
if file_idx == 0 or line_num > 0:
f_out.write(line)
If you can load the file all at once, you can also skip the first line by using readline followed by readlines:
for file_idx, file in enumerate(files):
with open(...) as out_f, open(...) as in_f:
if file_idx != 0:
f_in.readline()
f_out.writelines(f_in.readlines())
I am writing a csv file which is named as OUT_FILE. now I can see that file is not immediately created on the disk, so I want to wait until the file gets created.
below is the code to write the csv file:
with open(OUT_FILE, 'a') as outputfile:
with open(INTER_FILE, 'rb') as feed:
writer = csv.writer(outputfile, delimiter=',', quotechar='"')
reader = csv.reader(feed, delimiter=',', quotechar='"')
for row in reader:
reportable_jurisdiction=row[7]
if '|' in reportable_jurisdiction:
row[7]="|".join(sorted(list(row[7].split('|'))))
print " reportable Jurisdiction with comma "+reportable_jurisdiction
else:
print "reportable Jurisdiction if single "+reportable_jurisdiction
writer.writerow(row)
feed.close()
outputfile.close()
Now I have one file called FEED_FILE which actually the input for the OUT_FILE i.e. after wrting the data on OUT_FILE, the size of the OUT_FILE and FEED_FILE should be same.
so for the same I have written the below code :
while True:
try:
print'sleeping for 5 seconds'
time.sleep(5)
outputfileSize=os.path.getsize(OUT_FILE)
if( outputfileSize ==FeedFileSize ):
break
except OSError:
print " file not created "
continue
print " file created !!"
now I don't know if this is executing since there are no errors and even print is not coming in output ?
any help?
You can check if file exist using python's os module?
import os
def wait_for_file(path):
timeout = 300
while timeout:
if os.path.exists(path):
# check if your condition of file size having
# same as feed file size is met
return
timeout = timeout - 5
sleep(5)
raise Exception, "File was not created"
I have the input file that looks like this (infile.txt):
a x
b y
c z
I want to implement a program that enable user to write to STDOUT or file depending on the command:
python mycode.py infile.txt outfile.txt
Will write to file.
And with this
python mycode.py infile.txt #2nd case
Will write to STDOUT.
I'm stuck with this code:
import sys
import csv
nof_args = len(sys.argv)
infile = sys.argv[1]
print nof_args
outfile = ''
if nof_args == 3:
outfile = sys.argv[2]
# for some reason infile is so large
# so we can't save it to data structure (e.g. list) for further processing
with open(infile, 'rU') as tsvfile:
tabreader = csv.reader(tsvfile, delimiter=' ')
with open(outfile, 'w') as file:
for line in tabreader:
outline = "__".join(line)
# and more processing
if nof_args == 3:
file.write(outline + "\n")
else:
print outline
file.close()
When using 2nd case it produces
Traceback (most recent call last):
File "test.py", line 18, in <module>
with open(outfile, 'w') as file:
IOError: [Errno 2] No such file or directory: ''
What's the better way to implement it?
You can try this:
import sys
if write_to_file:
out = open(file_name, 'w')
else:
out = sys.stdout
# or a one-liner:
# out = open(file_name, 'w') if write_to_file else sys.stdout
for stuff in data:
out.write(stuff)
out.flush() # cannot close stdout
# Python deals with open files automatically
You can also use this instead of out.flush():
try:
out.close()
except AttributeError:
pass
This looks a bit ugly to me, so, flush will be just well.