how to join multiple sorted files in Python alphabetically? - python

How can I read multiple CSV input files line by line, compare the characters in each line, write the line appearing first alphabetically to an output file, and then advance the pointer of the minimum value's file to continue the comparisons with all files until the end of all input files is reached. Here's some rough planning toward a solution.
buffer = []
for inFile in inFiles:
f = open(inFile, "r")
line = f.next()
buffer.append([line, inFile])
#find minimum value in buffer alphabetically...
#write it to an output file...
#how do I advance one line in the file with the min value?
#and then continue the line-by-line comparisons in input files?

You can use heapq.merge:
import heapq
import contextlib
files = [open(fn) for fn in inFiles]
with contextlib.nested(*files):
with open('output', 'w') as f:
f.writelines(heapq.merge(*files))
In Python 3.x (3.3+):
import heapq
import contextlib
with contextlib.ExitStack() as stack:
files = [stack.enter_context(open(fn)) for fn in inFiles]
with open('output', 'w') as f:
f.writelines(heapq.merge(*files))

Related

Overwriting lines in text file [duplicate]

How can I insert a string at the beginning of each line in a text file, I have the following code:
f = open('./ampo.txt', 'r+')
with open('./ampo.txt') as infile:
for line in infile:
f.insert(0, 'EDF ')
f.close
I get the following error:
'file' object has no attribute 'insert'
Python comes with batteries included:
import fileinput
import sys
for line in fileinput.input(['./ampo.txt'], inplace=True):
sys.stdout.write('EDF {l}'.format(l=line))
Unlike the solutions already posted, this also preserves file permissions.
You can't modify a file inplace like that. Files do not support insertion. You have to read it all in and then write it all out again.
You can do this line by line if you wish. But in that case you need to write to a temporary file and then replace the original. So, for small enough files, it is just simpler to do it in one go like this:
with open('./ampo.txt', 'r') as f:
lines = f.readlines()
lines = ['EDF '+line for line in lines]
with open('./ampo.txt', 'w') as f:
f.writelines(lines)
Here's a solution where you write to a temporary file and move it into place. You might prefer this version if the file you are rewriting is very large, since it avoids keeping the contents of the file in memory, as versions that involve .read() or .readlines() will. In addition, if there is any error in reading or writing, your original file will be safe:
from shutil import move
from tempfile import NamedTemporaryFile
filename = './ampo.txt'
tmp = NamedTemporaryFile(delete=False)
with open(filename) as finput:
with open(tmp.name, 'w') as ftmp:
for line in finput:
ftmp.write('EDF '+line)
move(tmp.name, filename)
For a file not too big:
with open('./ampo.txt', 'rb+') as f:
x = f.read()
f.seek(0,0)
f.writelines(('EDF ', x.replace('\n','\nEDF ')))
f.truncate()
Note that , IN THEORY, in THIS case (the content is augmented), the f.truncate() may be not really necessary. Because the with statement is supposed to close the file correctly, that is to say, writing an EOF (end of file ) at the end before closing.
That's what I observed on examples.
But I am prudent: I think it's better to put this instruction anyway. For when the content diminishes, the with statement doesn't write an EOF to close correctly the file less far than the preceding initial EOF, hence trailing initial characters remains in the file.
So if the with statement doens't write EOF when the content diminishes, why would it write it when the content augments ?
For a big file, to avoid to put all the content of the file in RAM at once:
import os
def addsomething(filepath, ss):
if filepath.rfind('.') > filepath.rfind(os.sep):
a,_,c = filepath.rpartition('.')
tempi = a + 'temp.' + c
else:
tempi = filepath + 'temp'
with open(filepath, 'rb') as f, open(tempi,'wb') as g:
g.writelines(ss + line for line in f)
os.remove(filepath)
os.rename(tempi,filepath)
addsomething('./ampo.txt','WZE')
f = open('./ampo.txt', 'r')
lines = map(lambda l : 'EDF ' + l, f.readlines())
f.close()
f = open('./ampo.txt', 'w')
map(lambda l : f.write(l), lines)
f.close()

Python - replace the startswith character

I want to replace the first character in each line from the text file.
2 1.510932 0.442072 0.978141 0.872182
5 1.510932 0.442077 0.978141 0.872181
Above is my text file.
import sys
import glob
import os.path
list_of_files = glob.glob('/path/txt/23.txt')
for file_name in list_of_files:
f= open(file_name, 'r')
lst = []
for line in f:
f = open(file_name , 'w')
if line.startswith("2 "):
line = line.replace("2 ","7")
f.write(line)
f.close()
What i want:-
If the number starting with 2, i want to change that into 7. The problem is that, In the same line multiple 7 is there. If i change startswith character and save everything was changing
Thanks
The proper solution is (pseudo code):
open sourcefile for reading as input
open temporaryfile for writing as output
for each line in input:
fix the line
write it to output
close input
close output
replace sourcefile with temporaryfile
We use a temporary file and write along to avoid potential memory errors.
I leave it up to you to translate this to Python (hint: that's quite straightforward).
This is one approach.
Ex:
for file_name in list_of_files:
data = []
with open(file_name) as infile:
for line in infile:
if line.startswith("2 "): #Check line
line = " ".join(['7'] + line.split()[1:]) #Update line
data.append(line)
with open(file_name, "w") as outfile: #Write back to file
for line in data:
outfile.write(line+"\n")

Python glob gives no result

I have a directory that contains a lot of .csv files, and I am trying to write a script that runs on all the files in the directory while doing the following operation:
Remove the first and last lines from all the csv files
I am running the following code:
import glob
list_of_files = glob.glob('path/to/directory/*.csv')
for file_name in list_of_files:
fi = open(file_name, 'r')
fo = open(file_name.replace('csv', 'out'), 'w') #make new output file for each file
num_of_lines = file_name.read().count('\n')
file_name.seek(0)
i = 0
for line in fi:
if i != 1 and i != num_of_lines-1:
fo.write(line)
fi.close()
fo.close()
And I run the script using python3 script.py. Though I don't get any error, I don't get any output file either.
There are multiple issues in your code. First of all you count the number of lines on the filename instead of the file-object. The second problem is that you initialize i=0 and compare against it but it never changes.
Personally I would just convert the file to a list of "lines", cut off the first and last and write all of them to the new file:
import glob
list_of_files = glob.glob('path/to/directory/*.csv')
for file_name in list_of_files:
with open(file_name, 'r') as fi:
with open(file_name.replace('csv', 'out'), 'w') as fo:
for line in list(fi)[1:-1]: # for all lines except the first and last
fo.write(line)
Using the with open allows to omit the close calls (because they are done implicitly) even if an exception occurs.
In case that still gives no output you could a print statement that shows which file is being processed:
print(file_name) # just inside the for-loop before any `open` calls.
Since you're using python-3.5 you could also use pathlib:
import pathlib
path = pathlib.Path('path/to/directory/')
# make sure it's a valid directory
assert path.is_dir(), "{} is not a valid directory".format(p.absolute())
for file_name in path.glob('*.csv'):
with file_name.open('r') as fi:
with pathlib.Path(str(file_name).replace('.csv', '.out')).open('w') as fo:
for line in list(fi)[1:-1]: # for all lines except the first and last
fo.write(line)
As Jon Clements pointed out there is a better way than [1:-1] to exclude the first and last line by using a generator function. That way you will definitely reduce the amount of memory used and it might also improve the overall performance. For example you could use:
import pathlib
def ignore_first_and_last(it):
it = iter(it)
firstline = next(it)
lastline = next(it)
for nxtline in it:
yield lastline
lastline = nxtline
path = pathlib.Path('path/to/directory/')
# make sure it's a valid directory
assert path.is_dir(), "{} is not a valid directory".format(p.absolute())
for file_name in path.glob('*.csv'):
with file_name.open('r') as fi:
with pathlib.Path(str(file_name).replace('.csv', '.out')).open('w') as fo:
for line in ignore_first_and_last(fi): # for all lines except the first and last
fo.write(line)

Appending output of a for loop for Python to a csv file

I have a folder with .txt files in it. My code will find the line count and character count in each of these files and save the output for each file in a single csv file in a different directory. The csv file is Linecount.csv. For some reason the output to csv file is repeating for character and linecount for the last output, though printing the output is producing correct results. The output of the print statement is correct.
For the csv file it is not.
import glob
import os
import csv
os.chdir('c:/Users/dasa17/Desktop/sample/Upload')
for file in glob.glob("*.txt"):
chars = lines = 0
with open(file,'r')as f:
for line in f:
lines+=1
chars += len(line)
a=file
b=lines
c=chars
print(a,b,c)
d=open('c:/Users/dasa17/Desktop/sample/Output/LineCount.cs‌​v', 'w')
writer = csv.writer(d,lineterminator='\n')
for a in os.listdir('c:/Users/dasa17/Desktop/sample/Upload'):
writer.writerow((a,b,c)) d.close()
Please check your indentation.
You are looping through each file using for file in glob.glob("*.txt"):
This stores the last result in a,b, and c. It doesn't appear to write it anywhere.
You then loop through each item using for a in os.listdir('c:/Users/dasa17/Desktop/sample/Upload'):, and store a from this loop (the filename), and the last value of b and c from the initial loop.
I've not run but reordering as follows may solve the issue:
import glob
import os
import csv
os.chdir('c:/Users/dasa17/Desktop/sample/Upload')
d=open('c:/Users/dasa17/Desktop/sample/Output/LineCount.cs‌​v', 'w')
writer = csv.writer(d,lineterminator='\n')
for file in glob.glob("*.txt"):
chars = lines = 0
with open(file,'r') as f:
for line in f:
lines+=1
chars += len(line)
a=file
b=lines
c=chars
print(a,b,c)
writer.writerow((a,b,c))
d.close()

join separate files with python

I want to join 100 differeant files into one.
Example of file with data:
example1.txt has in this format:
something
something
somehting
example2.txt has in this format:
something
something
somehting
and all the 100 files have the same format of data and also have a common name example1.....example100 which mean the example is the same and have a number.
from itertools import chain
infiles = [open('{}_example.txt'.format(i+1), 'r') for i in xrange(113)]
with open('example.txt', 'w') as fout:
for lines in chain(*infiles):
fout.write(lines)
I used this but the problem is the first line of the next file joined with the last of the previous file
If you have 100 files, better to just use an array of files:
infiles = [open('example{}.txt'.format(i+1), 'r') for i in xrange(100)]
with open('Join.txt', 'w') as fout:
for lines in izip_longest(*infiles, fillvalue=''):
lines = [line.rstrip('\n') for line in lines]
print >> fout, separator.join(lines)
I would open a new file as writable: join.txt, and then loop through the files you want with a range(1,100):
join = open('Join.txt','w')
for file in range(1,100):
file = open('example'+file+'.txt','r')
file = file.readlines()
for line in file:
join.write(line)

Categories