Python append multiple files in given order to one big file

Python append multiple files in given order to one big file - python

I have up to 8 seperate Python processes creating temp files in a shared folder. Then I'd like the controlling process to append all the temp files in a certain order into one big file. What's the quickest way of doing this at an os agnostic shell level?

Just using simple file IO:
# tempfiles is a list of file handles to your temp files. Order them however you like
f = open("bigfile.txt", "w")
for tempfile in tempfiles:
f.write(tempfile.read())
That's about as OS agnostic as it gets. It's also fairly simple, and the performance ought to be about as good as using anything else.

Not aware of any shell-level commands for appending one file to another. But appending at 'python level' is sufficiently easy that I am guessing python developers did not think it necessary to add it to the library.
The solution depends on the size and structure of the temp files you are appending. If they are all small enough that you don't mind reading each of them into memory, then the answer from Rafe Kettler (copied from his answer and repeated below) does the job with the least amount of code.
# tempfiles is an ordered list of temp files (open for reading)
f = open("bigfile.txt", "w")
for tempfile in tempfiles:
f.write(tempfile.read())
If reading files fully into memory is not possible or not an appropriate solution, you will want to loop through each file and read them piece-wise. If your temp file contains newline-terminated lines which can be read individually into memory, you might do something like this
# tempfiles is an ordered list of temp files (open for reading)
f = open("bigfile.txt", "w")
for tempfile in tempfiles:
for line in tempfile
f.write(line)
Alternatively - something which will always work - you may choose a buffer size and just read the file piece-wise, e.g.
# tempfiles is an ordered list of temp files (open for reading)
f = open("bigfile.txt", "w")
for tempfile in tempfiles:
while True:
data = tempfile.read(65536)
if data:
f.write(data)
else:
break
The input/output tutorial has a lot of good info.

Rafe's answer was lacking proper open/close statements, e.g.
# tempfiles is a list of file handles to your temp files. Order them however you like
with open("bigfile.txt", "w") as fo:
for tempfile in tempfiles:
with open(tempfile,'r') as fi: fo.write(fi.read())
However, be forewarned that if you want to sort the contents of the bigfile, this method does not catch instances where the last line in one or more of your temp files has a different EOL format, which will cause some strange sort results. In this case, you will want to strip the tempfile lines as you read them, and then write consistent EOL lines to the bigfile (i.e. involving an extra line of code).

I feel a bit stupid to add another answer after 8 years and so many answers, but I arrived here by the "append to file" title, and didn't see the right solution for appending to an existing binary file with buffered read/write.
So here is the basic way to do that:
def append_file_to_file(_from, _to):
block_size = 1024*1024
with open(_to, "ab") as outfile, open(_from, "rb") as infile:
while True:
input_block = infile.read(block_size)
if not input_block:
break
outfile.write(input_block)
Given this building block, you can use:
for filename in ['a.bin','b.bin','c.bin']:
append_file_to_file(filename, 'outfile.bin')

import os
str = os.listdir("./")
for i in str:
f = open(i)
f2 = open("temp.txt", "a")
for line in f.readlines():
f2.write(line)
We can use above code to read all the contents from all the file present in current directory and store into temp.txt file.

Use fileinput:
with open("bigfile.txt", "w") as big_file:
with fileinput.input(files=tempfiles) as inputs:
for line in inputs:
big_file.write(line)
This is more memory efficient than #RafeKettler's answer as it doesn't need to read the whole file into memory before writing to big_file.

Try this. It's very fast (much faster than line-by-line, and shouldn't cause a VM thrash for large files), and should run on about anything, including CPython 2.x, CPython 3.x, Pypy, Pypy3 and Jython. Also it should be highly OS-agnostic. Also, it makes no assumptions about file encodings.
#!/usr/local/cpython-3.4/bin/python3
'''Cat 3 files to one: example code'''
import os
def main():
'''Main function'''
input_filenames = ['a', 'b', 'c']
block_size = 1024 * 1024
if hasattr(os, 'O_BINARY'):
o_binary = getattr(os, 'O_BINARY')
else:
o_binary = 0
output_file = os.open('output-file', os.O_WRONLY | o_binary)
for input_filename in input_filenames:
input_file = os.open(input_filename, os.O_RDONLY | o_binary)
while True:
input_block = os.read(input_file, block_size)
if not input_block:
break
os.write(output_file, input_block)
os.close(input_file)
os.close(output_file)
main()
There is one (nontrivial) optimization I've left out: It's better to not assume anything about a good blocksize, instead using a bunch of random ones, and slowly backing off the randomization to focus on the good ones (sometimes called "simulated annealing"). But that's a lot more complexity for little actual performance benefit.
You could also make the os.write keep track of its return value and restart partial writes, but that's only really necessary if you're expecting to receive (nonterminal) *ix signals.

In this code, you can indicate the path and name of the input/output files, and it will create the final big file in that path:
import os
dir_name = "Your_Desired_Folder/Goes_Here" #path
input_files_names = ["File1.txt", "File2.txt", "File3.txt"] #input files
file_name_out = "Big_File.txt" #choose a name for the output file
file_output = os.path.join(dir_name, file_name_out)
fout = open(file_output, "w")
for tempfile in input_files_names:
inputfile = os.path.join(dir_name, tempfile)
fin = open(inputfile, 'r')
for line in fin:
fout.write(line)
fin.close()
fout.close()

Simple & Efficient way to copy data from multiple files to one big file, Before that you need to rename your files to (int) eg. 1,2,3,4...etc, Code:
#Rename Files First
import os
path = 'directory_name'
files = os.listdir(path)
i = 1
for file in files:
os.rename(os.path.join(path, file), os.path.join(path, str(i)+'.txt'))
i = i+1
# Code For Copying Data from Multiple files
import os
i = 1
while i<50:
filename = i
for filename in os.listdir("directory_name"):
# %s is your filename # .txt is file extension
f = open("%s.txt" % i,'r')
fout = open("output_filename", "a")
for line in f:
fout.write(line)
i += 1

There's also the fileinput class in Python 3, which is perfect for this sort of situation

I was solving similar problem, I was combining multiple files in a folder into a big one, in the same folder, sorted based on file modified
Hints are in comments in the code block
from glob import glob
import os
# Folder is where files are stored
# This is also where the big file will be stored
folder = r".\test_folder"
big_filename = r"connected.txt"
# Get all files except big the file and sort by last modified
all_files = glob(folder + "/*")
all_files = [fi for fi in all_files if big_filename not in fi]
all_files.sort(key=os.path.getmtime)
# Get content of each file and append it to a list
output_big_file = []
for one_file in all_files:
with open(one_file, "r", encoding="utf-8") as f:
output_big_file.append(f.read())
# Save list as a file
save_path = os.path.join(folder, big_filename)
with open(save_path, "w", encoding="utf-8") as f:
f.write("\n".join(output_big_file))

Just change target dir)))
import os
d = "./output_dir"
str = os.listdir(d)
for i in str:
f = open(d + '/' + i)
f2 = open(d + '/' + "output.csv", "a")
for line in f.readlines():
f2.write(line)

Related

How to read multiple .txt files in a folder and write into a single file using python?

I am trying to read files with .txt extension and wanted to append into a single txt file. I could read data. But what is the best way to write into single .txt file?

sources = ["list of paths to files you want to write from"]
dest = open("file.txt", "a")
for src in sources:
source = open(src, "r")
data = source.readlines()
for d in data:
dest.write(d)
source.close()
dest.close()
If your destination doesnt already exist you can use "w"(write) mode instead of "a"(append) mode.

Try this.
x.txt:
Python is fun
y.txt:
Hello World. Welcome to my code.
z.txt:
I know that python is popular.
Main Python file:
list_=['x.txt','y.txt','z.txt']
new_list_=[]
for i in list_:
x=open(i,"r")
re=x.read()
new_list_.append(re)
with open('all.txt',"w") as file:
for line in new_list_:
file.write(line+"\n")

After you find the filenames, if you have a lot of files you should avoid string concatenation when merging file contents because in python string concatenation comes with O(n) runtime cost. I think the code below demonstrates the full example.
import glob
# get every txt files from the current directory
txt_files = glob.iglob('./*.txt')
def get_file_content(filename):
content = ''
with open(filename, 'r') as f:
content = f.read()
return content
contents = []
for txt_file in txt_files:
contents.append(get_file_content(txt_file))
with open('complete_content.txt', 'w') as f:
f.write(''.join(contents))

Combining multiple csv files into one csv file

I am trying to combine multiple csv files into one, and have tried a number of methods but I am struggling.
I import the data from multiple csv files, and when I compile them together into one csv file, it seems that the first few rows get filled out nicely, but then it starts randomly inputting spaces of variable number in between the rows, and it never finishes filling out the combined csv file, it just seems to continuously get information added to it, which does not make sense to me because I am trying to compile a finite amount of data.
I have already tried writing close statements for the file, and I still get the same result, my designated combined csv file never stops getting data, and it will randomly space the data throughout the file - I just want a normally compiled csv.
Is there an error in my code? Is there any explanation as to why my csv file is behaving this way?
csv_file_list = glob.glob(Dir + '/*.csv') #returns the file list
print (csv_file_list)
with open(Avg_Dir + '.csv','w') as f:
wf = csv.writer(f, delimiter = ',')
print (f)
for files in csv_file_list:
rd = csv.reader(open(files,'r'),delimiter = ',')
for row in rd:
print (row)
wf.writerow(row)

Your code works for me.
Alternatively, you can merge files as follows:
csv_file_list = glob.glob(Dir + '/*.csv')
with open(Avg_Dir + '.csv','w') as wf:
for file in csv_file_list:
with open(file) as rf:
for line in rf:
if line.strip(): # if line is not empty
if not line.endswith("\n"):
line+="\n"
wf.write(line)
Or, if the files are not too large, you can read each file at once. But in this case all empty lines an headers will be copied:
csv_file_list = glob.glob(Dir + '/*.csv')
with open(Avg_Dir + '.csv','w') as wf:
for file in csv_file_list:
with open(file) as rf:
wf.write(rf.read().strip()+"\n")

Consider several adjustments:
Use context manager, with, for both the read and write process. This avoids the need to close() file objects which you do not do on the read objects.
For skipping lines issue: use either the argument newline='' in open() or lineterminator="\n" argument in csv.writer(). See SO answers for former and latter.
Use os.path.join() to properly concatenate folder and file paths. This method is os-agnostic so accounts for Windows or Unix machines using forward or backslashes types.
Adjusted script:
import os
import csv, glob
Dir = r"C:\Path\To\Source"
Avg_Dir = r"C:\Path\To\Destination\Output"
csv_file_list = glob.glob(os.path.join(Dir, '*.csv')) # returns the file list
print (csv_file_list)
with open(os.path.join(Avg_Dir, 'Output.csv'), 'w', newline='') as f:
wf = csv.writer(f, lineterminator='\n')
for files in csv_file_list:
with open(files, 'r') as r:
next(r) # SKIP HEADERS
rr = csv.reader(r)
for row in rr:
wf.writerow(row)

Modifying a file in-place inside nested for loops

I am iterating directories and files inside of them while I modify in place each file. I am looking to have the new modified file being read right after.
Here is my code with descriptive comments:
# go through each directory based on their ids
for id in id_list:
id_dir = os.path.join(ouput_dir, id)
os.chdir(id_dir)
# go through all files (with a specific extension)
for filename in glob('*' + ext):
# modify the file by replacing all new-line characters with an empty space
with fileinput.FileInput(filename, inplace=True) as f:
for line in f:
print(line.replace('\n', ' '), end='')
# here I would like to read the NEW modified file
with open(filename) as newf:
content = newf.read()
As it stands, the newf is not the new modified one, but instead the original f. I think I understand why that is, however I found it difficult to overcome that issue.
I can always do 2 separate iterations (go through each directory based on their ids, go through all files (with a specific extension) and modify the file, and then repeat iteration to read each one of them) but I was hoping if there was a more efficient way around it. Perhaps if it would be possible to restart the second for loop after the modification has taken place and then have the read take place (so to avoid at least repeating the outer for loop).
Any ideas/designs of to achieve the above with a clean and efficient way?

For me it works with this code:
#!/usr/bin/env python3
import os
from glob import glob
import fileinput
id_list=['1']
ouput_dir='.'
ext = '.txt'
# go through each directory based on their ids
for id in id_list:
id_dir = os.path.join(ouput_dir, id)
os.chdir(id_dir)
# go through all files (with a specific extension)
for filename in glob('*' + ext):
# modify the file by replacing all new-line characters with an empty space
for line in fileinput.FileInput(filename, inplace=True):
print(line.replace('\n', ' ') , end="")
# here I would like to read the NEW modified file
with open(filename) as newf:
content = newf.read()
print(content)
notice how I iterate over the lines!

I am not saying that the way you are going about doing this is incorrect but I feel that you are overcomplicating it. Here is my super simple solution.
import glob, fileinput
for filename in glob('*' + ext):
f_in = (x.rstrip() for x in open(filename, 'rb').readlines()) #instead of trying to modify in place we instead read in data and replace raw_values.
with open(filename, 'wb') as f_out: # we then write the data stream back out
#extra modification to the data can go here, i just remove the /r and /n and write back out
for i in f_in:
f_out.write(i)
#now there is no need to read the data back in because we already have a static referance to it.

Python fastest way to read a large number of small files into memory?

I'm trying to read a few thousands html files stored on disk.
Is there any way to do better than;
for files in os.listdir('.'):
if files.endswith('.html') :
with (open) files as f:
a=f.read()
#do more stuffs

For a similar problem I have used this simple piece of code:
import glob
for file in glob.iglob("*.html"):
with open(file) as f:
a = f.read()
iglob doesn't stores all file simultaneously, this is perfect with a huge directory.
Remenber to close files after you have finished, the construct "with-open" make sure for you.

Here's some code that's significantly faster than with open(...) as f: f.read()
def read_file_bytes(path: str, size=-1) -> bytes:
fd = os.open(path, os.O_RDONLY)
try:
if size == -1:
size = os.fstat(fd).st_size
return os.read(fd, size)
finally:
os.close(fd)
If you know the maximum size of the file, pass that in to the size argument so you can avoid the stat call.
Here's some all-around faster code:
for entry in os.scandir('.'):
if entry.name.endswith('.html'):
# on windows entry.stat(follow_symlinks=False) is free, but on unix requires a syscall.
a = read_file_bytes(entry.path, entry.stat(follow_symlinks=False).st_size)
a = file_bytes.decode() # if string needed rather than bytes

Python copy and rename many small csv files based on selected characters within the files

I'm not a programmer; I'm a pilot who has done just a little bit of scripting in a past life, so I'm completely non-current at this. I have searched the forum and found somewhat similar problems that, with more expertise and time I might be able to adapt to my problem, but I hope I can get closer by asking my own question. I hope my problem is unique enough that those considering answering do not feel their time is wasted, considering my disadvantage. Anyway here is my problem:
Some of my crew members periodically have a need to rename a few hundred to more than 1,000 small csv files based on a specific convention applied to their contents. Not all of the files are used in a given project, but any subset of them could be used, so automation makes a lot of sense here. Currently this is done manually as needed. I can easily move all these files into a single directory for processing, since all their file names are unique as received.
Here are representative excerpts from two example csv files, preceded by their respective file names (As I receive them):
A_13LSAT_2014-04-23_1431.csv:
1,KDAL CURLO RW13L SAT 20140414_0644,SID,N/A,DDI
2,*,RW13L(AER),SAT
3,RW13L(AER),+325123.36,-0965121.20,RW31R(DER),+325031.35,-0965020.95
4,1,1.2,+325123.36,-0965121.20,0.0,+325031.35,-0965020.95,2.0
3,RW31R(DER),+325031.35,-0965020.95,GH13L,+324947.23,-0964929.84
4,1,2.4,+325031.35,-0965020.95,0.0,+324947.23,-0964929.84,2.0
5,TTT,0,0
5,CVE,0,0
A_RROSEE_2014-04-03_1419.csv:
1,KDFW SEEVR STAR RRONY SEEVR 20140403_1340,STAR,N/A,DDI
2,*,RRONY,SEEVR
3,RRONY,+333455.16,-0952530.56,ROWZE,+333233.02,-0954016.52
4,1,12.6,+333455.16,-0952530.56,0.0,+333233.02,-0954016.52,2.0
5,EIC,0,1
5,SLR,0,0
I know these files are not code, but I entered them indented in this post so they would display properly.
The files must be renamed due to the 8.3 limitation of the platform they are used on.
The convention is:
•On the first line, the first two characters in the second word of the second "cell" (Which are the 6th and 7th characters of the second cell), and,
•on line 2, the first three characters of the third cell, and
•the first three characters of the fourth cell.
The contents and format of the files must remain unaltered. In theory this convention yields unique names for every file so duplication of file names should not be a problem.
The files above would be copied and renamed respectively to:
CURW1SAT.csv
SERROSEE.csv
That's it. Just a script that will scan a directory full of these csv files, and create renamed copies in the same directory according the the convention I just described, based on their contents. I'm attempting to use Activestate Python 2.7.7.
Thanks in advance for any consideration.

It's not what you'd call pretty, but neither am I; and it works (and it's simple)
import os
import glob
fileset = set(glob.glob(os.path.basename(os.path.join(".", "*.csv"))))
for filename in fileset:
with open(filename, "r") as f:
csv_file = f.readlines()
out = csv_file[0].split(",")[1].split(" ")[1][:2]
out += csv_file[1].split(",")[2][:3]
out += csv_file[1].split(",")[3][:3]
os.rename(filename, out + ".csv")
just drop this in the folder with all the csv's to be renamed and run it

That is indeed not too complicated. Python has out of the box everything you need.
I don't think it's a good idea to rename the files, in case of error (e.g. collision) it would make the process dangerous, copying to another folder is safer.
The code could look like that:
import csv
import os
import os.path
import sys
import shutil
def Process(input_directory, output_directory, filename):
"""This methods reads the file named 'filename' in input_directory and copies
it to output_directory, renaming it."""
# Read the file and extract first 2 lines.
with open(filename, 'r') as csv_file:
reader = csv.reader(csv_file, delimiter=',')
line1 = reader.next()
line2 = reader.next()
line1_second_cell = line1[1]
# split() separate words by spaces into a list, [1] takes the second.
second_word = line1_second_cell.split()[1]
line2_third_cell = line2[2]
line2_fourth_cell = line2[3]
# [:2] takes the first two characters from a string.
new_filename = second_word[:2] + line2_third_cell[:3] + line2_fourth_cell[:3]
new_filename += '.csv'
print 'copying', filename, 'to', new_filename
shutil.copyfile(
os.path.join(input_directory, filename),
os.path.join(output_directory, new_filename))
# sys.argv is the list of arguments passed on the command line.
if len(sys.argv) == 3:
input_directory = sys.argv[1]
output_directory = sys.argv[2]
# os.listdir gives all the files in the directory (including ., .. and sub
# directories).
for filename in os.listdir(input_directory):
if filename.endswith(".csv"):
Process(input_directory, output_directory, filename)
else:
print "Usage:", sys.argv[0], "source_directory target_directory"
On windows you can run it in a command line (cmd.exe):
C:\where_your_python_is\python.exe C:\where_your_script_is\renamer.py C:\input C:\output
On linux it would be a little simpler as the python binary is in the path:
python /where_your_script_is/renamer.py /input /output

Put this in a script, and when you run it, give it the directory name as an argument on the command line:
import csv
import sys
import os
def rename_csv_file(filename):
global directory
with open(filename,'r') as csv_file:
newfilename = str()
rownum = 0
filereader = csv.reader(csv_file,delimiter=',')
for row in filereader:
if rownum == 0:
newfilename = row[1].split()[1][:2]
elif rownum == 1:
newfilename += row[2][:3]
newfilename += row[3][:3]
break
rownum += 1
newfilename += '.csv'
newfullpath = os.path.join(directory,newfilename)
os.rename(filename,newfullpath)
if len(sys.argv) < 2:
print "Usage: {} directory_name".format(sys.argv[0])
sys.exit()
directory = sys.argv[1]
csvfiles = [ os.path.join(directory,f) for f in os.listdir(directory) if (os.path.isfile(os.path.join(directory,f)) and f.endswith('.csv')) ]
for f in csvfiles:
rename_csv_file(f)

This assumes that every csv in your directory needs to be renamed. The code could be more condensed, but I tried to spell it out a bit so you could see what was going on.
import os
import csv
import shutil
#change this to the directory where your csvs are stored
dirname = r'C:\yourdirectory'
os.chdir(dirname)
for item in os.listdir(dirname): #look through directory contents
if item.endswith('.csv'):
f = open(item)
r = csv.reader(f)
line1 = r.next() #get the first line of csv
line2 = r.next() #get the second line of csv
f.close()
name1 = line1[1][:2] #first part of your name
name2 = line2[2][:3] #second part
name3 = line2[3][:3] #third part
newname = name1+name2+name3+'.csv'
shutil.copy2(os.path.join(dirname,item),newname) #copied csv with newname

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.