Related
I'm looking for a solution to merge multiples JSONL files from one folder using a Python script. Something like the script below that works for JSON files.
import json
import glob
result = []
for f in glob.glob("*.json"):
with jsonlines.open(f) as infile:
result.append(json.load(infile))
with open("merged_file.json", "wb") as outfile:
json.dump(result, outfile)
Please find below a sample of my JSONL file(only one line) :
{"date":"2021-01-02T08:40:11.378000000Z","partitionId":"0","sequenceNumber":"4636458","offset":"1327163410568","iotHubDate":"2021-01-02T08:40:11.258000000Z","iotDeviceId":"text","iotMsg":{"header":{"deviceTokenJwt":"text","msgType":"text","msgOffset":3848,"msgKey":"text","msgCreation":"2021-01-02T09:40:03.961+01:00","appName":"text","appVersion":"text","customerType":"text","customerGroup":"Customer"},"msgData":{"serialNumber":"text","machineComponentTypeId":"text","applicationVersion":"3.1.4","bootloaderVersion":"text","firstConnectionDate":"2018-02-20T10:34:47+01:00","lastConnectionDate":"2020-12-31T12:05:04.113+01:00","counters":[{"type":"DurationCounter","id":"text","value":"text"},{"type":"DurationCounter","id":"text","value":"text"},{"type":"DurationCounter","id":"text","value":"text"},{"type":"IntegerCounter","id":"text","value":2423},{"type":"IntegerCounter","id":"text","value":9914},{"type":"DurationCounter","id":"text","value":"text"},{"type":"IntegerCounter","id":"text","value":976},{"type":"DurationCounter","id":"text","value":"PT0S"},{"type":"IntegerCounter","id":"text","value":28},{"type":"DurationCounter","id":"text","value":"PT0S"},{"type":"DurationCounter","id":"text","value":"PT0S"},{"type":"DurationCounter","id":"text","value":"text"},{"type":"IntegerCounter","id":"text","value":1}],"defects":[{"description":"ProtocolDb.ProtocolIdNotFound","defectLevelId":"Warning","occurrence":3},{"description":"BridgeBus.CrcError","defectLevelId":"Warning","occurrence":1},{"description":"BridgeBus.Disconnected","defectLevelId":"Warning","occurrence":6}],"maintenanceEvents":[{"interventionId":"Other","comment":"text","appearance_display":0,"intervention_date":"2018-11-29T09:52:16.726+01:00","intervention_counterValue":"text","intervention_workerName":"text"},{"interventionId":"Other","comment":"text","appearance_display":0,"intervention_date":"2019-06-04T15:30:15.954+02:00","intervention_counterValue":"text","intervention_workerName":"text"}]}}}
Does anyone know how can I handle loading this?
Since each line in a JSONL file is a complete JSON object, you don't actually need to parse the JSONL files at all in order to merge them into another JSONL file. Instead, merge them by simply concatenating them. However, the caveat here is that the JSONL format does not mandate a newline character at the end of file. You would therefore have to read each line into a buffer to test if a JSONL file ends without a newline character, in which case you would have to explicitly output a newline character in order to separate the first record of the next file:
with open("merged_file.json", "w") as outfile:
for filename in glob.glob("*.json"):
with open(filename) as infile:
for line in infile:
outfile.write(line)
if not line.endswith('\n'):
outfile.write('\n')
You can update a main dict with every json object you load. Like
import json
import glob
result = {}
for f in glob.glob("*.json"):
with jsonlines.open(f) as infile:
result.update(json.load(infile)) #merge the dicts
with open("merged_file.json", "wb") as outfile:
json.dump(result, outfile)
But this will overwite similar keys.!
I have a few really big .zip files. Each contains 1 huge .csv.
When I try to read it in, I either get a memory error or everything freezes/crashes.
I've tried this:
zf = zipfile.ZipFile('Eve.zip', 'r')
df1 = zf.read('Eve.csv')
but this gives a MemoryError.
I've done some research and tried this:
import zipfile
with zipfile.ZipFile('Events_WE20200308.zip', 'r') as z:
with z.open('Events_WE20200308.csv') as f:
for line in f:
df=pd.DataFrame(f)
print(line)
but I can't get it into a dataframe.
Any ideas please?
I am trying to read files with .txt extension and wanted to append into a single txt file. I could read data. But what is the best way to write into single .txt file?
sources = ["list of paths to files you want to write from"]
dest = open("file.txt", "a")
for src in sources:
source = open(src, "r")
data = source.readlines()
for d in data:
dest.write(d)
source.close()
dest.close()
If your destination doesnt already exist you can use "w"(write) mode instead of "a"(append) mode.
Try this.
x.txt:
Python is fun
y.txt:
Hello World. Welcome to my code.
z.txt:
I know that python is popular.
Main Python file:
list_=['x.txt','y.txt','z.txt']
new_list_=[]
for i in list_:
x=open(i,"r")
re=x.read()
new_list_.append(re)
with open('all.txt',"w") as file:
for line in new_list_:
file.write(line+"\n")
After you find the filenames, if you have a lot of files you should avoid string concatenation when merging file contents because in python string concatenation comes with O(n) runtime cost. I think the code below demonstrates the full example.
import glob
# get every txt files from the current directory
txt_files = glob.iglob('./*.txt')
def get_file_content(filename):
content = ''
with open(filename, 'r') as f:
content = f.read()
return content
contents = []
for txt_file in txt_files:
contents.append(get_file_content(txt_file))
with open('complete_content.txt', 'w') as f:
f.write(''.join(contents))
I am trying to append several csv files into a single csv file using python while adding the file name (or, even better, a sub-string of the file name) as a new variable. All files have headers. The following script does the trick of merging the files, but does not cover the file name as variable issue:
import glob
filenames=glob.glob("/filepath/*.csv")
outputfile=open("out.csv","a")
for line in open(str(filenames[1])):
outputfile.write(line)
for i in range(1,len(filenames)):
f = open(str(filenames[i]))
f.next()
for line in f:
outputfile.write(line)
outputfile.close()
I was wondering if there are any good suggestions. I have about 25k small size csv files (less than 100KB each).
You can use Python's csv module to parse the CSV files for you, and to format the output. Example code (untested):
import csv
with open(output_filename, "wb") as outfile:
writer = None
for input_filename in filenames:
with open(input_filename, "rb") as infile:
reader = csv.DictReader(infile)
if writer is None:
field_names = ["Filename"] + reader.fieldnames
writer = csv.DictWriter(outfile, field_names)
writer.writeheader()
for row in reader:
row["Filename"] = input_filename
writer.writerow(row)
A few notes:
Always use with to open files. This makes sure they will get closed again when you are done with them. Your code doesn't correctly close the input files.
CSV files should be opened in binary mode.
Indices start at 0 in Python. Your code skips the first file, and includes the lines from the second file twice. If you just want to iterate over a list, you don't need to bother with indices in Python. Simply use for x in my_list instead.
Simple changes will achieve what you want:
For the first line
outputfile.write(line) -> outputfile.write(line+',file')
and later
outputfile.write(line+','+filenames[i])
I have up to 8 seperate Python processes creating temp files in a shared folder. Then I'd like the controlling process to append all the temp files in a certain order into one big file. What's the quickest way of doing this at an os agnostic shell level?
Just using simple file IO:
# tempfiles is a list of file handles to your temp files. Order them however you like
f = open("bigfile.txt", "w")
for tempfile in tempfiles:
f.write(tempfile.read())
That's about as OS agnostic as it gets. It's also fairly simple, and the performance ought to be about as good as using anything else.
Not aware of any shell-level commands for appending one file to another. But appending at 'python level' is sufficiently easy that I am guessing python developers did not think it necessary to add it to the library.
The solution depends on the size and structure of the temp files you are appending. If they are all small enough that you don't mind reading each of them into memory, then the answer from Rafe Kettler (copied from his answer and repeated below) does the job with the least amount of code.
# tempfiles is an ordered list of temp files (open for reading)
f = open("bigfile.txt", "w")
for tempfile in tempfiles:
f.write(tempfile.read())
If reading files fully into memory is not possible or not an appropriate solution, you will want to loop through each file and read them piece-wise. If your temp file contains newline-terminated lines which can be read individually into memory, you might do something like this
# tempfiles is an ordered list of temp files (open for reading)
f = open("bigfile.txt", "w")
for tempfile in tempfiles:
for line in tempfile
f.write(line)
Alternatively - something which will always work - you may choose a buffer size and just read the file piece-wise, e.g.
# tempfiles is an ordered list of temp files (open for reading)
f = open("bigfile.txt", "w")
for tempfile in tempfiles:
while True:
data = tempfile.read(65536)
if data:
f.write(data)
else:
break
The input/output tutorial has a lot of good info.
Rafe's answer was lacking proper open/close statements, e.g.
# tempfiles is a list of file handles to your temp files. Order them however you like
with open("bigfile.txt", "w") as fo:
for tempfile in tempfiles:
with open(tempfile,'r') as fi: fo.write(fi.read())
However, be forewarned that if you want to sort the contents of the bigfile, this method does not catch instances where the last line in one or more of your temp files has a different EOL format, which will cause some strange sort results. In this case, you will want to strip the tempfile lines as you read them, and then write consistent EOL lines to the bigfile (i.e. involving an extra line of code).
I feel a bit stupid to add another answer after 8 years and so many answers, but I arrived here by the "append to file" title, and didn't see the right solution for appending to an existing binary file with buffered read/write.
So here is the basic way to do that:
def append_file_to_file(_from, _to):
block_size = 1024*1024
with open(_to, "ab") as outfile, open(_from, "rb") as infile:
while True:
input_block = infile.read(block_size)
if not input_block:
break
outfile.write(input_block)
Given this building block, you can use:
for filename in ['a.bin','b.bin','c.bin']:
append_file_to_file(filename, 'outfile.bin')
import os
str = os.listdir("./")
for i in str:
f = open(i)
f2 = open("temp.txt", "a")
for line in f.readlines():
f2.write(line)
We can use above code to read all the contents from all the file present in current directory and store into temp.txt file.
Use fileinput:
with open("bigfile.txt", "w") as big_file:
with fileinput.input(files=tempfiles) as inputs:
for line in inputs:
big_file.write(line)
This is more memory efficient than #RafeKettler's answer as it doesn't need to read the whole file into memory before writing to big_file.
Try this. It's very fast (much faster than line-by-line, and shouldn't cause a VM thrash for large files), and should run on about anything, including CPython 2.x, CPython 3.x, Pypy, Pypy3 and Jython. Also it should be highly OS-agnostic. Also, it makes no assumptions about file encodings.
#!/usr/local/cpython-3.4/bin/python3
'''Cat 3 files to one: example code'''
import os
def main():
'''Main function'''
input_filenames = ['a', 'b', 'c']
block_size = 1024 * 1024
if hasattr(os, 'O_BINARY'):
o_binary = getattr(os, 'O_BINARY')
else:
o_binary = 0
output_file = os.open('output-file', os.O_WRONLY | o_binary)
for input_filename in input_filenames:
input_file = os.open(input_filename, os.O_RDONLY | o_binary)
while True:
input_block = os.read(input_file, block_size)
if not input_block:
break
os.write(output_file, input_block)
os.close(input_file)
os.close(output_file)
main()
There is one (nontrivial) optimization I've left out: It's better to not assume anything about a good blocksize, instead using a bunch of random ones, and slowly backing off the randomization to focus on the good ones (sometimes called "simulated annealing"). But that's a lot more complexity for little actual performance benefit.
You could also make the os.write keep track of its return value and restart partial writes, but that's only really necessary if you're expecting to receive (nonterminal) *ix signals.
In this code, you can indicate the path and name of the input/output files, and it will create the final big file in that path:
import os
dir_name = "Your_Desired_Folder/Goes_Here" #path
input_files_names = ["File1.txt", "File2.txt", "File3.txt"] #input files
file_name_out = "Big_File.txt" #choose a name for the output file
file_output = os.path.join(dir_name, file_name_out)
fout = open(file_output, "w")
for tempfile in input_files_names:
inputfile = os.path.join(dir_name, tempfile)
fin = open(inputfile, 'r')
for line in fin:
fout.write(line)
fin.close()
fout.close()
Simple & Efficient way to copy data from multiple files to one big file, Before that you need to rename your files to (int) eg. 1,2,3,4...etc, Code:
#Rename Files First
import os
path = 'directory_name'
files = os.listdir(path)
i = 1
for file in files:
os.rename(os.path.join(path, file), os.path.join(path, str(i)+'.txt'))
i = i+1
# Code For Copying Data from Multiple files
import os
i = 1
while i<50:
filename = i
for filename in os.listdir("directory_name"):
# %s is your filename # .txt is file extension
f = open("%s.txt" % i,'r')
fout = open("output_filename", "a")
for line in f:
fout.write(line)
i += 1
There's also the fileinput class in Python 3, which is perfect for this sort of situation
I was solving similar problem, I was combining multiple files in a folder into a big one, in the same folder, sorted based on file modified
Hints are in comments in the code block
from glob import glob
import os
# Folder is where files are stored
# This is also where the big file will be stored
folder = r".\test_folder"
big_filename = r"connected.txt"
# Get all files except big the file and sort by last modified
all_files = glob(folder + "/*")
all_files = [fi for fi in all_files if big_filename not in fi]
all_files.sort(key=os.path.getmtime)
# Get content of each file and append it to a list
output_big_file = []
for one_file in all_files:
with open(one_file, "r", encoding="utf-8") as f:
output_big_file.append(f.read())
# Save list as a file
save_path = os.path.join(folder, big_filename)
with open(save_path, "w", encoding="utf-8") as f:
f.write("\n".join(output_big_file))
Just change target dir)))
import os
d = "./output_dir"
str = os.listdir(d)
for i in str:
f = open(d + '/' + i)
f2 = open(d + '/' + "output.csv", "a")
for line in f.readlines():
f2.write(line)