Merge multiple JSONL files from a folder using Python

Merge multiple JSONL files from a folder using Python - python

I'm looking for a solution to merge multiples JSONL files from one folder using a Python script. Something like the script below that works for JSON files.
import json
import glob
result = []
for f in glob.glob("*.json"):
with jsonlines.open(f) as infile:
result.append(json.load(infile))
with open("merged_file.json", "wb") as outfile:
json.dump(result, outfile)
Please find below a sample of my JSONL file(only one line) :
{"date":"2021-01-02T08:40:11.378000000Z","partitionId":"0","sequenceNumber":"4636458","offset":"1327163410568","iotHubDate":"2021-01-02T08:40:11.258000000Z","iotDeviceId":"text","iotMsg":{"header":{"deviceTokenJwt":"text","msgType":"text","msgOffset":3848,"msgKey":"text","msgCreation":"2021-01-02T09:40:03.961+01:00","appName":"text","appVersion":"text","customerType":"text","customerGroup":"Customer"},"msgData":{"serialNumber":"text","machineComponentTypeId":"text","applicationVersion":"3.1.4","bootloaderVersion":"text","firstConnectionDate":"2018-02-20T10:34:47+01:00","lastConnectionDate":"2020-12-31T12:05:04.113+01:00","counters":[{"type":"DurationCounter","id":"text","value":"text"},{"type":"DurationCounter","id":"text","value":"text"},{"type":"DurationCounter","id":"text","value":"text"},{"type":"IntegerCounter","id":"text","value":2423},{"type":"IntegerCounter","id":"text","value":9914},{"type":"DurationCounter","id":"text","value":"text"},{"type":"IntegerCounter","id":"text","value":976},{"type":"DurationCounter","id":"text","value":"PT0S"},{"type":"IntegerCounter","id":"text","value":28},{"type":"DurationCounter","id":"text","value":"PT0S"},{"type":"DurationCounter","id":"text","value":"PT0S"},{"type":"DurationCounter","id":"text","value":"text"},{"type":"IntegerCounter","id":"text","value":1}],"defects":[{"description":"ProtocolDb.ProtocolIdNotFound","defectLevelId":"Warning","occurrence":3},{"description":"BridgeBus.CrcError","defectLevelId":"Warning","occurrence":1},{"description":"BridgeBus.Disconnected","defectLevelId":"Warning","occurrence":6}],"maintenanceEvents":[{"interventionId":"Other","comment":"text","appearance_display":0,"intervention_date":"2018-11-29T09:52:16.726+01:00","intervention_counterValue":"text","intervention_workerName":"text"},{"interventionId":"Other","comment":"text","appearance_display":0,"intervention_date":"2019-06-04T15:30:15.954+02:00","intervention_counterValue":"text","intervention_workerName":"text"}]}}}
Does anyone know how can I handle loading this?

Since each line in a JSONL file is a complete JSON object, you don't actually need to parse the JSONL files at all in order to merge them into another JSONL file. Instead, merge them by simply concatenating them. However, the caveat here is that the JSONL format does not mandate a newline character at the end of file. You would therefore have to read each line into a buffer to test if a JSONL file ends without a newline character, in which case you would have to explicitly output a newline character in order to separate the first record of the next file:
with open("merged_file.json", "w") as outfile:
for filename in glob.glob("*.json"):
with open(filename) as infile:
for line in infile:
outfile.write(line)
if not line.endswith('\n'):
outfile.write('\n')

You can update a main dict with every json object you load. Like
import json
import glob
result = {}
for f in glob.glob("*.json"):
with jsonlines.open(f) as infile:
result.update(json.load(infile)) #merge the dicts
with open("merged_file.json", "wb") as outfile:
json.dump(result, outfile)
But this will overwite similar keys.!

Related

How to unzip a lot of gzip txt files and read through each one?

What sort of loop would I use? I have tried some sort of for loop but I cannot get my code to review each file and keep going through all of the files. I have hundreds of files that I need analyzed. I saw somewhere that someone used this code:
import glob
import gzip
ZIPFILES='name.gz'
filelist = glob.glob(ZIPFILES)
for gzfile in filelist:
# print("#Starting " + gzfile) #if you want to know which file is being processed
with gzip.open(gzfile, 'r') as f:
for line in f:
print(line)
but how do I loop this to keep reading more files.

You can use glob to get a list of all the files in a directory, and then iterate over that list.
for filename in glob.glob('*.gz'):
with gzip.open(filename, 'r') as f:
for line in f:
print(line)

Store Json data to JSON file and save them in the CSV file

I tried this way but did not work
with open("data.json", "a", encoding='utf-8') as f:
json.dump(data, f,ensure_ascii=False, indent=4 )
But this problem occurs
#2
I want to convert from json to CSV
An example of what I want
Please tell me if this is possible

Both can be done with pandas
To store json data in a .json file, use pandas.DataFrame.to_json
To save json data in a .csv file, first use pandas.read_json to read the data into a dataframe and then use pandas.DataFrame.to_csv

You can't append JSON files together into a new JSON, because of the nature of the JSON format.
Instead of writing each object individually to the JSON file, you should collect all of the objects into a list, and write the list to the JSON file:
lst = []
for data in ...:
lst.append(data)
with open("data.json", "w", encoding='utf-8') as f:
# ^ notice "a" was changed to "w" here
json.dump(lst, f, ensure_ascii=False, indent=4)

Combining multiple csv files into one csv file

I am trying to combine multiple csv files into one, and have tried a number of methods but I am struggling.
I import the data from multiple csv files, and when I compile them together into one csv file, it seems that the first few rows get filled out nicely, but then it starts randomly inputting spaces of variable number in between the rows, and it never finishes filling out the combined csv file, it just seems to continuously get information added to it, which does not make sense to me because I am trying to compile a finite amount of data.
I have already tried writing close statements for the file, and I still get the same result, my designated combined csv file never stops getting data, and it will randomly space the data throughout the file - I just want a normally compiled csv.
Is there an error in my code? Is there any explanation as to why my csv file is behaving this way?
csv_file_list = glob.glob(Dir + '/*.csv') #returns the file list
print (csv_file_list)
with open(Avg_Dir + '.csv','w') as f:
wf = csv.writer(f, delimiter = ',')
print (f)
for files in csv_file_list:
rd = csv.reader(open(files,'r'),delimiter = ',')
for row in rd:
print (row)
wf.writerow(row)

Your code works for me.
Alternatively, you can merge files as follows:
csv_file_list = glob.glob(Dir + '/*.csv')
with open(Avg_Dir + '.csv','w') as wf:
for file in csv_file_list:
with open(file) as rf:
for line in rf:
if line.strip(): # if line is not empty
if not line.endswith("\n"):
line+="\n"
wf.write(line)
Or, if the files are not too large, you can read each file at once. But in this case all empty lines an headers will be copied:
csv_file_list = glob.glob(Dir + '/*.csv')
with open(Avg_Dir + '.csv','w') as wf:
for file in csv_file_list:
with open(file) as rf:
wf.write(rf.read().strip()+"\n")

Consider several adjustments:
Use context manager, with, for both the read and write process. This avoids the need to close() file objects which you do not do on the read objects.
For skipping lines issue: use either the argument newline='' in open() or lineterminator="\n" argument in csv.writer(). See SO answers for former and latter.
Use os.path.join() to properly concatenate folder and file paths. This method is os-agnostic so accounts for Windows or Unix machines using forward or backslashes types.
Adjusted script:
import os
import csv, glob
Dir = r"C:\Path\To\Source"
Avg_Dir = r"C:\Path\To\Destination\Output"
csv_file_list = glob.glob(os.path.join(Dir, '*.csv')) # returns the file list
print (csv_file_list)
with open(os.path.join(Avg_Dir, 'Output.csv'), 'w', newline='') as f:
wf = csv.writer(f, lineterminator='\n')
for files in csv_file_list:
with open(files, 'r') as r:
next(r) # SKIP HEADERS
rr = csv.reader(r)
for row in rr:
wf.writerow(row)

read a csv file and translate it in uppercase and store it in another csv file with python?

i need to read a csv file and translate it in uppercase and store it in another csv file with python
I have this code:
import csv
with open('data_csv.csv', 'rb') as f:
header = next(f).strip().split(',')
reader = csv.DictReader((l.upper() for l in f), fieldnames=header)
for line in reader:
print line
with open('test.csv', 'r') as f:
for line in f:
print line
but I can not find the right result

You don't need to read the header separately. If you don't provide a fieldnames argument to DictReader(), the header is read automatically. Next, don't print your lines, you have now read the whole file and dropped all the lines.
Open both the input and output files in the same with statement, you can then write lines directly to the output. There is no need to use the csv module here, because you don't need to parse out the rows, then form the rows into lines again.
Just loop over the file, uppercase the lines, and write out the result:
with open('data_csv.csv', 'r') as input, open('test.csv', 'w') as output:
output.writelines(line.upper() for line in input)

Merging several csv files and storing the file names as a variable - Python

I am trying to append several csv files into a single csv file using python while adding the file name (or, even better, a sub-string of the file name) as a new variable. All files have headers. The following script does the trick of merging the files, but does not cover the file name as variable issue:
import glob
filenames=glob.glob("/filepath/*.csv")
outputfile=open("out.csv","a")
for line in open(str(filenames[1])):
outputfile.write(line)
for i in range(1,len(filenames)):
f = open(str(filenames[i]))
f.next()
for line in f:
outputfile.write(line)
outputfile.close()
I was wondering if there are any good suggestions. I have about 25k small size csv files (less than 100KB each).

You can use Python's csv module to parse the CSV files for you, and to format the output. Example code (untested):
import csv
with open(output_filename, "wb") as outfile:
writer = None
for input_filename in filenames:
with open(input_filename, "rb") as infile:
reader = csv.DictReader(infile)
if writer is None:
field_names = ["Filename"] + reader.fieldnames
writer = csv.DictWriter(outfile, field_names)
writer.writeheader()
for row in reader:
row["Filename"] = input_filename
writer.writerow(row)
A few notes:
Always use with to open files. This makes sure they will get closed again when you are done with them. Your code doesn't correctly close the input files.
CSV files should be opened in binary mode.
Indices start at 0 in Python. Your code skips the first file, and includes the lines from the second file twice. If you just want to iterate over a list, you don't need to bother with indices in Python. Simply use for x in my_list instead.

Simple changes will achieve what you want:
For the first line
outputfile.write(line) -> outputfile.write(line+',file')
and later
outputfile.write(line+','+filenames[i])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge multiple JSONL files from a folder using Python - python

Related

How to unzip a lot of gzip txt files and read through each one?

Store Json data to JSON file and save them in the CSV file

Combining multiple csv files into one csv file

read a csv file and translate it in uppercase and store it in another csv file with python?

Merging several csv files and storing the file names as a variable - Python

Categories

Resources