I have multiple files, each containing multiple highly nested json rows. The two first rows of one such file look like:
{
"u":"28",
"evv":{
"w":{
"1":400,
"2":{
"i":[{
"l":14,
"c":"7",
"p":"4"
}
]
}
}
}
}
{
"u":"29",
"evv":{
"w":{
"3":400,
"2":{
"i":[{
"c":14,
"y":"7",
"z":"4"
}
]
}
}
}
}
they are actually rows, I just wrote them here this way for more visibility.
My question is the following:
Is there any way to convert all these files to one (or multiple, i.e. one per file) csv/excel... ?
Is there any simple way, that doesn't require writing dozens, or hundreds of lines in Python, specific to my file, to convert all these files to one (or multiple, i.e. one per file) csv/excel... ? One example would be using an external library, script... that handles this particular task, regardless of the names of the fields.
The trap is that some elements do not appear in each line. For example, for the "i" key, we have 3 fields (l, c, p) in the first json, and 3 in the second one (c, y, z). Ideally, the csv should contain as many columns as possible fields (e.g. evv.w.2.i.l, evv.w.2.i.c, evv.w.2.i.p, evv.w.2.i.y, evv.w.2.i.z) at the risk of having (many) null values per csv row.
A possible csv output for this example would have the following columns:
u, evv.w.1, evv.w.3, evv.w.2.i.l, evv.w.2.i.c, evv.w.2.i.p, evv.w.2.i.y, evv.w.2.i.z
Any idea/reference is welcome :)
Thanks
No, there is no general-purpose program that does precisely what you ask for.
You can, however, write a Python program that does it.
This program might do what you want. It does not have any code specific to your key names, but it is specific to your file format.
It can take several files on the command line.
Each file is presumed to have one JSON object per line.
It flattens the JSON object, joining labels with "."
import fileinput
import json
import csv
def flattify(d, key=()):
if isinstance(d, list):
result = {}
for i in d:
result.update(flattify(i, key))
return result
if isinstance(d, dict):
result = {}
for k, v in d.items():
result.update(flattify(v, key + (k,)))
return result
return {key: d}
total = []
for line in fileinput.input():
if(line.strip()):
line = json.loads(line)
line = flattify(line)
line = {'.'.join(k): v for k, v in line.items()}
total.append(line)
keys = set()
for d in total:
keys.update(d)
with open('result.csv', 'w') as output_file:
output_file = csv.DictWriter(output_file, sorted(keys))
output_file.writeheader()
output_file.writerows(total)
Please check if this (python3) solution works for you.
import json
import csv
with open('test.json') as data_file:
with open('output.csv', 'w', newline='') as fp:
for line in data_file:
data = json.loads(line)
output = [[data['u'], data['evv']['w'].get('1'), data['evv']['w'].get('3'),
data['evv']['w'].get('2')['i'][0].get('l'), data['evv']['w'].get('2')['i'][0].get('c'),
data['evv']['w'].get('2')['i'][0].get('p'), data['evv']['w'].get('2')['i'][0].get('y'),
data['evv']['w'].get('2')['i'][0].get('z')]]
a = csv.writer(fp, delimiter=',')
a.writerows(output)
test.json
{ "u": "28", "evv": { "w": { "1": 400, "2": { "i": [{ "l": 14, "c": "7", "p": "4" }] } } }}
{"u":"29","evv":{ "w":{ "3":400, "2":{ "i":[{ "c":14, "y":"7", "z":"4" } ] } } }}
output
python3 pyprog.py
dac#dac-Latitude-E7450 ~/P/pyprog> more output.csv
28,400,,14,7,4,,
29,,400,,14,,7,4
Related
I am using csv module to convert json to csv and store it in a file or print it to stdout.
def write_csv(data:list, header:list, path:str=None):
# data is json format data as list
output_file = open(path, 'w') if path else sys.stdout
out = csv.writer(output_file)
out.writerow(header)
for row in data:
out.writerow([row[attr] for attr in header])
if path: output_file.close()
I want to store the converted csv to a variable instead of sending it to a file or stdout.
say I want to create a function like this:
def json_to_csv(data:list, header:list):
# convert json data into csv string
return string_csv
NOTE: format of data is simple
data is list of dictionaries of string to string maping
[
{
"username":"srbcheema",
"name":"Sarbjit Singh"
},
{
"username":"testing",
"name":"Test, user"
}
]
I want csv output to look like:
username,name
srbcheema,Sarbjit Singh
testing,"Test, user"
Converting JSON to CSV is not a trivial operation. There is also no standardized way to translate between them...
For example
my_json = {
"one": 1,
"two": 2,
"three": {
"nested": "structure"
}
}
Could be represented in a number of ways...
These are all (to my knowledge) valid CSVs that contain all the information from the JSON structure.
data
'{"one": 1, "two": 2, "three": {"nested": "structure"}}'
one,two,three
1,2,'{"nested": "structure"}'
one,two,three__nested
1,2,structure
In essence, you will have to figure out the best translation between the two based on your knowledge of the data. There is no right answer on how to go about this.
I'm relatively knew to Python so there's probably a better way, but this works:
def get_safe_string(string):
return '"'+string+'"' if "," in string else string
def json_to_csv(data):
csv_keys = data[0].keys()
header = ",".join(csv_keys)
res = list(",".join(get_safe_string(row.get(k)) for k in csv_keys) for row in data)
res.insert(0,header)
return "\n".join(r for r in res)
My generated json output is showing that it's not a valid Json while checking with jslint. Getting error EOF.
Here am using if len(data) != 0: for not inserting [] in the final output.json file (working but don't know any other way to avoid inserting [] to file)
with open('output.json', 'a') as jsonFile:
print(data)
if len(data) != 0:
json.dump(data, jsonFile, indent=2)
My input data is coming one by one from another function generated from inside for loop.
Sample "data" coming from another function using loop :
print(data)
[{'product': 'food'}, {'price': '$100'}]
[{'product': 'clothing'}, {'price': '$40'}]
...
Can I append these data and make a json file under "Store". What should be the the proper practice. Please suggest.
Sample output generated from output.json file :
[
{
"product": "food"
},
{
"price": "$100"
}
][
{
"product": "clothing"
},
{
"price": "$40"
}
]
Try jsonlines package, you would need to install it using pip install jsonlines.
jsonlines does not contain the comma(,) at the end of line. So you can read and write exact structure the way you have anod you would not need to do any additional merge or formatting.
import jsonlines
with jsonlines.open('output.json') as reader:
for obj in reader:
// Do something with obj
Similarly, you can do the dump but by write method of this module.
with jsonlines.open('output.json', mode='w') as writer:
writer.write(...)
output.jsonl would look like this
[{'product': 'food'}, {'price': '$100'}]
[{'product': 'clothing'}, {'price': '$40'}]
Yes, You can always club them all together and link it to a key named Store which would make sense as they are all the products in the store.
But I think the below format would be much better as each product in the store have a defined product name along with the price of that product
{
"Store":[
{
"product":"food",
"price":"$100"
},
{
"product":"clothing",
"price":"$40"
}
]
}
If you do this way you need not have to insert each and every key,value pair to the json but instead if you can simply insert the entire product name and price to a single object and keep appending it to the store list
I'm trying to find all json objects in my jsonl file that contain the same identifier value.
So if my data look like:
{
"data": {
"value": 42,
"url": "url.com",
"details": {
"timestamp": "07:32:29",
"identifier": "123ABC"
}
},
"message": "string"
}
I want to find every object that has the same identifier value. The file is too large to load all at once, so instead I check line by line and store just the identifier values. This has the drawback of missing the first object that has that identifier (ie, if objects A, B, and C all have the same identifier, I would only end up with B and C saved). To find the first occurrence of the identifier, I try reading through the file a second time to pick up only the first time each duplicate identifier is found. This is where I encounter some problems.
This part works as intended:
import gzip
import json_lines
import jsonlines
from itertools import groupby
identifiers=set()
duplicates=[]
with json_lines.open('file.jsonlines.gz') as f:
for item in f:
ID = item["data"]["details"]["identifier"]
if ID in identifiers:
duplicates.append(item)
else:
identifiers.add(ID)
dup_IDs={dup["data"]["details"]["identifier"] for dup in duplicates}
But when I read through the file a second time:
with json_lines.open('file.jsonlines.gz') as f:
for item in f:
ID = item["data"]["details"]["identifier"]
if ID in dup_IDs:
duplicates.append(item)
dup_IDs.remove(ID)
else:
continue
if len(dup_IDs)==0:
break
else:
continue
It runs for ~30 minutes and eventually crashes my computer. I'm assuming (hoping) this is because there's a problem with my code and not my computer because the code is easier to fix.
If the file size is too large, I'd suggest to upload data into SQL database and use SQL queries to filter what you needed.
import gzip
import json_lines
import jsonlines
from itertools import groupby
duplicates=[]
nb = {}
i = 0
with json_lines.open('file.jsonlines.gz') as f:
for item in f:
ID = item["data"]["details"]["identifier"]
if ID in nb:
if ID not in b:
nb[ID]=int(i)
else:
nb[ID]=str(i)
i +=1
i = 0
k = set(nb[i] for i in nb if isinstance(nb[i], int))
del nb
with json_lines.open('file.jsonlines.gz') as f:
for item in f:
if i in k:
duplicates.append(item)
i +=1
print(duplicates)
I am trying to convert an xml to JSON (condensed version of the code is provided below).
The issue I am facing is with a tag, which can have multiple values (example below). I cannot directly make it as dict, since the key (NAME) can have multiple values. The output generated by the code Vs the expected output is given below.
python script:
import json
mylist = ['"Event" : "BATCHS01-wbstp01"', '"Event" : "BATCHS01-wbstrt01"']
tmpdict = {}
tmpdict['Events'] = mylist
with open('test.json','w') as fp:
json.dump(tmpdict,fp,indent=4, sort_keys=False)
Output Generated:
{
"Events": [
"\"Event\" : \"BATCHS01-wbstp01\"",
"\"Event\" : \"BATCHS01-wbstrt01\""
]
}
Expected Output:
{
"Events": [
{"Event" : "BATCHS01-wbstp01"},
{"Event" : "BATCHS01-wbstrt01"}
]
}
The issue is that your mylist is an array of strings rather than an array of map objects.
You need to remove the outer quote to make it:
mylist = [{"Event" : "BATCHS01-wbstp01"}, {"Event" : "BATCHS01-wbstrt01"}]
I don't see why you cannot produce this structure from XML. It's rather simple regardless of whether 'key (NAME) can have multiple values'.
You can salvage your data by first converting it to valid JSON piecewise and then dumping the JSON into a string or a file:
tmpdict = {"Events" : [json.loads('{' + item + '}') for item in mylist]}
json.dumps(tmpdict)
'{"Events": [{"Event": "BATCHS01-wbstp01"}, {"Event": "BATCHS01-wbstrt01"}]}'
Code:
You can first convert the XML pieces to dict's like:
tmpdict['Events'] = [json.loads('{%s}' % x) for x in mylist]
Test Code:
import json
mylist = ['"Event" : "BATCHS01-wbstp01"', '"Event" : "BATCHS01-wbstrt01"']
tmpdict = {}
tmpdict['Events'] = [json.loads('{%s}' % x) for x in mylist]
with open('test.json', 'w') as fp:
json.dump(tmpdict, fp, indent=4, sort_keys=False)
Results:
{
"Events": [
{
"Event": "BATCHS01-wbstp01"
},
{
"Event": "BATCHS01-wbstrt01"
}
]
}
I'm trying to figure out was is the best way to go about this problem:
I'm reading text lines from a certain buffer that eventually creates a certain log that looks something like this:
Some_Information: here there's some information about date and hour
Additional information: log summary #1234:
details {
name: "John Doe"
address: "myAdress"
phone: 01234567
}
information {
age: 30
height: 1.70
weight: 70
}
I would like to get all the fields in this log to a dictionary which I can later turn into a json file, the different sections in the log are not important so for example if myDictionary is a dictionary variable in python I would like to have:
> myDictionary['age']
will show me 30.
and the same for all other fields.
Speed is very important here that's why I would like to just go through every line once and get it in a dictionary
My way about doing this would be to for each line that contains ":" colon I would split the string and get the key and the value in the dictionary.
is there a better way to do it?
Is there any python module that would be sufficient?
If more information is needed please let me know.
Edit:
So I've tried something that to me look to work best so far,
I am currently reading from a file to simulate the reading of the buffer
My code:
import json
import shlex
newDict = dict()
with open('log.txt') as f:
for line in f:
try:
line = line.replace(" ", "")
stringSplit = line.split(':')
key = stringSplit[0]
value = stringSplit[1]
value = shlex.split(value)
newDict[key] = value[0]
except:
continue
with open('result.json', 'w') as fp:
json.dump(newDict, fp)
Resulting in the following .json:
{"name": "JohnDoe", "weight": "70", "Additionalinformation": "logsummary#1234",
"height": "1.70", "phone": "01234567", "address": "myAdress", "age": "30"}
You haven't described exactly what the desired output should be from the sample input, so it's not completely clear what you want done. So I guessed and the following only extracts data values from lines following one that contains a '{' until one with a '}' in it is encountered, while ignoring others.
It uses the re module to isolate the two parts of each dictionary item definition found on the line, and then uses the ast module to convert the value portion of that into a valid Python literal (i.e. string, number, tuple, list, dict, bool, and None).
import ast
import json
import re
pat = re.compile(r"""(?P<key>\w+)\s*:\s*(?P<value>.+)$""")
data_dict = {}
with open('log.txt', 'rU') as f:
braces = 0
for line in (line.strip() for line in f):
if braces > 0:
match = pat.search(line)
if match and len(match.groups()) == 2:
key = match.group('key')
value = ast.literal_eval(match.group('value'))
data_dict[key] = value
elif '{' in line:
braces += 1
elif '}' in line:
braces -= 1
else:
pass # ignore line
print(json.dumps(data_dict, indent=4))
Output from your example input:
{
"name": "John Doe",
"weight": 70,
"age": 30,
"height": 1.7,
"phone": 342391,
"address": "myAdress"
}