Parsing a file that looks like JSON to a JSON - python

I'm trying to figure out was is the best way to go about this problem:
I'm reading text lines from a certain buffer that eventually creates a certain log that looks something like this:
Some_Information: here there's some information about date and hour
Additional information: log summary #1234:
details {
name: "John Doe"
address: "myAdress"
phone: 01234567
}
information {
age: 30
height: 1.70
weight: 70
}
I would like to get all the fields in this log to a dictionary which I can later turn into a json file, the different sections in the log are not important so for example if myDictionary is a dictionary variable in python I would like to have:
> myDictionary['age']
will show me 30.
and the same for all other fields.
Speed is very important here that's why I would like to just go through every line once and get it in a dictionary
My way about doing this would be to for each line that contains ":" colon I would split the string and get the key and the value in the dictionary.
is there a better way to do it?
Is there any python module that would be sufficient?
If more information is needed please let me know.
Edit:
So I've tried something that to me look to work best so far,
I am currently reading from a file to simulate the reading of the buffer
My code:
import json
import shlex
newDict = dict()
with open('log.txt') as f:
for line in f:
try:
line = line.replace(" ", "")
stringSplit = line.split(':')
key = stringSplit[0]
value = stringSplit[1]
value = shlex.split(value)
newDict[key] = value[0]
except:
continue
with open('result.json', 'w') as fp:
json.dump(newDict, fp)
Resulting in the following .json:
{"name": "JohnDoe", "weight": "70", "Additionalinformation": "logsummary#1234",
"height": "1.70", "phone": "01234567", "address": "myAdress", "age": "30"}

You haven't described exactly what the desired output should be from the sample input, so it's not completely clear what you want done. So I guessed and the following only extracts data values from lines following one that contains a '{' until one with a '}' in it is encountered, while ignoring others.
It uses the re module to isolate the two parts of each dictionary item definition found on the line, and then uses the ast module to convert the value portion of that into a valid Python literal (i.e. string, number, tuple, list, dict, bool, and None).
import ast
import json
import re
pat = re.compile(r"""(?P<key>\w+)\s*:\s*(?P<value>.+)$""")
data_dict = {}
with open('log.txt', 'rU') as f:
braces = 0
for line in (line.strip() for line in f):
if braces > 0:
match = pat.search(line)
if match and len(match.groups()) == 2:
key = match.group('key')
value = ast.literal_eval(match.group('value'))
data_dict[key] = value
elif '{' in line:
braces += 1
elif '}' in line:
braces -= 1
else:
pass # ignore line
print(json.dumps(data_dict, indent=4))
Output from your example input:
{
"name": "John Doe",
"weight": 70,
"age": 30,
"height": 1.7,
"phone": 342391,
"address": "myAdress"
}

Related

convert json to csv and store it in a variable in python

I am using csv module to convert json to csv and store it in a file or print it to stdout.
def write_csv(data:list, header:list, path:str=None):
# data is json format data as list
output_file = open(path, 'w') if path else sys.stdout
out = csv.writer(output_file)
out.writerow(header)
for row in data:
out.writerow([row[attr] for attr in header])
if path: output_file.close()
I want to store the converted csv to a variable instead of sending it to a file or stdout.
say I want to create a function like this:
def json_to_csv(data:list, header:list):
# convert json data into csv string
return string_csv
NOTE: format of data is simple
data is list of dictionaries of string to string maping
[
{
"username":"srbcheema",
"name":"Sarbjit Singh"
},
{
"username":"testing",
"name":"Test, user"
}
]
I want csv output to look like:
username,name
srbcheema,Sarbjit Singh
testing,"Test, user"
Converting JSON to CSV is not a trivial operation. There is also no standardized way to translate between them...
For example
my_json = {
"one": 1,
"two": 2,
"three": {
"nested": "structure"
}
}
Could be represented in a number of ways...
These are all (to my knowledge) valid CSVs that contain all the information from the JSON structure.
data
'{"one": 1, "two": 2, "three": {"nested": "structure"}}'
one,two,three
1,2,'{"nested": "structure"}'
one,two,three__nested
1,2,structure
In essence, you will have to figure out the best translation between the two based on your knowledge of the data. There is no right answer on how to go about this.
I'm relatively knew to Python so there's probably a better way, but this works:
def get_safe_string(string):
return '"'+string+'"' if "," in string else string
def json_to_csv(data):
csv_keys = data[0].keys()
header = ",".join(csv_keys)
res = list(",".join(get_safe_string(row.get(k)) for k in csv_keys) for row in data)
res.insert(0,header)
return "\n".join(r for r in res)

Python: finding duplicates in large jsonl file

I'm trying to find all json objects in my jsonl file that contain the same identifier value.
So if my data look like:
{
"data": {
"value": 42,
"url": "url.com",
"details": {
"timestamp": "07:32:29",
"identifier": "123ABC"
}
},
"message": "string"
}
I want to find every object that has the same identifier value. The file is too large to load all at once, so instead I check line by line and store just the identifier values. This has the drawback of missing the first object that has that identifier (ie, if objects A, B, and C all have the same identifier, I would only end up with B and C saved). To find the first occurrence of the identifier, I try reading through the file a second time to pick up only the first time each duplicate identifier is found. This is where I encounter some problems.
This part works as intended:
import gzip
import json_lines
import jsonlines
from itertools import groupby
identifiers=set()
duplicates=[]
with json_lines.open('file.jsonlines.gz') as f:
for item in f:
ID = item["data"]["details"]["identifier"]
if ID in identifiers:
duplicates.append(item)
else:
identifiers.add(ID)
dup_IDs={dup["data"]["details"]["identifier"] for dup in duplicates}
But when I read through the file a second time:
with json_lines.open('file.jsonlines.gz') as f:
for item in f:
ID = item["data"]["details"]["identifier"]
if ID in dup_IDs:
duplicates.append(item)
dup_IDs.remove(ID)
else:
continue
if len(dup_IDs)==0:
break
else:
continue
It runs for ~30 minutes and eventually crashes my computer. I'm assuming (hoping) this is because there's a problem with my code and not my computer because the code is easier to fix.
If the file size is too large, I'd suggest to upload data into SQL database and use SQL queries to filter what you needed.
import gzip
import json_lines
import jsonlines
from itertools import groupby
duplicates=[]
nb = {}
i = 0
with json_lines.open('file.jsonlines.gz') as f:
for item in f:
ID = item["data"]["details"]["identifier"]
if ID in nb:
if ID not in b:
nb[ID]=int(i)
else:
nb[ID]=str(i)
i +=1
i = 0
k = set(nb[i] for i in nb if isinstance(nb[i], int))
del nb
with json_lines.open('file.jsonlines.gz') as f:
for item in f:
if i in k:
duplicates.append(item)
i +=1
print(duplicates)

Python Create a List file and write query

so sorry for my question if it seems so easy but I am newbie user of python and I can not find a way to solve it.
I have a "dish.py" file which includes some sub-lists
Fruits={"Ap":Apple
"Br":Black Mulberry
"Ch":Black Cherry
}
Meals={"BN":Bean
"MT":Meat
"VG":Vegetable
}
Legumes={"LN":Green Lentil
"P": Pea
"PN":Runner Peanut
}
I want to impelement the dish.py file in a code that at the end, I want to create a query inside of the file
with open("/home/user/Py_tut/Cond/dish.py", 'r') as dish:
content = dish.read()
print dish.closed
dm=dict([dish])
nl=[x for x in dm if x[0]=='P']
for x in dm:
x=str(raw_input("Enter word:"))
if x in dm:
print dm[x]
elif x[0]==("P"):
nl.append(x)
print .join( nl)
It may be look so messy but
dm=dict([dish]) I want to create a dictionary for query
nl=[x for x in dm if x[0]=='P'] I want to write words begin with "P" letter
Here is my questions:
1. Q: I suppose there is a problem with my dish.py file. How can I reorganize it?
2. Q: How can I apply a query to the file and extract the words begin with "P"
Thank you so much in advance
dict() can't load strings:
>>> dict("{'a': 1, 'b': 2}")
Traceback (most recent call last):
File "<pyshell#0>", line 1, in <module>
dict("{'a': 1, 'b': 2}")
ValueError: dictionary update sequence element 0 has length 1; 2 is required
As a sequence it would be ("{", "'", "a", "'", ":",...
Instead I would use the json module, change the dish.py format (changing extension to .json and using JSON syntax) and change the code.
dish.json
{
"Fruits": {
"Ap": "Apple",
"Br": "Black Mulberry",
"Ch": "Black Cherry"
},
"Meals": {
"BN": "Bean",
"MT": "Meat",
"VG": "Vegetable"
},
"Legumes": {
"GL": "Green Lentin",
"P": "Pea",
"PN": "Running Peanut"
}
}
__init__.py
import json
with open("/home/user/Py_tut/Cond/dish.py", 'r') as dish:
content = dish.read()
print(dish.closed)
dm = json.loads(content) # loads JSON
nl=[x for x in dm if x[0]=='P']
for x in dm:
x = str(raw_input("Enter word:"))
if x in dm:
print dm[x]
elif x[0] == ("P"):
nl.append(x)
print "".join(nl)
Q: How can I apply a query to the file and extract the words begin with "P" Thank you so much in advance
Assuming that you want to get every string separated by either space or newline and return them into a list, i'd do this:
import re #Importing RegExp module
def wordsBeginP():
with open("words.txt") as wordsfile: # Querying a file
words = wordsfile.open
parsed = re.sub(r"\n", " ", words) # Replace \n to " "
return [for i in parsed.split(" ") if i[0] == "P"] # Return list of words
So I think you have more than these two issues/questions.
First, if you want to include 'hardcoded' lists, dicts and such, you probably want to include dish with dish.py being in your working directory.
That is, if your data structures in the python file are actually in the correct form:
Fruits={"Ap":'Apple',
"Br":'Black Mulberry',
"Ch":'Black Cherry'
}
Meals={"BN":'Bean',
"MT":'Meat',
"VG":'Vegetable'
}
Legumes={"LN":'Green Lentil',
"P":'Pea',
"PN":'Runner Peanut'
}
Finally, you can search in all the datastructures that were named and included in the file, under the created namespace of the include (which is dish).
for f in [dish.Fruits,dish.Meals,dish.Legumes]:
for k,v in f.items():
if k.startswith('P'):
print k,v
Also interesting for you might be pickling (though there are some caveats).

Multiple jsons to csv

I have multiple files, each containing multiple highly nested json rows. The two first rows of one such file look like:
{
"u":"28",
"evv":{
"w":{
"1":400,
"2":{
"i":[{
"l":14,
"c":"7",
"p":"4"
}
]
}
}
}
}
{
"u":"29",
"evv":{
"w":{
"3":400,
"2":{
"i":[{
"c":14,
"y":"7",
"z":"4"
}
]
}
}
}
}
they are actually rows, I just wrote them here this way for more visibility.
My question is the following:
Is there any way to convert all these files to one (or multiple, i.e. one per file) csv/excel... ?
Is there any simple way, that doesn't require writing dozens, or hundreds of lines in Python, specific to my file, to convert all these files to one (or multiple, i.e. one per file) csv/excel... ? One example would be using an external library, script... that handles this particular task, regardless of the names of the fields.
The trap is that some elements do not appear in each line. For example, for the "i" key, we have 3 fields (l, c, p) in the first json, and 3 in the second one (c, y, z). Ideally, the csv should contain as many columns as possible fields (e.g. evv.w.2.i.l, evv.w.2.i.c, evv.w.2.i.p, evv.w.2.i.y, evv.w.2.i.z) at the risk of having (many) null values per csv row.
A possible csv output for this example would have the following columns:
u, evv.w.1, evv.w.3, evv.w.2.i.l, evv.w.2.i.c, evv.w.2.i.p, evv.w.2.i.y, evv.w.2.i.z
Any idea/reference is welcome :)
Thanks
No, there is no general-purpose program that does precisely what you ask for.
You can, however, write a Python program that does it.
This program might do what you want. It does not have any code specific to your key names, but it is specific to your file format.
It can take several files on the command line.
Each file is presumed to have one JSON object per line.
It flattens the JSON object, joining labels with "."
import fileinput
import json
import csv
def flattify(d, key=()):
if isinstance(d, list):
result = {}
for i in d:
result.update(flattify(i, key))
return result
if isinstance(d, dict):
result = {}
for k, v in d.items():
result.update(flattify(v, key + (k,)))
return result
return {key: d}
total = []
for line in fileinput.input():
if(line.strip()):
line = json.loads(line)
line = flattify(line)
line = {'.'.join(k): v for k, v in line.items()}
total.append(line)
keys = set()
for d in total:
keys.update(d)
with open('result.csv', 'w') as output_file:
output_file = csv.DictWriter(output_file, sorted(keys))
output_file.writeheader()
output_file.writerows(total)
Please check if this (python3) solution works for you.
import json
import csv
with open('test.json') as data_file:
with open('output.csv', 'w', newline='') as fp:
for line in data_file:
data = json.loads(line)
output = [[data['u'], data['evv']['w'].get('1'), data['evv']['w'].get('3'),
data['evv']['w'].get('2')['i'][0].get('l'), data['evv']['w'].get('2')['i'][0].get('c'),
data['evv']['w'].get('2')['i'][0].get('p'), data['evv']['w'].get('2')['i'][0].get('y'),
data['evv']['w'].get('2')['i'][0].get('z')]]
a = csv.writer(fp, delimiter=',')
a.writerows(output)
test.json
{ "u": "28", "evv": { "w": { "1": 400, "2": { "i": [{ "l": 14, "c": "7", "p": "4" }] } } }}
{"u":"29","evv":{ "w":{ "3":400, "2":{ "i":[{ "c":14, "y":"7", "z":"4" } ] } } }}
output
python3 pyprog.py
dac#dac-Latitude-E7450 ~/P/pyprog> more output.csv
28,400,,14,7,4,,
29,,400,,14,,7,4

Parse Key Value Pairs in Python

So I have a key value file that's similar to JSON's format but it's different enough to not be picked up by the Python JSON parser.
Example:
"Matt"
{
"Location" "New York"
"Age" "22"
"Items"
{
"Banana" "2"
"Apple" "5"
"Cat" "1"
}
}
Is there any easy way to parse this text file and store the values into an array such that I could access the data using a format similar to Matt[Items][Banana]? There is only to be one pair per line and a bracket should denote going down a level and going up a level.
You could use re.sub to 'fix up' your string and then parse it. As long as the format is always either a single quoted string or a pair of quoted strings on each line, you can use that to determine where to place commas and colons.
import re
s = """"Matt"
{
"Location" "New York"
"Age" "22"
"Items"
{
"Banana" "2"
"Apple" "5"
"Cat" "1"
}
}"""
# Put a colon after the first string in every line
s1 = re.sub(r'^\s*(".+?")', r'\1:', s, flags=re.MULTILINE)
# add a comma if the last non-whitespace character in a line is " or }
s2 = re.sub(r'(["}])\s*$', r'\1,', s1, flags=re.MULTILINE)
Once you've done that, you can use ast.literal_eval to turn it into a Python dict. I use that over JSON parsing because it allows for trailing commas, without which the decision of where to put commas becomes a lot more complicated:
import ast
data = ast.literal_eval('{' + s2 + '}')
print data['Matt']['Items']['Banana']
# 2
Not sure how robust this approach is outside of the example you've posted but it does support for escaped characters and deeper levels of structured data. It's probably not going to be fast enough for large amounts of data.
The approach converts your custom data format to JSON using a (very) simple parser to add the required colons and braces, the JSON data can then be converted to a native Python dictionary.
import json
# Define the data that needs to be parsed
data = '''
"Matt"
{
"Location" "New \\"York"
"Age" "22"
"Items"
{
"Banana" "2"
"Apple" "5"
"Cat"
{
"foo" "bar"
}
}
}
'''
# Convert the data from custom format to JSON
json_data = ''
# Define parser states
state = 'OUT'
key_or_value = 'KEY'
for c in data:
# Handle quote characters
if c == '"':
json_data += c
if state == 'IN':
state = 'OUT'
if key_or_value == 'KEY':
key_or_value = 'VALUE'
json_data += ':'
elif key_or_value == 'VALUE':
key_or_value = 'KEY'
json_data += ','
else:
state = 'IN'
# Handle braces
elif c == '{':
if state == 'OUT':
key_or_value = 'KEY'
json_data += c
elif c == '}':
# Strip trailing comma and add closing brace and comma
json_data = json_data.rstrip().rstrip(',') + '},'
# Handle escaped characters
elif c == '\\':
state = 'ESCAPED'
json_data += c
else:
json_data += c
# Strip trailing comma
json_data = json_data.rstrip().rstrip(',')
# Wrap the data in braces to form a dictionary
json_data = '{' + json_data + '}'
# Convert from JSON to the native Python
converted_data = json.loads(json_data)
print(converted_data['Matt']['Items']['Banana'])

Categories