How to parse dictionary syntaxed string to dictionary object - python

I have a file syntaxed in a way that ressembles a Dictionary as follows:
{YEARS:5}
{GROUPS:[1,2]}
{SAVE_FILE:{USE:1,NAME:CustomCalendar.ics}}
{SAVE_ONLINE:{USE:1,NAME:Custom Calendar,EMAIL:an.email#something.com,PASSWORD:AcompLExP#ssw0rd}}
{COURSES:[BTC,CIT,CQN,OSA,PRJCQN,PRJIOT,PILS,SPO,SHS1]}
I would like to find a way to parse each individual line into a dictionary as it is written. The difficulty I have is that some of these lines contain a dictionary as their value.
I am capable of taking the single lines and converting them to actual dictionaries but I am having an issue when working with the other lines.
Here is the code I have so far:
def get_config(filename=):
with open(filename, encoding="utf8") as config:
years = config.read().split()[0]
print(parse_line(years))
def parse_line(input_line):
input_line = input_line.strip("{}")
input_line = input_line.split(":")
return {input_line[i]: input_line[i + 1] for i in range(0, len(input_line), 2)}
If at all possible, I'd love to be able to deal with any line within a single function and hopefully deal with more than two nested dictionaries.
Thanks in advance!

If your file would contain valid JSON format, it would be an easy task to read the file and convert your data structures to dictionaries.
To give an example, consider having the following line of text in a file text.txt:
{"SAVE_ONLINE":{"USE":1,"NAME":"Custom Calendar","EMAIL":"an.email#something.com","PASSWORD":"AcompLExP#ssw0rd"}}
Please note, that the only difference are the quotes " around strings.
You can easily parse the line to a dictionary structure with:
import json
with open('text.txt', 'r') as f:
d = json.loads(f.read())
Output
print(d)
# {'SAVE_ONLINE': {'USE': 1, 'NAME': 'Custom Calendar', 'EMAIL': 'an.email#something.com', 'PASSWORD': 'AcompLExP#ssw0rd'}}

Related

Accessing items in a dump of dictionary objects in Python

I have a strange dataset from our customer. It is a .json file but inside it looks like below
{"a":"aaa","b":"bbb","text":"hello"}
{"a":"aaa","b":"bbb","text":"hi"}
{"a":"aaa","b":"bbb","text":"hihi"}
As you notice, this is just a dump of dictionary objects. It is neither a list (no [] and comma seperator between objects) nor a proper JSON although the file extension is .json. So I am really confused about how to read this file.
All I care about is reading all the text keys from each of the dictionary objects.
This "strange dataset" is actually an existing format that builds upon JSON, called JSONL.
As #user655321 said, you can parse each line. Here's a more complete example with the complete dataset available in the list of dicts dataset:
import json
dataset = []
with open("my_file.json") as file:
for line in file:
dataset.append(json.loads(line))
In [51]: [json.loads(i)["text"] for i in open("file.json").readlines()]
Out[51]: ['hello', 'hi', 'hihi']
Use list comprehension, it's easier
You can read it line by line and convert the lines to JSON objects and extract the needed data text in your case.
You can do something as follows:
import json
lines = open("file.txt").readlines()
for line in lines:
dictionary = json.loads(line)
print(dictionary["text"])
Since it's not a single JSON file, you can read in the input line by line and deserialize them independently:
import json
with open('my_file.json') as fh:
for line in fh:
json_obj = json.loads(line)
keys = json_obj.keys() # eg, 'a', 'b', 'text'
text_val = json_obj['text'] # eg, 'hello', 'hi', or 'hihi'
How about splitting the content by \n then using json to load each dictionary? something like:
import json
with open(your_file) as f:
data = f.read()
my_dicts = []
for line in data.split():
my_dicts.append(json.loads(line))
import ast
with open('my_file.json') as fh:
for line in fh:
try:
dict_data = ast.literal_eval(line)
assert isinstance(dict_data,dict)
### Process Dictionary Data here or append to list to convert to list of dicts
except (SyntaxError, ValueError, AssertionError):
print('ERROR - {} is not a dictionary'.format(line))

need help to improve my Python script performance that uses nested loop and json file

I need help with improving my script's execution time.
It does what it suppose to do:
Reads a file line by line
Matches the line with the content of json file
Writes both the matching lines with the corresponding information from json file into a new txt file
The problem is with execution time, the file has more than 500,000 lines and the json file contains much more.
How can I optimize this script?
import json
import time
start = time.time()
print start
JsonFile=open('categories.json')
data = json.load(JsonFile)
Annotated_Data={}
FileList = [line.rstrip('\n') for line in open("FilesNamesID.txt")]
for File in FileList:
for key, value in data.items():
if File == key:
Annotated_Data[key]=(value)
with open('Annotated_Files.txt', 'w') as outfile:
json.dump(Annotated_Data, outfile, indent=4)
end = time.time()
print(end - start)
There is no need for the nested for loop to look up the File in data. You could replace it with the following code:
for File in FileList:
if File in data:
Annotated_Data[File]=data[File]
or with a comprehension:
AnnotatedData = {File: data[File] for File in FileList if File in data}
You can also avoid copying the contents of the whole FilesNamesID.txt to the new list - you are consuming it line by line anyway - but it would be a relatively minor improvement.
I don't know exact format of your data, but you could try speed-up your script by using set():
json_data = '''
{
"file1": "data1",
"file2": "data2",
"file3": "data3"
}
'''
filenames_id_txt = '''
file1
file3
'''
import json
data = json.loads(json_data)
lines = [l.strip() for l in filenames_id_txt.splitlines() if l.strip()]
s = set(data.keys())
Annotated_Data = {k: data[k] for k in s.intersection(lines)}
print(json.dumps(Annotated_Data))
Prints:
{"file3": "data3", "file1": "data1"}
EDIT: If I understand your question correctly, you want to find "intersection" between your JSON data and lines in your TXT file.
I chose the set() (doc) to store the JSON keys (set is collection of unique elements). The set() has very fast methods, one of the method is intersection() (doc), which accepts other iterators (e.g. lines from the TXT file) and return a new set with common elements.
I use this new set to construct new dictionary and output it as JSON file.

open a .json file with multiple dictionaries

I have a problem that I can't solve with python, it is probably very stupid but I didn't manage to find the solution by myself.
I have a .json file where the results of a simulation are stored. The result is stored as a series of dictionaries like
{"F_t_in_max": 709.1800264942982, "F_t_out_max": 3333.1574129603068, "P_elec_max": 0.87088836042046958, "beta_max": 0.38091242406098391, "r0_max": 187.55175182942901, "r1_max": 1354.8636763521174, " speed ": 8}
{"F_t_in_max": 525.61428305710433, "F_t_out_max": 2965.0538075438467, "P_elec_max": 0.80977406754203796, "beta_max": 0.59471606595464666, "r0_max": 241.25371753877008, "r1_max": 688.61786996066826, " speed ": 9}
{"F_t_in_max": 453.71124051199763, "F_t_out_max": 2630.1763649193008, "P_elec_max": 0.64268078173342935, "beta_max": 1.0352896471221695, "r0_max": 249.32706230502498, "r1_max": 709.11415981343885, " speed ": 10}
I would like to open the file and and access the values like to plot "r0_max" as function of "speed" but I can't open unless there is only one dictionary.
I use
with open('./results/rigid_wing_opt.json') as data_file:
data = json.load(data_file)
but When the file contains more than one dictionary I get the error
ValueError: Extra data: line 5 column 1 - line 6 column 1 (char 217 - 431)
If your input data is exactly as provided then you should be able to interpret each individual dictionary using json.load. If each dictionary is on its own line then this should be sufficient:
with open('filename', 'r') as handle:
json_data = [json.loads(line) for line in handle]
I would recommend reading the file line-by-line and convert each line independently to a dictionary.
You can place each line into a list with the following code:
import ast
# Read all lines into a list
with open(fname) as f:
content = f.readlines()
# Convert each list item to a dict
content = [ ast.literal_eval( line ) for line in content ]
Or an even shorter version performing the list comprehension on the same line:
import ast
# Read all lines into a list
with open(fname) as f:
content = [ ast.literal_eval( l ) for l in f.readlines() ]
{...} {...} is not proper json. It is two json objects separated by a space. Unless you can change the format of the input file to correct this, I'd suggest you try something a little different. If the data is a simple as in your example, then you could do something like this:
with open('filename', 'r') as handle:
text_data = handle.read()
text_data = '[' + re.sub(r'\}\s\{', '},{', text_data) + ']'
json_data = json.loads(text_data)
This should work even if your dictionaries are not on separate lines.
That is not valid JSON. You can't have multiple obje at the top level, without surrounding them by a list and inserting commas between them.

Converting to csv from?

I have got a file with the following lines
{"status":"OK","message":"OK","data":[{"type":"addressAccessType","addressAccessId":"0a3f508f-e7c8-32b8-e044-0003ba298018","municipalityCode":"0766","municipalityName":"Hedensted","streetCode":"0072","streetName":"Værnegården","streetBuildingIdentifier":"13","mailDeliverySublocationIdentifier":"","districtSubDivisionIdentifier":"","postCodeIdentifier":"8000","districtName":"Århus","presentationString":"Værnegården 13, 8000 Århus","addressSpecificCount":1,"validCoordinates":true,"geometryWkt":"POINT(553564 6179299)","x":553564,"y":6179299}]}
I want to transform every line into a csv readable file with headers. Like the following
status,message,data,addressAccessId,municipalityCode,municipalityName,streetCode,streetName,streetBuildingIdentifier,mailDeliverySublocationIdentifier,districtSubDivisionIdentifier,postCodeIdentifier,districtName,presentationString,addressSpecificCount,validCoordinates,geometryWkt,x,y
OK,OK,data:type,addressAccessType,0a3f508f-e7c8-32b8-e044-0003ba298018,0766,Hedensted,0072,Værnegården,13,,,8000,Århus,Værnegården 13, 8000 Århus,1,true,POINT553564 6179299,553564,6179299
How do I accomplish that? Code and explanation are very welcome. So far this is what I have come up with the following from this example:(How can I convert JSON to CSV?)
x = json.loads(x)
f = csv.writer(open('test.csv', 'wb+'))
# Write CSV Header, If you dont need that, remove this line
f.writerow(['status', 'message', 'type', 'addressAccessId', 'municipalityCode','municipalityName','streetCode','streetName','streetBuildingIdentifier','mailDeliverySublocationIdentifier','districtSubDivisionIdentifier','postCodeIdentifier','districtName','presentationString','addressSpecificCount','validCoordinates','geometryWkt','x','y'])
for x in x:
f.writerow([x['status'],
x['message'],
x['data']['type'],
x['data']['addressAccessId'],
x['data']['municipalityCode'],
x['data']['municipalityName'],
x['data']['streetCode'],
x['data']['streetName'],
x['data']['streetBuildingIdentifier'],
x['data']['mailDeliverySublocationIdentifier'],
x['data']['districtSubDivisionIdentifier'],
x['data']['postCodeIdentifier'],
x['data']['districtName'],
x['data']['presentationString'],
x['data']['addressSpecificCount'],
x['data']['validCoordinates'],
x['data']['geometryWkt'],
x['data']['x'],
x['data']['y']])
I have looked through and tried a lot of other solutions, including DictWriter, replace() and translate() to remove characthers but have not yet been able to transform the line to my need. The purpose being able to select the fields that are output into a new file, and transforming x and y to a new coordinate system. But for now Im just trying to parse the above line to a csv file. Can anyone offer code and explanation of their code? Thank you very much for your time.
Below are the first few lines of my addresses.txt
{"status":"OK","message":"OK","data":[{"type":"addressAccessType","addressAccessId":"0a3f5081-e039-32b8-e044-0003ba298018","municipalityCode":"0265","municipalityName":"Roskilde","streetCode":"0831","streetName":"Brønsager","streetBuildingIdentifier":"69","mailDeliverySublocationIdentifier":"","districtSubDivisionIdentifier":"Svogerslev","postCodeIdentifier":"4000","districtName":"Roskilde","presentationString":"Brønsager 69, 4000 Roskilde","addressSpecificCount":1,"validCoordinates":true,"geometryWkt":"POINT(690026 6169309)","x":690026,"y":6169309}]}
{"status":"OK","message":"OK","data":[{"type":"addressAccessType","addressAccessId":"0a3f5089-ecab-32b8-e044-0003ba298018","municipalityCode":"0461","municipalityName":"Odense","streetCode":"9505","streetName":"Vægtens Kvarter","streetBuildingIdentifier":"271","mailDeliverySublocationIdentifier":"","districtSubDivisionIdentifier":"Holluf Pile","postCodeIdentifier":"5220","districtName":"Odense SØ","presentationString":"Vægtens Kvarter 271, 5220 Odense SØ","addressSpecificCount":1,"validCoordinates":true,"geometryWkt":"POINT(592191 6135829)","x":592191,"y":6135829}]}
{"status":"OK","message":"OK","data":[{"type":"addressAccessType","addressAccessId":"0a3f507c-adc3-32b8-e044-0003ba298018","municipalityCode":"0165","municipalityName":"Albertslund","streetCode":"0445","streetName":"Skyttehusene","streetBuildingIdentifier":"33","mailDeliverySublocationIdentifier":"","districtSubDivisionIdentifier":"","postCodeIdentifier":"2620","districtName":"Albertslund","presentationString":"Skyttehusene 33, 2620 Albertslund","addressSpecificCount":1,"validCoordinates":true,"geometryWkt":"POINT(711079 6174741)","x":711079,"y":6174741}]}
{"status":"OK","message":"OK","data":[{"type":"addressAccessType","addressAccessId":"0a3f509c-7f57-32b8-e044-0003ba298018","municipalityCode":"0851","municipalityName":"Aalborg","streetCode":"5205","streetName":"Løvstikkevej","streetBuildingIdentifier":"36","mailDeliverySublocationIdentifier":"","districtSubDivisionIdentifier":"","postCodeIdentifier":"9000","districtName":"Aalborg","presentationString":"Løvstikkevej 36, 9000 Aalborg","addressSpecificCount":1,"validCoordinates":true,"geometryWkt":"POINT(552407 6322490)","x":552407,"y":6322490}]}
{"status":"OK","message":"OK","data":[{"type":"addressAccessType","addressAccessId":"0a3f5098-32a6-32b8-e044-0003ba298018","municipalityCode":"0779","municipalityName":"Skive","streetCode":"0462","streetName":"Landevejen","streetBuildingIdentifier":"52","mailDeliverySublocationIdentifier":"","districtSubDivisionIdentifier":"Håsum","postCodeIdentifier":"7860","districtName":"Spøttrup","presentationString":"Landevejen 52, 7860 Spøttrup","addressSpecificCount":1,"validCoordinates":true,"geometryWkt":"POINT(491515 6269739)","x":491515,"y":6269739}]}
Note that the data key holds a list of dictionaries. x['data']['type'] wouldn't work, but x['data'][0]['type'] would. There might be more than one such dictionary in that list, however. I'll assume you want a CSV row per x['data'] dictionary.
Next, it appears you have a UTF-8 BOM on every line; whatever wrote this was not using UTF-8 encoding correctly. We need to strip this marker, the first 3 characters.
Last, JSON strings are always Unicode data, and you have non-ASCII characters in your data, so you'll have to encode to bytestrings again before passing the data to the CSV writer object.
I'd use csv.DictWriter here, with a pre-defined list of field names:
import codecs
import csv
import json
fields = [
'status', 'message', 'type', 'addressAccessId', 'municipalityCode',
'municipalityName', 'streetCode', 'streetName', 'streetBuildingIdentifier',
'mailDeliverySublocationIdentifier', 'districtSubDivisionIdentifier',
'postCodeIdentifier', 'districtName', 'presentationString', 'addressSpecificCount',
'validCoordinates', 'geometryWkt', 'x', 'y']
with open('test.csv', 'wb') as csvfile, open('jsonfile', 'r') as jsonfile:
writer = csv.DictWriter(csvfile, fields)
writer.writeheader()
for line in jsonfile:
if line.startswith(codecs.BOM_UTF8):
line = line[3:]
entry = json.loads(line)
for item in entry['data']:
row = dict(item, status=entry['status'], message=entry['message'])
row = {k.encode('utf8'): unicode(v).encode('utf8') for k, v in row.iteritems()}
writer.writerow(row)
The row dictionary is basically a copy of each of the dictionaries in the entry['data'] list, with the status and message keys copied over separately. This makes row a flat dictionary instead.
I also read your input file line by line, as you say that each line contains a separate JSON entry.
Open the output file with cvs.DictWriter() and define the output header fields as you specified. Use extrasaction='ignore' and restval='' as options.
Look at Opening A large JSON file in Python with no newlines for csv conversion Python 2.6.6 for help with processing large files as I had a similar question Also look at the question that I link to.
I build a similar type of system from a JSON using appropriate loops.
for example,
def parse_row(currdata):
outx = {}
# currdata is defined earlier to point to the x['data'] dictionary
for eachx in currdata:
outx[eachx] = currdata[eachx]
return outx
where this is in a function with currdata as an argument and called with x['data'][row] as the input argument.
rows = len(x['data'])
for row in range(rows):
outx = parse_row(x['data'][row])
# process the row and create output
This should allow you to set up the parsing properly. I cannot copy the actual code into this answer but this should point you to a solution.

I cannot get split to work, what am I doing wrong?

Here is the code for the program that I have done so far. I am trying to calculate the efficiency of NBA players for a class project. When I run the program on a comma-delimited file that contains all the stats, instead of splitting on each comma it is creating a list entry of the entire line of the stat file. I get an index out of range error or it treats each character as a index point instead of the separate fields. I am new to this but it seems it should be creating a list for each line in the file that is separated by elements of that list, so I get a list of lists. I hope I have made myself understood.
Here is the code:
def get_data_list (file_name):
data_file = open(file_name, "r")
data_list = []
for line_str in data_file:
# strip end-of-line, split on commas, and append items to list
line_str.strip()
line_str.split(',')
print(line_str)
data_list.append(line_str)
print(data_list)
file_name1 = input("File name: ")
result_list = get_data_list (file_name1)
print(result_list)
I do not see how to post the data file for you to look at and try it with, but any file of numbers that are comma-delimited should work.
If there is a way to post the data file or email to you for you to help me with it I would be happy to do so.
Boliver
Strings are immutable objects, this means you can't change them in place. That means, any operation on a string returns a new one. Now look at your code:
line_str.strip() # returns a string
line_str.split(',') # returns a list of strings
data_list.append(line_str) # appends original 'line_str' (i.e. the entire line)
You could solve this by:
stripped = line_str.strip()
data = stripped.split(',')
data_list.append(data)
Or concatenating the string operations:
data = line_str.strip().split(',')
data_list.append(data)

Categories