Parse Key Value Pairs in Python - python

So I have a key value file that's similar to JSON's format but it's different enough to not be picked up by the Python JSON parser.
Example:
"Matt"
{
"Location" "New York"
"Age" "22"
"Items"
{
"Banana" "2"
"Apple" "5"
"Cat" "1"
}
}
Is there any easy way to parse this text file and store the values into an array such that I could access the data using a format similar to Matt[Items][Banana]? There is only to be one pair per line and a bracket should denote going down a level and going up a level.

You could use re.sub to 'fix up' your string and then parse it. As long as the format is always either a single quoted string or a pair of quoted strings on each line, you can use that to determine where to place commas and colons.
import re
s = """"Matt"
{
"Location" "New York"
"Age" "22"
"Items"
{
"Banana" "2"
"Apple" "5"
"Cat" "1"
}
}"""
# Put a colon after the first string in every line
s1 = re.sub(r'^\s*(".+?")', r'\1:', s, flags=re.MULTILINE)
# add a comma if the last non-whitespace character in a line is " or }
s2 = re.sub(r'(["}])\s*$', r'\1,', s1, flags=re.MULTILINE)
Once you've done that, you can use ast.literal_eval to turn it into a Python dict. I use that over JSON parsing because it allows for trailing commas, without which the decision of where to put commas becomes a lot more complicated:
import ast
data = ast.literal_eval('{' + s2 + '}')
print data['Matt']['Items']['Banana']
# 2

Not sure how robust this approach is outside of the example you've posted but it does support for escaped characters and deeper levels of structured data. It's probably not going to be fast enough for large amounts of data.
The approach converts your custom data format to JSON using a (very) simple parser to add the required colons and braces, the JSON data can then be converted to a native Python dictionary.
import json
# Define the data that needs to be parsed
data = '''
"Matt"
{
"Location" "New \\"York"
"Age" "22"
"Items"
{
"Banana" "2"
"Apple" "5"
"Cat"
{
"foo" "bar"
}
}
}
'''
# Convert the data from custom format to JSON
json_data = ''
# Define parser states
state = 'OUT'
key_or_value = 'KEY'
for c in data:
# Handle quote characters
if c == '"':
json_data += c
if state == 'IN':
state = 'OUT'
if key_or_value == 'KEY':
key_or_value = 'VALUE'
json_data += ':'
elif key_or_value == 'VALUE':
key_or_value = 'KEY'
json_data += ','
else:
state = 'IN'
# Handle braces
elif c == '{':
if state == 'OUT':
key_or_value = 'KEY'
json_data += c
elif c == '}':
# Strip trailing comma and add closing brace and comma
json_data = json_data.rstrip().rstrip(',') + '},'
# Handle escaped characters
elif c == '\\':
state = 'ESCAPED'
json_data += c
else:
json_data += c
# Strip trailing comma
json_data = json_data.rstrip().rstrip(',')
# Wrap the data in braces to form a dictionary
json_data = '{' + json_data + '}'
# Convert from JSON to the native Python
converted_data = json.loads(json_data)
print(converted_data['Matt']['Items']['Banana'])

Related

Python replace values in unknown structure JSON file

Say that I have a JSON file whose structure is either unknown or may change overtime - I want to replace all values of "REPLACE_ME" with a string of my choice in Python.
Everything I have found assumes I know the structure. For example, I can read the JSON in with json.load and walk through the dictionary to do replacements then write it back. This assumes I know Key names, structure, etc.
How can I replace ALL of a given string value in a JSON file with something else?
This function recursively replaces all strings which equal the value original with the value new.
This function works on the python structure - but of course you can use it on a json file - by using json.load
It doesn't replace keys in the dictionary - just the values.
def nested_replace( structure, original, new ):
if type(structure) == list:
return [nested_replace( item, original, new) for item in structure]
if type(structure) == dict:
return {key : nested_replace(value, original, new)
for key, value in structure.items() }
if structure == original:
return new
else:
return structure
d = [ 'replace', {'key1': 'replace', 'key2': ['replace', 'don\'t replace'] } ]
new_d = nested_replace(d, 'replace', 'now replaced')
print(new_d)
['now replaced', {'key1': 'now replaced', 'key2': ['now replaced', "don't replace"]}]
I think there's no big risk if you want to replace any key or value enclosed with quotes (since quotes are escaped in json unless they are part of a string delimiter).
I would dump the structure, perform a str.replace (with double quotes), and parse again:
import json
d = { 'foo': {'bar' : 'hello'}}
d = json.loads(json.dumps(d).replace('"hello"','"hi"'))
print(d)
result:
{'foo': {'bar': 'hi'}}
I wouldn't risk to replace parts of strings or strings without quotes, because it could change other parts of the file. I can't think of an example where replacing a string without double quotes can change something else.
There are "clean" solutions like adapting from Replace value in JSON file for key which can be nested by n levels but is it worth the effort? Depends on your requirements.
Why not modify the file directly instead of treating it as a JSON?
with open('filepath') as f:
lines = f.readlines()
for line in lines:
line = line.replace('REPLACE_ME', 'whatever')
with open('filepath_new', 'a') as f:
f.write(line)
You could load the JSON file into a dictionary and recurse through that to find the proper values but that's unnecessary muscle flexing.
The best way is to simply treat the file as a string and do the replacements that way.
json_file = 'my_file.json'
with open(json_file) as f:
file_data = f.read()
file_data = file_data.replace('REPLACE_ME', 'new string')
<...>
with open(json_file, 'w') as f:
f.write(file_data)
json_data = json.loads(file_data)
From here the file can be re-written and you can continue to use json_data as a dict.
Well that depends, if you want to place all the strings entitled "REPLACE_ME" with the same string you can use this. The for loop loops through all the keys in the dictionary and then you can use the keys to select each value in the dictionary. If it is equal to your replacement string it will replace it with the string you want.
search_string = "REPLACE_ME"
replacement = "SOME STRING"
test = {"test1":"REPLACE_ME", "test2":"REPLACE_ME", "test3":"REPLACE_ME", "test4":"REPLACE_ME","test5":{"test6":"REPLACE_ME"}}
def replace_nested(test):
for key,value in test.items():
if type(value) is dict:
replace_nested(value)
else:
if value==search_string:
test[key] = replacement
replace_nested(test)
print(test)
To solve this problem in a dynamic way, I have obtained to use the same json file to declare the variables that we want to replace.
Json File :
{
"properties": {
"property_1": "value1",
"property_2": "value2"
},
"json_file_content": {
"key_to_find": "{{property_1}} is my value"
"dict1":{
"key_to_find": "{{property_2}} is my other value"
}
}
Python code (references Replace value in JSON file for key which can be nested by n levels):
import json
def fixup(self, a_dict:dict, k:str, subst_dict:dict) -> dict:
"""
function inspired by another answers linked below
"""
for key in a_dict.keys():
if key == k:
for s_k, s_v in subst_dict.items():
a_dict[key] = a_dict[key].replace("{{"+s_k+"}}",s_v)
elif type(a_dict[key]) is dict:
fixup(a_dict[key], k, subst_dict)
# ...
file_path = "my/file/path"
if path.exists(file_path):
with open(file_path, 'rt') as f:
json_dict = json.load(f)
fixup(json_dict ["json_file_content"],"key_to_find",json_dict ["properties"])
print(json_dict) # json with variables resolved
else:
print("file not found")
Hope it helps

Python Convert string to dict

I have a string :
'{tomatoes : 5 , livestock :{cow : 5 , sheep :2 }}'
and would like to convert it to
{
"tomatoes" : "5" ,
"livestock" :"{"cow" : "5" , "sheep" :"2" }"
}
Any ideas ?
This has been settled in 988251
In short; use the python ast library's literal_eval() function.
import ast
my_string = "{'key':'val','key2':2}"
my_dict = ast.literal_eval(my_string)
The problem with your input string is that it's actually not a valid JSON because your keys are not declared as strings, otherwise you could just use the json module to load it and be done with it.
A simple and dirty way to get what you want is to first turn it into a valid JSON by adding quotation marks around everything that's not a whitespace or a syntax character:
source = '{tomatoes : 5 , livestock :{cow : 5 , sheep :2 }}'
output = ""
quoting = False
for char in source:
if char.isalnum():
if not quoting:
output += '"'
quoting = True
elif quoting:
output += '"'
quoting = False
output += char
print(output) # {"tomatoes" : "5" , "livestock" :{"cow" : "5" , "sheep" :"2" }}
This gives you a valid JSON so now you can easily parse it to a Python dict using the json module:
import json
parsed = json.loads(output)
# {'livestock': {'sheep': '2', 'cow': '5'}, 'tomatoes': '5'}
What u have is a JSON formatted string which u want to convert to python dictionary.
Using the JSON library :
import json
with open("your file", "r") as f:
dictionary = json.loads(f.read());
Now dictionary contains the data structure which ur looking for.
Here is my answer:
dict_str = '{tomatoes: 5, livestock: {cow: 5, sheep: 2}}'
def dict_from_str(dict_str):
while True:
try:
dict_ = eval(dict_str)
except NameError as e:
key = e.message.split("'")[1]
dict_str = dict_str.replace(key, "'{}'".format(key))
else:
return dict_
print dict_from_str(dict_str)
My strategy is to convert the dictionary str to a dict by eval. However, I first have to deal with the fact that your dictionary keys are not enclosed in quotes. I do that by evaluating it anyway and catching the error. From the error message, I extract the key that was interpreted as an unknown variable, and enclose it with quotes.

Parsing a file that looks like JSON to a JSON

I'm trying to figure out was is the best way to go about this problem:
I'm reading text lines from a certain buffer that eventually creates a certain log that looks something like this:
Some_Information: here there's some information about date and hour
Additional information: log summary #1234:
details {
name: "John Doe"
address: "myAdress"
phone: 01234567
}
information {
age: 30
height: 1.70
weight: 70
}
I would like to get all the fields in this log to a dictionary which I can later turn into a json file, the different sections in the log are not important so for example if myDictionary is a dictionary variable in python I would like to have:
> myDictionary['age']
will show me 30.
and the same for all other fields.
Speed is very important here that's why I would like to just go through every line once and get it in a dictionary
My way about doing this would be to for each line that contains ":" colon I would split the string and get the key and the value in the dictionary.
is there a better way to do it?
Is there any python module that would be sufficient?
If more information is needed please let me know.
Edit:
So I've tried something that to me look to work best so far,
I am currently reading from a file to simulate the reading of the buffer
My code:
import json
import shlex
newDict = dict()
with open('log.txt') as f:
for line in f:
try:
line = line.replace(" ", "")
stringSplit = line.split(':')
key = stringSplit[0]
value = stringSplit[1]
value = shlex.split(value)
newDict[key] = value[0]
except:
continue
with open('result.json', 'w') as fp:
json.dump(newDict, fp)
Resulting in the following .json:
{"name": "JohnDoe", "weight": "70", "Additionalinformation": "logsummary#1234",
"height": "1.70", "phone": "01234567", "address": "myAdress", "age": "30"}
You haven't described exactly what the desired output should be from the sample input, so it's not completely clear what you want done. So I guessed and the following only extracts data values from lines following one that contains a '{' until one with a '}' in it is encountered, while ignoring others.
It uses the re module to isolate the two parts of each dictionary item definition found on the line, and then uses the ast module to convert the value portion of that into a valid Python literal (i.e. string, number, tuple, list, dict, bool, and None).
import ast
import json
import re
pat = re.compile(r"""(?P<key>\w+)\s*:\s*(?P<value>.+)$""")
data_dict = {}
with open('log.txt', 'rU') as f:
braces = 0
for line in (line.strip() for line in f):
if braces > 0:
match = pat.search(line)
if match and len(match.groups()) == 2:
key = match.group('key')
value = ast.literal_eval(match.group('value'))
data_dict[key] = value
elif '{' in line:
braces += 1
elif '}' in line:
braces -= 1
else:
pass # ignore line
print(json.dumps(data_dict, indent=4))
Output from your example input:
{
"name": "John Doe",
"weight": 70,
"age": 30,
"height": 1.7,
"phone": 342391,
"address": "myAdress"
}

How to parse a string that looks like JSON with lots of embedded classes in python?

I have a string that lists the properties of a request event.
My string looks like:
requestBody: {
propertyA = 1
propertyB = 2
propertyC = {
propertyC1 = 1
propertyC2 = 2
}
propertyD = [
{ propertyD1 = { propertyD11 = 1}},
{ propertyD1 = [ {propertyD21 = 1, propertyD22 = 2},
{propertyD21 = 3, propertyD22 = 4}]}
]
}
I have tried to replace the "=" with ":" so that I can put it into a JSON reader in python, but JSON also requires that key and value are stored in string with double quotes and a "," to separate each KV pair. This then became a little complicated to implement. What are some better approaches to parsing this into python dictionary with exactly the same structure (e.g. embedded dictionaries are also preserved)?
Question:
If I were to write a full parser, what's the main pattern that I should tackle? Storing parenthesis in a stack until the parenthesis completes?
This is a nice case for using pyparsing, especially since it adds the issue of recursive structuring.
The short answer is the following parser (processes everything after the leading requestBody :):
LBRACE,RBRACE,LBRACK,RBRACK,EQ = map(Suppress, "{}[]=")
NL = LineEnd().setName("NL")
# define special delimiter for lists and objects, since they can be
# comma-separated or just newline-separated
list_delim = NL | ','
list_delim.leaveWhitespace()
# use a parse action to convert numeric values to ints or floats at parse time
def convert_number(t):
try:
return int(t[0])
except ValueError:
return float(t[0])
number = Word(nums, nums+'.').addParseAction(convert_number)
qs = quotedString
# forward-declare value, since it will be defined recursively
obj_value = Forward()
ident = Word(alphas, alphanums+'_')
obj_property = Group(ident + EQ + obj_value)
# use Dict wrapper to auto-define nested properties as key-values
obj = Group(LBRACE + Dict(Optional(delimitedList(obj_property, delim=list_delim))) + RBRACE)
obj_array = Group(LBRACK + Optional(delimitedList(obj, delim=list_delim)) + RBRACK)
# now assign to previously-declared obj_value, using '<<=' operator
obj_value <<= obj_array | obj | number | qs
# parse the data
res = obj.parseString(sample)[0]
# convert the result to a dict
import pprint
pprint.pprint(res.asDict())
gives
{'propertyA': 1,
'propertyB': 2,
'propertyC': {'propertyC1': 1, 'propertyC2': 2},
'propertyD': {'propertyD1': {'propertyD11': 1},
'propertyD2': {'propertyD21': 3, 'propertyD22': 4}}}

Custom format to JSON

How can i convert the following line(not sure what format is this) to JSON format?
[root=Root [key1=value1, key2=value2, key3=Key3 [key3_1=value3_1, key3_2=value3_2, key3_3=Key3_3 [key3_3_1=value3_3_1]], key4=value4]]
where Root, Key3, Key3_3 denote complex elements.
to
{
"root": {
"key1" : "value1",
"key2" : "value2",
"key3" : {
"key3_1" : "value3_1",
"key3_2" : "value3_2",
"key3_3" : {
"key3_3_1" : "value3_3_1"
}
},
"key4" : "value4
}
}
I am looking for approach and not solution. If you are down-voting this question, Please comment why you are doing so.
Let x be a string with the above serialization.
First, lets replace the occurrences of Root, Key3 and Key3_3 with empty strings
# the string fragments like "root=Root [" need to be replaced by "root=["
# to achieve this, we match the regex pattern "\w+ ["
# This matches ALL instances in the input string where we have a word bounded by "=" & " [",
# i.e. "Root [", "Key3 [", "Key3_3" are all matched. as will any other example you can think of
# where the `word` is composed of letters numbers or underscore followed
# by a single space character and then "["
# We replace this fragment with "[", (which we will later replace with "{")
# giving us the transformation "root=Root [" => "root=["
import re
o = re.compile(r'\w+ [[]')
y = re.sub(o, '[', x, 0)
Then, lets split the resulting string into words and non words
# Here we split the string into two lists, one containing adjacent tokens (nonwords)
# and the other containing the words
# The idea is to split / recombine the source string with quotes around all our words
w = re.compile(r'\W+')
nw = re.compile(r'\w+')
words = w.split(y)[1:-1] # ignore the end elements which are empty.
nonwords = nw.split(y) # list elements are contiguous non-word characters, i.e not a-Z_0-9
struct = '"{}"'.join(nonwords) # format structure of final output with quotes around the word's placeholder.
almost_there = struct.format(*words) # insert words into the string
And finally, replace the square brackets with squigly ones, and = with :
jeeson = almost_there.replace(']', '}').replace('=', ':').replace('[', '{')
# "{'root':{'key1':'value1', 'key2':'value2', 'key3':{'key3_1':'value3_1', 'key3_2':'value3_2', 'key3_3':{'key3_3_1':'value3_3_1'}}, 'key4':'value4'}}"
I had to spend around two hours on this, but I think I have something which would work all the cases based on the format you provided. If not, I am sure it'll be a minor change. Even though you asked only for the idea, since I coded it up anyway, here's the Python code.
import json
def to_json(cust_str):
from_index = 0
left_indices = []
levels = {}
level = 0
for i, char in enumerate(cust_str):
if char == '[':
level += 1
left_indices.append(i)
if level in levels:
levels[level] += 1
else:
levels[level] = 1
elif char == ']':
level -= 1
level = max(levels.keys())
value_stack = []
while True:
left_index = left_indices.pop()
right_index = cust_str.find(']', left_index) + 1
values = {}
pairs = cust_str[left_index:right_index][1:-1].split(',')
if levels[level] > 0:
for pair in pairs:
pair = pair.split('=')
values[pair[0].strip()] = pair[1]
else:
level -= 1
for pair in pairs:
pair = pair.split('=')
if pair[1][-1] == ' ':
values[pair[0].strip()] = value_stack.pop()
else:
values[pair[0].strip()] = pair[1]
value_stack.append(values)
levels[level] -= 1
cust_str = cust_str[:left_index] + cust_str[right_index:]
if levels[1] == 0:
return json.dumps(values)
if __name__ == '__main__':
# Data in custom format
cust_str = '[root=Root [key1=value1, key2=value2, key3=Key3 [key3_1=value3_1, key3_2=value3_2, key3_3=Key3_3 [key3_3_1=value3_3_1]], key4=value4]]'
# Data in JSON format
json_str = to_json(cust_str)
print json_str
The idea is that, we map the number of levels the dicts go to in the custom format and the number of values which are not strings corresponding to those levels. Along with that, we keep track of the indices of the [ character in the given string. We then start from the innermost dict representation by popping the stack containing the [ (left) indices and parse them. As each of them is parsed, we remove them from the string and continue. The rest you can probably read in the code.
I ran it for the data you gave and the result is as follows.
{
"root":{
"key2":"value2",
"key3":{
"key3_2":"value3_2",
"key3_3":{
"key3_3_1":"value3_3_1"
},
"key3_1":"value3_1"
},
"key1":"value1",
"key4":"value4"
}
}
Just to make sure it works for more general cases, I used this custom string.
[root=Root [key1=value1, key2=Key2 [key2_1=value2_1], key3=Key3 [key3_1=value3_1, key3_2=Key3_2 [key3_2_1=value3_2_1], key3_3=Key3_3 [key3_3_1=value3_3_1]], key4=value4]]
And parsed it.
{
"root":{
"key2":{
"key2_1":"value2_1"
},
"key3":{
"key3_2":{
"key3_2_1":"value3_2_1"
},
"key3_3":{
"key3_3_1":"value3_3_1"
},
"key3_1":"value3_1"
},
"key1":"value1",
"key4":"value4"
}
}
Which, as far as I can see, is how it should be parsed. Also, remember, do not strip the values since the logic depends on the whitespace at the end of values which should have the dicts as values (if that makes any sense).

Categories