Merging JSON in Python, alternate to eval()? - python

Suppose I'm dealing with the following two (or more) JSON strings from a dictionary:
JSONdict['context'] = '{"Context":"{context}","PID":"{PID}"}'
JSONdict['RDFchildren'] = '{"results":[ {"object" :
"info:fedora/book:fullbook"} ,{"object" :
"info:fedora/book:images"} ,{"object" :
"info:fedora/book:HTML"} ,{"object" :
"info:fedora/book:altoXML"} ,{"object" :
"info:fedora/book:thumbs"} ,{"object" :
"info:fedora/book:originals"} ]}'
I would like to create a merged JSON string, with "context" and "query" as root level keys. Something like this:
{"context": {"PID": "wayne:campbellamericansalvage", "Context":
"object_page"}, "RDFchildren": {"results": [{"object":
"info:fedora/book:fullbook"}, {"object":
"info:fedora/book:images"}, {"object":
"info:fedora/book:HTML"}, {"object":
"info:fedora/book:altoXML"}, {"object":
"info:fedora/book:thumbs"}, {"object":
"info:fedora/book:originals"}]}}
The following works, but I'd like to avoid using eval() if possible.
# using eval
JSONevaluated = {}
for each in JSONdict:
JSONevaluated[each] = eval(JSONdict[each])
JSONpackage = json.dumps(JSONevaluated)
Also got this way working, but feels hackish and I'm afraid encoding and escaping will become problematic as more realistic metadata comes through:
#iterate through dictionary, unpack strings and concatenate
concatList = []
for key in JSONdict:
tempstring = JSONdict[key][1:-1] #removes brackets
concatList.append(tempstring)
JSONpackage = ",".join(concatList) #comma delimits
JSONpackage = "{"+JSONpackage+"}" #adds brackets for well-formed JSON
Any thoughts? advice?

You can use json.loads() instead of eval() in your first example.

Related

Parsing data containing escaped quotes and separators in python

I have data that is structured like this:
1661171420, foo="bar", test="This, is a \"TEST\"", count=5, com="foo, bar=blah"
It always starts with a unix timestamp, but then I can't know how many other fields follow and how they are called.
The goal is to parse this into a dictionary as such:
{"timestamp": 1661171420,
"foo": "bar",
"test": 'This, is a "TEST"',
"count": 5,
"com": "foo, bar=blah"}
I'm having trouble parsing this, especially regarding the escaped quotes and commas in the values.
What would be the best way to parse this correctly? preferably without any 3rd party modules.
If changing the format of input data is not an option (JSON would be much easier to handle, but if it is an API as you say then you might be stuck with this) the following would work assuming the file follows given structure more or less. Not the cleanest solution, I agree, but it does the job.
import re
d = r'''1661171420, foo="bar", test="This, is a \"TEST\"", count=5, com="foo, bar=blah", fraction=-0.11'''.replace(r"\"", "'''")
string_pattern = re.compile(r'''(\w+)="([^"]*)"''')
matches = re.finditer(string_pattern, d)
parsed_data = {}
parsed_data['timestamp'] = int(d.partition(", ")[0])
for match in matches:
parsed_data[match.group(1)] = match.group(2).replace("'''", "\"")
number_pattern = re.compile(r'''(\w+)=([+-]?\d+(?:\.\d+)?)''')
matches = re.finditer(number_pattern, d)
for match in matches:
try:
parsed_data[match.group(1)] = int(match.group(2))
except ValueError:
parsed_data[match.group(1)] = float(match.group(2))
print(parsed_data)

Replace a string by another string in an "array"

My file so far looks like this:
[
{
"asks" : [
[
0.00276477,
NumberInt(9)
],
[
0.00276478,
NumberInt(582)]]
}
]
I would like to replace the "NumberInt(9)" by the digit 9.
What I tried so far looks like this:
json_data=open("test.json").read()
number = re.findall("NumberInt\(([0-9]+)\)", json_data)
Nint = re.findall("(Nu.*)", json_data)
json_data.replace('Nint', 'number')
But it does not replace it in my original file... Does someone has an idea?
Here's how to do it, based on the documentation of re.sub():
with open("test.json") as file:
json_data = file.read()
new_json = re.sub("NumberInt\(([0-9]+)\)", r"\1", json_data)
Note that re.sub() returns a copy of the string, just like the built-in str.replace() method does.
First point: use re.sub() instead of str.replace() here. Also note that python strings are immutable so in both case you have to rebind your string to the result of the function.
Second point: your file will OF COURSE not be updated if you don't explicitely do it by yourself - you have to write back the corrected string to the file (reopening the file in write mode).

Python Extract numbers in after the string foo: with regex

I have data like this:
{"address": "1GocfVCWTiRViPtqZetcX4UiCxnKxgTHwK", "balance": 1234, "pending_balance": 0, "paid_out": 0}
I want extract numbers after balance, but its can be from 0 to infinity.
So, from example above the output desired:
1234
And btw one more question.
I have got data like this
{"address": "1GocfVCWTiRViPtqZetcX4UiCxnKxgTHwK", "invoice": "invNrKU2ZFMuAJKUiejyVe3X34ybP9awyWZBfUEdY2dZKxYTB8ajW", "redeem_code": "BTCvQDD9xFYHHDYNi1JYeLY1eEkGFBFB49qojETjLBZ2CVYyPm56B"}
Whats the normal way of doing that:
strs = repr(s)
address = s[13:47]
invoice = s[62:115]
redeem_code = s[134:187]
print(address)
print(invoice)
print(redeem_code)
Thx for help.
don't ever use regexes to parse structured data like this. Once parsed with proper means (json.loads or ast.literal_eval both work here), they become native python structure, trivial to access to.
In your case, using json.loads in one line:
import json
print(json.loads('{"address": "1GocfVCWTiRViPtqZetcX4UiCxnKxgTHwK", "balance": 1234, "pending_balance": 0, "paid_out": 0}')["balance"])
result:
1234
(same method applies for your second question)
Actually what you are showing us is what in Python is called a dictionary.
That is a set of key and values.
Look here for more info: https://docs.python.org/3.6/tutorial/datastructures.html#dictionaries
Your dictionary has the following keys and values:
"address" --> "1GocfVCWTiRViPtqZetcX4UiCxnKxgTHwK"
"balance" --> 1234
"pending_balance" --> 0
"paid_out" --> 0
Now if what you have is a dictionary:
d = {"address": "1GocfVCWTiRViPtqZetcX4UiCxnKxgTHwK", "balance": 1234, "pending_balance": 0, "paid_out": 0}
print(d.get('balanace')) #1234
If however what you have is an external file with that information or you got it from a web service of some sort, you have a string representation of a dictionary. Here is where the JSON-library becomes valuable:
import json
# Assuming you got a string
s = '{"address": "1GocfVCWTiRViPtqZetcX4UiCxnKxgTHwK", "balance": 1234, "pending_balance": 0, "paid_out": 0}'
d = json.loads(s) # <-- converts the string to a dictionary
print(d.get('balance')) #1234
Your data looks like json, so the preferable way of dealing with it is parsing using json module
import json
parsed_data = json.loads(data)
balance = parsed_data['balance']
If using regular expressions is a must, you can use following code
import re
match = re.search('"balance": (\d+)', data)
balance = int(match.group(1))
In this example me use \d+ to match string of digits and parenthesis to create a group. Group 0 would be the whole matched string and group 1 - the first group we created.

decoding JSON data with backslash encoding

I have the following JSON data.
"[
\"msgType\": \"0\",
\"tid\": \"1\",
\"data\": \"[
{
\\\"EventName\\\": \\\"TExceeded\\\",
\\\"Severity\\\": \\\"warn\\\",
\\\"Subject\\\": \\\"Exceeded\\\",
\\\"Message\\\": \\\"tdetails: {
\\\\\\\"Message\\\\\\\": \\\\\\\"my page tooktoolong(2498ms: AT: 5ms,
BT: 1263ms,
CT: 1230ms),
andexceededthresholdof5ms\\\\\\\",
\\\\\\\"Referrer\\\\\\\": \\\\\\\"undefined\\\\\\\",
\\\\\\\"Session\\\\\\\": \\\\\\\"None\\\\\\\",
\\\\\\\"ResponseTime\\\\\\\": 0,
\\\\\\\"StatusCode\\\\\\\": 0,
\\\\\\\"Links\\\\\\\": 215,
\\\\\\\"Images\\\\\\\": 57,
\\\\\\\"Forms\\\\\\\": 2,
\\\\\\\"Platform\\\\\\\": \\\\\\\"Linuxx86_64\\\\\\\",
\\\\\\\"BrowserAppname\\\\\\\": \\\\\\\"Netscape\\\\\\\",
\\\\\\\"AppCodename\\\\\\\": \\\\\\\"Mozilla\\\\\\\",
\\\\\\\"CPUs\\\\\\\": 8,
\\\\\\\"Language\\\\\\\": \\\\\\\"en-GB\\\\\\\",
\\\\\\\"isEvent\\\\\\\": \\\\\\\"true\\\\\\\",
\\\\\\\"PageLatency\\\\\\\": 2498,
\\\\\\\"Threshold\\\\\\\": 5,
\\\\\\\"AT\\\\\\\": 5,
\\\\\\\"BT\\\\\\\": 1263,
\\\\\\\"CT\\\\\\\": 1230
}\\\",
\\\"EventTimestamp\\\": \\\"1432514783269\\\"
}
]\",
\"Timestamp\": \"1432514783269\",
\"AppName\": \"undefined\",
\"Group\": \"UndefinedGroup\"
]"
I want to make this JSON file into a single level of wrapping.i.e I want to remove the nested structure inside and copy that data over to the top level JSON structure. How can I do this?
If this strucutre is named json_data
I want to be able to access
json_data['Platform']
json_data[BrowserAppname']
json_data['Severity']
json_data['msgType']
Basically some kind of rudimentary normalization.What is the easiest way to do this using python
A generally unsafe but probably okay in this case solution would be:
import json
d = json.loads(json_string.replace('\\', ''))
I'm not sure what happened but this doesn't look like valid JSON.
You have some double-quotes escaped once, some twice, some three times etc.
You have key/value pairs inside of a list-like object []
tdetails is missing a trailing quote
Even if you fix the above you still have your data list quoted as a multi-line string which is invalid.
It appears to be that this "JSON" was constructed by hand, by someone with no knowledge of JSON.
You can try "massaging" the data into JSON with the following:
import re
x = re.sub(r'\\+', '', js_str)
x = re.sub(r'\n', '', js_str)
x = '{' + js_str.strip()[1:-1] + '}'
Which would make the string almost json like, but you still need to fix point #3.

How to read text file in python with unsuccessful format?

I made a big mistake, when I choose the way of dumping data;
Now I have a text file, that consist of
{ "13234134": ["some", "strings", ...]}{"34545345": ["some", "strings", ...]} ..so on
How can I read it into python?
edit:
I have tried json,
when I add at begin and at end of file curly-braces manually, I have "ValueError: Expecting property name:", because "13234134" string maybi invalid for json, I do not know how to avoid it.
edit1
with open('new_file.txt', 'w') as outfile:
for index, user_id in enumerate(users):
json.dump(dict = get_user_tweets(user_id), outfile)
It looks like what you have is an undelimited stream of JSON objects. As if you'd called json.dump over and over on the same file, or ''.join(json.dumps(…) for …). And, in fact, the first one is exactly what you did. :)
So, you're in luck. JSON is a self-delimiting format, which means you can read up to the end of the first JSON object, then read from there up to the end of the next JSON object, and so on. The raw_decode method essentially does the hard part.
There's no stdlib function that wraps it up, and I don't know of any library that does it, but it's actually very easy to do yourself:
def loads_multiple(s):
decoder = json.JSONDecoder()
pos = 0
while pos < len(s):
pos, obj = decoder.raw_decode(s, pos)
yield obj
So, instead of doing this:
obj = json.loads(s)
do_stuff_with(obj)
… you do this:
for obj in loads_multi(s):
do_stuff_with(obj)
Or, if you want to combine all the objects into one big list:
objs = list(loads_multi(s))
Consider simply rewriting it to something that is valid json. If indeed your bad data only contains the format that you've shown (a series of json structures that are not comma-separated), then just add commas and square braces:
with open('/tmp/sto/junk.csv') as f:
data = f.read()
print(data)
s = "[ {} ]".format(data.strip().replace("}{", "},{"))
print(s)
import json
data = json.loads(s)
print(type(data))
Output:
{ "13234134": ["some", "strings"]}{"34545345": ["some", "strings", "like", "this"]}
[ { "13234134": ["some", "strings"]},{"34545345": ["some", "strings", "like", "this"]} ]
<class 'list'>

Categories