I have a large file, that contains valid nested json on each line, each json looks like (real data is much bigger, so this peace of json will be shown for illustration just):
{"location":{"town":"Rome","groupe":"Advanced",
"school":{"SchoolGroupe":"TrowMet", "SchoolName":"VeronM"}},
"id":"145",
"Mother":{"MotherName":"Helen","MotherAge":"46"},"NGlobalNote":2,
"Father":{"FatherName":"Peter","FatherAge":"51"},
"Study":[{
"Teacher":["MrCrock","MrDaniel"],
"Field":{"Master1":["Marketing", "Politics", "Philosophy"],
"Master2":["Economics", "Management"], "ExamCode": "1256"}
}],
"season":["summer","spring"]}
I need to parse this file, in order to extract only some key-values from every json, to obtain the dataframe that should look like:
Groupe Id MotherName FatherName Master2
Advanced 56 Laure James Economics, Management
Middle 11 Ann Nicolas Web-development
Advanced 6 Helen Franc Literature, English Language
I use method proposed me in the other question .get but it doesn't work with nested json, so for instance if I try:
def extract_data(data):
""" convert 1 json dict to records for import"""
dummy = {}
jfile = json.loads(data.strip())
return (
jfile.get('Study', dummy).get('Field', np.nan).get('Master1',np.nan),
jfile.get('location', dummy).get('groupe', np.nan))
for this line jfile.get('Study', dummy).get('Field', np.nan).get('Master1', np.nan) it throws me an error:
AttributeError: 'list' object has no attribute 'get'
obviously it happens because the value of "Study" is not a dictionary, neither list, but a valid json! how can I deal with this problem? Does exist a method that works like .get, but for json? I guess there is another option : decode this json and then parse it with .get, but the problem that it is in the core of another json, so I have no clue how to decode it!
Data is a valid JSON formatted string. JSON contains four basic elements:
Object: defined with curly braces {}
Array: defined with braces []
Value: can be a string, a number, an object, an array, or the literals true, false or null
String: defined by double quotes and contain Unicode characters or common backslash escapes
Using json.loads will convert the string into a python object recursively. It means that every inner JSON element will be represented as a python object.
Therefore:
jfile.get('Study') ---> python list
To retrieve Field you should iterate over the study list:
file = json.loads(data.strip())
study_list = jfile.get('Study', []) # don't set default value with different type
for item in study_list:
print item.get('Field')
Related
I have pulled JSON data from a url. The result is a dictionary. How can I transform this dictionary so metric is a column, and the time is the index for each value
Thanks in advance
time------------------------AdrActCnt-----BlkCnt------BlkSizeByte
2021-01-28T00:00:00.000Z----1097896.0-----145.0-------190568423.0
2021-01-29T00:00:00.000Z----1208741.0-----152.0-------199725189.0
2021-01-29T00:00:00.000Z----1087755.0-----136.0-------177349536.0
Output:
{"metricData":{"metrics":["AdrActCnt","BlkCnt","BlkSizeByte"],"series":
[{"time":"2021-01-28T00:00:00Z","values"["1097896.0","145.0","190568423.0"]},
{"time":"2021-01-29T00:00:00Z","values":["1208741.0","152.0","199725189.0"]},
{"time":"2021-01-30T00:00:00Z","values":["1087755.0","136.0","177349536.0"]}
You may be looking for a dict comprehension, which is similar to a list comprehension, just creates a dictionary at the end:
liststuff = [{"time":"2021-01-28T00:00:00.000Z","values":["1097896.0","145.0","190568423.0"]},{"time":"2021-01-29T00:00:00.000Z","values":["1208741.0","152.0","199725189.0"]},{"time":"2021-01-30T00:00:00.000Z","values":["1087755.0","136.0","177349536.0"]}]
dictstuff = {item['time']:item['values'] for item in liststuff}
print(dictstuff)
{'2021-01-28T00:00:00.000Z': ['1097896.0', '145.0', '190568423.0'], '2021-01-29T00:00:00.000Z': ['1208741.0', '152.0', '199725189.0'], '2021-01-30T00:00:00.000Z': ['1087755.0', '136.0', '177349536.0']}
liststuff is your data, just needed [] wrapping (I assume that's a typo in the question, it's not valid JSON without the brackets). If you need help with parsing the string, use json.loads() (from the json module) to make it actual Python data:
import json
jsonstuff = '[{"time":"2021-01-28T00:00:00.000Z","values":["1097896.0","145.0","190568423.0"]},{"time":"2021-01-29T00:00:00.000Z","values":["1208741.0","152.0","199725189.0"]},{"time":"2021-01-30T00:00:00.000Z","values":["1087755.0","136.0","177349536.0"]}]'
liststuff = json.loads(jsonstuff)
(here jsonstuff is the string you've downloaded)
I currently have a python dictionary where the keys are strings representing URLs, and the values are also string URLs.
#socketio.on('blacklist', namespace='/update')
def add_to_blacklist(message):
stored_blacklist.clear()
for key, val in message.items():
stored_blacklist[key] = val
lookup = {'name':'blacklist'}
resp = patch_internal('blacklist', payload={"sites":stored_blacklist}, **lookup)
However there is an issue with how python is interpreting the insertions. For example:
stored_blacklist["example.com"] = "thankspython.edu"
My desired behavior is that stored_blacklist maps "example.com" to "thankspython.edu" like so:
{'example.com':'thankspython.edu'}
However, printing stored_blacklist gives me this instead:
{'example': {'com': 'thankspython.edu'}}
How could I get the desired behavior where a string with a period character in it could be read as a normal string instead of automatically creating some pseudo-JSON object?
I spent several hours on this, tried everything I found online, pulled some of the hair left on my head...
I have this JSON sent to a Flask webservice I'm writing :
{'jsonArray': '[
{
"nom":"0012345679",
"Start":"2018-08-01",
"Finish":"2018-08-17",
"Statut":"Validee"
},
{
"nom":"0012345679",
"Start":"2018-09-01",
"Finish":"2018-09-10",
"Statut":"Demande envoyée au manager"
},
{
"nom":"0012345681",
"Start":"2018-04-01",
"Finish":"2018-04-08",
"Statut":"Validee"
},
{
"nom":"0012345681",
"Start":"2018-07-01",
"Finish":"2018-07-15",
"Statut":"Validee"
}
]'}
I want to simply loop through the records :
app = Flask(__name__)
#app.route('/graph', methods=['POST'])
def webhook():
if request.method == 'POST':
req_data = request.get_json()
print(req_data) #-> shows JSON that seems to be right
##print(type(req_data['jsonArray']))
#j1 = json.dumps(req_data['jsonArray'])
#j2 = json.loads(req_data['jsonArray'])
#data = json.loads(j1)
#for rec in data:
# print(rec) #-> This seems to consider rec as one of the characters of the whole JSON string, and prints every character one by one
#for key in data:
# value = data[key]
# print("The key and value are ({}) = ({})".format(key, value)) #-> TypeError: string indices must be integers
for record in req_data['jsonArray']:
for attribute, value in rec.items(): #-> Gives error 'str' object has no attribute 'items'
print(attribute, value)
I believe I am lost between JSON object, python dict object, strings, but I don't know what I am missing. I really tried to put the JSON received through json.dumps and json.loads methods, but still nothing. What am I missing ??
I simply want to loop through each record to create another python object that I will feed to a charting library like this :
df = [dict(Task="0012345678", Start='2017-01-01', Finish='2017-02-02', Statut='Complete'),
dict(Task="0012345678", Start='2017-02-15', Finish='2017-03-15', Statut='Incomplete'),
dict(Task="0012345679", Start='2017-01-17', Finish='2017-02-17', Statut='Not Started'),
dict(Task="0012345679", Start='2017-01-17', Finish='2017-02-17', Statut='Complete'),
dict(Task="0012345680", Start='2017-03-10', Finish='2017-03-20', Statut='Not Started'),
dict(Task="0012345680", Start='2017-04-01', Finish='2017-04-20', Statut='Not Started'),
dict(Task="0012345680", Start='2017-05-18', Finish='2017-06-18', Statut='Not Started'),
dict(Task="0012345681", Start='2017-01-14', Finish='2017-03-14', Statut='Complete')]
The whole thing is wrapped in single quotes, meaning it's a string and you need to parse it.
for record in json.loads(req_data['jsonArray']):
Looking at your commented code, you did this:
j1 = json.dumps(req_data['jsonArray'])
data = json.loads(j1)
Using json.dumps on a string is the wrong idea, and moreover json.loads(json.dumps(x)) is just the same as x, so that just got you back where you started, i.e. data was the same thing as req_data['jsonArray'] (a string).
This was the right idea:
j2 = json.loads(req_data['jsonArray'])
but you never used j2.
As you've seen, iterating over a string gives you each character of the string.
I'm new to Spark. I have a dataframe that contains the results of some analysis. I converted that dataframe into JSON so I could display it in a Flask App:
results = result.toJSON().collect()
An example entry in my json file is below. I then tried to run a for loop in order to get specific results:
{"userId":"1","systemId":"30","title":"interest"}
for i in results:
print i["userId"]
This doesn't work at all and I get errors such as: Python (json) : TypeError: expected string or buffer
I used json.dumps and json.loads and still nothing - I keep on getting errors such as string indices must be integers, as well as the above error.
I then tried this:
print i[0]
This gave me the first character in the json "{" instead of the first line. I don't really know what to do, can anyone tell me where I'm going wrong?
Many Thanks.
If the result of result.toJSON().collect() is a JSON encoded string, then you would use json.loads() to convert it to a dict. The issue you're running into is that when you iterate a dict with a for loop, you're given the keys of the dict. In your for loop, you're treating the key as if it's a dict, when in fact it is just a string. Try this:
# toJSON() turns each row of the DataFrame into a JSON string
# calling first() on the result will fetch the first row.
results = json.loads(result.toJSON().first())
for key in results:
print results[key]
# To decode the entire DataFrame iterate over the result
# of toJSON()
def print_rows(row):
data = json.loads(row)
for key in data:
print "{key}:{value}".format(key=key, value=data[key])
results = result.toJSON()
results.foreach(print_rows)
EDIT: The issue is that collect returns a list, not a dict. I've updated the code. Always read the docs.
collect() Return a list that contains all of the elements in this RDD.
Note This method should only be used if the resulting array is
expected to be small, as all the data is loaded into the driver’s
memory.
EDIT2: I can't emphasize enough, always read the docs.
EDIT3: Look here.
import json
>>> df = sqlContext.read.table("n1")
>>> df.show()
+-----+-------+----+---------------+-------+----+
| c1| c2| c3| c4| c5| c6|
+-----+-------+----+---------------+-------+----+
|00001|Content| 1|Content-article| |2018|
|00002|Content|null|Content-article|Content|2015|
+-----+-------+----+---------------+-------+----+
>>> results = df.toJSON().map(lambda j: json.loads(j)).collect()
>>> for i in results: print i["c1"], i["c6"]
...
00001 2018
00002 2015
Here is what worked for me:
df_json = df.toJSON()
for row in df_json.collect():
#json string
print(row)
#json object
line = json.loads(row)
print(line[some_key])
Keep in mind that using .collect() is not advisable, since it collects the distributed data frames, and defeats the purpose of using data frames.
To get an array of python dicts:
results = df.toJSON().map(json.loads).collect()
To get an array of JSON strings:
results = df.toJSON().collect()
To get a JSON string (i.e. a JSON string of an array):
results = df.toPandas().to_json(orient='records')
and using that to get an array of Python dicts:
results = json.loads(df.toPandas().to_json(orient='records'))
I have a string where I'd like to grab the "id" number 12079500908. I am trying to use ast.literal_eval but received a ValueError: malformed string. Is there any other way to get the id number from the string below?
doc_request = urllib2.Request("https://api.box.com/2.0/search?query=SEARCHTERMS", headers=doc_headers)
doc_response = urllib2.urlopen(doc_request)
view_doc_response = doc_response.read()
doc_dict=ast.literal_eval(view_doc_response)
Edit
Output:
view_doc_response = '{"total_count":1,"entries":[{"type":"file","id":"12079500908","sequence_id":"1","etag":"1","sha1":"6887169228cab0cfb341059194bc980e1be8ad90","name":"file.pdf","description":"","size":897838,"path_collection":{"total_count":2,"entries":[{"type":"folder","id":"0","sequence_id":null,"etag":null,"name":"All Files"},{"type":"folder","id":"1352745576","sequence_id":"0","etag":"0","name":"Patient Files"}]},"created_at":"2013-12-03T10:23:30-08:00","modified_at":"2013-12-03T11:17:52-08:00","trashed_at":null,"purged_at":null,"content_created_at":"2013-12-03T10:23:30-08:00","content_modified_at":"2013-12-03T11:17:52-08:00","created_by":{"type":"user","id":"20672372","name":"name","login":"email"},"modified_by":{"type":"user","id":"206732372","name":"name","login":"email"},"owned_by":{"type":"user","id":"206737772","name":"name","login":"email"},"shared_link":{"url":"https:\\/\\/www.box.net\\/s\\/ymfslf1phfqiw65bunjg","download_url":"https:\\/\\/www.box.net\\/shared\\/static\\/ymfslf1phfqiw65bunjg.pdf","vanity_url":null,"is_password_enabled":false,"unshared_at":null,"download_count":0,"preview_count":0,"access":"open","permissions":{"can_download":true,"can_preview":true}},"parent":{"type":"folder","id":"1352745576","sequence_id":"0","etag":"0","name":"Patient Files"},"item_status":"active"}],"limit":30,"offset":0}'
calling doc_dict gives:
ValueError: malformed string
ast.literal_eval is for parsing valid Python syntax, what you have is JSON. Valid JSON looks a lot like Python syntax except that JSON can contain null, true, and false which are mapped to None, True, and False in Python when passed through a JSON decoder. You can use json.loads for this. The code might look something like this:
import json
doc_dict = json.loads(view_doc_response)
first_id = doc_dict['entries'][0]['id'] # with your data, should be 12079500908
Note that this assumes that you manually added the ... at the end of the string, presumably after shortening it. If that ... is actually in your code as well then you have invalid JSON and you will need to do some processing before it will work.