Json file to dictionary - python

I am using the yelp dataset and I want to parse the review json file to a dictionary. I tried loading it on a pandas DataFrame and then creating the dictionary, but because the file is too big it is time consuming. I want to keep only the user_id and stars values. A line of the json file looks like this:
{
"votes": {
"funny": 0, "useful": 2, "cool": 1},
"user_id": "Xqd0DzHaiyRqVH3WRG7hzg",
"review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-05-17",
"text": (
"dr. goldberg offers everything i look for in a general practitioner. "
"he's nice and easy to talk to without being patronizing; he's always on "
"time in seeing his patients; he's affiliated with a top-notch hospital (nyu) "
"which my parents have explained to me is very important in case something "
"happens and you need surgery; and you can get referrals to see specialists "
"without having to see him first. really, what more do you need? i'm "
"sitting here trying to think of any complaints i have about him, but i'm "
"really drawing a blank."
),
"type": "review", "business_id": "vcNAWiLM4dR7D2nwwJ7nCA"
}
How can i iterate over every 'field' (for the lack o a better word)? So far i can only iterate over each line.
EDIT
As requested pandas code :
reading the json
with open('yelp_academic_dataset_review.json') as f:
df = pd.DataFrame(json.loads(line) for line in f)
Creating the dictionary
dict = {}
for i, row in df.iterrows():
business_id = row['business_id']
user_id = row['user_id']
rating = row['stars']
key = (business_id, user_id)
dict[key] = rating

You don't need to read this into a DataFrame. json.load() returns a dictionary. For example:
sample.json
{
"votes": {
"funny": 0,
"useful": 2,
"cool": 1
},
"user_id": "Xqd0DzHaiyRqVH3WRG7hzg",
"review_id": "15SdjuK7DmYqUAj6rjGowg",
"stars": 5,
"date": "2007-05-17",
"text": "dr. goldberg offers everything i look for in a general practitioner. he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first. really, what more do you need? i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank.",
"type": "review",
"business_id": "vcNAWiLM4dR7D2nwwJ7nCA"
}
read_json.py
import json
with open('sample.json', 'r') as fh:
result_dict = json.load(fh)
print(result_dict['user_id'])
print(result_dict['stars'])
output
Xqd0DzHaiyRqVH3WRG7hzg
5
With that output you can easily create a DataFrame.
There are several good discussions about parsing json as a stream on SO, but the gist is it's not possible natively, although some tools seem to attempt it.
In the interest of keeping your code simple and with minimal dependencies, you might see if reading the json directory into a dictionary is a sufficient improvement.

Related

Python json pulling list of items

I apologize in advance if this is simple. This is my first go at Python and I've been searching and trying things all day and just haven't been able to figure out how to accomplish what I need.
I am pulling a list of assets from an API. Below is an example of the result of this request (in reality it will return 50 sensorpoints.
There is a second request that will pull readings from a specific sensor based on sensorPointId. I need to be able to enter an assetId, and pull the readings from each sensor.
{
"assetId": 1436,
"assetName": "Pharmacy",
"groupId": "104",
"groupName": "West",
"environment": "Freezer",
"lastActivityDate": "2021-01-25T18:54:34.5970000Z",
"tags": [
"Manager: Casey",
"State: Oregon"
],
"sensorPoints": [
{
"sensorPointId": 126,
"sensorPointName": "Top Temperature",
"devices": [
"23004000080793070793",
"74012807612084533500"
]
},
{
"sensorPointId": 129,
"sensorPointName": "Bottom Temperature",
"devices": [
"86004000080793070956"
]
}
]
}
My plan was to go through the list from the first request, make a list of all the sensorpointIds in that asset then run the second request for each based on that list. The problem no matter which method I try to pull the individual sensorpointIds, it says object is not subscriptable, even when looking at a string value. These are all the things I've tried. I'm sure it's something silly I'm missing, but all of these I have seen in examples. I've written the full response to a text file just to make sure I'm getting good data, and that works fine.
r = request...
data = r.json
for sensor in data:
print (data["sensorpointId")
or
print(["sensorsPoints"]["sensorPointName"])
these give 'method' object is not iterable
I've also just tried to print a single sensorpointId
print(data["sensorpointId"][0])
print(data["sensorpointName"][0])
print(data["sensorPoints"][0]["sensorpointId"])
all of these give object is not subscriptable
print(r["sensorPoints"][0]["sensorpointName"])
'Response' object is not subscriptable
print(data["sensorPoints"][0]["sensorpointName"])
print(["sensorPoints"][0]["sensorpointName"]
string indices must be integers, not 'str'
I got it!
data = r.json()['sensorPoints']
sensors = []
for d in data:
sensor = d['sensorPointId']
sensors.append(sensor)

Get n number of documents from a collection using MongoDB/MongoEngine

Hi everyone I have a document inside a collection like this. (Ignore the absurdity of the question).
[
{
"tag": "english",
"difficulty": "hard",
"question": "What are alphabets",
"option_1": "98 billion light years",
"option_2": "23.3 trillion light years",
"option_3": "6 minutes",
"option_4": "It is still unknown",
"correct_answer": "option_1",
"id": "5f80befbaaf3c9ce2f4e2fb9"
}
]
There are multiple documents such as this one (10000).
I'm trying to write a python get to function using flask-restful to get n number of documents from this collection.
Currently, I'm confused about how to write a MongoEngine query.
This is what I do to get a single document based on it.
def get(self,id):
questions = Question.objects.get(id=id).to_json()
return Response(questions,
mimetype="application/json",
status = 200)
for n number of documents, I'm unable to figure out what to write inside.
def get_n_questions(self,n):
body = request.get_json(force =True)
questions = ???
return Response(questions,
mimetype="application/json",
status = 200)
You can use the limit(n) method (doc) on a queryset. This will let you retrieve the n firsts documents from the collection.
In your case that would mean:
questions = Question.objects().limit(n).to_json()
You may also be interested in the skip(n) method, this will allow you to do pagination (similarly to a limit/offset from MySQL for instance).

Trouble when storing API data in Python list

I'm struggling with my json data that I get from an API. I've gone into several api urls to grab my data, and I've stored it in an empty list. I then want to take out all fields that say "reputation" and I'm only interested in that number. See my code here:
import json
import requests
f = requests.get('my_api_url')
if(f.ok):
data = json.loads(f.content)
url_list = [] #the list stores a number of urls that I want to request data from
for items in data:
url_list.append(items['details_url']) #grab the urls that I want to enter
total_url = [] #stores all data from all urls here
for index in range(len(url_list)):
url = requests.get(url_list[index])
if(url.ok):
url_data = json.loads(url.content)
total_url.append(url_data)
print(json.dumps(total_url, indent=2)) #only want to see if it's working
Thus far I'm happy and can enter all urls and get the data. It's in the next step I get trouble. The above code outputs the following json data for me:
[
[
{
"id": 316,
"name": "storabro",
"url": "https://storabro.net",
"customer": true,
"administrator": false,
"reputation": 568
}
],
[
{
"id": 541,
"name": "sega",
"url": "https://wedonthaveanyyet.com",
"customer": true,
"administrator": false,
"reputation": 45
},
{
"id": 90,
"name": "Villa",
"url": "https://brandvillas.co.uk",
"customer": true,
"administrator": false,
"reputation": 6
}
]
]
However, I only want to print out the reputation, and I cannot get it working. If I in my code instead use print(total_url['reputation']) it doesn't work and says "TypeError: list indices must be integers or slices, not str", and if I try:
for s in total_url:
print(s['reputation'])
I get the same TypeError.
Feels like I've tried everything but I can't find any answers on the web that can help me, but I understand I still have a lot to learn and that my error will be obvious to some people here. It seems very similar to other things I've done with Python, but this time I'm stuck. To clarify, I'm expecting an output similar to: [568, 45, 6]
Perhaps I used the wrong way to do this from the beginning and that's why it's not working all the way for me. Started to code with Python in October and it's still very new to me but I want to learn. Thank you all in advance!
It looks like your total_url is a list of lists, so you might write a function like:
def get_reputations(data):
for url in data:
for obj in url:
print(obj.get('reputation'))
get_reputations(total_url)
# output:
# 568
# 45
# 6
If you'd rather not work with a list of lists in the first place, you can extend the list with each result instead of append in the expression used to construct total_url
You can also use json.load and try to read the response
def get_rep():
response = urlopen(api_url)
r = response.read().decode('utf-8')
r_obj = json.loads(r)
for item in r_obj['response']:
print("Reputation: {}".format(item['reputation']))

sub_items generate new columns in JSON to CSV converter with python

I am trying to convert a JSON string to a CSV file which I can work on further in excel. For that, I am using the following script: https://github.com/vinay20045/json-to-csv
I was on that for a few hours yesterday but could not get it working :(
I reduced my json string to the minimum for the sake of explaining what I mean.
https://pastebin.com/Vjt799Bb
{
"page": 1,
"pages": 2270,
"limit": 10,
"total": 22693,
"items": [
{
"address": {
"city": "cityname first dataset",
"company_name": "companyname first dataset"
},
"amount": 998,
"items": [
{
"description": "first part of first dataset",
"number": "part number of first part of first dataset"
}
],
"number": "number of first dataset",
"service_date": {
"type": "DEFAULT",
"date": "2015-11-18"
},
"vat_option": null
},
{
"address": {
"city": "cityname second dataset",
"company_name": "companyname second dataset"
},
"amount": 998,
"items": [
{
"description": "first part of second dataset",
"number": "part number of first part of second dataset"
},
{
"description": "second part of second dataset",
"number": "part number of second part of second dataset"
}
],
"number": "number of second dataset",
"service_date": {
"type": "DEFAULT",
"date": "2015-11-18"
},
"vat_option": null
}
]
}
I would really appreciate if you could take a look at it.
The script now delivers the following result:
.dropbox.com/s/165zbfl8wn52syf/scriptresult.jpg?dl=0
(please add www in front to have a link)
What the script now needs to do is following (F3, G4 and so on are cell definitions from the above screenshot):
- copy F3 and G3 to D4 and E4
- remove columns F and G
- copy A3:C3 to A4:C4
- copy F3:I3 to F4:I4
Target CSV will then look like:
.dropbox.com/s/l1wj3ntrlomwmaq/target.jpg?dl=0
(please add www in front to have a link)
So all in all, the „items_items_0“ „items_items_1“ is a problem because when the JSON data has sub_items, they will get new columns in the header with the current script. But I’d like to have them in new rows instead.
Do you see any chance how I can reach that? The logic is quite clear to me, but I am an absolute newbie in python - maybe that’s the problem :(
Thank you for your great support!
Cheers,
Tom
I do agree: You're asking about the usage of a specific package without providing the actual code.
I went ahead, made some assumptions, and created a snippet which could help you solve your issue. Instead of using the script you linked, I use a combination of manually creating a dictionary and then using Pandas to print, potential modification, and eventual export. Note: This does not solve your problem (since I'm not really getting it to the fullest extend) – it rather hopes to give you a good start with some of the tools and techniques.
See .ipynb file in this Gist, https://gist.github.com/AndiH/4d4ef85e2dec395a0ae5343c648565eb, the gist of it I'll paste below:
import pandas as pd
import json
with open("input.json") as f:
rawjson = json.load(f)
data = []
for element in rawjson["items"]:
data.append({
"item_address_city": element["address"]["city"],
"item_address_company_name": element["address"]["company_name"],
"items_amount": element["amount"]
})
df = pd.DataFrame(data)
df.head()
df.to_csv("output.csv")

combining two pandas dataframes as single json output

I'm trying to combine two pandas dataframes into a single JSON output.
The json output below is the result from this code - df.to_json(orient = "split")
{
columns: [],
index: [],
data: [
[
"COMPANY ONE",
"123 HAPPY PLACE",
"GOTHAM CITY",
"NJ",
12345,
"US",
8675309,
"",
"",
"",
"",
""
],
[.....]
]
}
A little background, I get the data from a csv file, and usually I have to separate the file in two parts, one good and the other bad. I've been using pandas for this process, which is great. So df contains the good data and say dfbad contains the bad data.
I used df.to_json(orient = "split") to output the good data, which I really like the structure of it. Now I want to do the same thing for the bad data, same structure, so something like this:
[{good}, {bad}]
I apologize in advance if the example above is not clear.
I tried to this:
jsonify(good = df.to_json(orient = "split"), bad = dfbad.to_json(orient = "split"))
But i know this is not going to work because the result for good and bad are turned into a string; which I don't want, I want to be able to have access to it.
data_dict = {}
data_dict['bad'] = dfbad.to_dict()
data_dict['good'] = df.to_dict()
return pd.json.dumps(data_dict)
This returns fine as a json, but not the structure I want, the way .to_json(orient = "split") does, unless I have to customize it.
Can anybody help with this issue? or can pinpoint in another direction how to solve this issue.
Thanks in advance!
UPDATE:
I found the solution, here is what I did:
good_json = df.to_json(orient="split")
bad_json = dfbad.to_json(orient="split")
return jsonify(bad = json.loads(bad_json), good = json.loads(good_json))
I added json.loads, you have to import it - import json - and it's now returning as a JSON output. If you have other suggestions, please let me know. I'm open to learn more about Pandas.

Categories