sub_items generate new columns in JSON to CSV converter with python - python

I am trying to convert a JSON string to a CSV file which I can work on further in excel. For that, I am using the following script: https://github.com/vinay20045/json-to-csv
I was on that for a few hours yesterday but could not get it working :(
I reduced my json string to the minimum for the sake of explaining what I mean.
https://pastebin.com/Vjt799Bb
{
"page": 1,
"pages": 2270,
"limit": 10,
"total": 22693,
"items": [
{
"address": {
"city": "cityname first dataset",
"company_name": "companyname first dataset"
},
"amount": 998,
"items": [
{
"description": "first part of first dataset",
"number": "part number of first part of first dataset"
}
],
"number": "number of first dataset",
"service_date": {
"type": "DEFAULT",
"date": "2015-11-18"
},
"vat_option": null
},
{
"address": {
"city": "cityname second dataset",
"company_name": "companyname second dataset"
},
"amount": 998,
"items": [
{
"description": "first part of second dataset",
"number": "part number of first part of second dataset"
},
{
"description": "second part of second dataset",
"number": "part number of second part of second dataset"
}
],
"number": "number of second dataset",
"service_date": {
"type": "DEFAULT",
"date": "2015-11-18"
},
"vat_option": null
}
]
}
I would really appreciate if you could take a look at it.
The script now delivers the following result:
.dropbox.com/s/165zbfl8wn52syf/scriptresult.jpg?dl=0
(please add www in front to have a link)
What the script now needs to do is following (F3, G4 and so on are cell definitions from the above screenshot):
- copy F3 and G3 to D4 and E4
- remove columns F and G
- copy A3:C3 to A4:C4
- copy F3:I3 to F4:I4
Target CSV will then look like:
.dropbox.com/s/l1wj3ntrlomwmaq/target.jpg?dl=0
(please add www in front to have a link)
So all in all, the „items_items_0“ „items_items_1“ is a problem because when the JSON data has sub_items, they will get new columns in the header with the current script. But I’d like to have them in new rows instead.
Do you see any chance how I can reach that? The logic is quite clear to me, but I am an absolute newbie in python - maybe that’s the problem :(
Thank you for your great support!
Cheers,
Tom

I do agree: You're asking about the usage of a specific package without providing the actual code.
I went ahead, made some assumptions, and created a snippet which could help you solve your issue. Instead of using the script you linked, I use a combination of manually creating a dictionary and then using Pandas to print, potential modification, and eventual export. Note: This does not solve your problem (since I'm not really getting it to the fullest extend) – it rather hopes to give you a good start with some of the tools and techniques.
See .ipynb file in this Gist, https://gist.github.com/AndiH/4d4ef85e2dec395a0ae5343c648565eb, the gist of it I'll paste below:
import pandas as pd
import json
with open("input.json") as f:
rawjson = json.load(f)
data = []
for element in rawjson["items"]:
data.append({
"item_address_city": element["address"]["city"],
"item_address_company_name": element["address"]["company_name"],
"items_amount": element["amount"]
})
df = pd.DataFrame(data)
df.head()
df.to_csv("output.csv")

Related

Using Python jasonpath_ng to Filter Lists of Objects

Had a look at other answers for similar however this doesn't seem to be working for me.
I have a simple requirement to filter a JSON list by a value in the objects of the list.
I.e.
jsonpath_expression = parse("$.balances[?(#.asset=='BTC')].free")
This path works on https://jsonpath.com/ with the following JSON.
{
"makerCommission": 10,
"takerCommission": 10,
"buyerCommission": 0,
"sellerCommission": 0,
"canTrade": true,
"canWithdraw": true,
"canDeposit": true,
"brokered": false,
"accountType": "SPOT",
"balances": [
{
"asset": "BTC",
"free": "0.06437673",
"locked": "0.00000000"
},
{
"asset": "LTC",
"free": "0.00000000",
"locked": "0.00000000"
}
]
}
When I try in python I get jsonpath_ng.exceptions.JsonPathLexerError: Error on line 1, col 11: Unexpected character: ?
I've tried quite a few variations - which garner various other jsonpath parse errors - based on other articles - this one looked promising and I believe aligns to my attempts.
Any ideas what I am doing wrong?

Python - Processing the data and analyse using functional programming

This is a straight forward question, How to use python to process the log file (Consider it as a json string for now). Below is the json data:
{
"voltas": {
"ac": [
{
"timestamp":1590761564,
"is_connected":true,
"reconnection_status":"N/A"
},
{
"timestamp":1590761566,
"is_connected":true,
"reconnection_status":"N/A"
},
{
"timestamp":1590761568,
"is_connected":false,
"reconnection_status":"true"
},
{
"timestamp":1590761570,
"is_connected":true,
"reconnection_status":"N/A"
},
{
"timestamp":1590761572,
"is_connected":true,
"reconnection_status":"N/A"
},
{
"timestamp":1590761574,
"is_connected":false,
"reconnection_status":"false"
},
{
"timestamp":1590761576,
"is_connected":false,
"reconnection_status":"true"
}
]
}
}
Since the question is just regarding how to process the json data, I am skipping the discussion about the data in json. Now, what I need is the analysed data as below.
{
"voltas" : [
"ac": {
"number_of_actual_connection_drops": 3,
"time_interval_between_droppings": [4, 8],
"number_of_successful_reconnections": 2,
"number_of_failure_reconnections": 1
}
]
}
This is how the data is analysed:
"number_of_actual_connection_drops": Number of "is_connected" == false.
"time_interval_between_droppings": It is a list which will be populated from the end(append from beginning). We need to pick the time stamp of the item which will have "is_connected":false, and "reconnection_status":"true". In this case last(7th item) block with timestamp = 1590761576. Now we need fo find the timestamp of previous block with "is_connected":false, and "reconnection_status":"true". In this case it's 3rd item with timestamp 1590761568. Now the last item in the list is difference of this timestamps 8. Now the list is [8].
Now the timestamp is 1590761568 and we don't have any previous block with is_connected: false, and reconnection_status: true, so we will take the first items timestamp which is 1590761564 and now the difference is 4. So the list is [4, 8]
"number_of_successful_reconnections": Number of "reconnected_status" = true
"number_of_failure_connections": Number of "reconnected_status" = false
We can achieve this with for loops and some if conditions. I am interested in doing this using functional programming ways (reduce, map, filter) in python.
For simplification I have mentioned only "ac". There will be many items similar to this. Thanks.

Trouble when storing API data in Python list

I'm struggling with my json data that I get from an API. I've gone into several api urls to grab my data, and I've stored it in an empty list. I then want to take out all fields that say "reputation" and I'm only interested in that number. See my code here:
import json
import requests
f = requests.get('my_api_url')
if(f.ok):
data = json.loads(f.content)
url_list = [] #the list stores a number of urls that I want to request data from
for items in data:
url_list.append(items['details_url']) #grab the urls that I want to enter
total_url = [] #stores all data from all urls here
for index in range(len(url_list)):
url = requests.get(url_list[index])
if(url.ok):
url_data = json.loads(url.content)
total_url.append(url_data)
print(json.dumps(total_url, indent=2)) #only want to see if it's working
Thus far I'm happy and can enter all urls and get the data. It's in the next step I get trouble. The above code outputs the following json data for me:
[
[
{
"id": 316,
"name": "storabro",
"url": "https://storabro.net",
"customer": true,
"administrator": false,
"reputation": 568
}
],
[
{
"id": 541,
"name": "sega",
"url": "https://wedonthaveanyyet.com",
"customer": true,
"administrator": false,
"reputation": 45
},
{
"id": 90,
"name": "Villa",
"url": "https://brandvillas.co.uk",
"customer": true,
"administrator": false,
"reputation": 6
}
]
]
However, I only want to print out the reputation, and I cannot get it working. If I in my code instead use print(total_url['reputation']) it doesn't work and says "TypeError: list indices must be integers or slices, not str", and if I try:
for s in total_url:
print(s['reputation'])
I get the same TypeError.
Feels like I've tried everything but I can't find any answers on the web that can help me, but I understand I still have a lot to learn and that my error will be obvious to some people here. It seems very similar to other things I've done with Python, but this time I'm stuck. To clarify, I'm expecting an output similar to: [568, 45, 6]
Perhaps I used the wrong way to do this from the beginning and that's why it's not working all the way for me. Started to code with Python in October and it's still very new to me but I want to learn. Thank you all in advance!
It looks like your total_url is a list of lists, so you might write a function like:
def get_reputations(data):
for url in data:
for obj in url:
print(obj.get('reputation'))
get_reputations(total_url)
# output:
# 568
# 45
# 6
If you'd rather not work with a list of lists in the first place, you can extend the list with each result instead of append in the expression used to construct total_url
You can also use json.load and try to read the response
def get_rep():
response = urlopen(api_url)
r = response.read().decode('utf-8')
r_obj = json.loads(r)
for item in r_obj['response']:
print("Reputation: {}".format(item['reputation']))

Json file to dictionary

I am using the yelp dataset and I want to parse the review json file to a dictionary. I tried loading it on a pandas DataFrame and then creating the dictionary, but because the file is too big it is time consuming. I want to keep only the user_id and stars values. A line of the json file looks like this:
{
"votes": {
"funny": 0, "useful": 2, "cool": 1},
"user_id": "Xqd0DzHaiyRqVH3WRG7hzg",
"review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-05-17",
"text": (
"dr. goldberg offers everything i look for in a general practitioner. "
"he's nice and easy to talk to without being patronizing; he's always on "
"time in seeing his patients; he's affiliated with a top-notch hospital (nyu) "
"which my parents have explained to me is very important in case something "
"happens and you need surgery; and you can get referrals to see specialists "
"without having to see him first. really, what more do you need? i'm "
"sitting here trying to think of any complaints i have about him, but i'm "
"really drawing a blank."
),
"type": "review", "business_id": "vcNAWiLM4dR7D2nwwJ7nCA"
}
How can i iterate over every 'field' (for the lack o a better word)? So far i can only iterate over each line.
EDIT
As requested pandas code :
reading the json
with open('yelp_academic_dataset_review.json') as f:
df = pd.DataFrame(json.loads(line) for line in f)
Creating the dictionary
dict = {}
for i, row in df.iterrows():
business_id = row['business_id']
user_id = row['user_id']
rating = row['stars']
key = (business_id, user_id)
dict[key] = rating
You don't need to read this into a DataFrame. json.load() returns a dictionary. For example:
sample.json
{
"votes": {
"funny": 0,
"useful": 2,
"cool": 1
},
"user_id": "Xqd0DzHaiyRqVH3WRG7hzg",
"review_id": "15SdjuK7DmYqUAj6rjGowg",
"stars": 5,
"date": "2007-05-17",
"text": "dr. goldberg offers everything i look for in a general practitioner. he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first. really, what more do you need? i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank.",
"type": "review",
"business_id": "vcNAWiLM4dR7D2nwwJ7nCA"
}
read_json.py
import json
with open('sample.json', 'r') as fh:
result_dict = json.load(fh)
print(result_dict['user_id'])
print(result_dict['stars'])
output
Xqd0DzHaiyRqVH3WRG7hzg
5
With that output you can easily create a DataFrame.
There are several good discussions about parsing json as a stream on SO, but the gist is it's not possible natively, although some tools seem to attempt it.
In the interest of keeping your code simple and with minimal dependencies, you might see if reading the json directory into a dictionary is a sufficient improvement.

Python. How do I print specific value from json when from similiar names?

Okay so my problem is that I need to print out one specific value from a json.
I've managed to print out all the values but not the specific one I want.
The json looks like this:
"apple": {
"stuff": 111,
"food": [
{
"money": 4000,
"time": 36,
},
{
"money": 12210,
"time": 94,
It continues like that with money and time.
So my problem is that when I do this:
ourResult = js['apple']['food']
for rs in ourResult:
print rs['time']
I receive all the times.. I only want to receive the time under money: 12210 for an example but I don't know how to do that when there is a colon and a value.
I thank you for all the help in advance.
Well, you already know how to get the value of "time", so just do the same with "money" and check it's equal to 12210.
Edit
for rs in ourResult:
if rs['money'] == 12210:
print rs['time']

Categories