I am looking for an efficient way - in terms of memory as well as speed - to parse large amount of JSON data (in order of several hundreds of MB) in python.
I looked at ijson package: https://pypi.org/project/ijson/ and experimented with its parse function but the for loop event parser appears to be very slow if I were to parse something at middle or bottom of JSON.
Below is the sample of similar json structure I have. JSON representing data of several hundreds of websites so data.items.website can grow very large and so does data.items.ids. With that kind of large structure - I found ijson event parser for loop to be very slow if I were to get information on some data in the middle or bottom of the data.items.website or data.items.ids.
{
"data": {
"items": {
"website": [
[
"abc.com",
[
[
"data1",
{
"type1": 10,
"type2": 11
}
],
[
"data2",
{
"type1": 100,
"type2": 150
}
],
[
"data3",
{
"type1": 40,
"type2": 50
}
]
]
]
]
},
"ids": [
[
"id1",
1
],
[
"id2",
2
]
]
}
}
for prefix, event, value in parser:
if prefix.startswith('data.items.ids'):
print(f"prefix {prefix}, event {event}, value {value}")
I am not sure if I am using ijson correctly or in a best way to to reap its benefits such that gives me speed as well as at the same time keeps memory consumption low to parse from large amounts of JSON.
Could someone help?
Related
I apologize in advance if this is simple. This is my first go at Python and I've been searching and trying things all day and just haven't been able to figure out how to accomplish what I need.
I am pulling a list of assets from an API. Below is an example of the result of this request (in reality it will return 50 sensorpoints.
There is a second request that will pull readings from a specific sensor based on sensorPointId. I need to be able to enter an assetId, and pull the readings from each sensor.
{
"assetId": 1436,
"assetName": "Pharmacy",
"groupId": "104",
"groupName": "West",
"environment": "Freezer",
"lastActivityDate": "2021-01-25T18:54:34.5970000Z",
"tags": [
"Manager: Casey",
"State: Oregon"
],
"sensorPoints": [
{
"sensorPointId": 126,
"sensorPointName": "Top Temperature",
"devices": [
"23004000080793070793",
"74012807612084533500"
]
},
{
"sensorPointId": 129,
"sensorPointName": "Bottom Temperature",
"devices": [
"86004000080793070956"
]
}
]
}
My plan was to go through the list from the first request, make a list of all the sensorpointIds in that asset then run the second request for each based on that list. The problem no matter which method I try to pull the individual sensorpointIds, it says object is not subscriptable, even when looking at a string value. These are all the things I've tried. I'm sure it's something silly I'm missing, but all of these I have seen in examples. I've written the full response to a text file just to make sure I'm getting good data, and that works fine.
r = request...
data = r.json
for sensor in data:
print (data["sensorpointId")
or
print(["sensorsPoints"]["sensorPointName"])
these give 'method' object is not iterable
I've also just tried to print a single sensorpointId
print(data["sensorpointId"][0])
print(data["sensorpointName"][0])
print(data["sensorPoints"][0]["sensorpointId"])
all of these give object is not subscriptable
print(r["sensorPoints"][0]["sensorpointName"])
'Response' object is not subscriptable
print(data["sensorPoints"][0]["sensorpointName"])
print(["sensorPoints"][0]["sensorpointName"]
string indices must be integers, not 'str'
I got it!
data = r.json()['sensorPoints']
sensors = []
for d in data:
sensor = d['sensorPointId']
sensors.append(sensor)
Had a look at other answers for similar however this doesn't seem to be working for me.
I have a simple requirement to filter a JSON list by a value in the objects of the list.
I.e.
jsonpath_expression = parse("$.balances[?(#.asset=='BTC')].free")
This path works on https://jsonpath.com/ with the following JSON.
{
"makerCommission": 10,
"takerCommission": 10,
"buyerCommission": 0,
"sellerCommission": 0,
"canTrade": true,
"canWithdraw": true,
"canDeposit": true,
"brokered": false,
"accountType": "SPOT",
"balances": [
{
"asset": "BTC",
"free": "0.06437673",
"locked": "0.00000000"
},
{
"asset": "LTC",
"free": "0.00000000",
"locked": "0.00000000"
}
]
}
When I try in python I get jsonpath_ng.exceptions.JsonPathLexerError: Error on line 1, col 11: Unexpected character: ?
I've tried quite a few variations - which garner various other jsonpath parse errors - based on other articles - this one looked promising and I believe aligns to my attempts.
Any ideas what I am doing wrong?
I have done a program, and works fine with ltcusdt, but gives this error with for example aaveusdt, I don't understand why. Can some one explain me whats the difference between them two (the two pairs have the same number of decimals, and similar price (200-100)) how and what should I change?
You'll have to provide you're code for anyone to figure out anything specific about your error, but basically each pair has it's own precision, Bitcoin which is about $50000, won't take the same decimals as Doge which is about $0.15
With the response you get from the GET /api/v3/exchangeInfo endpoint,
{
"timezone": "UTC",
"serverTime": 1565246363776,
"rateLimits": [
{
//These are defined in the `ENUM definitions` section under `Rate Limiters (rateLimitType)`.
//All limits are optional
}
],
"exchangeFilters": [
//These are the defined filters in the `Filters` section.
//All filters are optional.
],
"symbols": [
{
"symbol": "ETHBTC",
"status": "TRADING",
"baseAsset": "ETH",
"baseAssetPrecision": 8,
"quoteAsset": "BTC",
"quotePrecision": 8,
"quoteAssetPrecision": 8,
"orderTypes": [
"LIMIT",
"LIMIT_MAKER",
"MARKET",
"STOP_LOSS",
"STOP_LOSS_LIMIT",
"TAKE_PROFIT",
"TAKE_PROFIT_LIMIT"
],
"icebergAllowed": true,
"ocoAllowed": true,
"isSpotTradingAllowed": true,
"isMarginTradingAllowed": true,
"filters": [
//These are defined in the Filters section.
//All filters are optional
],
"permissions": [
"SPOT",
"MARGIN"
]
}
]
}
you can get the precision for each endpoint from the keys
baseAssetPrecision
quotePrecision
quoteAssetPrecision
baseCommissionPrecision
quoteCommissionPrecision
Also CCXT is an open source library that takes care of all the precision stuff and auto trims numbers so that they don't exceed the precision limits
There's a list of examples in the repo at ccxt/examples and this is the manual
This is a straight forward question, How to use python to process the log file (Consider it as a json string for now). Below is the json data:
{
"voltas": {
"ac": [
{
"timestamp":1590761564,
"is_connected":true,
"reconnection_status":"N/A"
},
{
"timestamp":1590761566,
"is_connected":true,
"reconnection_status":"N/A"
},
{
"timestamp":1590761568,
"is_connected":false,
"reconnection_status":"true"
},
{
"timestamp":1590761570,
"is_connected":true,
"reconnection_status":"N/A"
},
{
"timestamp":1590761572,
"is_connected":true,
"reconnection_status":"N/A"
},
{
"timestamp":1590761574,
"is_connected":false,
"reconnection_status":"false"
},
{
"timestamp":1590761576,
"is_connected":false,
"reconnection_status":"true"
}
]
}
}
Since the question is just regarding how to process the json data, I am skipping the discussion about the data in json. Now, what I need is the analysed data as below.
{
"voltas" : [
"ac": {
"number_of_actual_connection_drops": 3,
"time_interval_between_droppings": [4, 8],
"number_of_successful_reconnections": 2,
"number_of_failure_reconnections": 1
}
]
}
This is how the data is analysed:
"number_of_actual_connection_drops": Number of "is_connected" == false.
"time_interval_between_droppings": It is a list which will be populated from the end(append from beginning). We need to pick the time stamp of the item which will have "is_connected":false, and "reconnection_status":"true". In this case last(7th item) block with timestamp = 1590761576. Now we need fo find the timestamp of previous block with "is_connected":false, and "reconnection_status":"true". In this case it's 3rd item with timestamp 1590761568. Now the last item in the list is difference of this timestamps 8. Now the list is [8].
Now the timestamp is 1590761568 and we don't have any previous block with is_connected: false, and reconnection_status: true, so we will take the first items timestamp which is 1590761564 and now the difference is 4. So the list is [4, 8]
"number_of_successful_reconnections": Number of "reconnected_status" = true
"number_of_failure_connections": Number of "reconnected_status" = false
We can achieve this with for loops and some if conditions. I am interested in doing this using functional programming ways (reduce, map, filter) in python.
For simplification I have mentioned only "ac". There will be many items similar to this. Thanks.
I am trying to convert a JSON string to a CSV file which I can work on further in excel. For that, I am using the following script: https://github.com/vinay20045/json-to-csv
I was on that for a few hours yesterday but could not get it working :(
I reduced my json string to the minimum for the sake of explaining what I mean.
https://pastebin.com/Vjt799Bb
{
"page": 1,
"pages": 2270,
"limit": 10,
"total": 22693,
"items": [
{
"address": {
"city": "cityname first dataset",
"company_name": "companyname first dataset"
},
"amount": 998,
"items": [
{
"description": "first part of first dataset",
"number": "part number of first part of first dataset"
}
],
"number": "number of first dataset",
"service_date": {
"type": "DEFAULT",
"date": "2015-11-18"
},
"vat_option": null
},
{
"address": {
"city": "cityname second dataset",
"company_name": "companyname second dataset"
},
"amount": 998,
"items": [
{
"description": "first part of second dataset",
"number": "part number of first part of second dataset"
},
{
"description": "second part of second dataset",
"number": "part number of second part of second dataset"
}
],
"number": "number of second dataset",
"service_date": {
"type": "DEFAULT",
"date": "2015-11-18"
},
"vat_option": null
}
]
}
I would really appreciate if you could take a look at it.
The script now delivers the following result:
.dropbox.com/s/165zbfl8wn52syf/scriptresult.jpg?dl=0
(please add www in front to have a link)
What the script now needs to do is following (F3, G4 and so on are cell definitions from the above screenshot):
- copy F3 and G3 to D4 and E4
- remove columns F and G
- copy A3:C3 to A4:C4
- copy F3:I3 to F4:I4
Target CSV will then look like:
.dropbox.com/s/l1wj3ntrlomwmaq/target.jpg?dl=0
(please add www in front to have a link)
So all in all, the „items_items_0“ „items_items_1“ is a problem because when the JSON data has sub_items, they will get new columns in the header with the current script. But I’d like to have them in new rows instead.
Do you see any chance how I can reach that? The logic is quite clear to me, but I am an absolute newbie in python - maybe that’s the problem :(
Thank you for your great support!
Cheers,
Tom
I do agree: You're asking about the usage of a specific package without providing the actual code.
I went ahead, made some assumptions, and created a snippet which could help you solve your issue. Instead of using the script you linked, I use a combination of manually creating a dictionary and then using Pandas to print, potential modification, and eventual export. Note: This does not solve your problem (since I'm not really getting it to the fullest extend) – it rather hopes to give you a good start with some of the tools and techniques.
See .ipynb file in this Gist, https://gist.github.com/AndiH/4d4ef85e2dec395a0ae5343c648565eb, the gist of it I'll paste below:
import pandas as pd
import json
with open("input.json") as f:
rawjson = json.load(f)
data = []
for element in rawjson["items"]:
data.append({
"item_address_city": element["address"]["city"],
"item_address_company_name": element["address"]["company_name"],
"items_amount": element["amount"]
})
df = pd.DataFrame(data)
df.head()
df.to_csv("output.csv")