Python - Processing the data and analyse using functional programming - python

This is a straight forward question, How to use python to process the log file (Consider it as a json string for now). Below is the json data:
{
"voltas": {
"ac": [
{
"timestamp":1590761564,
"is_connected":true,
"reconnection_status":"N/A"
},
{
"timestamp":1590761566,
"is_connected":true,
"reconnection_status":"N/A"
},
{
"timestamp":1590761568,
"is_connected":false,
"reconnection_status":"true"
},
{
"timestamp":1590761570,
"is_connected":true,
"reconnection_status":"N/A"
},
{
"timestamp":1590761572,
"is_connected":true,
"reconnection_status":"N/A"
},
{
"timestamp":1590761574,
"is_connected":false,
"reconnection_status":"false"
},
{
"timestamp":1590761576,
"is_connected":false,
"reconnection_status":"true"
}
]
}
}
Since the question is just regarding how to process the json data, I am skipping the discussion about the data in json. Now, what I need is the analysed data as below.
{
"voltas" : [
"ac": {
"number_of_actual_connection_drops": 3,
"time_interval_between_droppings": [4, 8],
"number_of_successful_reconnections": 2,
"number_of_failure_reconnections": 1
}
]
}
This is how the data is analysed:
"number_of_actual_connection_drops": Number of "is_connected" == false.
"time_interval_between_droppings": It is a list which will be populated from the end(append from beginning). We need to pick the time stamp of the item which will have "is_connected":false, and "reconnection_status":"true". In this case last(7th item) block with timestamp = 1590761576. Now we need fo find the timestamp of previous block with "is_connected":false, and "reconnection_status":"true". In this case it's 3rd item with timestamp 1590761568. Now the last item in the list is difference of this timestamps 8. Now the list is [8].
Now the timestamp is 1590761568 and we don't have any previous block with is_connected: false, and reconnection_status: true, so we will take the first items timestamp which is 1590761564 and now the difference is 4. So the list is [4, 8]
"number_of_successful_reconnections": Number of "reconnected_status" = true
"number_of_failure_connections": Number of "reconnected_status" = false
We can achieve this with for loops and some if conditions. I am interested in doing this using functional programming ways (reduce, map, filter) in python.
For simplification I have mentioned only "ac". There will be many items similar to this. Thanks.

Related

Using Python jasonpath_ng to Filter Lists of Objects

Had a look at other answers for similar however this doesn't seem to be working for me.
I have a simple requirement to filter a JSON list by a value in the objects of the list.
I.e.
jsonpath_expression = parse("$.balances[?(#.asset=='BTC')].free")
This path works on https://jsonpath.com/ with the following JSON.
{
"makerCommission": 10,
"takerCommission": 10,
"buyerCommission": 0,
"sellerCommission": 0,
"canTrade": true,
"canWithdraw": true,
"canDeposit": true,
"brokered": false,
"accountType": "SPOT",
"balances": [
{
"asset": "BTC",
"free": "0.06437673",
"locked": "0.00000000"
},
{
"asset": "LTC",
"free": "0.00000000",
"locked": "0.00000000"
}
]
}
When I try in python I get jsonpath_ng.exceptions.JsonPathLexerError: Error on line 1, col 11: Unexpected character: ?
I've tried quite a few variations - which garner various other jsonpath parse errors - based on other articles - this one looked promising and I believe aligns to my attempts.
Any ideas what I am doing wrong?

Sort a json file in python

I had a list of single long string and I wanted to print the output in a particular form.
convert list to a particular json in python
but after conversion order of data changed. How can I maintain the same order?
input_data =
[
"21:15-21:30 IllegalAgrumentsException 1,
21:15-21:30 NullPointerException 2,
22:00-22:15 UserNotFoundException 1,
22:15-22:30 NullPointerException 1
....."
]
Code to covert the data in particular json form:
input_data = input[0] // input is list of single long string.
input_data = re.split(r',\s*', input_data)
output = collections.defaultdict(collections.Counter)
# print(output)
for line in input_data:
time, error, count = line.split(None, 2)
output[time][error] += int(count)
print(output)
response = [
{
"time": time,
"logs": [
{"exception": exception, "count": count}
for (exception, count) in counter.items()
],
}
for (time, counter) in output.items())
]
print(response)
My output :
{
"response": [
{
"logs": [
{
"count": 1,
"exception": "UserNotFoundException"
}
],
"time": "22:45-23:00"
},
{
"logs": [
{
"count": 1,
"exception": "NullPointerException"
}
],
"time": "23:00-23:15"
}...
]
}
so my order is changed but I need my data to be in same order i.e start from 21:15-21:30 and so on.. How can I maintain the same order ?
Your timestamps are already sortable, so if you don't care about the order of individual exceptions, you can just do:
for (time, counter) in sorted(output.items())
which will do a lexicographical sort by time and then by count. You can do sorted(output.items(), key=lambda x: x[0]) if you want just sort by time, or key=lambda x: x[0], -x[1] for by time and count descending.
The data is read into a dictionary, a defaultdict to be precise:
output[time][error] += int(count)
This data structure is grouping the data by time and by error type, which implies that there may be multiple items with the same time and the same error time. There is no way to have the "same order", if the data is regrouped like that.
On the other hand, you probably expect the time to be ordered in the input and even if it is not, you want output ordered by time, yo sou just need to do that, so instead of this:
for (time, counter) in output.items()
do this:
for time in sorted(output)
and then get the counter as
counter = output[time]
EDIT: time is sorted, but not starting at 0:00, sorting by time string is not correct. Instead, sorting the time by the original time order is correct.
Therefore, remember the original time order:
time_order = []
for line in input_data:
time, error, count = line.split(None, 2)
output[time][error] += int(count)
time_order.append(time)
Then later sort by it:
for time in sorted(output, key=time_order.index)

Parse large amounts of JSON data in python

I am looking for an efficient way - in terms of memory as well as speed - to parse large amount of JSON data (in order of several hundreds of MB) in python.
I looked at ijson package: https://pypi.org/project/ijson/ and experimented with its parse function but the for loop event parser appears to be very slow if I were to parse something at middle or bottom of JSON.
Below is the sample of similar json structure I have. JSON representing data of several hundreds of websites so data.items.website can grow very large and so does data.items.ids. With that kind of large structure - I found ijson event parser for loop to be very slow if I were to get information on some data in the middle or bottom of the data.items.website or data.items.ids.
{
"data": {
"items": {
"website": [
[
"abc.com",
[
[
"data1",
{
"type1": 10,
"type2": 11
}
],
[
"data2",
{
"type1": 100,
"type2": 150
}
],
[
"data3",
{
"type1": 40,
"type2": 50
}
]
]
]
]
},
"ids": [
[
"id1",
1
],
[
"id2",
2
]
]
}
}
for prefix, event, value in parser:
if prefix.startswith('data.items.ids'):
print(f"prefix {prefix}, event {event}, value {value}")
I am not sure if I am using ijson correctly or in a best way to to reap its benefits such that gives me speed as well as at the same time keeps memory consumption low to parse from large amounts of JSON.
Could someone help?

Trouble when storing API data in Python list

I'm struggling with my json data that I get from an API. I've gone into several api urls to grab my data, and I've stored it in an empty list. I then want to take out all fields that say "reputation" and I'm only interested in that number. See my code here:
import json
import requests
f = requests.get('my_api_url')
if(f.ok):
data = json.loads(f.content)
url_list = [] #the list stores a number of urls that I want to request data from
for items in data:
url_list.append(items['details_url']) #grab the urls that I want to enter
total_url = [] #stores all data from all urls here
for index in range(len(url_list)):
url = requests.get(url_list[index])
if(url.ok):
url_data = json.loads(url.content)
total_url.append(url_data)
print(json.dumps(total_url, indent=2)) #only want to see if it's working
Thus far I'm happy and can enter all urls and get the data. It's in the next step I get trouble. The above code outputs the following json data for me:
[
[
{
"id": 316,
"name": "storabro",
"url": "https://storabro.net",
"customer": true,
"administrator": false,
"reputation": 568
}
],
[
{
"id": 541,
"name": "sega",
"url": "https://wedonthaveanyyet.com",
"customer": true,
"administrator": false,
"reputation": 45
},
{
"id": 90,
"name": "Villa",
"url": "https://brandvillas.co.uk",
"customer": true,
"administrator": false,
"reputation": 6
}
]
]
However, I only want to print out the reputation, and I cannot get it working. If I in my code instead use print(total_url['reputation']) it doesn't work and says "TypeError: list indices must be integers or slices, not str", and if I try:
for s in total_url:
print(s['reputation'])
I get the same TypeError.
Feels like I've tried everything but I can't find any answers on the web that can help me, but I understand I still have a lot to learn and that my error will be obvious to some people here. It seems very similar to other things I've done with Python, but this time I'm stuck. To clarify, I'm expecting an output similar to: [568, 45, 6]
Perhaps I used the wrong way to do this from the beginning and that's why it's not working all the way for me. Started to code with Python in October and it's still very new to me but I want to learn. Thank you all in advance!
It looks like your total_url is a list of lists, so you might write a function like:
def get_reputations(data):
for url in data:
for obj in url:
print(obj.get('reputation'))
get_reputations(total_url)
# output:
# 568
# 45
# 6
If you'd rather not work with a list of lists in the first place, you can extend the list with each result instead of append in the expression used to construct total_url
You can also use json.load and try to read the response
def get_rep():
response = urlopen(api_url)
r = response.read().decode('utf-8')
r_obj = json.loads(r)
for item in r_obj['response']:
print("Reputation: {}".format(item['reputation']))

sub_items generate new columns in JSON to CSV converter with python

I am trying to convert a JSON string to a CSV file which I can work on further in excel. For that, I am using the following script: https://github.com/vinay20045/json-to-csv
I was on that for a few hours yesterday but could not get it working :(
I reduced my json string to the minimum for the sake of explaining what I mean.
https://pastebin.com/Vjt799Bb
{
"page": 1,
"pages": 2270,
"limit": 10,
"total": 22693,
"items": [
{
"address": {
"city": "cityname first dataset",
"company_name": "companyname first dataset"
},
"amount": 998,
"items": [
{
"description": "first part of first dataset",
"number": "part number of first part of first dataset"
}
],
"number": "number of first dataset",
"service_date": {
"type": "DEFAULT",
"date": "2015-11-18"
},
"vat_option": null
},
{
"address": {
"city": "cityname second dataset",
"company_name": "companyname second dataset"
},
"amount": 998,
"items": [
{
"description": "first part of second dataset",
"number": "part number of first part of second dataset"
},
{
"description": "second part of second dataset",
"number": "part number of second part of second dataset"
}
],
"number": "number of second dataset",
"service_date": {
"type": "DEFAULT",
"date": "2015-11-18"
},
"vat_option": null
}
]
}
I would really appreciate if you could take a look at it.
The script now delivers the following result:
.dropbox.com/s/165zbfl8wn52syf/scriptresult.jpg?dl=0
(please add www in front to have a link)
What the script now needs to do is following (F3, G4 and so on are cell definitions from the above screenshot):
- copy F3 and G3 to D4 and E4
- remove columns F and G
- copy A3:C3 to A4:C4
- copy F3:I3 to F4:I4
Target CSV will then look like:
.dropbox.com/s/l1wj3ntrlomwmaq/target.jpg?dl=0
(please add www in front to have a link)
So all in all, the „items_items_0“ „items_items_1“ is a problem because when the JSON data has sub_items, they will get new columns in the header with the current script. But I’d like to have them in new rows instead.
Do you see any chance how I can reach that? The logic is quite clear to me, but I am an absolute newbie in python - maybe that’s the problem :(
Thank you for your great support!
Cheers,
Tom
I do agree: You're asking about the usage of a specific package without providing the actual code.
I went ahead, made some assumptions, and created a snippet which could help you solve your issue. Instead of using the script you linked, I use a combination of manually creating a dictionary and then using Pandas to print, potential modification, and eventual export. Note: This does not solve your problem (since I'm not really getting it to the fullest extend) – it rather hopes to give you a good start with some of the tools and techniques.
See .ipynb file in this Gist, https://gist.github.com/AndiH/4d4ef85e2dec395a0ae5343c648565eb, the gist of it I'll paste below:
import pandas as pd
import json
with open("input.json") as f:
rawjson = json.load(f)
data = []
for element in rawjson["items"]:
data.append({
"item_address_city": element["address"]["city"],
"item_address_company_name": element["address"]["company_name"],
"items_amount": element["amount"]
})
df = pd.DataFrame(data)
df.head()
df.to_csv("output.csv")

Categories