convert excel colums in nested arrays/ list using pandas - python

I have some data in csv file and i need to convert the data into as lists and arrays to json format. here is a sample data:
the desired output id :
{
"topics":[
{
"topicID":1,
"labels":[
{
"phrase":"security level",
"prob":0.3
},
{
"phrase":" hack",
"prob":0.3
},
{
"phrase":"our server lab",
"prob":0.2
},
{
"phrase":" people",
"prob":0.2
},
{
"phrase":" trouble",
"prob":0.2
}
]
},
{
"topicID":2,
"labels":[
{
"phrase":"base3",
"prob":0.4806
}
]
}
]
}
and so on.
i have just extracted 5 colums to get topics array:
df.loc[:, ['t_1', 't_2', 't_3','t_4','t_5']]
and i have converted topics columns to array:
topic_list = df[[''t_1', 't_2', 't_3','t_4','t_5'']].values
but I am clueless how to append phrases and other columns in this array?

Related

convert json with nested dicts into data frame with python

can someone explain how I convert the following json into a simple data frame with the following headings?
----- sample----
{
"last_scanned_block": 14968718,
"blocks": {
"13965799": {
"0x9603846aff5c425277e483de16179a68dbc739debcc5449ea99e45c9d0924430": {
"165": {
"from": "0x0000000000000000000000000000000000000000",
"to": "0x01f87c337be5636Cd9B3D48F1159768A7e7837A5",
"value": 100000000000000000000000000,
"timestamp": "2022-01-08T16:19:02"
}
}
},
"13965820": {
"0xd4a4122734a522c40504c8b0ab43b9aa40ac821cd9913179b3ae64e5b166fc57": {
"226": {
"from": "0x01f87c337be5636Cd9B3D48F1159768A7e7837A5",
"to": "0xEa3Fa123Eb40CEEaeED390D8d6dE6AF95f044AF7",
"value": 610000000000000000000000,
"timestamp": "2022-01-08T16:25:12"
}
}
},
--- end----
I'd like the df to have the following 8 column headings and values for each row
(value examples for first row)
Last_scanned_block: 14968718
block: 13965799
hex: 0x9603846aff5c425277e483de16179a68dbc739debcc5449ea99e45c9d0924430
number: 165
from: 0x0000000000000000000000000000000000000000
to: 0x01f87c337be5636Cd9B3D48F1159768A7e7837A5
value: 100000000000000000000000000
timestamp: 2022-01-08T16:19:02
Thanks
I would make a new dictionary from the json that is passed in. Essentially instead of having nested dictionaries like you have above you want to get them into one simple dictionary according to your headings and values. It should be:
*heading name* : *list of values*
Essentially, the resulting format should be:
{"Last_scanned_block" : [14968718], "block" : [13965799], "hex" : ["0x9603846aff5c425277e483de16179a68dbc739debcc5449ea99e45c9d0924430"], "number" : [165], "from" : ["0x0000000000000000000000000000000000000000"], "to" : ["0x01f87c337be5636Cd9B3D48F1159768A7e7837A5"], "value": [100000000000000000000000000], "timestamp" : ["2022-01-08T16:19:02"]}
Then every time you read more data you just append it to each respective list in your dictionary.
Once you have your complete dictionary, you would use pandas. So something along the lines of:
import pandas
d = *the dictionary above*
frame = pandas.DataFrame(data = d)
print(frame)

Split Pandas DataFrame objects into 2 JSON objects

I have a requirement to send JSON data via API Rest call, but the files need to be separate as there is a file size limit.
I currently collect the data from a HANA database, and split it into 2 files:
connection = dbapi.connect(address='<>',port='<>',user='<>',password='<>')
df = pd.read_sql('''QUERY''', connection)
length_df = len(df)
length_half = math.ceil(len(df) / 2)
if length_df > length_half:
payload_1 = df[:length_half]
payload_2 = df[length_half:]
This returns 2 perfectly separated DF's, that I need to process as JSON objects.
After this has been done, I iterate over these DF's separately, and put them into the required format that the client wants it as:
item_data_load_1 = {}
for row in payload_1.itertuples(index=False):
list = {
"itemId": {
"primaryId": row[0].lstrip('0')
},
"description": {
"languageCode": "",
"descriptionType": "",
"value": row[25]
},
"classifications": {
"itemType": row[7]
},
"tradeItemBaseUnitOfMeasure": row[8],
"demandUnitInformation": {
"demandUnitName": row[0].lstrip('0'),
"description": {
"value": row[25]
},
"actionCode": "",
"demandUnitDetails": {
"demandUnitBaseUnitOfMeasure": row[8],
"brandName": row[6],
"unitSize": row[19],
"handlingInstruction": {
"handlingInstructionCode": ""
}
}
}
item_data_load_1.append(list)
item_data_load_2 = {}
for row in payload_2.itertuples(index=False):
list = {
"itemId": {
"primaryId": row[0].lstrip('0')
},
"description": {
"languageCode": "",
"descriptionType": "",
"value": row[25]
},
"classifications": {
"itemType": row[7]
},
"tradeItemBaseUnitOfMeasure": row[8],
"demandUnitInformation": {
"demandUnitName": row[0].lstrip('0'),
"description": {
"value": row[25]
},
"actionCode": "",
"demandUnitDetails": {
"demandUnitBaseUnitOfMeasure": row[8],
"brandName": row[6],
"unitSize": row[19],
"handlingInstruction": {
"handlingInstructionCode": ""
}
}
}
item_data_load_2.append(list)
I do this for both payloads, and then dump it into a JSON object:
json_data_1 = json.dumps(item_data_load_1)
json_data_2 = json.dumps(item_data_load_2)
However, the first file I produce has the correct amount of records in it, but the second file has double that - it's like it's taking the data from the first payload, and then also appending the other half of the second payload (essentially making it one big file).
Row count of 1st dataframe:
44938
Row count of 2nd dataframe:
44938
Row count of full dataframe:
89876
Length of 1st payload:
39139770
Length of 2nd payload:
78279080
This should not occur, though, as it should just produce 2 JSON files with the same record counts.
I cannot share the full script as it has sensitive information in it.
Can someone please explain to me why this is happening?
Thanks in advance.

How to scrape attributes from json values

I am trying to scrape some values through a json that looks like:
{
"attributes":{
"531":{
"id":"531",
"code":"taille",
"label":"taille",
"options":[
{
"id":"30",
"label":"40",
"is_in":"0"
},
{
"id":"31",
"label":"41",
"is_in":"1"
}
]
}
},
"template":"Helloworld"
}
My issue is that the number 531 is different in each json file that I am trying to scrape and what I am trying to grab through this json is the label and is_in value
What I have done so far is that I tried to do something like this but I am stuck and dont know how to do if the 531 is changing to something else
getOption = '{
"attributes":{
"531":{
"id":"531",
"code":"taille",
"label":"taille",
"options":[
{
"id":"30",
"label":"40",
"is_in":"0"
},
{
"id":"31",
"label":"41",
"is_in":"1"
}
]
}
},
"template":"Helloworld"
}'
for att, values in getOption.items():
print(values)
So how do I possible scrape the value label and is_in?
I'm not sure if you can have several 531 keys but you can loop through them.
getOption = {
"attributes":{
"531":{
"id":"531",
"code":"taille",
"label":"taille",
"options":[
{
"id":"30",
"label":"40",
"is_in":"0"
},
{
"id":"31",
"label":"41",
"is_in":"1"
}
]
}
},
"template":"Helloworld"
}
attributes = getOption['attributes']
for key in attributes.keys():
for item in attributes[key]['options']:
print(item['label'], item['is_in'])

Dictionary length is equal to 3 but when trying to access an index receiving KeyError

I am attempting to parse a json response that looks like this:
{
"links": {
"next": "http://www.neowsapp.com/rest/v1/feed?start_date=2015-09-08&end_date=2015-09-09&detailed=false&api_key=xxx",
"prev": "http://www.neowsapp.com/rest/v1/feed?start_date=2015-09-06&end_date=2015-09-07&detailed=false&api_key=xxx",
"self": "http://www.neowsapp.com/rest/v1/feed?start_date=2015-09-07&end_date=2015-09-08&detailed=false&api_key=xxx"
},
"element_count": 22,
"near_earth_objects": {
"2015-09-08": [
{
"links": {
"self": "http://www.neowsapp.com/rest/v1/neo/3726710?api_key=xxx"
},
"id": "3726710",
"neo_reference_id": "3726710",
"name": "(2015 RC)",
"nasa_jpl_url": "http://ssd.jpl.nasa.gov/sbdb.cgi?sstr=3726710",
"absolute_magnitude_h": 24.3,
"estimated_diameter": {
"kilometers": {
"estimated_diameter_min": 0.0366906138,
"estimated_diameter_max": 0.0820427065
},
"meters": {
"estimated_diameter_min": 36.6906137531,
"estimated_diameter_max": 82.0427064882
},
"miles": {
"estimated_diameter_min": 0.0227984834,
"estimated_diameter_max": 0.0509789586
},
"feet": {
"estimated_diameter_min": 120.3760332259,
"estimated_diameter_max": 269.1689931548
}
},
"is_potentially_hazardous_asteroid": false,
"close_approach_data": [
{
"close_approach_date": "2015-09-08",
"close_approach_date_full": "2015-Sep-08 09:45",
"epoch_date_close_approach": 1441705500000,
"relative_velocity": {
"kilometers_per_second": "19.4850295284",
"kilometers_per_hour": "70146.106302123",
"miles_per_hour": "43586.0625520053"
},
"miss_distance": {
"astronomical": "0.0269230459",
"lunar": "10.4730648551",
"kilometers": "4027630.320552233",
"miles": "2502653.4316094954"
},
"orbiting_body": "Earth"
}
],
"is_sentry_object": false
},
}
I am trying to figure out how to parse through to get "miss_distance" dictionary values ? I am unable to wrap my head around it.
Here is what I have been able to do so far:
After I get a Response object from request.get()
response = request.get(url
I convert the response object to json object
data = response.json() #this returns dictionary object
I try to parse the first level of the dictionary:
for i in data:
if i == "near_earth_objects":
dataset1 = data["near_earth_objects"]["2015-09-08"]
#this returns the next object which is of type list
Please someone can explain me :
1. How to decipher this response in the first place.
2. How can I move forward in parsing the response object and get to miss_distance dictionary ?
Please any pointers/help is appreciated.
Thank you
Your data will will have multiple dictionaries for the each date, near earth object, and close approach:
near_earth_objects = data['near_earth_objects']
for date in near_earth_objects:
objects = near_earth_objects[date]
for object in objects:
close_approach_data = object['close_approach_data']
for close_approach in close_approach_data:
print(close_approach['miss_distance'])
The code below gives you a table of date, miss_distances for every object for every date
import json
raw_json = '''
{
"near_earth_objects": {
"2015-09-08": [
{
"close_approach_data": [
{
"miss_distance": {
"astronomical": "0.0269230459",
"lunar": "10.4730648551",
"kilometers": "4027630.320552233",
"miles": "2502653.4316094954"
},
"orbiting_body": "Earth"
}
]
}
]
}
}
'''
if __name__ == "__main__":
parsed = json.loads(raw_json)
# assuming this json includes more than one near_earch_object spread across dates
near_objects = []
for date, near_objs in parsed['near_earth_objects'].items():
for obj in near_objs:
for appr in obj['close_approach_data']:
o = {
'date': date,
'miss_distances': appr['miss_distance']
}
near_objects.append(o)
print(near_objects)
output:
[
{'date': '2015-09-08',
'miss_distances': {
'astronomical': '0.0269230459',
'lunar': '10.4730648551',
'kilometers': '4027630.320552233',
'miles': '2502653.4316094954'
}
}
]

Using Pandas to convert JSON to CSV with specific fields

I am currently trying to convert a JSON file to a CSV file using Pandas.
The codes that I'm using now are able to convert the JSON to a CSV file.
import pandas as pd
json_data = pd.read_json("out1.json")
from pandas.io.json import json_normalize
df = json_normalize(json_data["events"])
df.to_csv("out.csv)
This is my JSON file:
{
"events": [
{
"raw": "{\"level\": \"INFO\", \"message\": \"Disabled camera with QR scan on by 80801234 at Area A\n\"}",
"logtypes": [
"json"
],
"timestamp": 1537190572023,
"unparsed": null,
"logmsg": "{\"level\": \"INFO\", \"message\": \"Disabled camera with QR scan on by 80801234 at Area A\n\"}",
"id": "c77afb4c-ba7c-11e8-8000-12b233ae723a",
"tags": [
"INFO"
],
"event": {
"json": {
"message": "Disabled camera with QR scan on by 80801234 at Area A\n",
"level": "INFO"
},
"http": {
"clientHost": "116.197.237.29",
"contentType": "text/plain; charset=UTF-8"
}
}
},
{
"raw": "{\"level\": \"INFO\", \"message\": \"Employee number saved successfully.\"}",
"logtypes": [
"json"
],
"timestamp": 1537190528619,
"unparsed": null,
"logmsg": "{\"level\": \"INFO\", \"message\": \"Employee number saved successfully.\"}",
"id": "ad9c0175-ba7c-11e8-803d-12b233ae723a",
"tags": [
"INFO"
],
"event": {
"json": {
"message": "Employee number saved successfully.",
"level": "INFO"
},
"http": {
"clientHost": "116.197.237.29",
"contentType": "text/plain; charset=UTF-8"
}
}
}
]
}
But what I wanted was just some fields (timestamp, level, message) inside the JSON file not all of it.
I have tried a variety of ways:
df = json_normalize(json_data["timestamp"]) // gives a KeyError on 'timestamp'
df = json_normalize(json_data, 'timestamp', ['event', 'json', ['level', 'message']]) // TypeError: string indices must be integers
Where did i went wrong?
I don't think json_normalize is intended to work on this specific orientation. I could be wrong but from the documentation, it appears that normalization means "Deal with lists within each dictionary".
Assume data is
data = json.load(open('out1.json'))['events']
Look at the first entry
data[0]['timestamp']
1537190572023
json_normalize wants this to be a list
[{'timestamp': 1537190572023}]
Create augmented data2
I don't actually recommend this approach.
If we create data2 accordingly:
data2 = [{**d, **{'timestamp': [{'timestamp': d['timestamp']}]}} for d in data]
We can use json_normalize
json_normalize(
data2, 'timestamp',
[['event', 'json', 'level'], ['event', 'json', 'message']]
)
timestamp event.json.level event.json.message
0 1537190572023 INFO Disabled camera with QR scan on by 80801234 a...
1 1537190528619 INFO Employee number saved successfully.
Comprehension
I think it's simpler to just do
pd.DataFrame([
(d['timestamp'],
d['event']['json']['level'],
d['event']['json']['message'])
for d in data
], columns=['timestamp', 'level', 'message'])
timestamp level message
0 1537190572023 INFO Disabled camera with QR scan on by 80801234 a...
1 1537190528619 INFO Employee number saved successfully.
json_normalize
But without the fancy arguments
json_normalize(data).pipe(
lambda d: d[['timestamp']].join(
d.filter(like='event.json')
)
)
timestamp event.json.level event.json.message
0 1537190572023 INFO Disabled camera with QR scan on by 80801234 a...
1 1537190528619 INFO Employee number saved successfully.

Categories