how to retrieve data from json file using python - python

I'm doing api requests to get json file to be parsed and converted into data frames. Json file sometimes may have empty fields, I am posting 2 possible cases where 1st json fill have the field I am looking for and the 2nd json file has that field empty.
1st json file:
print(resp2)
{
"entityId": "proc_1234",
"displayName": "oracle12",
"firstSeenTms": 1639034760000,
"lastSeenTms": 1650386100000,
"properties": {
"detectedName": "oracle.sysman.gcagent.tmmain.TMMain",
"bitness": "64",
"jvmVendor": "IBM",
"metadata": [
{
"key": "COMMAND_LINE_ARGS",
"value": "/usr/local/oracle/oem/agent12c/agent_13.3.0.0.0"
},
{
"key": "EXE_NAME",
"value": "java"
},
{
"key": "EXE_PATH",
"value": "/usr/local/oracle/oem/agent*c/agent_*/oracle_common/jdk/bin/java"
},
{
"key": "JAVA_MAIN_CLASS",
"value": "oracle.sysman.gcagent.tmmain.TMMain"
},
{
"key": "EXE_PATH",
"value": "/usr/local/oracle/oem/agent12c/agent_13.3.0.0.0/oracle_common/jdk/bin/java"
}
]
}
}
2nd Json file:
print(resp2)
{
"entityId": "PROCESS_GROUP_INSTANCE-FB8C65551916D57D",
"displayName": "Windows System",
"firstSeenTms": 1619147697131,
"lastSeenTms": 1653404640000,
"properties": {
"detectedName": "Windows System",
"bitness": "32",
"metadata": [],
"awsNameTag": "Windows System",
"softwareTechnologies": [
{
"type": "WINDOWS_SYSTEM"
}
],
"processType": "WINDOWS_SYSTEM"
}
}
as you can see metadata": [] empty.
I need to extract entityId, detectedName and if metada has data, I need to get EXE_NAME and EXE_PATH. if metada section is empty, I still need to get the entityId and detectedName from this json file and form a data frame.
so, I have done this:
#retrieve the detecteName value from the json
det_name = list(resp2.get('properties','detectedName').values())[0]
#retrieve EXE_NAME, EXE_PATH and entityId from the json. This part works when metata section has data
Procdf=(pd.json_normalize(resp2, record_path=['properties', 'metadata'], meta=['entityId']).drop_duplicates(subset=['key']).query("key in ['EXE_NAME','EXE_PATH']").assign(detectedName=det_name).pivot('entityId', 'key', 'value').reset_index())
#Add detectedName to the Procdf data frame
Procdf["detectedName"] = det_name
this above code snippet works when metadata has data, if it has no data [], I still need to create a data frame with entityId, detectedName and EXE_NAME and EXE_PATH being empty.
how can I do this? Right now when metadat[], I get this error name 'key' is not defined and skipps that json.

Why not create a new dict based on whether there's value for metadata or not?
Here's an example (this should work with both response types):
import pandas as pd
def find_value(response: dict, key: str) -> str:
result = []
try:
for x in response['properties']['metadata']:
if x['key'] == key:
result.append(x['value'])
except KeyError:
return ""
return result[0] if result else ""
def get_values(response: dict) -> dict:
return {
"entityId": response['entityId'],
"displayName": response['displayName'],
"EXE_NAME": find_value(response, 'EXE_NAME'),
"EXE_PATH": find_value(response, 'EXE_PATH'),
}
sample_response = {
"entityId": "PROCESS_GROUP_INSTANCE-FB8C65551916D57D",
"displayName": "Windows System",
"firstSeenTms": 1619147697131,
"lastSeenTms": 1653404640000,
"properties": {
"detectedName": "Windows System",
"bitness": "32",
"awsNameTag": "Windows System",
"metadata": [],
"softwareTechnologies": [
{
"type": "WINDOWS_SYSTEM"
}
],
"processType": "WINDOWS_SYSTEM"
}
}
print(pd.json_normalize(get_values(sample_response)))
Sample output for metadata being empty:
entityId displayName EXE_NAME EXE_PATH
0 PROCESS_GROUP_INSTANCE-FB8C65551916D57D Windows System
And one when metadata carries, well, data:
entityId ... EXE_PATH
0 proc_1234 ... /usr/local/oracle/oem/agent*c/agent_*/oracle_c...

Related

Parsing Json extracting key value python

Hi guys I am trying to extract the same key but with different values over a long JSON response, but i keep getting :
KeyError: 'id'
Not sure what i am doing wrong, but i am accessing it using REST API:
This is what i have as a script :
from requests.auth import HTTPBasicAuth
import requests
import json
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
def countries():
data = requests.get("https://10.24.21.4:8543/api/netim/v1/countries/", verify=False, auth=HTTPBasicAuth("admin", "admin"))
rep = data.json()
for cid in rep:
cid = rep["id"]
print(cid)
countries()
The response is rather long, but it is like this, you will see "id", and i need the respective values :
{
"items": [
{
"name": "Afghanistan",
"displayName": "Afghanistan",
"meta": {
"type": "COUNTRY"
},
"id": "AF",
"links": {
"self": {
"path": "/api/netim/v1/countries/AF"
}
}
},
{
"name": "Albania",
"displayName": "Albania",
"meta": {
"type": "COUNTRY"
},
"id": "AL",
"links": {
"self": {
"path": "/api/netim/v1/countries/AL"
}
}
},
{
"name": "Algeria",
"displayName": "Algeria",
"meta": {
"type": "COUNTRY"
},
"id": "DZ",
"links": {
"self": {
"path": "/api/netim/v1/countries/DZ"
}
}
},
{
"name": "American Samoa",
"displayName": "American Samoa",
"meta": {
"type
I rewrote your functions a little, You should now be able to get all teh IDs from the JSON response. I suggest you look into teh basics of Dictionaries and Lists in Python
from requests.auth import HTTPBasicAuth
import requests
import json
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
def countries():
data = requests.get("https://10.24.21.4:8543/api/netim/v1/countries/", verify=False, auth=HTTPBasicAuth("admin", "admin"))
rep = data.json()
return [elem.get("id","") for elem in rep['items']]
countries()
Update:
If you wish to extract the value of the "path" key and simultaneously the value of the "id" key, You would need a list of dictionaries where every dictionary corresponds to a single record from the json.
the modified function is as follows:
def countries():
data = requests.get("https://10.24.21.4:8543/api/netim/v1/countries/", verify=False, auth=HTTPBasicAuth("admin", "admin"))
rep = data.json()
return [{"id":elem.get("id",""),"path":elem["links"]["self"]["path"]} for elem in rep['items']]
the get() returns a default value in case the key is absent in the dictionary. The function, new as well as the previous one, would not fail in case the values were not returned in the JSON response for the id and path keys
If you are sure that the value of links will always be available you can use the above function directly else you will have to write a custom function that would parse the key links and return an empty string if it is empty in the json
The response is not an array, it's a dictionary.
You want the "items" element of that dictionary:
for cid in rep['items']:

CSV file to JSON for nested array generic template using python (for csv to mongodb insert)

I want to create the JSON file from CSV file using the generic python script.
Found hone package from GitHub but some of the functionalities missing in that code.
csv to json
I want to code like generic template CSV to JSON.
[
{
"birth": {
"day": "7",
"month": "May",
"year": "1985"
},
"name": "Bob",
"reference": "TRUE",
"reference name": "Smith"
}
]
Only handled above type of JSON only.
[
{
"Type": "AwsEc2Instance",
"Id": "i-cafebabe",
"Partition": "aws",
"Region": "us-west-2",
"Tags": {
"billingCode": "Lotus-1-2-3",
"needsPatching": "true"
},
"Details": {
"AwsEc2Instance": {
"Type": "i3.xlarge",
"ImageId": "ami-abcd1234",
"IpV4Addresses": [ "54.194.252.215", "192.168.1.88" ],
"IpV6Addresses": [ "2001:db812341a2b::123" ],
"KeyName": "my_keypair",
"VpcId": "vpc-11112222",
"SubnetId": "subnet-56f5f633",
"LaunchedAt": "2018-05-08T16:46:19.000Z"
}
}
}
]
I want to handle nested array[] ,{}
I have done something like this before and below code can be modified as I have not seen your dataset.
dataframe = pd.read_excel('dataframefilepath', encoding='utf-8', header=0)
'''Adding to list to finally save it as JSON'''
df = []
for (columnName, columnData) in dataframe.iteritems():
if dataframe.columns.get_loc(columnName) > 0:
for indata, rwdata in dataframe.iterrows():
for insav, rwsave in df_to_Save.iterrows():
if rwdata.Selected_Prediction == rwsave.Selected_Prediction:
#print()
df_to_Save.loc[insav, 'Value_to_Save'] = rwdata[dataframe.columns.get_loc(columnName)]
#print(rwdata[dataframe.columns.get_loc(columnName)])
df.append(df_to_Save.set_index('Selected_Prediction').T.to_dict('record'))
df = eval(df)
'''Saving in JSON format'''
path_to_save = '\\your path'
with open(path_to_save, 'w') as json_file:
json.dump(df, json_file)

Python 3.4 and AVRO: Unable to convert the simple message in AVRO, based on schema?

I am looking to create a avro message for each record (csv) with metadata header i.e. with nested schema.
I am using Python 3.4. I have the required modules i.e. avro-python3 downloaded.
I have the record data in form of csv with its header.
Basically I have the code for creating required message and metadata header.
My AVSC file (Sample only):
Schema: {"name": "person","type": "record","fields": [{"name": "address","type": {"type" : "record","name" : "AddressUSRecord","fields" : [{"name": "streetaddress", "type": "string"},{"name": "city", "type":"string"},{"name": "pin", "type":"long"}]}}]}
My record is also created. (Showing the pretty format of record).
For pin: 123.456 (float value)
However, when I am trying to convert the above record in avro format, based on avsc file mentioned, it fails saying the "Tha datum is not the example of schema".
Code:
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
import csv
import json
# header class to give header data. Just simple assignment
from header import Header
# body class to give body, just simple assignment for now.
from pnlData import PnlData
import os
import sys
if __name__ == "__main__":
schemaFile = "/path/tardisPnl.avsc"
outFile = "/path/SampleOutLanding.avro"
schema = avro.schema.Parse(open(schemaFile, "r").read())
a = Header()
a.generateMessageId() #Simple text generated for now
a.generateTimestamp() #Simple number generated for now
#print(a.__dict__)
b = PnlData()
b.generatePnlData() #Simple value assigned as seen in example
#print(b.__dict__)
landingMessage = {}
landingMessage["header"] = a.__dict__
landingMessage["pnlData"] = b.__dict__
#print (json.dumps(landingMessage))
writer = DataFileWriter(open(outFile, "wb"), DatumWriter(), schema)
try:
writer.append(landingMessage)
except Exception as e:
print('Error: %s ' % (e))
writer.close()
I have tried to convert the above avro schema to JSON schema, then created sample JSON data based on the schema (online links), to see if my data object is correct. Infact I created my record based on sample data generated based on the schema.
However, when I try to use them and run the code it always fails.
I am not much familiar with AVRO, so need to understand what I am missing here? Why this simple data and schema is not working?
I first tried the following simple records (same sample online tool) and schema, and it works.
Simple avsc:
{"name": "person","type": "record","fields": [{"name": "firstname", "type": "string"},{"name": "lastname", "type": "string"},{"name": "address","type": {"type" : "record","name" : "AddressUSRecord","fields" : [{"name": "streetaddress", "type": "string"},{"name": "city", "type":"string"}]}}]}
Simple Data (Again pretty printed):
{
"firstname": "ABCDEFGHIJKLMN",
"lastname": "ABCDEFGHIJKLMNOPQRSTUVWXYZAB",
"address": {
"streetaddress": "ABCDEFGHIJKLMN",
"city": "ABCDEFGHIJKLMNO"
}
}
If I create the above dict, and pass the same (same code, not changes) to avsc file as above, it works fine.
Only difference in my avsc and (simple) sample avsc is one extra nested attribute, etc. I am unable to find the reason for not working in my slightly complex data.
The fastavro library has a validate function that can help with this.
Using the data and schema you provided, it would look like this:
schema = {
"type":"record",
"name":"SomeName",
"doc":"This schema contains the metadata fields wrapped in a header field which follows the official SA MessageHeader schema.",
"fields":[
{
"name":"header",
"type":{
"type":"record",
"name":"MessageHeader",
"fields":[
{
"name":"messageId",
"type":"string"
},
{
"name":"businessId",
"type":"string"
},
{
"name":"batchId",
"type":"string"
},
{
"name":"sourceSystem",
"type":"string"
},
{
"name":"secondarySourceSystem",
"type":"string"
},
{
"name":"sourceSystemCreationTimestamp",
"type":"long"
},
{
"name":"sentBy",
"type":"string"
},
{
"name":"sentTo",
"type":"string"
},
{
"name":"messageType",
"type":"string"
},
{
"name":"schemaVersion",
"type":"string"
},
{
"name":"processing",
"type":"string"
},
{
"name":"sourceLocation",
"type":"string"
}
]
}
},
{
"name":"pnlData",
"type":{
"type":"record",
"name":"pnlDataDetails",
"fields":[
{
"name":"granularity",
"type":"string"
},
{
"name":"pnl_type",
"type":"string"
},
{
"name":"pnl_subtype",
"type":"string"
},
{
"name":"date",
"type":"int"
},
{
"name":"book",
"type":"string"
},
{
"name":"currency",
"type":"string"
},
{
"name":"category",
"type":"string"
},
{
"name":"subcategory",
"type":"string"
},
{
"name":"riskcategory",
"type":"string"
},
{
"name":"market_name",
"type":"string"
},
{
"name":"risk_order",
"type":"string"
},
{
"name":"tenor",
"type":"string"
},
{
"name":"product",
"type":"string"
},
{
"name":"trade_id",
"type":"string"
},
{
"name":"pnl_local",
"type":"long"
},
{
"name":"pnl_cde",
"type":"long"
},
{
"name":"pnl_status",
"type":"string"
}
]
}
}
]
}
record = {
"pnlData": {
"pnl_cde": 997.8100000024,
"pnl_status": "locked",
"granularity": "detailed view",
"book": "8271",
"date": 20181130,
"subcategory": "None",
"pnl_local": 997.7899999917,
"pnl_subtype": "Regular",
"tenor": "None",
"pnl_type": "Daily",
"risk_order": "None",
"market_name": "None",
"trade_id": "None",
"category": "None",
"product": "None",
"currency": "cad",
"riskcategory": "None"
},
"header": {
"sentBy": "SYSTEM",
"businessId": "T1",
"messageId": "pnl_0001",
"processing": "RealTime",
"messageType": "None",
"sourceLocation": "None",
"sentTo": "SA",
"secondarySourceSystem": "None",
"schemaVersion": "1.6T",
"sourceSystem": "SYSTEM",
"sourceSystemCreationTimestamp": 1236472051,
"batchId": "None"
}
}
import fastavro
fastavro.validation.validate(record, schema)
The error I get is the following: "SomeName.pnlData.pnlDataDetails.pnl_local is <997.7899999917> of type <class 'float'> expected long"

flask-sqlalchemy dynamically construct query

I have an input json like the following:
{
"page": 2,
"limit": 10,
"order": [
{
"field": "id",
"type": "asc"
},
{
"field": "email",
"type": "desc"
},
...
{
"field": "fieldN",
"type": "desc"
}
],
"filter": [
{
"field": "company_id",
"type": "=",
"value": 1
},
...
{
"field": "counter",
"type": ">",
"value": 5
}
]
}
How do I dynamically construct sqlalchemy query based on my input json if I don't know fields count?
Something like this:
User.query.filter(filter.field, filter.type, filter.value).filter(filter.field1, filter.type1, filter.value1)...filter(filter.fieldN, filter.typeN, filter.valueN).order_by("id", "ask").order_by("email", "desc").order_by("x1", "y1")....order_by("fieldN"...."desc").all()
Convert the json into a dictionary and retrieve the value.
If your json is in a file (say, data.json), the json library will satisfy your needs:
import json
f = open("data.json")
data = json.load(f)
f.close()
User.query.filter(company_id=1).order_by(data["id"], data["ask"]).order_by(data["email"], data["desc"]).all()
If your json is a string (say, json_data):
import json
data = json.loads(json_data)
User.query.filter(company_id=1).order_by(data["id"], data["ask"]).order_by(data["email"], data["desc"]).all()
If your json is a request from the python requests library i.e. res = requests.get(...), then res.json() will return a dictionary:
data = res.json()
User.query.filter(company_id=1).order_by(data["id"], data["ask"]).order_by(data["email"], data["desc"]).all()

Accessing nested json objects using python

I am trying to interact with an API and running into issues accessing nested objects. Below is sample json output that I am working with.
{
"results": [
{
"task_id": "22774853-2b2c-49f4-b044-2d053141b635",
"params": {
"type": "host",
"target": "54.243.80.16",
"source": "malware_analysis"
},
"v": "2.0.2",
"status": "success",
"time": 227,
"data": {
"details": {
"as_owner": "Amazon.com, Inc.",
"asn": "14618",
"country": "US",
"detected_urls": [],
"resolutions": [
{
"hostname": "bumbleride.com",
"last_resolved": "2016-09-15 00:00:00"
},
{
"hostname": "chilitechnology.com",
"last_resolved": "2016-09-16 00:00:00"
}
],
"response_code": 1,
"verbose_msg": "IP address in dataset"
},
"match": true
}
}
]
}
The deepest I am able to access is the data portion which returns too much.... ideally I am just trying access as_owner,asn,country,detected_urls,resolutions
When I try to access details / response code ... etc I will get a KeyError. My nested json goes deeper then other Q's mentioned and I have tried that logic.
Below is my current code snippet and any help is appreciated!
import requests
import json
headers = {
'Content-Type': 'application/json',
}
params = (
('wait', 'true'),
)
data = '{"target":{"one":{"type": "ip","target": "54.243.80.16", "sources": ["xxx","xxxxx"]}}}'
r=requests.post('https://fakewebsite:8000/api/services/intel/lookup/jobs', headers=headers, params=params, data=data, auth=('apikey', ''))
parsed_json = json.loads(r.text)
#results = parsed_json["results"]
for item in parsed_json["results"]:
print(item['data'])
You just need to index correctly into the converted JSON. Then you can easily loop over a list of the keys you want to fetch, since they are all in the "details" dictionary.
import json
raw = '''\
{
"results": [
{
"task_id": "22774853-2b2c-49f4-b044-2d053141b635",
"params": {
"type": "host",
"target": "54.243.80.16",
"source": "malware_analysis"
},
"v": "2.0.2",
"status": "success",
"time": 227,
"data": {
"details": {
"as_owner": "Amazon.com, Inc.",
"asn": "14618",
"country": "US",
"detected_urls": [],
"resolutions": [
{
"hostname": "bumbleride.com",
"last_resolved": "2016-09-15 00:00:00"
},
{
"hostname": "chilitechnology.com",
"last_resolved": "2016-09-16 00:00:00"
}
],
"response_code": 1,
"verbose_msg": "IP address in dataset"
},
"match": true
}
}
]
}
'''
parsed_json = json.loads(raw)
wanted = ['as_owner', 'asn', 'country', 'detected_urls', 'resolutions']
for item in parsed_json["results"]:
details = item['data']['details']
for key in wanted:
print(key, ':', json.dumps(details[key], indent=4))
# Put a blank line at the end of the details for each item
print()
output
as_owner : "Amazon.com, Inc."
asn : "14618"
country : "US"
detected_urls : []
resolutions : [
{
"hostname": "bumbleride.com",
"last_resolved": "2016-09-15 00:00:00"
},
{
"hostname": "chilitechnology.com",
"last_resolved": "2016-09-16 00:00:00"
}
]
BTW, when you fetch JSON data using requests there's no need to use json.loads: you can access the converted JSON using the .json method of the returned request object instead of using its .text attribute.
Here's a more robust version of the main loop of the above code. It simply ignores any missing keys. I didn't post this code earlier because the extra if tests make it slightly less efficient, and I didn't know that keys could be missing.
for item in parsed_json["results"]:
if not 'data' in item:
continue
data = item['data']
if not 'details' in data:
continue
details = data['details']
for key in wanted:
if key in details:
print(key, ':', json.dumps(details[key], indent=4))
# Put a blank line at the end of the details for each item
print()

Categories