(Python) Reading specific keys from json file - python

I have a question about how I can make this code work.
I have several json files, from which I must read information and store it in a pandas framework to export it later. The json files are pretty branched up and go something like this:
'TESTS' -> 'MEASUREMENTS' -> SeveralTestNames(For each of those keys there is a key 'Value') I need to get only the Values and save them.
My thought was, that I get the keys with the testnames, and then in a loop apply these names to the json.load()-method, but no matter what I try, it doesnt work.
import json
with open(file) as f:
data = json.load(f)
date = data['INFO']['TIME'][0:10]
time = data['INFO']['TIME'][11:]
t = data['TESTS']['MEASUREMENTS']
type = [*t]
value = []
i = 0
for x in type:
v = data['TESTS']['MEASUREMENTS'][type[i]]['RESULTS']
value.append(v)
i = i + 1
This just gives me 'TypeError: list indices must be integers or slices, not str', but when I remove the last bit with the ['RESULTS'], it gives me the keys of the tests, but i need the values from within them.

It seems you're overcomplicating this for yourself a bit.
I'm reproducing your json from your comments as best as I can since parts of it are not valid json.
data = {
'INFO': {
'TIME': ' ',
'VERSION' : ' '
},
'TESTS': {
'MEASUREMENTS': {
'Test1': {
'RESULTS': {
'VALUE': 147799000000
}
},
'Test2': {
'RESULTS': {
'VALUE': 147888278322
}
}
}
}
}
values = []
# this iterates over each dict within the MEASUREMENTS key
# dict.values() returns a view that you can iterate over
# of just the values
for measurement in data['TESTS']['MEASUREMENTS'].values():
values.append(measurement['RESULTS']['VALUE'])
print(values)
In your case it would be:
import json
with open(file) as f:
data = json.load(f)
values = []
for measurement in data['TESTS']['MEASUREMENTS'].values():
values.append(measurement['RESULTS']['VALUE'])
print(values)

Related

Turn values from string to integers in JSON file python

I'm trying to change values in a JSON file from strings to integers, my issue is that the keys are row numbers so I can't call by key name (as they will change consistently). The values that need changing are within the "sharesTraded" object. Below is my JSON file:
{
"lastDate": {
"30": "04/04/2022",
"31": "04/04/2022",
"40": "02/03/2022",
"45": "02/01/2022"
},
"transactionType": {
"30": "Automatic Sell",
"31": "Automatic Sell",
"40": "Automatic Sell",
"45": "Sell"
},
"sharesTraded": {
"30": "29,198",
"31": "105,901",
"40": "25,000",
"45": "1,986"
}
}
And here is my current code:
import json
data = json.load(open("AAPL22_trades.json"))
dataa = data['sharesTraded']
dataa1 = dataa.values()
data1 = [s.replace(',', '') for s in dataa1]
data1 = [int(i) for i in data1]
open("AAPL22_trades.json", "w").write(
json.dumps(data1, indent=4))
However, I need the integer values to replace the string values. Instead, my code just replaces the entire JSON with the integers. I imagine there is something extra at the end that I'm missing because the strings have been changed to integers but applying it to the JSON is missing.
You have a couple of problems. First, since you only convert the values to a list, you loose the information about which key is associated with the values. Second, you write that list back to the file, loosing all of the other data too.
You could create a new dictionary with the modified values and assign that back to the original.
import json
data = json.load(open("AAPL22_trades.json"))
data['sharesTraded'] = {k:int(v.replace(',', ''))
for k,v in data['sharesTraded'].items()}
json.dump(data, open("AAPL22_trades.json", "w"), indent=4)

Convert python list to JSON by appending a specific key value

I have a python list like this,
vehicle = ['car', 'bus']
And I want to convert this to JSON formated like below,
[ { "vehicle": "car" }, { "vehicle": "bus" } ]
I tried to convert to a dictionary too. But couldn't able to do it.
Here you go;
import json
vehicle = ['car', 'bus']
res = json.dumps([{'vehicle': v} for v in vehicle])
print(res)
output
[{"vehicle": "car"}, {"vehicle": "bus"}]
Suggestion:
I will recommend this instead;
import json
vehicle = ['car', 'bus']
res = json.dumps([{'vehicle': vehicle}])
print(res)
OUTPUT:
[{"vehicle": ["car", "bus"]}]

How to access nested attribute without passing parent attribute in pyspark json

I am trying to access inner attributes of following json using pyspark
[
{
"432": [
{
"atttr1": null,
"atttr2": "7DG6",
"id":432,
"score": 100
}
]
},
{
"238": [
{
"atttr1": null,
"atttr2": "7SS8",
"id":432,
"score": 100
}
]
}
]
In the output, I am looking for something like below in form of csv
atttr1, atttr2,id,score
null,"7DG6",432,100
null,"7SS8",238,100
I understand I can get these details like below but I don't want to pass 432 or 238 in lambda expression as in bigger json this(italic one) will vary. I want to iterate over all available values.
print(inputDF.rdd.map(lambda x:(x['*432*'])).first())
print(inputDF.rdd.map(lambda x:(x['*238*'])).first())
I also tried registering a temp table with the name "test" but it gave an error with message element._id doesn't exist.
inputDF.registerTempTable("test")
srdd2 = spark.sql("select element._id from test limit 1")
Any help will be highly appreciated. I am using spark 2.4
Without using pyspark features, you can do it like this:
data = json.loads(json_str) # or whatever way you're getting the data
columns = 'atttr1 atttr2 id score'.split()
print(','.join(columns)) # headers
for item in data:
for obj in list(item.values())[0]: # since each list has only one object
print(','.join(str(obj[col]) for col in columns))
Output:
atttr1,atttr2,id,score
None,7DG6,432,100
None,7SS8,432,100
Or
for item in data:
obj = list(item.values())[0][0] # since the object is the one and only item in list
print(','.join(str(obj[col]) for col in columns))
FYI, you can store those in a variable or write it out to csv instead of/and also printing it.
And if you're just looking to dump that to csv, see this answer.

How to convert csv to json with multi-level nesting using pandas

I've tried to follow a bunch of answers I've seen on SO, but I'm really stuck here. I'm trying to convert a CSV to JSON.
The JSON schema has multiple levels of nesting and some of the values in the CSV will be shared.
Here's a link to one record in the CSV.
Think of this sample as two different parties attached to one document.
The fields on the document (document_source_id, document_amount, record_date, source_url, document_file_url, document_type__title, apn, situs_county_id, state_code) should not duplicate.
While the fields of each entity are unique.
I've tried to nest these using a complex groupby statement, but am stuck getting the data into my schema.
Here's what I've tried. It doesn't contain all fields because I'm having a difficult time understanding what it all means.
j = (df.groupby(['state_code',
'record_date',
'situs_county_id',
'document_type__title',
'document_file_url',
'document_amount',
'source_url'], as_index=False)
.apply(lambda x: x[['source_url']].to_dict('r'))
.reset_index()
.rename(columns={0:'metadata', 1:'parcels'})
.to_json(orient='records'))
Here's how the sample CSV should output
{
"metadata":{
"source_url":"https://a836-acris.nyc.gov/DS/DocumentSearch/DocumentDetail?doc_id=2019012901225004",
"document_file_url":"https://a836-acris.nyc.gov/DS/DocumentSearch/DocumentImageView?doc_id=2019012901225004"
},
"state_code":"NY",
"nested_data":{
"parcels":[
{
"apn":"3972-61",
"situs_county_id":"36005"
}
],
"participants":[
{
"entity":{
"name":"5 AIF WILLOW, LLC",
"situs_street":"19800 MACARTHUR BLVD",
"situs_city":"IRVINE",
"situs_unit":"SUITE 1150",
"state_code":"CA",
"situs_zip":"92612"
},
"participation_type":"Grantee"
},
{
"entity":{
"name":"5 ARCH INCOME FUND 2, LLC",
"situs_street":"19800 MACARTHUR BLVD",
"situs_city":"IRVINE",
"situs_unit":"SUITE 1150",
"state_code":"CA",
"situs_zip":"92612"
},
"participation_type":"Grantor"
}
]
},
"record_date":"01/31/2019",
"situs_county_id":"36005",
"document_source_id":"2019012901225004",
"document_type__title":"ASSIGNMENT, MORTGAGE"
}
You might need to use the json_normalize function from pandas.io.json
from pandas.io.json import json_normalize
import csv
li = []
with open('filename.csv', 'r') as f:
reader = csv.DictReader(csvfile)
for row in reader:
li.append(row)
df = json_normalize(li)
Here , we are creating a list of dictionaries from the csv file and creating a dataframe from the function json_normalize.
Below is one way to export your data:
# all columns used in groupby()
grouped_cols = ['state_code', 'record_date', 'situs_county_id', 'document_source_id'
, 'document_type__title', 'source_url', 'document_file_url']
# adjust some column names to map to those in the 'entity' node in the desired JSON
situs_mapping = {
'street_number_street_name': 'situs_street'
, 'city_name': 'situs_city'
, 'unit': 'situs_unit'
, 'state_code': 'state_code'
, 'zipcode_full': 'situs_zip'
}
# define columns used for 'entity' node. python 2 need to adjust to the syntax
entity_cols = ['name', *situs_mapping.values()]
#below for python 2#
#entity_cols = ['name'] + list(situs_mapping.values())
# specify output fields
output_cols = ['metadata','state_code','nested_data','record_date'
, 'situs_county_id', 'document_source_id', 'document_type__title']
# define a function to get nested_data
def get_nested_data(d):
return {
'parcels': d[['apn', 'situs_county_id']].drop_duplicates().to_dict('r')
, 'participants': d[['entity', 'participation_type']].to_dict('r')
}
j = (df.rename(columns=situs_mapping)
.assign(entity=lambda x: x[entity_cols].to_dict('r'))
.groupby(grouped_cols)
.apply(get_nested_data)
.reset_index()
.rename(columns={0:'nested_data'})
.assign(metadata=lambda x: x[['source_url', 'document_file_url']].to_dict('r'))[output_cols]
.to_json(orient="records")
)
print(j)
Note: If participants contain duplicates and must run drop_duplicates() as we do on parcels, then assign(entity) can be moved to defining the participants in the get_nested_data() function:
, 'participants': d[['participation_type', *entity_cols]] \
.drop_duplicates() \
.assign(entity=lambda x: x[entity_cols].to_dict('r')) \
.loc[:,['entity', 'participation_type']] \
.to_dict('r')

Convert python nested JSON-like data to dataframe

My records looks like this and I need to write it to a csv file:
my_data={"data":[{"id":"xyz","type":"book","attributes":{"doc_type":"article","action":"cut"}}]}
which looks like json, but the next record starts with "data" and not "data1" which forces me to read each record separately. Then, I convert it to a dict using eval(), to iterate thru keys and values for a certain path to get to the values I need. Then, I generate a list of keys and values based on the keys I need. Then, a pd.dataframe() converts that list into a dataframe which I know how to convert to csv. My code that works is below. But I am sure there are better ways to do this. Mine scales poorly. Thx.
counter=1
k=[]
v=[]
res=[]
m=0
for line in f2:
jline=eval(line)
counter +=1
for items in jline:
k.append(jline[u'data'][0].keys())
v.append(jline[u'data'][0].values())
print 'keys are:', k
i=0
j=0
while i <3 :
while j <3:
if k[i][j]==u'id':
res.append(v[i][j])
j += 1
i += 1
#res is my result set
del k[:]
del v[:]
Changing my_data to be:
my_data = [{"id":"xyz","type":"book","attributes":{"doc_type":"article","action":"cut"}}, # Data One
{"id":"xyz2","type":"book","attributes":{"doc_type":"article","action":"cut"}}, # Data Two
{"id":"xyz3","type":"book","attributes":{"doc_type":"article","action":"cut"}}] # Data Three
You can dump this directly into a dataframe as so:
mydf = pd.DataFrame(my_data)
It's not clear what your data path would be, but if you are looking for specific combinations of id, type, etc. You could explicitly search
def find_my_way(data, pattern):
# pattern = {'id':'someid', 'type':'sometype'...}
res = []
for row in data:
if row.get('id') == pattern.get('id'):
res.append(row)
return row
mydf = pd.DataFrame(find_my_way(mydata, pattern))
EDIT:
Without going into how the api works, in pseudo-code, you'll want to do something like the following:
my_objects = []
calls = 0
while calls < maximum:
my_data = call_the_api(params)
data = my_data.get('data')
if not data:
calls+=1
continue
# Api calls to single objects usually return a dictionary, to group objects they return lists. This handles both cases
if isinstance(data, list):
my_objects = [*data, *my_objects]
elif isinstance(data, {}):
my_objects = [{**data}, *my_objects]
# This will unpack the data response into a list that you can then load into a DataFrame with the attributes from the api as the columns
df = pd.DataFrame(my_objects)
Assuming your data from the api looks like:
"""
{
"links": {},
"meta": {},
"data": {
"type": "FactivaOrganizationsProfile",
"id": "Goog",
"attributes": {
"key_executives": {
"source_provider": [
{
"code": "FACSET",
"descriptor": "FactSet Research Systems Inc.",
"primary": true
}
]
}
},
"relationships": {
"people": {
"data": {
"type": "people",
"id": "39961704"
}
}
}
},
"included": {}
}
"""
per the documentation, which is why I'm using my_data.get('data').
That should get you all of the data (unfiltered) into a DataFrame
Saving the DataFrame for the last bit is a bit more memory friendly

Categories