Turn values from string to integers in JSON file python - python

I'm trying to change values in a JSON file from strings to integers, my issue is that the keys are row numbers so I can't call by key name (as they will change consistently). The values that need changing are within the "sharesTraded" object. Below is my JSON file:
{
"lastDate": {
"30": "04/04/2022",
"31": "04/04/2022",
"40": "02/03/2022",
"45": "02/01/2022"
},
"transactionType": {
"30": "Automatic Sell",
"31": "Automatic Sell",
"40": "Automatic Sell",
"45": "Sell"
},
"sharesTraded": {
"30": "29,198",
"31": "105,901",
"40": "25,000",
"45": "1,986"
}
}
And here is my current code:
import json
data = json.load(open("AAPL22_trades.json"))
dataa = data['sharesTraded']
dataa1 = dataa.values()
data1 = [s.replace(',', '') for s in dataa1]
data1 = [int(i) for i in data1]
open("AAPL22_trades.json", "w").write(
json.dumps(data1, indent=4))
However, I need the integer values to replace the string values. Instead, my code just replaces the entire JSON with the integers. I imagine there is something extra at the end that I'm missing because the strings have been changed to integers but applying it to the JSON is missing.

You have a couple of problems. First, since you only convert the values to a list, you loose the information about which key is associated with the values. Second, you write that list back to the file, loosing all of the other data too.
You could create a new dictionary with the modified values and assign that back to the original.
import json
data = json.load(open("AAPL22_trades.json"))
data['sharesTraded'] = {k:int(v.replace(',', ''))
for k,v in data['sharesTraded'].items()}
json.dump(data, open("AAPL22_trades.json", "w"), indent=4)

Related

Create JSON from specific lines of data

I have this test data that I am trying to use to create a JSON for just select items. I have the items listed and would just like to output a JSON list with select item.
What I have:
import json
# Data to be written
dictionary ={
"id": "04",
"name": "sunil",
"department": "HR"
}
# Serializing json
x= json.dumps(dictionary,indent=0)
# JSON String
y = json.loads(x)
# Goal is to print:
{
"id": "04",
"name": "sunil"
}
If you don't need to save the department key you can use this:
del y['department']
Then your y variable will print what you wanted:
{"id": "04", "name": "sunil"}
Other ways to solve the same issue
for key in list(y.keys()):
#add all potential keys you want to remain in the final dictionary
if key == "id" or key == "name":
continue
else:
del y[key]
However, iterating over a dictionary is pretty slow. You could assign values to temporary variables and then remake the dictionary like this:
temp_id = y['id']
temp_name = y['name']
y.clear()
y['id'] = temp_id
y['name'] = temp_name
This should be faster that iterating over a dictionary.

(Python) Reading specific keys from json file

I have a question about how I can make this code work.
I have several json files, from which I must read information and store it in a pandas framework to export it later. The json files are pretty branched up and go something like this:
'TESTS' -> 'MEASUREMENTS' -> SeveralTestNames(For each of those keys there is a key 'Value') I need to get only the Values and save them.
My thought was, that I get the keys with the testnames, and then in a loop apply these names to the json.load()-method, but no matter what I try, it doesnt work.
import json
with open(file) as f:
data = json.load(f)
date = data['INFO']['TIME'][0:10]
time = data['INFO']['TIME'][11:]
t = data['TESTS']['MEASUREMENTS']
type = [*t]
value = []
i = 0
for x in type:
v = data['TESTS']['MEASUREMENTS'][type[i]]['RESULTS']
value.append(v)
i = i + 1
This just gives me 'TypeError: list indices must be integers or slices, not str', but when I remove the last bit with the ['RESULTS'], it gives me the keys of the tests, but i need the values from within them.
It seems you're overcomplicating this for yourself a bit.
I'm reproducing your json from your comments as best as I can since parts of it are not valid json.
data = {
'INFO': {
'TIME': ' ',
'VERSION' : ' '
},
'TESTS': {
'MEASUREMENTS': {
'Test1': {
'RESULTS': {
'VALUE': 147799000000
}
},
'Test2': {
'RESULTS': {
'VALUE': 147888278322
}
}
}
}
}
values = []
# this iterates over each dict within the MEASUREMENTS key
# dict.values() returns a view that you can iterate over
# of just the values
for measurement in data['TESTS']['MEASUREMENTS'].values():
values.append(measurement['RESULTS']['VALUE'])
print(values)
In your case it would be:
import json
with open(file) as f:
data = json.load(f)
values = []
for measurement in data['TESTS']['MEASUREMENTS'].values():
values.append(measurement['RESULTS']['VALUE'])
print(values)

How to access nested attribute without passing parent attribute in pyspark json

I am trying to access inner attributes of following json using pyspark
[
{
"432": [
{
"atttr1": null,
"atttr2": "7DG6",
"id":432,
"score": 100
}
]
},
{
"238": [
{
"atttr1": null,
"atttr2": "7SS8",
"id":432,
"score": 100
}
]
}
]
In the output, I am looking for something like below in form of csv
atttr1, atttr2,id,score
null,"7DG6",432,100
null,"7SS8",238,100
I understand I can get these details like below but I don't want to pass 432 or 238 in lambda expression as in bigger json this(italic one) will vary. I want to iterate over all available values.
print(inputDF.rdd.map(lambda x:(x['*432*'])).first())
print(inputDF.rdd.map(lambda x:(x['*238*'])).first())
I also tried registering a temp table with the name "test" but it gave an error with message element._id doesn't exist.
inputDF.registerTempTable("test")
srdd2 = spark.sql("select element._id from test limit 1")
Any help will be highly appreciated. I am using spark 2.4
Without using pyspark features, you can do it like this:
data = json.loads(json_str) # or whatever way you're getting the data
columns = 'atttr1 atttr2 id score'.split()
print(','.join(columns)) # headers
for item in data:
for obj in list(item.values())[0]: # since each list has only one object
print(','.join(str(obj[col]) for col in columns))
Output:
atttr1,atttr2,id,score
None,7DG6,432,100
None,7SS8,432,100
Or
for item in data:
obj = list(item.values())[0][0] # since the object is the one and only item in list
print(','.join(str(obj[col]) for col in columns))
FYI, you can store those in a variable or write it out to csv instead of/and also printing it.
And if you're just looking to dump that to csv, see this answer.

How to convert dynamic nested json into csv?

I have some dynamically generated nested json that I want to convert to a CSV file using python. I am trying to use pandas for this. My question is - is there a way to use this and flatten the json data to put in the csv without knowing the json keys that need flattened in advance? An example of my data is this:
{
"reports": [
{
"name": "report_1",
"details": {
"id": "123",
"more info": "zyx",
"people": [
"person1",
"person2"
]
}
},
{
"name": "report_2",
"details": {
"id": "123",
"more info": "zyx",
"actions": [
"action1",
"action2"
]
}
}
]
}
More nested json objects can be dynamically generated in the "details" section that I do not know about in advance but need to be represented in their own cell in the csv.
For the above example, I'd want the csv to look something like this:
Name, Id, More Info, People_1, People_2, Actions_1, Actions_2
report_1, 123, zxy, person1, person2, ,
report_2, 123, zxy , , , action1 , action2
Here's the code I have:
data = json.loads('{"reports": [{"name": "report_1","details": {"id": "123","more info": "zyx","people": ["person1","person2"]}},{"name": "report_2","details": {"id": "123","more info": "zyx","actions": ["action1","action2"]}}]}')
df = pd.json_normalize(data['reports'])
df.to_csv("test.csv")
And here is the outcome currently:
,name,details.id,details.more info,details.people,details.actions
0,report_1,123,zyx,"['person1', 'person2']",
1,report_2,123,zyx,,"['action1', 'action2']"
I think what your are lookig for is:
https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html
If using pandas doesn't work for you, here's the more canonical Python way of doing it.
You're trying to write out a CSV file, and that implicitly means you must write out a header containing all the keys.
The constraint that you don't know the keys in advance means you can't do this in a single pass.
def convert_record_to_flat_dict(record):
# You need to figure out exactly how you want to do this; everything
# should be strings.
record.update(record.pop('details'))
return record
header = {}
rows = [[]] # Leave the header row blank for now.
csv_out = csv.writer(buffer)
for record in report:
record = convert_record_to_flat_dict(record)
for key in record.keys():
if key not in header:
header[key] = len(header)
rows[0].append(key)
row = [''] * len(header)
for key, index in header.items():
row[index] = record.get(key, '')
rows.append(row)
# And you can go back to ensure all rows have the same number of keys:
for row in rows:
row.extend([''] * (len(row) - len(header)))
Now you have a list of lists that's ready to be sent to csv.csvwriter() or the like.
If memory is an issue, another technique is to write out a temporary file and then reprocess it once you know the header.

Convert python nested JSON-like data to dataframe

My records looks like this and I need to write it to a csv file:
my_data={"data":[{"id":"xyz","type":"book","attributes":{"doc_type":"article","action":"cut"}}]}
which looks like json, but the next record starts with "data" and not "data1" which forces me to read each record separately. Then, I convert it to a dict using eval(), to iterate thru keys and values for a certain path to get to the values I need. Then, I generate a list of keys and values based on the keys I need. Then, a pd.dataframe() converts that list into a dataframe which I know how to convert to csv. My code that works is below. But I am sure there are better ways to do this. Mine scales poorly. Thx.
counter=1
k=[]
v=[]
res=[]
m=0
for line in f2:
jline=eval(line)
counter +=1
for items in jline:
k.append(jline[u'data'][0].keys())
v.append(jline[u'data'][0].values())
print 'keys are:', k
i=0
j=0
while i <3 :
while j <3:
if k[i][j]==u'id':
res.append(v[i][j])
j += 1
i += 1
#res is my result set
del k[:]
del v[:]
Changing my_data to be:
my_data = [{"id":"xyz","type":"book","attributes":{"doc_type":"article","action":"cut"}}, # Data One
{"id":"xyz2","type":"book","attributes":{"doc_type":"article","action":"cut"}}, # Data Two
{"id":"xyz3","type":"book","attributes":{"doc_type":"article","action":"cut"}}] # Data Three
You can dump this directly into a dataframe as so:
mydf = pd.DataFrame(my_data)
It's not clear what your data path would be, but if you are looking for specific combinations of id, type, etc. You could explicitly search
def find_my_way(data, pattern):
# pattern = {'id':'someid', 'type':'sometype'...}
res = []
for row in data:
if row.get('id') == pattern.get('id'):
res.append(row)
return row
mydf = pd.DataFrame(find_my_way(mydata, pattern))
EDIT:
Without going into how the api works, in pseudo-code, you'll want to do something like the following:
my_objects = []
calls = 0
while calls < maximum:
my_data = call_the_api(params)
data = my_data.get('data')
if not data:
calls+=1
continue
# Api calls to single objects usually return a dictionary, to group objects they return lists. This handles both cases
if isinstance(data, list):
my_objects = [*data, *my_objects]
elif isinstance(data, {}):
my_objects = [{**data}, *my_objects]
# This will unpack the data response into a list that you can then load into a DataFrame with the attributes from the api as the columns
df = pd.DataFrame(my_objects)
Assuming your data from the api looks like:
"""
{
"links": {},
"meta": {},
"data": {
"type": "FactivaOrganizationsProfile",
"id": "Goog",
"attributes": {
"key_executives": {
"source_provider": [
{
"code": "FACSET",
"descriptor": "FactSet Research Systems Inc.",
"primary": true
}
]
}
},
"relationships": {
"people": {
"data": {
"type": "people",
"id": "39961704"
}
}
}
},
"included": {}
}
"""
per the documentation, which is why I'm using my_data.get('data').
That should get you all of the data (unfiltered) into a DataFrame
Saving the DataFrame for the last bit is a bit more memory friendly

Categories