I have two json files with price values. Each of these two values are in each file. The second one corresponds to the average value of other prices (which is count in a "soldnumber" key. What I'd like is to create an if statement where: if sold number > 3, then add the "avgactualeu" value to "cote_actual". Else: add the cote_lv value.
Sample of the first database:
[{
"objectID": 10000,
"cote_lv": 28000,
},
{
"objectID": 10001,
"cote_lv": 35000,
}...]
Sample of the second one:
[{
"objectID": 10002,
"avg_actual": 47640,
"sold_number": 2,
},
{
"objectID": 10001,
"sold_number": 5,
"unsold_number": 1,
"unsold_var": 17
}...]
I expect an output with a "cote_actual": value for each objectID in the two files.
My python code that doesn't work:
import json
with open('./output/gmstatsencheresdemo.json', encoding='utf-8') as data_file2, open('./live_files/demo_db_live.json', encoding='utf-8') as data_file:
data2 = json.loads(data_file2.read())
data = json.loads(data_file.read())
for i in data:
cotelv = i.get('cote_lv')
i['cote_actual'] = {}
i['cote_actual'] = cotelv
for x in data2:
soldnumber = x.get('sold_number')
avgactual = x.get('avg_actual')
x['cote_actual'] = {}
x['cote_actual'] = avgactual
if soldnumber >= 3:
x['cote_actual'] = avgactual
elif soldnumber <= 3:
x['cote_actual'] = cotelv
print(x['cote_actual'])
EDIT: My output is okay when it comes to avgactualwhich is display correctly but it's not with cotelv: it doesn't loop through all the values but only display one (9000 in this example)
Output:
9000
107792
9000
125700
9000
I believe this answer is found at How can I open multiple files using "with open" in Python?.
However, to cut it short, you can open both files in the same row and then perform your actions inside the "with" indentation. The reason your code fails is that after the "with", if you do not indent, the actions will not take into account the opened file, much as a loop or a regular "if" statement. Hence, you can open them simultaneously and edit them under the correct indentation, as shown below.
with open('./output/gmstatsencheresdemo.json', encoding='utf-8') as data_file2, open('./live_files/demo_db_live.json', encoding='utf-8') as data_file:
data2 = json.loads(data_file2.read())
data = json.loads(data_file.read())
#Now that they are opened, you can perform the actions through here
Hope that answers your question!
Related
I have some dynamically generated nested json that I want to convert to a CSV file using python. I am trying to use pandas for this. My question is - is there a way to use this and flatten the json data to put in the csv without knowing the json keys that need flattened in advance? An example of my data is this:
{
"reports": [
{
"name": "report_1",
"details": {
"id": "123",
"more info": "zyx",
"people": [
"person1",
"person2"
]
}
},
{
"name": "report_2",
"details": {
"id": "123",
"more info": "zyx",
"actions": [
"action1",
"action2"
]
}
}
]
}
More nested json objects can be dynamically generated in the "details" section that I do not know about in advance but need to be represented in their own cell in the csv.
For the above example, I'd want the csv to look something like this:
Name, Id, More Info, People_1, People_2, Actions_1, Actions_2
report_1, 123, zxy, person1, person2, ,
report_2, 123, zxy , , , action1 , action2
Here's the code I have:
data = json.loads('{"reports": [{"name": "report_1","details": {"id": "123","more info": "zyx","people": ["person1","person2"]}},{"name": "report_2","details": {"id": "123","more info": "zyx","actions": ["action1","action2"]}}]}')
df = pd.json_normalize(data['reports'])
df.to_csv("test.csv")
And here is the outcome currently:
,name,details.id,details.more info,details.people,details.actions
0,report_1,123,zyx,"['person1', 'person2']",
1,report_2,123,zyx,,"['action1', 'action2']"
I think what your are lookig for is:
https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html
If using pandas doesn't work for you, here's the more canonical Python way of doing it.
You're trying to write out a CSV file, and that implicitly means you must write out a header containing all the keys.
The constraint that you don't know the keys in advance means you can't do this in a single pass.
def convert_record_to_flat_dict(record):
# You need to figure out exactly how you want to do this; everything
# should be strings.
record.update(record.pop('details'))
return record
header = {}
rows = [[]] # Leave the header row blank for now.
csv_out = csv.writer(buffer)
for record in report:
record = convert_record_to_flat_dict(record)
for key in record.keys():
if key not in header:
header[key] = len(header)
rows[0].append(key)
row = [''] * len(header)
for key, index in header.items():
row[index] = record.get(key, '')
rows.append(row)
# And you can go back to ensure all rows have the same number of keys:
for row in rows:
row.extend([''] * (len(row) - len(header)))
Now you have a list of lists that's ready to be sent to csv.csvwriter() or the like.
If memory is an issue, another technique is to write out a temporary file and then reprocess it once you know the header.
I have two CSV files which have one-to-many relation between them.
main.csv:
"main_id","name"
"1","foobar"
attributes.csv:
"id","main_id","name","value","updated_at"
"100","1","color","red","2020-10-10"
"101","1","shape","square","2020-10-10"
"102","1","size","small","2020-10-10"
I would like to convert this to JSON of this structure:
[
{
"main_id": "1",
"name": "foobar",
"attributes": [
{
"id": "100",
"name": "color",
"value": "red",
"updated_at": "2020-10-10"
},
{
"id": "101",
"name": "shape",
"value": "square",
"updated_at": "2020-10-10"
},
{
"id": "103",
"name": "size",
"value": "small",
"updated_at": "2020-10-10"
}
]
}
]
I tried using Python and Pandas like:
import pandas
def transform_group(group):
group.reset_index(inplace=True)
group.drop('main_id', axis='columns', inplace=True)
return group.to_dict(orient='records')
main = pandas.read_csv('main.csv')
attributes = pandas.read_csv('attributes.csv', index_col=0)
attributes = attributes.groupby('main_id').apply(transform_group)
attributes.name = "attributes"
main = main.merge(
right=attributes,
on='main_id',
how='left',
validate='m:1',
copy=False,
)
main.to_json('out.json', orient='records', indent=2)
It works. But the issue is that it does not seem to scale. When running on my whole dataset I have, I can load individual CSV files without problems, but when trying to modify data structure before calling to_json, memory usage explodes.
So is there a more efficient way to do this transformation? Maybe there is some Pandas feature I am missing? Or is there some other library to use? Moreover, use of apply seems to be pretty slow here.
This is a tough problem and we have all felt your pain.
There are three ways I would attack this problem. First, groupby is slower if you allow pandas to do the break out.
import pandas as pd
import numpy as np
from collections import defaultdict
df = pd.DataFrame({'id': np.random.randint(0, 100, 5000),
'name': np.random.randint(0, 100, 5000)})
now if you do the standard groupby
groups = []
for k, rows in df.groupby('id'):
groups.append(rows)
you will find that
groups = defaultdict(lambda: [])
for id, name in df.values:
groups[id].append((id, name))
is about 3 times faster.
The second method is I would use change it to use Dask and the dask parallelization. A discussion about dask is what is dask and how is it different from pandas.
The third is algorithmic. Load up the main file and then by ID, then only load the data for that ID, having multiple bites at what is in memory and what is in disk, then saving out a partial result as it becomes available.
So in my case I was able to load original tables in memory, but doing embedding exploded the size so that it did not fit memory anymore. So I ended up still using Pandas to load CSV files, but then I iteratively generate row by row and saving each row into a separate JSON. This means I do not have a large data structure in the memory for one large JSON.
Another important realization was that it is important to make the related column an index, and that it has to be sorted, so that querying it is fast (because generally there are duplicate entries in the related column).
I made the following two helper functions:
def get_related_dict(related_table, label):
assert related_table.index.is_unique
if pandas.isna(label):
return None
row = related_table.loc[label]
assert isinstance(row, pandas.Series), label
result = row.to_dict()
result[related_table.index.name] = label
return result
def get_related_list(related_table, label):
# Important to be more performant when selecting non-unique labels.
assert related_table.index.is_monotonic_increasing
try:
# We use this syntax for always get a DataFrame and not a Series when there is only one row matching.
return related_table.loc[[label], :].to_dict(orient='records')
except KeyError:
return []
And then I do:
main = pandas.read_csv('main.csv', index_col=0)
attributes = pandas.read_csv('attributes.csv', index_col=1)
# We sort index to be more performant when selecting non-unique labels. We use stable sort.
attributes.sort_index(inplace=True, kind='mergesort')
columns = [main.index.name] + list(main.columns)
for row in main.itertuples(index=True, name=None):
assert len(columns) == len(row)
data = dict(zip(columns, row))
data['attributes'] = get_related_list(attributes, data['main_id'])
json.dump(data, sys.stdout, indent=2)
sys.stdout.write("\n")
I have a sizable json file and i need to get the index of a certain value inside it. Here's what my json file looks like:
data.json
[{...many more elements here...
},
{
"name": "SQUARED SOS",
"unified": "1F198",
"non_qualified": null,
"docomo": null,
"au": "E4E8",
"softbank": null,
"google": "FEB4F",
"image": "1f198.png",
"sheet_x": 0,
"sheet_y": 28,
"short_name": "sos",
"short_names": [
"sos"
],
"text": null,
"texts": null,
"category": "Symbols",
"sort_order": 167,
"added_in": "0.6",
"has_img_apple": true,
"has_img_google": true,
"has_img_twitter": true,
"has_img_facebook": true
},
{...many more elements here...
}]
How can i get the index of the value "FEB4F" whose key is "google", for example?
My only idea was this but it doesn't work:
print(data.index('FEB4F'))
Your basic data structure is a list, so there's no way to avoid looping over it.
Loop through all the items, keeping track of the current position. If the current item has the desired key/value, print the current position.
position = 0
for item in data:
if item.get('google') == 'FEB4F':
print('position is:', position)
break
position += 1
Assuming your data can fit in an table, I recommend using pandas for that. Here is the summary:
Read de data using pandas.read_json
Identify witch column to filter
Filter using pandas.DataFrame.loc
IE:
import pandas as pd
data = pd.read_json("path_to_json.json")
print(data)
#lets assume you want to filter using the 'unified' column
filtered = data.loc[data['unified'] == 'something']
print(filtered)
Of course the steps would be different depending on the JSON structure
I'm pulling data from an API for a weather system. The API returns a single JSON object with sensors broken up into two sub-nodes for each sensor. I'm trying to associate two (or more) sensors with their time-stamps. Unfortunately, not every sensor polls every single time (although they're supposed to).
In effect, I have a JSON object that looks like this:
{
"sensor_data": {
"mbar": [{
"value": 1012,
"timestamp": "2019-10-31T00:15:00"
}, {
"value": 1011,
"timestamp": "2019-10-31T00:30:00"
}, {
"value": 1010,
"timestamp": "2019-10-31T00:45:00"
}],
"temperature": [{
"value": 10.3,
"timestamp": "2019-10-31T00:15:00"
}, {
"value": 10.2,
"timestamp": "2019-10-31T00:30:00"
}, {
"value": 10.0,
"timestamp": "2019-10-31T00:45:00"
}, {
"value": 9.8,
"timestamp": "2019-10-31T01:00:00"
}]
}
}
This examples shows I have one extra temperature reading, and this example is a really small one.
How can I take this data and associate a single reading for each timestamp, gathering as much sensor data as I can pull from matching timestamps? Ultimately, I want to export the data into a CSV file, with each row representing a slice in time from the sensor, to be graphed or further analyzed after.
For lists that are exactly the same length, I have a solution:
sensor_id = '007_OHMSS'
sensor_data = read_json('sensor_data.json') # wrapper function for open and load json
list_a = sensor_data['mbar']
list_b = sensor_data['temperature']
pair_perfect_sensor_list(sensor_id, list_a, list_b)
def pair_perfect_sensor_lists(sensor_id, list_a, list_b):
# in this case, list a will be mbar, list_b will be temperature
matches = list()
if len(list_a) == len(list_b):
for idx, reading in enumerate(list_a):
mbar_value = reading['value']
timestamp = reading['timestamp']
t_reading = list_b[idx]
t_time = t_reading['timestamp']
temp_value = t_reading['value']
print(t_time == timestamp)
if t_time == timestamp:
match = {
'sensor_id': sensor_id,
'mbar_index': idx,
'time_index': idx,
'mbar_value': mbar_value,
'temp_value': temp_value,
'mbar_time': timestamp,
'temp_time': t_time,
}
print('here is your match:')
print(match)
matches.append(match)
else:
print("IMPERFECT!")
print(t_time)
print(timestamp)
return matches
return failure
When there's not a match, I want to skip a reading for the missing sensor (in this case, the last mbar reading) and just do an N/A.
In most cases, the offset is just one node - meaning temp has one extra reading, somewhere in the middle.
I was using the idx index to optimize the speed of the process, so I don't have to loop through the second (or third, or nth) dict to see if the timestamp exists in it, but I know that's not preferred either, because dicts aren't ordered. In this case, it appears every sub-node sensor dict is ordered by timestamp, so I was trying to leverage that convenience.
Is this a common problem? If so, just point me to the terminology. But I've searched already and cannot find a reasonable, efficient answer besides "loop through each sub-dict and look for a match".
Open to any ideas, because I'll have to do this often, and on large (25 MB files or larger, sometimes) JSON objects. The full dump is up and over 300 MB, but I've sliced them up by sensor IDs so they're more manageable.
You can use .get to avoid type errors to get an output like this.
st=yourjsonabove
mbar={}
for item in st['sensor_data']['mbar']:
mbar[item['timestamp']] = item['value']
temperature={}
for item in st['sensor_data']['temperature']:
temperature[item['timestamp']] = item['value']
for timestamp in temperature:
print("Timestamp:" , timestamp, "Sensor Reading: ", mbar.get(timestamp), "Temperature Reading: ", temperature[timestamp])
leading to output:
Timestamp: 2019-10-31T00:15:00 Sensor Reading: 1012 Temperature Reading: 10.3
Timestamp: 2019-10-31T00:30:00 Sensor Reading: 1011 Temperature Reading: 10.2
Timestamp: 2019-10-31T00:45:00 Sensor Reading: 1010 Temperature Reading: 10.0
Timestamp: 2019-10-31T01:00:00 Sensor Reading: None Temperature Reading: 9.8
Does that help?
You could make a dict with timestamp keys of your sensor readings like
mbar = {s['timestamp']:s['value'] for s in sensor_data['mbar']}
temp = {s['timestamp']:s['value'] for s in sensor_data['temperature']}
Now it is easy to compare using the difference of the key sets
mbar.keys() - temp.keys()
temp.keys() - mbar.keys()
My records looks like this and I need to write it to a csv file:
my_data={"data":[{"id":"xyz","type":"book","attributes":{"doc_type":"article","action":"cut"}}]}
which looks like json, but the next record starts with "data" and not "data1" which forces me to read each record separately. Then, I convert it to a dict using eval(), to iterate thru keys and values for a certain path to get to the values I need. Then, I generate a list of keys and values based on the keys I need. Then, a pd.dataframe() converts that list into a dataframe which I know how to convert to csv. My code that works is below. But I am sure there are better ways to do this. Mine scales poorly. Thx.
counter=1
k=[]
v=[]
res=[]
m=0
for line in f2:
jline=eval(line)
counter +=1
for items in jline:
k.append(jline[u'data'][0].keys())
v.append(jline[u'data'][0].values())
print 'keys are:', k
i=0
j=0
while i <3 :
while j <3:
if k[i][j]==u'id':
res.append(v[i][j])
j += 1
i += 1
#res is my result set
del k[:]
del v[:]
Changing my_data to be:
my_data = [{"id":"xyz","type":"book","attributes":{"doc_type":"article","action":"cut"}}, # Data One
{"id":"xyz2","type":"book","attributes":{"doc_type":"article","action":"cut"}}, # Data Two
{"id":"xyz3","type":"book","attributes":{"doc_type":"article","action":"cut"}}] # Data Three
You can dump this directly into a dataframe as so:
mydf = pd.DataFrame(my_data)
It's not clear what your data path would be, but if you are looking for specific combinations of id, type, etc. You could explicitly search
def find_my_way(data, pattern):
# pattern = {'id':'someid', 'type':'sometype'...}
res = []
for row in data:
if row.get('id') == pattern.get('id'):
res.append(row)
return row
mydf = pd.DataFrame(find_my_way(mydata, pattern))
EDIT:
Without going into how the api works, in pseudo-code, you'll want to do something like the following:
my_objects = []
calls = 0
while calls < maximum:
my_data = call_the_api(params)
data = my_data.get('data')
if not data:
calls+=1
continue
# Api calls to single objects usually return a dictionary, to group objects they return lists. This handles both cases
if isinstance(data, list):
my_objects = [*data, *my_objects]
elif isinstance(data, {}):
my_objects = [{**data}, *my_objects]
# This will unpack the data response into a list that you can then load into a DataFrame with the attributes from the api as the columns
df = pd.DataFrame(my_objects)
Assuming your data from the api looks like:
"""
{
"links": {},
"meta": {},
"data": {
"type": "FactivaOrganizationsProfile",
"id": "Goog",
"attributes": {
"key_executives": {
"source_provider": [
{
"code": "FACSET",
"descriptor": "FactSet Research Systems Inc.",
"primary": true
}
]
}
},
"relationships": {
"people": {
"data": {
"type": "people",
"id": "39961704"
}
}
}
},
"included": {}
}
"""
per the documentation, which is why I'm using my_data.get('data').
That should get you all of the data (unfiltered) into a DataFrame
Saving the DataFrame for the last bit is a bit more memory friendly