How to compare two dicts and find matching values - python

I'm pulling data from an API for a weather system. The API returns a single JSON object with sensors broken up into two sub-nodes for each sensor. I'm trying to associate two (or more) sensors with their time-stamps. Unfortunately, not every sensor polls every single time (although they're supposed to).
In effect, I have a JSON object that looks like this:
{
"sensor_data": {
"mbar": [{
"value": 1012,
"timestamp": "2019-10-31T00:15:00"
}, {
"value": 1011,
"timestamp": "2019-10-31T00:30:00"
}, {
"value": 1010,
"timestamp": "2019-10-31T00:45:00"
}],
"temperature": [{
"value": 10.3,
"timestamp": "2019-10-31T00:15:00"
}, {
"value": 10.2,
"timestamp": "2019-10-31T00:30:00"
}, {
"value": 10.0,
"timestamp": "2019-10-31T00:45:00"
}, {
"value": 9.8,
"timestamp": "2019-10-31T01:00:00"
}]
}
}
This examples shows I have one extra temperature reading, and this example is a really small one.
How can I take this data and associate a single reading for each timestamp, gathering as much sensor data as I can pull from matching timestamps? Ultimately, I want to export the data into a CSV file, with each row representing a slice in time from the sensor, to be graphed or further analyzed after.
For lists that are exactly the same length, I have a solution:
sensor_id = '007_OHMSS'
sensor_data = read_json('sensor_data.json') # wrapper function for open and load json
list_a = sensor_data['mbar']
list_b = sensor_data['temperature']
pair_perfect_sensor_list(sensor_id, list_a, list_b)
def pair_perfect_sensor_lists(sensor_id, list_a, list_b):
# in this case, list a will be mbar, list_b will be temperature
matches = list()
if len(list_a) == len(list_b):
for idx, reading in enumerate(list_a):
mbar_value = reading['value']
timestamp = reading['timestamp']
t_reading = list_b[idx]
t_time = t_reading['timestamp']
temp_value = t_reading['value']
print(t_time == timestamp)
if t_time == timestamp:
match = {
'sensor_id': sensor_id,
'mbar_index': idx,
'time_index': idx,
'mbar_value': mbar_value,
'temp_value': temp_value,
'mbar_time': timestamp,
'temp_time': t_time,
}
print('here is your match:')
print(match)
matches.append(match)
else:
print("IMPERFECT!")
print(t_time)
print(timestamp)
return matches
return failure
When there's not a match, I want to skip a reading for the missing sensor (in this case, the last mbar reading) and just do an N/A.
In most cases, the offset is just one node - meaning temp has one extra reading, somewhere in the middle.
I was using the idx index to optimize the speed of the process, so I don't have to loop through the second (or third, or nth) dict to see if the timestamp exists in it, but I know that's not preferred either, because dicts aren't ordered. In this case, it appears every sub-node sensor dict is ordered by timestamp, so I was trying to leverage that convenience.
Is this a common problem? If so, just point me to the terminology. But I've searched already and cannot find a reasonable, efficient answer besides "loop through each sub-dict and look for a match".
Open to any ideas, because I'll have to do this often, and on large (25 MB files or larger, sometimes) JSON objects. The full dump is up and over 300 MB, but I've sliced them up by sensor IDs so they're more manageable.

You can use .get to avoid type errors to get an output like this.
st=yourjsonabove
mbar={}
for item in st['sensor_data']['mbar']:
mbar[item['timestamp']] = item['value']
temperature={}
for item in st['sensor_data']['temperature']:
temperature[item['timestamp']] = item['value']
for timestamp in temperature:
print("Timestamp:" , timestamp, "Sensor Reading: ", mbar.get(timestamp), "Temperature Reading: ", temperature[timestamp])
leading to output:
Timestamp: 2019-10-31T00:15:00 Sensor Reading: 1012 Temperature Reading: 10.3
Timestamp: 2019-10-31T00:30:00 Sensor Reading: 1011 Temperature Reading: 10.2
Timestamp: 2019-10-31T00:45:00 Sensor Reading: 1010 Temperature Reading: 10.0
Timestamp: 2019-10-31T01:00:00 Sensor Reading: None Temperature Reading: 9.8
Does that help?

You could make a dict with timestamp keys of your sensor readings like
mbar = {s['timestamp']:s['value'] for s in sensor_data['mbar']}
temp = {s['timestamp']:s['value'] for s in sensor_data['temperature']}
Now it is easy to compare using the difference of the key sets
mbar.keys() - temp.keys()
temp.keys() - mbar.keys()

Related

How to manipulate and slice multi-dimensional JSON data in Python?

I'm trying to set up a convenient system for storing and analyzing data from experiments. For the data files I use the following JSON format:
{
"sample_id": "",
"timestamp": "",
"other_metadata1": "",
"measurements": {
"type1": {
"timestamp": "",
"other_metadata2": ""
"data": {
"parameter1": [1,2,3],
"parameter2": [4,5,6]
}
}
"type2": { ... }
}
}
Now for analyzing many of these files, I want to filter for sample metadata and measurement metadata to get a subset of the data to plot. I wrote a function like this:
def get_subset(data_dict, include_samples={}, include_measurements={}):
# Start with a copy of all datasets
subset = copy.deepcopy(data_dict)
# Include samples if they satisfy certain properties
for prop, req in include_samples.items():
subset = {file: sample for file, sample in subset.items() if sample[prop] == req}
# Filter by measurement properties
for file, sample in subset.items():
measurements = sample['measurements'].copy()
for prop, req in include_measurements.items():
measurements = [meas for meas in measurements if meas[prop] == req]
# Replace the measurements list
sample['measurements'] = measurements
return subset
While this works, I feel like I'm re-inventing the wheel of something like pandas. I would like to have more functionality like dropping all NaN values, excluding based on metadata, etc, All of which is available in pandas. However my data format is not compatible with the 2D nature of that.
Any suggestions on how to go about manipulating and slicing such data strutures without reinventing a lot of things?

How to convert dynamic nested json into csv?

I have some dynamically generated nested json that I want to convert to a CSV file using python. I am trying to use pandas for this. My question is - is there a way to use this and flatten the json data to put in the csv without knowing the json keys that need flattened in advance? An example of my data is this:
{
"reports": [
{
"name": "report_1",
"details": {
"id": "123",
"more info": "zyx",
"people": [
"person1",
"person2"
]
}
},
{
"name": "report_2",
"details": {
"id": "123",
"more info": "zyx",
"actions": [
"action1",
"action2"
]
}
}
]
}
More nested json objects can be dynamically generated in the "details" section that I do not know about in advance but need to be represented in their own cell in the csv.
For the above example, I'd want the csv to look something like this:
Name, Id, More Info, People_1, People_2, Actions_1, Actions_2
report_1, 123, zxy, person1, person2, ,
report_2, 123, zxy , , , action1 , action2
Here's the code I have:
data = json.loads('{"reports": [{"name": "report_1","details": {"id": "123","more info": "zyx","people": ["person1","person2"]}},{"name": "report_2","details": {"id": "123","more info": "zyx","actions": ["action1","action2"]}}]}')
df = pd.json_normalize(data['reports'])
df.to_csv("test.csv")
And here is the outcome currently:
,name,details.id,details.more info,details.people,details.actions
0,report_1,123,zyx,"['person1', 'person2']",
1,report_2,123,zyx,,"['action1', 'action2']"
I think what your are lookig for is:
https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html
If using pandas doesn't work for you, here's the more canonical Python way of doing it.
You're trying to write out a CSV file, and that implicitly means you must write out a header containing all the keys.
The constraint that you don't know the keys in advance means you can't do this in a single pass.
def convert_record_to_flat_dict(record):
# You need to figure out exactly how you want to do this; everything
# should be strings.
record.update(record.pop('details'))
return record
header = {}
rows = [[]] # Leave the header row blank for now.
csv_out = csv.writer(buffer)
for record in report:
record = convert_record_to_flat_dict(record)
for key in record.keys():
if key not in header:
header[key] = len(header)
rows[0].append(key)
row = [''] * len(header)
for key, index in header.items():
row[index] = record.get(key, '')
rows.append(row)
# And you can go back to ensure all rows have the same number of keys:
for row in rows:
row.extend([''] * (len(row) - len(header)))
Now you have a list of lists that's ready to be sent to csv.csvwriter() or the like.
If memory is an issue, another technique is to write out a temporary file and then reprocess it once you know the header.

Modify two files in the same script in python

I have two json files with price values. Each of these two values are in each file. The second one corresponds to the average value of other prices (which is count in a "soldnumber" key. What I'd like is to create an if statement where: if sold number > 3, then add the "avgactualeu" value to "cote_actual". Else: add the cote_lv value.
Sample of the first database:
[{
"objectID": 10000,
"cote_lv": 28000,
},
{
"objectID": 10001,
"cote_lv": 35000,
}...]
Sample of the second one:
[{
"objectID": 10002,
"avg_actual": 47640,
"sold_number": 2,
},
{
"objectID": 10001,
"sold_number": 5,
"unsold_number": 1,
"unsold_var": 17
}...]
I expect an output with a "cote_actual": value for each objectID in the two files.
My python code that doesn't work:
import json
with open('./output/gmstatsencheresdemo.json', encoding='utf-8') as data_file2, open('./live_files/demo_db_live.json', encoding='utf-8') as data_file:
data2 = json.loads(data_file2.read())
data = json.loads(data_file.read())
for i in data:
cotelv = i.get('cote_lv')
i['cote_actual'] = {}
i['cote_actual'] = cotelv
for x in data2:
soldnumber = x.get('sold_number')
avgactual = x.get('avg_actual')
x['cote_actual'] = {}
x['cote_actual'] = avgactual
if soldnumber >= 3:
x['cote_actual'] = avgactual
elif soldnumber <= 3:
x['cote_actual'] = cotelv
print(x['cote_actual'])
EDIT: My output is okay when it comes to avgactualwhich is display correctly but it's not with cotelv: it doesn't loop through all the values but only display one (9000 in this example)
Output:
9000
107792
9000
125700
9000
I believe this answer is found at How can I open multiple files using "with open" in Python?.
However, to cut it short, you can open both files in the same row and then perform your actions inside the "with" indentation. The reason your code fails is that after the "with", if you do not indent, the actions will not take into account the opened file, much as a loop or a regular "if" statement. Hence, you can open them simultaneously and edit them under the correct indentation, as shown below.
with open('./output/gmstatsencheresdemo.json', encoding='utf-8') as data_file2, open('./live_files/demo_db_live.json', encoding='utf-8') as data_file:
data2 = json.loads(data_file2.read())
data = json.loads(data_file.read())
#Now that they are opened, you can perform the actions through here
Hope that answers your question!

Filtering pandas dataframe by date to count views for timeline of programs

I need to count viewers by program for a streaming channel from a json logfile.
I identify the programs by their starttimes, such as:
So far I have two Dataframes like this:
The first one contains all the timestamps from the logfile
viewers_from_log = pd.read_json('sqllog.json', encoding='UTF-8')
# Convert date string to pandas datetime object:
viewers_from_log['time'] = pd.to_datetime(viewers_from_log['time'])
Source JSON file:
[
{
"logid": 191605,
"time": "0:00:17"
},
{
"logid": 191607,
"time": "0:00:26"
},
{
"logid": 191611,
"time": "0:01:20"
}
]
The second contains the starting times and titles of the programs
programs_start_time = pd.DataFrame.from_dict('programs.json', orient='index')
Source JSON file:
{
"2019-05-29": [
{
"title": "\"Amiről a kövek mesélnek\"",
"startTime_dt": "2019-05-29T00:00:40Z"
},
{
"title": "Koffer - Kedvcsináló Kul(t)túrák Külföldön",
"startTime_dt": "2019-05-29T00:22:44Z"
},
{
"title": "Gubancok",
"startTime_dt": "2019-05-29T00:48:08Z"
}
]
}
So what I need to do is to count the entries / program in the log file and link them to the program titles.
My approach is to slice log data for each date range from program data and get the shape. Next add column for program data with results:
import pandas as pd
# setup test data
log_data = {'Time': ['2019-05-30 00:00:26', '2019-05-30 00:00:50', '2019-05-30 00:05:50','2019-05-30 00:23:26']}
log_data = pd.DataFrame(data=log_data)
program_data = {'Time': ['2019-05-30 00:00:00', '2019-05-30 00:22:44'],
'Program': ['Program 1', 'Program 2']}
program_data = pd.DataFrame(data=program_data)
counts = []
for index, row in program_data.iterrows():
# get counts on selected range
try:
log_range = log_data[(log_data['Time'] > program_data.loc[index].values[0]) & (log_data['Time'] < program_data.loc[index+1].values[0])]
counts.append(log_range.shape[0])
except:
log_range = log_data[log_data['Time'] > program_data.loc[index].values[0]]
counts.append(log_range.shape[0])
# add aditional column with collected counts
program_data['Counts'] = counts
Output:
Time Program Counts
0 2019-05-30 00:00:00 Program 1 3
1 2019-05-30 00:22:44 Program 2 1
A working (but maybe a little quick and dirty) method:
Use the .shift(-1) method on the timestamp column of programs_start_time dataframe, to get an additional column with a name date_end indicating the timestamp of end for each TV program.
Then for each example_timestamp in the log file, you can query the TV programs dataframe like this: df[(df['date_start']=<example_timestamp) & (df['date_end']>example_timestamp)] (make sure you substitute df with your dataframe's name: programs_start_time) which will give you exactly one dataframe row and extract from it the name of the TV programm.
Hope this helps!
Solution with histogram, using numpy:
import pandas as pd
import numpy as np
df_p = pd.DataFrame([
{
"title": "\"Amiről a kövek mesélnek\"",
"startTime_dt": "2019-05-29T00:00:40Z"
},
{
"title": "Koffer - Kedvcsináló Kul(t)túrák Külföldön",
"startTime_dt": "2019-05-29T00:22:44Z"
},
{
"title": "Gubancok",
"startTime_dt": "2019-05-29T00:48:08Z"
}
])
df_v = pd.DataFrame([
{
"logid": 191605,
"time": "2019-05-29 0:00:17"
},
{
"logid": 191607,
"time": "2019-05-29 0:00:26"
},
{
"logid": 191611,
"time": "2019-05-29 0:01:20"
}
])
df_p.startTime_dt = pd.to_datetime(df_p.startTime_dt)
df_v.time = pd.to_datetime(df_v.time)
# here's part where I convert datetime to timestamp in seconds - astype(int) casts it to nanoseconds, hence there's // 10**9
programmes_start = df_p.startTime_dt.astype(int).values // 10**9
viewings_starts = df_v.time.astype(int).values // 10**9
# make bins for histogram
# add zero to the beginning of the array
# add value that is time an hour after the start of the last given programme to the end of the array
programmes_start = np.pad(programmes_start, (1, 1), mode='constant', constant_values=(0, programmes_start.max()+3600))
histogram = np.histogram(viewings_starts, bins=programmes_start)
print(histogram[0]
# prints [2 1 0 0]
Interpretation: there were 2 log entries before 'Amiről a kövek mesélnek' started, 1 log entry between starts of 'Amiről a kövek mesélnek' and 'Koffer - Kedvcsináló Kul(t)túrák Külföldön', 0 log entries between starts of 'Koffer - Kedvcsináló Kul(t)túrák Külföldön' and 'Gubancok' and 0 entries after start od 'Gubancok'. Which, looking at the data you provided, seems correct :) Hope this helps.
NOTE: I assume, that you have the date of the viewings. You don't have them in the example log file, but they appear in the screenshot - so I assumed that you can compute/get them somehow and added them by hand to the input dict.

Convert python nested JSON-like data to dataframe

My records looks like this and I need to write it to a csv file:
my_data={"data":[{"id":"xyz","type":"book","attributes":{"doc_type":"article","action":"cut"}}]}
which looks like json, but the next record starts with "data" and not "data1" which forces me to read each record separately. Then, I convert it to a dict using eval(), to iterate thru keys and values for a certain path to get to the values I need. Then, I generate a list of keys and values based on the keys I need. Then, a pd.dataframe() converts that list into a dataframe which I know how to convert to csv. My code that works is below. But I am sure there are better ways to do this. Mine scales poorly. Thx.
counter=1
k=[]
v=[]
res=[]
m=0
for line in f2:
jline=eval(line)
counter +=1
for items in jline:
k.append(jline[u'data'][0].keys())
v.append(jline[u'data'][0].values())
print 'keys are:', k
i=0
j=0
while i <3 :
while j <3:
if k[i][j]==u'id':
res.append(v[i][j])
j += 1
i += 1
#res is my result set
del k[:]
del v[:]
Changing my_data to be:
my_data = [{"id":"xyz","type":"book","attributes":{"doc_type":"article","action":"cut"}}, # Data One
{"id":"xyz2","type":"book","attributes":{"doc_type":"article","action":"cut"}}, # Data Two
{"id":"xyz3","type":"book","attributes":{"doc_type":"article","action":"cut"}}] # Data Three
You can dump this directly into a dataframe as so:
mydf = pd.DataFrame(my_data)
It's not clear what your data path would be, but if you are looking for specific combinations of id, type, etc. You could explicitly search
def find_my_way(data, pattern):
# pattern = {'id':'someid', 'type':'sometype'...}
res = []
for row in data:
if row.get('id') == pattern.get('id'):
res.append(row)
return row
mydf = pd.DataFrame(find_my_way(mydata, pattern))
EDIT:
Without going into how the api works, in pseudo-code, you'll want to do something like the following:
my_objects = []
calls = 0
while calls < maximum:
my_data = call_the_api(params)
data = my_data.get('data')
if not data:
calls+=1
continue
# Api calls to single objects usually return a dictionary, to group objects they return lists. This handles both cases
if isinstance(data, list):
my_objects = [*data, *my_objects]
elif isinstance(data, {}):
my_objects = [{**data}, *my_objects]
# This will unpack the data response into a list that you can then load into a DataFrame with the attributes from the api as the columns
df = pd.DataFrame(my_objects)
Assuming your data from the api looks like:
"""
{
"links": {},
"meta": {},
"data": {
"type": "FactivaOrganizationsProfile",
"id": "Goog",
"attributes": {
"key_executives": {
"source_provider": [
{
"code": "FACSET",
"descriptor": "FactSet Research Systems Inc.",
"primary": true
}
]
}
},
"relationships": {
"people": {
"data": {
"type": "people",
"id": "39961704"
}
}
}
},
"included": {}
}
"""
per the documentation, which is why I'm using my_data.get('data').
That should get you all of the data (unfiltered) into a DataFrame
Saving the DataFrame for the last bit is a bit more memory friendly

Categories