Recreating list of objects efficiently

Recreating list of objects efficiently - python

I am trying to convert a list of objects which has been queried using SQLAlchemy.
The issue that I am having is that the process is taking around 18-20 seconds to loop, process and send the data to the frontdnd. The data is of around 5 million rows which is way too slow to put into production.
Here is an example of what I using.
test = [
{"id": 5, "date":"2022-01-01 00:00:00"},
{"id": 5, "date": "2022-01-01 00:00:00"},
{"id": 5, "date": "2022-01-01 00:00:00"},
{"id": 5, "date": "2022-01-01 00:00:00"},
]
test_dict = {}
for i in test:
if i["id"] not in test_dict:
test_dict[i["id"]] = []
test_dict[i["id"]].append(i["date"].isoformat)
Expected output should be e.g
[
{5: [date, date, date, date, date]},
{6: [date]}
]
I totally understand this is not working code and I am not looking to fix it. I just wrote this on the fly but my main focus is what to do with the for loop to speed the process up.
Thank you everyone for your help.
Thank you everyone for your answers so far.
Providing more info, the data needs to be sent to the frontend which is then rendered on a graph. This data is updated around every minute or so and can also be requested between 2 time ranges. These time ranges are always a minimum of 35 days so the rows returned are always a minimum of 5 million or so. 20 seconds for a graph to load for the end user I would say is too slow. The for loop is the cause of this bottleneck but would be nice to get the for loop down to say 5 seconds at least.
Thank you
Extra info:
Processing database side is unfortunately not an option for this. The data must be converted to the correct format inside the API. For example, concat the data into the correct format or converting to JSON during query isn't an option.

If I understood currently you can use pandas dataframes
test = [
{"id": 5, "date":"2022-01-01 00:00:00"},
{"id": 5, "date": "2022-02-01 00:00:00"},
{"id": 5, "date": "2022-03-01 00:00:00"},
{"id": 5, "date": "2022-04-01 00:00:00"},
{"id": 6, "date": "2022-05-01 00:00:00"},
]
import pandas as pd
df = pd.DataFrame.from_dict(test)
res = df.groupby("id").agg(list)
print(res)
Output :
date
id
5 [2022-01-01 00:00:00, 2022-02-01 00:00:00, 2022-03-01 00:00:00, 2022-04-01 00:00:00]
6 [2022-05-01 00:00:00]
And if you want it to be as dict you can use res.to_dict()

You should probably not send 5 millions objects to the frontend.
Usually we use pagination, filters, and sort elements.
Then if you are really willing to do so, the fastest way would probably be to cache your data, for instance by creating and maintaining a json file on your server that the clients would download.

Related

Detecting duplicates in pandas when a column contains lists

Is there a reasonable way to detect duplicates in a Pandas dataframe when a column contains lists or numpy nd arrays, like the example below? I know I could convert the lists into strings, but the act of converting back and forth feels... wrong. Plus, lists seem more legible and convenient given ~how I got here (online code) and where I'm going after.
import pandas as pd
df = pd.DataFrame(
{
"author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
"date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
"ingredients": [
["ingredA", "ingredB", "ingredC"],
["ingredA", "ingredB", "ingredC"],
["ingredA", "ingredB", "ingredD"],
["ingredA", "ingredB", "ingredD", "ingredE"],
["ingredB", "ingredC", "ingredF"],
],
}
)
# Traditional find duplicates
# df[df.duplicated(keep=False)]
# Avoiding pandas duplicated function (question 70596016 solution)
i = [hash(tuple(i.values())) for i in df.to_dict(orient="records")]
j = [i.count(k) > 1 for k in i]
df[j]
Both methods (the latter from this alternative find duplicates answer) result in
TypeError: unhashable type: 'list'.
They would work, of course, if the dataframe looked like this:
df = pd.DataFrame(
{
"author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
"date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
"recipe": [
"recipeC",
"recipeC",
"recipeD",
"recipeE",
"recipeF",
],
}
)
Which made me wonder if something like integer encoding might be reasonable? It's not that different from converting to/from strings, but at least it's legible. Alternatively, suggestions for converting to a single string of ingredients per row directly from the starting dataframe in the code link above would be appreciated (i.e., avoiding lists altogether).

With map tuple
out = df[df.assign(rating = df['rating'].map(tuple)).duplicated(keep=False)]
Out[295]:
author date rating
0 Jefe98 1423112400 [ingredA, ingredB, ingredC]
1 Jefe98 1423112400 [ingredA, ingredB, ingredC]

Data input format (call the service) for Azure ML time series forecast model deployed as a web service (Python)

Sorry in advance for the lengthy question as I wanted to explain it as detailed as possible.
I used the Azure AutoML to train a model and deployed it as a web service. Now I can access (call) it over the REST endpoint.
I have the following data types for attributes: date (timestamp), number, number, number, number, integer.
I trained the model with the following parametres:
Timestaps interval: 15 min
Forecast Horizon: 4 (I need the forecast every hour for the next hour)
Target rolling window size: 96 (the forecast must ba based on the last 24 hours of data)
As I understand, based on the above, I have to provide last 4 entries to the model for a correct prediction. Otherwise, it will consider a time gap. Am I right? In this case, how I could input 4 instances at a time for a single prediction? The following example is wrong as it asks for 4 predictions for each instance:
import requests
import json
# URL for the web service
scoring_uri = 'http://xxxxx-xxxxxxx-xxxxxx-xxxxxxx.xxxxx.azurecontainer.io/score'
data = {"data":
[
[
2020-10-04 19:30:00,1.29281,1.29334,1.29334,1.29334,1
],
[
2020-10-04 19:45:00,1.29334,1.29294,1.29294,1.29294,1
],
[
2020-10-04 21:00:00,1.29294,1.29217,1.29334,1.29163,34
],
[
2020-10-04 21:15:00,1.29217,1.29257,1.29301,1.29115,195]
]
}
# Convert to JSON string
input_data = json.dumps(data)
# Set the content type
headers = {'Content-Type': 'application/json'}
# Make the request and display the response
resp = requests.post(scoring_uri, input_data, headers=headers)
print(resp.text)
The above code is based on the provided Microsoft example https://learn.microsoft.com/en-us/azure/machine-learning/how-to-consume-web-service?tabs=python#call-the-service-python.
I am unable to replicate the provided example with my data. I have an error "SyntaxError: leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers" pointing to the date. I assume, I need to specify the data type but could not find how.
I very appreciate any help or direction. Thank you.

This issue was solved by AlphaLu-0572's answer, add it as the answer to close the question:
The service takes data in form of deserialized pandas data frame. In the example below, it will look like:
import json
X_test = pd.DataFrame([
['2020-10-04 19:30:00', 1.29281, 1.29334, 1.29334, 1.29334, 1],
['2020-10-04 19:45:00', 1.29334, 1.29294, 1.29294, 1.29294, 1],
['2020-10-04 21:00:00', 1.29294, 1.29217, 1.29334, 1.29163, 34],
['2020-10-04 21:15:00', 1.29217, 1.29257, 1.29301, 1.29115, 195]],
columns=['date', 'number_1', 'number_2', 'number_3', 'number_4', 'integer']
)
test_sample = json.dumps({'data': X_test.to_dict(orient='records')})
test_sample
Which will result in JSON string as:
{"data": [{"date": "2020-10-04 19:30:00", "number_1": 1.29281, "number_2": 1.29334, "number_3": 1.29334, "number_4": 1.29334, "integer": 1}, {"date": "2020-10-04 19:45:00", "number_1": 1.29334, "number_2": 1.29294, "number_3": 1.29294, "number_4": 1.29294, "integer": 1}, {"date": "2020-10-04 21:00:00", "number_1": 1.29294, "number_2": 1.29217, "number_3": 1.29334, "number_4": 1.29163, "integer": 34}, {"date": "2020-10-04 21:15:00", "number_1": 1.29217, "number_2": 1.29257, "number_3": 1.29301, "number_4": 1.29115, "integer": 195}]}
Please rename the columns to the corresponding columns from the training data set.

How to compare two dicts and find matching values

I'm pulling data from an API for a weather system. The API returns a single JSON object with sensors broken up into two sub-nodes for each sensor. I'm trying to associate two (or more) sensors with their time-stamps. Unfortunately, not every sensor polls every single time (although they're supposed to).
In effect, I have a JSON object that looks like this:
{
"sensor_data": {
"mbar": [{
"value": 1012,
"timestamp": "2019-10-31T00:15:00"
}, {
"value": 1011,
"timestamp": "2019-10-31T00:30:00"
}, {
"value": 1010,
"timestamp": "2019-10-31T00:45:00"
}],
"temperature": [{
"value": 10.3,
"timestamp": "2019-10-31T00:15:00"
}, {
"value": 10.2,
"timestamp": "2019-10-31T00:30:00"
}, {
"value": 10.0,
"timestamp": "2019-10-31T00:45:00"
}, {
"value": 9.8,
"timestamp": "2019-10-31T01:00:00"
}]
}
}
This examples shows I have one extra temperature reading, and this example is a really small one.
How can I take this data and associate a single reading for each timestamp, gathering as much sensor data as I can pull from matching timestamps? Ultimately, I want to export the data into a CSV file, with each row representing a slice in time from the sensor, to be graphed or further analyzed after.
For lists that are exactly the same length, I have a solution:
sensor_id = '007_OHMSS'
sensor_data = read_json('sensor_data.json') # wrapper function for open and load json
list_a = sensor_data['mbar']
list_b = sensor_data['temperature']
pair_perfect_sensor_list(sensor_id, list_a, list_b)
def pair_perfect_sensor_lists(sensor_id, list_a, list_b):
# in this case, list a will be mbar, list_b will be temperature
matches = list()
if len(list_a) == len(list_b):
for idx, reading in enumerate(list_a):
mbar_value = reading['value']
timestamp = reading['timestamp']
t_reading = list_b[idx]
t_time = t_reading['timestamp']
temp_value = t_reading['value']
print(t_time == timestamp)
if t_time == timestamp:
match = {
'sensor_id': sensor_id,
'mbar_index': idx,
'time_index': idx,
'mbar_value': mbar_value,
'temp_value': temp_value,
'mbar_time': timestamp,
'temp_time': t_time,
}
print('here is your match:')
print(match)
matches.append(match)
else:
print("IMPERFECT!")
print(t_time)
print(timestamp)
return matches
return failure
When there's not a match, I want to skip a reading for the missing sensor (in this case, the last mbar reading) and just do an N/A.
In most cases, the offset is just one node - meaning temp has one extra reading, somewhere in the middle.
I was using the idx index to optimize the speed of the process, so I don't have to loop through the second (or third, or nth) dict to see if the timestamp exists in it, but I know that's not preferred either, because dicts aren't ordered. In this case, it appears every sub-node sensor dict is ordered by timestamp, so I was trying to leverage that convenience.
Is this a common problem? If so, just point me to the terminology. But I've searched already and cannot find a reasonable, efficient answer besides "loop through each sub-dict and look for a match".
Open to any ideas, because I'll have to do this often, and on large (25 MB files or larger, sometimes) JSON objects. The full dump is up and over 300 MB, but I've sliced them up by sensor IDs so they're more manageable.

You can use .get to avoid type errors to get an output like this.
st=yourjsonabove
mbar={}
for item in st['sensor_data']['mbar']:
mbar[item['timestamp']] = item['value']
temperature={}
for item in st['sensor_data']['temperature']:
temperature[item['timestamp']] = item['value']
for timestamp in temperature:
print("Timestamp:" , timestamp, "Sensor Reading: ", mbar.get(timestamp), "Temperature Reading: ", temperature[timestamp])
leading to output:
Timestamp: 2019-10-31T00:15:00 Sensor Reading: 1012 Temperature Reading: 10.3
Timestamp: 2019-10-31T00:30:00 Sensor Reading: 1011 Temperature Reading: 10.2
Timestamp: 2019-10-31T00:45:00 Sensor Reading: 1010 Temperature Reading: 10.0
Timestamp: 2019-10-31T01:00:00 Sensor Reading: None Temperature Reading: 9.8
Does that help?

You could make a dict with timestamp keys of your sensor readings like
mbar = {s['timestamp']:s['value'] for s in sensor_data['mbar']}
temp = {s['timestamp']:s['value'] for s in sensor_data['temperature']}
Now it is easy to compare using the difference of the key sets
mbar.keys() - temp.keys()
temp.keys() - mbar.keys()

Is an SQL database more memory/performance efficient than a large Pandas dataframe?

I have more than 6000 XML want to parse and save as csv (or anything else for storage).
I need to perform JOIN for each XML to join them to big dataframe.
The problem is the process takes so long and uses too many memory.
I am wondering would sql can solve the problem? faster and less memory consumption?
def get_data(lst):
results = pd.DataFrame()
errors = []
for data in lst:
try:
df = parseXML_Annual(data)
try:
results = results.join(df, how = "outer")
except:
results = df
except:
errors.append(data)
return results, errors
results, errors = get_data(lst_result)

As I can see from your sample, entire XML file is related to the same company. To me it sounds that you need to add this a new row, not join it as a table. In my understanding you want to have some list of metrics for each company. If so you probably can just stick with key-value storage. If python is your primary tool, use dictionary, and then save it as a JSON file.
In your for loop just fill a blank dictionary with data from XML like this:
report = {
"apple": {
'metricSet1': {"m11": 5, "m12": 2, "m13": 3},
'metricSet2': {"m21": 4, "m22": 5, "m23": 6}
},
"google": {
'metricSet1': {"m11": 1, "m12": 13, "m13": 3},
'metricSet2': {"m21": 9, "m22": 0, "m23": 11}
},
"facebook": {
'metricSet1': {"m11": 1, "m12": 9, "m13": 9},
'metricSet2': {"m21": 7, "m22": 2, "m23": 4}
}
}
when you need to query it or fill some table with data do something like this:
for k in report.keys():
row = [
k,
report[k]["metricSet1"]["m12"],
report[k]["metricSet2"]["m22"],
report[k]["metricSet2"]["m23"]
]
print(row)
If data structure is not changing (say all these XML are the same) it would make sence to store it in SQL database, creating a table for each metric set. If XML structure may vary then just keep it as json file, or probably in some Key-Value based database, like mongo

Using Python to turn one JSON object/dict into two

I've got JSON data formatted like this.
{
"website": "http://www.zebrawebworks.com/zebra/bluetavern/day.cfm?&year=2018&month=6&day=29",
"date": "2018-06-29",
"headliner": [
"Delta Ringnecks",
"Flathead String Band"
],
"data": [
"4:00 PM",
"FEE: $0",
"Jug Band Music",
"8:00 PM",
"FEE: $5",
"Old Time Fiddle & Banjoby some young turks!"
]
}
I'm working through a bunch of items that look like this in a for concert in data: loop. On dates where there are two concerts like this, I need to create a new Python dictionary, so that each concert is in its own dictionary like so:
{
"website": "http://www.zebrawebworks.com/zebra/bluetavern/day.cfm?&year=2018&month=6&day=29",
"date": "2018-06-29",
"headliner": "Delta Ringnecks",
"data": [
"4:00 PM",
"FEE: $0",
"Jug Band Music",
]
},
{
"website": "http://www.zebrawebworks.com/zebra/bluetavern/day.cfm?&year=2018&month=6&day=29",
"date": "2018-06-29",
"headliner": "Flathead String Band"
"data": [
"8:00 PM",
"FEE: $5",
"Old Time Fiddle & Banjoby some young turks!"
]
}
Is there a recommended way to do this? I can't change the data in the for loop itself, right?, because then it would screw up my iteration.
Could I append it to the end of data so that the for loop covers the new dictionaries (I would still need to parse some data after everything is separated)?
Or should I maybe create a new dictionary with the split days, delete the two-concerts-in-one-day objects, and then combine the dictionaries I have left?
I hope this is enough info and that I'm not mixing terminology too much. I'm very new to the JSON Python module and have been struggling with how to efficiently approach this problem. Thank you.

I suggest you create copies of the dict and store the specific data in each one. For example:
result = []
for pos in range(0, len(original_dict['headliner'])):
new_dict = original_dict.copy()
new_dict['data'] = original_dict['data'][pos*3:(pos+1)*3]
new_dict['headliner'] = original_dict['headliner'][pos]
result.append(new_dict)
print(result)

You can get a pretty clean version of this using the grouper idiom from the itertools documentation:
In [42]: new_list = [{'website': d['website'], 'date': d['date'], 'headliner': headliner, 'data': list(datarow)}
...: for headliner, datarow in zip(d['headliner'], grouper(d['data'], 3))]
...:
In [43]: new_list
Out[43]:
[{'website': 'http://www.zebrawebworks.com/zebra/bluetavern/day.cfm?&year=2018&month=6&day=29',
'date': '2018-06-29',
'headliner': 'Delta Ringnecks',
'data': ['4:00 PM', 'FEE: $0', 'Jug Band Music']},
{'website': 'http://www.zebrawebworks.com/zebra/bluetavern/day.cfm?&year=2018&month=6&day=29',
'date': '2018-06-29',
'headliner': 'Flathead String Band',
'data': ['8:00 PM',
'FEE: $5',
'Old Time Fiddle & Banjoby some young turks!']}]

Here's the solution I came up with thanks with the help of nosklo above. Hope it helps someone with a similar problem in the future.
new_concerts = []
for concert in blue_data:
if len(concert['headliner']) == 2:
new_concert = concert.copy()
new_concert['headliner'] = str(concert['headliner'][1])
concert['headliner'] = str(concert['headliner'][0])
mid = len(concert['data']) / 2
new_concert['data'] = concert['data'][mid:]
concert['data'] = concert['data'][0:mid]
new_concerts.append(new_concert)
blue_data = blue_data + new_concerts

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Recreating list of objects efficiently - python

Related

Detecting duplicates in pandas when a column contains lists

Data input format (call the service) for Azure ML time series forecast model deployed as a web service (Python)

How to compare two dicts and find matching values

Is an SQL database more memory/performance efficient than a large Pandas dataframe?

Using Python to turn one JSON object/dict into two

Categories

Resources