Create a dictionary from list of dictionaries selecting specific values - python

I have a list of dictionaries as below and I'd like to create a dictionary to store specific data from the list.
test_list = [
{
'id':1,
'colour':'Red',
'name':'Apple',
'edible': True,
'price':100
},
{
'id':2,
'colour':'Blue',
'name':'Blueberry',
'edible': True,
'price':200
},
{
'id':3,
'colour':'Yellow',
'name':'Crayon',
'edible': False,
'price':300
}
]
For instance, a new dictionary to store just the {id, name, price} of the various items.
I created several lists:
id_list = []
name_list = []
price_list = []
Then I added the data I want to each list:
for n in test_list:
id_list.append(n['id']
name_list.append(n['name']
price_list.append(n['price']
But I can't figure out how to create a dictionary (or a more appropriate structure?) to store the data in the {id, name, price} format I'd like. Appreciate help!

If you don't have too much data, you can use this nested list/dictionary comprehension:
keys = ['id', 'name', 'price']
result = {k: [x[k] for x in test_list] for k in keys}
That'll give you:
{
'id': [1, 2, 3],
'name': ['Apple', 'Blueberry', 'Crayon'],
'price': [100, 200, 300]
}

I think a list of dictionaries is stille the right data format, so this:
test_list = [
{
'id':1,
'colour':'Red',
'name':'Apple',
'edible': True,
'price':100
},
{
'id':2,
'colour':'Blue',
'name':'Blueberry',
'edible': True,
'price':200
},
{
'id':3,
'colour':'Yellow',
'name':'Crayon',
'edible': False,
'price':300
}
]
keys = ['id', 'name', 'price']
limited = [{k: v for k, v in d.items() if k in keys} for d in test_list]
print(limited)
Result:
[{'id': 1, 'name': 'Apple', 'price': 100}, {'id': 2, 'name': 'Blueberry', 'price': 200}, {'id': 3, 'name': 'Crayon', 'price': 300}]
This is nice, because you can access its parts like limited[1]['price'].
However, your use case is perfect for pandas, if you don't mind using a third party library:
import pandas as pd
test_list = [
{
'id':1,
'colour':'Red',
'name':'Apple',
'edible': True,
'price':100
},
{
'id':2,
'colour':'Blue',
'name':'Blueberry',
'edible': True,
'price':200
},
{
'id':3,
'colour':'Yellow',
'name':'Crayon',
'edible': False,
'price':300
}
]
df = pd.DataFrame(test_list)
print(df['price'][1])
print(df)
The DataFrame is perfect for this stuff and selecting just the columns you need:
keys = ['id', 'name', 'price']
df_limited = df[keys]
print(df_limited)
The reason I'd prefer either to a dictionary of lists is that manipulating the dictionary of lists will get complicated and error prone and accessing a single record means accessing three separate lists - there's not a lot of advantages to that approach except maybe that some operations on lists will be faster, if you access a single attribute more often. But in that case, pandas wins handily.
In the comments you asked "Let's say I had item_names = ['Apple', 'Teddy', 'Crayon'] and I wanted to check if one of those item names was in the df_limited variable or I guess the df_limited['name'] - is there a way to do that, and if it is then print say the price, or manipulate the price?"
There's many ways of course, I recommend looking into some online pandas tutorials, because it's a very popular library and there's excellent documentation and teaching materials online.
However, just to show how easy it would be in both cases, retrieving the matching objects or just the prices for them:
item_names = ['Apple', 'Teddy', 'Crayon']
items = [d for d in test_list if d['name'] in item_names]
print(items)
item_prices = [d['price'] for d in test_list if d['name'] in item_names]
print(item_prices)
items = df[df['name'].isin(item_names)]
print(items)
item_prices = df[df['name'].isin(item_names)]['price']
print(item_prices)
Results:
[{'id': 1, 'colour': 'Red', 'name': 'Apple', 'edible': True, 'price': 100}, {'id': 3, 'colour': 'Yellow', 'name': 'Crayon', 'edible': False, 'price': 300}]
[100, 300]
id name price
0 1 Apple 100
2 3 Crayon 300
0 100
2 300
In the example with the dataframe there's a few things to note. They are using .isin() since using in won't work in the fancy way dataframes allow you to select data df[<some condition on df using df>], but there's fast and easy to use alternatives for all standard operations in pandas. More importantly, you can just do the work on the original df - it already has everything you need in there.
And let's say you wanted to double the prices for these products:
df.loc[df['name'].isin(item_names), 'price'] *= 2
This uses .loc for technical reasons (you can't modify just any view of a dataframe), but that's way too much to get into in this answer - you'll learn looking into pandas. It's pretty clean and simple though, I'm sure you agree. (you could use .loc for the previous example as well)
In this trivial example, both run instantly, but you'll find that pandas performs better for very large datasets. Also, try writing the same examples using the method you requested (as provided in the accepted answer) and you'll find that it's not as elegant, unless you start by zipping everything together again:
item_prices = [p for i, n, p in zip(result.values()) if n in item_names]
Getting out a result that has the same structure as result is way more trickier with more zipping and unpacking involved, or requires you to go over the lists twice.

Related

Updating dictionaries based on huge list of dicts

I have a huge(around 350k elements) list of dictionaries:
lst = [
{'data': 'xxx', 'id': 1456},
{'data': 'yyy', 'id': 24234},
{'data': 'zzz', 'id': 3222},
{'data': 'foo', 'id': 1789},
]
On the other hand I receive dictionaries(around 550k) one by one with missing value(not every dict is missing this value) which I need to update from:
example_dict = {'key': 'x', 'key2': 'y', 'id': 1456, 'data': None}
To:
example_dict = {'key': 'x', 'key2': 'y', 'id': 1456, 'data': 'xxx'}
And I need to take each dict and search withing the list for matching 'id' and update the 'data'. Doing it this way takes ages to process:
if example_dict['data'] is None:
for row in lst:
if row['id'] == example_dict['id']:
example_dict['data'] = row['data']
Is there a way to build a structured chunked data divided to in e.g. 10k values and tell the incoming dict in which chunk to search for the 'id'? Or any other way to to optimize that? Any help is appreciated, take care.
Use a dict instead of searching linearly through the list.
The first important optimization is to remove that linear search through lst, by building a dict indexed on id pointing to the rows
For example, this will be a lot faster than your code, if you have enough RAM to hold all the rows in memory:
row_dict = {row['id']: row for row in lst}
if example_dict['data'] is None:
if example_dict['id'] in row_dict:
example_dict['data'] = row_dict[example_dict['id']]['data']
This improvement will be relevant for you whether you process the rows by chunks of 10k or all at once, since dictionary lookup time is constant instead of linear in the size of lst.
Make your own chunking process
Next you ask "Is there a way to build a structured chunked data divided...". Yes, absolutely. If the data is too big to fit in memory, write a first pass function that divides the input based on id into several temporary files. They could be based on the last two digits of ID if order is irrelevant, or on ranges of ids if you prefer. Do that for both the list of rows, and the dictionaries you receive, and then process each list/dict file pairs on the same ids one at a time, with code like above.
If you have to preserve the order in which you receive the dictionaries, though, this approach will be more difficult to implement.
Some preprocessing of lst list might help a lot. E.g. transform that list of dicts into dictionary, where id would be a key.
To be precise transform lst into such structure:
lst = {
'1456': 'xxx',
'24234': 'yyy',
'3222': 'zzz',
...
}
Then when trying to check your data attributes in example_dict, just access straight to id key in lst as follows:
if example_dict['data'] is None:
example_dict['data'] = lst.get(example_dict['id'])
It should reduce the time complexity from something as quadratic complexity (n*n) to linear complexity (n).
Try creating creating a hash table (in Python, a dict) from lst to speed up the lookup based on 'id':
lst = [
{'data': 'xxx', 'id': 1456},
{'data': 'yyy', 'id': 24234},
{'data': 'zzz', 'id': 3222},
{'data': 'foo', 'id': 1789},
]
example_dict = {'key': 'x', 'key2': 'y', 'id': 1456, 'data': None}
dct ={row['id'] : row for row in lst}
if example_dict['data'] is None:
example_dict['data'] = dct[example_dict['id']]['data']
print(example_dict)
Sample output:
{'key': 'x', 'key2': 'y', 'id': 1456, 'data': 'xxx'}

Fill pandas dataframe within a for loop

I am working with Amazon Rekognition to do some image analysis.
With a symple Python script, I get - at every iteration - a response of this type:
(example for the image of a cat)
{'Labels':
[{'Name': 'Pet', 'Confidence': 96.146484375, 'Instances': [],
'Parents': [{'Name': 'Animal'}]}, {'Name': 'Mammal', 'Confidence': 96.146484375,
'Instances': [], 'Parents': [{'Name': 'Animal'}]},
{'Name': 'Cat', 'Confidence': 96.146484375.....
I got all the attributes I need in a list, that looks like this:
[Pet, Mammal, Cat, Animal, Manx, Abyssinian, Furniture, Kitten, Couch]
Now, I would like to create a dataframe where the elements in the list above appear as columns and the rows take values 0 or 1.
I created a dictionary in which I add the elements in the list, so I get {'Cat': 1}, then I go to add it to the dataframe and I get the following error:
TypeError: Index(...) must be called with a collection of some kind, 'Cat' was passed.
Not only that, but I don't even seem able to add to the same dataframe the information from different images. For example, if I only insert the data in the dataframe (as rows, not columns), I get a series with n rows with the n elements (identified by Amazon Rekognition) of only the last image, i.e. I start from an empty dataframe at each iteration.
The result I would like to get is something like:
Image Human Animal Flowers etc...
Pic1 1 0 0
Pic2 0 0 1
Pic3 1 1 0
For reference, this is the code I am using now (I should add that I am working on a software called KNIME, but this is just Python):
from pandas import DataFrame
import pandas as pd
import boto3
fileName=flow_variables['Path_Arr[1]'] #This is just to tell Amazon the name of the image
bucket= 'mybucket'
client=boto3.client('rekognition', region_name = 'us-east-2')
response = client.detect_labels(Image={'S3Object':
{'Bucket':bucket,'Name':fileName}})
data = [str(response)] # This is what I inserted in the first cell of this question
d= {}
for key, value in response.items():
for el in value:
if isinstance(el,dict):
for k, v in el.items():
if k == "Name":
d[v] = 1
print(d)
df = pd.DataFrame(d, ignore_index=True)
print(df)
output_table = df
I am definitely getting it all wrong both in the for loop and when adding things to my dataframe, but nothing really seems to work!
Sorry for the super long question, hope it was clear! Any ideas?
I do not know if this answers your question completely, because i do not know, what you data can look like, but it's a good step that should help you, i think. I added the same data multiple time, but the way should be clear.
import pandas as pd
response = {'Labels': [{'Name': 'Pet', 'Confidence': 96.146484375, 'Instances': [], 'Parents': [{'Name': 'Animal'}]},
{'Name': 'Cat', 'Confidence': 96.146484375, 'Instances': [{'BoundingBox':
{'Width': 0.6686800122261047,
'Height': 0.9005332589149475,
'Left': 0.27255237102508545,
'Top': 0.03728689253330231},
'Confidence': 96.146484375}],
'Parents': [{'Name': 'Pet'}]
}]}
def handle_new_data(repsonse_data: dict, image_name: str) -> pd.DataFrame:
d = {"Image": image_name}
result = pd.DataFrame()
for key, value in repsonse_data.items():
for el in value:
if isinstance(el, dict):
for k, v in el.items():
if k == "Name":
d[v] = 1
result = result.append(d, ignore_index=True)
return result
df_all = pd.DataFrame()
df_all = df_all.append(handle_new_data(response, "image1"))
df_all = df_all.append(handle_new_data(response, "image2"))
df_all = df_all.append(handle_new_data(response, "image3"))
df_all = df_all.append(handle_new_data(response, "image4"))
df_all.reset_index(inplace=True)
print(df_all)

Pandas json_normalize and JSON flattening error

A panda newbie here that's struggling to understand why I'm unable to completely flatten a JSON I receive from an API. I need a Dataframe with all the data that is returned by the API, however I need all nested data to be expanded and given it's own columns for me to be able to use it.
The JSON I receive is as follows:
[
{
"query":{
"id":"1596487766859-3594dfce3973bc19",
"name":"test"
},
"webPage":{
"inLanguages":[
{
"code":"en"
}
]
},
"product":{
"name":"Test",
"description":"Test2",
"mainImage":"image1.jpg",
"images":[
"image2.jpg",
"image3.jpg"
],
"offers":[
{
"price":"45.0",
"currency":"€"
}
],
"probability":0.9552192
}
}
]
Running pd.json_normalize(data) without any additional parameters shows the nested values price and currency in the product.offers column. When I try to separate these out into their own columns with the following:
pd.json_normalize(data,record_path=['product',meta['product',['offers']]])
I end up with the following error:
f"{js} has non list value {result} for path {spec}. "
Any help would be much appreciated.
I've used this technique a few times
do initial pd.json_normalize() to discover the columns
build meta parameter by inspecting this and the original JSON. NB possible index out of range here
you can only request one list drives record_path param
a few tricks product/images is a list so it gets named 0. rename it
did a Cartesian product to merge two different data frames from breaking down lists. It's not so stable
data = [{'query': {'id': '1596487766859-3594dfce3973bc19', 'name': 'test'},
'webPage': {'inLanguages': [{'code': 'en'}]},
'product': {'name': 'Test',
'description': 'Test2',
'mainImage': 'image1.jpg',
'images': ['image2.jpg', 'image3.jpg'],
'offers': [{'price': '45.0', 'currency': '€'}],
'probability': 0.9552192}}]
# build default to get column names
df = pd.json_normalize(data)
# from column names build the list that gets sent to meta param
mymeta = [[s for s in c.split(".")] for c in df.columns ]
# exclude lists from meta - this will fail
mymeta = [l for l in mymeta if not isinstance(data[0][l[0]][l[1]], list)]
# you can build df from either of the product lists NOT both
df1 = pd.json_normalize(data, record_path=[["product","offers"]], meta=mymeta)
df2 = pd.json_normalize(data, record_path=[["product","images"]], meta=mymeta).rename(columns={0:"image"})
# want them together - you can merge them. note columns heavily overlap so remove most columns from df2
df1.assign(foo=1).merge(
df2.assign(foo=1).drop(columns=[c for c in df2.columns if c!="image"]), on="foo").drop(columns="foo")

How to extract the value of a given key based on the value of another key, from a list of nested (not-always-existing) dictionaries [Python]

I have a list of dictionaries called api_data, where each dictionary has this structure:
{
'location':
{
'indoor': 0,
'exact_location': 0,
'latitude': '45.502',
'altitude': '133.9',
'id': 12780,
'country': 'IT',
'longitude': '9.146'
},
'sampling_rate': None,
'id': 91976363,
'sensordatavalues':
[
{
'value_type': 'P1',
'value': '8.85',
'id': 197572463
},
{
'value_type': 'P2',
'value': '3.95',
'id': 197572466
}
{
'value_type': 'temperature',
'value': '20.80',
'id': 197572625
},
{
'value_type': 'humidity',
'value': '97.70',
'id': 197572626
}
],
'sensor':
{
'id': 24645,
'sensor_type':
{
'name': 'DHT22',
'id': 9,
'manufacturer':
'various'
},
'pin': '7'
},
'timestamp': '2020-04-18 18:37:50'
},
This structure is not complete for each dictionary, meaning that sometimes a dictionary, a list element or a key is missing.
I want to extract the value of a key when the key value of the same dictionary is equal to a certain value.
For example, for dictionary sensordatavalues, I want the value of the key 'value' when 'value_type' is equal to 'P1'.
I have developed this code working with for and if cycles, but I bet it is heavily inefficient.
How can I do it in a quicker and more efficient way?
Please note that sensordatavalues always exists
for sensor in api_data:
sensordatavalues = sensor['sensordatavalues']
# L_sdv = len(sensordatavalues)
for physical_quantity_recorded in sensordatavalues:
if physical_quantity_recorded['value_type'] == 'P1':
PM10_value = physical_quantity_recorded['value']
If you are confident that the value 'P1' is unique to the key you are searching, you can use the 'in' operator with dict.values()
Should be ok to omit this assignment: sensordatavalues = sensor['sensordatavalues']
for sensor in api_data:
for physical_quantity_recorded in sensor['sensordatavalues']:
if 'P1' in physical_quantity_recorded.values():
PM10_value = physical_quantity_recorded['value']
You just need one for loop:
for x in api_data["sensordatavalues"]:
if x["value_type"] == "P1":
print(x["value"])
Output:
8.85
Use dictionary.get() method if the key not exist it will return default value
for physical_quantity_recorded in api_data['sensordatavalues']:
if physical_quantity_recorded.get('value_type', 'default_value') == 'P1':
PM10_value = physical_quantity_recorded.get('value', 'default_value')
this is an alternative: jmespath - allows you to search and filter a nested dict/json :
summary of jmespath ... to access a key, use the . notation, if ur values are in a list, u access it via the [] notation
NB: dict is wrapped in a data variable
import jmespath
#sensordatavalues is a key, so we can access it directly
#the values of sensordatavalues are wrapped in a list
#to access it we pass the bracket(```[]```)
#we are interested in the dict where value_type is P1
#in jmespath, we identify that using the ? mark to precede the filter object
#pass the filter
#and finally access the key we are interested in ... value
expression = jmespath.compile('sensordatavalues[?value_type==`P1`].value')
expression.search(data)
['8.85']

How to extract specific values from a list of dictionaries in python

I have a list of dictionaries like shown below and i would like to extract the partID and the corresponding quantity for a specific orderID using python, but i don't know how to do it.
dataList = [{'orderID': 'D00001', 'customerID': 'C00001', 'partID': 'P00001', 'quantity': 2},
{'orderID': 'D00002', 'customerID': 'C00002', 'partID': 'P00002', 'quantity': 1},
{'orderID': 'D00003', 'customerID': 'C00003', 'partID': 'P00001', 'quantity': 1},
{'orderID': 'D00004', 'customerID': 'C00004', 'partID': 'P00003', 'quantity': 3}]
So for example, when i search my dataList for a specific orderID == 'D00003', i would like to receive both the partID ('P00001'), as well as the corresponding quantity (1) of the specified order. How would you go about this? Any help is much appreciated.
It depends.
You are not going to do that a lot of time, you can just iterate over the list of dictionaries until you find the "correct" one:
search_for_order_id = 'D00001'
for d in dataList:
if d['orderID'] == search_for_order_id:
print(d['partID'], d['quantity'])
break # assuming orderID is unique
Outputs
P00001 2
Since this solution is O(n), if you are going to do this search a lot of times it will add up.
In that case it will be better to transform the data to a dictionary of dictionaries, with orderID being the outer key (again, assuming orderID is unique):
better = {d['orderID']: d for d in dataList}
This is also O(n) but you pay it only once. Any subsequent lookup is an O(1) dictionary lookup:
search_for_order_id = 'D00001'
print(better[search_for_order_id]['partID'], better[search_for_order_id]['quantity'])
Also outputs
P00001 2
I believe you would like to familiarize yourself with the pandas package, which is very useful for data analysis. If these are the kind of problems you're up against, I advise you to take the time and take a tutorial in pandas. It can do a lot, and is very popular.
Your dataList is very similar to a DataFrame structure, so what you're looking for would be as simple as:
import pandas as pd
df = pd.DataFrame(dataList)
df[df['orderID']=='D00003']
You can use this:
results = [[x['orderID'], x['partID'], x['quantity']] for x in dataList]
for i in results:
print(i)
Also,
results = [['Order ID: ' + x['orderID'], 'Part ID: ' + x['partID'],'Quantity:
' + str(x['quantity'])] for x in dataList]
To get the partID you can make use of the filter function.
myData = [{"x": 1, "y": 1}, {"x": 2, "y": 5}]
filtered = filter(lambda item: item["x"] == 1) # Search for an object with x equal to 1
# Get the next item from the filter (the matching item) and get the y property.
print(next(filtered)["y"])
You should be able to apply this to your situation.

Categories