Fill pandas dataframe within a for loop - python

I am working with Amazon Rekognition to do some image analysis.
With a symple Python script, I get - at every iteration - a response of this type:
(example for the image of a cat)
{'Labels':
[{'Name': 'Pet', 'Confidence': 96.146484375, 'Instances': [],
'Parents': [{'Name': 'Animal'}]}, {'Name': 'Mammal', 'Confidence': 96.146484375,
'Instances': [], 'Parents': [{'Name': 'Animal'}]},
{'Name': 'Cat', 'Confidence': 96.146484375.....
I got all the attributes I need in a list, that looks like this:
[Pet, Mammal, Cat, Animal, Manx, Abyssinian, Furniture, Kitten, Couch]
Now, I would like to create a dataframe where the elements in the list above appear as columns and the rows take values 0 or 1.
I created a dictionary in which I add the elements in the list, so I get {'Cat': 1}, then I go to add it to the dataframe and I get the following error:
TypeError: Index(...) must be called with a collection of some kind, 'Cat' was passed.
Not only that, but I don't even seem able to add to the same dataframe the information from different images. For example, if I only insert the data in the dataframe (as rows, not columns), I get a series with n rows with the n elements (identified by Amazon Rekognition) of only the last image, i.e. I start from an empty dataframe at each iteration.
The result I would like to get is something like:
Image Human Animal Flowers etc...
Pic1 1 0 0
Pic2 0 0 1
Pic3 1 1 0
For reference, this is the code I am using now (I should add that I am working on a software called KNIME, but this is just Python):
from pandas import DataFrame
import pandas as pd
import boto3
fileName=flow_variables['Path_Arr[1]'] #This is just to tell Amazon the name of the image
bucket= 'mybucket'
client=boto3.client('rekognition', region_name = 'us-east-2')
response = client.detect_labels(Image={'S3Object':
{'Bucket':bucket,'Name':fileName}})
data = [str(response)] # This is what I inserted in the first cell of this question
d= {}
for key, value in response.items():
for el in value:
if isinstance(el,dict):
for k, v in el.items():
if k == "Name":
d[v] = 1
print(d)
df = pd.DataFrame(d, ignore_index=True)
print(df)
output_table = df
I am definitely getting it all wrong both in the for loop and when adding things to my dataframe, but nothing really seems to work!
Sorry for the super long question, hope it was clear! Any ideas?

I do not know if this answers your question completely, because i do not know, what you data can look like, but it's a good step that should help you, i think. I added the same data multiple time, but the way should be clear.
import pandas as pd
response = {'Labels': [{'Name': 'Pet', 'Confidence': 96.146484375, 'Instances': [], 'Parents': [{'Name': 'Animal'}]},
{'Name': 'Cat', 'Confidence': 96.146484375, 'Instances': [{'BoundingBox':
{'Width': 0.6686800122261047,
'Height': 0.9005332589149475,
'Left': 0.27255237102508545,
'Top': 0.03728689253330231},
'Confidence': 96.146484375}],
'Parents': [{'Name': 'Pet'}]
}]}
def handle_new_data(repsonse_data: dict, image_name: str) -> pd.DataFrame:
d = {"Image": image_name}
result = pd.DataFrame()
for key, value in repsonse_data.items():
for el in value:
if isinstance(el, dict):
for k, v in el.items():
if k == "Name":
d[v] = 1
result = result.append(d, ignore_index=True)
return result
df_all = pd.DataFrame()
df_all = df_all.append(handle_new_data(response, "image1"))
df_all = df_all.append(handle_new_data(response, "image2"))
df_all = df_all.append(handle_new_data(response, "image3"))
df_all = df_all.append(handle_new_data(response, "image4"))
df_all.reset_index(inplace=True)
print(df_all)

Related

How to normalize a complex json format in a pandas data frame that is a list of dictionaries

I have a pandas data frame that has one column like this in json format. I am not able to understand how to extract this.
df['completionDetails'][0] gives:
[{'name': 'start', 'time': 1654098788177},
{'name': 'arrival',
'time': 1654099038368,
'location': [-74.2713929, 40.5017297]},
{'name': 'departure',
'time': 1654098843357,
'location': [-74.2802414, 40.5095964]}]
I have tried:
dict_df = pd.DataFrame([ast.literal_eval(i) for i in df['completionDetails'].values])
But it is giving me error. What method can I use for this?
Expected Output:
start_time arrival_time arrival_location departure_time departure_location
1654098788177 1654099038368 [-74.2713929, 40.5017297] 1654098843357 [-74.2802414, 40.5095964]
IIUC each cell of the completionDetails column is a list of dictionaries.
You can make a dataframe out of each cell and concatenate the dfs:
dict_df = pd.concat([pd.DataFrame(i) for i in df['completionDetails'].values])
Edit:
Following your own edit, this is how you'd get the desired output:
dict_df = pd.concat([pd.DataFrame({f"{x['name']}_{k}": [v]
for x in i for k,v in x.items() if k!='name'}
) for i in df['completionDetails'].values if isinstance(i, list)])
As you can see we're building key names from the name key and other keys to create new dictionaries that will be used to create dataframes (that in turn will be concatenated to each other)
Output:
start_time arrival_time arrival_location departure_time departure_location
0 1654098788177 1654099038368 [-74.2713929, 40.5017297] 1654098843357 [-74.2802414, 40.5095964]

Create a dictionary from list of dictionaries selecting specific values

I have a list of dictionaries as below and I'd like to create a dictionary to store specific data from the list.
test_list = [
{
'id':1,
'colour':'Red',
'name':'Apple',
'edible': True,
'price':100
},
{
'id':2,
'colour':'Blue',
'name':'Blueberry',
'edible': True,
'price':200
},
{
'id':3,
'colour':'Yellow',
'name':'Crayon',
'edible': False,
'price':300
}
]
For instance, a new dictionary to store just the {id, name, price} of the various items.
I created several lists:
id_list = []
name_list = []
price_list = []
Then I added the data I want to each list:
for n in test_list:
id_list.append(n['id']
name_list.append(n['name']
price_list.append(n['price']
But I can't figure out how to create a dictionary (or a more appropriate structure?) to store the data in the {id, name, price} format I'd like. Appreciate help!
If you don't have too much data, you can use this nested list/dictionary comprehension:
keys = ['id', 'name', 'price']
result = {k: [x[k] for x in test_list] for k in keys}
That'll give you:
{
'id': [1, 2, 3],
'name': ['Apple', 'Blueberry', 'Crayon'],
'price': [100, 200, 300]
}
I think a list of dictionaries is stille the right data format, so this:
test_list = [
{
'id':1,
'colour':'Red',
'name':'Apple',
'edible': True,
'price':100
},
{
'id':2,
'colour':'Blue',
'name':'Blueberry',
'edible': True,
'price':200
},
{
'id':3,
'colour':'Yellow',
'name':'Crayon',
'edible': False,
'price':300
}
]
keys = ['id', 'name', 'price']
limited = [{k: v for k, v in d.items() if k in keys} for d in test_list]
print(limited)
Result:
[{'id': 1, 'name': 'Apple', 'price': 100}, {'id': 2, 'name': 'Blueberry', 'price': 200}, {'id': 3, 'name': 'Crayon', 'price': 300}]
This is nice, because you can access its parts like limited[1]['price'].
However, your use case is perfect for pandas, if you don't mind using a third party library:
import pandas as pd
test_list = [
{
'id':1,
'colour':'Red',
'name':'Apple',
'edible': True,
'price':100
},
{
'id':2,
'colour':'Blue',
'name':'Blueberry',
'edible': True,
'price':200
},
{
'id':3,
'colour':'Yellow',
'name':'Crayon',
'edible': False,
'price':300
}
]
df = pd.DataFrame(test_list)
print(df['price'][1])
print(df)
The DataFrame is perfect for this stuff and selecting just the columns you need:
keys = ['id', 'name', 'price']
df_limited = df[keys]
print(df_limited)
The reason I'd prefer either to a dictionary of lists is that manipulating the dictionary of lists will get complicated and error prone and accessing a single record means accessing three separate lists - there's not a lot of advantages to that approach except maybe that some operations on lists will be faster, if you access a single attribute more often. But in that case, pandas wins handily.
In the comments you asked "Let's say I had item_names = ['Apple', 'Teddy', 'Crayon'] and I wanted to check if one of those item names was in the df_limited variable or I guess the df_limited['name'] - is there a way to do that, and if it is then print say the price, or manipulate the price?"
There's many ways of course, I recommend looking into some online pandas tutorials, because it's a very popular library and there's excellent documentation and teaching materials online.
However, just to show how easy it would be in both cases, retrieving the matching objects or just the prices for them:
item_names = ['Apple', 'Teddy', 'Crayon']
items = [d for d in test_list if d['name'] in item_names]
print(items)
item_prices = [d['price'] for d in test_list if d['name'] in item_names]
print(item_prices)
items = df[df['name'].isin(item_names)]
print(items)
item_prices = df[df['name'].isin(item_names)]['price']
print(item_prices)
Results:
[{'id': 1, 'colour': 'Red', 'name': 'Apple', 'edible': True, 'price': 100}, {'id': 3, 'colour': 'Yellow', 'name': 'Crayon', 'edible': False, 'price': 300}]
[100, 300]
id name price
0 1 Apple 100
2 3 Crayon 300
0 100
2 300
In the example with the dataframe there's a few things to note. They are using .isin() since using in won't work in the fancy way dataframes allow you to select data df[<some condition on df using df>], but there's fast and easy to use alternatives for all standard operations in pandas. More importantly, you can just do the work on the original df - it already has everything you need in there.
And let's say you wanted to double the prices for these products:
df.loc[df['name'].isin(item_names), 'price'] *= 2
This uses .loc for technical reasons (you can't modify just any view of a dataframe), but that's way too much to get into in this answer - you'll learn looking into pandas. It's pretty clean and simple though, I'm sure you agree. (you could use .loc for the previous example as well)
In this trivial example, both run instantly, but you'll find that pandas performs better for very large datasets. Also, try writing the same examples using the method you requested (as provided in the accepted answer) and you'll find that it's not as elegant, unless you start by zipping everything together again:
item_prices = [p for i, n, p in zip(result.values()) if n in item_names]
Getting out a result that has the same structure as result is way more trickier with more zipping and unpacking involved, or requires you to go over the lists twice.

Python list of dictionaries aggregate values

Here is an example input:
[{'name':'susan', 'wins': 1, 'team': 'team1'}
{'name':'jack', 'wins':1, 'team':'team2'}
{'name':'susan', 'wins':1, 'team':'team1'}]
Desired output
[{'name':'susan', 'wins':2, 'team': 'team1'}
{'name':'jack', 'wins':1, 'team':'team2'}]
I have lots of the dictionaries and want to only add, the 'win' value, based on the 'name' value,
and keep the 'team' values
I've tried to use Counter, but the result was
{'name':'all the names added toghther',
'wins': 'all the wins added toghther'
}
I was able to use defaultdict which seemed to work
result = defaultdict(int)
for d in data:
result[d['name']] += d['wins'])
but the results was something like
{'susan': 2, 'jack':1}
Here it added the values correctly but didn't keep the 'team' key
I guess I'm confused about defaultdict and how it works.
any help very appreciated.
Did you consider using pandas?
import pandas as pd
dicts = [
{'name':'susan', 'wins': 1, 'team': 'team1'},
{'name':'jack', 'wins':1, 'team':'team2'},
{'name':'susan', 'wins':1, 'team':'team1'},
]
agg_by = ["name", "team"]
df = pd.DataFrame(dicts)
df = df.groupby(agg_by)["wins"].apply(sum)
df = df.reset_index()
aggregated_dict = df.to_dict("records")

how to normalize this below json using panda in django

using this view.py query my output is showing something like this. you can see in choices field there are multiple array so i can normalize in serial wise here is my json
{"pages":[{"name":"page1","title":"SurveyWindow Pvt. Ltd. Customer Feedback","description":"Question marked * are compulsory.",
"elements":[{"type":"radiogroup","name":"question1","title":"Do you like our product? *","isRequired":true,
"choices":[{"value":"Yes","text":"Yes"},{"value":"No","text":"No"}]},{"type":"checkbox","name":"question2","title":"Please Rate Our PM Skill","isRequired":false,"choices":[{"value":"High","text":"High"},{"value":"Low","text":"Low"},{"value":"Medium","text":"Medium"}]},{"type":"radiogroup","name":"question3","title":"Do you like our services? *","isRequired":true,"choices":[{"value":"Yes","text":"Yes"},{"value":"No","text":"No"}]}]}]}
this is my view.py
jsondata=SurveyMaster.objects.all().filter(survey_id='1H2711202014572740')
q = jsondata.values('survey_json_design')
qs_json = pd.DataFrame.from_records(q)
datatotable = pd.json_normalize(qs_json['survey_json_design'], record_path=['pages','elements'])
qs_json = datatotable.to_html()
Based on your comments and picture here's what I would do to go from the picture to something more SQL-friendly (what you refer to as "normalization"), but keep in mind this might blow up if you don't have sufficient memory.
Create a new list which you'll fill with the new data, then iterate over the pandas table's rows, and then over every item in your list. For every iteration in the inner loop use the data from the row (minus the column you're iteration over). For convenience I added it as the last element.
# Example data
df = pd.DataFrame({"choices": [[{"text": "yes", "value": "yes"},
{"text": "no", "value": "no"}],
[{"ch1": 1, "ch2": 2}, {"ch3": "ch3"}]],
"name": ["kostas", "rajesh"]})
data = []
for i, row in df.iterrows():
for val in row["choices"]:
data.append((*row.drop("choices").values, val))
df = pd.DataFrame(data, columns=["names", "choices"])
print(df)
names choices
0 kostas {'text': 'yes', 'value': 'yes'}
1 kostas {'text': 'no', 'value': 'no'}
2 george {'ch1': 1, 'ch2': 2}
3 george {'ch3': 'ch3'}
This is where I guess you want to go. All that's left is to just modify the column / variable names with your own data.

How to iterate through this nested dictionary within a list using for loop

I have a list of nested dictionaries that I want to get specific values and put into a dictionary like this:
vid = [{'a':{'display':'axe', 'desc':'red'}, 'b':{'confidence':'good'}},
{'a':{'display':'book', 'desc':'blue'}, 'b':{'confidence':'poor'}},
{'a':{'display':'apple', 'desc':'green'}, 'b':{'confidence':'good'}}
]
I saw previous questions similar to this, but I still can't get the values such as 'axe' and 'red'. I would like the new dict to have a 'Description', 'Confidence' and other columns with the values from the nested dict.
I have tried this for loop:
new_dict = {}
for x in range(len(vid)):
for y in vid[x]['a']:
desc = y['desc']
new_dict['Description'] = desc
I got many errors but mostly this error:
TypeError: string indices must be integers
Can someone please help solve how to get the values from the nested dictionary?
You don't need to iterate through the keys in the dictionary (the inner for-loop), just access the value you want.
vid = [{'a':{'display':'axe', 'desc':'red'}, 'b':{'confidence':'good'} },
{'a':{'display':'book', 'desc':'blue'}, 'b':{'confidence':'poor'}},
{'a':{'display':'apple', 'desc':'green'}, 'b':{'confidence':'good'}}
]
new_dict = {}
list_of_dicts = []
for x in range(len(vid)):
desc = vid[x]['a']['desc']
list_of_dicts.append({'desc': desc})
I have found a temporary solution for this. I decided to use the pandas dataframe instead.
df = pd.DataFrame(columns = ['Desc'])
for x in range(len(vid)):
desc = vid[x]['a']['desc']
df.loc[len(df)] = [desc]
so you want to write this to csv later so pandas will help you a lot for this problem using pandas you can get the desc by
import pandas as pd
new_dict = {}
df = pd.DataFrame(vid)
for index, row in df.iterrows() :
new_dict['description'] = row['a']['desc']
a b
0 {'display': 'axe', 'desc': 'red'} {'confidence': 'good'}
1 {'display': 'book', 'desc': 'blue'} {'confidence': 'poor'}
2 {'display': 'apple', 'desc': 'green'} {'confidence': 'good'}
this is how dataframe looks like a b are column of the dataframe and your nested dicts are rows of dataframe
Try using this list comprehension:
d = [{'Description': i['a']['desc'], 'Confidence': i['b']['confidence']} for i in vid]
print(d)

Categories