Iteration issue to create a nested dictionary - python

My data looks as followed:
Application WorkflowStep
0 WF:ACAA-CR (auto) Manager
1 WF:ACAA-CR (auto) Access Responsible
2 WF:ACAA-CR (auto) Automatic
3 WF:ACAA-CR-AccResp (auto) Manager
4 WF:ACAA-CR-AccResp (auto) Access Responsible
5 WF:ACAA-CR-AccResp (auto) Automatic
6 WF:ACAA-CR-IT-AccResp[AUTO] Group
7 WF:ACAA-CR-IT-AccResp[AUTO] Access Responsible
8 WF:ACAA-CR-IT-AccResp[AUTO] Automatic
Additionally to these two columns I want to add a third column showing the sum of all WorkflowStep's.
The dictionary should look like the following (or similiar):
{'WF:ACAA-CR (auto)':
[{'Workflow': ['Manager', 'Access Responsible','Automatic'], 'Summary': 3}],
'WF:ACAA-CR-AccResp (auto)':
[{'Workflow': ['Manager','Access Responsible','Automatic'], 'Summary': 3}],
'WF:ACAA-CR-IT-AccResp[AUTO]':
[{'Workflow': ['Group','Access Responsible','Automatic'], 'Summary': 3}]
}
My code to create a dictionary out of the two above columns works fine.
for i in range(len(df)):
currentid = df.iloc[i,0]
currentvalue = df.iloc[i,1]
dict.setdefault(currentid, [])
dict[currentid].append(currentvalue)
The code to create the sum of the WorkflowStep is as followed and also works fine:
for key, values in dict.items():
val = values
match = ["Manager", "Access Responsible", "Automatic", "Group"]
c = Counter(val)
sumofvalues = 0
for m in match:
if c[m] == 1:
sumofvalues += 1
My initial idea was to adjust my first code where the initial key is the Application and WorkflowStep, Summary would be sub-dictionaries.
for i in range(len(df)):
currentid = df.iloc[i,0]
currentvalue = df.iloc[i,1]
dict.setdefault(currentid, [])
dict[currentid].append({"Workflow": [currentvalue], "Summary": []})
The result of this is however unsatisfactory because it does not add currentvalue to the already existing Workflow key but recreates them after every iteration.
Example
{'WF:ACAA-CR (auto)': [{'Workflow': ['Manager'], 'Summary': []},
{'Workflow': ['Access Responsible'], 'Summary': []},
{'Workflow': ['Automatic'], 'Summary': []}]
}
How can I create a dictionary similiar to what I wrote above?

IIUC, here's what can help -
val = df.groupby('Application')['WorkflowStep'].unique()
{val.index[i]: [{'WorkflowStep':list(val[i]), 'Summary':len(val[i])}] for i in range(len(val))}
resulting into,
{'WF:ACAA-CR (auto)': [{'WorkflowStep': ['Manager', 'Access Responsible', 'Automatic'], 'Summary': 3}],
'WF:ACAA-CR-AccResp (auto)': [{'WorkflowStep': ['Manager', 'Access Responsible', 'Automatic'], 'Summary': 3}],
'WF:ACAA-CR-IT-AccResp[AUTO]': [{'WorkflowStep': ['Group', 'Access Responsible', 'Automatic'], 'Summary': 3}]}

I think meW's answer is a much better way of doing things, and takes advantage of the neat power of dataframe's but for reference, if you wanted to do it the way you were trying, I think this will work:
# Create the data for testing.
d = {'Application': ["WF:ACAA-CR (auto)", "WF:ACAA-CR (auto)", "WF:ACAA-CR (auto)",
"WF:ACAA-CR-AccResp (auto)", "WF:ACAA-CR-AccResp (auto)", "WF:ACAA-CR-AccResp (auto)"],
'WorkflowStep': ["Manager", "Access Responsible","Automatic","Manager","Access Responsible", "Automatic"]}
df = pd.DataFrame(d)
new_dict = dict()
# Iterate through the rows of the data frame.
for index, row in df.iterrows():
# Get the values for the current row.
current_application_id = row['Application']
current_workflowstep = row['WorkflowStep']
# Set the default values if not already set.
new_dict.setdefault(current_application_id, {'Workflow': [], 'Summary' : 0})
# Add the new values.
new_dict[current_application_id]['Workflow'].append(current_workflowstep)
new_dict[current_application_id]['Summary'] += 1
print(new_dict)
Which gives an output of:
{'WF:ACAA-CR (auto)': {'Workflow': ['Manager', 'Access Responsible', 'Automatic'], 'Summary': 3},
'WF:ACAA-CR-AccResp (auto)': {'Workflow': ['Manager', 'Access Responsible', 'Automatic'], 'Summary': 3}}

Related

Create a dictionary from list of dictionaries selecting specific values

I have a list of dictionaries as below and I'd like to create a dictionary to store specific data from the list.
test_list = [
{
'id':1,
'colour':'Red',
'name':'Apple',
'edible': True,
'price':100
},
{
'id':2,
'colour':'Blue',
'name':'Blueberry',
'edible': True,
'price':200
},
{
'id':3,
'colour':'Yellow',
'name':'Crayon',
'edible': False,
'price':300
}
]
For instance, a new dictionary to store just the {id, name, price} of the various items.
I created several lists:
id_list = []
name_list = []
price_list = []
Then I added the data I want to each list:
for n in test_list:
id_list.append(n['id']
name_list.append(n['name']
price_list.append(n['price']
But I can't figure out how to create a dictionary (or a more appropriate structure?) to store the data in the {id, name, price} format I'd like. Appreciate help!
If you don't have too much data, you can use this nested list/dictionary comprehension:
keys = ['id', 'name', 'price']
result = {k: [x[k] for x in test_list] for k in keys}
That'll give you:
{
'id': [1, 2, 3],
'name': ['Apple', 'Blueberry', 'Crayon'],
'price': [100, 200, 300]
}
I think a list of dictionaries is stille the right data format, so this:
test_list = [
{
'id':1,
'colour':'Red',
'name':'Apple',
'edible': True,
'price':100
},
{
'id':2,
'colour':'Blue',
'name':'Blueberry',
'edible': True,
'price':200
},
{
'id':3,
'colour':'Yellow',
'name':'Crayon',
'edible': False,
'price':300
}
]
keys = ['id', 'name', 'price']
limited = [{k: v for k, v in d.items() if k in keys} for d in test_list]
print(limited)
Result:
[{'id': 1, 'name': 'Apple', 'price': 100}, {'id': 2, 'name': 'Blueberry', 'price': 200}, {'id': 3, 'name': 'Crayon', 'price': 300}]
This is nice, because you can access its parts like limited[1]['price'].
However, your use case is perfect for pandas, if you don't mind using a third party library:
import pandas as pd
test_list = [
{
'id':1,
'colour':'Red',
'name':'Apple',
'edible': True,
'price':100
},
{
'id':2,
'colour':'Blue',
'name':'Blueberry',
'edible': True,
'price':200
},
{
'id':3,
'colour':'Yellow',
'name':'Crayon',
'edible': False,
'price':300
}
]
df = pd.DataFrame(test_list)
print(df['price'][1])
print(df)
The DataFrame is perfect for this stuff and selecting just the columns you need:
keys = ['id', 'name', 'price']
df_limited = df[keys]
print(df_limited)
The reason I'd prefer either to a dictionary of lists is that manipulating the dictionary of lists will get complicated and error prone and accessing a single record means accessing three separate lists - there's not a lot of advantages to that approach except maybe that some operations on lists will be faster, if you access a single attribute more often. But in that case, pandas wins handily.
In the comments you asked "Let's say I had item_names = ['Apple', 'Teddy', 'Crayon'] and I wanted to check if one of those item names was in the df_limited variable or I guess the df_limited['name'] - is there a way to do that, and if it is then print say the price, or manipulate the price?"
There's many ways of course, I recommend looking into some online pandas tutorials, because it's a very popular library and there's excellent documentation and teaching materials online.
However, just to show how easy it would be in both cases, retrieving the matching objects or just the prices for them:
item_names = ['Apple', 'Teddy', 'Crayon']
items = [d for d in test_list if d['name'] in item_names]
print(items)
item_prices = [d['price'] for d in test_list if d['name'] in item_names]
print(item_prices)
items = df[df['name'].isin(item_names)]
print(items)
item_prices = df[df['name'].isin(item_names)]['price']
print(item_prices)
Results:
[{'id': 1, 'colour': 'Red', 'name': 'Apple', 'edible': True, 'price': 100}, {'id': 3, 'colour': 'Yellow', 'name': 'Crayon', 'edible': False, 'price': 300}]
[100, 300]
id name price
0 1 Apple 100
2 3 Crayon 300
0 100
2 300
In the example with the dataframe there's a few things to note. They are using .isin() since using in won't work in the fancy way dataframes allow you to select data df[<some condition on df using df>], but there's fast and easy to use alternatives for all standard operations in pandas. More importantly, you can just do the work on the original df - it already has everything you need in there.
And let's say you wanted to double the prices for these products:
df.loc[df['name'].isin(item_names), 'price'] *= 2
This uses .loc for technical reasons (you can't modify just any view of a dataframe), but that's way too much to get into in this answer - you'll learn looking into pandas. It's pretty clean and simple though, I'm sure you agree. (you could use .loc for the previous example as well)
In this trivial example, both run instantly, but you'll find that pandas performs better for very large datasets. Also, try writing the same examples using the method you requested (as provided in the accepted answer) and you'll find that it's not as elegant, unless you start by zipping everything together again:
item_prices = [p for i, n, p in zip(result.values()) if n in item_names]
Getting out a result that has the same structure as result is way more trickier with more zipping and unpacking involved, or requires you to go over the lists twice.

Python 3.9.5: One dictionary assignment is overwriting multiple keys [BUG?]

I am reading a .csv called courses. Each row corresponds to a course which has an id, a name, and a teacher. They are to be stored in a Dict. An example:
list_courses = {
1: {'id': 1, 'name': 'Biology', 'teacher': 'Mr. D'},
...
}
While iterating the rows using enumerate(file_csv.readlines()) I am performing the following:
list_courses={}
for idx, row in enumerate(file_csv.readlines()):
# Skip blank rows.
if row.isspace(): continue
# If we're using the row, turn it into a list.
row = row.strip().split(",")
# If it's the header row, take note of the header. Use these values for the dictionaries' keys.
# As of 3.7 a Dict remembers the order in which the keys were inserted.
# Since the order is constant, simply load each other row into the corresponding key.
if not idx:
sheet_item = dict.fromkeys(row)
continue
# Loop through the keys in sheet_item. Assign the value found in the row, converting to int where necessary.
for idx, key in enumerate(list(sheet_item)):
sheet_item[key] = int(row[idx].strip()) if key == 'id' or key == 'mark' else row[idx].strip()
# Course list
print("ADDING COURSE WITH ID {} TO THE DICTIONARY:".format(sheet_item['id']))
list_courses[sheet_item['id']] = sheet_item
print("\tADDED: {}".format(sheet_item))
print("\tDICT : {}".format(list_courses))
Thus, the list_courses dictionary is printed after each sheet_item is added to it.
Now comes the issue - when reading in two courses, I expect that list_courses should read:
list_courses = {
1: {'id': 1, 'name': 'Biology', 'teacher': 'Mr. D'},
2: {'id': 2, 'name': 'History', 'teacher': 'Mrs. P'}
}
However, the output of my print statements (substantiated by errors later in my program) is:
ADDING COURSE WITH ID 1 TO THE DICTIONARY:
ADDED: {'id': 1, 'name': 'Biology', 'teacher': 'Mr. D'}
DICT : {1: {'id': 1, 'name': 'Biology', 'teacher': 'Mr. D'}}
ADDING COURSE WITH ID 2 TO THE DICTIONARY:
ADDED: {'id': 2, 'name': 'History', 'teacher': 'Mrs. P'}
DICT : {1: {'id': 2, 'name': 'History', 'teacher': 'Mrs. P'}, 2: {'id': 2, 'name': 'History', 'teacher': 'Mrs. P'}}
Thus, the id with which the sheet_item is being added to courses_list is correct (1 or 2), however the assignment which occurs for the second course appears to be overwriting the value for key 1. I'm not even sure how this is possible. Please let me know your thoughts.
You're using the same dictionary for both the header and all the rows. You never create any new dictionaries after the header. Key assignments are overwriting previous ones, because there are no new dictionaries to write to.
Store the keys in a list, and make a new sheet_item before the for loop:
list_courses={}
keys = None # Let Python know this is defined
for idx, row in enumerate(file_csv.readlines()):
# Skip blank rows.
if row.isspace(): continue
# If we're using the row, turn it into a list.
row = row.strip().split(",")
# If it's the header row, take note of the header. Use these values for the dictionaries' keys.
# As of 3.7 a Dict remembers the order in which the keys were inserted.
# Since the order is constant, simply load each other row into the corresponding key.
if not idx:
keys = row
continue
sheet_item = {}
# Loop through the keys in sheet_item. Assign the value found in the row, converting to int where necessary.
for idx, key in enumerate(keys):
sheet_item[key] = int(row[idx].strip()) if key == 'id' or key == 'mark' else row[idx].strip()
# Course list
print("ADDING COURSE WITH ID {} TO THE DICTIONARY:".format(sheet_item['id']))
list_courses[sheet_item['id']] = sheet_item
print("\tADDED: {}".format(sheet_item))
print("\tDICT : {}".format(list_courses))

how to normalize this below json using panda in django

using this view.py query my output is showing something like this. you can see in choices field there are multiple array so i can normalize in serial wise here is my json
{"pages":[{"name":"page1","title":"SurveyWindow Pvt. Ltd. Customer Feedback","description":"Question marked * are compulsory.",
"elements":[{"type":"radiogroup","name":"question1","title":"Do you like our product? *","isRequired":true,
"choices":[{"value":"Yes","text":"Yes"},{"value":"No","text":"No"}]},{"type":"checkbox","name":"question2","title":"Please Rate Our PM Skill","isRequired":false,"choices":[{"value":"High","text":"High"},{"value":"Low","text":"Low"},{"value":"Medium","text":"Medium"}]},{"type":"radiogroup","name":"question3","title":"Do you like our services? *","isRequired":true,"choices":[{"value":"Yes","text":"Yes"},{"value":"No","text":"No"}]}]}]}
this is my view.py
jsondata=SurveyMaster.objects.all().filter(survey_id='1H2711202014572740')
q = jsondata.values('survey_json_design')
qs_json = pd.DataFrame.from_records(q)
datatotable = pd.json_normalize(qs_json['survey_json_design'], record_path=['pages','elements'])
qs_json = datatotable.to_html()
Based on your comments and picture here's what I would do to go from the picture to something more SQL-friendly (what you refer to as "normalization"), but keep in mind this might blow up if you don't have sufficient memory.
Create a new list which you'll fill with the new data, then iterate over the pandas table's rows, and then over every item in your list. For every iteration in the inner loop use the data from the row (minus the column you're iteration over). For convenience I added it as the last element.
# Example data
df = pd.DataFrame({"choices": [[{"text": "yes", "value": "yes"},
{"text": "no", "value": "no"}],
[{"ch1": 1, "ch2": 2}, {"ch3": "ch3"}]],
"name": ["kostas", "rajesh"]})
data = []
for i, row in df.iterrows():
for val in row["choices"]:
data.append((*row.drop("choices").values, val))
df = pd.DataFrame(data, columns=["names", "choices"])
print(df)
names choices
0 kostas {'text': 'yes', 'value': 'yes'}
1 kostas {'text': 'no', 'value': 'no'}
2 george {'ch1': 1, 'ch2': 2}
3 george {'ch3': 'ch3'}
This is where I guess you want to go. All that's left is to just modify the column / variable names with your own data.

Fill pandas dataframe within a for loop

I am working with Amazon Rekognition to do some image analysis.
With a symple Python script, I get - at every iteration - a response of this type:
(example for the image of a cat)
{'Labels':
[{'Name': 'Pet', 'Confidence': 96.146484375, 'Instances': [],
'Parents': [{'Name': 'Animal'}]}, {'Name': 'Mammal', 'Confidence': 96.146484375,
'Instances': [], 'Parents': [{'Name': 'Animal'}]},
{'Name': 'Cat', 'Confidence': 96.146484375.....
I got all the attributes I need in a list, that looks like this:
[Pet, Mammal, Cat, Animal, Manx, Abyssinian, Furniture, Kitten, Couch]
Now, I would like to create a dataframe where the elements in the list above appear as columns and the rows take values 0 or 1.
I created a dictionary in which I add the elements in the list, so I get {'Cat': 1}, then I go to add it to the dataframe and I get the following error:
TypeError: Index(...) must be called with a collection of some kind, 'Cat' was passed.
Not only that, but I don't even seem able to add to the same dataframe the information from different images. For example, if I only insert the data in the dataframe (as rows, not columns), I get a series with n rows with the n elements (identified by Amazon Rekognition) of only the last image, i.e. I start from an empty dataframe at each iteration.
The result I would like to get is something like:
Image Human Animal Flowers etc...
Pic1 1 0 0
Pic2 0 0 1
Pic3 1 1 0
For reference, this is the code I am using now (I should add that I am working on a software called KNIME, but this is just Python):
from pandas import DataFrame
import pandas as pd
import boto3
fileName=flow_variables['Path_Arr[1]'] #This is just to tell Amazon the name of the image
bucket= 'mybucket'
client=boto3.client('rekognition', region_name = 'us-east-2')
response = client.detect_labels(Image={'S3Object':
{'Bucket':bucket,'Name':fileName}})
data = [str(response)] # This is what I inserted in the first cell of this question
d= {}
for key, value in response.items():
for el in value:
if isinstance(el,dict):
for k, v in el.items():
if k == "Name":
d[v] = 1
print(d)
df = pd.DataFrame(d, ignore_index=True)
print(df)
output_table = df
I am definitely getting it all wrong both in the for loop and when adding things to my dataframe, but nothing really seems to work!
Sorry for the super long question, hope it was clear! Any ideas?
I do not know if this answers your question completely, because i do not know, what you data can look like, but it's a good step that should help you, i think. I added the same data multiple time, but the way should be clear.
import pandas as pd
response = {'Labels': [{'Name': 'Pet', 'Confidence': 96.146484375, 'Instances': [], 'Parents': [{'Name': 'Animal'}]},
{'Name': 'Cat', 'Confidence': 96.146484375, 'Instances': [{'BoundingBox':
{'Width': 0.6686800122261047,
'Height': 0.9005332589149475,
'Left': 0.27255237102508545,
'Top': 0.03728689253330231},
'Confidence': 96.146484375}],
'Parents': [{'Name': 'Pet'}]
}]}
def handle_new_data(repsonse_data: dict, image_name: str) -> pd.DataFrame:
d = {"Image": image_name}
result = pd.DataFrame()
for key, value in repsonse_data.items():
for el in value:
if isinstance(el, dict):
for k, v in el.items():
if k == "Name":
d[v] = 1
result = result.append(d, ignore_index=True)
return result
df_all = pd.DataFrame()
df_all = df_all.append(handle_new_data(response, "image1"))
df_all = df_all.append(handle_new_data(response, "image2"))
df_all = df_all.append(handle_new_data(response, "image3"))
df_all = df_all.append(handle_new_data(response, "image4"))
df_all.reset_index(inplace=True)
print(df_all)

How to extract specific values from a list of dictionaries in python

I have a list of dictionaries like shown below and i would like to extract the partID and the corresponding quantity for a specific orderID using python, but i don't know how to do it.
dataList = [{'orderID': 'D00001', 'customerID': 'C00001', 'partID': 'P00001', 'quantity': 2},
{'orderID': 'D00002', 'customerID': 'C00002', 'partID': 'P00002', 'quantity': 1},
{'orderID': 'D00003', 'customerID': 'C00003', 'partID': 'P00001', 'quantity': 1},
{'orderID': 'D00004', 'customerID': 'C00004', 'partID': 'P00003', 'quantity': 3}]
So for example, when i search my dataList for a specific orderID == 'D00003', i would like to receive both the partID ('P00001'), as well as the corresponding quantity (1) of the specified order. How would you go about this? Any help is much appreciated.
It depends.
You are not going to do that a lot of time, you can just iterate over the list of dictionaries until you find the "correct" one:
search_for_order_id = 'D00001'
for d in dataList:
if d['orderID'] == search_for_order_id:
print(d['partID'], d['quantity'])
break # assuming orderID is unique
Outputs
P00001 2
Since this solution is O(n), if you are going to do this search a lot of times it will add up.
In that case it will be better to transform the data to a dictionary of dictionaries, with orderID being the outer key (again, assuming orderID is unique):
better = {d['orderID']: d for d in dataList}
This is also O(n) but you pay it only once. Any subsequent lookup is an O(1) dictionary lookup:
search_for_order_id = 'D00001'
print(better[search_for_order_id]['partID'], better[search_for_order_id]['quantity'])
Also outputs
P00001 2
I believe you would like to familiarize yourself with the pandas package, which is very useful for data analysis. If these are the kind of problems you're up against, I advise you to take the time and take a tutorial in pandas. It can do a lot, and is very popular.
Your dataList is very similar to a DataFrame structure, so what you're looking for would be as simple as:
import pandas as pd
df = pd.DataFrame(dataList)
df[df['orderID']=='D00003']
You can use this:
results = [[x['orderID'], x['partID'], x['quantity']] for x in dataList]
for i in results:
print(i)
Also,
results = [['Order ID: ' + x['orderID'], 'Part ID: ' + x['partID'],'Quantity:
' + str(x['quantity'])] for x in dataList]
To get the partID you can make use of the filter function.
myData = [{"x": 1, "y": 1}, {"x": 2, "y": 5}]
filtered = filter(lambda item: item["x"] == 1) # Search for an object with x equal to 1
# Get the next item from the filter (the matching item) and get the y property.
print(next(filtered)["y"])
You should be able to apply this to your situation.

Categories