convert list to nested dic and assign a value - python

I am trying to convert csv to Json. If I encounter csv headers with naming convention "columnName1.0.columnName2.0.columnName3" I need to create a nested JSON --> {ColumnName1 : {columnName2 : {columnName3 : value }}}..
So far I am able to split header into list of subColumnNames and create a nested JSON type, but I am unable to assign a value. Any Help?
data = open(str(fileName.strip("'")),'rb')
reader = csv.DictReader(data,delimiter = ',',quotechar='"')
'''
Get the header '''
for line in reader:
for x,y in line.items():
columns = re.split("\.\d\.",x)
if len(columns) == 1:
continue
else:
print "COLUMNS %s"%columns
testLine = {}
for subColumnName in reversed(columns):
testLine = {subColumnName: testLine}
''' Need to Assign value y? '''
print "LINE%s"%testLine
Output:
COLUMNS ['experience', 'title']
LINE{'experience': {'title': {}}}
COLUMNS ['experience', 'organization', 'profile_url']
LINE{'experience': {'organization': {'profile_url': {}}}}
COLUMNS ['experience', 'start']
LINE{'experience': {'start': {}}}
COLUMNS ['raw_experience', 'organization', 'profile_url']
LINE{'raw_experience': {'organization': {'profile_url': {}}}}
COLUMNS ['raw_experience', 'end']
LINE{'raw_experience': {'end': {}}}
COLUMNS ['experience', 'organization', 'name']
LINE{'experience': {'organization': {'name': {}}}}

The value you want is currently {}, the initial value of testLine. You can try this:
testLine = value
for subColumnName in reversed(columns):
testLine = {subColumnName: testLine}

Related

How to normalize a complex json format in a pandas data frame that is a list of dictionaries

I have a pandas data frame that has one column like this in json format. I am not able to understand how to extract this.
df['completionDetails'][0] gives:
[{'name': 'start', 'time': 1654098788177},
{'name': 'arrival',
'time': 1654099038368,
'location': [-74.2713929, 40.5017297]},
{'name': 'departure',
'time': 1654098843357,
'location': [-74.2802414, 40.5095964]}]
I have tried:
dict_df = pd.DataFrame([ast.literal_eval(i) for i in df['completionDetails'].values])
But it is giving me error. What method can I use for this?
Expected Output:
start_time arrival_time arrival_location departure_time departure_location
1654098788177 1654099038368 [-74.2713929, 40.5017297] 1654098843357 [-74.2802414, 40.5095964]
IIUC each cell of the completionDetails column is a list of dictionaries.
You can make a dataframe out of each cell and concatenate the dfs:
dict_df = pd.concat([pd.DataFrame(i) for i in df['completionDetails'].values])
Edit:
Following your own edit, this is how you'd get the desired output:
dict_df = pd.concat([pd.DataFrame({f"{x['name']}_{k}": [v]
for x in i for k,v in x.items() if k!='name'}
) for i in df['completionDetails'].values if isinstance(i, list)])
As you can see we're building key names from the name key and other keys to create new dictionaries that will be used to create dataframes (that in turn will be concatenated to each other)
Output:
start_time arrival_time arrival_location departure_time departure_location
0 1654098788177 1654099038368 [-74.2713929, 40.5017297] 1654098843357 [-74.2802414, 40.5095964]

How to update values in Python's dictionary?

My dictionary looks like below, and I am following this link to update the values in "Column_Type" key. Bascially, I would like to replace values "String" with "VARCHAR(256)", DATE with "NUMBER (4,0)", Int with "NUMBER" and Numeric with "Number". Whenever I run below code, my values are not getting updated to my dictionary.My desired output for updated dictionary is as below
Please note: The location of column_types might vary as well. For ex: Column_type[String] currently is at position 1, but It might be at position 3 later on .
{'Column_name': ['Name', 'Salary', 'Date', 'Phone'], 'Column_Type': ['String', 'Numeric', 'Date', 'Int']}
Code:
for key1, key2 in my_dict.items():
if key2== 'String':
my_dict[key2] = "VARCHAR(256)"
print(my_dict)
Desired Output:
{'Column_name': ['Name', 'Salary', 'Date', 'Phone'], 'Column_Type': ['VARCHAR(256)', 'NUMBER', 'NUMBER(4,0)', 'NUMBER']}
In your example, your keys are "Column_Name" and Column_Type". There is no key named "String" in your dict. Both values in your dict are of type list so neither are equal to the string String either.
What you want is to replace a specific value in a list.
Try like this:
for index, value in enumerate(my_dict["Column_Type"]):
if value == "String":
my_dict["Column_Type"][index] = "VARCHAR(256)"
This replaces the value in the list, not the dict. That is what you want.
If you need to replace multiple values you can use a dict, like #Jeremy suggested:
type_strs = {
'String': 'VARCHAR(256)',
'Numeric': 'NUMBER',
'Date': 'NUMBER(4,0)',
'Int': 'NUMBER'
}
for index, value in enumerate(my_dict["Column_Type"]):
my_dict["Column_Type"][index] = type_strs.get(value, value)
Here, the .get() function on a dict returns the value corresponding to the key given by the first argument, or the second argument if no such key exists.
type_strs = {
'String': 'VARCHAR(256)',
'Numeric': 'NUMBER',
'Date': 'NUMBER(4,0)',
'Int': 'NUMBER'
}
my_dict['Column_Type'] = [type_strs[t] for t in my_dict['Column_Type']]
I would recommend a dictionary instead of if statements for translating the type strings
Your are in this line comparing a list with an element of this list if key2== 'String':
key2 when you are traveling the variable contains the next ['String', 'Numeric', 'Date', 'Int'], so you will need to join to this value of the array for compare. You can do it with a for cycle
The program is the next:
my_dict={'Column_name': ['Name', 'Salary', 'Date', 'Phone'], 'Column_Type': ['String', 'Numeric', 'Date', 'Int']}
# We create this variable to save the position of the element
position=0
# We travel to the dictionary
for i in my_dict['Column_Type']:
# If the variable is equal to the string
if i == 'String':
# We assign the new information to the variable
my_dict['Column_Type'][position]="VARCHAR(256)"
#And add one to the position
position+=1
print(my_dict)
Output
{'Column_name': ['Name', 'Salary', 'Date', 'Phone'], 'Column_Type': ['VARCHAR(256)', 'Numeric', 'Date', 'Int']}
You can use list.update(val1, val2)
example:
# Dictionary of strings to ints
word_freq = {
"Hello": 56,
"at": 23,
"test": 43,
"this": 43
}
# Adding a new key value pair
word_freq.update({'before': 23})
print(word_freq)

Fill pandas dataframe within a for loop

I am working with Amazon Rekognition to do some image analysis.
With a symple Python script, I get - at every iteration - a response of this type:
(example for the image of a cat)
{'Labels':
[{'Name': 'Pet', 'Confidence': 96.146484375, 'Instances': [],
'Parents': [{'Name': 'Animal'}]}, {'Name': 'Mammal', 'Confidence': 96.146484375,
'Instances': [], 'Parents': [{'Name': 'Animal'}]},
{'Name': 'Cat', 'Confidence': 96.146484375.....
I got all the attributes I need in a list, that looks like this:
[Pet, Mammal, Cat, Animal, Manx, Abyssinian, Furniture, Kitten, Couch]
Now, I would like to create a dataframe where the elements in the list above appear as columns and the rows take values 0 or 1.
I created a dictionary in which I add the elements in the list, so I get {'Cat': 1}, then I go to add it to the dataframe and I get the following error:
TypeError: Index(...) must be called with a collection of some kind, 'Cat' was passed.
Not only that, but I don't even seem able to add to the same dataframe the information from different images. For example, if I only insert the data in the dataframe (as rows, not columns), I get a series with n rows with the n elements (identified by Amazon Rekognition) of only the last image, i.e. I start from an empty dataframe at each iteration.
The result I would like to get is something like:
Image Human Animal Flowers etc...
Pic1 1 0 0
Pic2 0 0 1
Pic3 1 1 0
For reference, this is the code I am using now (I should add that I am working on a software called KNIME, but this is just Python):
from pandas import DataFrame
import pandas as pd
import boto3
fileName=flow_variables['Path_Arr[1]'] #This is just to tell Amazon the name of the image
bucket= 'mybucket'
client=boto3.client('rekognition', region_name = 'us-east-2')
response = client.detect_labels(Image={'S3Object':
{'Bucket':bucket,'Name':fileName}})
data = [str(response)] # This is what I inserted in the first cell of this question
d= {}
for key, value in response.items():
for el in value:
if isinstance(el,dict):
for k, v in el.items():
if k == "Name":
d[v] = 1
print(d)
df = pd.DataFrame(d, ignore_index=True)
print(df)
output_table = df
I am definitely getting it all wrong both in the for loop and when adding things to my dataframe, but nothing really seems to work!
Sorry for the super long question, hope it was clear! Any ideas?
I do not know if this answers your question completely, because i do not know, what you data can look like, but it's a good step that should help you, i think. I added the same data multiple time, but the way should be clear.
import pandas as pd
response = {'Labels': [{'Name': 'Pet', 'Confidence': 96.146484375, 'Instances': [], 'Parents': [{'Name': 'Animal'}]},
{'Name': 'Cat', 'Confidence': 96.146484375, 'Instances': [{'BoundingBox':
{'Width': 0.6686800122261047,
'Height': 0.9005332589149475,
'Left': 0.27255237102508545,
'Top': 0.03728689253330231},
'Confidence': 96.146484375}],
'Parents': [{'Name': 'Pet'}]
}]}
def handle_new_data(repsonse_data: dict, image_name: str) -> pd.DataFrame:
d = {"Image": image_name}
result = pd.DataFrame()
for key, value in repsonse_data.items():
for el in value:
if isinstance(el, dict):
for k, v in el.items():
if k == "Name":
d[v] = 1
result = result.append(d, ignore_index=True)
return result
df_all = pd.DataFrame()
df_all = df_all.append(handle_new_data(response, "image1"))
df_all = df_all.append(handle_new_data(response, "image2"))
df_all = df_all.append(handle_new_data(response, "image3"))
df_all = df_all.append(handle_new_data(response, "image4"))
df_all.reset_index(inplace=True)
print(df_all)

How to iterate through this nested dictionary within a list using for loop

I have a list of nested dictionaries that I want to get specific values and put into a dictionary like this:
vid = [{'a':{'display':'axe', 'desc':'red'}, 'b':{'confidence':'good'}},
{'a':{'display':'book', 'desc':'blue'}, 'b':{'confidence':'poor'}},
{'a':{'display':'apple', 'desc':'green'}, 'b':{'confidence':'good'}}
]
I saw previous questions similar to this, but I still can't get the values such as 'axe' and 'red'. I would like the new dict to have a 'Description', 'Confidence' and other columns with the values from the nested dict.
I have tried this for loop:
new_dict = {}
for x in range(len(vid)):
for y in vid[x]['a']:
desc = y['desc']
new_dict['Description'] = desc
I got many errors but mostly this error:
TypeError: string indices must be integers
Can someone please help solve how to get the values from the nested dictionary?
You don't need to iterate through the keys in the dictionary (the inner for-loop), just access the value you want.
vid = [{'a':{'display':'axe', 'desc':'red'}, 'b':{'confidence':'good'} },
{'a':{'display':'book', 'desc':'blue'}, 'b':{'confidence':'poor'}},
{'a':{'display':'apple', 'desc':'green'}, 'b':{'confidence':'good'}}
]
new_dict = {}
list_of_dicts = []
for x in range(len(vid)):
desc = vid[x]['a']['desc']
list_of_dicts.append({'desc': desc})
I have found a temporary solution for this. I decided to use the pandas dataframe instead.
df = pd.DataFrame(columns = ['Desc'])
for x in range(len(vid)):
desc = vid[x]['a']['desc']
df.loc[len(df)] = [desc]
so you want to write this to csv later so pandas will help you a lot for this problem using pandas you can get the desc by
import pandas as pd
new_dict = {}
df = pd.DataFrame(vid)
for index, row in df.iterrows() :
new_dict['description'] = row['a']['desc']
a b
0 {'display': 'axe', 'desc': 'red'} {'confidence': 'good'}
1 {'display': 'book', 'desc': 'blue'} {'confidence': 'poor'}
2 {'display': 'apple', 'desc': 'green'} {'confidence': 'good'}
this is how dataframe looks like a b are column of the dataframe and your nested dicts are rows of dataframe
Try using this list comprehension:
d = [{'Description': i['a']['desc'], 'Confidence': i['b']['confidence']} for i in vid]
print(d)

List of Dictionaries to DataFrame

I have a data like this and I want the data to be written in a dataframe so that I can convert it directly into a csv file.
Data =
[ {'event': 'User Clicked', 'properties': {'user_id': '123', 'page_visited': 'contact_us', etc},
{'event': 'User Clicked', 'properties': {'user_id': '456', 'page_visited': 'homepage', etc} , ......
{'event': 'User Clicked', 'properties': {'user_id': '789', 'page_visited': 'restaurant', etc}} ]
This is How I am able to access its values:
for item in list_of_dict_responses:
print item['event']
for key, value in item.items():
if type(value) is dict:
for k, v in value.items():
print k,v
I want it in a dataframe where event is a column with value of User Clicked and properties is a another column with sub column of user_id, page_visited, contact_us and then respective values of sub column.
flatten the nested dictionaries & then just use the data frame constructor to create a data frame.
data = [
{'event': 'User Clicked', 'properties': {'user_id': '123', 'page_visited': 'contact_us'}},
{'event': 'User Clicked', 'properties': {'user_id': '456', 'page_visited': 'homepage'}},
{'event': 'User Clicked', 'properties': {'user_id': '789', 'page_visited': 'restaurant'}}
]
The flattened dictionary may be constructed in several ways. Here's 1 method using a generator that is generic & will work with arbitrary-depth nested dictionaries (or at least until it hits the max recursion depth)
def flatten(kv, prefix=[]):
for k, v in kv.items():
if isinstance(v, dict):
yield from flatten(v, prefix+[str(k)])
else:
if prefix:
yield '_'.join(prefix+[str(k)]), v
else:
yield str(k), v
Then using list comprehension to flatten all the records in data, construct the data frame
pd.DataFrame({k:v for k, v in flatten(kv)} for kv in data)
#Out
event properties_page_visited properties_user_id
0 User Clicked contact_us 123
1 User Clicked homepage 456
2 User Clicked restaurant 789
You have 2 options: either use a MultiIndex for columns, or add a prefix for data in properties. The former, in my opinion, is not appropriate here, since you don't have a "true" hierarchical columnar structure. The second level, for example, would be empty for event.
Implementing the second idea, you can restructure your list of dictionaries before feeding to pd.DataFrame. The syntax {**d1, **d2} is used to combine two dictionaries.
data_transformed = [{**{'event': d['event']},
**{f'properties_{k}': v for k, v in d['properties'].items()}} \
for d in Data]
res = pd.DataFrame(data_transformed)
print(res)
event properties_page_visited properties_user_id
0 User Clicked contact_us 123
1 User Clicked homepage 456
2 User Clicked restaurant 789
This also aids writing to and reading from CSV files, where a MultiIndex can be ambiguous.

Categories