how to normalize this below json using panda in django - python

using this view.py query my output is showing something like this. you can see in choices field there are multiple array so i can normalize in serial wise here is my json
{"pages":[{"name":"page1","title":"SurveyWindow Pvt. Ltd. Customer Feedback","description":"Question marked * are compulsory.",
"elements":[{"type":"radiogroup","name":"question1","title":"Do you like our product? *","isRequired":true,
"choices":[{"value":"Yes","text":"Yes"},{"value":"No","text":"No"}]},{"type":"checkbox","name":"question2","title":"Please Rate Our PM Skill","isRequired":false,"choices":[{"value":"High","text":"High"},{"value":"Low","text":"Low"},{"value":"Medium","text":"Medium"}]},{"type":"radiogroup","name":"question3","title":"Do you like our services? *","isRequired":true,"choices":[{"value":"Yes","text":"Yes"},{"value":"No","text":"No"}]}]}]}
this is my view.py
jsondata=SurveyMaster.objects.all().filter(survey_id='1H2711202014572740')
q = jsondata.values('survey_json_design')
qs_json = pd.DataFrame.from_records(q)
datatotable = pd.json_normalize(qs_json['survey_json_design'], record_path=['pages','elements'])
qs_json = datatotable.to_html()

Based on your comments and picture here's what I would do to go from the picture to something more SQL-friendly (what you refer to as "normalization"), but keep in mind this might blow up if you don't have sufficient memory.
Create a new list which you'll fill with the new data, then iterate over the pandas table's rows, and then over every item in your list. For every iteration in the inner loop use the data from the row (minus the column you're iteration over). For convenience I added it as the last element.
# Example data
df = pd.DataFrame({"choices": [[{"text": "yes", "value": "yes"},
{"text": "no", "value": "no"}],
[{"ch1": 1, "ch2": 2}, {"ch3": "ch3"}]],
"name": ["kostas", "rajesh"]})
data = []
for i, row in df.iterrows():
for val in row["choices"]:
data.append((*row.drop("choices").values, val))
df = pd.DataFrame(data, columns=["names", "choices"])
print(df)
names choices
0 kostas {'text': 'yes', 'value': 'yes'}
1 kostas {'text': 'no', 'value': 'no'}
2 george {'ch1': 1, 'ch2': 2}
3 george {'ch3': 'ch3'}
This is where I guess you want to go. All that's left is to just modify the column / variable names with your own data.

Related

(Python) How to only keep specific part of cells in a dataframe

I want to cleanup a source column of my dataframe. At the end I only want to keep the part behind 'name'.
What is the best way to do this?
For example:
row 1, column 1:
{'id': 'rtl-nieuws', 'name': 'RTL Nieuws'}
row 2, column 1:
{'id': 'none', 'name': 'www.ad.nl'}
Desired outcome:
row 1, column 1:
RTL Nieuws
row 2, column 1:
www.ad.nl
Is this what you are trying to do? In the future, please consider giving a working example to solve the request from.
data = pd.DataFrame({
"id": ["rtl-nieuws", "none"],
"name": ["RTL Nieuws", "www.ad.nl"]
}, index=[0,1])
data.drop("id", axis = 1)
# name
# 0 RTL Nieuws
# 1 www.ad.nl
Considering your data seems to be in the format of a dictionary, you can use ast.literal_eval() to access the value at the 'name' key.
import ast
current_cell = "{'id': 'rtl-nieuws', 'name': 'RTL Nieuws'}"
name = ast.literal_eval(current_cell)['name']
print(name)
>>> RTL Nieuws

Fill pandas dataframe within a for loop

I am working with Amazon Rekognition to do some image analysis.
With a symple Python script, I get - at every iteration - a response of this type:
(example for the image of a cat)
{'Labels':
[{'Name': 'Pet', 'Confidence': 96.146484375, 'Instances': [],
'Parents': [{'Name': 'Animal'}]}, {'Name': 'Mammal', 'Confidence': 96.146484375,
'Instances': [], 'Parents': [{'Name': 'Animal'}]},
{'Name': 'Cat', 'Confidence': 96.146484375.....
I got all the attributes I need in a list, that looks like this:
[Pet, Mammal, Cat, Animal, Manx, Abyssinian, Furniture, Kitten, Couch]
Now, I would like to create a dataframe where the elements in the list above appear as columns and the rows take values 0 or 1.
I created a dictionary in which I add the elements in the list, so I get {'Cat': 1}, then I go to add it to the dataframe and I get the following error:
TypeError: Index(...) must be called with a collection of some kind, 'Cat' was passed.
Not only that, but I don't even seem able to add to the same dataframe the information from different images. For example, if I only insert the data in the dataframe (as rows, not columns), I get a series with n rows with the n elements (identified by Amazon Rekognition) of only the last image, i.e. I start from an empty dataframe at each iteration.
The result I would like to get is something like:
Image Human Animal Flowers etc...
Pic1 1 0 0
Pic2 0 0 1
Pic3 1 1 0
For reference, this is the code I am using now (I should add that I am working on a software called KNIME, but this is just Python):
from pandas import DataFrame
import pandas as pd
import boto3
fileName=flow_variables['Path_Arr[1]'] #This is just to tell Amazon the name of the image
bucket= 'mybucket'
client=boto3.client('rekognition', region_name = 'us-east-2')
response = client.detect_labels(Image={'S3Object':
{'Bucket':bucket,'Name':fileName}})
data = [str(response)] # This is what I inserted in the first cell of this question
d= {}
for key, value in response.items():
for el in value:
if isinstance(el,dict):
for k, v in el.items():
if k == "Name":
d[v] = 1
print(d)
df = pd.DataFrame(d, ignore_index=True)
print(df)
output_table = df
I am definitely getting it all wrong both in the for loop and when adding things to my dataframe, but nothing really seems to work!
Sorry for the super long question, hope it was clear! Any ideas?
I do not know if this answers your question completely, because i do not know, what you data can look like, but it's a good step that should help you, i think. I added the same data multiple time, but the way should be clear.
import pandas as pd
response = {'Labels': [{'Name': 'Pet', 'Confidence': 96.146484375, 'Instances': [], 'Parents': [{'Name': 'Animal'}]},
{'Name': 'Cat', 'Confidence': 96.146484375, 'Instances': [{'BoundingBox':
{'Width': 0.6686800122261047,
'Height': 0.9005332589149475,
'Left': 0.27255237102508545,
'Top': 0.03728689253330231},
'Confidence': 96.146484375}],
'Parents': [{'Name': 'Pet'}]
}]}
def handle_new_data(repsonse_data: dict, image_name: str) -> pd.DataFrame:
d = {"Image": image_name}
result = pd.DataFrame()
for key, value in repsonse_data.items():
for el in value:
if isinstance(el, dict):
for k, v in el.items():
if k == "Name":
d[v] = 1
result = result.append(d, ignore_index=True)
return result
df_all = pd.DataFrame()
df_all = df_all.append(handle_new_data(response, "image1"))
df_all = df_all.append(handle_new_data(response, "image2"))
df_all = df_all.append(handle_new_data(response, "image3"))
df_all = df_all.append(handle_new_data(response, "image4"))
df_all.reset_index(inplace=True)
print(df_all)

iterating through csv dataframe to create/assign variables

I have a csv file that will contain a frequently updated (overwritten) dataframe with a few rows of purchase orders, something like this:
uniqueId item action quantity price
123 widget1 buy 10 99.44
234 widget2 sell 15 19.99
345 widget3 buy 2 999.99
This csv file will be passed to my python code by another program; my code will check for its presence every few minutes. Once it appears, the code will read it. I'm not including the code for that, since that's not the issue.
The idea is to turn this purchase order dataframe into something that I can pass to my (already written) place-the-order code. I want to iterate through each row in order (enumerate?), and assign the values from that row to variables that I use in the order code, then reassign the new values to the same variable for the next row after the order from that row has been placed.
As I understand it, itertuples are probably the way to go for iterating through it, but I'm new enough to python that I can't figure out the actual mechanism/syntax of using it to do what I want. All my trial-and-error tests for assigning the values to reusable variables result in syntax errors.
I'm having a mental block on what is probably very basic python! I know how to iterate through the rows and print 'em out--plenty of examples out there show me how to do that--but not how to turn the data into something I can use elsewhere. Can someone walk me through an example or two that actually applies to what I'm trying to do?
Like you said, you can quite easily iterate over a dataframe with .itertuples()
Here's how I would go about it (df is your dataframe; for it I used the data from your example):
Code:
for row in df.itertuples():
print(row)
Output:
Pandas(Index=0, uniqueId=123, item='widget1', action='buy', quantity=10, price=99.44)
Pandas(Index=1, uniqueId=234, item='widget2', action='sell', quantity=15, price=19.99)
Pandas(Index=2, uniqueId=345, item='widget3', action='buy', quantity=2, price=999.999)
If you want to get specific entries of the tuples you need to use the position in the tuple as index:
Code:
for row in df.itertuples():
uniqueID = row[1]
print(uniqueID)
Output:
123
234
345
I'm not sure how the rest of your code looks like. If you have the place-the-order code inside a function you could just call the function in the for-loop after assigning the variables to your liking:
for row in df.itertuples():
uniqueID = row[1]
item = row[2]
action = row[3]
quantity = row[4]
price = row[5]
place-the-order(uniqueID, item, action, quantity, price)
(You could even skip assigning the variables and just call place-the-order(row[1], row[2], ...). In my opinion it is more readable to assign the variables.)
If your place-the-order code is not in a function I would recommend using a nested dictionary with the row index as key and a dictionary of the content of the row as value. The row index is easily accessible as it is the first item in the tuple.
content_of_rows = {}
for row in df.itertuples():
index = row[0]
uniqueID = row[1]
item = row[2]
action = row[3]
quantity = row[4]
price = row[5]
content_of_rows.update({index:{"uniqueID":uniqueID, "item": item, "action": action, "quantity": quantity, "price": price}})
print(content_of_rows)
Output:
{0: {'uniqueID': 123, 'item': 'widget1', 'action': 'buy', 'quantity': 10, 'price': 99.44},
1: {'uniqueID': 234, 'item': 'widget2', 'action': 'sell', 'quantity': 15, 'price': 19.99},
2: {'uniqueID': 345, 'item': 'widget3', 'action': 'buy', 'quantity': 2, 'price': 999.999}}
This way you can't use the same variable for every row, since it's generally speaking just a different way of writing a dataframe. You can iterate over dictionaries pretty much the same as over tuples but instead of numerical indices you have to use the key.
for row in content_of_rows:
# row is the key, so in the first iteration it would be 0, in the second iteration it would be 1, and so on
print(content_of_rows[row])
Output:
{'uniqueID': 123, 'item': 'widget1', 'action': 'buy', 'quantity': 10, 'price': 99.44}
{'uniqueID': 234, 'item': 'widget2', 'action': 'sell', 'quantity': 15, 'price': 19.99}
{'uniqueID': 345, 'item': 'widget3', 'action': 'buy', 'quantity': 2, 'price': 999.999}
If you want to get the uniqueID of the rows, you would do something like this:
Code:
for row in content_of_rows:
print(content_of_rows[row]["uniqueID"]) # just put the second key you're looking for right after the first, also in []
Output:
123
234
345
It's usually best to put different parts of your code into functions, so if you haven't done that already I'd recommend you do so. That way you can use the same variables for each row.
I hope this (kinda long, sorry about that) answer could help you. Greetings from Bavaria!

Add same value to multiple sets of rows. The value changes based on condition

I have a dataframe that is dynamically created.
I create my first set of rows as:
df['tourist_spots'] = pd.Series(<A list of tourist spots in a city>)
To this df I add:
df['city'] = <City Name>
So far so good. A bunch of rows are created with the same city name for multiple tourist spots.
I want to add a new city. So I do:
df['tourist_spots'].append(pd.Series(<new data>))
Now, when I append a new city with:
df['city'].append('new city')
the previously updated city data is gone. It is as if every time the rows are replaced and not appended.
Here's an example of what I want:
Step 1:
df['tourist_spot'] = pd.Series('Golden State Bridge' + a bunch of other spots)
For all the rows created by the above data I want:
df['city'] = 'San Francisco'
Step 2:
df['tourist_spot'].append(pd.Series('Times Square' + a bunch of other spots)
For all the rows created by the above data, I want:
df['city'] = 'New York'
How can I achieve this?
Use dictionary to add rows to your data frame, it is faster method.
Here is an e.g.
STEP 1
Create dictionary:
dict_df = [{'tourist_spots': 'Jones LLC', 'City': 'Boston'},
{'tourist_spots': 'Alpha Co', 'City': 'Boston'},
{'tourist_spots': 'Blue Inc', 'City': 'Singapore' }]
STEP2
Convert dictionary to dataframe:
df = pd.DataFrame(dict_df)
STEP3
Add new entries to dataframe in dictionary format:
df = df.append({'tourist_spots': 'New_Blue', 'City': 'Singapore'}, ignore_index=True)
References:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_dict.html

Extract JSON | API | Pandas DataFrame

I am using the Facebook API (v2.10) to which I've extracted the data I need, 95% of which is perfect. My problem is the 'actions' metric which returns as a dictionary within a list within another dictionary.
At present, all the data is in a DataFrame, however, the 'actions' column is a list of dictionaries that contain each individual action for that day.
{
"actions": [
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "7"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "3"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "144"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "34"
}]}
All this appears in one cell (row) within the DataFrame.
What is the best way to:
Get the action type, create a new column and use the Use "action_type" as the column name?
List the correct value under this column
It looks like JSON but when I look at the type, it's a panda series (stored as an object).
For those willing to help (thank you, I greatly appreciate it) - can you either point me in the direction of the right material and I will read it and work it out on my own (I'm not entirely sure what to look for) or if you decide this is an easy problem, explain to me how and why you solved it this way. Don't just want the answer
I have tried the following (with help from a friend) and it kind of works, but I have issues with this running in my script. IE: if it runs within a bigger code block, I get the following error:
for i in range(df.shape[0]):
line = df.loc[i, 'Conversions']
L = ast.literal_eval(line)
for l in L:
cid = l['action_type']
value = l['value']
df.loc[i, cid] = value
If I save the DF as a csv, call it using pd.read_csv...it executes properly, but not within the script. No idea why.
Error:
ValueError: malformed node or string: [{'value': '1', 'action_type': 'offsite_conversion.custom.xxxxx}]
Any help would be greatly appreciated.
Thanks,
Adrian
You can use json_normalize:
In [11]: d # e.g. dict from json.load OR instead pass the json path to json_normalize
Out[11]:
{'actions': [{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx',
'value': '7'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '3'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '144'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '34'}]}
In [12]: pd.io.json.json_normalize(d, record_path="actions")
Out[12]:
action_type value
0 offsite_conversion.custom.xxxxxxxxxxx 7
1 offsite_conversion.custom.xxxxxxxxxxx 3
2 offsite_conversion.custom.xxxxxxxxxxx 144
3 offsite_conversion.custom.xxxxxxxxxxx 34
You can use df.join(pd.DataFrame(df['Conversions'].tolist()).pivot(columns='action_type', values='value').reset_index(drop=True)).
Explanation:
df['Conversions'].tolist() returns a list of dictionaries. This list is then transformed into a DataFrame using pd.DataFrame. Then, you can use the pivot function to pivot the table into the shape that you want.
Lastly, you can join the table with your original DataFrame. Note that this only works if you DataFrame's index is the default (i.e., integers starting from 0). If this is not the case, you can do this instead:
df2 = pd.DataFrame(df['Conversions'].tolist()).pivot(columns='action_type', values='value').reset_index(drop=True)
for col in df2.columns:
df[col] = df2[col]

Categories