Nested Dictionary using Pandas DataFrame - python

I have some data with duplicates that looks like this:
WEBPAGE
ID
VALUE
Webpage 1
ID 1
Value 1
Webpage 1
ID 1
Value 2
Webpage 1
ID 1
Value 3
Webpage 1
ID 2
Value 4
Webpage 1
ID 2
Value 5
Each webpage can have more than 1 ID associated with it and each ID can have more than one value associated with it.
I'd like to ideally have a nested dictionary with lists to handle the multiple IDs and multiple values:
{WEBPAGE: {ID 1: [value 1, value 2, value 3], ID 2: [value 4, value 5]}}
I've tried using to_dict and group_by but I can't seem to find the right syntax to create a nested dictionary within those.

Try:
out = {}
for _, x in df.iterrows():
out.setdefault(x["WEBPAGE"], {}).setdefault(x["ID"], []).append(x["VALUE"])
print(out)
Prints:
{
"Webpage 1": {
"ID 1": ["Value 1", "Value 2", "Value 3"],
"ID 2": ["Value 4", "Value 5"],
}
}

For a pandas approach, you just need to use a nested groupby:
d = (df.groupby('WEBPAGE')
.apply(lambda g: g.groupby('ID')['VALUE'].agg(list).to_dict())
.to_dict()
)
output:
{'Webpage 1': {'ID 1': ['Value 1', 'Value 2', 'Value 3'],
'ID 2': ['Value 4', 'Value 5']}}

Another possible solution, using dictionary comprehension:
{x: {y: [z for z in df.VALUE[(df.WEBPAGE == x) & (df.ID == y)]]
for y in df.ID[df.WEBPAGE == x]} for x in df.WEBPAGE}
Output:
{'Webpage 1': {'ID 1': ['Value 1', 'Value 2', 'Value 3'],
'ID 2': ['Value 4', 'Value 5']}}

Related

Stock taking items in Python Dataframe into a dictionary

Suppose I have a Python Dataframe:
Column A
Column B
A
Val 1
A
Val 2
B
Val A
B
Val B
B
Val C
B
Val D
I want to stock-take Column B into a dictionary with key = unique values of Column A, as such:
out = { 'A': ['Val 1','Val 2'],
'B': ['Val A','Val B','Val C','Val D'] }
How would I do that?
I tried making a pivot table but it only allows aggregating column B; I want them as separate value in a list.
One way using pandas.DataFrame.groupby:
out = df.groupby("Column A")["Column B"].apply(list).to_dict()
Output:
{'A': ['Val 1', 'Val 2'], 'B': ['Val A', 'Val B', 'Val C', 'Val D']}

How to combine multiple columns of a pandas Dataframe into one column in JSON format

I have a sample dataframe as follows:
Main Key
Second
Column A
Column B
Column C
Column D
Column E
First
A
Value 1
Value 2
Value 3
Value 4
Value 5
Second
B
Value 6
Value 7
Value 8
Value 9
Value 10
Third
C
Value 11
Value 12
Value 13
Value 14
Value 15
Fourth
D
Value 16
Value 17
Value 18
Value 19
Value 20
I want to make a new column called 'Aggregated Data', where I make each value in Columns A to E, as key-value pair, and combine them in 'Aggregated Data' in JSON Format
The expected output would look like this:
Main Key
Second
Aggregated Data
First
A
{"Column A":"Value 1","Column B":"Value 2","Column C":"Value 3","Column D":"Value 4","Column E":"Value 5"}
Second
B
{"Column A":"Value 6","Column B":"Value 7","Column C":"Value 8","Column D":"Value 9","Column E":"Value 10"}
Third
C
{"Column A":"Value 11","Column B":"Value 12","Column C":"Value 13","Column D":"Value 14","Column E":"Value 15"}
Fourth
D
{"Column A":"Value 16","Column B":"Value 17","Column C":"Value 18","Column D":"Value 19","Column E":"Value 20"}
Any idea how this can be achieved? Thanks
Via intermediate pandas.DataFrame.to_dict call (with orient records to obtain lists like [{column -> value}, … , {column -> value}]):
df[['Main Key', 'Second']].assign(Aggregated_Data=df.set_index(['Main Key', 'Second']).to_dict(orient='records'))
Main Key Second Aggregated_Data
0 First A {'Column A': 'Value 1 ', 'Column B': 'Value 2 ...
1 Second B {'Column A': 'Value 6 ', 'Column B': 'Value 7 ...
2 Third C {'Column A': 'Value 11 ', 'Column B': 'Value 1...
3 Fourth D {'Column A': 'Value 16 ', 'Column B': 'Value 1...
Just skip the first two columns and call to_json :
out = (df[["Main Key", "Second"]]
.assign(Aggregated_Data= df.iloc[:, 2:]
.apply(lambda x: x.to_json(), axis=1))
Alternatively, use a dict/listcomp :
df["Aggregated_Data"] = [{k: v for k, v in zip(df.columns[2:], v)}
for v in df.iloc[:,2:].to_numpy()]
Output :
print(out)
Main Key Second Aggregated_Data
0 First A {"Column A":"Value 1","Column B":"Value 2","Co...
1 Second B {"Column A":"Value 6","Column B":"Value 7","Co...
2 Third C {"Column A":"Value 11","Column B":"Value 12","...
3 Fourth D {"Column A":"Value 16","Column B":"Value 17","...

Transforming my concatenated tuple into a pandas DataFrame [duplicate]

This question already has answers here:
How to unnest (explode) a column in a pandas DataFrame, into multiple rows
(16 answers)
Closed last year.
I wrote the following code (not sure if this is the best approach), just know the data I have is divided into two separate lists, in the correct order. Z[0] is steps, and z[1] is the lists.
for i,z in enumerate(zip(steps,userids_list)):
print(z)
This results in the following tuple values:
# SAMPLE
(('Step 1 string', [list of userid of that step]),
('Step 2 string', [list of userid of that step]),
('Step 3 string', [list of userid of that step]),
('Step n string', [list of userids of that step]))
My goal is to transform that style of data into the following pandas DataFrame.
Column 1 Column 2
Step 1 User id
Step 1 User id
Step 2 User id
Step 2 User id
Step 3 User id
Step 3 User id
Unfortunately I couldn't find a way to transform the data into what I want. Any ideas on what I could try to do?
explode is perfect for this. Load your data into a dataframe and then explode the column containing the lists:
df = pd.DataFrame({
'Column 1': Z[0],
'Column 2': Z[1],
})
df = df.explode('Column 2')
For example:
steps = ['Step 1', 'Step 2', 'Step 3']
user_ids = [
['user a', 'user b'],
['user a', 'user b', 'user c'],
['user c'],
]
df = pd.DataFrame({
'step': steps,
'user_id': user_ids,
})
df = df.explode('user_id').reset_index(drop=True)
print(df)
Output:
step user_id
0 Step 1 user a
1 Step 1 user b
2 Step 2 user a
3 Step 2 user b
4 Step 2 user c
5 Step 3 user c
data = (('Step 1 string', [list of userid of that step]),
('Step 2 string', [list of userid of that step]),
('Step 3 string', [list of userid of that step]),
('Step n string', [list of userids of that step]))
df = pd.DataFrame(data, columns=['Column 1', 'Column 2'])
This probably do the job

Create a new column for a dataframe based on a complicated dictionary

I have a complicated dictionary to remap values. How do I achieve this in python?
cols_to_check = ["ColA","ColB","ColC"]
dic_string = "{(Type 1 | Type 2) : Type dual,
(Type 3 | Type 4) : Type many,
ELSE: Not listed
}"
df = pd.DataFrame(
{
'ID': ['AB01', 'AB02', 'AB03', 'AB04', 'AB05','AB06','AB07','AB08'],
'ColA': ["Type 1","Undef",np.nan,"Undef",
"Type 1", "","", "Undef"],
'ColB': ["N","Type 2","","",
"Y", np.nan,"", "N"],
'ColC': [np.nan,"Undef","Type 3",np.nan,"Undef",
"Undef", "","Type 2"]
})
I can do if it's a simple dictionary and if there is no ELSE mentioned in it. Assuming I can convert the 'dic_string' to following:
dic = {"Type 1" : "Type dual","Type 2" : "Type dual",
"Type 3" : "Type many", "Type 4" : "Type many",
"ELSE": "Not listed"
}
How can I make the end result like this with the new column "Result". How do I achieve this without hardcoding the contents of dic?
Use np.select:
dic = {('Type 1', 'Type 2'): 'Type dual',
('Type 3', 'Type 4'): 'Type many',}
default = 'Not listed'
condlist = [df[cols_to_check].isin(k).any(axis=1) for k in dic]
choicelist = dic.values()
df['Result'] = np.select(condlist, choicelist, default)
Output:
ID ColA ColB ColC Result
0 AB01 Type 1 N NaN Type dual
1 AB02 Undef Type 2 Undef Type dual
2 AB03 NaN Type 3 Type many
3 AB04 Undef NaN Not listed
4 AB05 Type 1 Y Undef Type dual
5 AB06 NaN Undef Not listed
6 AB07 Not listed
7 AB08 Undef N Type 2 Type dual

Skip item with more columns when creating Pandas DataFrame

I have a list of lists:
list = [
['Row 1','Value 1'],
['Row 2', 'Value 2'],
['Row 3', 'Value 3', 'Value 4']
]
And I have a list for dataframe header:
header_list = ['RowID', 'Value']
If I create the DataFrame using df = pd.DataFrame(list, columns = header_list), then python will through me an error says Row3 has more than 2 columns, which is inconsistent with the header_list.
So how can I skip Row 3 when creating the DataFrame. And how to achieve this with "in-place" calculation, which means NOT creating a new list which loops through the original list and append the item with length=2.
Thanks for the help!
First change variable list to L, because list is python code reserved word.
Then for filter use list comprehension:
L = [['Row 1','Value 1'], ['Row 2', 'Value 2'], ['Row 3', 'Value 3', 'Value 4']]
#for omit all rows != 2
df = pd.DataFrame([x for x in L if len(x) == 2], columns = header_list)
print (df)
RowID Value
0 Row 1 Value 1
1 Row 2 Value 2
#filter last 2 values if len != 2
df = pd.DataFrame([x if len(x) == 2 else x[-2:] for x in L], columns = header_list)
print (df)
RowID Value
0 Row 1 Value 1
1 Row 2 Value 2
2 Value 3 Value 4
Or:
#filter first 2 values if len != 2
df = pd.DataFrame([x if len(x) == 2 else x[:2] for x in L], columns = header_list)
print (df)
RowID Value
0 Row 1 Value 1
1 Row 2 Value 2
2 Row 3 Value 3
try below code:
list1 = [['Row 1','Value 1'], ['Row 2', 'Value 2'], ['Row 3', 'Value 3']]
dff = pd.DataFrame(list1)
dff = dff[[x for x in range(len(header_list))]]
dff.columns = header_list

Categories