i need some help with labeling data inside dataframe, based on dynamic conditions.
I have a dataframe
df3 = pd.DataFrame({
'first_name': ['John', 'John', 'Jane', 'Jane', 'Jane','Marry', 'Victoria', 'Gabriel', 'John'],
'id': [1, 1, 2, 2, 2, 3, 4, 5, 1],
'age': [30, 30, 25, 25, 25, 30, 45, 15, 30],
'group': [0, 0, 0, 0, 0, 0, 0, 0, 0],
'product_type': [1, 1, 2, 1, 2, 1, 2, 1, 2],
'quantity': [10, 15, 10, 10, 15, 30, 30, 10, 10]
})
df3['agemore'] = (df3['age'] > 20)
df3
So i need to take first person with id=1 and group =0 and label him with group=1 (on all of his rows).
This person placed on 3 rows (indexes 0, 1, 8) and has agemore=True, product_type = 1, 1, 2 and quantity = 10, 15, 10.
Condition for looking matched persons are based on product_type,quantity, agemore columns.
the first taken person slice:
df6=df3.loc[lambda df: (df['id'] ==1) &(df['product_type'] ==1), :]
df6
i need to take agemore = True, product_type = 1 (with is on two rows) and quantity of product of this type(10,15) for conditions.
and i will look for persons with has agemore = True, product_type = 2(two, its cross column search) (with is on two rows) and quantity of product_type = 2 (10,15) for conditions. The matched person has id 2. i must put this person in group 1 as well.
Then take next person with lowest id and group=0, take his conditions, look for similar, group them together etc
The output i would like to have
df4 = pd.DataFrame({
'first_name': ['John', 'John', 'Jane', 'Jane', 'Jane','Marry', 'Victoria', 'Gabriel', 'John'],
'id': [1, 1, 2, 2, 2, 3, 4, 5, 1],
'age': [30, 30, 25, 25, 25, 30, 45, 15, 30],
'group': [1, 1, 1, 1, 1, 2, 2, 3, 1],
'product_type': [1, 1, 2, 1, 2, 1, 2, 1, 2],
'quantity': [10, 15, 10, 10, 15, 30, 30, 10, 10]
})
df4
set2
import pandas as pd
data = pd.DataFrame({
'first_name': ['John', 'John', 'Jane', 'Jane', 'Jane','Marry', 'Victoria', 'Gabriel', 'John'],
'id': [1, 1, 2, 2, 2, 3, 4, 5, 1],
'age': [30, 30, 25, 25, 25, 30, 45, 15, 3],
'group': [0, 0, 0, 0, 0, 0, 0, 0, 0],
'product_type': [1, 1, 2, 1, 2, 1, 2, 1, 1],
'quantity': [10, 15, 10, 10, 15, 30, 30, 10, 10]
})
data['agemore'] = (data['age'] > 20)
rm1991, thanks for clarifying your question.
From the information provided, I gathered that you are trying to group customers by their behavior and age group. I can also infer that the IDs are assigned to customers when they first make a transaction with you, which means that the higher the ID value, the newer the customer is to the company.
If this is the case, I would suggest you use an unsupervised learning method to cluster the data points by their similarity regarding the product type, quantity purchased, and age group. Have a look at the SKLearn suite of clustering algorithms for further information.
NB: upon further clarification from rm1991, it seems that product_type is not a "clustering" criteria.
I have replicated your output using only Pandas logic within a loop, as you can see below:
import pandas as pd
data = pd.DataFrame({
'first_name': ['John', 'John', 'Jane', 'Jane', 'Jane','Marry', 'Victoria', 'Gabriel', 'John'],
'id': [1, 1, 2, 2, 2, 3, 4, 5, 1],
'age': [30, 30, 25, 25, 25, 30, 45, 15, 30],
'group': [0, 0, 0, 0, 0, 0, 0, 0, 0],
'product_type': [1, 1, 2, 1, 2, 1, 2, 1, 2],
'quantity': [10, 15, 10, 10, 15, 30, 30, 10, 10]
})
data['agemore'] = (data['age'] > 20)
group_val = 0
for id in data['id'].unique():
age_param = list(set([age_bool for age_bool in data.loc[data['id'] == id, 'agemore']]))
# Product type removed as per latest requirements
# product_type_param = list(set([prod_type for prod_type in data.loc[data['id'] == id, 'product_type']]))
quantity_param = list(set([qty for qty in data.loc[data['id'] == id, 'quantity']]))
if data.loc[(data['id'] == id)
& (data['group']==0), :].shape[0] > 0:
group_val += 1
data.loc[(data['group'] == 0)
& (data['agemore'].isin(age_param))
# Product_type removed as per latest requirements
# & (data['product_type'].isin(product_type_param))
& (data['quantity'].isin(quantity_param)), 'group'] = group_val
Now the output does match what you've posted earlier:
first_name id age group product_type quantity agemore
0 John 1 30 1 1 10 True
1 John 1 30 1 1 15 True
2 Jane 2 25 1 2 10 True
3 Jane 2 25 1 1 10 True
4 Jane 2 25 1 2 15 True
5 Marry 3 30 2 1 30 True
6 Victoria 4 45 2 2 30 True
7 Gabriel 5 15 3 1 10 False
8 John 1 30 1 2 10 True
It remains unclear to me why Victoria, with ID = 4, would be assigned to the same group as Marry (ID = 3), given that they have not purchased the same product_type.
I hope this is helpful.
Related
I have two list
x=[1,2,5,4,3]
y=[4,8,9,2,18] and a .csv table that looks like the one below.
ID
Age
Group
Name
1
4
3
Sam
2
50
1
Raj
3
18
9
John
My goal is to print a list of Group for elements that are in (x,y) and (Id,Age). So for example: Since (1,4) is in (x,y) and also in (Id,Age), the list would contain 3. For (3,18) would be a similar case as it is in both (x,y) and (Id, Age). So then my result would be a list of these numbers, [3,9].
I tired doing result=df[df['ID'].isin(x), df['Age'].isin(y)]['Group'] but this get me anywhere. I am stuck on what to do next. Any help would appreciated.
What you try to do seems to be an AND, that is done with
df[df['ID'].isin(x) & df['Age'].isin(y)]
But with that data
df = pd.DataFrame([{'ID': 1, 'Age': 4, 'Group': 3, 'Name': 'Sam'},
{'ID': 2, 'Age': 50, 'Group': 1, 'Name': 'Raj'},
{'ID': 3, 'Age': 18, 'Group': 9, 'Name': 'John'},
{'ID': 3, 'Age': 19, 'Group': 9, 'Name': 'John'}])
x = [1, 2, 5, 4, 3]
y = [4, 8, 9, 19, 18]
It would give also the (3, 19) line even is that isn't a pair
ID Age Group Name
0 1 4 3 Sam
2 3 18 9 John
3 3 19 9 John
You need to look by row, here's some try
pairs = list(zip(x, y))
result = df[pd.Series(zip(df['ID'], df['Age'])).isin(pairs)]['Group']
I am trying to create this staffing grit to make my admin work easier at work. 'days' contains a week.
days = [M, T, W, Th, F]
days = [0, 1, 1, 1, 1] means s/he works everyday except for Mondays.
If the value is 2, that means they work a special shift.
S/he works from start_time to end_time - e.g. wakana works 0600-1400 everyday.
S/he works from special_start to special_end on days the value is 2, e.g. eleonor works 0700-1900 Monday and Friday, and 0700-1500 on Wednesday.
I got Monday down, but I know there is a better way, perhaps using function, to print all days. I have been playing around forever with it now, but I cannot figure it out. Thank you in advance! I have so much respect for all of you experts!
staffing_data = [
{'name': 'wakana',
'start_time': 6,
'end_time': 14,
'days': [1, 1, 1, 1, 1],
'special_start': None,
'special_end': None},
{'name': 'kate',
'start_time': 11,
'end_time': 21,
'days': [0, 1, 1, 1, 1],
'special_start': None,
'special_end': None},
{'name': 'eleonor',
'start_time': 7,
'end_time': 19,
'days': [1, 0, 2, 0, 1],
'special_start': 7,
'special_end': 15}]
at_7 = 0
at_11 = 0
at_15 = 0
at_19 = 0
for person in staffing_data:
if person['start_time'] <= 7 and person['end_time'] > 7 and person['days'][0] == 1:
at_7 += 1
if person['start_time'] <= 11 and person['end_time'] > 11 and person['days'][0] == 1:
at_11 += 1
if person['start_time'] <= 15 and person['end_time'] > 15 and person['days'][0] == 1:
at_15 += 1
if person['start_time'] <= 19 and person['end_time'] > 19 and person['days'][0] == 1:
at_19 += 1
print(f"{at_7} at 7")
print(f"{at_11} at 11")
print(f"{at_15} at 15")
print(f"{at_19} at 19")
#Monday Staffing
#2 at 7
#3 at 11
#1 at 15
#0 at 19
You just need another loop for looping the days, and store the data.
staffing_data = [
{'name': 'wakana',
'start_time': 6,
'end_time': 14,
'days': [1, 1, 1, 1, 1],
'special_start': None,
'special_end': None},
{'name': 'kate',
'start_time': 11,
'end_time': 21,
'days': [0, 1, 1, 1, 1],
'special_start': None,
'special_end': None},
{'name': 'eleonor',
'start_time': 7,
'end_time': 19,
'days': [1, 0, 2, 0, 1],
'special_start': 7,
'special_end': 15}]
days = ['M', 'T', 'W', 'Th', 'F']
#result = [{"at_7":0,"at_11":0,"at_15":0,"at_19":0} for _ in range(len(days))]
result = []
for _ in range(len(days)):
result.append({"at_7":0,"at_11":0,"at_15":0,"at_19":0})
for person in staffing_data:
for day in range(len(days)):
start = 'start_time'
end = 'end_time'
if person['days'][day] == 0:
continue
elif person['days'][day] == 2:
start = 'special_start'
end = 'special_end'
if person[start] <= 7 and person[end] > 7:
result[day]["at_7"] += 1
if person[start] <= 11 and person[end] > 11:
result[day]["at_11"] += 1
if person[start] <= 15 and person[end] > 15:
result[day]["at_15"] += 1
if person[start] <= 19 and person[end] > 19:
result[day]["at_19"] += 1
for i in range(len(days)):
print(days[i])
print(f"{result[i]['at_7']} at 7")
print(f"{result[i]['at_11']} at 11")
print(f"{result[i]['at_15']} at 15")
print(f"{result[i]['at_19']} at 19")
print()
I'm trying to use groupby and get values as a list.
End df should be "bid" as index, score as list for second column (ex. [85, 58] if they both have the same "bid"]
This is my df:
When I use merged.groupby("bid")['score_y'].apply(list)
I get TypeError: 'Series' objects are mutable, thus they cannot be hashed.
Does anyone know why I'm getting this error?
Edit 1:
This is the datasource: https://data.sfgov.org/Health-and-Social-Services/Restaurant-Scores-LIVES-Standard/pyih-qa8i
The df "ins" yields the following where "bid" are the numbers before the "_' in "iid".
My code so far:
ins2018 = ins[ins['year'] == 2018] #.drop(["iid", 'date', 'type', 'timestamp', 'year', 'Missing Score'], axis = 1)
# new = ins2018.loc[ins2018["score"] > 0].sort_values("date").groupby("bid").count()
# new = new.loc[new["iid"] == 2]
# merge = pd.merge(new, ins2018, how = "left", on = "bid").sort_values('date_y')
# merged = merge.loc[merge['score_y'] > 0].drop(['iid_x', 'date_x', 'score_x', 'type_x', 'timestamp_x', 'year_x', 'Missing Score_x', 'iid_y', 'type_y', 'timestamp_y', 'year_y', 'Missing Score_y', "date_y"], axis = 1)
Aggregate list onto score_y with pandas.DataFrame.aggregat
Depending on merged, the index may need to be reset.
# reset the index of of merged
merged = merged.reset_index(drop=True)
# groupby bid and aggregate a list onto score_y
merged.groupby('bid').agg({'score_y': list})
Example
import pandas as pd
import numpy as np
import random
np.random.seed(365)
random.seed(365)
rows = 100
data = {'a': np.random.randint(10, size=(rows)),
'groups': [random.choice(['1-5', '6-25', '26-100', '100-500', '500-1000', '>1000']) for _ in range(rows)]}
df = pd.DataFrame(data)
# groupby and aggregate a list
dfg = df.groupby('groups').agg({'a': list})
dfg
[out]:
a
groups
1-5 [7, 8, 4, 3, 1, 7, 9, 3, 2, 7, 6, 4, 4, 6]
100-500 [4, 3, 2, 8, 6, 3, 1, 5, 7, 7, 3, 5, 4, 7, 2, 2, 4]
26-100 [4, 2, 2, 9, 5, 3, 1, 0, 7, 9, 7, 7, 9, 9, 9, 7, 0, 0, 4]
500-1000 [2, 8, 0, 7, 6, 6, 8, 4, 6, 2, 2, 5]
6-25 [5, 9, 7, 0, 6, 5, 7, 9, 9, 9, 6, 5, 6, 0, 2, 7, 4, 0, 3, 9, 0, 5, 0, 3]
>1000 [2, 1, 3, 6, 7, 6, 0, 5, 9, 9, 3, 2, 6, 0]
Using data from Restaurant Scores - LIVES Standard
Attempts to follow along with the code in the OP.
import pandas as pd
# load data
ins = pd.read_csv('data/Restaurant_Scores_-_LIVES_Standard.csv')
# convert inspection_date to a datetime format
ins.inspection_date = pd.to_datetime(ins.inspection_date)
# add a year column
ins['year'] = ins.inspection_date.dt.year
# select data for 2018
ins2018 = ins[ins['year'] == 2018]
################################################################
# this is where you run into issues
# new is the counts for every column
# this is what you could have done to get the number of inspection counts
# just count the occurrences of business_id
counts = ins2018.groupby('business_id').agg({'business_id': 'count'}).rename(columns={'business_id': 'inspection_counts'}).reset_index()
# don't do this: get dataframe of counts
# new = ins2018.loc[ins2018["inspection_score"] > 0].sort_values("inspection_date").groupby("business_id").count()
# don't do this: select data
# new = new.loc[new["inspection_id"] == 2].reset_index()
# merge updated
merge = pd.merge(counts, ins2018, how = "left", on = "business_id")
################################################################
# select data again
merged = merge.loc[(merge['inspection_score_y'] > 0) & (merge.inspection_counts >= 2)]
# groupby and aggregate list
mg = merged.groupby('business_id').agg({'inspection_score_y': list})
# display(mg)
inspection_score_y
business_id
31 [96.0, 96.0]
54 [94.0, 94.0]
61 [94.0, 94.0]
66 [98.0, 98.0]
101 [92.0, 92.0]
groupby on ins updated
import pandas as pd
# load data and parse the dates
ins = pd.read_csv('data/Restaurant_Scores_-_LIVES_Standard.csv', parse_dates=['inspection_date'])
# select specific data
data = ins[(ins.inspection_date.dt.year == 2018) & (ins.inspection_score > 0)].dropna().reset_index(drop=True)
# groupby
dg = data.groupby('business_id').agg({'inspection_score': list})
# display(dg)
inspection_score
business_id
54 [94.0, 94.0]
146 [90.0, 81.0, 90.0, 81.0, 90.0, 81.0, 81.0, 81.0]
151 [81.0, 81.0, 81.0, 81.0, 81.0]
155 [90.0, 90.0, 90.0, 90.0]
184 [90.0, 90.0, 90.0, 96.0]
# if you only want results with 2 or more inspections
# get the length of the list because each score represents and inspection
dg['inspection_count'] = dg.inspection_score.map(len)
# filter for 2 or more; this removes 81 business_id that had less than two inspections
dg = dg[dg.inspection_count >= 2]
I have a dataframe like as:
df =
time_id gt_class num_missed_base num_missed_feature num_objects_base num_objects_feature
5G21A6P00L4100023:1566617404450336 CAR 11 4 27 30
5G21A6P00L4100023:1566617404450336 BICYCLE 4 6 27 30
5G21A6P00L4100023:1566617404450336 PERSON 2 3 27 30
5G21A6P00L4100023:1566617404450336 TRUCK 1 0 27 30
5G21A6P00L4100023:1566617428450689 CAR 25 14 60 67
5G21A6P00L4100023:1566617428450689 PERSON 7 6 60 67
5G21A6P00L4100023:1566617515950900 BICYCLE 1 1 59 65
5G21A6P00L4100023:1566617515950900 CAR 20 9 59 65
5G21A6P00L4100023:1566617515950900 PERSON 10 2 59 65
5G21A6P00L4100037:1567169649450046 CAR 8 0 29 32
5G21A6P00L4100037:1567169649450046 PERSON 1 0 29 32
5G21A6P00L4100037:1567169649450046 TRUCK 1 0 29 32
at each time_id it shows how many objects are missed in base model num_missed_base, how many objects are missed in feature model num_missed_feature, and how many objects exist at that time in base and feature innum_objects_base, num_objects_feature
I need to draw a scatter plot using (plotly.graph_objs and FigureWidget) of time_id, such that when user hover over each point(each point represents a unique time_id) it shows the following for the time_id == 5G21A6P00L4100023:1566617404450336:
What should be the hover_text in the code below?
import plotly.graph_objs as go
hover_text = ????
df_agg = df.groupby("time_id").sum().reset_index()
error_trace = go.Scattergl(
x=df_agg["num_missed_base"].tolist(),
y=df_agg["num_missed_feature"].tolist(),
text=hover_text,
mode="markers",
marker=dict(cmax=50, cmin=-50, opacity=0.3),
)
A pandas professional would certainly be able to make the code snippet below a bit more elegant and efficient. But my work-arounds will do the job as well. The main challenge is to turn your source dataframe into a grouped version like this:
time_id gt_class num_missed_base base_str num_missed_feature feature_str
0 5G21A6P00L4100023:1566617404450336 CAR,BICYCLE,PERSON,TRUCK 18 11,4,2,1 13 11,4,2,1
1 5G21A6P00L4100023:1566617428450689 CAR,PERSON 32 25,7 20 25,7
2 5G21A6P00L4100023:1566617515950900 BICYCLE,CAR,PERSON 31 1,20,10 12 1,20,10
3 5G21A6P00L4100037:1567169649450046 CAR,PERSON,TRUCK 10 8,1,1 0 8,1,1
The bad news is that this is not nearly enough. The good news is that the snippet below will handle it all and give you this plot:
What you see here is a plot that groups the associated data for each timestamp so that you can see the sum of, for example, num_missed_feature for all classes, and the number for each underlying class in the hoverinfo. With a little further tweaking I may be able to include the sums as well. But this is all I have time for right now.
Complete code:
import pandas as pd
import re
import plotly.graph_objects as go
smpl = {'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
'columns': ['time_id',
'gt_class',
'num_missed_base',
'num_missed_feature',
'num_objects_base',
'num_objects_feature'],
'data': [['5G21A6P00L4100023:1566617404450336', 'CAR', 11, 4, 27, 30],
['5G21A6P00L4100023:1566617404450336', 'BICYCLE', 4, 6, 27, 30],
['5G21A6P00L4100023:1566617404450336', 'PERSON', 2, 3, 27, 30],
['5G21A6P00L4100023:1566617404450336', 'TRUCK', 1, 0, 27, 30],
['5G21A6P00L4100023:1566617428450689', 'CAR', 25, 14, 60, 67],
['5G21A6P00L4100023:1566617428450689', 'PERSON', 7, 6, 60, 67],
['5G21A6P00L4100023:1566617515950900', 'BICYCLE', 1, 1, 59, 65],
['5G21A6P00L4100023:1566617515950900', 'CAR', 20, 9, 59, 65],
['5G21A6P00L4100023:1566617515950900', 'PERSON', 10, 2, 59, 65],
['5G21A6P00L4100037:1567169649450046', 'CAR', 8, 0, 29, 32],
['5G21A6P00L4100037:1567169649450046', 'PERSON', 1, 0, 29, 32],
['5G21A6P00L4100037:1567169649450046', 'TRUCK', 1, 0, 29, 32]]}
df = pd.DataFrame(index=smpl['index'], columns = smpl['columns'], data=smpl['data'])
df['base_str'] = df['num_missed_base'].astype(str)
df['feature_str'] = df['num_missed_base'].astype(str)
df2=df.groupby(['time_id'], as_index = False).agg({'gt_class': ','.join,
'num_missed_base':sum,
'base_str':','.join,
'num_missed_feature':sum,
'feature_str':','.join,})
col_elem=[]
row_elem=[]
for i in df2.index:
gt_class = df2['gt_class'].loc[i].split(',')
base_str = df2['base_str'].loc[i].split(',')
for j, elem in enumerate(gt_class):
new_elem = elem+": "+base_str[j]
row_elem.append(new_elem)
col_elem.append(row_elem)
row_elem=[]
df2['hover']=col_elem
df2['hover'] = df2['hover'].astype(str)
df2['hover2'] = df2['hover'].map(lambda x: x.lstrip('[]').rstrip(']'))
#df2['hover2'].apply(lambda x: x.str.replace(',','.'))
df2['hover2']=df2['hover2'].replace("'",'', regex=True)
df2['hover2']=df2['hover2'].replace(',','<br>', regex=True)
# plotly
fig = go.Figure()
fig.add_traces(go.Scatter(x=df2['num_missed_base'], y=df2['num_missed_feature'],
mode='markers', marker=dict(color='red',
line=dict(color='black', width=1),
size=14),
#hovertext=df2["hover"],
hovertext=df2['hover2'],
hoverinfo="text",
))
fig.update_xaxes(showspikes=True, linecolor='black', title='Base',
spikecolor='black', spikethickness=0.5, spikedash='solid')
fig.update_yaxes(showspikes=True, linecolor='black', title = 'Feature',
spikecolor='black', spikethickness=0.5, spikedash='solid')
fig.update_layout(
paper_bgcolor='rgba(0,0,0,0)',
plot_bgcolor='rgba(0,0,0,0)'
)
fig.show()
Based on the #vestland answer I came up with this:
smpl = {'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
'columns': ['time_id',
'gt_class',
'num_missed_base',
'num_missed_feature',
'num_objects_base',
'num_objects_feature'],
'data': [['5G21A6P00L4100023:1566617404450336', 'CAR', 11, 4, 27, 30],
['5G21A6P00L4100023:1566617404450336', 'BICYCLE', 4, 6, 27, 30],
['5G21A6P00L4100023:1566617404450336', 'PERSON', 2, 3, 27, 30],
['5G21A6P00L4100023:1566617404450336', 'TRUCK', 1, 0, 27, 30],
['5G21A6P00L4100023:1566617428450689', 'CAR', 25, 14, 60, 67],
['5G21A6P00L4100023:1566617428450689', 'PERSON', 7, 6, 60, 67],
['5G21A6P00L4100023:1566617515950900', 'BICYCLE', 1, 1, 59, 65],
['5G21A6P00L4100023:1566617515950900', 'CAR', 20, 9, 59, 65],
['5G21A6P00L4100023:1566617515950900', 'PERSON', 10, 2, 59, 65],
['5G21A6P00L4100037:1567169649450046', 'CAR', 8, 0, 29, 32],
['5G21A6P00L4100037:1567169649450046', 'PERSON', 1, 0, 29, 32],
['5G21A6P00L4100037:1567169649450046', 'TRUCK', 1, 0, 29, 32]]}
df = pd.DataFrame(index=smpl['index'], columns = smpl['columns'], data=smpl['data'])
def func(row):
return ','.join(row.tolist())
def multi_column1(row):
l = []
for n in row.index:
x = df.loc[n, 'gt_class']
y = df.loc[n, 'num_missed_base']
z = df.loc[n, 'num_missed_feature']
w = '{} : [base = {}, feature = {}]'.format(x, y, z)
l.append(w)
return l
if "hover_text" not in df.columns:
df.insert(0, "hover_text", range(len(df)))
df = df.groupby('time_id').agg({'gt_class':func, 'num_missed_base': sum, 'num_missed_feature': sum, 'hover_text': multi_column1})
df.reset_index(inplace=True)
df['hover_text'] = df['hover_text'].astype(str)
df['hover_text'] = df['hover_text'].map(lambda x: x.lstrip('[]').rstrip(']'))
df['hover_text'] = df['hover_text'].replace("'",'', regex=True)
df['hover_text'] = df['hover_text'].replace('],',']<br>', regex=True)
# plotly
fig = go.Figure()
fig.add_traces(go.Scatter(x=df['num_missed_base'], y=df['num_missed_feature'],
mode='markers', marker=dict(color='red',
line=dict(color='black', width=1),
size=14),
#hovertext=df2["hover"],
hovertext=df['hover_text'],
hoverinfo="text",
))
fig.update_xaxes(showspikes=True, linecolor='black', title='Base',
spikecolor='black', spikethickness=0.5, spikedash='solid')
fig.update_yaxes(showspikes=True, linecolor='black', title = 'Feature',
spikecolor='black', spikethickness=0.5, spikedash='solid')
fig.update_layout(
paper_bgcolor='rgba(0,0,0,0)',
plot_bgcolor='rgba(0,0,0,0)'
)
fig.show()
Hi I am fairly new to Python/programming and am having trouble with a unpacking a nested column in my dataframe.
The df in question looks like this:
The column I am trying to unpack looks like this (in JSON format):
df['id_data'] = [{u'metrics': {u'app_clicks': [6, 28, 13, 28, 43, 45],
u'card_engagements': [6, 28, 13, 28, 43, 45],
u'carousel_swipes': None,
u'clicks': [18, 33, 32, 48, 70, 95],
u'engagements': [25, 68, 46, 79, 119, 152],
u'follows': [0, 4, 1, 1, 1, 5],
u'impressions': [1697, 5887, 3174, 6383, 10250, 12301],
u'likes': [3, 4, 6, 9, 12, 15],
u'poll_card_vote': None,
u'qualified_impressions': None,
u'replies': [0, 0, 0, 0, 0, 1],
u'retweets': [1, 3, 0, 2, 5, 6],
u'tweets_send': None,
u'url_clicks': None},
u'segment': None}]
As you can see, there is a lot going on in this column. There is a list -> dictionary -> dictionary -> potentially another list. I would like each individual metric (app_clicks, card_engagement, carousel_swipes, etc) to be its own column. I've tried the following code with no progress.
df['app_clicks'] = df.apply(lambda row: u['app_clicks'] for y in row['id_data'] if y['metricdata'] = 'list', axis=1)
Any thoughts on how I could tackle this?
You should be able to pass the dictionary directly to the dataframe constructor:
foo = pd.DataFrame(df['id_data'][0]['metrics'])
foo.iloc[:3, :4]
app_clicks card_engagements carousel_swipes clicks
0 6 6 None 18
1 28 28 None 33
2 13 13 None 32
Hopefully I am understanding your question correctly and this gets you what you need
You can use to_json:
df1 = pd.DataFrame(json.loads(df["id_data"].to_json(orient="records")))
df2 = pd.DataFrame(json.loads(df1["metrics"].to_json(orient="records")))