I have a dataset which contains id, datetime, model features, ground truth labels and the predicted probability.
id datetime feature1 feature2 feature3 ... label probability
001 2023-01-01 a1 b3 c1 ... Rejected 0.98
002 2023-01-04 a2 b1 c1 ... Approved 0.28
003 2023-01-04 a1 b2 c1 ... Rejected 0.81
004 2023-01-08 a2 b3 c2 ... Rejected 0.97
005 2023-01-09 a2 b1 c1 ... Approved 0.06
006 2023-01-09 a2 b2 c2 ... Approved 0.06
007 2023-01-10 a1 b1 c2 ... Approved 0.13
008 2023-01-11 a2 b2 c1 ... Approved 0.18
009 2023-01-12 a2 b1 c1 ... Approved 0.16
010 2023-01-12 a1 b1 c2 ... Rejected 0.96
011 2023-01-09 a2 b3 c2 ... Approved 0.16
...
I want to know what is the AUC of each segment under different features. How can I manipulate the dataset to get results?
What I have done is to use the groupby method on date to get the monthly AUC for all features together.
def group_auc(x, col_tar, col_scr):
from sklearn import metrics
return metrics.roc_auc_score(x[col_tar], x[col_scr])
def map_y(x):
if x == 'Rejected':
return 1
elif x == 'Approved':
return 0
return x
## example
y_name = 'label'
df[y_name] = df[y_name].apply(map_y)
# Remove NA rows
df = df.dropna(subset = [y_name])
df['Month_Year'] = df['datetime'].dt.to_period('M')
group_data_monthly = df.groupby('Month_Year').apply(group_auc, y_name, 'probability').reset_index().rename(columns={0:'AUC'})
My expected output will be like,
datetime features value AUC
2023-01-01 feature1 a1 0.98
2023-01-01 feature1 a2 ...
2023-01-01 feature1 a3 ...
2023-01-01 feature2 b1 ...
2023-01-01 feature2 b2 ...
2023-01-01 feature2 b3 ...
2023-01-01 feature3 c1 ...
2023-01-01 feature3 c2 ...
2023-01-04 feature1 a1 ...
2023-01-04 feature1 a2 ...
2023-01-04 feature1 a3 ...
2023-01-04 feature2 b1 ...
...
I have also tried to use stack method to transpose the dataframe, but the script failed due to the huge size of the dataframe.
Related
I have a large pandas dataframe with varying rows and columns but looks more or less like:
time id angle ...
0.0 a1 33.67 ...
0.0 b2 35.90 ...
0.0 c3 42.01 ...
0.0 d4 45.00 ...
0.1 a1 12.15 ...
0.1 b2 15.35 ...
0.1 c3 33.12 ...
0.2 a1 65.28 ...
0.2 c3 87.43 ...
0.3 a1 98.85 ...
0.3 c3 100.12 ...
0.4 a1 11.11 ...
0.4 c3 83.22 ...
...
I am trying to aggregate the id's together and then find id's that have in common time-intervals. I have tried using pandas groupby and can easily group them by id and get their respective groups with information. How can I then take it a step further to find id's that also have the same time stamps?
Ideally I'd like to return intersection of certain fixed time intervals (2-3 seconds) for similar id's with the fixed time interval overlap:
time id angle ...
0.0 a1 33.67 ...
0.1 a1 12.15 ...
0.2 a1 65.28 ...
0.3 a1 98.85 ...
0.0 c3 42.01 ...
0.1 c3 33.12 ...
0.2 c3 87.43 ...
0.3 c3 100.12 ...
Code tried so far:
#create pandas grouped by id
df1 = df.groupby(['id'], as_index=False)
Which outputs:
time id angle ...
(0.0 a1 33.67
...
0.4 a1 11.11)
(0.0 b2 35.90
0.1 b2 15.35)
(0.0 c3 42.01
...
0.4 c3 83.22)
(0.0 d4 45.00)
But I'd like to return only a dataframe where id and time are the same for a fixed interval, in the above example .4 seconds.
Any ideas on a fairly simple way to achieve this with pandas dataframes?
If need filter rows by some intervals - e.g. here between 0 and 0.4 and get all id which overlap use boolean indexing with Series.between first, then DataFrame.pivot:
df1 = df[df['time'].between(0, 0.4)].pivot('time','id','angle')
print (df1)
id a1 b2 c3 d4
time
0.0 33.67 35.90 42.01 45.0
0.1 12.15 15.35 33.12 NaN
0.2 65.28 NaN 87.43 NaN
0.3 98.85 NaN 100.12 NaN
0.4 11.11 NaN 83.22 NaN
There are missing values for non overlap id, so remove columns with any NaNs by DataFrame.any and reshape to 3 columns by DataFrame.unstack and Series.reset_index:
print (df1.dropna(axis=1))
id a1 c3
time
0.0 33.67 42.01
0.1 12.15 33.12
0.2 65.28 87.43
0.3 98.85 100.12
0.4 11.11 83.22
df2 = df1.dropna(axis=1).unstack().reset_index(name='angle')
print (df2)
id time angle
0 a1 0.0 33.67
1 a1 0.1 12.15
2 a1 0.2 65.28
3 a1 0.3 98.85
4 a1 0.4 11.11
5 c3 0.0 42.01
6 c3 0.1 33.12
7 c3 0.2 87.43
8 c3 0.3 100.12
9 c3 0.4 83.22
There are many ways to define the filter you're asking for:
df.groupby('id').filter(lambda x: len(x) > 4)
# OR
df.groupby('id').filter(lambda x: x['time'].eq(0.4).any())
# OR
df.groupby('id').filter(lambda x: x['time'].max() == 0.4)
Output:
time id angle
0 0.0 a1 33.67
2 0.0 c3 42.01
4 0.1 a1 12.15
6 0.1 c3 33.12
7 0.2 a1 65.28
8 0.2 c3 87.43
9 0.3 a1 98.85
10 0.3 c3 100.12
11 0.4 a1 11.11
12 0.4 c3 83.22
I have the following two dataframes:
df1=
testID Time containsMedia overallScore Difficulty
a1 134 0 0.70 Easy
a2 345 0 0.22 Hard
a3 355 0 1 Easy
a4 444 1 0 Hard
a5 356 1 0.89 Easy
And
df2=
TypeOfTest testID Parameter partialScore
Prep a1 3 0.70
Exam a1 5 0.80
Final a1 6 0.60
Prep a2 3 0.01
Final a2 2 0.90
Prep a3 1 0.79
Exam a3 5 0.11
None a4 5 1
Exam a5 3 0.89
I want:
testID Time containsMedia overallScore Difficulty Prep Exam Final None
a1 134 0 0.70 Easy 0.70 0.80 0.60
a2 345 0 0.22 Hard 0.01 0.90
a3 355 0 1 Easy 0.79 0.11
a4 444 1 0 Hard 1
a5 356 1 0.89 Easy 0.89
Probably with NaNs in the blank spaces.
I tried:
result_test = pd.concat([df1, df2 , df2, df2, df2] axis=1)
and it just glues them together. How is it possible to do this? What do I do with the None/NaNs?
Let's try pivot + join with an optional reindex to get columns in correct order:
result_df = df1.join(
df2.pivot(index='testID', columns='TypeOfTest', values='partialScore')
.reindex(columns=['Prep', 'Exam', 'Final', 'None']),
on='testID'
)
*Without reindex df2 columns will be in alphabetical order.
result_df:
testID Time containsMedia overallScore Difficulty Prep Exam Final None
0 a1 134 0 0.70 Easy 0.70 0.80 0.6 NaN
1 a2 345 0 0.22 Hard 0.01 NaN 0.9 NaN
2 a3 355 0 1.00 Easy 0.79 0.11 NaN NaN
3 a4 444 1 0.00 Hard NaN NaN NaN 1.0
4 a5 356 1 0.89 Easy NaN 0.89 NaN NaN
fillna can be used to replace NaN with empty strings:
result_df = df1.join(
df2.pivot(index='testID', columns='TypeOfTest', values='partialScore')
.reindex(columns=['Prep', 'Exam', 'Final', 'None'])
.fillna(''),
on='testID'
)
result_df:
testID Time containsMedia overallScore Difficulty Prep Exam Final None
0 a1 134 0 0.70 Easy 0.7 0.8 0.6
1 a2 345 0 0.22 Hard 0.01 0.9
2 a3 355 0 1.00 Easy 0.79 0.11
3 a4 444 1 0.00 Hard 1.0
4 a5 356 1 0.89 Easy 0.89
I have two dfs
F1_ID
F2_ID
Event_ID
Date
a1
b2
ab4
5/12/21
a2
b3
ab5
5/12/21
b2
a1
ab4
5/12/21
b3
a2
ab5
5/12/21
the second df has a lot more information on it so I am going to show a filtered version of it.
F1_ID
Event_Name
F2_ID
Event_ID
Date
stats
amount
F1_str_total
F2_str_total
a1
Test
b2
ab1
5/8/21
12
41
13
17
a2
Test1
b3
ab2
5/8/21
16
42
12
54
b2
Test
a1
ab1
5/8/21
-12
-41
0
7
b3
Test1
a2
ab2
5/8/21
-16
-42
87
97
I would like to append the details in df1 to df2 and put None in the missing columns but im not sure how to do this.
Expected Output:
F1_ID
Event_Name
F2_ID
Event_ID
Date
stats
amount
F1_str_total
F2_str_total
a1
Test
b2
ab1
5/8/21
12
41
13
17
a2
Test1
b3
ab2
5/8/21
16
42
12
54
b2
Test
a1
ab1
5/8/21
-12
-41
0
7
b3
Test1
a2
ab2
5/8/21
-16
-42
87
97
a1
None
b2
ab4
5/12/21
None
None
None
None
a2
None
b3
ab5
5/12/21
None
None
None
None
b2
None
a1
ab4
5/12/21
None
None
None
None
b3
None
a2
ab%
5/12/21
None
None
None
None
Simply use pandas.DataFrame.append()
df2 = df2.append(df1, ignore_index=True)
print(df2)
F1_ID Event_Name F2_ID Event_ID Date stats amount F1_str_total \
0 a1 Test b2 ab1 5/8/21 12.0 41.0 13.0
1 a2 Test1 b3 ab2 5/8/21 16.0 42.0 12.0
2 b2 Test a1 ab1 5/8/21 -12.0 -41.0 0.0
3 b3 Test1 a2 ab2 5/8/21 -16.0 -42.0 87.0
4 a1 NaN b2 ab4 5/12/21 NaN NaN NaN
5 a2 NaN b3 ab5 5/12/21 NaN NaN NaN
6 b2 NaN a1 ab4 5/12/21 NaN NaN NaN
7 b3 NaN a2 ab5 5/12/21 NaN NaN NaN
F2_str_total
0 17.0
1 54.0
2 7.0
3 97.0
4 NaN
5 NaN
6 NaN
7 NaN
Or you can use pandas.concat()
df2 = pd.concat([df2, df1], ignore_index=True)
Hey I am struggling with a transformation of a DataFrame:
The initial frame has a format like this:
df=pd.DataFrame({'A':['A1','A1','A1','A1','A1','A2','A2','A2','A2','A3','A3','A3','A3'],
'B':['B1','B1','B1','B1','B2','B2','B2','B3','B3','B3','B4','B4','B4'],
'C':['C1','C1','C1','C2','C2','C3','C3','C4','C4','C5','C5','C6','C6'],
'X':['a','b','c','a','c','a','b','b','c','a','c','a','c'],
'Y':[1,4,4,2,4,1,4,3,1,2,3,4,5]})
A B C X Y
A1 B1 C1 a 1
A1 B1 C1 b 4
A1 B1 C1 c 4
A1 B1 C2 a 2
A1 B2 C2 c 4
A2 B2 C3 a 1
A2 B2 C3 b 4
A2 B3 C4 b 3
A2 B3 C4 c 1
A3 B3 C5 a 2
A3 B4 C5 c 3
A3 B4 C6 a 4
A3 B4 C6 c 5
I have some columns in the beginning where I want to apply groupby and then transpose the last two columns:
First df.groupby(['A','B','C','X']).sum()
Y
A B C X
A1 B1 C1 a 1
b 4
c 4
C2 a 2
B2 C2 c 4
A2 B2 C3 a 1
b 4
B3 C4 b 3
c 1
A3 B3 C5 a 2
B4 C5 c 3
C6 a 4
c 5
and then transpose the X/Y columns and add them horizontally.
A B C a b c
A1 B1 C1 1.0 4.0 4.0
A1 B1 C2 2.0 NaN NaN
A1 B2 C2 NaN NaN 4.0
A2 B2 C3 1.0 4.0 NaN
A2 B3 C4 NaN 3.0 1.0
A3 B3 C5 2.0 NaN NaN
A3 B4 C5 NaN NaN 3.0
A3 B4 C6 4.0 NaN 5.0
Not all groupby rows have all values so they need to be filled with something like np.nan.
The question is linked to this one here but it is more complicated and I couldn't figure it out.
Use Series.unstack for reshape:
df1 = (df.groupby(['A','B','C','X'])['Y'].sum()
.unstack()
.reset_index()
.rename_axis(None, axis=1))
print (df1)
A B C a b c
0 A1 B1 C1 1.0 4.0 4.0
1 A1 B1 C2 2.0 NaN NaN
2 A1 B2 C2 NaN NaN 4.0
3 A2 B2 C3 1.0 4.0 NaN
4 A2 B3 C4 NaN 3.0 1.0
5 A3 B3 C5 2.0 NaN NaN
6 A3 B4 C5 NaN NaN 3.0
7 A3 B4 C6 4.0 NaN 5.0
Alternative with DataFrame.pivot_table:
df1 = (df.pivot_table(index=['A','B','C'],
columns='X',
values='Y',
aggfunc='sum').reset_index().rename_axis(None, axis=1))
print (df1)
A B C a b c
0 A1 B1 C1 1.0 4.0 4.0
1 A1 B1 C2 2.0 NaN NaN
2 A1 B2 C2 NaN NaN 4.0
3 A2 B2 C3 1.0 4.0 NaN
4 A2 B3 C4 NaN 3.0 1.0
5 A3 B3 C5 2.0 NaN NaN
6 A3 B4 C5 NaN NaN 3.0
7 A3 B4 C6 4.0 NaN 5.0
Assuming the following DataFrame:
A B C D E F
0 d1 10 d11 10 d21 10
1 d2 30 d12 30 d22 30
2 d3 40 d13 40 d23 40
3 d4 105 d14 105 NaN NaN
4 d5 10 d15 10 NaN NaN
5 d6 30 NaN NaN NaN NaN
6 d7 40 NaN NaN NaN NaN
7 d8 10 NaN NaN NaN NaN
8 d9 5 NaN NaN NaN NaN
9 d10 10 NaN NaN NaN NaN
how do i merge all the descriptions into a single header that is associated with the respective value ?
d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20 d21 d22 d23 d24 d25 d26 d27 d28 d29 d30
0 10 30 40 105 10 30 40 10 5 10 10 30 40 105 10 30 40 10 5 10 10 30 40 105 10 30 40 10 5 10
take note that some descriptions of the original dataframe could have blank values and descriptions (NaN)
i realised i asked something similar before but after putting it into my code it does not achieve what i needed
We can use pd.concat iterating over column pairs i.e
pairs = list(zip(df.columns,df.columns[1:]))[::2]
# [('A', 'B'), ('C', 'D'), ('E', 'F')]
# iterate over pairs and set the first element of pair as index and rename the column name to 0. Then concat and drop na.
ndf = pd.concat([df[list(i)].set_index(i[0]).rename(columns={i[1]:0})
for i in pairs],0).dropna()
d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 \
0 10.0 30.0 40.0 105.0 10.0 30.0 40.0 10.0 5.0 10.0 10.0 30.0
d13 d14 d15 d21 d22 d23
0 40.0 105.0 10.0 10.0 30.0 40.0
r = np.arange(df.shape[1])
a = r % 2
b = r // 2
df.T.set_index([a, b]).T.stack().set_index(0).T
0 d1 d11 d21 d2 d12 d22 d3 d13 d23 d4 d14 d5 d15 d6 d7 d8 d9 d10
1 10 10 10 30 30 30 40 40 40 105 105 10 10 30 40 10 5 10
For fun:-)
pd.DataFrame(sum([df1.values.tolist() for _, df1 in df.groupby((df.dtypes=='object').cumsum(),axis=1)],[])).dropna().set_index(0).T
0 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 \
1 10.0 30.0 40.0 105.0 10.0 30.0 40.0 10.0 5.0 10.0 10.0 30.0
0 d13 d14 d15 d21 d22 d23
1 40.0 105.0 10.0 10.0 30.0 40.0