Pandas. Group by index and apply max for column - python

I am trying to group dataframe by 3 elements and want to get rows with the highest column value from the group, but max method applies to all columns. How can I achieve this?
What I do:
In [69]: fr
Out[69]:
ping delta
0 1516190798773 161
1 1516191845372 143
2 1516192904988 144
3 1516193952748 295
4 1516195008033 233
5 1516196049407 252
In [70]: fr.groupby(fr.index / 3).max()
Out[70]:
ping delta
0 1516192904988 161
1 1516196049407 295
Result I want to get:
ping delta
0 1516190798773 161
1 1516193952748 295

If want first value in ping column and max value in delta:
df = fr.groupby(fr.index // 3).agg({'delta':'max','ping':'first'})
print (df)
ping delta
0 1516190798773 161
1 1516193952748 295
If want max value in delta and all coresponding rows:
df = fr.loc[fr.groupby(fr.index // 3)['delta'].idxmax()]
print (df)
ping delta
0 1516190798773 161
3 1516193952748 295
Better sample for difference:
print (fr)
ping delta
0 1516190798773 161
1 1516191845372 143
2 1516192904988 144
3 1516193952748 233 <-swapped values 233
4 1516195008033 295 <-swapped values 295
5 1516196049407 252
df = fr.groupby(fr.index // 3).agg({'delta':'max','ping':'first'})
print (df)
ping delta
0 1516190798773 161
1 1516193952748 295
df = fr.loc[fr.groupby(fr.index // 3)['delta'].idxmax()]
print (df)
ping delta
0 1516190798773 161
4 1516195008033 295

Related

How to calculate value of column in dataframe based on value/count of other columns in the dataframe in python?

I have a pandas dataframe which has data of 24 hours of the day for a whole month with the following fields:
(df1):- date,hour,mid,rid,percentage,total
I need to create 2nd dataframe using this dataframe with the following fields:
(df2) :- date, hour,mid,rid,hour_total
Here hour_total is to be calculated as below:
If for a combination of (date,mid,rid) from dataframe 1, count of records where df1.percentage is 0 is 24, then hour_total = df1.total/24 else hour_total = (df1.percentage /100) * total
For example if dataframe 1 is as below:- (count of records for group of date mid,rid where perc is 0 is 24)
date,hour,mid,rid,perc,total
2019-10-31,0,2, 0,0,3170.87
2019-10-31,1,2,0,0,3170.87
2019-10-31,2,2,0,0,3170.87
2019-10-31,3,2,0,0,3170.87
2019-10-31,4,2,0,0,3170.87
.
.
2019-10-31,23,2,0,0,3170.87
Then dataframe 2 should be: (hour_total = df1.total/24)
date,hour,mid,rid,hour_total
2019-10-31,0,2,0,132.12
2019-10-31,1,4,0,132.12
2019-10-31,2,13,0,132.12
2019-10-31,3,17,0,132.12
2019-10-31,4,7,0,132.12
.
.
2019-10-31,23,27,0,132.12
How can I accomplish this?
You can try the apply function
For example
a = np.random.randint(100,200, size=5)
b = np.random.randint(100,200, size=5)
c = [datetime.now() for x in range(100) if x%20 == 0]
df1 = pd.DataFrame({'Time' : c, "A" : a, "B" : b})
Above data frame looks like this
Time A B
0 2019-10-24 20:37:38.907058 158 190
1 2019-10-24 20:37:38.907058 161 127
2 2019-10-24 20:37:38.908056 100 100
3 2019-10-24 20:37:38.908056 163 164
4 2019-10-24 20:37:38.908056 121 159
Now if we want to compute a new column whose value depends on the other values of column.
You can define a function which does this computation.
def func(x):
t = x[0] # time
a = x[1] # A
b = x[2] # B
return a+b
And apply this function to the data frame
df1["new_col"] = df1.apply(func, axis=1)
Which would yield the following result.
Time A B new_col
0 2019-10-24 20:37:38.907058 158 190 348
1 2019-10-24 20:37:38.907058 161 127 288
2 2019-10-24 20:37:38.908056 100 100 200
3 2019-10-24 20:37:38.908056 163 164 327
4 2019-10-24 20:37:38.908056 121 159 280

Pandas cumsum + cumcount on multiple columns

Aloha,
I have the following DataFrame
stores = [1,2,3,4,5]
weeks = [1,1,1,1,1]
df = pd.DataFrame({'Stores' : stores,
'Weeks' : weeks})
df = pd.concat([df]*53)
df['Weeks'] = df['Weeks'].add(df.groupby('Stores').cumcount())
df['Target'] = np.random.randint(400,600,size=len(df))
df['Actual'] = np.random.randint(350,800,size=len(df))
df['Variance %'] = (df['Target'] - df['Actual']) / df['Target']
df.loc[df['Variance %'] >= 0.01, 'Status'] = 'underTarget'
df.loc[df['Variance %'] <= 0.01, 'Status'] = 'overTarget'
df['Status'] = df['Status'].fillna('atTarget')
df.sort_values(['Stores','Weeks'],inplace=True)
this gives me the following
print(df.head())
Stores Weeks Target Actual Variance % Status
0 1 1 430 605 -0.406977 overTarget
0 1 2 549 701 -0.276867 overTarget
0 1 3 471 509 -0.080679 overTarget
0 1 4 549 378 0.311475 underTarget
0 1 5 569 708 -0.244288 overTarget
0 1 6 574 650 -0.132404 overTarget
0 1 7 466 623 -0.336910 overTarget
now what I'm trying to do is do a cumulative count of Stores where they were either over or undertarget but reset when the status changes.
I thought this would be the best way to do this (and many variants of this) but this does not reset the counter.
s = df.groupby(['Stores','Weeks','Status'])['Status'].shift().ne(df['Status'])
df['Count'] = s.groupby(df['Stores']).cumsum()
my logic was to group by my relevant columns, and do a != shift to reset the cumsum
Naturally I've scoured lots of different questions but I can't seem to figure this out. Would anyone be so kind to explain to me what would be the best method to tackle this problem?
I hope everything here is clear and reproducible. Please let me know if you need any additional information.
Expected Output
Stores Weeks Target Actual Variance % Status Count
0 1 1 430 605 -0.406977 overTarget 1
0 1 2 549 701 -0.276867 overTarget 2
0 1 3 471 509 -0.080679 overTarget 3
0 1 4 549 378 0.311475 underTarget 1 # Reset here as status changes
0 1 5 569 708 -0.244288 overTarget 1 # Reset again.
0 1 6 574 650 -0.132404 overTarget 2
0 1 7 466 623 -0.336910 overTarget 3
Try pd.Series.groupby() after create the key by cumsum
s=df.groupby('Stores')['Status'].apply(lambda x : x.ne(x.shift()).ne(0).cumsum())
df['Count']=df.groupby([df.Stores,s]).cumcount()+1

Pandas: How to print the groupby values

I have a following data set from Table_Record:
Seg_ID Lock_ID Code
111 100 1
222 121 2
333 341 2
444 100 1
555 100 1
666 341 2
777 554 4
888 332 5
I am using the sql query to find the Seg_IDs where Lock_ID is repeated:
Select Code,Lock_ID,Seg_ID from Table_Record group by Code, Lock_ID;
Seg_ID Lock_ID Code
111 100 1
444 100 1
555 100 1
222 121 2
333 341 2
666 341 2
777 554 4
888 332 5
How can I achieve the same using Pandas?
Excepted O/P from Pandas is:
eg.
Seg_ID (111,444,555) has Lock_id (1).
Seg_ID (222,333,666) has Lock_ID (2).
First get all codes by filtering only duplicated values and then filter original DaatFrame by boolean indexing with isin:
codes = df.loc[df.duplicated(['Lock_ID']), 'Code'].unique()
df1 = df[df['Code'].isin(codes)]
print (df1)
Seg_ID Lock_ID Code
0 111 100 1
1 222 121 2
2 333 341 2
3 444 100 1
4 555 100 1
5 666 341 2
Then groupby with f-strings:
for k, v in df1.groupby(['Code'])['Seg_ID']:
print (f'Seg_ID {tuple(v)} has Code ({k})')
Seg_ID (111, 444, 555) has Code (1)
Seg_ID (222, 333, 666) has Code (2)
If want output like DataFrame use apply with tuple:
df2 = df1.groupby(['Code'])['Seg_ID'].apply(tuple).reset_index()
print (df2)
Code Seg_ID
0 1 (111, 444, 555)
1 2 (222, 333, 666)
Simply use groupby. As I could understand from your code, you'd want:
grouped= df.groupby(['Code']['LockId'])

Pandas: Get highest value from a column for each unique value in another column

How to get the highest value in one column for each unique value in another column and return the same dataframe structure back.
Here is a pandas dataframe example?
reg.nr counter value ID2 categ date
1 37367 421 231385 93 A 20.01.2004
2 37368 428 235156 93 B 21.01.2004
3 37369 408 234251 93 C 22.01.2004
4 37372 403 196292 93 D 23.01.2004
5 55523 400 247141 139 E 24.01.2004
6 55575 415 215818 139 F 25.01.2004
7 55576 402 204404 139 A 26.01.2004
8 69940 402 62244 175 B 27.01.2004
9 69941 402 38274 175 C 28.01.2004
10 69942 404 55171 175 D 29.01.2004
11 69943 416 55495 175 E 30.01.2004
12 69944 407 90231 175 F 31.01.2004
13 69945 411 75382 175 A 01.02.2004
14 69948 405 119129 175 B 02.02.2004
Where i want to return the highest value of column "counter" based on the unique value of column "ID2". After the new pandas dataframe should look like this:
reg.nr counter value ID2 categ date
1 37368 428 235156 93 B 21.01.2004
2 55575 415 215818 139 F 25.01.2004
3 69943 416 55495 175 E 30.01.2004
One way using drop_duplicates
In [332]: df.sort_values('counter', ascending=False).drop_duplicates(['ID2'])
Out[332]:
reg.nr counter value ID2 categ date
2 37368 428 235156 93 B 21.01.2004
11 69943 416 55495 175 E 30.01.2004
6 55575 415 215818 139 F 25.01.2004
For desired output, you could sort on two columns, and reset the index
In [336]: (df.sort_values(['ID2', 'counter'], ascending=[True, False])
.drop_duplicates(['ID2']).reset_index(drop=True)
)
Out[336]:
reg.nr counter value ID2 categ date
0 37368 428 235156 93 B 21.01.2004
1 55575 415 215818 139 F 25.01.2004
2 69943 416 55495 175 E 30.01.2004
df.loc[df.groupby('ID2')['counter'].idxmax(), :].reset_index()
index reg.nr counter value ID2 categ date
0 2 37368 428 235156 93 B 21.01.2004
1 6 55575 415 215818 139 F 25.01.2004
2 11 69943 416 55495 175 E 30.01.2004
First, you are grouping your dataframe by column ID2. Then you get counter column and calculate an index of (first) maximal element of this column in each group. Then you use these indexes to filter your initial dataframe. Finally you resets indexes (if you need it).

Aligning Dataframes based on count on pandas

I am aligning two dataframes which look like the following:
Dataframe 1
Timestamp L_x L_y L_a R_x R_y R_a
2403950 621.3 461.3 313 623.3 461.8 260
2404050 622.5 461.3 312 623.3 462.6 260
2404150 623.1 461.5 311 623.4 464 261
2404250 623.6 461.7 310 623.7 465.4 261
2404350 623.8 461.5 309 623.9 466.1 261
Dataframe 2
This dataframe contains the timestamps that a particular event occured.
Timestamp
0 2404030
1 2404050
2 2404250
3 2404266
4 2404282
5 2404298
6 2404314
7 2404330
8 2404350
9 2404382
All timestamps are in milliseconds. As you can see, the first dataframe is resampled to 100milliseconds. So what I want to do is, to align the two dataframes based on count. Which means based on the count how many events occur during a particular 100milliseconds bin time. For example, from the dataframe 1, in the first 100millisecond bin time (24043950 - 2404049), only one event occur according to the second dataframe which is at 2404030 and so on. The aligned table should look like the following:
Timestamp L_x L_y L_a R_x R_y R_a count
2403950 621.3 461.3 313 623.3 461.8 260 1
2404050 622.5 461.3 312 623.3 462.6 260 1
2404150 623.1 461.5 311 623.4 464 261 0
2404250 623.6 461.7 310 623.7 465.4 261 6
2404350 623.8 461.5 309 623.9 466.1 261 2
Thank you for your help and suggestion.
You want to perform integer division on the timestamp (i.e. a // b), but first need to add 50 to it given your bucketing. Then convert it back into the correct units by multiplying by 100 and subtracting 50.
Now, group on this new index and perform a count.
You then merge these counts to your original dataframe and do some formatting operations to get the data in the desired shape. Make sure to fill NaNs with zero.
df2['idx'] = (df2.Timestamp + 50) // 100 * 100 - 50
counts = df2.groupby('idx').count()
>>> counts
Timestamp
idx
2403950 1
2404050 1
2404250 6
2404350 2
df_new =df.merge(counts, how='left', left_on='Timestamp', right_index=True, suffixes=['', '_'])
columns = list(df_new)
columns[-1] = 'count'
df_new.columns = columns
df_new['count'].fillna(0, inplace=True)
>>> df_new
Timestamp L_x L_y L_a R_x R_y R_a count
0 2403950 621.3 461.3 313 623.3 461.8 260 1
1 2404050 622.5 461.3 312 623.3 462.6 260 1
2 2404150 623.1 461.5 311 623.4 464.0 261 0
3 2404250 623.6 461.7 310 623.7 465.4 261 6
4 2404350 623.8 461.5 309 623.9 466.1 261 2

Categories