I have a new code I'm trying to write where a dataframe gets filtered/edited to obtain "stints" for each individual. Using the dataframe below as an example, I'm basically trying to get each persons start/end dates for a given location. Usually I can get started on my own but I'm stumped as to how to approach this so if anyone has ideas I would greatly appreciate it.
Person
Location
Date
0
Tom
A
1/1/2021
1
Tom
A
1/2/2021
2
Tom
A
1/3/2021
3
Tom
B
1/4/2021
4
Tom
B
1/5/2021
5
Tom
B
1/6/2021
6
Tom
A
1/7/2021
7
Tom
A
1/8/2021
8
Tom
A
1/9/2021
9
Tom
C
1/10/2021
10
Tom
C
1/11/2021
11
Tom
A
1/12/2021
12
Tom
A
1/13/2021
13
Tom
B
1/14/2021
14
Tom
B
1/15/2021
15
Mark
A
1/1/2021
16
Mark
A
1/2/2021
17
Mark
B
1/3/2021
18
Mark
B
1/4/2021
19
Mark
A
1/5/2021
20
Mark
A
1/6/2021
21
Mark
C
1/7/2021
22
Mark
C
1/8/2021
23
Mark
C
1/9/2021
24
Mark
C
1/10/2021
25
Mark
A
1/11/2021
26
Mark
A
1/12/2021
27
Mark
B
1/13/2021
28
Mark
B
1/14/2021
29
Mark
B
1/15/2021
Expected outcome:
Person
Location
StintNum
Start_Date
End Date
0
Tom
A
1
1/1/2021
1/3/2021
1
Tom
B
2
1/4/2021
1/6/2021
2
Tom
A
3
1/7/2021
1/9/2021
3
Tom
C
4
1/10/2021
1/11/2021
4
Tom
A
5
1/12/2021
1/13/2021
5
Tom
B
6
1/14/2021
1/15/2021
6
Mark
A
1
1/1/2021
1/2/2021
7
Mark
B
2
1/3/2021
1/4/2021
8
Mark
A
3
1/5/2021
1/6/2021
9
Mark
C
4
1/7/2021
1/10/2021
10
Mark
A
5
1/11/2021
1/12/2021
11
Mark
B
6
1/13/2021
1/15/2021
IMO, a clean way is to use groupby+agg, this enables to set custom aggregators easily and is faster than apply:
df['Date'] = pd.to_datetime(df['Date'])
group = df['Location'].ne(df['Location'].shift()).cumsum()
df2 = (
df.groupby(['Person', group], as_index=False)
.agg(Location=('Location', 'first'),
# line below is a dummy function to set a column placeholder
# uncomment it you want the columns in order
#StintNum=('Location', lambda x: float('NaN')),
Start_Date=('Date', 'min'),
End_Date=('Date', 'max'),
)
)
df2['StintNum'] = df2.groupby('Person').cumcount().add(1)
Output:
Person Location StintNum Start_Date End_Date
0 Mark A 1 2021-01-01 2021-01-02
1 Mark B 2 2021-01-03 2021-01-04
2 Mark A 3 2021-01-05 2021-01-06
3 Mark C 4 2021-01-07 2021-01-10
4 Mark A 5 2021-01-11 2021-01-12
5 Mark B 6 2021-01-13 2021-01-15
6 Tom A 1 2021-01-01 2021-01-03
7 Tom B 2 2021-01-04 2021-01-06
8 Tom A 3 2021-01-07 2021-01-09
9 Tom C 4 2021-01-10 2021-01-11
10 Tom A 5 2021-01-12 2021-01-13
11 Tom B 6 2021-01-14 2021-01-15
Try this:
df['Date'] = pd.to_datetime(df['Date'])
new_df = df.groupby([df['Person'], df['Location'].ne(df['Location'].shift(1)).cumsum()], sort=False).apply(lambda x: pd.Series([x['Date'].min(), x['Date'].max()], index=['Start_Date','End_Date'])).reset_index()
new_df['StintNum'] = new_df.groupby('Person').cumcount().add(1)
Output:
>>> new_df
Person Location Start_Date End_Date StintNum
0 Tom 1 2021-01-01 2021-01-03 1
1 Tom 2 2021-01-04 2021-01-06 2
2 Tom 3 2021-01-07 2021-01-09 3
3 Tom 4 2021-01-10 2021-01-11 4
4 Tom 5 2021-01-12 2021-01-13 5
5 Tom 6 2021-01-14 2021-01-15 6
6 Mark 7 2021-01-01 2021-01-02 1
7 Mark 8 2021-01-03 2021-01-04 2
8 Mark 9 2021-01-05 2021-01-06 3
9 Mark 10 2021-01-07 2021-01-10 4
10 Mark 11 2021-01-11 2021-01-12 5
11 Mark 12 2021-01-13 2021-01-15 6
Related
I have the next pd.DataFrame:
Index ID Name Date Days
1 1 Josh 5-1-20 10
2 1 Josh 9-1-20 10
3 1 Josh 19-1-20 6
4 2 Mike 1-1-20 10
5 3 George 1-4-20 10
6 4 Rose 1-2-20 10
7 4 Rose 11-5-20 5
8 5 Mark 1-9-20 10
9 6 Joe 1-4-21 10
10 7 Jill 1-1-21 10
I'm needing to make a DataFrame where the ID is not repeated, for that, I want to creat new columns (Date y Days), considering the case with most repeatitions (3 in this case).
The desired output is the next DataFrame:
Index ID Name Date 1 Date 2 Date 3 Days1 Days2 Days3
1 1 Josh 5-1-20 9-1-20 19-1-20 10 10 6
2 2 Mike 1-1-20 10
3 3 George 1-4-20 10
4 4 Rose 1-2-20 11-5-20 10 5
5 5 Mark 1-9-20 10
6 6 Joe 1-4-21 10
7 7 Jill 1-1-21 10
Try:
df_out = df.set_index(['ID','Name',df.groupby('ID').cumcount()+1]).unstack()
df_out.columns = [f'{i} {j}' for i, j in df_out.columns]
df_out.fillna('').reset_index()
Output:
ID Name Index 1 Index 2 Index 3 Date 1 Date 2 Date 3 Days 1 Days 2 Days 3
0 1 Josh 1.0 2.0 3.0 5-1-20 9-1-20 19-1-20 10.0 10.0 6.0
1 2 Mike 4.0 1-1-20 10.0
2 3 George 5.0 1-4-20 10.0
3 4 Rose 6.0 7.0 1-2-20 11-5-20 10.0 5.0
4 5 Mark 8.0 1-9-20 10.0
5 6 Joe 9.0 1-4-21 10.0
6 7 Jill 10.0 1-1-21 10.0
Here is a solution using pivot with a helper column:
df2 = (df
.assign(col=df.groupby('ID').cumcount().add(1).astype(str))
.pivot(index=['ID','Name'], columns='col', values=['Date', 'Days'])
.fillna('')
)
df2.columns = df2.columns.map('_'.join)
df2.reset_index()
Output:
ID Name Date_1 Date_2 Date_3 Days_1 Days_2 Days_3
0 1 Josh 5-1-20 9-1-20 19-1-20 10 10 6
1 2 Mike 1-1-20 10
2 3 George 1-4-20 10
3 4 Rose 1-2-20 11-5-20 10 5
4 5 Mark 1-9-20 10
5 6 Joe 1-4-21 10
6 7 Jill 1-1-21 10
I have a Dataset like this
Date Runner Group distance [km]
2021-01-01 Joe 1 7
2021-01-02 Jack 1 6
2021-01-03 Jess 1 9
2021-01-01 Paul 2 11
2021-01-02 Peter 2 12
2021-01-02 Sara 3 15
2021-01-03 Sarah 3 10
and I want to calculate the cumulative sum for each group of runners.
Date Runner Group distance [km] cum sum [km]
2021-01-01 Joe 1 7 7
2021-01-02 Jack 1 6 13
2021-01-03 Jess 1 9 22
2021-01-01 Paul 2 11 11
2021-01-02 Peter 2 12 23
2021-01-02 Sara 3 15 15
2021-01-03 Sarah 3 10 25
Unfortunately, I have no idea how to do this and I didn't find the answer somewhere else. Could someone give me a hint?
import pandas as pd
import numpy as np
df = pd.DataFrame([['2021-01-01','Joe', 1, 7],
['2021-01-02',"Jack", 1, 6],
['2021-01-03',"Jess", 1, 9],
['2021-01-01',"Paul", 2, 11],
['2021-01-02',"Peter", 2, 12],
['2021-01-02',"Sara", 3, 15],
['2021-01-03',"Sarah", 3, 10]],
columns=['Date','Runner', 'Group', 'distance [km]'])
Try groupby cumsum:
>>> df['cum sum [km]'] = df.groupby('Group')['distance [km]'].cumsum()
>>> df
Date Runner Group distance [km] cum sum [km]
0 2021-01-01 Joe 1 7 7
1 2021-01-02 Jack 1 6 13
2 2021-01-03 Jess 1 9 22
3 2021-01-01 Paul 2 11 11
4 2021-01-02 Peter 2 12 23
5 2021-01-02 Sara 3 15 15
6 2021-01-03 Sarah 3 10 25
>>>
I'm having trouble with some pandas groupby object issue, which is the following:
so I have this dataframe:
Letter name num_exercises
A carl 1
A Lenna 2
A Harry 3
A Joe 4
B Carl 5
B Lenna 3
B Harry 3
B Joe 6
C Carl 6
C Lenna 3
C Harry 4
C Joe 7
And I want to add a column on it, called num_exercises_total , which contains the total sum of num_exercises for each letter. Please note that this value must be repeated for each row in the letter group.
The output would be as follows:
Letter name num_exercises num_exercises_total
A carl 1 15
A Lenna 2 15
A Harry 3 15
A Joe 4 15
B Carl 5 18
B Lenna 3 18
B Harry 3 18
B Joe 6 18
C Carl 6 20
C Lenna 3 20
C Harry 4 20
C Joe 7 20
I've tried adding the new column like this:
df['num_exercises_total'] = df.groupby(['letter'])['num_exercises'].sum()
But it returns the value NaN for all the rows.
Any help would be highly appreciated.
Thank you very much in advance!
You may want to check transform
df.groupby(['Letter'])['num_exercises'].transform('sum')
0 10
1 10
2 10
3 10
4 17
5 17
6 17
7 17
8 20
9 20
10 20
11 20
Name: num_exercises, dtype: int64
df['num_of_total']=df.groupby(['Letter'])['num_exercises'].transform('sum')
Transform works perfectly for this question. WenYoBen is right. I am just putting slightly different version here.
df['num_of_total']=df['num_excercises'].groupby(df['Letter']).transform('sum')
>>> df
Letter name num_excercises num_of_total
0 A carl 1 10
1 A Lenna 2 10
2 A Harry 3 10
3 A Joe 4 10
4 B Carl 5 17
5 B Lenna 3 17
6 B Harry 3 17
7 B Joe 6 17
8 C Carl 6 20
9 C Lenna 3 20
10 C Harry 4 20
11 C Joe 7 20
I have a DataFrame and I want to calculate the mean and the variance for each row for each person. Moreover, there is a column date and the chronological order must be respect when calculating the mean and the variance; the dataframe is already sorted by date. The date are just the number of day after the earliest date. The mean for the earliest date of a person row is simply the value in the column Points and the variance should be NAN or 0. Then, for the second date, the mean should be the means of the value in the column Points for this date and the previous one. Here is my code to generate the dataframe:
import pandas as pd
import numpy as np
data=[["Al",0, 12],["Bob",2, 10],["Carl",5, 12],["Al",5, 5],["Bob",9, 2]
,["Al",22, 4],["Bob",22, 16],["Carl",33, 2],["Al",45, 7],["Bob",68, 4]
,["Al",72, 11],["Bob",79, 5]]
df= pd.DataFrame(data, columns=["Name", "Date", "Points"])
print(df)
Name Date Points
0 Al 0 12
1 Bob 2 10
2 Carl 5 12
3 Al 5 5
4 Bob 9 2
5 Al 22 4
6 Bob 22 16
7 Carl 33 2
8 Al 45 7
9 Bob 68 4
10 Al 72 11
11 Bob 79 5
Here is my code to obtain the mean and the variance:
df['Mean'] = df.apply(
lambda x: df[(df.Name == x.Name) & (df.Date < x.Date)].Points.mean(),
axis=1)
df['Variance'] = df.apply(
lambda x: df[(df.Name == x.Name)& (df.Date < x.Date)].Points.var(),
axis=1)
However, the mean is shifted by one row and the variance by two rows. The dataframe obtained when sort by Nameand Dateis:
Name Date Points Mean Variance
0 Al 0 12 NaN NaN
3 Al 5 5 12.000000 NaN
5 Al 22 4 8.50000 24.500000
8 Al 45 7 7.000000 19.000000
10 Al 72 11 7.000000 12.666667
1 Bob 2 10 NaN NaN
4 Bob 9 2 10.000000 NaN
6 Bob 22 16 6.000000 32.000000
9 Bob 68 4 9.333333 49.333333
11 Bob 79 5 8.000000 40.000000
2 Carl 5 12 NaN NaN
7 Carl 33 2 12.000000 NaN
Instead, the dataframe should be as below:
Name Date Points Mean Variance
0 Al 0 12 12 NaN
3 Al 5 5 8.5 24.5
5 Al 22 4 7 19
8 Al 45 7 7 12.67
10 Al 72 11 7.8 ...
1 Bob 2 10 10 NaN
4 Bob 9 2 6 32
6 Bob 22 16 9.33 49.33
9 Bob 68 4 8 40
11 Bob 79 5 7.4 ...
2 Carl 5 12 12 NaN
7 Carl 33 2 7 50
What should I change ?
I have a date in a list:
[datetime.date(2017, 8, 9)]
I want replace the value of a dataframe matching that date with zero.
Dataframe:
Date Amplitude Magnitude Peaks Crests
0 2017-06-21 6.953356 1046.656154 4 3
1 2017-06-27 7.015520 1185.221306 5 4
2 2017-06-28 6.947471 908.115055 2 2
3 2017-06-29 6.921587 938.175153 3 3
4 2017-07-02 6.906078 938.273547 3 2
5 2017-07-03 6.898809 955.718452 6 5
6 2017-07-04 6.876283 846.514852 5 5
7 2017-07-26 6.862897 870.610086 6 5
8 2017-07-27 6.846426 824.403786 7 7
9 2017-07-28 6.831949 813.753420 7 7
10 2017-07-29 6.823125 841.245427 4 3
11 2017-07-30 6.816301 846.603427 5 4
12 2017-07-31 6.810133 842.287006 5 4
13 2017-08-01 6.800645 794.167590 3 3
14 2017-08-02 6.793034 801.505774 4 3
15 2017-08-03 6.790814 860.497395 7 6
16 2017-08-04 6.785664 815.055002 4 4
17 2017-08-05 6.782069 829.607640 5 4
18 2017-08-06 6.778176 819.014799 4 3
19 2017-08-07 6.774587 817.624203 5 5
20 2017-08-08 6.771193 815.101641 4 3
21 2017-08-09 6.765695 772.970000 1 1
22 2017-08-10 6.769422 945.207554 1 1
23 2017-08-11 6.773154 952.422598 4 3
24 2017-08-12 6.770926 826.700122 4 4
25 2017-08-13 6.772816 916.046905 5 5
26 2017-08-14 6.771130 834.881662 5 5
27 2017-08-15 6.769183 826.009391 5 5
28 2017-08-16 6.767313 824.650882 5 4
29 2017-08-17 6.765894 832.752100 5 5
30 2017-08-18 6.766861 894.165751 5 5
31 2017-08-19 6.768392 912.200274 4 3
i have tried this:
for x in range(len(all_details)):
for y in selected_day:
m = all_details['Date'] > y
all_details.loc[m, 'Peaks'] = 0
But getting an error:
ValueError: Arrays were different lengths: 32 vs 1
Can anybody suggest me the correct way to do it>
Any help would be appreciated.
First your solution working nice with your sample data.
Another faster solution is creating each mask in loop and then reduce by logical or, and - what need. Better it is explained here.
L = [datetime.date(2017, 8, 9)]
m = np.logical_or.reduce([all_details['Date'] > x for x in L])
all_details.loc[m, 'Peaks'] = 0
In your solution is better compare only by minimal date from list:
all_details.loc[all_details['Date'] > min(L), 'Peaks'] = 0