Grouping of a dataframe monthly after calculating the highest daily values - python

I've got a dataframe with two columns one is datetime dataframe consisting of dates, and another one consists of quantity. It looks like something like this,
Date Quantity
0 2019-01-05 10
1 2019-01-10 15
2 2019-01-22 14
3 2019-02-03 12
4 2019-05-11 25
5 2019-05-21 4
6 2019-07-08 1
7 2019-07-30 15
8 2019-09-05 31
9 2019-09-10 44
10 2019-09-25 8
11 2019-12-09 10
12 2020-04-11 111
13 2020-04-17 5
14 2020-06-05 17
15 2020-06-16 12
16 2020-06-22 14
I want to make another dataframe. It should consist of two columns one is Month/Year and the other is Till Highest. I basically want to calculate the highest quantity value until that month and group it using month/year. Example of what I want precisely is,
Month/Year Till Highest
0 Jan/2019 15
1 Feb/2019 15
2 May/2019 25
3 Jul/2019 25
4 Sep/2019 44
5 Dec/2019 44
6 Apr/2020 111
7 Jun/2020 111
In my case, the dataset is vast, and I've readings of almost every day of each month and each year in the specified timeline. Here I've made a dummy dataset to show an example of what I want.
Please help me with this. Thanks in advance :)

See the annotated code:
(df
# convert date to monthly period (2019-01)
.assign(Date=pd.to_datetime(df['Date']).dt.to_period('M'))
# period and max quantity per month
.groupby('Date')
.agg(**{'Month/Year': ('Date', 'first'),
'Till highest': ('Quantity', 'max')})
# format periods as Jan/2019 and get cumulated max quantity
.assign(**{
'Month/Year': lambda d: d['Month/Year'].dt.strftime('%b/%Y'),
'Till highest': lambda d: d['Till highest'].cummax()
})
# drop the groupby index
.reset_index(drop=True)
)
output:
Month/Year Till highest
0 Jan/2019 15
1 Feb/2019 15
2 May/2019 25
3 Jul/2019 25
4 Sep/2019 44
5 Dec/2019 44
6 Apr/2020 111
7 Jun/2020 111

In R you can use cummax:
df=data.frame(Date=c("2019-01-05","2019-01-10","2019-01-22","2019-02-03","2019-05-11","2019-05-21","2019-07-08","2019-07-30","2019-09-05","2019-09-10","2019-09-25","2019-12-09","2020-04-11","2020-04-17","2020-06-05","2020-06-16","2020-06-22"),Quantity=c(10,15,14,12,25,4,1,15,31,44,8,10,111,5,17,12,14))
data.frame(`Month/Year`=unique(format(as.Date(df$Date),"%b/%Y")),
`Till Highest`=cummax(tapply(df$Quantity,sub("-..$","",df$Date),max)),
check.names=F,row.names=NULL)
Month/Year Till Highest
1 Jan/2019 15
2 Feb/2019 15
3 May/2019 25
4 Jul/2019 25
5 Sep/2019 44
6 Dec/2019 44
7 Apr/2020 111
8 Jun/2020 111

Related

Converting object into time and grouping/summarizing time (H/M/S) into 24 hours

I subsetted a big dataframe, slicing only one column Start Time with `type(object).
test = taxi_2020['Start Time']
Got a column
0 00:15:00
1 00:15:00
2 00:15:00
3 00:15:00
4 00:15:00
...
4137289 00:00:00
4137290 00:00:00
4137291 00:00:00
4137292 00:00:00
4137293 00:00:00
Name: Start Time, Length: 4137294, dtype: object
Then I grouped and summarized it by the count (to my best knowledge)
test.value_counts().sort_index().reset_index()
and got two columns
index Start Time
0 00:00:00 24005
1 00:15:00 22815
2 00:30:00 20438
3 00:45:00 19012
4 01:00:00 18082
... ... ...
91 22:45:00 32365
92 23:00:00 31815
93 23:15:00 29582
94 23:30:00 26903
95 23:45:00 24599
Not sure why this index column appeared, now I failed to rename it or convert.
What do I would like to see?
My ideal output - to group time by hour (24h format is ok), it looks like data counts every 15 min, so basically put each next 4 columns together. 00:15:00 is ok to be as 0 hour, 23:00:00 as 23rd hour.
My ideal output:
Hour Rides
0 34000
1 60000
2 30000
3 40000
I would like to create afterwards a simple histogram to show the occurrence by the hour.
Appreciate any help!
IIUC,
#Create dummy input datafframe
test = pd.DataFrame({'time':pd.date_range('2020-06-01', '2020-06-01 23:59:00', freq='15T').strftime('%H:%M:%S'),
'rides':np.random.randint(15000,28000,96)})
Let's create a DateTimeIndex from string and resample, aggregate with sum and convert DateTimeIndex to hours:
test2 = (test.set_index(pd.to_datetime(test['time'], format='%H:%M:%S'))
.rename_axis('hour').resample('H').sum())
test2.index = test2.index.hour
test2.reset_index()
Output:
hour rides
0 0 74241
1 1 87329
2 2 76933
3 3 86208
4 4 88002
5 5 82618
6 6 82188
7 7 81203
8 8 78591
9 9 95592
10 10 99778
11 11 85294
12 12 93931
13 13 80490
14 14 84181
15 15 71786
16 16 90962
17 17 96568
18 18 85646
19 19 88324
20 20 83595
21 21 89284
22 22 72061
23 23 74057
Step by step I found answer myself
Using this code, I renamed columns
test.rename(columns = {'index': "Time", 'Start Time': 'Rides'})
Got
The remaining question - how to summarize by the hour.
After applying
test2['hour'] = pd.to_datetime(test2['Time'], format='%H:%M:%S').dt.hour
test2
I came closer
Finally, I grouped by hour value
test3 = test2.groupby('hour', as_index=False).agg({"Rides": "sum"})
print(test3)

Python Data Frame: How do I work with rows?

I have imported this file as a Data Frame in Pandas. The left-most column is time (7 am to 9:15 am. Rows show traffic volume at intersection in 15 minute increments. How do I find the peak hour? or the hour with most volume? To get the hourly volumes, I have to add 4 rows.
I am a newbie with Python and any help is appreciated.
import pandas as pd
f_path ="C:/Users/reggi/Dropbox/1. 2020/6. Programming Python/Text Files/TMC118txt.txt"
df = pd.read_csv(f_path, index_col=0, sep='\s+')
Sample of the data file below:: First column is time in 15 minute increments, first row is traffic count by movement.
NBL NBT NBR SBL SBT SBR EBL EBT EBR WBL WBT WBR
715 8 3 12 1 1 0 4 93 18 36 68 4
730 16 5 20 5 2 1 0 135 12 39 128 3
745 9 1 29 6 2 3 4 169 21 28 163 6
800 10 2 33 4 0 4 4 147 8 34 174 6
815 11 1 30 1 4 3 4 93 10 28 140 8
My approach would be to move the time to a column:
df.reset_index(inplace=True)
Then I would create a new column for hour and one for minutes:
df['hour'] = df['index'].apply(lambda x: x[:-2])
df['minute'] = df['index'].apply(lambda x: x{-2:]
Then you could do a groupby on hour and sum the traffic movements, sort, etc.
hourly = df.groupby(by='hour').sum()

Pandas expanding mean after a certain date

I need some help with groupby and expanding mean in python pandas.
I am trying to use pandas expanding mean and groupby. In this image below, I want to group by using the id column and expand mean by date. But the catch is for January I am not using expanding mean. For example, you can think January might be a past month and take the overall mean of the value column and grouping by ids.
For February and March I want to use expanding mean of value column on top of January. So for 7 Feb and id 1, the 44.5 in expanding mean column is basically mean of January before the value of 89 occurs today. The next value for id 1 is on 7-Mar which is inclusive of previous value of 89 on 7 Feb for id = 1.
So basically my idea is taking the overall mean upto Feb 1, and then use expanding mean on top of whatever mean has been calculated upto that date.
id date value count(prior) expanding mean (after feb)
1 1-Jan 28 4 44.75
2 1-Jan 43 3 37.33
3 1-Jan 69 3 57.00
1 2-Jan 31 4 44.75
2 2-Jan 22 3 37.33
1 7-Jan 82 4 44.75
2 7-Jan 47 3 37.33
3 7-Jan 79 3 57.00
1 8-Jan 38 4 44.75
3 8-Jan 23 3 57.00
1 7-Feb 89 4 44.75
2 7-Feb 22 3 37.33
3 7-Feb 80 3 57.00
2 19-Feb 91 4 33.50
3 19-Feb 97 4 62.75
1 7-Mar 48 5 53.60
2 7-Mar 98 5 45.00
3 7-Mar 35 5 69.60
I've given the count columns as a reference to how the count is increasing. It basically means everything prior to that date.

Iterating over groups in a dataframe [duplicate]

This question already has answers here:
Looping over groups in a grouped dataframe
(2 answers)
Closed 4 years ago.
The issue I am having is that I want to group the dataframe and then use functions to manipulate the data after its been grouped. For example I want to group the data by Date and then iterate through each row in the date groups to parse to a function?
The issue is groupby seems to create a tuple of the key and then a massive string consisting of all of the rows in the data making iterating through each row impossible
When you apply groupby on a dataframe, you don't get rows, you get groups of dataframe. For example, consider:
df
ID Date Days Volume/Day
0 111 2016-01-01 20 50
1 111 2016-02-01 25 40
2 111 2016-03-01 31 35
3 111 2016-04-01 30 30
4 111 2016-05-01 31 25
5 112 2016-01-01 31 55
6 112 2016-01-02 26 45
7 112 2016-01-03 31 40
8 112 2016-01-04 30 35
9 112 2016-01-05 31 30
for i, g in df.groupby('ID'):
print(g, '\n')
ID Date Days Volume/Day
0 111 2016-01-01 20 50
1 111 2016-02-01 25 40
2 111 2016-03-01 31 35
3 111 2016-04-01 30 30
4 111 2016-05-01 31 25
ID Date Days Volume/Day
5 112 2016-01-01 31 55
6 112 2016-01-02 26 45
7 112 2016-01-03 31 40
8 112 2016-01-04 30 35
9 112 2016-01-05 31 30
For your case, you should probably look into dfGroupby.apply, if you want to apply some function on your groups, dfGroupby.transform to produce like indexed dataframe (see docs for explanation) or dfGroupby.agg, if you want to produce aggregated results.
You'd do something like:
r = df.groupby('Date').apply(your_function)
You'd define your function as:
def your_function(df):
... # operation on df
return result
If you have problems with the implementation, please open a new question, post your data and your code, and any associated errors/tracebacks. Happy coding.

Sorting Pandas dataframe data within Groupby groups

I have a large pandas dataframe that can be represented structurally as:
id date status
0 12 2015-05-01 0
1 12 2015-05-22 1
2 12 2015-05-14 1
3 12 2015-05-06 0
4 45 2015-05-03 1
5 45 2015-05-12 1
6 45 2015-05-02 0
7 51 2015-05-05 1
8 51 2015-05-01 0
9 51 2015-05-23 1
10 51 2015-05-17 1
11 51 2015-05-03 0
12 51 2015-05-05 0
13 76 2015-05-04 1
14 76 2015-05-22 1
15 76 2015-05-08 0
And can be created in Python 3.4 using:
tempDF = pd.DataFrame({ 'id': [12,12,12,12,45,45,45,51,51,51,51,51,51,76,76,76],
'date': ['2015-05-01','2015-05-22','2015-05-14','2015-05-06','2015-05-03','2015-05-12','2015-05-02','2015-05-05','2015-05-01','2015-05-23','2015-05-17','2015-05-03','2015-05-05','2015-05-04','2015-05-22','2015-05-08'],
'status': [0,1,1,0,1,1,0,1,0,1,1,0,0,1,1,0]})
tempDF['date'] = pd.to_datetime(tempDF['date'])
I would like to divide the dataframe into groups based on variable 'id', sort within groups based on 'date' and then get the last 'status' value within each group.
So far, I have:
tempGrouped = tempDF.groupby('id')
tempGrouped['status'].last()
which produces:
id
12 0
45 0
51 0
76 0
However, the status should be 1 in each case (the value associated with the latest date). I can't work out how to sort the groups by date before selecting the last value. It's likely I'm a little snow-blind after trying to work this out for a while, so I apologise in advance if the solution is obvious.
you can sort and group like this :
tempDF.sort(['id','date']).groupby('id')['status'].last()

Categories