I would like to know an easy way to get which semester a day belongs to while displaying it on following format ('YYYY-SX'); 2018-01-01 -> (2018S1).
I have a date range and is pretty easy to do it for quarters:
import pandas as pd
import datetime
start = datetime.datetime(2018, 1, 1)
end = datetime.datetime(2020, 1, 1)
all_days = pd.date_range(start, end, freq='D')
all_quarters = []
for day in all_days:
all_quarters.append(str(pd.Period(day, freq='Q')))
However given the docs there is no frequency for semesters:
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.Period.html
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
I don't want to necessarily use any specific modules.
Any ideas on how to do it in a clean way?
You can do something like this.
df['sem']= df.date.dt.year.astype(str) + 'S'+ np.where(df.date.dt.quarter.gt(2),2,1).astype(str)
Note: the column date needs to be as datetime object
Input
date
0 2019-09-30
1 2019-10-31
2 2019-11-30
3 2019-12-31
4 2020-01-31
5 2020-02-29
6 2020-03-31
7 2020-04-30
8 2020-05-31
9 2020-06-30
Output
date sem
0 2019-09-30 2019S2
1 2019-10-31 2019S2
2 2019-11-30 2019S2
3 2019-12-31 2019S2
4 2020-01-31 2020S1
5 2020-02-29 2020S1
6 2020-03-31 2020S1
7 2020-04-30 2020S1
8 2020-05-31 2020S1
9 2020-06-30 2020S1
Related
I have date-interval-data with a "periodicity"-column representing how frequent the date interval occurs:
Weekly: same weekdays every week
Biweekly: same weekdays every other week
Monthly: Same DATES every month
Moreover I have a "recurring_until"-column specifying when the recurrence should stop.
What I need to accomplish is:
creating a separate row for each recurring record until the "recurring_until" has been reached.
What I have:
What I need:
I have been trying with various for loops without much success. Here is the sample data:
import pandas as pd
data = {'id':['1','2','3','4'],'from':['5/31/2020','6/3/2020','6/18/2020','6/10/2020'],'to':['6/5/2020','6/3/2020','6/19/2020','6/10/2020'],'periodicity':['weekly','weekly','biweekly','monthly'],'recurring_until':['7/25/2020','6/9/2020','12/30/2020','7/9/2020']}
df = pd.DataFrame(data)
First of all preprocess:
df.set_index("id", inplace=True)
df["from"], df["to"], df["recurring_until"] = pd.to_datetime(df["from"]), pd.to_datetime(df.to), pd.to_datetime(df.recurring_until)
Next compute all the periodic from:
new_from = df.apply(lambda x: pd.date_range(x["from"], x.recurring_until), axis=1) #generate all days between from and recurring_until
new_from[df.periodicity=="weekly"] = new_from[df.periodicity=="weekly"].apply(lambda x:x[::7]) #slicing by week
new_from[df.periodicity=="biweekly"] = new_from[df.periodicity=="biweekly"].apply(lambda x:x[::14]) #slicing by biweek
new_from[df.periodicity=="monthly"] = new_from[df.periodicity=="monthly"].apply(lambda x:x[x.day==x.day[0]]) #selectiong only days equal to the first day
new_from = new_from.explode() #explode to obtain a series
new_from.name = "from" #naming the series
after this we have new_from like this:
id
1 2020-05-31
1 2020-06-07
1 2020-06-14
1 2020-06-21
1 2020-06-28
1 2020-07-05
1 2020-07-12
1 2020-07-19
2 2020-06-03
3 2020-06-18
3 2020-07-02
3 2020-07-16
3 2020-07-30
3 2020-08-13
3 2020-08-27
3 2020-09-10
3 2020-09-24
3 2020-10-08
3 2020-10-22
3 2020-11-05
3 2020-11-19
3 2020-12-03
3 2020-12-17
4 2020-06-10
Name: from, dtype: datetime64[ns]
Now lets compute all the periodic to as:
new_to = new_from+(df.to-df["from"]).loc[new_from.index]
new_to.name = "to"
and we have new_to like this:
id
1 2020-06-05
1 2020-06-12
1 2020-06-19
1 2020-06-26
1 2020-07-03
1 2020-07-10
1 2020-07-17
1 2020-07-24
2 2020-06-03
3 2020-06-19
3 2020-07-03
3 2020-07-17
3 2020-07-31
3 2020-08-14
3 2020-08-28
3 2020-09-11
3 2020-09-25
3 2020-10-09
3 2020-10-23
3 2020-11-06
3 2020-11-20
3 2020-12-04
3 2020-12-18
4 2020-06-10
Name: to, dtype: datetime64[ns]
We can finally concatenate this two series and join them to the initial dataframe:
periodic_df = pd.concat([new_from, new_to], axis=1).join(df[["periodicity", "recurring_until"]]).reset_index()
result:
I have a Pandas dataframe that describes arrivals at stations. It has two columns: time and station id.
Example:
time id
0 2019-10-31 23:59:36 22
1 2019-10-31 23:58:23 260
2 2019-10-31 23:54:55 82
3 2019-10-31 23:54:46 82
4 2019-10-31 23:54:42 21
I would like to resample this into five minute blocks, which shows the number of arrivals at the station in the time-block that starts at the time, so it should look like this:
time id arrivals
0 2019-10-31 23:55:00 22 1
1 2019-10-31 23:50:00 22 5
2 2019-10-31 23:55:00 82 0
3 2019-10-31 23:25:00 82 325
4 2019-10-31 23:21:00 21 1
How could I use some high performance function to achieve this?
pandas.DataFrame.resample does not seem to be a possibility, since it requires the index to be a timestamp, and in this case several rows can have the same time.
df.groupby(['id',pd.Grouper(key='time', freq='5min')])\
.size()\
.to_frame('arrivals')\
.reset_index()
I think it's a horrible solution (couldn't find a better one at the moment), but it more or less gets you where you want:
df.groupby("id").resample("5min", on="time").count()[["id"]].swaplevel(0, 1, axis=0).sort_index(axis=0).set_axis(["arrivals"], axis=1)
Try with groupby and resample:
>>> df.set_index("time").groupby("id").resample("5min").count()
id
id time
21 2019-10-31 23:50:00 1
22 2019-10-31 23:55:00 1
82 2019-10-31 23:50:00 2
260 2019-10-31 23:55:00 1
I have some data which I'm trying to groupby "name" first and then resample by "transaction_date"
transaction_date name revenue
01/01/2020 ADIB 30419
01/01/2020 ADIB 1119372
01/01/2020 ADIB 1272170
01/01/2020 ADIB 43822
01/01/2020 ADIB 24199
The issue i have is writing groupby resample in two different ways return two different results
1-- df.groupby("name").resample("M", on="transaction_date").sum()[['revenue']].head(12)
2-- df.groupby("name").resample("M", on="transaction_date").aggregate({'revenue':'sum'}).head(12)
The first method returns the values I'm looking for.
I don't understand why the two methods return different results. Is this a bug?
Result 1
name transaction_date revenue
ADIB 2020-01-31 39170943.0
2020-02-29 48003966.0
2020-03-31 32691641.0
2020-04-30 11979337.0
2020-05-31 35510726.0
2020-06-30 25677857.0
2020-07-31 12437122.0
2020-08-31 4348936.0
2020-09-30 10547188.0
2020-10-31 5287406.0
2020-11-30 4288930.0
2020-12-31 17066105.0
Result 2
name transaction_date revenue
ADIB 2020-01-31 64128331.0
2020-02-29 54450014.0
2020-03-31 45636192.0
2020-04-30 25016777.0
2020-05-31 11941744.0
2020-06-30 15703151.0
2020-07-31 5517526.0
2020-08-31 4092618.0
2020-09-30 4333433.0
2020-10-31 3944117.0
2020-11-30 6528058.0
2020-12-31 5718196.0
Indeed, it's either a bug or an extremely strange behavior. Consider the following data:
input:
date revenue name
0 2020-10-27 0.744045 n_1
1 2020-10-29 0.074852 n_1
2 2020-11-21 0.560182 n_2
3 2020-12-29 0.208616 n_2
4 2020-05-03 0.325044 n_0
gb = df.groupby("name").resample("M", on="date")
gb.aggregate({'revenue':'sum'})
==>
revenue
name date
n_0 2020-12-31 0.325044
n_1 2020-05-31 0.744045
2020-06-30 0.000000
2020-07-31 0.000000
2020-08-31 0.000000
2020-09-30 0.000000
2020-10-31 0.074852
n_2 2020-10-31 0.560182
2020-11-30 0.208616
print(gb.sum()[['revenue']])
==>
revenue
name date
n_0 2020-05-31 0.325044
n_1 2020-10-31 0.818897
n_2 2020-11-30 0.560182
2020-12-31 0.208616
As one can see, it seems that aggregate produces the wrong results. For example, it takes data from Oct and attaches it to May.
Here's an even simpler example:
Data frame:
date revenue name
0 2020-02-24 9 n_1
1 2020-05-12 8 n_2
2 2020-03-28 9 n_2
3 2020-01-14 2 n_0
gb = df.groupby("name").resample("M", on="date")
res1 = gb.sum()[['revenue']]
==>
name date
n_0 2020-01-31 2
n_1 2020-02-29 9
n_2 2020-03-31 9
2020-04-30 0
2020-05-31 8
res2 = gb.aggregate({'revenue':'sum'})
==>
name date
n_0 2020-05-31 2
n_1 2020-01-31 9
n_2 2020-02-29 8
2020-03-31 9
I opened a bug about it: https://github.com/pandas-dev/pandas/issues/35173
I'd like to change my dataframe adding time intervals for every hour during a month
Original df
money food
0 1 2
1 4 5
2 5 7
Output:
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 00:01:00
2 1 2 2020-01-01 00:02:00
...
2230 5 7 2020-01-31 00:22:00
2231 5 7 2020-01-31 00:23:00
where 2231 = out_rows_number-1 = month_days_number*hours_per_day*orig_rows_number - 1
What is the proper way to perform it?
Use cross join by DataFrame.merge and new DataFrame with all hours per month created by date_range:
df1 = pd.DataFrame({'a':1,
'time':pd.date_range('2020-01-01', '2020-01-31 23:00:00', freq='h')})
df = df.assign(a=1).merge(df1, on='a', how='outer').drop('a', axis=1)
print (df)
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 01:00:00
2 1 2 2020-01-01 02:00:00
3 1 2 2020-01-01 03:00:00
4 1 2 2020-01-01 04:00:00
... ... ...
2227 5 7 2020-01-31 19:00:00
2228 5 7 2020-01-31 20:00:00
2229 5 7 2020-01-31 21:00:00
2230 5 7 2020-01-31 22:00:00
2231 5 7 2020-01-31 23:00:00
[2232 rows x 3 columns]
Data frame 1
id start_date end_date count
1 2018-02-01 2018-02-04 4
1 2018-02-06 2018-02-07 2
2 2018-03-05 2018-03-08 3
2 2018-03-12 2018-03-15 4
Data frame 2
id start end
1 2018-02-01 2018-02-08
2 2018-03-01 2018-03-15
output is like
id start_date end_date count
1 2018-02-01 2018-02-04 4
1 2018-02-06 2018-02-07 3
2 2018-03-05 2018-03-08 7
2 2018-03-12 2018-03-15 4
Explanation for output.
id:1 from 2018-02-02 to 2018-02-08.data frame 1 does not have 2018-02-08 for count is increased 2 to 3
id:2 from 2018-03-01 to 2018-03-15.data frame 1 does not include 2018-03-01 to 2018-03-04 for that count increased to 3 to 7
count is no.of days between start_date and end_date
How can i get the solution.