I had a Dataframe with this kind of date
Year
Day
Hour
Minute
2017
244
0
0
2017
244
0
1
2017
244
0
2
I want to create a new column on this DataFrame showing the date +hour minute but I don't know how to convert the days into months and unify everything
I try something using pd.to_datetime like the code below.
line['datetime'] = pd.to_datetime(line['Year'] + line['Day'] + line['Hour'] + line['Minute'], format= '%Y%m%d %H%M')
I would like to have something like this:
Year
Month
Day
Hour
Minute
2017
9
1
0
0
2017
9
1
0
1
2017
9
1
0
2
So in your case do
df['date'] = pd.to_datetime(df.astype(str).agg(' '.join,1),format='%Y %j %H %M')
Out[294]:
0 2017-09-01 00:00:00
1 2017-09-01 00:01:00
2 2017-09-01 00:02:00
dtype: datetime64[ns]
#df['month'] = df['date'].dt.month
#df['day'] = df['date'].dt.day
Try:
s = pd.to_datetime(df['Year'], format='%Y') \
+ pd.TimedeltaIndex(df['Day']-1, unit='D')
print(s)
# Output
0 2017-09-01
1 2017-09-01
2 2017-09-01
dtype: datetime64[ns]
Now you can insert your columns:
df.insert(1, 'Month', s.dt.month)
df['Day'] = s.dt.day
print(df)
# Output
Year Month Day Hour Minute
0 2017 9 1 0 0
1 2017 9 1 0 1
2 2017 9 1 0 2
df["Month"]=round(df["Day"]/30+.5).astype(int)
This establishes a new column and populates taht column by using the day column to calculate the month (total days / 30), rounding up by adding .5 and inserting it as an integer using astype
Example screenshot
Related
how can I check whether a particular section (ex: year or day) is present in the DateTime column in pandas? it's something like you want to examine the time gap between two rows in hours, but first, you need to check that the hour's section is present in DateTime.
desired outcome:
datetime
is hours present
2020-01-21 17:24:00
true
2020-01-22
false
2020-01-23 17:28:00
true
2020-01-24
false
2020-01-24
false
I can't workout what you're asking, so hopefully this overkill helps;
import pandas as pd
import numpy as np
import datetime
# create dummy dataframe with mock data
df = pd.DataFrame({'datetime': ['2021-01-01 00:00:00', '2021-01-02 00:00:00', '2021-01-03 00:00:00', '2021-01-04 00:00:00', '2021-01-05 00:00:00','2022-01-05 00:00:00']})
df['datetime'] = pd.to_datetime(df['datetime'])
df['day_of_week'] = df['datetime'].dt.day_name()
df['day_of_month'] = df['datetime'].dt.day
df['day_of_year'] = df['datetime'].dt.dayofyear
df['week_of_year'] = df['datetime'].dt.week
df['month_of_year'] = df['datetime'].dt.month
df['year'] = df['datetime'].dt.year
df['hour'] = df['datetime'].dt.hour
df['minute'] = df['datetime'].dt.minute
df['second'] = df['datetime'].dt.second
#check whether a particular section (ex: year or day) is present in the DateTime column in pandas
checkYear = df['datetime'].dt.year.isin([2021])
#show time gap between hours for each row and add new column to dataframe
df['time_gap'] = df['datetime'].diff().dt.total_seconds()
print(df)
print(checkYear)
Output:
datetime day_of_week day_of_month day_of_year week_of_year month_of_year year hour minute second time_gap
0 2021-01-01 Friday 1 1 53 1 2021 0 0 0 NaN
1 2021-01-02 Saturday 2 2 53 1 2021 0 0 0 86400.0
2 2021-01-03 Sunday 3 3 53 1 2021 0 0 0 86400.0
3 2021-01-04 Monday 4 4 1 1 2021 0 0 0 86400.0
4 2021-01-05 Tuesday 5 5 1 1 2021 0 0 0 86400.0
5 2022-01-05 Wednesday 5 5 1 1 2022 0 0 0 31536000.0
0 True
1 True
2 True
3 True
4 True
5 False
I am new to pandas and have been struggling with this easy task. Given the following dataframe:
Year Qt Value
0 2010 2 17
1 2015 1 11
2 2020 1 86
I want to create another column with the date of quarter end:
Year Qt Value Date
0 2010 2 17 30/06/2010
1 2015 1 11 31/03/2015
2 2020 1 86 31/03/2020
What is the best way of doing this?
you can do it with this pd.to_datetime to change the column 'Year' and 'Qt' to date and then add an offset to the end of the Quarter with :
pd.to_datetime(df['Year'].astype(str)+'Q'+df['Qt'].astype(str))\
+ pd.tseries.offsets.QuarterEnd()
0 2010-06-30
1 2015-03-31
2 2020-03-31
dtype: datetime64[ns]
I'd use pd.Period and end_time:
df['Date'] = df.apply(lambda x: pd.Period(f"{x['Qt']}Q{x['Year']}" ).end_time.strftime('%d/%m/%Y'), axis=1)
Output:
Year Qt Value Date
0 2010 2 17 30/06/2010
1 2015 1 11 31/03/2015
2 2020 1 86 31/03/2020
Assume the below dataframe, df
Start_Date End_Date
0 20201101 20201130
1 20201201 20201231
2 20210101 20210131
3 20210201 20210228
4 20210301 20210331
How to Calculate time difference between two date columns in days?
Required Output
Start_Date End_Date Diff_in_Days
0 20201101 20201130
1 20201201 20201231
2 20210101 20210131
3 20210201 20210228
4 20210301 20210331
First idea is convert columns to datetimes, get difference and convert timedeltas to days by Series.dt.days:
df['Diff_in_Days'] = (pd.to_datetime(df['End_Date'], format='%Y%m%d')
.sub(pd.to_datetime(df['Start_Date'], format='%Y%m%d'))
.dt.days)
print (df)
Start_Date End_Date Diff_in_Days
0 20201101 20201130 29
1 20201201 20201231 30
2 20210101 20210131 30
3 20210201 20210228 27
4 20210301 20210331 30
Another solution better if processing datetimes later is reassign back columns and use solution above:
df['Start_Date'] = pd.to_datetime(df['Start_Date'], format='%Y%m%d')
df['End_Date'] = pd.to_datetime(df['End_Date'], format='%Y%m%d')
df['Diff_in_Days'] = df['End_Date'].sub(df['Start_Date']).dt.days
print (df)
Start_Date End_Date Diff_in_Days
0 2020-11-01 2020-11-30 29
1 2020-12-01 2020-12-31 30
2 2021-01-01 2021-01-31 30
3 2021-02-01 2021-02-28 27
4 2021-03-01 2021-03-31 30
I have a dataframe with values per day (see df below).
I want to group the "Forecast" field per week but with Monday as the first day of the week.
Currently I can do it via pd.TimeGrouper('W') (see df_final below) but it groups the week starting on Sundays (see df_final below)
import pandas as pd
data = [("W1","G1",1234,pd.to_datetime("2015-07-1"),8),
("W1","G1",1234,pd.to_datetime("2015-07-30"),2),
("W1","G1",1234,pd.to_datetime("2015-07-15"),2),
("W1","G1",1234,pd.to_datetime("2015-07-2"),4),
("W1","G2",2345,pd.to_datetime("2015-07-5"),5),
("W1","G2",2345,pd.to_datetime("2015-07-7"),1),
("W1","G2",2345,pd.to_datetime("2015-07-9"),1),
("W1","G2",2345,pd.to_datetime("2015-07-11"),3)]
labels = ["Site","Type","Product","Date","Forecast"]
df = pd.DataFrame(data,columns=labels).set_index(["Site","Type","Product","Date"])
df
Forecast
Site Type Product Date
W1 G1 1234 2015-07-01 8
2015-07-30 2
2015-07-15 2
2015-07-02 4
G2 2345 2015-07-05 5
2015-07-07 1
2015-07-09 1
2015-07-11 3
df_final = (df
.reset_index()
.set_index("Date")
.groupby(["Site","Product",pd.TimeGrouper('W')])["Forecast"].sum()
.astype(int)
.reset_index())
df_final["DayOfWeek"] = df_final["Date"].dt.dayofweek
df_final
Site Product Date Forecast DayOfWeek
0 W1 1234 2015-07-05 12 6
1 W1 1234 2015-07-19 2 6
2 W1 1234 2015-08-02 2 6
3 W1 2345 2015-07-05 5 6
4 W1 2345 2015-07-12 5 6
Use W-MON instead W, check anchored offsets:
df_final = (df
.reset_index()
.set_index("Date")
.groupby(["Site","Product",pd.Grouper(freq='W-MON')])["Forecast"].sum()
.astype(int)
.reset_index())
df_final["DayOfWeek"] = df_final["Date"].dt.dayofweek
print (df_final)
Site Product Date Forecast DayOfWeek
0 W1 1234 2015-07-06 12 0
1 W1 1234 2015-07-20 2 0
2 W1 1234 2015-08-03 2 0
3 W1 2345 2015-07-06 5 0
4 W1 2345 2015-07-13 5 0
I have three solutions to this problem as described below. First, I should state that the ex-accepted answer is incorrect. Here is why:
# let's create an example df of length 9, 2020-03-08 is a Sunday
s = pd.DataFrame({'dt':pd.date_range('2020-03-08', periods=9, freq='D'),
'counts':0})
> s
dt
counts
0
2020-03-08 00:00:00
0
1
2020-03-09 00:00:00
0
2
2020-03-10 00:00:00
0
3
2020-03-11 00:00:00
0
4
2020-03-12 00:00:00
0
5
2020-03-13 00:00:00
0
6
2020-03-14 00:00:00
0
7
2020-03-15 00:00:00
0
8
2020-03-16 00:00:00
0
These nine days span three Monday-to-Sunday weeks. The weeks of March 2nd, 9th, and 16th. Let's try the accepted answer:
# the accepted answer
> s.groupby(pd.Grouper(key='dt',freq='W-Mon')).count()
dt
counts
2020-03-09 00:00:00
2
2020-03-16 00:00:00
7
This is wrong because the OP wants to have "Monday as the first day of the week" (not as the last day of the week) in the resulting dataframe. Let's see what we get when we try with freq='W'
> s.groupby(pd.Grouper(key='dt', freq='W')).count()
dt
counts
2020-03-08 00:00:00
1
2020-03-15 00:00:00
7
2020-03-22 00:00:00
1
This grouper actually grouped as we wanted (Monday to Sunday) but labeled the 'dt' with the END of the week, rather than the start. So, to get what we want, we can move the index by 6 days like:
w = s.groupby(pd.Grouper(key='dt', freq='W')).count()
w.index -= pd.Timedelta(days=6)
or alternatively we can do:
s.groupby(pd.Grouper(key='dt',freq='W-Mon',label='left',closed='left')).count()
a third solution, arguably the most readable one, is converting dt to period first, then grouping, and finally (if needed) converting back to timestamp:
s.groupby(s.dt.dt.to_period('W'))['counts'].count().to_timestamp()
# a variant of this solution is: s.set_index('dt').to_period('W').groupby(pd.Grouper(freq='W')).count().to_timestamp()
all of these solutions return what the OP asked for:
dt
counts
2020-03-02 00:00:00
1
2020-03-09 00:00:00
7
2020-03-16 00:00:00
1
Explanation: when freq is provided to pd.Grouper, both closed and label kwargs default to right. Setting freq to W (short for W-Sun) works because we want our week to end on Sunday (Sunday included, and g.closed == 'right' handles this). Unfortunately, the pd.Grouper docstring does not show the default values but you can see them like this:
g = pd.Grouper(key='dt', freq='W')
print(g.closed, g.label)
> right right
I have a DataFrame like this
df = pd.DataFrame({'Team':['CHI','IND','CHI','CHI','IND','CHI','CHI','IND'],
'Date':[datetime.date(2015,10,27),datetime.date(2015,10,28),datetime.date(2015,10,29),datetime.date(2015,10,30),datetime.date(2015,11,1),datetime.date(2015,11,2),datetime.date(2015,11,4),datetime.date(2015,11,4)]})
I can find the number of rest days between games using this.
df['TeamRest'] = df.groupby('Team')['Date'].diff() - datetime.timedelta(1)
I would like to also add a row to the DataFrame that keeps track of how many games each team has played in the last 5 days.
With Date converted to datetime so it can be used as DateTimeIndex, which will be important for the rolling_count with daily frequency
df.Date = pd.to_datetime(df.Date)
1) calculate the difference in days between games per team:
df['days_between'] = df.groupby('Team')['Date'].diff() - timedelta(days=1)
2) calculate the rolling count of games for the last 5 days per team:
df['game_count'] = 1
rolling_games_count = df.set_index('Date').groupby('Team').apply(lambda x: pd.rolling_count(x, window=5, freq='D')).reset_index()
df = df.drop('game_count', axis=1).merge(rolling_games_count, on=['Team', 'Date'], how='left')
to get:
Date Team days_between game_count
0 2015-10-27 CHI NaT 1
1 2015-10-28 IND NaT 1
2 2015-10-29 CHI 1 days 2
3 2015-10-30 CHI 0 days 3
4 2015-11-01 IND 3 days 2
5 2015-11-02 CHI 2 days 3
6 2015-11-04 CHI 1 days 2
7 2015-11-04 IND 2 days 2
If you were to
df = pd.DataFrame({'Team':['CHI','IND','CHI','CHI','IND','CHI','CHI','IND'], 'Date': [date(2015,10,27),date(2015,10,28),date(2015,10,29),date(2015,10,30),date(2015,11,1),date(2015,11,2),date(2015,11,4),date(2015,12,10)]})
df['game'] = 1 # initialize a game to count.
df['nb_games'] = df.groupby('Team')['game'].apply(pd.rolling_count, 5)
you get the surprising result (one Date changed to one month later)
Date Team game nb_games
0 2015-10-27 CHI 1 1
2 2015-10-29 CHI 1 2
3 2015-10-30 CHI 1 3
5 2015-11-02 CHI 1 4
6 2015-11-04 CHI 1 5
1 2015-10-28 IND 1 1
4 2015-11-01 IND 1 2
7 2015-12-10 IND 1 3
of nb_games=3 for a later date in December, when there were no games during the last five days. Unless you convert to datetime, you only count the last five entries in the DataFrame, so you'll always get five for a team with more than five games played.