Combine weekday with hours in Pandas - python

I have a data frame with a weekday column that contains the name of the weekdays and a time column that contains hours on these days. How can I combine these 2 columns, so they can be also sortable?
I have tried the string version but it is not sortable based on weekdays and hours.
This is the sample table how it looks like.
weekday
time
Monday
12:00
Monday
13:00
Tuesday
20:00
Friday
10:00
This is what I want to get.
weekday_hours
Monday 12:00
Monday 13:00
Tuesday 20:00
Friday 10:00

Asumming that df is your initial dataframe
import json
datas = json.loads(df.to_json(orient="records"))
final_data = {"weekday_hours": []}
for data in datas:
final_data["weekday_hours"].append(data['weekday'] + ' ' + data['time'])
final_df = pd.DataFrame(final_data)
final_df
Ouptput:

you first need to create a datetime object of 7 days at an hourly level to sort by. In a normal Data warehousing world you normally have a calendar and a time dimension with all the different representation of your date data that you can merge and sort by, this is an adaptation of that methodology.
import pandas as pd
df1 = pd.DataFrame({'date' : pd.date_range('01 Jan 2021', '08 Jan 2021',freq='H')})
df1['str_date'] = df1['date'].dt.strftime('%A %H:%M')
print(df1.head(5))
date str_date
0 2021-01-01 00:00:00 Friday 00:00
1 2021-01-01 01:00:00 Friday 01:00
2 2021-01-01 02:00:00 Friday 02:00
3 2021-01-01 03:00:00 Friday 03:00
4 2021-01-01 04:00:00 Friday 04:00
Then create your column to merge on.
df['str_date'] = df['weekday'] + ' ' + df['time']
df2 = pd.merge(df[['str_date']],df1,on=['str_date'],how='left')\
.sort_values('date').drop('date',1)
print(df2)
str_date
3 Friday 10:00
0 Monday 12:00
1 Monday 13:00
2 Tuesday 20:00

Based on my understanding of the question, you want a single column, "weekday_hours," but you also want to be able to sort the data based on this column. This is a bit tricky because "Monday" doesn't provide enough information to define a valid datetime. Parsing using pd.to_datetime(df['weekday_hours'], format='%A %H:%M' for example, will return 1900-01-01 <hour::minute::second> if given just weekday and time. When sorted, this only sorts by time.
One workaround is to use dateutil to parse the dates. In lieu of a date, it will return the next date corresponding to the day of the week. For example, today (9 April 2021) dateutil.parser.parse('Friday 10:00') returns datetime.datetime(2021, 4, 9, 10, 0) and dateutil.parser.parse('Monday 10:00') returns datetime.datetime(2021, 4, 12, 10, 0). Therefore, we need to set the "default" date to something corresponding to our "first" day of the week. Here is an example starting with unsorted dates:
import datetime
import dateutil
import pandas as pd
weekdays = ['Friday', 'Monday', 'Monday', 'Tuesday']
times = ['10:00', '13:00', '12:00', '20:00', ]
df = pd.DataFrame({'weekday' : weekdays, 'time' : times})
df2 = pd.DataFrame()
df2['weekday_hours'] = df[['weekday', 'time']].agg(' '.join, axis=1)
amonday = datetime.datetime(2021, 2, 1, 0, 0) # assuming week starts monday
sorter = lambda t: [dateutil.parser.parse(ti, default=amonday) for ti in t]
print(df2.sort_values('weekday_hours', key=sorter))
Produces the output:
weekday_hours
2 Monday 12:00
1 Monday 13:00
3 Tuesday 20:00
0 Friday 10:00
Note there are probably more computationaly efficient ways if you are working with a lot of data, but this should illustrate the idea of a sortable weekday/time pair.

Related

Pandas: Get first and last row of the same date and calculate time difference

I have a dataframe where I have a date and a time column.
Each row describes some event. I want to calculate the timespan for each different day and add it as new row. The actual calculation is not that important (which units etc.), I just want to know, how I can the first and last row for each date, to access the time value.
The dataframe is already sorted by date and all rows of the same date are also ordered by the time.
Minimal example of what I have
import pandas as pd
df = pd.DataFrame({"Date": ["01.01.2020", "01.01.2020", "01.01.2020", "02.02.2022", "02.02.2022"],
"Time": ["12:00", "13:00", "14:45", "02:00", "08:00"]})
df
and what I want
EDIT: The duration column should be calculated by
14:45 - 12:00 = 2:45 for the first date and
08:00 - 02:00 = 6:00 for the second date.
I suspect this is possible with the groupby function but I am not sure how exactly to do it.
I hope you will find this helpful.
import pandas as pd
df = pd.DataFrame({"Date": ["01.01.2020", "01.01.2020", "01.01.2020", "02.02.2022", "02.02.2022"],
"Time": ["12:00", "13:00", "14:45", "02:00", "08:00"]})
df["Datetime"] = pd.to_datetime((df["Date"] + " " + df["Time"]))
def date_diff(df):
df["Duration"] = df["Datetime"].max() - df["Datetime"].min()
return df
df = df.groupby("Date").apply(date_diff)
df = df.drop("Datetime", axis=1)
Output:
Date Time Duration
0 01.01.2020 12:00 0 days 02:45:00
1 01.01.2020 13:00 0 days 02:45:00
2 01.01.2020 14:45 0 days 02:45:00
3 02.02.2022 02:00 0 days 06:00:00
4 02.02.2022 08:00 0 days 06:00:00
You can then do some string styling:
df['Duration'] = df['Duration'].astype(str).map(lambda x: x[7:12])
Output:
Date Time Duration
0 01.01.2020 12:00 02:45
1 01.01.2020 13:00 02:45
2 01.01.2020 14:45 02:45
3 02.02.2022 02:00 06:00
4 02.02.2022 08:00 06:00
here is one way to do it
# groupby on Date and find the difference of max and min time in each group
# format it as HH:MM by extracting Hours and minutes
# and creating a dictionary
d=dict((df.groupby('Date')['Time'].apply(lambda x:
(pd.to_timedelta(x.max() +':00') -
pd.to_timedelta(x.min() +':00')
)
).astype(str).str.extract(r'days (..:..)')
).reset_index().values)
# map the dictionary and update the duration in DF
df['duration']=df['Date'].map(d)
df
Date Time duration
0 01.01.2020 12:00 02:45
1 01.01.2020 13:00 02:45
2 01.01.2020 14:45 02:45
3 02.02.2022 02:00 06:00
4 02.02.2022 08:00 06:00
By the example shown below, you can achieve want you want.
df['Start time'] = df.apply(lambda row: df[df['Date'] == row['Date']]['Time'].max(), axis=1)
df
Update:
import datetime
df['Duration'] = df.apply(lambda row: str(datetime.timedelta(seconds=(datetime.datetime.strptime(df[df['Date'] == row['Date']]['Time'].max(), '%H:%M') - datetime.datetime.strptime(df[df['Date'] == row['Date']]['Time'].min(), '%H:%M')).total_seconds())) , axis=1)
df
you can use:
from datetime import timedelta
import numpy as np
df['xdate']=pd.to_datetime(df['Date'] + ' '+ df['Time'],format='%d.%m.%Y %H:%M')
df['max']=df.groupby(df['xdate'].dt.date)['xdate'].transform(np.max) #get max dates each date
df['min']=df.groupby(df['xdate'].dt.date)['xdate'].transform(np.min) #get min date each date
#get difference max and min dates
df['Duration']= df[['min','max']].apply(lambda x: x['max'] - timedelta(hours=x['min'].hour,minutes=x['min'].minute,seconds=x['min'].second),axis=1).dt.strftime('%H:%M')
df=df.drop(['xdate','min','max'],axis=1)
print(df)
'''
Date Time Duration
0 01.01.2020 12:00 02:45
1 01.01.2020 13:00 02:45
2 01.01.2020 14:45 02:45
3 02.02.2022 02:00 06:00
4 02.02.2022 08:00 06:00
'''

How to get a date from year, month, week of month and Day of week in Pandas?

I have a Pandas dataframe, which looks like below
I want to create a new column, which tells the exact date from the information from all the above columns. The code should look something like this:
df['Date'] = pd.to_datetime(df['Month']+df['WeekOfMonth']+df['DayOfWeek']+df['Year'])
I was able to find a workaround for your case. You will need to define the dictionaries for the months and the days of the week.
month = {"Jan":"01", "Feb":"02", "March":"03", "Apr": "04", "May":"05", "Jun":"06", "Jul":"07", "Aug":"08", "Sep":"09", "Oct":"10", "Nov":"11", "Dec":"12"}
week = {"Monday":1,"Tuesday":2,"Wednesday":3,"Thursday":4,"Friday":5,"Saturday":6,"Sunday":7}
With this dictionaries the transformation that I used with a custom dataframe was:
rows = [["Dec",5,"Wednesday", "1995"],
["Jan",3,"Wednesday","2013"]]
df = pd.DataFrame(rows, columns=["Month","Week","Weekday","Year"])
df['Date'] = (df["Year"] + "-" + df["Month"].map(month) + "-" + (df["Week"].apply(lambda x: (x - 1)*7) + df["Weekday"].map(week).apply(int) ).apply(str)).astype('datetime64[ns]')
However you have to be careful. With some data that you posted as example there were some dates that exceeds the date range. For example, for
row = ["Oct",5,"Friday","2018"]
The date displayed is 2018-10-33. I recommend using some logic to filter your data in order to avoid this kind of problems.
Let's approach it in 3 steps as follows:
Get the date of month start Month_Start from Year and Month
Calculate the date offsets DateOffset relative to Month_Start from WeekOfMonth and DayOfWeek
Get the actual date Date from Month_Start and DateOffset
Here's the codes:
df['Month_Start'] = pd.to_datetime(df['Year'].astype(str) + df['Month'] + '01', format="%Y%b%d")
import time
df['DateOffset'] = (df['WeekOfMonth'] - 1) * 7 + df['DayOfWeek'].map(lambda x: time.strptime(x, '%A').tm_wday) - df['Month_Start'].dt.dayofweek
df['Date'] = df['Month_Start'] + pd.to_timedelta(df['DateOffset'], unit='D')
Output:
Month WeekOfMonth DayOfWeek Year Month_Start DateOffset Date
0 Dec 5 Wednesday 1995 1995-12-01 26 1995-12-27
1 Jan 3 Wednesday 2013 2013-01-01 15 2013-01-16
2 Oct 5 Friday 2018 2018-10-01 32 2018-11-02
3 Jun 2 Saturday 1980 1980-06-01 6 1980-06-07
4 Jan 5 Monday 1976 1976-01-01 25 1976-01-26
The Date column now contains the dates derived from the information from other columns.
You can remove the working interim columns, if you like, as follows:
df = df.drop(['Month_Start', 'DateOffset'], axis=1)

Calculate the time difference between two hh:mm columns in a pandas dataframe

I am reading some data from an csv file where the datatype of the two columns are in hh:mm format. Here is an example:
Start End
11:15 15:00
22:30 2:00
In the above example, the End in the 2nd row happens in the next day. I am trying to get the time difference between these two columns in the most efficient way as the dataset is huge. Is there any good pythonic way for doing this? Also, since there is no date, and some Ends happen in the next I get wrong result when I calculate the diff.
>>> import pandas as pd
>>> df = pd.read_csv(file_path)
>>> pd.to_datetime(df['End'])-pd.to_datetime(df['Start'])
0 0 days 03:45:00
1 0 days 03:00:00
2 -1 days +03:30:00
You can use the technique (a+x)%x with a timedelta of 24h (or 1d, same)
the + timedelta(hours=24) makes all values becomes positive
the % timedelta(hours=24) makes the ones above 24h back of 24h
df['duration'] = (pd.to_datetime(df['End']) - pd.to_datetime(df['Start']) + timedelta(hours=24)) \
% timedelta(hours=24)
Gives
Start End duration
0 11:15 15:00 0 days 03:45:00
1 22:30 2:00 0 days 03:30:00

Creating Columns for Hour of Day and date based on datetime column

How can I create a new column that has the day only, and hour of day only based of a column that has a datetime timestamp?
DF has column such as:
Timestamp
2019-05-31 21:11:43
2018-11-21 18:01:00
2017-11-21 22:01:04
2020-04-15 11:01:00
2017-04-20 04:00:33
I want two new columns that look like below:
Day | Hour of Day
2019-05-31 21:00
2018-11-21 18:00
2017-11-21 22:00
2020-04-15 11:00
2017-04-20 04:00
I tried something like below but it only gives me a # for hour of day,
df['hour'] = pd.to_datetime(df['Timestamp'], format='%H:%M:%S').dt.hour
where output would be 9 for 9:32:00 which isnt what I want to calculate
Thanks!
Please try dt.strftime(format+string)
df['hour'] = pd.to_datetime(df['Timestamp']).dt.strftime("%H"+":00")
Following your comments below. Lets Try use df.assign and extract hour and date separately
df=df.assign(hour=pd.to_datetime(df['Timestamp']).dt.strftime("%H"+":00"), Day=pd.to_datetime(df['Timestamp']).dt.date)
You could convert time to string and then just select substrings by index.
df = pd.DataFrame({'Timestamp': ['2019-05-31 21:11:43', '2018-11-21 18:01:00',
'2017-11-21 22:01:04', '2020-04-15 11:01:00',
'2017-04-20 04:00:33']})
df['Day'], df['Hour of Day'] = zip(*df.Timestamp.apply(lambda x: [str(x)[:10], str(x)[11:13]+':00']))

correct way to sum values of second column for all unique values of first column pandas dataframe

I am new to pandas. I have a dataframe that has the days of the week in the first column and a list of values in the second. I wish to sum up the total value for each week day. so:
day values
0 Thursday 3
1 Thursday 0
2 Friday 0
2 Friday 1
4 Saturday 3
5 Saturday 1
etc...
would become :
day values
0 Thursday 3
1 Friday 1
2 Saturday 4
etc...
Using summing the number of occurrences per day pandas I achieved what I wanted:
- where the original df is called value_frame
values_on_day =pd.DataFrame(value_frame.groupby(value_frame.day).apply(lambda subf: subf['values'].sum()))
however the values and the weekdays are stuffed into one cell so that:
print dict(values_on_day)
equals:
{0: day
Friday 3
Monday 4
Saturday 7
Sunday 22
Thursday 26
Tuesday 2
Wednesday 4
Name: 0, dtype: int64}
I have coded a workaround by converting columns into dicts then lists and then back into a dict and converting back into a df but obviously this is not the way to do it.
Please would you show me the correct way to achieve total values for each day of the week in the original dataframe?
I agree with #Primer. This is the right way to code what you want to do.
I have updated my answer to add an index being the weekday number.
import pandas as pd
import time
df = pd.DataFrame({'day': ['Thursday', 'Thursday', 'Friday', 'Friday', 'Saturday', 'Saturday'], 'values': [3,0,0,1,3,1]})
result = df.groupby('day').sum()
# Reseting the index
result.reset_index(inplace=True)
# Creating a new index as the weekday number for each day
result.index = result['day'].apply(lambda x: time.strptime(x, '%A').tm_wday)
# Renaming the index
result.index.names = ['weekday']
# Sorting by index
result.sort_index(inplace=True)
print(result)
Gives:
day values
weekday
3 Thursday 3
4 Friday 1
5 Saturday 4

Categories