Dividing time intervals with multiple index into hourly buckets in Python - python

here is the code for the sample data set I have
data={'ID':[4,4,4,4,22,22,23,25,29],
'Zone':[32,34,21,34,27,29,32,75,9],
'checkin_datetime':['04-01-2019 13:07','04-01-2019 13:09','04-01-2019 14:06','04-01-2019 14:55','04-01-2019 20:23'
,'04-01-2019 21:38','04-01-2019 21:38','04-01-2019 23:22','04-02-2019 01:00'],
'checkout_datetime':['04-01-2019 13:09','04-01-2019 13:12','04-01-2019 14:07','04-01-2019 15:06','04-01-2019 21:32'
,'04-01-2019 21:42','04-01-2019 21:45','04-02-2019 00:23','04-02-2019 06:15']
}
df = pd.DataFrame(data,columns= ['ID','Zone', 'checkin_datetime','checkout_datetime'])
df['checkout_datetime'] = pd.to_datetime(df['checkout_datetime'])
df['checkin_datetime'] = pd.to_datetime(df['checkin_datetime'])
Using this data set I am trying to create the following data set
Checked_in_hour ID Zone checked_in_minutes
01-04-2019 13:00 4 32 2
01-04-2019 13:00 4 34 3
01-04-2019 14:00 4 21 1
01-04-2019 14:00 4 34 5
01-04-2019 15:00 4 34 6
01-04-2019 20:00 22 27 37
01-04-2019 20:00 22 27 8
01-04-2019 20:00 22 27 37
01-04-2019 21:00 22 29 4
01-04-2019 21:00 23 32 7
01-04-2019 23:00 25 75 38
02-04-2019 00:00 25 75 24
02-04-2019 01:00 29 9 60
02-04-2019 02:00 29 9 60
02-04-2019 03:00 29 9 60
02-04-2019 04:00 29 9 60
02-04-2019 05:00 29 9 60
02-04-2019 06:00 29 9 16
Where Checked in hour is calculated by subtracting the checkin_datetime and the checkout_datetime and the time is grouped by hours and Zone
This is code I have so far which calculates this at Checked_in_hour level which I need to add on the Zone Variable
#working logic
df2 = pd.DataFrame(
index=pd.DatetimeIndex(
start=df['checkin_datetime'].min(),
end=df['checkout_datetime'].max(),freq='1T'),
columns = ['is_checked_in','ID'], data=0)
for index, row in df.iterrows():
df2['is_checked_in'][row['checkin_datetime']:row['checkout_datetime']] = 1
df2['ID'][row['checkin_datetime']:row['checkout_datetime']] = row['ID']
df3 = df2.resample('1H').aggregate({'is_checked_in': sum,'ID':max})

Not sure if this is efficient, but should work.
import pandas as pd
from datetime import timedelta
def group_into_hourly_buckets(df):
df['duration'] = df['checkout_datetime'] - df['checkin_datetime']
grouped_data = []
for idx, row in df.iterrows():
dur = row['duration'].seconds//60
start_time = row['checkin_datetime']
hours_ = 0
while dur > 0:
_data = {}
_data['Checked_in_hour'] = start_time.floor('H') + timedelta(hours=hours_)
time_spent_in_window = min(dur, 60)
if (hours_ == 0):
time_spent_in_window = min(time_spent_in_window, ((start_time.ceil('H') - start_time).seconds)//60)
_data['checked_in_minutes'] = time_spent_in_window
_data['ID'] = row['ID']
_data['Zone'] = row['Zone']
dur -= time_spent_in_window
hours_ += 1
grouped_data.append(_data)
return pd.DataFrame(grouped_data)

Related

Use Pandas to update year of a subset of dates in a column

I have a pandas dataframe that consists of a date column of one year and then daily data to go along with it. I want to update the year of just the rows that pertain to January. I can select the January subset of my dataframe just fine, and try to change the year of that subset based on the answer given here, but when I try to update the values of that subset by adding an offset I get an error.
Setup:
import pandas as pd
df = pd.DataFrame({'Date': pd.date_range(start = "01-01-2023", end = "12-31-2023"), 'data': 25})
Select January subset:
df[df['Date'].dt.month == 1]
This works as expected:
Date data
0 2023-01-01 25
1 2023-01-02 25
2 2023-01-03 25
3 2023-01-04 25
4 2023-01-05 25
5 2023-01-06 25
6 2023-01-07 25
7 2023-01-08 25
8 2023-01-09 25
9 2023-01-10 25
10 2023-01-11 25
11 2023-01-12 25
12 2023-01-13 25
13 2023-01-14 25
14 2023-01-15 25
15 2023-01-16 25
16 2023-01-17 25
17 2023-01-18 25
18 2023-01-19 25
19 2023-01-20 25
20 2023-01-21 25
21 2023-01-22 25
22 2023-01-23 25
23 2023-01-24 25
24 2023-01-25 25
25 2023-01-26 25
26 2023-01-27 25
27 2023-01-28 25
28 2023-01-29 25
29 2023-01-30 25
30 2023-01-31 25
Attempt to change:
df[df['Date'].dt.month == 1] = df[df['Date'].dt.month == 1] + pd.offsets.DateOffset(years=1)
TypeError: Concatenation operation is not implemented for NumPy arrays, use np.concatenate() instead. Please do not rely on this error; it may not be given on all Python implementations.
I've tried a few different variations of this but seem to be having issues changing the subset dataframe data.
You have to select Date column (solution enhanced by #mozway, thanks):
df.loc[df['Date'].dt.month == 1, 'Date'] += pd.offsets.DateOffset(years=1)
print(df)
# Output
Date data
0 2024-01-01 25
1 2024-01-02 25
2 2024-01-03 25
3 2024-01-04 25
4 2024-01-05 25
.. ... ...
360 2023-12-27 25
361 2023-12-28 25
362 2023-12-29 25
363 2023-12-30 25
364 2023-12-31 25
[365 rows x 2 columns]

Uneven grid when plotting date in plt

I want to plot variable by date, days and month. Grid is uneven when month is changing. How to force size of grid in this case?
Data is loaded via Pandas, as DataFrame.
ga =
Reference Organic_search Direct Date
0 0 0 0 2021-11-22
1 0 0 0 2021-11-23
2 0 0 0 2021-11-24
3 0 0 0 2021-11-25
4 0 0 0 2021-11-26
5 0 0 0 2021-11-27
6 0 0 0 2021-11-28
7 42 19 35 2021-11-29
8 69 33 48 2021-11-30
9 107 32 35 2021-12-01
10 62 30 26 2021-12-02
11 20 26 30 2021-12-03
12 22 22 20 2021-12-04
13 40 41 20 2021-12-05
14 14 39 26 2021-12-06
15 18 25 34 2021-12-07
16 8 21 13 2021-12-08
17 11 21 17 2021-12-09
18 23 27 20 2021-12-10
19 46 26 17 2021-12-11
20 29 42 20 2021-12-12
21 122 37 19 2021-12-13
22 97 25 29 2021-12-14
23 288 51 39 2021-12-15
24 96 29 26 2021-12-16
25 51 25 36 2021-12-17
26 23 16 21 2021-12-18
27 47 32 10 2021-12-19
code:
fig, ax = plt.subplots(figsize = (15,5))
ax.plot(ga.date, ga.reference)
ax.set(xlabel = 'Data',
ylabel = 'Ruch na stronie')
date_form = DateFormatter('%d/%m')
ax.xaxis.set_major_formatter(date_form)
graph
Looking at the added data, I realized why the interval was not constant.
This is because the number of days corresponding to each month is different.
So I just made the date data into one string data. And the grid spacing was forced to be the same.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
df = pd.read_excel('test.xlsx', index_col=0)
fig, ax = plt.subplots(figsize=(15, 5))
ax.plot(df['Date'].dt.strftime('%d/%y'), df.Refference)
ax.set(xlabel='Data',
ylabel='Ruch na stronie')
ax.grid(True)
# set xaxis interval
interval = 3
ax.xaxis.set_major_locator(ticker.MultipleLocator(interval))

Interpolate data and then merge 2 dataframes

I am starting off with Python and using Pandas.
I have 2 CSVs i.e
CSV1
Date Col1 Col2
2021-01-01 20 15
2021-01-02 22 12
2021-01-03 30 18
.
.
2021-12-31 125 160
so on and so forth...
CSV2
Start_Date End_Date Sunday Monday Tuesday Wednesday Thursday Friday Saturday
2021-01-01 2021-02-25 15 25 35 45 30 40 55
2021-02-26 2021-05-31 25 30 44 35 50 45 66
.
.
2021-09-01 2021-0-25 44 25 65 54 24 67 38
Desired result
Date Col1 Col2 New_Col3 New_Col4
2021-01-01 20 15 Fri 40
2021-01-02 22 12 Sat 55
2021-01-03 30 18 Sun 15
.
.
2021-12-31 125 160 Fri 67
New_Col3 is the weekday abbreviation of Date
New_Col4 is the cell in CSV2 where the Date falls between Start_Date and End_Date row-wise, and from the corresponding weekday column-wise.
# Convert date column to datetime
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Start_Date'] = pd.to_datetime(df2['Start_Date'])
df2['End_Date'] = pd.to_datetime(df2['End_Date'])
# Get abbreviated weekday name
df1['New_Col3'] = df1['Date'].apply(lambda x: x.strftime('%a'))
New_Col4 = []
# Iterate over df1
for i in range(len(df1)):
# If df1['date'] is in between df2['Start_Date'] and df2['End_Date']
# Get the value according to df1['date'] weekday name
for j in range(len(df2)):
if df2.loc[j, 'Start_Date'] <= df1.loc[i, 'Date'] <= df2.loc[j, 'End_Date']:
day_name = df1.loc[i, 'Date'].strftime('%A')
New_Col4.append(df2.loc[j, day_name])
# Assign the result to a new column
df1['New_Col4'] = New_Col4
# print(df1)
Date Col1 Col2 New_Col3 New_Col4
0 2021-01-01 20 15 Fri 40
1 2021-01-02 22 12 Sat 55
2 2021-01-03 30 18 Sun 15
3 2021-03-03 40 18 Wed 35
Keys
Construct datetime and interval indexes to enable pd.IntervalIndex.get_indexer(pd.DatetimeIndex) for efficient row-matching. (reference post)
Apply a value retrieval function from df2 on each row of df1 for New_Col4.
With this approach, an explicit double for-loop search can be avoided in row-matching. However, a slow .apply() is still required. Maybe there is a fancy way to combine these two steps, but I will stop here for the time being.
Data
Typo in the last End_Date is changed.
import pandas as pd
import io
df1 = pd.read_csv(io.StringIO("""
Date Col1 Col2
2021-01-01 20 15
2021-01-02 22 12
2021-01-03 30 18
2021-12-31 125 160
"""), sep=r"\s+", engine='python')
df2 = pd.read_csv(io.StringIO("""
Start_Date End_Date Sunday Monday Tuesday Wednesday Thursday Friday Saturday
2021-01-01 2021-02-25 15 25 35 45 30 40 55
2021-02-26 2021-05-31 25 30 44 35 50 45 66
2021-09-01 2022-01-25 44 25 65 54 24 67 38
"""), sep=r"\s+", engine='python')
df1["Date"] = pd.to_datetime(df1["Date"])
df2["Start_Date"] = pd.to_datetime(df2["Start_Date"])
df2["End_Date"] = pd.to_datetime(df2["End_Date"])
Solution
# 1. Get weekday name
df1["day_name"] = df1["Date"].dt.day_name()
df1["New_Col3"] = df1["day_name"].str[:3]
# 2-1. find corresponding row in df2
df1.set_index("Date", inplace=True)
idx = pd.IntervalIndex.from_arrays(df2["Start_Date"], df2["End_Date"], closed="both")
df1["df2_row"] = idx.get_indexer(df1.index)
# 2-2. pick out the value from df2
def f(row):
"""Get (#row, day_name) in df2"""
return df2[row["day_name"]].iloc[row["df2_row"]]
df1["New_Col4"] = df1.apply(f, axis=1)
Result
print(df1.drop(columns=["day_name", "df2_row"]))
Out[319]:
Col1 Col2 New_Col3 New_Col4
Date
2021-01-01 20 15 Fri 40
2021-01-02 22 12 Sat 55
2021-01-03 30 18 Sun 15
2021-12-31 125 160 Fri 67

Sorting Columns By Ascending Order

Given this example dataframe,
Date 01012019 01022019 02012019 02022019 03012019 03022019
Period
1 45 21 43 23 32 23
2 42 12 43 11 14 65
3 11 43 24 23 21 12
I will like to sort the date based on the month - (the date is in ddmmyyyy). However, the date is a string when I type(date). I tried to use pd.to_datetime but it failed with an error month must be in 1..12.
Any advice? Thank you!
Specify format of datetimes in to_datetime and then sort_index:
df.columns = pd.to_datetime(df.columns, format='%d%m%Y')
df = df.sort_index(axis=1)
print (df)
2019-01-01 2019-01-02 2019-01-03 2019-02-01 2019-02-02 2019-02-03
Date
1 45 43 32 21 23 23
2 42 43 14 12 11 65
3 11 24 21 43 23 12

python, how to know the week number in the year of the day, Saturday as the first day of the week

I am using python/pandas, and want to know how to get the week number in the year of one day while Saturday as the first day of the week.
i did search a lot, but all the way takes either Monday or Sunday as the first day of week...
Please help...thanks
Thanks all! really appreciated all your quick answers..but i have to apology that i am not making my question clearly.
I want to know the week number in the year. for example, 2015-08-09 is week 32 while Monday as first day of week, but week 33 while Saturday as first day of week.
Thanks #Cyphase and everyone, I changed a bit the code of Cyphase and it works.
def week_number(start_week_on, date_=None):
assert 1 <= start_week_on <= 7 #Monday=1, Sunday=7
if not date_:
date_ = date.today()
__, normal_current_week, normal_current_day = date_.isocalendar()
print date_, normal_current_week, normal_current_day
if normal_current_day >= start_week_on:
week = normal_current_week + 1
else:
week = normal_current_week
return week
If I understand correctly the following does what you want:
In [101]:
import datetime as dt
import pandas as pd
​
df = pd.DataFrame({'date':pd.date_range(start=dt.datetime(2015,8,9), end=dt.datetime(2015,9,1))})
df['week'] = df['date'].dt.week.shift(-2).ffill()
df['orig week'] = df['date'].dt.week
df['day of week'] = df['date'].dt.dayofweek
df
Out[101]:
date week orig week day of week
0 2015-08-09 33 32 6
1 2015-08-10 33 33 0
2 2015-08-11 33 33 1
3 2015-08-12 33 33 2
4 2015-08-13 33 33 3
5 2015-08-14 33 33 4
6 2015-08-15 34 33 5
7 2015-08-16 34 33 6
8 2015-08-17 34 34 0
9 2015-08-18 34 34 1
10 2015-08-19 34 34 2
11 2015-08-20 34 34 3
12 2015-08-21 34 34 4
13 2015-08-22 35 34 5
14 2015-08-23 35 34 6
15 2015-08-24 35 35 0
16 2015-08-25 35 35 1
17 2015-08-26 35 35 2
18 2015-08-27 35 35 3
19 2015-08-28 35 35 4
20 2015-08-29 36 35 5
21 2015-08-30 36 35 6
22 2015-08-31 36 36 0
23 2015-09-01 36 36 1
The above uses dt.week and shifts by 2 rows and then forward fills the NaN values.
import datetime
datetime.date(2015, 8, 9).isocalendar()[1]
You could just do this:
from datetime import date
def week_number(start_week_on, date_=None):
assert 0 <= start_week_on <= 6
if not date_:
date_ = date.today()
__, normal_current_week, normal_current_day = date_.isocalendar()
if normal_current_day >= start_week_on:
week = normal_current_week
else:
week = normal_current_week - 1
return week
print("Week starts We're in")
for start_week_on in range(7):
this_week = week_number(start_week_on)
print(" day {0} week {1}".format(start_week_on, this_week))
Output on day 4 (Thursday):
Week starts We're in
day 0 week 33
day 1 week 33
day 2 week 33
day 3 week 33
day 4 week 33
day 5 week 32
day 6 week 32

Categories