I have a pandas DataFrame and I want to reformat AND order the Date Range column.
This is the df.head():
Numeric Index Origin Movement ID Origin Display Name Destination Movement ID Destination Display Name Date Range Mean Travel Time (Seconds) Range - Lower Bound Travel Time (Seconds) Range - Upper Bound Travel Time (Seconds)
0 0 1074 Traffic Zone 02047 28 Traffic Zone 16024 1/4/2016 - 1/4/2016, Every day, Daily Average 2296 1593 3309
1 1 1074 Traffic Zone 02047 29 Traffic Zone 16025 1/4/2016 - 1/4/2016, Every day, Daily Average 2378 1662 3402
2 2 1074 Traffic Zone 02047 35 Traffic Zone 14080 1/4/2016 - 1/4/2016, Every day, Daily Average 1846 1703 2000
3 3 1074 Traffic Zone 02047 43 Traffic Zone 14072 1/4/2016 - 1/4/2016, Every day, Daily Average 1797 1647 1959
4 4 1074 Traffic Zone 02047 48 Traffic Zone 16027 1/4/2016 - 1/4/2016, Every day, Daily Average 2301 1670 3168
My df['Date Range'] strings are dates from January 2nd 2016 to March 31st 2020 and they are in the following format:
1 1/4/2016 - 1/4/2016, Every day, Daily Average
2 1/4/2016 - 1/4/2016, Every day, Daily Average
3 1/4/2016 - 1/4/2016, Every day, Daily Average
4 1/4/2016 - 1/4/2016, Every day, Daily Average
...
542 1/2/2016 - 1/2/2016, Every day, Daily Average
543 1/2/2016 - 1/2/2016, Every day, Daily Average
544 1/2/2016 - 1/2/2016, Every day, Daily Average
545 1/2/2016 - 1/2/2016, Every day, Daily Average
546 1/2/2016 - 1/2/2016, Every day, Daily Average
How do I transform "1/2/2016 - 1/2/2016, Every day, Daily Average" into "2016-01-02" for every date and order them by date?
Note: The string has two dates and they are the same, for every row, that's why I want to transform them into one date only.
You can split by first space, select first value and convert to datetime with format parameter by to_datetime, last if necessary use DataFrame.sort_values:
df['Date Range'] = pd.to_datetime(df['Date Range'].str.split().str[0], format='%d/%m/%Y')
df = df.sort_values('Date Range')
Related
I have a timeseries dataframe where there are alerts for some particular rows. The dataframe looks like-
machineID time vibration alerts
1 2023-02-15 220 1
11:45
1 2023-02-15 221 0
12:00
1 2023-02-15 219 0
12:15
1 2023-02-15 220 1
12:30
1 2023-02-16 220 1
11:45
1 2023-02-16 221 1
12:00
1 2023-02-16 219 0
12:15
1 2023-02-16 220 1
12:30
I want to calculate difference of alerts columns for each day. But since the date column is in time interval of 15 minutes, I am not getting how to group for whole day i.e., sum the alerts for each day and compare it with the sum of all alerts of the previous day.
In short, I need a way to sum all alerts for each day and substract with previous day. The result should be in another dataframe where there is a date column and difference of alerts column. In this case, the new dataframe will be-
time diff_alerts
2023-02-16 1
since there is difference of 1 alert on the next day i.e. 16-02-2023
Group by day with a custom pd.Grouper then sum alerts and finally compute the diff with the previous day:
>>> (df.groupby(pd.Grouper(key='time', freq='D'))['alerts'].sum().diff()
.dropna().rename('diff_alerts').astype(int).reset_index())
time diff_alerts
0 2023-02-16 1
Note: the second line of code is just here to have a clean output.
Here is my example data:
team
sales
month
a
100
1/1/2023
a
200
2/1/2023
b
600
1/1/2023
b
300
2/1/2023
load in pandas like so:
mydata = pd.DataFrame([
['team','sales','month'],
['a', 100, '1/1/2023'],
['a', 200, '2/1/2023'],
['b', 600, '1/1/2023'],
['b', 300, '2/1/2023']
])
mydata.columns = mydata.iloc[0]
mydata = mydata[1:]
mydata['month'] = pd.to_datetime(mydata['month'])
My desired outcome for team "a" is this data aggregated by each week as starting on Monday, like this:
team
sales
Monday Week
a
22.58
1/2/2023
a
22.58
1/9/2023
a
22.58
1/16/2023
a
22.58
1/23/2023
a
42.17
1/30/2023
a
50
2/6/2023
a
50
2/13/2023
a
50
2/20/2023
a
14.29
2/27/2023
So the logic on the calculated sales per week is:
$100 of sales in January, so avg sales per day is 100/31 = 3.23 per day, * 7 days in a weeks = $22.58 for each week in January.
February is $200 over 28 days, so ($200/28)*7 = $50 a week in Feb.
The calculation on the week starting 1/30/2023 is a little more complicated. I need to carry the January rate the first 2 days of 1/30 and 1/31, then start summing the Feb rate for the following 5 days in Feb (until 2/5/2023). So it would be 5*(200/28)+2*(100/31) = 42.17
Is there a way to do this in Pandas? I believe the logic that may work is taking each monthly total, decomposing that into daily data with an average rate, then using pandas to aggregate back up to weekly data starting on Monday for each month, but I'm lost trying to chain together the date functions.
I think you have miscalculation for team A for the week of 1/30/2023. It has no sales in Feb so its sales for the week should be 3.23 * 2 = 4.46.
Here's one way to do that:
def get_weekly_sales(group: pd.DataFrame) -> pd.DataFrame:
tmp = (
# Put month to the index and convert it to monthly period
group.set_index("month")[["sales"]]
.to_period("M")
# Calculate the average daily sales
.assign(sales=lambda x: x["sales"] / x.index.days_in_month)
# Up-sample the dataframe to daily
.resample("1D")
.ffill()
# Sum by week
.groupby(pd.Grouper(freq="W"))
.sum()
)
# Clean up the index
tmp.index = tmp.index.to_timestamp().rename("week_starting")
return tmp
df.groupby("team").apply(get_weekly_sales)
Assume a dataframe as follows. I'm looking to add a column to the df dataframe that takes the price for current row, and subtracts it from the price at the last index 5 minutes prior to the current hour/minute. I've attempted to reference a minute_df and read the current hour/minute and pull the close price from the minute_df, but have not got a working solution. The df index is datetime64.
For example, at 06:27:12, it should be taking this rows price, minus the close price at the last index from the 06:22, as this is 5 minutes prior to 06:27. For each index within the minute 06:27, it should be referencing this close price for the calculation, until it turns to 06:28, then should be subtracting from last index at 06:23.
df
TimeStamp Price Q hour min
2022-10-05 05:30:11.344618-05:00 8636 1 5 30
2022-10-05 05:30:12.647597-05:00 8637 1 5 30
2022-10-05 05:30:20.080559-05:00 8637 1 5 30
2022-10-05 05:30:21.267389-05:00 8637 2 5 30
2022-10-05 05:30:21.267952-05:00 8636 1 5 30
minute_df
TimeStamp open high low close
2022-10-05 05:30:00-05:00 8636 8645 8635 8645
2022-10-05 05:31:00-05:00 8645 8647 8637 8638
2022-10-05 05:32:00-05:00 8639 8650 8639 8649
2022-10-05 05:33:00-05:00 8648 8652 8648 8649
Expected output is a column within the df dataframe containing value of the current price - closing price, or the price at the last index 5 minutes prior to current minute. NaN values up until there is sufficient rows to lookback this many periods.
df['price_change']
Not sure if I understand correctly but here's my try
If TimeStamp is a column
# Remove the seconds and microseconds
floor_ts = df.TimeStamp.dt.floor("min")
# Get last 5 minute timestamp
last_index_5_ts = floor_ts - pd.Timedelta(5, unit="min")
# Create dict from minute_df TimeStamp to close price
ts_to_close_dict = dict(zip(minute_df.TimeStamp, minute_df.close))
close_price_v = last_index_5_ts.map(ts_to_close_dict)
df["price_change"] = df.Price - close_price_v
df
Same code but if TimeStamp is an index
floor_ts = df.index.floor("min")
last_index_5_ts = floor_ts - pd.Timedelta(5, unit="min")
ts_to_close_dict = dict(zip(minute_df.index, minute_df.close))
close_price_v = last_index_5_ts.map(ts_to_close_dict)
df["price_change"] = df.Price - close_price_v
df
Few notes:
I'm not sure what you're meaning about handling NaN values but if you need forward fill / backward fill them you can use pd.fillna
Some of the pandas function (like floor) above might be missing in older pandas version
EDIT:
I didn't notice the df already have hour and minute column. You may use it for calculating floor_ts (though not sure if it's easier/faster)
I have a DataFrame with a multi-index consisting of (phase, service_group, station, year, period) whose purpose is to return "capacity_required" when all 5 values of the multi-index are specified.
For example in phase Final, service-group West, station Milton, year 2025, period Peak Hour 1, the required_capacity is 1500.
Currently there are 7 possible periods, two of which are "Off-Peak Hour" and "Shoulder Hour".
I need to add a new period to every instance of the multi-index, called Off-Peak Shoulder, where the new value is defined as the average of Off-Peak Hour and Shoulder Hour.
So far I have the following code:
import pandas as pd
import os
directory = '/Users/mark/PycharmProjects/psrpcl_data'
capacity_required_file = 'Capacity_Requirements.csv'
capacity_required_path = os.path.join(directory, capacity_required_file)
df_capacity_required = pd.read_csv(capacity_required_path, sep=',',
usecols=['phase', 'service_group', 'station', 'year', 'period', 'capacity_required'])
df_capacity_required.set_index(['phase', 'service_group', 'station', 'year'], inplace=True)
df_capacity_required.sort_index(inplace=True)
print(df_capacity_required.head(14))
And the output from the above code is:
period capacity_required
phase service_group station year
Early Barrie Allandale Waterfront Station 2025 AM Peak Period 490
2025 Off-Peak Hour 100
2025 PM Peak Period 520
2025 Peak Hour 2 250
2025 Peak Hour 5 180
2025 Peak Hour 6 180
2025 Shoulder Hour 250
2026 AM Peak Period 520
2026 Off-Peak Hour 50
2026 PM Peak Period 520
2026 Peak Hour 2 260
2026 Peak Hour 5 180
2026 Peak Hour 6 180
2026 Shoulder Hour 250
The above is just the first 14 lines of about 30K lines. This shows you two years worth of periods. Notice there are 7 periods per year.
I am trying to create a new "period" called "Off-Peak Shoulder" to be added to every single (phase, service_group, station, year) combination which is to be the average of Off-Peak and Shoulder.
The following line correctly calculates the one Off-Peak Shoulder value per index value:
off_peak_shoulder = df_capacity_required.loc[df_capacity_required.period == 'Off-Peak Hour', 'capacity_required'].add(
df_capacity_required.loc[df_capacity_required.period == 'Shoulder', 'capacity_required']).div(2)
print(off_peak_shoulder)
The above code provides the following (correct) Off-Peak Shoulder series as output:
phase service_group station year
Early Barrie Allandale Waterfront Station 2025 0.0
2026 0.0
2027 0.0
2028 0.0
2029 0.0
...
Initial Union Pearson Express Pearson Station 2023 160.0
2024 160.0
Weston Station 2022 80.0
2023 105.0
2024 105.0
Question: How do I merge/join the off_peak_shoulder series into df_capacity_required to get Off-Peak Shoulder to be one more entry under "period", as shown below?
period capacity_required
phase service_group station year
Early Barrie Allandale Waterfront Station 2025 AM Peak Period 490
2025 Off-Peak Hour 100
2025 PM Peak Period 520
2025 Peak Hour 2 250
2025 Peak Hour 5 180
2025 Peak Hour 6 180
2025 Shoulder Hour 250
2025 Off-Peak Shoulder 175
2026 AM Peak Period 520
2026 Off-Peak Hour 50
2026 PM Peak Period 520
2026 Peak Hour 2 260
2026 Peak Hour 5 180
2026 Peak Hour 6 180
2026 Shoulder Hour 250
2025 Off-Peak Shoulder 150
I slept on the problem and woke up with a solution. I already have the list of values I need, with the correct multi-index set for each value. I was thinking I needed some complex multi-index insertion code, but actually I just needed to put the created DataFrame in the same form as the original DataFrame, and concat the two together.
Here is the code I added. Note the first line is the same as the original code, except I added a call to reset_index.
df_new = df_capacity_required.loc[df_capacity_required.period == 'Off-Peak Hour', 'capacity_required'].add(
df_capacity_required.loc[df_capacity_required.period == 'Shoulder Hour', 'capacity_required']).div(2).reset_index()
df_new['period'] = 'Off-Peak Shoulder'
df_new.set_index(['phase', 'service_group', 'station', 'year'], inplace=True)
df_capacity_required = concat([df_capacity_required, df_new])
df_capacity_required.sort_index(inplace=True)
print_full(df_capacity_required.head(16))
And that print statement gives the following desired output:
period capacity_required
phase service_group station year
Early Barrie Allandale Waterfront Station 2025 AM Peak Period 490
2025 Off-Peak Hour 100
2025 PM Peak Period 520
2025 Peak Hour 2 250
2025 Peak Hour 5 180
2025 Peak Hour 6 180
2025 Shoulder Hour 250
2025 Off-Peak Shoulder 175
2026 AM Peak Period 520
2026 Off-Peak Hour 50
2026 PM Peak Period 520
2026 Peak Hour 2 260
2026 Peak Hour 5 180
2026 Peak Hour 6 180
2026 Shoulder Hour 250
2026 Off-Peak Shoulder 150
But thanks for everyone who read the question. It is very nice knowing there are people out there on StackOverflow willing to help with someone gets stuck.
I have a df with DateTimeIndex (hourly readings) and light intensity.
Time Light
1/2/2017 18:00 31
1/2/2017 19:00 -5
1/2/2017 20:00 NA
......
......
2/2/2017 05:00 NA
2/2/2017 06:00 20
The issue is that after sunset (6 pm) until sunrise (6 am), the sensor doesn't work and has bad readings. I would like to set any readings in this period to 0.
You can create a mask with these conditions and set the value based on it.
hours = (df.index.to_series().dt.hour) # convert DateTimeIndex to hours
mask = (hours > 6) & (hours < 18)
df.loc[~mask, 'Light'] = 0
You should convert the DataTimeIndex to Series to access the datetime methods.