i have the following dataframe:
timestamp mes
0 2019-01-01 18:15:55.700 1228
1 2019-01-01 18:35:56.872 1402
2 2019-01-01 18:35:56.872 1560
3 2019-01-01 19:04:25.700 1541
4 2019-01-01 19:54:23.150 8754
5 2019-01-02 18:01:00.025 4124
6 2019-01-02 18:17:56.125 9736
7 2019-01-02 18:58:59.799 1597
8 2019-01-02 20:10:15.896 5285
How can I select only the rows where timestamp is between a start_time and end_time, for all the days in the dataframe? Basically the same role of .between_time() but here the timestamp column can't be the index since there are repeated values.
Also, this is actually a chunk from pd.read_csv() and I would have to do this for several millions of them, would it be faster if I used for example numpy datetime functionalities? I guess I could create from timestamp a time column and create a mask on that, but I'm afraid this would be too slow.
EDIT:
I added more rows and this is the expected result, say for start_time=datetime.time(18), end_time=datetime.time(19):
timestamp mes
0 2019-01-01 18:15:55.700 1228
1 2019-01-01 18:35:56.872 1402
2 2019-01-01 18:35:56.872 1560
5 2019-01-02 18:01:00.025 4124
6 2019-01-02 18:17:56.125 9736
7 2019-01-02 18:58:59.799 1597
My code (works but is slow):
df['time'] = df.timestamp.apply(lambda x: x.time())
mask = (df.time<end) & (df.time>=start)
selected = df.loc[mask]
If you have you columns set to date time:
start = df["timestamp"] >= "2019-01-01 18:15:55.700"
end = df["timestamp"] <= "2019-01-01 18:15:55.896 "
between_two_dates = start & end
df.loc[between_two_dates]
Works for me. Just set timestamp to datetime and take to index
df=pd.DataFrame({'timestamp':['2019-01-01 18:15:55.700','2019-01-01 18:17:55.700','2019-01-01 18:19:55.896'],'mes':[1228,1402,1560]})#Data
df['timestamp']=pd.to_datetime(df['timestamp'])#Coerce timestamp to datetime
df.set_index('timestamp', inplace=True)#set timestamp as index
df.between_time('18:16', '20:15')#Time btetween select
Result
Related
I created a pandas df with columns named start_date and current_date. Both columns have a dtype of datetime64[ns]. What's the best way to find the quantity of business days between the current_date and start_date column?
I've tried:
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
us_bd = CustomBusinessDay(calendar=USFederalHolidayCalendar())
projects_df['start_date'] = pd.to_datetime(projects_df['start_date'])
projects_df['current_date'] = pd.to_datetime(projects_df['current_date'])
projects_df['days_count'] = len(pd.date_range(start=projects_df['start_date'], end=projects_df['current_date'], freq=us_bd))
I get the following error message:
Cannot convert input....start_date, dtype: datetime64[ns]] of type <class 'pandas.core.series.Series'> to Timestamp
I'm using Python version 3.10.4.
pd.date_range's parameters need to be datetimes, not series.
For this reason, we can use df.apply to apply the function to each row.
In addition, pandas has bdate_range which is just date_range with freq defaulting to business days, which is exactly what you need.
Using apply and a lambda function, we can create a new Series calculating business days between each start and current date for each row.
projects_df['start_date'] = pd.to_datetime(projects_df['start_date'])
projects_df['current_date'] = pd.to_datetime(projects_df['current_date'])
projects_df['days_count'] = projects_df.apply(lambda row: len(pd.bdate_range(row['start_date'], row['current_date'])), axis=1)
Using a random sample of 10 date pairs, my output is the following:
start_date current_date bdays
0 2022-01-03 17:08:04 2022-05-20 00:53:46 100
1 2022-04-18 09:43:02 2022-06-10 16:56:16 40
2 2022-09-01 12:02:34 2022-09-25 14:59:29 17
3 2022-04-02 14:24:12 2022-04-24 21:05:55 15
4 2022-01-31 02:15:46 2022-07-02 16:16:02 110
5 2022-08-02 22:05:15 2022-08-17 17:25:10 12
6 2022-03-06 05:30:20 2022-07-04 08:43:00 86
7 2022-01-15 17:01:33 2022-08-09 21:48:41 147
8 2022-06-04 14:47:53 2022-12-12 18:05:58 136
9 2022-02-16 11:52:03 2022-10-18 01:30:58 175
if I have 2 different set of dates:
01/05/2022 - 31/12/2022
01/01/2023 - 31/12/2023
01/05/2022 - 30/09/2022
01/10/2022 - 31/12/2022
01/01/2023 - 31/12/2023
I want to check if both set of dates above are contiguous between below range of dates
Date 1 = 01/05/2022
Date 2 = 31/12/2023
Please suggest a solution.
It seems to me easier to use pandas to check if dates fall into the date range.
You have the data day, month, year. In my practice, I usually see the sequences year, month, day.
I changed the variables 'Date_1', 'Date_2' to the desired format and the arrays themselves with dates, which I divided into two parts from and to. Then I filled the dataframe with these arrays and checked the date range. I specifically added one line with data for clarity: 2023-01-01 2025-12-31, it is just filtered, since it does not fall under the condition.
import pandas as pd
from datetime import datetime
Date_1 = '01/05/2022'
Date_2 = '31/12/2023'
Date_1 = datetime.strptime(Date_1, "%d/%m/%Y")
Date_2 = datetime.strptime(Date_2, "%d/%m/%Y")
start = [datetime.strptime(i, "%d/%m/%Y")for i in ['01/05/2022', '01/01/2023', '01/05/2022', '01/10/2022', '01/01/2023', '01/01/2023']]
finish = [datetime.strptime(i, "%d/%m/%Y")for i in ['31/12/2022', '31/12/2023', '30/09/2022', '31/12/2022', '31/12/2023', '31/12/2025']]
df = pd.DataFrame({'start': start, 'finish': finish})
print(df)
print(df[(df['start'] >= Date_1) & (df['finish'] <= Date_2)])
Output print(df)
start finish
0 2022-05-01 2022-12-31
1 2023-01-01 2023-12-31
2 2022-05-01 2022-09-30
3 2022-10-01 2022-12-31
4 2023-01-01 2023-12-31
5 2023-01-01 2025-12-31
Output print(df[(df['start'] >= Date_1) & (df['finish'] <= Date_2)])
start finish
0 2022-05-01 2022-12-31
1 2023-01-01 2023-12-31
2 2022-05-01 2022-09-30
3 2022-10-01 2022-12-31
4 2023-01-01 2023-12-31
I have a DataFrame like this:
date time value
0 2019-04-18 07:00:10 100.8
1 2019-04-18 07:00:20 95.6
2 2019-04-18 07:00:30 87.6
3 2019-04-18 07:00:40 94.2
The DataFrame contains value recorded every 10 seconds for entire year 2019. I need to calculate standard deviation and mean/average of value for each hour of each date, and create two new columns for them. I have tried first separating the hour for each value like:
df["hour"] = df["time"].astype(str).str[:2]
Then I have tried to calculate standard deviation by:
df["std"] = df.groupby("hour").median().index.get_level_values('value').stack().std()
But that won't work, could I have some advise on the problem?
We can split the time column around the delimiter :, then slice the hour component using str[0], finally group the dataframe on date along with hour component and aggregate column value with mean and std:
hr = df['time'].str.split(':', n=1).str[0]
df.groupby(['date', hr])['value'].agg(['mean', 'std'])
If you want to broadcast the aggregated values to original dataframe, then we need to use transform instead of agg:
g = df.groupby(['date', df['time'].str.split(':', n=1).str[0]])['value']
df['mean'], df['std'] = g.transform('mean'), g.transform('std')
date time value mean std
0 2019-04-18 07:00:10 100.8 94.55 5.434151
1 2019-04-18 07:00:20 95.6 94.55 5.434151
2 2019-04-18 07:00:30 87.6 94.55 5.434151
3 2019-04-18 07:00:40 94.2 94.55 5.434151
have synthesized data. Start by generating a true datetime column
groupby() hour
use describe() to get mean & std
merge() back to original data frame
d = pd.date_range("1-Jan-2019", "28-Feb-2019", freq="10S")
df = pd.DataFrame({"datetime":d, "value":np.random.uniform(70,90,len(d))})
df = df.assign(date=df.datetime.dt.strftime("%Y-%m-%d"),
time=df.datetime.dt.strftime("%H:%M:%S"))
# create a datetime column - better than manipulating strings
df["datetime"] = pd.to_datetime(df.date + " " + df.time)
# calc mean & std by hour
dfh = (df.groupby(df.datetime.dt.hour, as_index=False)
.apply(lambda dfa: dfa.describe().T.loc[:,["mean","std"]].reset_index(drop=True))
.droplevel(1)
)
# merge mean & std by hour back
df.merge(dfh, left_on=df.datetime.dt.hour, right_index=True).drop(columns="key_0")
datetime value mean std
0 2019-01-01 00:00:00 86.014209 80.043364 5.777724
1 2019-01-01 00:00:10 77.241141 80.043364 5.777724
2 2019-01-01 00:00:20 71.650739 80.043364 5.777724
3 2019-01-01 00:00:30 71.066332 80.043364 5.777724
4 2019-01-01 00:00:40 77.203291 80.043364 5.777724
... ... ... ... ...
3144955 2019-12-30 23:59:10 89.577237 80.009751 5.773007
3144956 2019-12-30 23:59:20 82.154883 80.009751 5.773007
3144957 2019-12-30 23:59:30 82.131952 80.009751 5.773007
3144958 2019-12-30 23:59:40 85.346724 80.009751 5.773007
3144959 2019-12-30 23:59:50 78.122761 80.009751 5.773007
I have a dataframe like this with two date columns and a quamtity column :
start_date end_date qty
1 2018-01-01 2018-01-08 23
2 2018-01-08 2018-01-15 21
3 2018-01-15 2018-01-22 5
4 2018-01-22 2018-01-29 12
I have a second dataframe with just column containing yearly holidays for a couple of years, like this:
holiday
1 2018-01-01
2 2018-01-27
3 2018-12-25
4 2018-12-26
I would like to go through the first dataframe row by row and assign boolean value to a new column holidays if a date in the second data frame falls between the date values of the first date frame. The result would look like this:
start_date end_date qty holidays
1 2018-01-01 2018-01-08 23 True
2 2018-01-08 2018-01-15 21 False
3 2018-01-15 2018-01-22 5 False
4 2018-01-22 2018-01-29 12 True
When I try to do that with a for loop I get the following error:
ValueError: Can only compare identically-labeled Series objects
An answer would be appreciated.
If you want a fully-vectorized solution, consider using the underlying numpy arrays:
import numpy as np
def holiday_arr(start, end, holidays):
start = start.reshape((-1, 1))
end = end.reshape((-1, 1))
holidays = holidays.reshape((1, -1))
result = np.any(
(start <= holiday) & (holiday <= end),
axis=1
)
return result
If you have your dataframes as above (calling them df1 and df2), you can obtain your desired result by running:
df1["contains_holiday"] = holiday_arr(
df1["start_date"].to_numpy(),
df1["end_date"].to_numpy(),
df2["holiday"].to_numpy()
)
df1 then looks like:
start_date end_date qty contains_holiday
1 2018-01-01 2018-01-08 23 True
2 2018-01-08 2018-01-15 21 False
3 2018-01-15 2018-01-22 5 False
4 2018-01-22 2018-01-29 12 True
try:
def _is_holiday(row, df2):
return ((df2['holiday'] >= row['start_date']) & (df2['holiday'] <= row['end_date'])).any()
df1.apply(lambda x: _is_holiday(x, df2), axis=1)
I'm not sure why you would want to go row-by-row. But boolean comparisons would be way faster.
df['holiday'] = ((df2.holiday >= df.start_date) & (df2.holiday <= df.end_date))
Time
>>> 1000 loops, best of 3: 1.05 ms per loop
Quoting #hchw solution (row-by-row)
def _is_holiday(row, df2):
return ((df2['holiday'] >= row['start_date']) & (df2['holiday'] <= row['end_date'])).any()
df.apply(lambda x: _is_holiday(x, df2), axis=1)
>>> The slowest run took 4.89 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 4.46 ms per loop
Try IntervalIndex.contains with list comprehensiont and np.sum
iix = pd.IntervalIndex.from_arrays(df1.start_date, df1.end_date, closed='both')
df1['holidays'] = np.sum([iix.contains(x) for x in df2.holiday], axis=0) >= 1
Out[812]:
start_date end_date qty holidays
1 2018-01-01 2018-01-08 23 True
2 2018-01-08 2018-01-15 21 False
3 2018-01-15 2018-01-22 5 False
4 2018-01-22 2018-01-29 12 True
Note: I assume start_date, end_date, holiday columns are in datetime format. If they are not, you need to convert them before run above command as follows
df1.start_date = pd.to_datetime(df1.start_date)
df1.end_date = pd.to_datetime(df1.end_date)
df2.holiday = pd.to_datetime(df2.holiday)
I have a dataset of samples covering multiple days, all with a timestamp.
I want to select rows within a specific time window. E.g. all rows that were generated between 1pm and 3 pm every day.
This is a sample of my data in a pandas dataframe:
22 22 2018-04-12T20:14:23Z 2018-04-12T21:14:23Z 0 6370.1
23 23 2018-04-12T21:14:23Z 2018-04-12T21:14:23Z 0 6368.8
24 24 2018-04-12T22:14:22Z 2018-04-13T01:14:23Z 0 6367.4
25 25 2018-04-12T23:14:22Z 2018-04-13T01:14:23Z 0 6365.8
26 26 2018-04-13T00:14:22Z 2018-04-13T01:14:23Z 0 6364.4
27 27 2018-04-13T01:14:22Z 2018-04-13T01:14:23Z 0 6362.7
28 28 2018-04-13T02:14:22Z 2018-04-13T05:14:22Z 0 6361.0
29 29 2018-04-13T03:14:22Z 2018-04-13T05:14:22Z 0 6359.3
.. ... ... ... ... ...
562 562 2018-05-05T08:13:21Z 2018-05-05T09:13:21Z 0 6300.9
563 563 2018-05-05T09:13:21Z 2018-05-05T09:13:21Z 0 6300.7
564 564 2018-05-05T10:13:14Z 2018-05-05T13:13:14Z 0 6300.2
565 565 2018-05-05T11:13:14Z 2018-05-05T13:13:14Z 0 6299.9
566 566 2018-05-05T12:13:14Z 2018-05-05T13:13:14Z 0 6299.6
How do I achieve that? I need to ignore the date and just evaluate the time component. I could traverse the dataframe in a loop and evaluate the date time in that way, but there must be a more simple way to do that..
I converted the messageDate which was read a a string to a dateTime by
df["messageDate"]=pd.to_datetime(df["messageDate"])
But after that I got stuck on how to filter on time only.
Any input appreciated.
datetime columns have DatetimeProperties object, from which you can extract datetime.time and filter on it:
import datetime
df = pd.DataFrame(
[
'2018-04-12T12:00:00Z', '2018-04-12T14:00:00Z','2018-04-12T20:00:00Z',
'2018-04-13T12:00:00Z', '2018-04-13T14:00:00Z', '2018-04-13T20:00:00Z'
],
columns=['messageDate']
)
df
messageDate
# 0 2018-04-12 12:00:00
# 1 2018-04-12 14:00:00
# 2 2018-04-12 20:00:00
# 3 2018-04-13 12:00:00
# 4 2018-04-13 14:00:00
# 5 2018-04-13 20:00:00
df["messageDate"] = pd.to_datetime(df["messageDate"])
time_mask = (df['messageDate'].dt.hour >= 13) & \
(df['messageDate'].dt.hour <= 15)
df[time_mask]
# messageDate
# 1 2018-04-12 14:00:00
# 4 2018-04-13 14:00:00
I hope the code is self explanatory. You can always ask questions.
import pandas as pd
# Prepping data for example
dates = pd.date_range('1/1/2018', periods=7, freq='H')
data = {'A' : range(7)}
df = pd.DataFrame(index = dates, data = data)
print df
# A
# 2018-01-01 00:00:00 0
# 2018-01-01 01:00:00 1
# 2018-01-01 02:00:00 2
# 2018-01-01 03:00:00 3
# 2018-01-01 04:00:00 4
# 2018-01-01 05:00:00 5
# 2018-01-01 06:00:00 6
# Creating a mask to filter the value we with to have or not.
# Here, we use df.index because the index is our datetime.
# If the datetime is a column, you can always say df['column_name']
mask = (df.index > '2018-1-1 01:00:00') & (df.index < '2018-1-1 05:00:00')
print mask
# [False False True True True False False]
df_with_good_dates = df.loc[mask]
print df_with_good_dates
# A
# 2018-01-01 02:00:00 2
# 2018-01-01 03:00:00 3
# 2018-01-01 04:00:00 4
df=df[(df["messageDate"].apply(lambda x : x.hour)>13) & (df["messageDate"].apply(lambda x : x.hour)<15)]
You can use x.minute, x.second similarly.
try this after ensuring messageDate is indeed datetime format as you have done
df.set_index('messageDate',inplace=True)
choseInd = [ind for ind in df.index if (ind.hour>=13)&(ind.hour<=15)]
df_select = df.loc[choseInd]
you can do the same, even without making the datetime column as an index, as the answer with apply: lambda shows
it just makes your dataframe 'better looking' if the datetime is your index rather than numerical one.