So lets say I have a pandas dataframe with SOME repeated dates:
import pandas as pd
import random
reportDate = pd.date_range('04-01-2010', '09-03-2021',periods = 5000).date
lowPriceMin = [random.randint(10, 20) for x in range(5000)]
df = pd.DataFrame()
df['reportDate'] = reportDate
df['lowPriceMin'] = lowPriceMin
Now I want to get the min value from every week since the starting date. So I will have around 559 (the number of weeks from '04-01-2010' to '09-03-2021') values with the min value from every week.
Try with resample:
df['reportDate'] = pd.to_datetime(df['reportDate'])
>>> df.set_index("reportDate").resample("W").min()
lowPriceMin
reportDate
2010-01-10 10
2010-01-17 10
2010-01-24 14
2010-01-31 10
2010-02-07 14
...
2021-02-14 11
2021-02-21 11
2021-02-28 10
2021-03-07 10
2021-03-14 17
[584 rows x 1 columns]
Related
This post follows on from another one I posted which can be found here:
use groupby() and for loop to count column values with conditions
I am working with the same data again:
import pandas as pd
import numpy as np
from datetime import timedelta
random.seed(365)
#some data
start_date = pd.date_range(start = "2015-01-09", end = "2022-09-11", freq = "6D")
end_date = [start_date + timedelta(days = np.random.exponential(scale = 100)) for start_date in start_date]
df = pd.DataFrame(
{"start_date":start_date,
"end_date":end_date}
)
#randomly remove some end dates
df["end_date"] = df["end_date"].sample(frac = 0.7).reset_index(drop = True)
df["end_date"] = df["end_date"].dt.date.astype("datetime64[ns]")
Like in the previous post, I first created a pd.Series with the 1st day of every month in the entire history of the data
dates = pd.Series(df["start_date"].dt.to_period("M").sort_values(ascending = True).unique()).dt.start_time
What I now want to do is count the number of rows in the data-frame where the df["start_date"] values are less than the 1st day of each month in the series and where the df["end_date"] values are greater than the 1st day of each month in the series
I would think that I would apply a lambda function or use np.logical_and on the dates series to obtain the output I am after - the logic of which would look something like this:
#only obtain those rows with end dates
inactives = df[df["end_date"].isnull() == False]
dates.apply(
lambda x: (inactives[inactives["start_date"] < x] & inactives[inactives["cancel_date"] > x]).count()
)
or like this:
dates.apply(
lambda x: np.logical_and(
inactives[inactives["start_date"] < x,
inactives[inactives["cancel_date"] > x]]
).sum())
The resulting output would look like this:
month_first
count
2015-01-01
10
2015-02-01
25
2015-03-01
45
Correct, we can use apply lambda for this. So, first, we create our list of first days in each month. Here we use freq "MS" to create start of month inside our defined interval.
new_df = pd.DataFrame({"month_first": pd.date_range(start="2015-01-01", end="2022-10-01", freq = "MS")})
This will result in this table:
month_first
0 2015-01-01
1 2015-02-01
2 2015-03-01
3 2015-04-01
4 2015-05-01
.. ...
89 2022-06-01
90 2022-07-01
91 2022-08-01
92 2022-09-01
93 2022-10-01
[94 rows x 1 columns]
Then we apply the lambda function below. So for each of the rows in our date range, we take from inactives which the start_date is less and end_date is greater. We use & operator to perform and operation to each row of our resulting comparisons. Then, we use sum to sum all the boolean values.
new_df["count"] = new_df["month_first"].apply(
lambda x: ((inactives["start_date"] < x) & (inactives["end_date"] > x)).sum())
This will result in this table:
month_first count
0 2015-01-01 0
1 2015-02-01 4
2 2015-03-01 9
3 2015-04-01 14
4 2015-05-01 19
.. ... ...
89 2022-06-01 25
90 2022-07-01 22
91 2022-08-01 19
92 2022-09-01 13
93 2022-10-01 13
[94 rows x 2 columns]
how to convert time to week number
year_start = '2019-05-21'
year_end = '2020-02-22'
How do I get the week number based on the date that I set as first week?
For example 2019-05-21 should be Week 1 instead of 2019-01-01
If you do not have dates outside of year_start/year_end, use isocalendar().week and perform a simple subtraction with modulo:
year_start = pd.to_datetime('2019-05-21')
#year_end = pd.to_datetime('2020-02-22')
df = pd.DataFrame({'date': pd.date_range('2019-05-21', '2020-02-22', freq='30D')})
df['week'] = (df['date'].dt.isocalendar().week.astype(int)-year_start.isocalendar()[1])%52+1
Output:
date week
0 2019-05-21 1
1 2019-06-20 5
2 2019-07-20 9
3 2019-08-19 14
4 2019-09-18 18
5 2019-10-18 22
6 2019-11-17 26
7 2019-12-17 31
8 2020-01-16 35
9 2020-02-15 39
Try the following code.
import numpy as np
import pandas as pd
year_start = '2019-05-21'
year_end = '2020-02-22'
# Create a sample dataframe
df = pd.DataFrame(pd.date_range(year_start, year_end, freq='D'), columns=['date'])
# Add the week number
df['week_number'] = (((df.date.view(np.int64) - pd.to_datetime([year_start]).view(np.int64)) / (1e9 * 60 * 60 * 24) - df.date.dt.day_of_week + 7) // 7 + 1).astype(np.int64)
date
week_number
2019-05-21
1
2019-05-22
1
2019-05-23
1
2019-05-24
1
2019-05-25
1
2019-05-26
1
2019-05-27
2
2019-05-28
2
2020-02-18
40
2020-02-19
40
2020-02-20
40
2020-02-21
40
2020-02-22
40
If you just need a function to calculate week no, based on given start and end date:
import pandas as pd
import numpy as np
start_date = "2019-05-21"
end_date = "2020-02-22"
start_datetime = pd.to_datetime(start_date)
end_datetime = pd.to_datetime(end_date)
def get_week_no(date):
given_datetime = pd.to_datetime(date)
# if date in range
if start_datetime <= given_datetime <= end_datetime:
x = given_datetime - start_datetime
# adding 1 as it will return 0 for 1st week
return int(x / np.timedelta64(1, 'W')) + 1
raise ValueError(f"Date is not in range {start_date} - {end_date}")
print(get_week_no("2019-05-21"))
In the function, we are calculating week no by finding difference between given date and start date in weeks.
I have the timeseries dataframe as:
timestamp
signal_value
2017-08-28 00:00:00
10
2017-08-28 00:05:00
3
2017-08-28 00:10:00
5
2017-08-28 00:15:00
5
I am trying to get the average Monthly percentage of the time where "signal_value" is greater than 5. Something like:
Month
metric
January
16%
February
2%
March
8%
April
10%
I tried the following code which gives the result for the whole dataset but how can I summarize it per each month?
total,count = 0, 0
for index, row in df.iterrows():
total += 1
if row["signal_value"] >= 5:
count += 1
print((count/total)*100)
Thank you in advance.
Let us first generate some random data (generate random dates taken from here):
import pandas as pd
import numpy as np
import datetime
def randomtimes(start, end, n):
frmt = '%d-%m-%Y %H:%M:%S'
stime = datetime.datetime.strptime(start, frmt)
etime = datetime.datetime.strptime(end, frmt)
td = etime - stime
dtimes = [np.random.random() * td + stime for _ in range(n)]
return [d.strftime(frmt) for d in dtimes]
# Recreat some fake data
timestamp = randomtimes("01-01-2021 00:00:00", "01-01-2023 00:00:00", 10000)
signal_value = np.random.random(len(timestamp)) * 10
df = pd.DataFrame({"timestamp": timestamp, "signal_value": signal_value})
Now we can transform the timestamp column to pandas timestamps to extract month and year per timestamp:
df.timestamp = pd.to_datetime(df.timestamp)
df["month"] = df.timestamp.dt.month
df["year"] = df.timestamp.dt.year
We generate a boolean column whether signal_value is larger than some threshold (here 5):
df["is_larger5"] = df.signal_value > 5
Finally, we can get the average for every month by using pandas.groupby:
>>> df.groupby(["year", "month"])['is_larger5'].mean()
year month
2021 1 0.509615
2 0.488189
3 0.506024
4 0.519362
5 0.498778
6 0.483709
7 0.498824
8 0.460396
9 0.542918
10 0.463043
11 0.492500
12 0.519789
2022 1 0.481663
2 0.527778
3 0.501139
4 0.527322
5 0.486936
6 0.510638
7 0.483370
8 0.521253
9 0.493639
10 0.495349
11 0.474886
12 0.488372
Name: is_larger5, dtype: float64
I have a pandas dataframe with a date column
I'm trying to create a function and apply it to the dataframe to create a column that returns the number of days in the month/year specified
so far i have:
from calendar import monthrange
def dom(x):
m = dfs["load_date"].dt.month
y = dfs["load_date"].dt.year
monthrange(y,m)
days = monthrange[1]
return days
This however does not work when I attempt to apply it to the date column.
Additionally, I would like to be able to identify whether or not it is the current month, and if so return the number of days up to the current date in that month as opposed to days in the entire month.
I am not sure of the best way to do this, all I can think of is to check the month/year against datetime's today and then use a delta
thanks in advance
For pt.1 of your question, you can cast to pd.Period and retrieve days_in_month:
import pandas as pd
# create a sample df:
df = pd.DataFrame({'date': pd.date_range('2020-01', '2021-01', freq='M')})
df['daysinmonths'] = df['date'].apply(lambda t: pd.Period(t, freq='S').days_in_month)
# df['daysinmonths']
# 0 31
# 1 29
# 2 31
# ...
For pt.2, you can take the timestamp of 'now' and create a boolean mask for your date column, i.e. where its year/month is less than "now". Then calculate the cumsum of the daysinmonth column for the section where the mask returns True. Invert the order of that series to get the days until now.
now = pd.Timestamp('now')
m = (df['date'].dt.year <= now.year) & (df['date'].dt.month < now.month)
df['daysuntilnow'] = df['daysinmonths'][m].cumsum().iloc[::-1].reset_index(drop=True)
Update after comment: to get the elapsed days per month, you can do
df['dayselapsed'] = df['daysinmonths']
m = (df['date'].dt.year == now.year) & (df['date'].dt.month == now.month)
if m.any():
df.loc[m, 'dayselapsed'] = now.day
df.loc[(df['date'].dt.year >= now.year) & (df['date'].dt.month > now.month), 'dayselapsed'] = 0
output
df
Out[13]:
date daysinmonths daysuntilnow dayselapsed
0 2020-01-31 31 213.0 31
1 2020-02-29 29 182.0 29
2 2020-03-31 31 152.0 31
3 2020-04-30 30 121.0 30
4 2020-05-31 31 91.0 31
5 2020-06-30 30 60.0 30
6 2020-07-31 31 31.0 31
7 2020-08-31 31 NaN 27
8 2020-09-30 30 NaN 0
9 2020-10-31 31 NaN 0
10 2020-11-30 30 NaN 0
11 2020-12-31 31 NaN 0
I have a pandas timeline table containing dates objects and scores:
datetime score
2018-11-23 08:33:02 4
2018-11-24 09:43:30 2
2018-11-25 08:21:34 5
2018-11-26 19:33:01 4
2018-11-23 08:50:40 1
2018-11-23 09:03:10 3
I want to aggregate the score by hour without taking into consideration the date, the result desired is :
08:00:00 10
09:00:00 5
19:00:00 4
So basically I have to remove the date-month-year, and then group score by hour,
I tried this command
monthagg = df['score'].resample('H').sum().to_frame()
Which does work but takes into consideration the date-month-year, How to remove DD-MM-YYYY and aggregate by Hour?
One possible solution is use DatetimeIndex.floor for set minutes and seconds to 0 and then convert DatetimeIndex to strings by DatetimeIndex.strftime, then aggregate sum:
a = df['score'].groupby(df.index.floor('H').strftime('%H:%M:%S')).sum()
#if column datetime
#a = df['score'].groupby(df['datetime'].dt.floor('H').dt.strftime('%H:%M:%S')).sum()
print (a)
08:00:00 10
09:00:00 5
19:00:00 4
Name: score, dtype: int64
Or use DatetimeIndex.hour and aggregate sum:
a = df.groupby(df.index.hour)['score'].sum()
#if column datetime
#a = df.groupby(df['datetime'].dt.hour)['score'].sum()
print (a)
datetime
8 10
9 5
19 4
Name: score, dtype: int64
Setup to generate a frame with datetime objects:
import datetime
import pandas as pd
rows = [datetime.datetime.now() + datetime.timedelta(hours=i) for i in range(100)]
df = pd.DataFrame(rows,columns = ["date"])
You can now add a hour-column like this, and then group by it:
df["hour"] = df["date"].dt.hour
df.groupby("hour").sum()
import pandas as pd
df = pd.DataFrame({'datetime':['2018-11-23 08:33:02 ','2018-11-24 09:43:30',
'2018-11-25 08:21:34',
'2018-11-26 19:33:01','2018-11-23 08:50:40',
'2018-11-23 09:03:10'],'score':[4,2,5,4,1,3]})
df['datetime']=pd.to_datetime(df['datetime'], errors='coerce')
df["hour"] = df["datetime"].dt.hour
df.groupby("hour").sum()
Output:
8 10
9 5
19 4