Is there a way to calculate the slope, intercept in python pandas. For example , for the below dataframe, can we calculate / populate another column that calculates y = mx + c
date_range = pd.date_range(date(2021,11,7), date.today())
index = date_range
value = np.random.rand(len(index))
historical = pd.DataFrame({'date': date_range, 'Sales' : value})
historical
Out[300]:
date Sales m c y = mx+c
0 2021-11-07 0.210038 --- --- ----
1 2021-11-08 0.918222 --- --- ----
2 2021-11-09 0.202677 --- --- ----
3 2021-11-10 0.620185 --- --- ----
4 2021-11-11 0.299857 --- --- ----
So m and c will be constant for each row here
Related
Given the following time series (illustration purposes):
From | Till | Precipitation
2022-01-01 06:00:00 | 2022-01-02 06:00:00 | 0.5
2022-01-02 06:00:00 | 2022-01-03 06:00:00 | 1.2
2022-01-03 06:00:00 | 2022-01-04 06:00:00 | 0.0
2022-01-04 06:00:00 | 2022-01-05 06:00:00 | 1.3
2022-01-05 06:00:00 | 2022-01-06 06:00:00 | 9.8
2022-01-06 06:00:00 | 2022-01-07 06:00:00 | 0.1
I'd like to estimate the daily precipitation between 2022-01-02 00:00:00 and 2022-01-06 00:00:00. We can assume that the rate of precipitation is constant for each given interval in the table.
Doing manually I'd assume something like
2022-01-02 00:00:00 | 2022-01-03 00:00:00 | 0.25 * 0.5 + 0.75 * 1.2
Note: the real-world data will most likely look much less regular, somewhat like the following (missing intervals can be assumed to be 0.0):
From | Till | Precipitation
2022-01-01 05:45:12 | 2022-01-02 02:11:20 | 0.8
2022-01-03 02:01:59 | 2022-01-04 12:01:00 | 5.4
2022-01-04 06:00:00 | 2022-01-05 06:00:00 | 1.3
2022-01-05 07:10:00 | 2022-01-06 07:10:00 | 9.2
2022-01-06 02:54:00 | 2022-01-07 02:53:59 | 0.1
Maybe there's a library with a general and efficient solution?
If there's no such library, how do I compute the resampled time series in the most efficient way?
just calculate the period overlaps ... I think this will be plenty fast
import pandas as pd
import numpy as np
def create_test_data():
# just a helper to construct a test dataframe
from_dates = pd.date_range(start='2022-01-01 06:00:00', freq='D', periods=6)
till_dates = pd.date_range(start='2022-01-02 06:00:00', freq='D', periods=6)
precip_amounts = [0.5, 1.2, 1, 2, 3, 0.5]
return pd.DataFrame({'From': from_dates, 'Till': till_dates, 'Precip': precip_amounts})
def get_between(df, start_datetime, end_datetime):
# all the entries that end (Till) after start_time
# and start(From) before the end
mask1 = df['Till'] > start_datetime
mask2 = df['From'] < end_datetime
return df[mask1 & mask2]
def get_ratio_values(df, start_datetime, end_datetime, debug=True):
# get the ratios of the period windows
df2 = get_between(df, start_datetime, end_datetime) # get only the rows of interest
precip_values = df['Precip']
# get overlap from the end time of row to start of our period of interest
overlap_period1 = df2['Till'] - start
# get overlap from end of our period of interest and the start time of row
overlap_period2 = end - df2['From']
# get the "best" overlap for each row
best_overlap = np.minimum(overlap_period1, overlap_period2)
# get the period of each duration
window_durations = df2['Till'] - df2['From']
# calculate the ratios of overlap (cannot be greater than 1)
ratios = np.minimum(1.0, best_overlap / window_durations)
# calculate the value * the ratio
ratio_values = ratios * precip_values
if debug:
# just some prints for verification
print("Ratio * value = result")
print("----------------------")
print("\n".join(f"{x:0.3f} * {y:0.2f} = {z}" for x, y, z in zip(ratios, df['Precip'], ratio_values)))
print("----------------------")
return ratio_values
start = pd.to_datetime('2022-01-02 00:00:00')
end = pd.to_datetime('2022-01-04 00:00:00')
ratio_vals = get_ratio_values(create_test_data(), start, end)
total_precip = ratio_vals.sum()
print("SUM RESULT =", total_precip)
you could also just calculate for the first and last entry since anything in the middle will be 1 always (which is probably both simpler and faster)
def get_ratio_values(df, start_datetime, end_datetime, debug=True):
# get the ratios of the period windows
df2 = get_between(df, start_datetime, end_datetime) # get only the rows of interest
precip_values = df['Precip']
# overlap with first row and duration of first row
overlap_start = df2[0]['Till'] - start
duration_start = df2[0]['Till'] - df2[0]['From']
# overlap with last row and duration of last row
overlap_end = end - df2[-1]['From']
duration_start = df2[-1]['Till'] - df2[-1]['From']
ratios = [1]* len(df2)
ratios[0] = overlap_start/duration_start
ratios[-1] = overlap_end/duration_end
return ratios * precip_values
how to convert time to week number
year_start = '2019-05-21'
year_end = '2020-02-22'
How do I get the week number based on the date that I set as first week?
For example 2019-05-21 should be Week 1 instead of 2019-01-01
If you do not have dates outside of year_start/year_end, use isocalendar().week and perform a simple subtraction with modulo:
year_start = pd.to_datetime('2019-05-21')
#year_end = pd.to_datetime('2020-02-22')
df = pd.DataFrame({'date': pd.date_range('2019-05-21', '2020-02-22', freq='30D')})
df['week'] = (df['date'].dt.isocalendar().week.astype(int)-year_start.isocalendar()[1])%52+1
Output:
date week
0 2019-05-21 1
1 2019-06-20 5
2 2019-07-20 9
3 2019-08-19 14
4 2019-09-18 18
5 2019-10-18 22
6 2019-11-17 26
7 2019-12-17 31
8 2020-01-16 35
9 2020-02-15 39
Try the following code.
import numpy as np
import pandas as pd
year_start = '2019-05-21'
year_end = '2020-02-22'
# Create a sample dataframe
df = pd.DataFrame(pd.date_range(year_start, year_end, freq='D'), columns=['date'])
# Add the week number
df['week_number'] = (((df.date.view(np.int64) - pd.to_datetime([year_start]).view(np.int64)) / (1e9 * 60 * 60 * 24) - df.date.dt.day_of_week + 7) // 7 + 1).astype(np.int64)
date
week_number
2019-05-21
1
2019-05-22
1
2019-05-23
1
2019-05-24
1
2019-05-25
1
2019-05-26
1
2019-05-27
2
2019-05-28
2
2020-02-18
40
2020-02-19
40
2020-02-20
40
2020-02-21
40
2020-02-22
40
If you just need a function to calculate week no, based on given start and end date:
import pandas as pd
import numpy as np
start_date = "2019-05-21"
end_date = "2020-02-22"
start_datetime = pd.to_datetime(start_date)
end_datetime = pd.to_datetime(end_date)
def get_week_no(date):
given_datetime = pd.to_datetime(date)
# if date in range
if start_datetime <= given_datetime <= end_datetime:
x = given_datetime - start_datetime
# adding 1 as it will return 0 for 1st week
return int(x / np.timedelta64(1, 'W')) + 1
raise ValueError(f"Date is not in range {start_date} - {end_date}")
print(get_week_no("2019-05-21"))
In the function, we are calculating week no by finding difference between given date and start date in weeks.
I have the timeseries dataframe as:
timestamp
signal_value
2017-08-28 00:00:00
10
2017-08-28 00:05:00
3
2017-08-28 00:10:00
5
2017-08-28 00:15:00
5
I am trying to get the average Monthly percentage of the time where "signal_value" is greater than 5. Something like:
Month
metric
January
16%
February
2%
March
8%
April
10%
I tried the following code which gives the result for the whole dataset but how can I summarize it per each month?
total,count = 0, 0
for index, row in df.iterrows():
total += 1
if row["signal_value"] >= 5:
count += 1
print((count/total)*100)
Thank you in advance.
Let us first generate some random data (generate random dates taken from here):
import pandas as pd
import numpy as np
import datetime
def randomtimes(start, end, n):
frmt = '%d-%m-%Y %H:%M:%S'
stime = datetime.datetime.strptime(start, frmt)
etime = datetime.datetime.strptime(end, frmt)
td = etime - stime
dtimes = [np.random.random() * td + stime for _ in range(n)]
return [d.strftime(frmt) for d in dtimes]
# Recreat some fake data
timestamp = randomtimes("01-01-2021 00:00:00", "01-01-2023 00:00:00", 10000)
signal_value = np.random.random(len(timestamp)) * 10
df = pd.DataFrame({"timestamp": timestamp, "signal_value": signal_value})
Now we can transform the timestamp column to pandas timestamps to extract month and year per timestamp:
df.timestamp = pd.to_datetime(df.timestamp)
df["month"] = df.timestamp.dt.month
df["year"] = df.timestamp.dt.year
We generate a boolean column whether signal_value is larger than some threshold (here 5):
df["is_larger5"] = df.signal_value > 5
Finally, we can get the average for every month by using pandas.groupby:
>>> df.groupby(["year", "month"])['is_larger5'].mean()
year month
2021 1 0.509615
2 0.488189
3 0.506024
4 0.519362
5 0.498778
6 0.483709
7 0.498824
8 0.460396
9 0.542918
10 0.463043
11 0.492500
12 0.519789
2022 1 0.481663
2 0.527778
3 0.501139
4 0.527322
5 0.486936
6 0.510638
7 0.483370
8 0.521253
9 0.493639
10 0.495349
11 0.474886
12 0.488372
Name: is_larger5, dtype: float64
I do have a json array, where i will be having id, starttime, endtime. I want to calculate average time being active by user. And some may have only startime but not endtime.
Example data -
data = [{"id":1, "stime":"2020-09-21T06:25:36Z","etime": "2020-09-22T09:25:36Z"},{"id":2, "stime":"2020-09-22T02:24:36Z","etime": "2020-09-23T07:25:36Z"},{"id":3, "stime":"2020-09-20T06:25:36Z","etime": "2020-09-24T09:25:36Z"},{"id":4, "stime":"2020-09-23T06:25:36Z","etime": "2020-09-29T09:25:36Z"}]
My method to achieve this, diff between startine and endtime. then total all difference time and divide by number of total num of Ids.
sample code:
import datetime
from datetime import timedelta
import dateutil.parser
datetimeFormat = '%Y-%m-%d %H:%M:%S.%f'
date_s_time = '2020-09-21T06:25:36Z'
date_e_time = '2020-09-22T09:25:36Z'
d1 = dateutil.parser.parse(date_s_time)
d2 = dateutil.parser.parse(date_e_time)
diff1 = datetime.datetime.strptime(d2.strftime('%Y-%m-%d %H:%M:%S.%f'), datetimeFormat)\
- datetime.datetime.strptime(d1.strftime('%Y-%m-%d %H:%M:%S.%f'), datetimeFormat)
print("Difference 1:", diff1)
date_s_time2 = '2020-09-20T06:25:36Z'
date_e_time2 = '2020-09-28T02:25:36Z'
d3 = dateutil.parser.parse(date_s_time2)
d4 = dateutil.parser.parse(date_e_time2)
diff2 = datetime.datetime.strptime(d4.strftime('%Y-%m-%d %H:%M:%S.%f'), datetimeFormat)\
- datetime.datetime.strptime(d3.strftime('%Y-%m-%d %H:%M:%S.%f'), datetimeFormat)
print("Difference 2:", diff2)
print("total", diff1+diff2)
print(diff1+diff2/2)
please suggest me is there a better approach which will be efficient.
You could use the pandas library.
import pandas as pd
data = [{"id":1, "stime":"2020-09-21T06:25:36Z","etime": "2020-09-22T09:25:36Z"},{"id":1, "stime":"2020-09-22T02:24:36Z","etime": "2020-09-23T07:25:36Z"},{"id":1, "stime":"2020-09-20T06:25:36Z","etime": "2020-09-24T09:25:36Z"},{"id":1, "stime":"2020-09-23T06:25:36Z"}]
(Let's say your last row has no end time)
Now, you can create a Pandas DataFrame using your data
df = pd.DataFrame(data)
df looks like so:
id stime etime
0 1 2020-09-21T06:25:36Z 2020-09-22T09:25:36Z
1 1 2020-09-22T02:24:36Z 2020-09-23T07:25:36Z
2 1 2020-09-20T06:25:36Z 2020-09-24T09:25:36Z
3 1 2020-09-23T06:25:36Z NaN
Now, we want to map the columns stime and etime so that the strings are converted to datetime objects, and fill NaNs with something that makes sense: if no end time exists, could we use the current time?
df = df.fillna(datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ'))
df['etime'] = df['etime'].map(dateutil.parser.parse)
df['stime'] = df['stime'].map(dateutil.parser.parse)
Or, if you want to drop the rows that don't have an etime, just do
df = df.dropna()
Now df becomes:
id stime etime
0 1 2020-09-21 06:25:36+00:00 2020-09-22 09:25:36+00:00
1 1 2020-09-22 02:24:36+00:00 2020-09-23 07:25:36+00:00
2 1 2020-09-20 06:25:36+00:00 2020-09-24 09:25:36+00:00
3 1 2020-09-23 06:25:36+00:00 2020-09-24 20:05:42+00:00
Finally, subtract the two:
df['tdiff'] = df['etime'] - df['stime']
and we get:
id stime etime tdiff
0 1 2020-09-21 06:25:36+00:00 2020-09-22 09:25:36+00:00 1 days 03:00:00
1 1 2020-09-22 02:24:36+00:00 2020-09-23 07:25:36+00:00 1 days 05:01:00
2 1 2020-09-20 06:25:36+00:00 2020-09-24 09:25:36+00:00 4 days 03:00:00
3 1 2020-09-23 06:25:36+00:00 2020-09-24 20:05:42+00:00 1 days 13:40:06
The mean of this column is:
df['tdiff'].mean()
Output: Timedelta('2 days 00:10:16.500000')
How to calculate rolling cumulative product on Pandas DataFrame.
I have a time series of returns in a pandas DataFrame. How can I calculate a rolling annualized alpha for the relevant columns in the DataFrame? I would normally use Excel and do: =PRODUCT(1+[trailing 12 months])-1
My DataFrame looks like the below (a small portion):
Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 \
2009-08-31 00:00:00 --- --- 0.1489 0.072377
2009-09-30 00:00:00 --- --- 0.0662 0.069608
2009-10-31 00:00:00 --- --- -0.0288 -0.016967
2009-11-30 00:00:00 --- --- -0.0089 0.0009
2009-12-31 00:00:00 --- --- 0.044 0.044388
2010-01-31 00:00:00 --- --- -0.0301 -0.054953
2010-02-28 00:00:00 --- --- -0.0014 0.00821
2010-03-31 00:00:00 --- --- 0.0405 0.049959
2010-04-30 00:00:00 --- --- 0.0396 -0.007146
2010-05-31 00:00:00 --- --- -0.0736 -0.079834
2010-06-30 00:00:00 --- --- -0.0658 -0.028655
2010-07-31 00:00:00 --- --- 0.0535 0.038826
2010-08-31 00:00:00 --- --- -0.0031 -0.013885
2010-09-30 00:00:00 --- --- 0.0503 0.045781
2010-10-31 00:00:00 --- --- 0.0499 0.025335
2010-11-30 00:00:00 --- --- 0.012 -0.007495
I've tried the code below provided for a similar question, but it looks like it doesn't work anymore ...
import pandas as pd
import numpy as np
# your DataFrame; df = ...
pd.rolling_apply(df, 12, lambda x: np.prod(1 + x) - 1)
... and the pages that I'm redirected seem not to be as relevant.
Ideally, I'd like to reproduce the DataFrame but with 12 month returns, not monthly so I can locate the relevant 12 month return depending on the month.
If I understand correctly, you could try something like the below:
import pandas as pd
import numpy as np
#define dummy dataframe with monthly returns
df = pd.DataFrame(1 + np.random.rand(20), columns=['returns'])
#compute 12-month rolling returns
df_roll = df.rolling(window=12).apply(np.prod) - 1