This question already has an answer here:
How to add a duration to datetime in Python polars
(1 answer)
Closed 17 days ago.
I am using Polars in Python to try and add thirty days to a date
I run the code, get no errors but also get no new dates
Can anyone see my mistake?
import polars as pl
mydf = pl.DataFrame(
{"start_date": ["2020-01-02", "2020-01-03", "2020-01-04"]})
mydf = mydf.with_column(
pl.col("start_date").str.strptime(pl.Date, "%Y-%m-%d"),
)
# Generate the days above and below
mydf = mydf.with_column(
pl.col('start_date') + pl.duration(days=30).alias('date_plus_delta')
)
mydf = mydf.with_column(
pl.col('start_date') + pl.duration(days=-30).alias('date_minus_delta')
)
print(mydf)
shape: (3, 1)
┌────────────┐
│ start_date │
│ --- │
│ date │
╞════════════╡
│ 2020-01-02 │
│ 2020-01-03 │
│ 2020-01-04 │
└────────────┘
Quick References
The Manual: https://pola-rs.github.io/polars-book/user-guide/howcani/data/timestamps.html
strftime formats: https://docs.rs/chrono/latest/chrono/format/strftime/index.html
SO Answer from a previous Post: How to add a duration to datetime in Python polars
You're supposed to call .alias on the entire operation pl.col('start_date') + pl.duration(days=30). Instead you're only alias-ing on pl.duration(days=30).
So the correct way would be:
import polars as pl
mydf = pl.DataFrame({"start_date": ["2020-01-02", "2020-01-03", "2020-01-04"]})
mydf = mydf.with_columns(pl.col("start_date").str.strptime(pl.Date, r"%Y-%m-%d"))
# Generate the days above and below
mydf = mydf.with_columns((pl.col('start_date') + pl.duration(days=30)).alias('date_plus_delta'))
mydf = mydf.with_columns((pl.col('start_date') - pl.duration(days=30)).alias('date_minus_delta'))
print(mydf)
Output
shape: (3, 3)
┌────────────┬─────────────────┬──────────────────┐
│ start_date ┆ date_plus_delta ┆ date_minus_delta │
│ --- ┆ --- ┆ --- │
│ date ┆ date ┆ date │
╞════════════╪═════════════════╪══════════════════╡
│ 2020-01-02 ┆ 2020-02-01 ┆ 2019-12-03 │
│ 2020-01-03 ┆ 2020-02-02 ┆ 2019-12-04 │
│ 2020-01-04 ┆ 2020-02-03 ┆ 2019-12-05 │
└────────────┴─────────────────┴──────────────────┘
Related
Is there a way to calculate the slope, intercept in python pandas. For example , for the below dataframe, can we calculate / populate another column that calculates y = mx + c
date_range = pd.date_range(date(2021,11,7), date.today())
index = date_range
value = np.random.rand(len(index))
historical = pd.DataFrame({'date': date_range, 'Sales' : value})
historical
Out[300]:
date Sales m c y = mx+c
0 2021-11-07 0.210038 --- --- ----
1 2021-11-08 0.918222 --- --- ----
2 2021-11-09 0.202677 --- --- ----
3 2021-11-10 0.620185 --- --- ----
4 2021-11-11 0.299857 --- --- ----
So m and c will be constant for each row here
I have defined a pandas DataFrame as follows:
df_tmp = pd.DataFrame({'EDT': pd.Series(dtype='datetime64[ns]'),
'FSPB': pd.Series(dtype='str'),
'FS_LA': pd.Series(dtype='str'),
'lA': pd.Series(dtype='int'),
'avg': pd.Series(dtype='float64'),
'nw': pd.Series(dtype='float64')})
Is there any way to convert the above into an empty polars DataFrame?
According to the polars docs, polars DataFrames can take a pandas DataFrame in their constructor, so:
import pandas as pd
import polars as pl
df_tmp = pd.DataFrame({'EDT': pd.Series(dtype='datetime64[ns]'),
'FSPB': pd.Series(dtype='str'),
'FS_LA': pd.Series(dtype='str'),
'lA': pd.Series(dtype='int'),
'avg': pd.Series(dtype='float64'),
'nw': pd.Series(dtype='float64')})
df = pl.DataFrame(df_tmp)
should work.
import polars as pl
import pandas as pd
pandas_df = pd.DataFrame({'EDT': pd.Series(dtype='datetime64[ns]'),
'FSPB': pd.Series(dtype='str'),
'FS_LA': pd.Series(dtype='str'),
'lA': pd.Series(dtype='int'),
'avg': pd.Series(dtype='float64'),
'nw': pd.Series(dtype='float64')})
pl.from_pandas(pandas_df)
shape: (0, 6)
┌──────────────┬──────┬───────┬─────┬─────┬─────┐
│ EDT ┆ FSPB ┆ FS_LA ┆ lA ┆ avg ┆ nw │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ datetime[ns] ┆ str ┆ str ┆ i64 ┆ f64 ┆ f64 │
╞══════════════╪══════╪═══════╪═════╪═════╪═════╡
└──────────────┴──────┴───────┴─────┴─────┴─────┘
How would I add 1 year to a column?
I've tried using map and apply but I failed miserably.
I also wonder why pl.date() accepts integers while it advertises that it only accepts str or pli.Expr.
A small hack workaround is:
col = pl.col('date').dt
df = df.with_column(pl.when(pl.col(column).is_not_null())
.then(pl.date(col.year() + 1, col.month(), col.day()))
.otherwise(pl.date(col.year() + 1,col.month(), col.day()))
.alias("date"))
but this won't work for months or days. I can't just add a number or I'll get a:
> thread 'thread '<unnamed>' panicked at 'invalid or out-of-range date<unnamed>',
' panicked at '/github/home/.cargo/registry/src/github.com-1ecc6299db9ec823/chrono-0.4.19/src/naive/date.rsinvalid or out-of-range date:', 173:/github/home/.cargo/registry/src/github.com-1ecc6299db9ec823/chrono-0.4.19/src/naive/date.rs51
:note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Most likely because day and month cycle while year goes to infinity.
I could also do this:
df = df.with_column(
pl.when(col.month() == 1)
.then(pl.date(col.year(), 2, col.day()))
.when(col.month() == 2)
.then(pl.date(col.year(), 3, col.day()))
.when(col.month() == 3)
.then(pl.date(col.year(), 4, col.day()))
.when(col.month() == 4)
.then(pl.date(col.year(), 5, col.day()))
.when(col.month() == 5)
.then(pl.date(col.year(), 6, col.day()))
.when(col.month() == 6)
.then(pl.date(col.year(), 7, col.day()))
.when(col.month() == 7)
.then(pl.date(col.year(), 8, col.day()))
.when(col.month() == 8)
.then(pl.date(col.year(), 9, col.day()))
.when(col.month() == 9)
.then(pl.date(col.year(), 10, col.day()))
.when(col.month() == 10)
.then(pl.date(col.year(), 11, col.day()))
.when(col.month() == 11)
.then(pl.date(col.year(), 12, col.day()))
.otherwise(pl.date(col.year() + 1, 1, 1))
.alias("valid_from")
)
Polars allows to do addition and subtraction with python's timedelta objects. However above week units things get a bit more complicated as we have to take different days of the month and leap years into account.
For this polars has offset_by under the dt namespace.
(pl.DataFrame({
"dates": pl.date_range(datetime(2000, 1, 1), datetime(2026, 1, 1), "1y")
}).with_columns([
pl.col("dates").dt.offset_by("1y").alias("dates_and_1_yr")
]))
shape: (27, 2)
┌─────────────────────┬─────────────────────┐
│ dates ┆ dates_and_1_yr │
│ --- ┆ --- │
│ datetime[ns] ┆ datetime[ns] │
╞═════════════════════╪═════════════════════╡
│ 2000-01-01 00:00:00 ┆ 2001-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2001-01-01 00:00:00 ┆ 2002-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2002-01-01 00:00:00 ┆ 2003-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2003-01-01 00:00:00 ┆ 2004-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-01 00:00:00 ┆ 2024-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2024-01-01 00:00:00 ┆ 2025-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2025-01-01 00:00:00 ┆ 2026-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2026-01-01 00:00:00 ┆ 2027-01-01 00:00:00 │
└─────────────────────┴─────────────────────┘
You can use polars.apply and dateutil.relativedelta which works for years, months, days and much more, but can be slow for lots of data.
from datetime import date
from dateutil.relativedelta import relativedelta
df = pl.DataFrame(pl.date_range(date(2019, 1, 1), date(2020, 10, 1), '3mo', name='date'))
df.with_column(pl.col('date').apply(lambda x: x + relativedelta(years=1)))
Update: Since the offset_by method is now also available for months, it should be used whenever possible (see accepted answer). I leave this answer here because the approach can be used for more complicated cases that are not supported by offset_by.
How to calculate rolling cumulative product on Pandas DataFrame.
I have a time series of returns in a pandas DataFrame. How can I calculate a rolling annualized alpha for the relevant columns in the DataFrame? I would normally use Excel and do: =PRODUCT(1+[trailing 12 months])-1
My DataFrame looks like the below (a small portion):
Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 \
2009-08-31 00:00:00 --- --- 0.1489 0.072377
2009-09-30 00:00:00 --- --- 0.0662 0.069608
2009-10-31 00:00:00 --- --- -0.0288 -0.016967
2009-11-30 00:00:00 --- --- -0.0089 0.0009
2009-12-31 00:00:00 --- --- 0.044 0.044388
2010-01-31 00:00:00 --- --- -0.0301 -0.054953
2010-02-28 00:00:00 --- --- -0.0014 0.00821
2010-03-31 00:00:00 --- --- 0.0405 0.049959
2010-04-30 00:00:00 --- --- 0.0396 -0.007146
2010-05-31 00:00:00 --- --- -0.0736 -0.079834
2010-06-30 00:00:00 --- --- -0.0658 -0.028655
2010-07-31 00:00:00 --- --- 0.0535 0.038826
2010-08-31 00:00:00 --- --- -0.0031 -0.013885
2010-09-30 00:00:00 --- --- 0.0503 0.045781
2010-10-31 00:00:00 --- --- 0.0499 0.025335
2010-11-30 00:00:00 --- --- 0.012 -0.007495
I've tried the code below provided for a similar question, but it looks like it doesn't work anymore ...
import pandas as pd
import numpy as np
# your DataFrame; df = ...
pd.rolling_apply(df, 12, lambda x: np.prod(1 + x) - 1)
... and the pages that I'm redirected seem not to be as relevant.
Ideally, I'd like to reproduce the DataFrame but with 12 month returns, not monthly so I can locate the relevant 12 month return depending on the month.
If I understand correctly, you could try something like the below:
import pandas as pd
import numpy as np
#define dummy dataframe with monthly returns
df = pd.DataFrame(1 + np.random.rand(20), columns=['returns'])
#compute 12-month rolling returns
df_roll = df.rolling(window=12).apply(np.prod) - 1
I am using the code below but get an error after pivoting the DataFrame:
dataframe:
name day value time
0 MAC000002 2012-12-16 0.147 09:30:00
1 MAC000002 2012-12-16 0.110 10:00:00
2 MAC000002 2012-12-16 0.736 10:30:00
3 MAC000002 2012-12-16 0.404 11:00:00
4 MAC000003 2012-12-16 0.845 00:30:00
Read in data, and pivot
ddf = dd.read_csv('data.csv')
#I added this but didnt fix the error below
ddf.index.name = 'index'
#dask requires string as category type
ddf['name'] = ddf['name'].astype('category')
ddf['name'] =ddf['name'].cat.as_known()
#pivot the table
df = ddf.pivot_table(columns='name', values='value', index='index')
df.head()
#KeyError: 'index'
Expected result (with or without index) - pivot rows to columns without any value modification:
MAC000002 MAC000003 ...
0.147 0.845
0.110 ...
0.736 ...
0.404 ...
Any idea why I am getting a KeyError 'index' and how I can overcome this?
According to the docs for pivot_table, value of index kwarg should refer to an existing column, so instead of setting name to the index, a column should be created with the desired index value:
# ddf.index.name = 'index'
ddf['index'] = ddf.index
Note that this assumes that the index is what you are really pivoting by.
Below is a reproducible snippet:
data = """
| name | day | value | time
0 | MAC000002 | 2012-12-16| 0.147| 09:30:00
1 | MAC000002 | 2012-12-16| 0.110| 10:00:00
2 | MAC000002 | 2012-12-16| 0.736| 10:30:00
3 | MAC000002 | 2012-12-16| 0.404| 11:00:00
4 | MAC000003 | 2012-12-16| 0.845| 00:30:00
"""
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(data), sep='|')
df.columns = [c.strip() for c in df.columns]
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=3)
ddf['index'] = ddf.index
#dask requires string as category type
ddf['name'] = ddf['name'].astype('category')
ddf['name'] =ddf['name'].cat.as_known()
ddf.pivot_table(columns='name', values='value', index='index').compute()
# name MAC000002 MAC000003
# index
# 0 0.147 NaN
# 1 0.110 NaN
# 2 0.736 NaN
# 3 0.404 NaN
# 4 NaN 0.845