how to convert an empty pandas Dataframe into a polars Dataframe

how to convert an empty pandas Dataframe into a polars Dataframe - python

I have defined a pandas DataFrame as follows:
df_tmp = pd.DataFrame({'EDT': pd.Series(dtype='datetime64[ns]'),
'FSPB': pd.Series(dtype='str'),
'FS_LA': pd.Series(dtype='str'),
'lA': pd.Series(dtype='int'),
'avg': pd.Series(dtype='float64'),
'nw': pd.Series(dtype='float64')})
Is there any way to convert the above into an empty polars DataFrame?

According to the polars docs, polars DataFrames can take a pandas DataFrame in their constructor, so:
import pandas as pd
import polars as pl
df_tmp = pd.DataFrame({'EDT': pd.Series(dtype='datetime64[ns]'),
'FSPB': pd.Series(dtype='str'),
'FS_LA': pd.Series(dtype='str'),
'lA': pd.Series(dtype='int'),
'avg': pd.Series(dtype='float64'),
'nw': pd.Series(dtype='float64')})
df = pl.DataFrame(df_tmp)
should work.

import polars as pl
import pandas as pd
pandas_df = pd.DataFrame({'EDT': pd.Series(dtype='datetime64[ns]'),
'FSPB': pd.Series(dtype='str'),
'FS_LA': pd.Series(dtype='str'),
'lA': pd.Series(dtype='int'),
'avg': pd.Series(dtype='float64'),
'nw': pd.Series(dtype='float64')})
pl.from_pandas(pandas_df)
shape: (0, 6)
┌──────────────┬──────┬───────┬─────┬─────┬─────┐
│ EDT ┆ FSPB ┆ FS_LA ┆ lA ┆ avg ┆ nw │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ datetime[ns] ┆ str ┆ str ┆ i64 ┆ f64 ┆ f64 │
╞══════════════╪══════╪═══════╪═════╪═════╪═════╡
└──────────────┴──────┴───────┴─────┴─────┴─────┘

Related

Polars Adding Days to a date [duplicate]

This question already has an answer here:
How to add a duration to datetime in Python polars
(1 answer)
Closed 17 days ago.
I am using Polars in Python to try and add thirty days to a date
I run the code, get no errors but also get no new dates
Can anyone see my mistake?
import polars as pl
mydf = pl.DataFrame(
{"start_date": ["2020-01-02", "2020-01-03", "2020-01-04"]})
mydf = mydf.with_column(
pl.col("start_date").str.strptime(pl.Date, "%Y-%m-%d"),
)
# Generate the days above and below
mydf = mydf.with_column(
pl.col('start_date') + pl.duration(days=30).alias('date_plus_delta')
)
mydf = mydf.with_column(
pl.col('start_date') + pl.duration(days=-30).alias('date_minus_delta')
)
print(mydf)
shape: (3, 1)
┌────────────┐
│ start_date │
│ --- │
│ date │
╞════════════╡
│ 2020-01-02 │
│ 2020-01-03 │
│ 2020-01-04 │
└────────────┘
Quick References
The Manual: https://pola-rs.github.io/polars-book/user-guide/howcani/data/timestamps.html
strftime formats: https://docs.rs/chrono/latest/chrono/format/strftime/index.html
SO Answer from a previous Post: How to add a duration to datetime in Python polars

You're supposed to call .alias on the entire operation pl.col('start_date') + pl.duration(days=30). Instead you're only alias-ing on pl.duration(days=30).
So the correct way would be:
import polars as pl
mydf = pl.DataFrame({"start_date": ["2020-01-02", "2020-01-03", "2020-01-04"]})
mydf = mydf.with_columns(pl.col("start_date").str.strptime(pl.Date, r"%Y-%m-%d"))
# Generate the days above and below
mydf = mydf.with_columns((pl.col('start_date') + pl.duration(days=30)).alias('date_plus_delta'))
mydf = mydf.with_columns((pl.col('start_date') - pl.duration(days=30)).alias('date_minus_delta'))
print(mydf)
Output
shape: (3, 3)
┌────────────┬─────────────────┬──────────────────┐
│ start_date ┆ date_plus_delta ┆ date_minus_delta │
│ --- ┆ --- ┆ --- │
│ date ┆ date ┆ date │
╞════════════╪═════════════════╪══════════════════╡
│ 2020-01-02 ┆ 2020-02-01 ┆ 2019-12-03 │
│ 2020-01-03 ┆ 2020-02-02 ┆ 2019-12-04 │
│ 2020-01-04 ┆ 2020-02-03 ┆ 2019-12-05 │
└────────────┴─────────────────┴──────────────────┘

Is there a way to calculate slope, intercept in python pandas

Is there a way to calculate the slope, intercept in python pandas. For example , for the below dataframe, can we calculate / populate another column that calculates y = mx + c
date_range = pd.date_range(date(2021,11,7), date.today())
index = date_range
value = np.random.rand(len(index))
historical = pd.DataFrame({'date': date_range, 'Sales' : value})
historical
Out[300]:
date Sales m c y = mx+c
0 2021-11-07 0.210038 --- --- ----
1 2021-11-08 0.918222 --- --- ----
2 2021-11-09 0.202677 --- --- ----
3 2021-11-10 0.620185 --- --- ----
4 2021-11-11 0.299857 --- --- ----
So m and c will be constant for each row here

Add timedelta to a date column above weeks

How would I add 1 year to a column?
I've tried using map and apply but I failed miserably.
I also wonder why pl.date() accepts integers while it advertises that it only accepts str or pli.Expr.
A small hack workaround is:
col = pl.col('date').dt
df = df.with_column(pl.when(pl.col(column).is_not_null())
.then(pl.date(col.year() + 1, col.month(), col.day()))
.otherwise(pl.date(col.year() + 1,col.month(), col.day()))
.alias("date"))
but this won't work for months or days. I can't just add a number or I'll get a:
> thread 'thread '<unnamed>' panicked at 'invalid or out-of-range date<unnamed>',
' panicked at '/github/home/.cargo/registry/src/github.com-1ecc6299db9ec823/chrono-0.4.19/src/naive/date.rsinvalid or out-of-range date:', 173:/github/home/.cargo/registry/src/github.com-1ecc6299db9ec823/chrono-0.4.19/src/naive/date.rs51
:note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Most likely because day and month cycle while year goes to infinity.
I could also do this:
df = df.with_column(
pl.when(col.month() == 1)
.then(pl.date(col.year(), 2, col.day()))
.when(col.month() == 2)
.then(pl.date(col.year(), 3, col.day()))
.when(col.month() == 3)
.then(pl.date(col.year(), 4, col.day()))
.when(col.month() == 4)
.then(pl.date(col.year(), 5, col.day()))
.when(col.month() == 5)
.then(pl.date(col.year(), 6, col.day()))
.when(col.month() == 6)
.then(pl.date(col.year(), 7, col.day()))
.when(col.month() == 7)
.then(pl.date(col.year(), 8, col.day()))
.when(col.month() == 8)
.then(pl.date(col.year(), 9, col.day()))
.when(col.month() == 9)
.then(pl.date(col.year(), 10, col.day()))
.when(col.month() == 10)
.then(pl.date(col.year(), 11, col.day()))
.when(col.month() == 11)
.then(pl.date(col.year(), 12, col.day()))
.otherwise(pl.date(col.year() + 1, 1, 1))
.alias("valid_from")
)

Polars allows to do addition and subtraction with python's timedelta objects. However above week units things get a bit more complicated as we have to take different days of the month and leap years into account.
For this polars has offset_by under the dt namespace.
(pl.DataFrame({
"dates": pl.date_range(datetime(2000, 1, 1), datetime(2026, 1, 1), "1y")
}).with_columns([
pl.col("dates").dt.offset_by("1y").alias("dates_and_1_yr")
]))
shape: (27, 2)
┌─────────────────────┬─────────────────────┐
│ dates ┆ dates_and_1_yr │
│ --- ┆ --- │
│ datetime[ns] ┆ datetime[ns] │
╞═════════════════════╪═════════════════════╡
│ 2000-01-01 00:00:00 ┆ 2001-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2001-01-01 00:00:00 ┆ 2002-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2002-01-01 00:00:00 ┆ 2003-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2003-01-01 00:00:00 ┆ 2004-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-01 00:00:00 ┆ 2024-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2024-01-01 00:00:00 ┆ 2025-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2025-01-01 00:00:00 ┆ 2026-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2026-01-01 00:00:00 ┆ 2027-01-01 00:00:00 │
└─────────────────────┴─────────────────────┘

You can use polars.apply and dateutil.relativedelta which works for years, months, days and much more, but can be slow for lots of data.
from datetime import date
from dateutil.relativedelta import relativedelta
df = pl.DataFrame(pl.date_range(date(2019, 1, 1), date(2020, 10, 1), '3mo', name='date'))
df.with_column(pl.col('date').apply(lambda x: x + relativedelta(years=1)))
Update: Since the offset_by method is now also available for months, it should be used whenever possible (see accepted answer). I leave this answer here because the approach can be used for more complicated cases that are not supported by offset_by.

How do I calculate a 12-month return based on monthly observations within dataframe in Python?

How to calculate rolling cumulative product on Pandas DataFrame.
I have a time series of returns in a pandas DataFrame. How can I calculate a rolling annualized alpha for the relevant columns in the DataFrame? I would normally use Excel and do: =PRODUCT(1+[trailing 12 months])-1
My DataFrame looks like the below (a small portion):
Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 \
2009-08-31 00:00:00 --- --- 0.1489 0.072377
2009-09-30 00:00:00 --- --- 0.0662 0.069608
2009-10-31 00:00:00 --- --- -0.0288 -0.016967
2009-11-30 00:00:00 --- --- -0.0089 0.0009
2009-12-31 00:00:00 --- --- 0.044 0.044388
2010-01-31 00:00:00 --- --- -0.0301 -0.054953
2010-02-28 00:00:00 --- --- -0.0014 0.00821
2010-03-31 00:00:00 --- --- 0.0405 0.049959
2010-04-30 00:00:00 --- --- 0.0396 -0.007146
2010-05-31 00:00:00 --- --- -0.0736 -0.079834
2010-06-30 00:00:00 --- --- -0.0658 -0.028655
2010-07-31 00:00:00 --- --- 0.0535 0.038826
2010-08-31 00:00:00 --- --- -0.0031 -0.013885
2010-09-30 00:00:00 --- --- 0.0503 0.045781
2010-10-31 00:00:00 --- --- 0.0499 0.025335
2010-11-30 00:00:00 --- --- 0.012 -0.007495
I've tried the code below provided for a similar question, but it looks like it doesn't work anymore ...
import pandas as pd
import numpy as np
# your DataFrame; df = ...
pd.rolling_apply(df, 12, lambda x: np.prod(1 + x) - 1)
... and the pages that I'm redirected seem not to be as relevant.
Ideally, I'd like to reproduce the DataFrame but with 12 month returns, not monthly so I can locate the relevant 12 month return depending on the month.

If I understand correctly, you could try something like the below:
import pandas as pd
import numpy as np
#define dummy dataframe with monthly returns
df = pd.DataFrame(1 + np.random.rand(20), columns=['returns'])
#compute 12-month rolling returns
df_roll = df.rolling(window=12).apply(np.prod) - 1

Pandas set_Value with DatetimeIndex [Python]

I'm trying to add the row-wise result from a function into my dataframe using
df.set_Value.
df in the format :
Count DTW
DateTime
2015-01-16 10 0
2015-01-17 28 0
Using df.setValue
dw.set_Value(idx, 'col', dtw) # idx and dtw are int values
TypeError: cannot insert DatetimeIndex with incompatible label
How do I solve this error or what alternative method with comparable efficiency is there?

I think you have Series, not DataFrame, so use Series.set_value with index converted to datetime
dw = pd.Series([-2374], index = [pd.to_datetime('2015-01-18')])
dw.index.name = 'DateTime'
print (dw)
DateTime
2015-01-18 -2374
dtype: int64
print (dw.set_value(pd.to_datetime('2015-01-19'), 1))
DateTime
2015-01-18 -2374
2015-01-19 1
dtype: int64
print (dw.set_value(pd.datetime(2015, 1, 19), 1))
DateTime
2015-01-18 -2374
2015-01-19 1
dtype: int64
More standard way is use ix or iloc:
print (dw)
Count DTW
DateTime
2015-01-16 10 0
2015-01-17 28 0
dw.ix[1, 'DTW'] = 10
#dw.DTW.iloc[1] = 10
print (dw)
Count DTW
DateTime
2015-01-16 10 0
2015-01-17 28 10

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to convert an empty pandas Dataframe into a polars Dataframe - python

Related

Polars Adding Days to a date [duplicate]

Is there a way to calculate slope, intercept in python pandas

Add timedelta to a date column above weeks

How do I calculate a 12-month return based on monthly observations within dataframe in Python?

Pandas set_Value with DatetimeIndex [Python]

Categories

Resources