time series forecasting with a small dataset(fewer observations) using python - python

I have around a small dataset(9 data points). So I think its hard to apply the Arima or sarima model(due to the fewer observations). How could i forecast for next 4 weeks.
Line chart:
Data:
{'Week': {0: Timestamp('2022-03-07 00:00:00'),
1: Timestamp('2022-03-14 00:00:00'),
2: Timestamp('2022-03-21 00:00:00'),
3: Timestamp('2022-03-28 00:00:00'),
4: Timestamp('2022-04-04 00:00:00'),
5: Timestamp('2022-04-11 00:00:00'),
6: Timestamp('2022-04-18 00:00:00'),
7: Timestamp('2022-04-25 00:00:00'),
8: Timestamp('2022-05-02 00:00:00')},
'Runs': {0: 191,
1: 190,
2: 198,
3: 245,
4: 179,
5: 219,
6: 195,
7: 159,
8: 177}}
Take look and please give me some suggestions that I can try !!!

Related

Error: "cannot reindex from a duplicate axis" when looping sns relplots

I've got a DataFrame that is structured similar to this one but with more variables:
{'Date': {0: Timestamp('2021-01-01 00:00:00'),
1: Timestamp('2021-01-01 00:00:00'),
2: Timestamp('2021-01-01 00:00:00'),
3: Timestamp('2021-02-01 00:00:00'),
4: Timestamp('2021-02-01 00:00:00'),
5: Timestamp('2021-02-01 00:00:00'),
6: Timestamp('2021-03-01 00:00:00'),
7: Timestamp('2021-03-01 00:00:00'),
8: Timestamp('2021-03-01 00:00:00')},
'Share': {0: 'nflx',
1: 'aapl',
2: 'amzn',
3: 'nflx',
4: 'aapl',
5: 'amzn',
6: 'nflx',
7: 'aapl',
8: 'amzn'},
'x': {0: 534,
1: 126,
2: 3270,
3: 590,
4: 172,
5: 3059,
6: 552,
7: 160,
8: 3462}}
I'm trying to loop sns relplots but getting the error "cannot reindex from a duplicate axis"
Code I've tried:
for i in [df["x"], df["y"], df["z"]]:
sns.relplot(data=df.reset_index(),
x="Date",
y=i,
hue="Share",
kind="line",
height=10,
aspect=1.7).savefig("{i}.png");
I know the error tells me that index has duplicates but to my knowledge it should be temporary gone with the "reset_index()". Moreover I think I have to index the date variable in order to do some of the variable specific calculations. Is the issue the plot code or what's the solution?

Filtering Dataframe based on Multiple Date Conditions

I'm working with the following DataFrame:
id slotTime EDD EDD-10M
0 1000000101068957 2021-05-12 2021-12-26 2021-02-26
1 1000000100849718 2021-03-20 2021-04-05 2020-06-05
2 1000000100849718 2021-03-20 2021-04-05 2020-06-05
3 1000000100849718 2021-03-20 2021-04-05 2020-06-05
4 1000000100849718 2021-03-20 2021-04-05 2020-06-05
I would like to only keep the rows where the slotTime is between EDD-10M and EDD:
df['EDD-10M'] < df['slotTime'] < df['EDD']]
I have tried using the following method:
df.loc[df[df['slotTime'] < df['EDD']] & df[df['EDD-10M'] < df['slotTime']]]
However it yields the following error
TypeError: ufunc 'bitwise_and' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Please Advise.
To replicate the above DataFrame use the below snippet:
import pandas as pd
from pandas import Timestamp
df = {
'id': {0: 1000000101068957,
1: 1000000100849718,
2: 1000000100849718,
3: 1000000100849718,
4: 1000000100849718,
5: 1000000100849718,
6: 1000000100849718,
7: 1000000100849718,
8: 1000000100849718,
9: 1000000100849718},
'EDD': {0: Timestamp('2021-12-26 00:00:00'),
1: Timestamp('2021-04-05 00:00:00'),
2: Timestamp('2021-04-05 00:00:00'),
3: Timestamp('2021-04-05 00:00:00'),
4: Timestamp('2021-04-05 00:00:00'),
5: Timestamp('2021-04-05 00:00:00'),
6: Timestamp('2021-04-05 00:00:00'),
7: Timestamp('2021-04-05 00:00:00'),
8: Timestamp('2021-04-05 00:00:00'),
9: Timestamp('2021-04-05 00:00:00')},
'EDD-10M': {0: Timestamp('2021-02-26 00:00:00'),
1: Timestamp('2020-06-05 00:00:00'),
2: Timestamp('2020-06-05 00:00:00'),
3: Timestamp('2020-06-05 00:00:00'),
4: Timestamp('2020-06-05 00:00:00'),
5: Timestamp('2020-06-05 00:00:00'),
6: Timestamp('2020-06-05 00:00:00'),
7: Timestamp('2020-06-05 00:00:00'),
8: Timestamp('2020-06-05 00:00:00'),
9: Timestamp('2020-06-05 00:00:00')},
'slotTime': {0: Timestamp('2021-05-12 00:00:00'),
1: Timestamp('2021-03-20 00:00:00'),
2: Timestamp('2021-03-20 00:00:00'),
3: Timestamp('2021-03-20 00:00:00'),
4: Timestamp('2021-03-20 00:00:00'),
5: Timestamp('2021-03-20 00:00:00'),
6: Timestamp('2021-03-20 00:00:00'),
7: Timestamp('2021-03-20 00:00:00'),
8: Timestamp('2021-03-20 00:00:00'),
9: Timestamp('2021-03-20 00:00:00')}}
df = pd.DataFrame(df)
you just need to group your sides
df[(df['slotTime'] < df['EDD']) & (df['EDD-10M'] < df['slotTime'])]
otherwise order of operations tries to & things first and it all falls apart
alternatively you may wish to use the .between operator (assuming you have a datetime series
df[df['slotTime'].between(df['EDD'],df['EDD-10M'])]
you can use between() method someone already answered you or try like this
df.loc[(df['EDD-10M'] < df['slotTime']) & (df['slotTime'] < df['EDD'])]
you should use ( and ) multiple conditions
You can this by using query:
df.query("(slotTime < EDD) & (`EDD-10M` < slotTime)")

Create a temporal network with toneto (Python)?

I have the following dataframe (here is the portion of it):
{'date': {0: Timestamp('2020-10-03 00:00:00'),
1: Timestamp('2020-10-03 00:00:00'),
2: Timestamp('2020-10-03 00:00:00'),
3: Timestamp('2020-10-03 00:00:00'),
4: Timestamp('2020-10-24 00:00:00'),
5: Timestamp('2020-10-24 00:00:00'),
6: Timestamp('2020-10-24 00:00:00'),
7: Timestamp('2020-10-24 00:00:00'),
8: Timestamp('2020-10-25 00:00:00'),
9: Timestamp('2020-10-25 00:00:00')},
'from': {0: 7960001,
1: 25500005,
2: 4660001,
3: 91000032,
4: 280001,
5: 26100016,
6: 30001114,
7: 12000016,
8: 79000523,
9: 74000114},
'to': {0: 30000934,
1: 74000351,
2: 4660001,
3: 91000031,
4: 66000413,
5: 26100022,
6: 26100024,
7: 12000016,
8: 79000321,
9: 74000122},
'weight': {0: 17.1,
1: 15.0,
2: 931.6,
3: 145.9,
4: 29.3,
5: 25.8,
6: 15.0,
7: 132.4,
8: 51.5,
9: 492.9}}
And I want to build a temporal network out of this time series - graph/network data.
I would like to build a network with respect to time + clusters.
Here is what I am trying to do:
df is the dataframe above
import teneto
t = list(df.index())
netin = {'i': df['from'], 'j': df['to'], 't': t, 'weight': df['weight']}
df = pd.DataFrame(data=netin)
tnet = TemporalNetwork(from_df=df)
tnet.network
Keep getting:
TypeError: 'RangeIndex' object is not callable

How to pretty print labels on chart

I have this dataframe:
{'date': {0: Timestamp('2019-10-31 00:00:00'),
1: Timestamp('2019-11-30 00:00:00'),
2: Timestamp('2019-12-31 00:00:00'),
3: Timestamp('2020-01-31 00:00:00'),
4: Timestamp('2020-02-29 00:00:00'),
5: Timestamp('2020-03-31 00:00:00'),
6: Timestamp('2020-04-30 00:00:00'),
7: Timestamp('2020-05-31 00:00:00'),
8: Timestamp('2020-06-30 00:00:00'),
9: Timestamp('2020-07-31 00:00:00'),
10: Timestamp('2020-08-31 00:00:00')},
'rate': {0: 100.0,
1: 99.04595078851037,
2: 101.09797599729458,
3: 102.29581878702609,
4: 104.72409825791058,
5: 109.45297539163114,
6: 118.24943699089361,
7: 119.65432196709045,
8: 117.82108184647535,
9: 118.6223497519237,
10: 120.32838345607335}}
When I plot it I get clogged x axis
How do I print it such that I get a month name, year on x axis. For instance Nov,19.
I am using this to plot
chart = sns.lineplot('date', 'rate', data=cdf,marker="o")
If I add more datapoints it doesnt display them even if I change size:
Data:
{'date': {0: Timestamp('2019-01-31 00:00:00'),
1: Timestamp('2019-02-28 00:00:00'),
2: Timestamp('2019-03-31 00:00:00'),
3: Timestamp('2019-04-30 00:00:00'),
4: Timestamp('2019-05-31 00:00:00'),
5: Timestamp('2019-06-30 00:00:00'),
6: Timestamp('2019-07-31 00:00:00'),
7: Timestamp('2019-08-31 00:00:00'),
8: Timestamp('2019-09-30 00:00:00'),
9: Timestamp('2019-10-31 00:00:00'),
10: Timestamp('2019-11-30 00:00:00'),
11: Timestamp('2019-12-31 00:00:00'),
12: Timestamp('2020-01-31 00:00:00'),
13: Timestamp('2020-02-29 00:00:00'),
14: Timestamp('2020-03-31 00:00:00'),
15: Timestamp('2020-04-30 00:00:00'),
16: Timestamp('2020-05-31 00:00:00'),
17: Timestamp('2020-06-30 00:00:00'),
18: Timestamp('2020-07-31 00:00:00')},
'rate': {0: 100.0,
1: 98.1580633244672,
2: 102.03029115707123,
3: 107.12429902683576,
4: 112.60187555657997,
5: 108.10306860500229,
6: 105.35473845070196,
7: 105.13286204895526,
8: 106.11760178061557,
9: 107.76819930852,
10: 106.77041938461862,
11: 108.84840098309556,
12: 110.29751856107903,
13: 112.93762886874026,
14: 118.04947620270883,
15: 127.80912766377679,
16: 128.90556903738158,
17: 126.96768455091889,
18: 127.95060601426769}}
I have posted the updated data.
from pandas import Timestamp
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame(
{'date':
{0: Timestamp('2019-10-31 00:00:00'),
1: Timestamp('2019-11-30 00:00:00'),
2: Timestamp('2019-12-31 00:00:00'),
3: Timestamp('2020-01-31 00:00:00'),
4: Timestamp('2020-02-29 00:00:00'),
5: Timestamp('2020-03-31 00:00:00'),
6: Timestamp('2020-04-30 00:00:00'),
7: Timestamp('2020-05-31 00:00:00'),
8: Timestamp('2020-06-30 00:00:00'),
9: Timestamp('2020-07-31 00:00:00'),
10: Timestamp('2020-08-31 00:00:00')},
'rate':
{0: 100.0,
1: 99.04595078851037,
2: 101.09797599729458,
3: 102.29581878702609,
4: 104.72409825791058,
5: 109.45297539163114,
6: 118.24943699089361,
7: 119.65432196709045,
8: 117.82108184647535,
9: 118.6223497519237,
10: 120.32838345607335}})
df['datelabel'] = df['date'].apply(lambda x: x.strftime('%b %d'))
chart = sns.lineplot('date', 'rate', data=df,marker="o")
chart.set_xticklabels(df.datelabel, rotation=45)
plt.show()
Here's one approach:
Build a lambda function to apply strftime with our target representations over each record in date.
For date formatting, see: https://strftime.org/
%b - Month as locale’s abbreviated name.
%d - Day of the month as a zero-padded decimal number.
We can save it as a set of reference labels to be applied to the chart via xticklabels.
Additionally, you can rotate the labels with the rotation parameter.

koalas groupby -> apply returns 'cannot insert "key", already exists'

I've been struggling with this issue and haven't been able to solve it, I got the current dataframe:
import databricks.koalas as ks
x = ks.DataFrame.from_records(
{'ds': {0: Timestamp('2018-10-06 00:00:00'),
1: Timestamp('2017-06-08 00:00:00'),
2: Timestamp('2018-10-22 00:00:00'),
3: Timestamp('2017-02-08 00:00:00'),
4: Timestamp('2019-02-03 00:00:00'),
5: Timestamp('2019-02-26 00:00:00'),
6: Timestamp('2017-04-15 00:00:00'),
7: Timestamp('2017-07-02 00:00:00'),
8: Timestamp('2017-04-04 00:00:00'),
9: Timestamp('2017-03-20 00:00:00'),
10: Timestamp('2018-06-09 00:00:00'),
11: Timestamp('2017-01-15 00:00:00'),
12: Timestamp('2018-05-07 00:00:00'),
13: Timestamp('2018-01-17 00:00:00'),
14: Timestamp('2017-07-11 00:00:00'),
15: Timestamp('2018-12-17 00:00:00'),
16: Timestamp('2018-12-05 00:00:00'),
17: Timestamp('2017-05-22 00:00:00'),
18: Timestamp('2017-08-13 00:00:00'),
19: Timestamp('2018-05-21 00:00:00')},
'store': {0: 81,
1: 128,
2: 81,
3: 128,
4: 25,
5: 128,
6: 11,
7: 124,
8: 43,
9: 25,
10: 25,
11: 124,
12: 124,
13: 128,
14: 81,
15: 11,
16: 124,
17: 11,
18: 167,
19: 128},
'stock': {0: 1,
1: 236,
2: 3,
3: 9,
4: 36,
5: 78,
6: 146,
7: 20,
8: 12,
9: 12,
10: 15,
11: 25,
12: 10,
13: 7,
14: 0,
15: 230,
16: 80,
17: 6,
18: 110,
19: 8},
'sells': {0: 1.0,
1: 17.0,
2: 1.0,
3: 2.0,
4: 1.0,
5: 2.0,
6: 7.0,
7: 1.0,
8: 1.0,
9: 1.0,
10: 2.0,
11: 1.0,
12: 1.0,
13: 1.0,
14: 1.0,
15: 1.0,
16: 1.0,
17: 3.0,
18: 2.0,
19: 1.0}}
)
and this function that I want to use in a groupby - apply:
import numpy as np
def compute_indicator(df):
return (
df.copy()
.assign(
indicator=lambda x: x['a'] < np.percentile(x['b'], 80)
)
.astype(int)
.fillna(1)
)
Where df is meant to be a pandas DataFrame. If I do a group-by apply using pandas, the code executes as expected:
import pandas as pd
# This runs
a = pd.DataFrame.from_dict(x.to_dict()).groupby('store').apply(compute_indicator)
but when trying to run the same on koalas it gives me the following error: ValueError: cannot insert store, already exists
x.groupby('store').apply(compute_indicator)
# ValueError: cannot insert store, already exists
I cannot use the typing annotation in compute_indicator because some columns are not fixed (they travel around with the dataframe, meant to be used by another transformations).
What should I do to run the code in koalas?
As for Koalas 0.29.0, when koalas.DataFrame.groupby(keys).apply(f) runs for the first time over an untyped func f, it has to infer the schema, and to do this runs pandas.DataFrame.head(n).groupby(keys).apply(f). The probem is that pandas apply receives as argument the dataframe with the groupby keys as index and as columns (see this issue).
The result of pandas.DataFrame.head(h).groupby(keys).apply(f) is then converted to a koalas.DataFrame, so if f doesn't drop the keys columns this conversion raises an exception because of duplicated column names (see issue)

Categories