Extrapolation of dataset based on index - python

I am trying to extrapolate my dataset. A snippet looks as follows. A simple linear extrapolation is fine here:
Index Value
3000 NaN
4000 NaN
5000 10
6000 20
6500 33
7000 44
8300 60
9300 NaN
9400 NaN
The extrapolation should consider the index values. As the pandas package only provides a function for interpolation, I am stuck. I looked at scipy package, but cant seem to implement my idea. Would really appreciate any help.

I'm more familiar with scikit-learn:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df = pd.DataFrame([(3000, np.nan),
(4000, np.nan),
(5000, 10),
(6000, 20),
(6500, 33),
(7000, 44 ),
(8300, 60),
(9300, np.nan),
(9400, np.nan)], columns=['Index', 'Value'])
def extrapolate(df, X_col, y_col):
df_ = df[[X_col, y_col]].dropna()
return LinearRegression().fit(
df_[X_col].values.reshape(-1,1), df_[y_col]).predict(
df[X_col].values.reshape(-1,1))
df['Value_'] = extrapolate(df, 'Index', 'Value')
df
You should obtain something like this:
Index Value Value_
0 3000 NaN -23.219022
1 4000 NaN -7.314802
2 5000 10.0 8.589417
3 6000 20.0 24.493637
4 6500 33.0 32.445747
5 7000 44.0 40.397857
6 8300 60.0 61.073342
7 9300 NaN 76.977562
8 9400 NaN 78.567984
# I assume you don't want to extrapolate the orginal values
df['Value'] = df['Value'].fillna(df['Value_'])
df
Gives:
Index Value Value_
0 3000 -23.219022 -23.219022
1 4000 -7.314802 -7.314802
2 5000 10.000000 8.589417
3 6000 20.000000 24.493637
4 6500 33.000000 32.445747
5 7000 44.000000 40.397857
6 8300 60.000000 61.073342
7 9300 76.977562 76.977562
8 9400 78.567984 78.567984

Related

Merge several Dataframes with outside temperature and power generation

I have several dataframes of heating devices which are containing data over 1 year. One time step is 15 min, each df have two columns: outside_temp and heat_generation. Each df looks like this:
outside_temp heat_production
0 11.1 200
1 11.1 150
2 11.0 245
3 11.0 0
4 11.0 300
5 10.9 49
6
.
.
.
35037 -5.1 450
35038 -5.1 450
35039 -5.1 450
35040 -5.2 600
I now want to know at which outside_temp I need how much heat_production for all heat devices(and therefore for all dataframes) -> I was thinking about groupby oder somthing else. But I dont know how to handel this amount of data the best way. When directly merging the dfs there is the problem that the outside temperature is there several times and the heat production of course differs. To solve this, I could imagine to take the average heat_production for each device at a given outside_temperature. Of course it can also be the case that a device was not measuring a specific temperature (e.g. the device is located in warmer or colder area -> Therefore NaN Values are possbile)
At the end I want to get kind of Polynomial/Sigmoid function to see how much heat_production is necessary at a given outside temperature
At the end I want to have a dataframe like this:
outside_temp heat_production_average_device_1 heat_production_average_device_2 ...etc
-20.0 790 NaN
-19.9 789 NaN
-19.8 788 790
-19.7 NaN 780
-19.6 770 NaN
.
.
.
19.6 34 0
19.7 32 0
19.8 30 0
19.9 32 0
20.0 0 0
Any idea whats the best way to do so ?
Given:
>>> df1
outside_temp heat_production
0 11.1 200
1 11.1 150
2 11.0 245
>>> df2
outside_temp heat_production
3 11.0 0
4 11.0 300
5 10.9 49
Doing:
def my_func(i, df):
renamer = {'heat_production': f'heat_production_average_device_{i}'}
return (df.groupby('outside_temp')
.mean()
.rename(columns=renamer))
dfs = [df1, df2]
dfs = [my_func(i+1, df) for i, df in enumerate(dfs)]
df = pd.concat(dfs, axis=1)
print(df)
Output:
heat_production_average_device_1 heat_production_average_device_2
outside_temp
11.0 245.0 150.0
11.1 175.0 NaN
10.9 NaN 49.0

How do I use Pandas to interpolate on the first few rows of a dataframe?

I've got some astrophysics data with unfilled rows, which I'm trying to use the Pandas Interpolate method to fill. It works on the np.NaN values found everywhere except in the first three rows. It fills the values, but instead of being a linear fill, it just places in the values of the fourth row. The first chunk of the dataframe (called avgdf_final) looks like this:
day lon lat rad
nums
0 319.0 NaN NaN NaN
1 320.0 NaN NaN NaN
2 321.0 NaN NaN NaN
3 322.0 56.485 2.7800 1.158
4 323.0 43.300 2.6800 1.166
5 324.0 30.100 2.5775 1.174
I've tried this (with like a million different minor variations) to no avail:
avgdf_final.interpolate(limit_direction='backward')
Every time, I get this result:
day lon lat rad
nums
0 319.0 56.485 2.7800 1.158
1 320.0 56.485 2.7800 1.158
2 321.0 56.485 2.7800 1.158
3 322.0 56.485 2.7800 1.158
4 323.0 43.300 2.6800 1.166
5 324.0 30.100 2.5775 1.174
Clearly, this isn't interpolated: it's just the same rows pasted in again. What can I do to make this work? Thanks in advance for any replies!!
Interpolation with "linear" requires to be in between data, which is not the case here (you rather want to extrapolate).
You could try to use a spline:
df2 = df.interpolate(method='spline', limit_direction='backward', order=1)
See the interpolate documentation for other methods.
Output:
day lon lat rad
0 319.0 96.0650 3.084167 1.134
1 320.0 82.8725 2.982917 1.142
2 321.0 69.6800 2.881667 1.150
3 322.0 56.4850 2.780000 1.158
4 323.0 43.3000 2.680000 1.166
5 324.0 30.1000 2.577500 1.174

Why all element are all NaN when construct a multiIndex Dataframe

suppose I have a Dataframe like this. I want to convert this to a 2-level multiIndex Dataframe.
dt st close volume
0 20100101 000001.sz 1 10000
1 20100101 000002.sz 10 50000
2 20100101 000003.sz 5 1000
3 20100101 000004.sz 15 7000
4 20100101 000005.sz 100 100000
5 20100102 000001.sz 2 20000
6 20100102 000002.sz 20 60000
7 20100102 000003.sz 6 2000
8 20100102 000004.sz 20 8000
9 20100102 000005.sz 110 110000
But when I try this code:
data = pd.read_csv('data/trial.csv')
print(data)
idx = pd.MultiIndex.from_product([data.dt.unique(),
data.st.unique()],
names=['dt', 'st'])
col = ['close', 'volume']
df = pd.DataFrame(data, idx, col)
print(df)
I find that all the element are NaN
close volume
dt st
20100101 000001.sz NaN NaN
000002.sz NaN NaN
000003.sz NaN NaN
000004.sz NaN NaN
000005.sz NaN NaN
20100102 000001.sz NaN NaN
000002.sz NaN NaN
000003.sz NaN NaN
000004.sz NaN NaN
000005.sz NaN NaN
How to handle this situation? Thanks.
You need only parameter index_col in read_csv:
#by positions of columns
data = pd.read_csv('data/trial.csv', index_col=[0,1])
Or:
#by names of columns
data = pd.read_csv('data/trial.csv', index_col=['dt', 'st'])
print (data)
close volume
dt st
20100101 000001.sz 1 10000
000002.sz 10 50000
000003.sz 5 1000
000004.sz 15 7000
000005.sz 100 100000
20100102 000001.sz 2 20000
000002.sz 20 60000
000003.sz 6 2000
000004.sz 20 8000
000005.sz 110 110000
Why all element are all NaN when construct a multiIndex Dataframe?
Reason is in DataFrame constructor:
df = pd.DataFrame(data, idx, col)
DataFrame called data has RangeIndex and not align with new MultiIndex, so get NaNs in data.
Possible solution if always each dt has same st values is filter Dataframe by columns names and then convert to numpy array, but better are index_col and set_index solutions:
df = pd.DataFrame(data[col].values, idx, col)
Try using set_index() like this:
new_df = df.set_index(['dt', 'st'])
Result:
>>> new_df
close volume
dt st
20100101 000001.sz 1 10000
000002.sz 10 50000
000003.sz 5 1000
000004.sz 15 7000
000005.sz 100 100000
20100102 000001.sz 2 20000
000002.sz 20 60000
000003.sz 6 2000
000004.sz 20 8000
000005.sz 110 110000
>>> new_df.index
MultiIndex(levels=[[20100101, 20100102], ['000001.sz', '000002.sz', '000003.sz', '000004.sz', '000005.sz']],
labels=[[0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]],
names=['dt', 'st'])

Python Pandas: Custom rolling window calculation

I'm Looking to take the most recent value in a rolling window and divide it by the mean of all numbers in said window.
What I tried:
df.a.rolling(window=7).mean()/df.a[-1]
This doesn't work because df.a[-1] is always the most recent of the entire dataset. I need the last value of the window.
I've done a ton of searching today. I may be searching the wrong terms, or not understanding the results, because I have not gotten anything useful.
Any pointers would be appreciated.
Aggregation (use the mean()) on a rolling windows returns a pandas Series object with the same indexing as the original column. You can simply aggregate the rolling window and then divide the original column by the aggregated values.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(30), columns=['A'])
df
# returns:
A
0 0
1 1
2 2
...
27 27
28 28
29 29
You can use a rolling mean to get a series with the same index.
df.A.rolling(window=7).mean()
# returns:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 3.0
7 4.0
...
26 23.0
27 24.0
28 25.0
29 26.0
Because it is indexed, you can simple divide by df.A to get your desired results.
df.A.rolling(window=7).mean() / df.A
# returns:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 0.500000
7 0.571429
8 0.625000
9 0.666667
10 0.700000
11 0.727273
12 0.750000
13 0.769231
14 0.785714
15 0.800000
16 0.812500
17 0.823529
18 0.833333
19 0.842105
20 0.850000
21 0.857143
22 0.863636
23 0.869565
24 0.875000
25 0.880000
26 0.884615
27 0.888889
28 0.892857
29 0.896552

Python pandas rolling.apply two time-series input into function

I have a DatetimeIndex indexed dataframe with two columns. The index is uneven.
A B
Date
2016-01-04 1 20
2016-01-12 2 10
2016-01-21 3 10
2016-01-25 2 20
2016-02-08 2 30
2016-02-15 1 20
2016-02-21 3 20
2016-02-25 2 20
I want to compute the dot product of time-series A and B over a rolling window of length 20 days.
It should return:
dot
Date
2016-01-04 Nan
2016-01-12 Nan
2016-01-21 Nan
2016-01-25 110
2016-02-08 130
2016-02-15 80
2016-02-21 140
2016-02-25 180
here is how this is obtained:
110 = 2*10+3*10+2*20 (product obtained in period from 2016-01-06 to 2016-01-25 included)
130 = 3*10+2*20+2*30 (product obtained in period from 2016-01-20 to 2016-02-08)
80 = 1*20+2*30 (product obtained in period from 2016-01-27 to 2016-02-15)
140 = 3*20+1*20+2*30 (product obtained in period from 2016-02-02 to 2016-02-21)
180 = 2*20+3*20+1*20+2*30 (product obtained in period from 2016-02-06 to 2016-02-25)
The dot product is an example that should be generalizable to any function taking two series and returning a value.
I think this should work. df.product() across rows, the df.rolling(period).sum()
Dates = pd.to_datetime(['2016-01-04',
'2016-01-12',
'2016-01-21',
'2016-01-25',
'2016-02-08',
'2016-02-15',
'2016-02-21',
'2016-02-25',
'2016-02-26'
]
)
data = {'A': [i*10 for i in range(1,10)], 'B': [i for i in range(1,10)]}
df1 = pd.DataFrame(data = data, index = Dates)
df2 = df1.product(axis =1).rolling(3).sum()
df2.columns = 'Dot'
df2
output
2016-01-04 NaN
2016-01-12 NaN
2016-01-21 140.0
2016-01-25 290.0
2016-02-08 500.0
2016-02-15 770.0
2016-02-21 1100.0
2016-02-25 1490.0
2016-02-26 1940.0
dtype: float64
And if your data is daily and you want to get 20 days data first, group them by 20 days and sum them up, or use last, according to what you want.
Dates1 = pd.date_range(start='2016-03-31', end = '2016-07-31')
data1 = {'A': [np.pi * i * np.random.rand()
for i in range(1, len(Dates1) + 1)],
'B': [i * np.random.randn() * 10
for i in range(1, len(Dates1) + 1)]}
df3 = pd.DataFrame(data = data1, index = Dates1)
df3.groupby(pd.TimeGrouper(freq = '20d')).sum()
A B
2016-03-31 274.224084 660.144639
2016-04-20 1000.456615 -2403.034012
2016-05-10 1872.422495 -1737.571080
2016-05-30 2121.497529 1157.710510
2016-06-19 3084.569208 -1854.258668
2016-07-09 3324.775922 -9743.113805
2016-07-29 505.162678 -1179.730820
and then use dot product like I did above.

Categories