I'm using scipy interpolate to find the a smooth approximation of different columns of a quasi-large data frame (of length around 240,000) then find the slope at interval midpoints like so:
tck = sc.interpolate.splrep(data['Time'], np.array(data[columname]), s=3)
slope =sc.interpolate.splev(interval_midpoints_array, tck, der=1)
For some columns this works well and very fast, but for others it takes what seems to me a long time (at least 15-20 minutes before I gave up).
I read here that the univariate spline has speed problems with NaN values, however I checked and there are no NaN values in my data frame. I also tried increasing the s value, but it had no significant effect on the time.
Is there a way to speed up the interpolation? Or a better way to approximate these slope values that is faster? Am I just impatient?
Is there a way to characterize data sets on which different approaches are faster?
Edit: added some sample data below
AT.data['Fast'][0:25]
0 0.9531
1 0.9536
2 0.9557
3 0.9578
4 0.9599
5 0.9625
6 0.9538
7 0.9143
8 0.9429
9 0.9773
10 0.9802
11 0.9831
12 0.9846
13 0.9849
14 0.9849
15 0.9826
16 0.9811
17 0.9791
18 0.9780
19 0.9773
20 0.9758
21 0.9752
22 0.9743
23 0.9737
24 0.9742
Name: Fast, dtype: float64
AT.data['Slow'][0:25]
0 105.1
1 105.1
2 105.1
3 105.1
4 105.1
5 105.1
6 105.1
7 105.0
8 105.0
9 105.0
10 105.0
11 105.0
12 105.0
13 105.0
14 104.9
15 104.9
16 104.9
17 104.8
18 104.8
19 104.8
20 104.8
21 104.7
22 104.7
23 104.7
24 104.7
Name: Slow, dtype: float64
Related
Hey I have a doubt on pandas rolling function.
I am currently using it to get mean for last 10 days of my time series data.
Example df:
column
2020-12-04 14
2020-12-05 15
2020-12-06 16
2020-12-07 17
2020-12-08 18
2020-12-09 19
2020-12-13 20
2020-12-14 11
2020-12-16 12
2020-12-17 13
Usage:
df['column'].rolling('10D').mean()
But the function calculates the rolling mean over the 10 calendar days. like if the current row date is 2020-12-17 it calculates till 2020-12-07.
However I would like the rolling mean on the last 10 days that are in the data frame. i.e I would want till 2020-12-04.
How can I acheive it?
Edit: So I can also have a 15 mins interval datetime index so doing window=10 is not helping in that case. Though it works here.
As said in the comments by #cs95, if you want to consider only the rows that are in the dataframe, you can ignore that your data is part of a timeseries and just specify a window sized by a number of rows, instead of by a number of days. In essence
df['column'].rolling(window=10).mean()
Just one little detail to remember. You have missing dates in you dataframe. You should fill that, otherwise it will not be a 10 day window. Instead you would have a 10-dates rolling window,which would be pretty meaningless if dates are randoly missing.
r = pd.date_range(start=df1.Date.min(), end=df1.Date.max())
df1 = df1.set_index('Date').reindex(r).fillna(0).rename_axis('Date').reset_index()
which gives you the dataframe:
Date column
0 2020-12-04 14.0
1 2020-12-05 15.0
2 2020-12-06 16.0
3 2020-12-07 17.0
4 2020-12-08 18.0
5 2020-12-09 19.0
6 2020-12-10 0.0
7 2020-12-11 0.0
8 2020-12-12 0.0
9 2020-12-13 20.0
10 2020-12-14 11.0
11 2020-12-15 0.0
12 2020-12-16 12.0
13 2020-12-17 13.0
Then applying:
df1['Mean']=df1['column'].rolling(window=10).mean()
returns
Date column Mean
0 2020-12-04 14.0 NaN
1 2020-12-05 15.0 NaN
2 2020-12-06 16.0 NaN
3 2020-12-07 17.0 NaN
4 2020-12-08 18.0 NaN
5 2020-12-09 19.0 NaN
6 2020-12-10 0.0 NaN
7 2020-12-11 0.0 NaN
8 2020-12-12 0.0 NaN
9 2020-12-13 20.0 11.9
10 2020-12-14 11.0 11.6
11 2020-12-15 0.0 10.1
12 2020-12-16 12.0 9.7
13 2020-12-17 13.0 9.3
I have several pandas dataframes (say a normal python list) which look like the following two. Note that there can be (in fact there are) some missing values at random dates. I need to compute percentiles of TMAX and/or TMAX_ANOM across the several dataframes, for each date, ignoring the missing values.
YYYY MM DD TMAX TMAX_ANOM
0 1980 7 1 13.0 2.333333
1 1980 7 2 14.3 2.566667
2 1980 7 3 15.6 2.800000
3 1980 7 4 16.9 3.033333
4 1980 8 1 18.2 3.266667
5 1980 8 2 19.5 3.500000
6 1980 8 3 20.8 3.733333
7 1980 8 4 22.1 3.966667
8 1981 7 1 10.0 -0.666667
9 1981 7 2 11.0 -0.733333
10 1981 7 3 12.0 -0.800000
11 1981 7 4 13.0 -0.866667
12 1981 8 1 14.0 -0.933333
13 1981 8 2 15.0 -1.000000
14 1981 8 3 16.0 -1.066667
15 1981 8 4 17.0 -1.133333
16 1982 7 1 9.0 -1.666667
17 1982 7 2 9.9 -1.833333
18 1982 7 3 10.8 -2.000000
19 1982 7 4 11.7 -2.166667
20 1982 8 1 12.6 -2.333333
21 1982 8 2 13.5 -2.500000
22 1982 8 3 14.4 -2.666667
23 1982 8 4 15.3 -2.833333
YYYY MM DD TMAX TMAX_ANOM
0 1980 7 1 14.0 3.666667
1 1980 7 2 15.4 4.033333
2 1980 7 3 16.8 4.400000
3 1980 7 4 18.2 4.766667
4 1980 8 1 19.6 5.133333
6 1980 8 3 22.4 5.866667
7 1980 8 4 23.8 6.233333
8 1981 7 1 10.0 -0.333333
9 1981 7 2 11.0 -0.366667
10 1981 7 3 12.0 -0.400000
11 1981 7 4 13.0 -0.433333
12 1981 8 1 14.0 -0.466667
13 1981 8 2 15.0 -0.500000
14 1981 8 3 16.0 -0.533333
15 1981 8 4 17.0 -0.566667
16 1982 7 1 7.0 -3.333333
17 1982 7 2 7.7 -3.666667
18 1982 7 3 8.4 -4.000000
19 1982 7 4 9.1 -4.333333
20 1982 8 1 9.8 -4.666667
21 1982 8 2 10.5 -5.000000
23 1982 8 4 11.9 -5.666667
So just to be clear, in this example with just two dataframe (and supposing the percentile is median to simplify the discussion), as a output I need a dataframe with 24 elements, the same YYYY/MM/DD fields, and the TMAX (and/or TMAX_ANOM) replaced as follow: for 1980/7/1 it must be the median between 13 and 14, for for 1980/7/2 it must be the median between 14.3 and 15.4 and so on. When there are missing values (for example the 1980/8/2 in the second dataframe here), the median must be computed just from the remaining dataframes -- so in this case the value would just be 19.5
I have not been able to find a clean way to accomplish this, with either numpy or pandas. Any suggestions or should I just resort to manual looping?
#dates as indexes
df1.index = pd.to_datetime(dict(year = df1.YYYY, month = df1.MM, day = df1.DD))
df2.index = pd.to_datetime(dict(year = df2.YYYY, month = df2.MM, day = df2.DD))
#binding useful columns
new_df = df1[['TMAX','TMAX_ANOM']].join(df2[['TMAX','TMAX_ANOM']], lsuffix = '_df1', rsuffix = '_df2')
#calculating quantile
new_df['TMAX_quantile'] = new_df[['TMAX_df1', 'TMAX_df2']].quantile(0.5, axis = 1)
I'm Looking to take the most recent value in a rolling window and divide it by the mean of all numbers in said window.
What I tried:
df.a.rolling(window=7).mean()/df.a[-1]
This doesn't work because df.a[-1] is always the most recent of the entire dataset. I need the last value of the window.
I've done a ton of searching today. I may be searching the wrong terms, or not understanding the results, because I have not gotten anything useful.
Any pointers would be appreciated.
Aggregation (use the mean()) on a rolling windows returns a pandas Series object with the same indexing as the original column. You can simply aggregate the rolling window and then divide the original column by the aggregated values.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(30), columns=['A'])
df
# returns:
A
0 0
1 1
2 2
...
27 27
28 28
29 29
You can use a rolling mean to get a series with the same index.
df.A.rolling(window=7).mean()
# returns:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 3.0
7 4.0
...
26 23.0
27 24.0
28 25.0
29 26.0
Because it is indexed, you can simple divide by df.A to get your desired results.
df.A.rolling(window=7).mean() / df.A
# returns:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 0.500000
7 0.571429
8 0.625000
9 0.666667
10 0.700000
11 0.727273
12 0.750000
13 0.769231
14 0.785714
15 0.800000
16 0.812500
17 0.823529
18 0.833333
19 0.842105
20 0.850000
21 0.857143
22 0.863636
23 0.869565
24 0.875000
25 0.880000
26 0.884615
27 0.888889
28 0.892857
29 0.896552
I plotted a data frame like this:
Date Quote-Spread
0 2013-11-17 2.0
1 2013-12-10 8.0
2 2013-12-11 8.0
3 2014-06-01 5.0
4 2014-06-23 15.0
5 2014-06-24 45.0
6 2014-06-25 10.0
7 2014-06-28 20.0
8 2014-09-13 50000.0
9 2015-03-30 250000.0
10 2016-04-02 103780.0
11 2016-04-03 119991.0
12 2016-04-04 29994.0
13 2016-04-05 69993.0
14 2016-04-06 39997.0
15 2016-04-09 490321.0
16 2016-04-10 65485.0
17 2016-04-11 141470.0
18 2016-04-12 109939.0
19 2016-04-13 29983.0
20 2016-04-16 39964.0
21 2016-04-17 39964.0
22 2016-04-18 79920.0
23 2016-04-19 29997.0
24 2016-04-20 108414.0
25 2016-04-23 126849.0
26 2016-04-24 206853.0
27 2016-04-25 37559.0
28 2016-04-26 22817.0
29 2016-04-27 37506.0
30 2016-04-30 37597.0
31 2016-05-01 18799.0
32 2016-05-02 18799.0
33 2016-05-03 9400.0
34 2016-05-07 29890.0
35 2016-05-08 29193.0
36 2016-05-09 7792.0
37 2016-05-10 3199.0
38 2016-05-11 8538.0
39 2016-05-14 49937.0
I use this command to plot them in ipython:
df2.plot(x= 'Date', y='Quote-Spread')
plt.show()
But my figure is plotted like this:
As you can see in day 2014-04-23, the Quote-Spread has a value about 126,000. But in plot it is just zero.
my whole plot is like this:
Here is my code of original data:
Sachad = df.loc[df['SID']== 40065016131938148]
#Drop rows with any zero
df1 = df1[~(df1 == 0).any(axis = 1)]
df1['Quote-Spread'] = (df1['SellPrice'].mask(df1['SellPrice'].eq(0))-
df1['BuyPrice'].mask(df1['BuyPrice'].eq(0))).abs()
df2 = df1.groupby('Date' , as_index = False )['Quote-Spread'].mean()
df2.plot(x= 'Date', y='Quote-Spread')
plt.show()
Another question is how i can plot for specific dates like between 2014-04-01 up to 2016-06-01. and draw vertical red lines for dates 2014-06-06 and 2016-01-06?
Please provide the code that produced the working plot. Any warning messages?
As for your last questions: to select the rows you want, you can simply use > and < operators to compare two datetimes in conditional statements.
For vertical lines, you can use plt.axvline(x=date, color = 'r')
I have two numpy arrays of identical length (398 rows), with the first 5 values for each as follows:
y_predicted =
[[-0.85908649]
[-1.19176482]
[-0.93658361]
[-0.83557211]
[-0.80681243]]
y_norm =
mpg
0 -0.705551
1 -1.089379
2 -0.705551
3 -0.961437
4 -0.833494
That is, the first has square brackets around each value, and the second has indexing and no square brackets.
The data is a normalised version of the first column (MPG) of the Auto-MPG dataset. The y_predicted values are results of a linear regression.
https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data
Would anyone know how I might convert these arrays to the same type so I can plot a scatter plot of them?
Both have shape: (398, 1)
Both have type: class 'numpy.ndarray', dtype float64
Data from the link provided
18.0 8 307.0 130.0 3504. 12.0 70 1 "chevrolet chevelle malibu"
15.0 8 350.0 165.0 3693. 11.5 70 1 "buick skylark 320"
18.0 8 318.0 150.0 3436. 11.0 70 1 "plymouth satellite"
16.0 8 304.0 150.0 3433. 12.0 70 1 "amc rebel sst"
17.0 8 302.0 140.0 3449. 10.5 70 1 "ford torino"
15.0 8 429.0 198.0 4341. 10.0 70 1 "ford galaxie 500"
The second of these looks like a pandas Series to me. If so you can do y_norm.values to get the underlying numpy array.