I'm Looking to take the most recent value in a rolling window and divide it by the mean of all numbers in said window.
What I tried:
df.a.rolling(window=7).mean()/df.a[-1]
This doesn't work because df.a[-1] is always the most recent of the entire dataset. I need the last value of the window.
I've done a ton of searching today. I may be searching the wrong terms, or not understanding the results, because I have not gotten anything useful.
Any pointers would be appreciated.
Aggregation (use the mean()) on a rolling windows returns a pandas Series object with the same indexing as the original column. You can simply aggregate the rolling window and then divide the original column by the aggregated values.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(30), columns=['A'])
df
# returns:
A
0 0
1 1
2 2
...
27 27
28 28
29 29
You can use a rolling mean to get a series with the same index.
df.A.rolling(window=7).mean()
# returns:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 3.0
7 4.0
...
26 23.0
27 24.0
28 25.0
29 26.0
Because it is indexed, you can simple divide by df.A to get your desired results.
df.A.rolling(window=7).mean() / df.A
# returns:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 0.500000
7 0.571429
8 0.625000
9 0.666667
10 0.700000
11 0.727273
12 0.750000
13 0.769231
14 0.785714
15 0.800000
16 0.812500
17 0.823529
18 0.833333
19 0.842105
20 0.850000
21 0.857143
22 0.863636
23 0.869565
24 0.875000
25 0.880000
26 0.884615
27 0.888889
28 0.892857
29 0.896552
Related
I would like to know how many 0.5/1/1.5/2/2.5/3/3.5/4/4.5/5 ratings that rated by every user in a data frame of a certain movie which is Ocean's Eleven (2001) in order to calculate Pearson Correlation using the formula.
Below is the code
import numpy as np
import pandas as pd
ratings_data = pd.read_csv("D:\\ratings.csv")
movies_name = pd.read_csv("D:\\movies.csv")
movies_data = pd.merge(ratings_data, movies_name, on='movieId')
movies_data.groupby('title')['rating'].mean()
movies_data.groupby('title')['rating'].count()
average_ratings_count['rating_counts']=pd.DataFrame(movies_data.groupby('title')['rating'].count())
https://i.stack.imgur.com/1eFLV.png
matrix_user_ratings = movies_data.pivot_table(index='userId', columns='title', values='rating')
oceanRatings = matrix_user_ratings["Ocean's Eleven (2001)"]
oceanRatings.head(20)
userId
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 4.0
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 4.0
19 NaN
20 NaN
Name: Ocean's Eleven (2001), dtype: float64
In this case, I just can know there are two 4.0 ratings, but I have around 600+ users. Because I am using movieLens dataset.
You can use groupby:
oceanRatings = matrix_user_ratings["Ocean's Eleven (2001)"].groupby('rating').count()
Or value_counts():
oceanRatings = matrix_user_ratings["Ocean's Eleven (2001)"].value_counts()
I want to merge the following 2 data frames in Pandas but the result isn't containing all the relevant columns:
L1aIn[0:5]
Filename OrbitNumber OrbitMode
OrbitModeCounter Year Month Day L1aIn
0 oco2_L1aInDP_35863a_210329_B10206_210330111927.h5 35863 DP a 2021 3 29 1
1 oco2_L1aInDP_35862a_210329_B10206_210330111935.h5 35862 DP a 2021 3 29 1
2 oco2_L1aInDP_35861b_210329_B10206_210330111934.h5 35861 DP b 2021 3 29 1
3 oco2_L1aInLP_35861a_210329_B10206_210330111934.h5 35861 LP a 2021 3 29 1
4 oco2_L1aInSP_35861a_210329_B10206_210330111934.h5 35861 SP a 2021 3 29 1
L2Std[0:5]
Filename OrbitNumber OrbitMode OrbitModeCounter Year Month Day L2Std
0 oco2_L2StdGL_35861a_210329_B10206r_21042704283... 35861 GL a 2021 3 29 1
1 oco2_L2StdXS_35860a_210329_B10206r_21042700342... 35860 XS a 2021 3 29 1
2 oco2_L2StdND_35852a_210329_B10206r_21042622540... 35852 ND a 2021 3 29 1
3 oco2_L2StdGL_35862a_210329_B10206r_21042622403... 35862 GL a 2021 3 29 1
4 oco2_L2StdTG_35856a_210329_B10206r_21042622422... 35856 TG a 2021 3 29 1
>>> df = L1aIn.copy(deep=True)
>>> df.merge(L2Std, how="outer", on=["OrbitNumber","OrbitMode","OrbitModeCounter"])
0 oco2_L1aInDP_35863a_210329_B10206_210330111927.h5 35863 DP a ... NaN NaN NaN NaN
1 oco2_L1aInDP_35862a_210329_B10206_210330111935.h5 35862 DP a ... NaN NaN NaN NaN
2 oco2_L1aInDP_35861b_210329_B10206_210330111934.h5 35861 DP b ... NaN NaN NaN NaN
3 oco2_L1aInLP_35861a_210329_B10206_210330111934.h5 35861 LP a ... NaN NaN NaN NaN
4 oco2_L1aInSP_35861a_210329_B10206_210330111934.h5 35861 SP a ... NaN NaN NaN NaN
5 NaN 35861 GL a ... 2021.0 3.0 29.0 1.0
6 NaN 35860 XS a ... 2021.0 3.0 29.0 1.0
7 NaN 35852 ND a ... 2021.0 3.0 29.0 1.0
8 NaN 35862 GL a ... 2021.0 3.0 29.0 1.0
9 NaN 35856 TG a ... 2021.0 3.0 29.0 1.0
[10 rows x 13 columns]
>>> df.columns
Index(['Filename', 'OrbitNumber', 'OrbitMode', 'OrbitModeCounter', 'Year',
'Month', 'Day', 'L1aIn'],
dtype='object')
I want the resulting merged table to include both the "L1aIn" and "L2Std" columns but as you can see it doesn't and only picks up the original columns from L1aIn.
I'm also puzzled about why it seems to be returning a dataframe object rather than None.
A toy example works fine for me, but the real-life one does not. What circumstances provoke this kind of behavior for merge?
Seems to me that you just need to a variable to the output of
merged_df = df.merge(L2Std, how="outer", on=["OrbitNumber","OrbitMode","OrbitModeCounter"])
print(merged_df.columns)
Hey I have a doubt on pandas rolling function.
I am currently using it to get mean for last 10 days of my time series data.
Example df:
column
2020-12-04 14
2020-12-05 15
2020-12-06 16
2020-12-07 17
2020-12-08 18
2020-12-09 19
2020-12-13 20
2020-12-14 11
2020-12-16 12
2020-12-17 13
Usage:
df['column'].rolling('10D').mean()
But the function calculates the rolling mean over the 10 calendar days. like if the current row date is 2020-12-17 it calculates till 2020-12-07.
However I would like the rolling mean on the last 10 days that are in the data frame. i.e I would want till 2020-12-04.
How can I acheive it?
Edit: So I can also have a 15 mins interval datetime index so doing window=10 is not helping in that case. Though it works here.
As said in the comments by #cs95, if you want to consider only the rows that are in the dataframe, you can ignore that your data is part of a timeseries and just specify a window sized by a number of rows, instead of by a number of days. In essence
df['column'].rolling(window=10).mean()
Just one little detail to remember. You have missing dates in you dataframe. You should fill that, otherwise it will not be a 10 day window. Instead you would have a 10-dates rolling window,which would be pretty meaningless if dates are randoly missing.
r = pd.date_range(start=df1.Date.min(), end=df1.Date.max())
df1 = df1.set_index('Date').reindex(r).fillna(0).rename_axis('Date').reset_index()
which gives you the dataframe:
Date column
0 2020-12-04 14.0
1 2020-12-05 15.0
2 2020-12-06 16.0
3 2020-12-07 17.0
4 2020-12-08 18.0
5 2020-12-09 19.0
6 2020-12-10 0.0
7 2020-12-11 0.0
8 2020-12-12 0.0
9 2020-12-13 20.0
10 2020-12-14 11.0
11 2020-12-15 0.0
12 2020-12-16 12.0
13 2020-12-17 13.0
Then applying:
df1['Mean']=df1['column'].rolling(window=10).mean()
returns
Date column Mean
0 2020-12-04 14.0 NaN
1 2020-12-05 15.0 NaN
2 2020-12-06 16.0 NaN
3 2020-12-07 17.0 NaN
4 2020-12-08 18.0 NaN
5 2020-12-09 19.0 NaN
6 2020-12-10 0.0 NaN
7 2020-12-11 0.0 NaN
8 2020-12-12 0.0 NaN
9 2020-12-13 20.0 11.9
10 2020-12-14 11.0 11.6
11 2020-12-15 0.0 10.1
12 2020-12-16 12.0 9.7
13 2020-12-17 13.0 9.3
If i have two data frames df1 and df2:
df1
yr
24 1984
30 1985
df2
d m yr
16 12 4 2012
17 13 10 1976
18 24 4 98
I would like to have a dataframe that gives an output as below, could you help with the function that could help me achieve this
d m yr
16 12 4 2012
17 13 10 1976
18 24 4 98
24 NaN NaN 1984
30 NaN NaN 1985
You are looking to concat two dataframes:
res = pd.concat([df2, df1], sort=False)
print(res)
d m yr
16 12.0 4.0 2012
17 13.0 10.0 1976
18 24.0 4.0 98
24 NaN NaN 1984
30 NaN NaN 1985
I'm using scipy interpolate to find the a smooth approximation of different columns of a quasi-large data frame (of length around 240,000) then find the slope at interval midpoints like so:
tck = sc.interpolate.splrep(data['Time'], np.array(data[columname]), s=3)
slope =sc.interpolate.splev(interval_midpoints_array, tck, der=1)
For some columns this works well and very fast, but for others it takes what seems to me a long time (at least 15-20 minutes before I gave up).
I read here that the univariate spline has speed problems with NaN values, however I checked and there are no NaN values in my data frame. I also tried increasing the s value, but it had no significant effect on the time.
Is there a way to speed up the interpolation? Or a better way to approximate these slope values that is faster? Am I just impatient?
Is there a way to characterize data sets on which different approaches are faster?
Edit: added some sample data below
AT.data['Fast'][0:25]
0 0.9531
1 0.9536
2 0.9557
3 0.9578
4 0.9599
5 0.9625
6 0.9538
7 0.9143
8 0.9429
9 0.9773
10 0.9802
11 0.9831
12 0.9846
13 0.9849
14 0.9849
15 0.9826
16 0.9811
17 0.9791
18 0.9780
19 0.9773
20 0.9758
21 0.9752
22 0.9743
23 0.9737
24 0.9742
Name: Fast, dtype: float64
AT.data['Slow'][0:25]
0 105.1
1 105.1
2 105.1
3 105.1
4 105.1
5 105.1
6 105.1
7 105.0
8 105.0
9 105.0
10 105.0
11 105.0
12 105.0
13 105.0
14 104.9
15 104.9
16 104.9
17 104.8
18 104.8
19 104.8
20 104.8
21 104.7
22 104.7
23 104.7
24 104.7
Name: Slow, dtype: float64