Rolling calculation across a column - array-wise - python

I'm trying to get a rolling n-day annualized equity return volatility but am having trouble implementing it. Basically, I would want to see in the last row (index 10) an implementation of sorts that does np.std(df["log returns“]*np.sqrt(252) for a rolling n-day window (e.g. index 6-10 for a 5-day window). If there aren't n values left, leave empty/fill with np.nan.
Index
log returns
annualized volatility
0
0.01
1
-0.005
2
0.021
3
0.01
4
-0.01
5
0.02
6
0.012
7
0.022
8
-0.001
9
-0.01
10
0.01
I thought about doing this with a while loop, but since I'm working with a lot of data I thought an array-wise operation may be smarter. Unfortunately I can't come up with one for the life of me.

Related

Python PuLP optimization problem for average values with increase and decrease constraints

I am trying to use PuLP to solve the following problem.
I want it to change the values in the value column by either whatever it wants or if the maximum decrease or increase fields are populated by that amount to achieve a target of 0.27.
band
term
value
max_decrease
max_increase
target
A
1
0.259
-0.01
0
0.27
A
2
0.239
0
0.01
0.27
B
1
0.17
0
0.01
0.27
C
1
0.245
-0.01
0
0.27
I have tried setting this problem up but not making any progress. I am currently trying to set it up by minimizing the following equation:
Minimize(target - avg(value)) to 0. So by changing the values it needs to achieve an average value of 0.27.

How to generate datetimeindex for 200 observations per second?

I have data from many sensors, and observations come 200 times every second. Now I want to resample at a lower rate, so make the dataset manageable calculation wise. But The time column is absolute and date time. Please see the first column below. Now I want to create an index in absolute datetime so that I can use resample() methods easily to resampling and aggregation at different durations.
Example:
0.000000 1.397081 -0.672387 0.552749
0.005000 2.374832 -0.221770 1.348744
0.010000 3.191852 0.776504 0.044648
0.015000 2.304027 0.188047 0.433253
0.020000 2.331740 -0.000074 0.424112
0.025000 2.869129 0.282714 1.081615
0.030000 3.312915 0.997374 0.456503
0.035000 2.044041 -0.114705 0.993204
I want a method to generate timestamps 200 times a second starting at a timestamp, when this run of experiment was started, 2020/03/14 23:49:19 for example. Starting at 2020/03/14 23:49:19 I want to generate time stamps 200 times every second. This will help me generate a DatetimeIndex and then resample and aggregate it to 10 times a second.
I could find no example at this frequency and granularity, after reading the date functionality pages at pandas, https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timestamps-vs-time-spans
the real datafiles are of course extremely big, and confidential so can not post it.
assuming we have for example
df
Out[52]:
t v1 v2 v3
0 0.000 1.397081 -0.672387 0.552749
1 0.005 2.374832 -0.221770 1.348744
2 0.010 3.191852 0.776504 0.044648
3 0.015 2.304027 0.188047 0.433253
4 0.020 2.331740 -0.000074 0.424112
5 0.025 2.869129 0.282714 1.081615
6 0.030 3.312915 0.997374 0.456503
7 0.035 2.044041 -0.114705 0.993204
we can define a start date/time and add the existing time axis as a timedelta (assuming seconds here) and set that as index:
start = pd.Timestamp("2020/03/14 23:49:19")
df.index = pd.DatetimeIndex(start + pd.to_timedelta(df['t'], unit='s'))
df
Out[55]:
t v1 v2 v3
t
2020-03-14 23:49:19.000 0.000 1.397081 -0.672387 0.552749
2020-03-14 23:49:19.005 0.005 2.374832 -0.221770 1.348744
2020-03-14 23:49:19.010 0.010 3.191852 0.776504 0.044648
2020-03-14 23:49:19.015 0.015 2.304027 0.188047 0.433253
2020-03-14 23:49:19.020 0.020 2.331740 -0.000074 0.424112
2020-03-14 23:49:19.025 0.025 2.869129 0.282714 1.081615
2020-03-14 23:49:19.030 0.030 3.312915 0.997374 0.456503
2020-03-14 23:49:19.035 0.035 2.044041 -0.114705 0.993204

Percent of total clusters per cluster per season using pandas

I have a pandas DataFrame that looks like this with 12 clusters in total. Certain clusters don't appear in a certain season.
I want to create a multi-line graph over the seasons of the percent of a specific cluster over each season. So if there are 30 teams in the 97-98 season and there are 10 teams in Cluster 1, then that value would be .33 since cluster 1 has one third of the total possible spots.
It'll look like this
And I want the dateset to look like this, where each cluster has its own percentage of the whole number of clusters in that season by percentage. I've tried using pandas groupby method to get a bunch of lists and then use value_counts() on that but that doesn't work since looping through df.groupby(['SEASON']) returns tuples, not a Series..
Thanks so much
Use .groupby combined with .value_counts and .unstack:
temp_df = df.groupby(['SEASON'])['Cluster'].value_counts(normalize=True).unstack().fillna(0.0)
temp_df.plot()
print(temp_df.round(2))
Cluster 0 1 2 4 5 6 7 10 11
SEASON
1996-97 0.1 0.21 0.17 0.21 0.07 0.1 0.03 0.07 0.03
1997-98 0.2 0.00 0.20 0.20 0.00 0.0 0.20 0.20 0.00

Pandas - question about Groupby - 2 columns

Trying to print 2 columns against one another to see how much one depends on the other i.e how chances of admission depends on research experience - print the average chance of admit against research.
I'm not sure I'm getting the command correct and if size or something else should be used at the end:
df.groupby(['Chance of Admit', 'Research']).size()
this is the result when I run the above:
Chance of Admit Research
0.34 0 2
0.36 0 1
1 1
0.37 0 1
0.38 0 2
..
0.93 1 12
0.94 1 13
0.95 1 5
0.96 1 8
0.97 1 4
Length: 99, dtype: int64
To see the mean admission rate by number of research papers, you should group by 'research' and take the mean:
df.groupby('Research').mean()
To see additional stats grouped by the number of research papers, use .describe():
df.groupby('Research').describe()
Finally, it may be useful to plot a correlation of the chance of admission vs. the amount of research:
df.plot.scatter(x='Research', y='Chance of Admit')
Syntactically speaking, to take the mean of the chance to admit column grouped by research, all you need is the following:
df[['Chance of Admit', 'Research']].groupby('Research').mean()
But be wary here, as the number may not actually tell you what the chance of getting admitted is given a particular amount of research. That is because you don't know the denominators in those probabilities.
For example,
Suppose I had a dataset that contained the following two rows:
Chance of Admit Research
0.75 1
0.25 1
The 'mean' "chance of admit" for research==1 would be .50 when calculated this way, but suppose that the first change came from a population of 100 students where 75% (75) we admitted. And the second from a population of 900, where 25% (225) were admitted.
Then over all the data we have, we would see 300 with research==1 admitted from a total of 1000 students or 30% change to admit with reseach==1

Create a row in pandas dataframe

I am trying to create a row in my existing pandas dataframe and the value of a new row should be a computation
I have a dataframe that looks like the below:
Rating LE_St % Total
1.00 7.58 74.55
2.00 0.56 5.55
3.00 0.21 2.04
5.00 0.05 0.44
6.00 1.77 17.42
All 10.17 100.00
I want to add a row called "Metric" which is the sum of "LE_St" variable for "Rating" >= 4 and <6 divided by "LE_St" for "All" i.e Metric = (0.05+1.77)/10.17
My output dataframe should look like below:
Rating LE_St % Total
1.00 7.58 74.55
2.00 0.56 5.55
3.00 0.21 2.04
5.00 0.05 0.44
6.00 1.77 17.42
All 10.17 100.00
Metric 0.44
I believe your approach to the dataframe is wrong.
Usually rows hold values correlating with columns in a matter that makes sense and not hold random information. the power of pandas and python is for holding and manipulating data. You can easily compute a value from a column or even all columns and store them in a "summary" like dataframe or in separate values. That might help you with this as well.
for computation on a column (i.e. Series object) you can use the .sum() method (or any other of the computational tools) and slice your dataframe by values in the "rating" column.
for random computation of small statistics you will be rather off with excel :)
an example of a solution might look like this:
all = 10.17 # i dont know where this value comes from
df = df[df['rating'].between(4, 6, inclusive=True)]
metric = sliced_df['LE_ST'].sum()/all
print metric # or store it somewhere however you like

Categories