Python: How to split dataframe with datetime index by number of observations? - python

data = {
'aapl': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
'aal' : [33, 33, 33, 32, 31, 30, 34, 29, 27, 26],
}
data = pd.DataFrame(data)
data.index = pd.date_range('2011-01-01', '2011-01-10')
n_obs = len(data) * 0.3
train, test = data[:n_obs], data[n_obs:]
>>> TypeError: cannot do slice indexing on DatetimeIndex with these indexers [3.0] of type float
I can probably slice the dataframe by date like df[ : '2011-01-05' ], but I want to be splitting the data by number of observations, which I have difficulties using the method above.

You need to ensure having an integer for slicing:
n_obs = int(len(data) * 0.3)
train, test = data[:n_obs], data[n_obs:]
output:
# train
aapl aal
2011-01-01 11 33
2011-01-02 12 33
2011-01-03 13 33
# test
aapl aal
2011-01-04 14 32
2011-01-05 15 31
2011-01-06 16 30
2011-01-07 17 34
2011-01-08 18 29
2011-01-09 19 27
2011-01-10 20 26
If you want to train/test a model you might be interested in getting a random sample:
test = data.sample(frac=0.3)
train = data.loc[data.index.difference(test.index)]

Related

Get 25 quantile in cumsum pandas

Suppose I have the following DataFrame:
df = pd.DataFrame({'id': [2, 4, 10, 12, 13, 14, 19, 20, 21, 22, 24, 25, 27, 29, 30, 31, 42, 50, 54],
'value': [37410.0, 18400.0, 200000.0, 392000.0, 108000.0, 423000.0, 80000.0, 307950.0,
50807.0, 201740.0, 182700.0, 131300.0, 282005.0, 428800.0, 56000.0, 412400.0, 1091595.0, 1237200.0,
927500.0]})
And I do the following:
df.sort_values(by='id').set_index('id').cumsum()
value
id
2 37410.0
4 55810.0
10 255810.0
12 647810.0
13 755810.0
14 1178810.0
19 1258810.0
20 1566760.0
21 1617567.0
22 1819307.0
24 2002007.0
25 2133307.0
27 2415312.0
29 2844112.0
30 2900112.0
31 3312512.0
42 4404107.0
50 5641307.0
54 6568807.0
I want to know the first element of id that is bigger than 25% of the cumulative sum. In this example, 25% of the cumsum would be 1,642,201.75. The first element to exceed that would be 22. I know it can be done with a for, but I think it would be pretty inefficient.
You could do:
percentile_25 = df['value'].sum() * 0.25
res = df[df['value'].cumsum() > percentile_25].head(1)
print(res)
Output
id value
9 22 201740.0
Or use searchsorted to do the search in O(log N):
percentile_25 = df['value'].sum() * 0.25
i = df['value'].cumsum().searchsorted(percentile_25)
res = df.iloc[i]
print(res)
Output
id 22.0
value 201740.0
Name: 9, dtype: float64

Can you explain the output: diff.sort_values(ascending=False).index.astype

Can anyone explain the following statement.
list(diff.sort_values(ascending=False).index.astype(int)[0:5])
Output: Int64Index([24, 26, 17, 2, 1], dtype='int64')
It sorts first, but what is the index doing and how do i get 24, 26, 17, 2 ,1 ??
diff is series
ipdb> diff
1 0.017647
2 0.311765
3 -0.060000
4 -0.120000
5 -0.040000
6 -0.120000
7 -0.190000
8 -0.200000
9 -0.100000
10 -0.011176
11 -0.130000
12 0.008824
13 -0.060000
14 -0.090000
15 -0.060000
16 0.008824
17 0.341765
18 -0.140000
19 -0.050000
20 -0.060000
21 -0.040000
22 -0.210000
23 0.008824
24 0.585882
25 -0.060000
26 0.555882
27 -0.031176
28 -0.060000
29 -0.170000
30 -0.220000
31 -0.170000
32 -0.040000
dtype: float64
Yout code return list of index values of top5 values of Series sorted in descending order.
First 'column' printed in pandas Series is called index, so your code after sorting convert values of index to integers and slice by indexing.
print (diff.sort_values(ascending=False))
24 0.585882
26 0.555882
17 0.341765
2 0.311765
1 0.017647
12 0.008824
23 0.008824
16 0.008824
10 -0.011176
27 -0.031176
32 -0.040000
21 -0.040000
5 -0.040000
19 -0.050000
15 -0.060000
3 -0.060000
13 -0.060000
25 -0.060000
28 -0.060000
20 -0.060000
14 -0.090000
9 -0.100000
6 -0.120000
4 -0.120000
11 -0.130000
18 -0.140000
31 -0.170000
29 -0.170000
7 -0.190000
8 -0.200000
22 -0.210000
30 -0.220000
Name: a, dtype: float64
print (diff.sort_values(ascending=False).index.astype(int))
Int64Index([24, 26, 17, 2, 1, 12, 23, 16, 10, 27, 32, 21, 5, 19, 15, 3, 13,
25, 28, 20, 14, 9, 6, 4, 11, 18, 31, 29, 7, 8, 22, 30],
dtype='int64')
print (diff.sort_values(ascending=False).index.astype(int)[0:5])
Int64Index([24, 26, 17, 2, 1], dtype='int64')
print (list(diff.sort_values(ascending=False).index.astype(int)[0:5]))
[24, 26, 17, 2, 1]
Here's what's happening:
diff.sort_values(ascending) - sorts a Series. By default, ascending is True, but you've kept it false, so it returns sorted Series in descending order.
pandas.Series.index returns a row-labels of the index (the sorted numbers 1 - 32 in your case)
.as_type(int) typecasts index row-labels as integers.
[0: 5] just picks the cells 0 through 5
Let me know if this helps!

Appending predicted residuals and rsquared to pandas dataframe - by groups

There is a question like this already but I want modifications and have tried few methods without much luck.
I have data and want to add the R squared of a regression by groups as a seperate column in the pandas dataframe. The caveat here is I only want to do the regression on values which do not have a extreme residual values within each group (ie, within 1 standard deviations or between -1 and 1 z score).
Here is the SAMPLE data frame:
df = pd.DataFrame({'gp': [1,1,1,1,1,2,2,2,2,2],
'x1': [3.17, 4.76, 4.17, 8.70, 11.45, 3.17, 4.76, 4.17, 8.70, 1.45],
'x2': [23, 26, 73, 72, 16, 26, 73, 72, 16, 25],
'y': [880.37, 716.20, 974.79, 322.80, 1054.25, 980.37, 816.20, 1074.79, 522.80, 1254.25]},
index=np.arange(10, 30, 2))
Now the answer which was on another post is such which works for me to get residuals in the group. This was the solution:
import pandas as pd
import numpy as np
import statsmodels.formula.api as sm
regmodel = 'y ~ x1 + x2'
def groupreg(g):
g['residual'] = sm.ols(formula=regmodel, data=g).fit().resid
return g
df = df.groupby('gp').apply(groupreg)
print(df)
Now this is great because I have a column residuals which gives the residual of the linear regression within each group.
However now I want to add another column which is R squared, where I want to add the R squared of the regression within each group only for the points where the residual is within +1/-1 zscore within within each group. So the goal is to add a R-squared which is stripping out extreme outliers in the regression (this should improve the R-squared values of a normal R-squared using all the data). Any help would be appreciated.
Edit**
FYI to add just a normal R squared the function would be this:
def groupreg(g):
g['residual'] = sm.ols(formula=regmodel, data=g).fit().resid
g['rsquared'] = sm.ols(formula=regmodel, data=g).fit().rsquared
return g
EDIT 2 **
Here is my code:
import pandas as pd
import numpy as np
import statsmodels.formula.api as sm
df = pd.DataFrame({'gp': [1,1,1,1,1,2,2,2,2,2],
'x1': [3.17, 4.76, 4.17, 8.70, 11.45, 3.17, 4.76, 4.17, 8.70, 1.45],
'x2': [23, 26, 73, 72, 16, 26, 73, 72, 16, 25],
'y': [880.37, 716.20, 974.79, 322.80, 1054.25, 980.37, 816.20, 1074.79, 522.80, 1254.25]},
index=np.arange(10, 30, 2))
regmodel = 'y ~ x1 + x2'
def groupreg(g):
g['residual'] = sm.ols(formula=regmodel, data=g).fit().resid
return g
df = df.groupby('gp').apply(groupreg)
print(df)
df['z_score'] = df.groupby('gp')['residual'].apply(lambda x: (x - x.mean())/x.std())
Output:
gp x1 x2 y residual z_score
10 1 3.17 23 880.37 -43.579309 -0.173726
12 1 4.76 26 716.20 -174.532201 -0.695759
14 1 4.17 73 974.79 318.634921 1.270214
16 1 8.70 72 322.80 -287.710952 -1.146938
18 1 11.45 16 1054.25 187.187542 0.746209
20 2 3.17 26 980.37 -67.245089 -0.822329
22 2 4.76 73 816.20 -96.883281 -1.184770
24 2 4.17 72 1074.79 104.400010 1.276691
26 2 8.70 16 522.80 21.017543 0.257020
28 2 1.45 25 1254.25 38.710817 0.473388
So Here I would like another column of R squared per group whilst not using the points which have a z-score greater and less than 1 and -1 respectively (eg would not use index 14, 16, 22 and 24 in the group by r square calculation.
First, use your full definition of groupreg that assigns both resid and rsquared columns:
def groupreg(g):
g['residual'] = sm.ols(formula=regmodel, data=g).fit().resid
g['rsquared'] = sm.ols(formula=regmodel, data=g).fit().rsquared
return g
Then, at the very end of your current code (after creating the z_score column), try the following to delete the rsquared entries in rows where -1 < z_score < 1:
df.loc[df['z_score'].abs() < 1, 'rsquared'] = np.NaN
Output:
gp x1 x2 y residual rsquared z_score
10 1 3.17 23 880.37 -43.579309 NaN -0.173726
12 1 4.76 26 716.20 -174.532201 NaN -0.695759
14 1 4.17 73 974.79 318.634921 0.250573 1.270214
16 1 8.70 72 322.80 -287.710952 0.250573 -1.146938
18 1 11.45 16 1054.25 187.187542 NaN 0.746209
20 2 3.17 26 980.37 -67.245089 NaN -0.822329
22 2 4.76 73 816.20 -96.883281 0.912987 -1.184770
24 2 4.17 72 1074.79 104.400010 0.912987 1.276691
26 2 8.70 16 522.80 21.017543 NaN 0.257020
28 2 1.45 25 1254.25 38.710817 NaN 0.473388

Pandas DataFrame: Complex linear interpolation

I have a dataframe with 4 sections
Section 1: Product details
Section 2: 6 Potential product values based on a range of simulations
Section 3: Upper and lower bound for the input parameter to the simulations
Section 4: Randomly generated values for the input parameters
Section 2 is generated by pricing the product at equal intervals between the upper and lower bound.
I need to take the values in Section 4 and figure out the corresponding product value. Here is a possible setup for this dataframe:
table2 = pd.DataFrame({
'Product Type': ['A', 'B', 'C', 'D'],
'State_1_Value': [10, 11, 12, 13],
'State_2_Value': [20, 21, 22, 23],
'State_3_Value': [30, 31, 32, 33],
'State_4_Value': [40, 41, 42, 43],
'State_5_Value': [50, 51, 52, 53],
'State_6_Value': [60, 61, 62, 63],
'Lower_Bound': [-1, 1, .5, 5],
'Upper_Bound': [1, 2, .625, 15],
'sim_1': [0, 0, .61, 7],
'sim_2': [1, 1.5, .7, 9],
})
>>> table2
Lower_Bound Product Type State_1_Value State_2_Value State_3_Value \
0 -1.0 A 10 20 30
1 1.0 B 11 21 31
2 0.5 C 12 22 32
3 5.0 D 13 23 33
State_4_Value State_5_Value State_6_Value Upper_Bound sim_1 sim_2
0 40 50 60 1.000 0.0 1.0
1 41 51 61 2.000 0.0 1.5
2 42 52 62 0.625 0.61 0.7
3 43 53 63 15.000 7.0 9.0
I will run through a couple examples of this calculation to make it clear what my question is.
Product A - sim_2
The input here is 1.0. This is equal to the upper bound for this product. Therefore the simulation value is equivalent to the state_6 value - 60
Product B - sim_2
The input here is 1.5. the LB to UB range is (1,2), therefore the 6 states are {1,1.2,1.4,1.6,1.8,2}. 1.5 is exactly in the middle of state_3 which has a value of 31 and state 4 which has a value of 41. Therefore the simulation value is 36.
Product C - sim_1
The input here is .61. The LB to UB range is (.5,.625), therefore the 6 states are {.5,.525,.55,.575,.6,.625}. .61 is between state 5 and 6. Specifically the bucket it would fall under would be 5*(.61-.5)/(.625-.5)+1 = 5.4 (it is multiplied by 5 as that is the number of intervals - you can calculate it other ways and get the same result). Then to calculate the value we use that bucket in a weighing of the values for state 5 and state 6: (62-52)*(5.4-5)+52 = 56.
Product B - sim_1
The input here is 0 which is below the lower bound of 1. Therefore we need to extrapolate the value. We use the same formula as above we just use the values of state 1 and state 2 to extrapolate. The bucket would be 5*(0-1)/(2-1)+1 = -4. The two values used at 11 and 21, so the value is (21-11)*(-4-1)+11= -39
I've also simplified the problem to try to visualize the solution, my final code needs to run on 500 values and 10,000 simulations, and the dataframe will have about 200 rows.
Here are the formulas I've used for the interpolation although I'm not committed to them specifically.
Bucket = N*(sim_value-LB)/(UB-LB) + 1
where N is the number of intervals
then nLower is the state value directly below the bucket, and nHigher is the state value directly above the bucket. If the bucket is outside the UB/LB, then force nLower and nHigher to be either the first two or last two values.
Final_value = (nHigher-nLower)*(Bucket1 - number_value_of_nLower)+nLower
To summarize, my question is how I can generate the final results based on the combination of input data provided. The most challenging part to me is how to make the connection from the Bucket number to the nLower and nHigher values.
I was able to generate the result using the following code. I'm not sure of the memory implications on a large dataframe, so still interested in better answers or improvements.
Edit: Ran this code on the full dataset, 141 rows, 500 intervals, 10,000 simulations, and it took slightly over 1.5 hours. So not quite as useless as I assumed, but there is probably a smarter way of doing this in a tiny fraction of that time.
for i in range(1,3):
table2['Bucket%s'%i] = 5 * (table2['sim_%s'%i] - table2['Lower_Bound']) / (table2['Upper_Bound'] - table2['Lower_Bound']) + 1
table2['lv'] = table2['Bucket%s'%i].map(int)
table2['hv'] = table2['Bucket%s'%i].map(int) + 1
table2.ix[table2['lv'] < 1 , 'lv'] = 1
table2.ix[table2['lv'] > 5 , 'lv'] = 5
table2.ix[table2['hv'] > 6 , 'hv'] = 6
table2.ix[table2['hv'] < 2 , 'hv'] = 2
table2['nLower'] = table2.apply(lambda row: row['State_%s_Value'%row['lv']],axis=1)
table2['nHigher'] = table2.apply(lambda row: row['State_%s_Value'%row['hv']],axis=1)
table2['Final_value_%s'%i] = (table2['nHigher'] - table2['nLower'])*(table2['Bucket%s'%i]-table2['lv']) + table2['nLower']
Output:
>>> table2
Lower_Bound Product Type State_1_Value State_2_Value State_3_Value \
0 -1.0 A 10 20 30
1 1.0 B 11 21 31
2 0.5 C 12 22 32
3 5.0 D 13 23 33
State_4_Value State_5_Value State_6_Value Upper_Bound sim_1 sim_2 \
0 40 50 60 1.000 0.00 1.0
1 41 51 61 2.000 0.00 1.5
2 42 52 62 0.625 0.61 0.7
3 43 53 63 15.000 7.00 9.0
Bucket1 lv hv nLower nHigher Final_value_1 Bucket2 Final_value_2
0 3.5 5 6 50 60 35.0 6.0 60.0
1 -4.0 3 4 31 41 -39.0 3.5 36.0
2 5.4 5 6 52 62 56.0 9.0 92.0
3 2.0 3 4 33 43 23.0 3.0 33.0
I posted a superior solution with no loops here:
Alternate method to avoid loop in pandas dataframe
df= pd.DataFrame({
'Product Type': ['A', 'B', 'C', 'D'],
'State_1_Value': [10, 11, 12, 13],
'State_2_Value': [20, 21, 22, 23],
'State_3_Value': [30, 31, 32, 33],
'State_4_Value': [40, 41, 42, 43],
'State_5_Value': [50, 51, 52, 53],
'State_6_Value': [60, 61, 62, 63],
'Lower_Bound': [-1, 1, .5, 5],
'Upper_Bound': [1, 2, .625, 15],
'sim_1': [0, 0, .61, 7],
'sim_2': [1, 1.5, .7, 9],
})
buckets = df.ix[:,-2:].sub(df['Lower_Bound'],axis=0).div(df['Upper_Bound'].sub(df['Lower_Bound'],axis=0),axis=0) * 5 + 1
low = buckets.applymap(int)
high = buckets.applymap(int) + 1
low = low.applymap(lambda x: 1 if x < 1 else x)
low = low.applymap(lambda x: 5 if x > 5 else x)
high = high.applymap(lambda x: 6 if x > 6 else x)
high = high.applymap(lambda x: 2 if x < 2 else x)
low_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(low.shape[0])[:,None], low])
high_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(high.shape[0])[:,None], high])
df1 = (high_value - low_value).mul((buckets - low).values) + low_value
df1['Product Type'] = df['Product Type']

Pandas dot product ValueError

I am trying to calculate the dot product of a data frame and a series, but I am getting ValueError: matrices are not aligned and I do not really understand why. I get
if (len(common) > len(self.columns) or len(common) > len(other.index)):
raise ValueError('matrices are not aligned')
with the error message, which I totally understand. But when I check my series, it has 25 values:
weights
Out[193]:
0 0.000002
1 0.000577
2 0.002480
3 0.004720
4 0.003640
5 0.001480
6 0.000054
7 0.000022
8 0.009060
9 0.000511
10 0.034900
11 0.140000
12 0.065600
13 0.325000
14 0.072900
15 0.031100
16 0.209000
17 0.003280
18 0.001390
19 0.002100
20 0.000847
21 0.009560
22 0.006320
23 0.014000
24 0.061900
Name: 3, dtype: float64
And when I check my data frame, it also has 25 columns:
In [195]: data
Out[195]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 131 entries, 0 to 130
Data columns (total 25 columns):
(etc)
So I don't understand why I get the error message. What am I missing here?
Some additional information:
I am using weightedave=data.dot(weights)
And I just figured out in the dot code that it does common = data.columns.union(weights.index) to get the common referred to in the error message. So I tested that, but in my case that becomes
In[220]: common
Out[220]: Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, u'100_AVET', u'101_AVET', u'102_AVET', u'13_AVET', u'14_AVET', u'15_AVET', u'18_AVET', u'19_AVET', u'20_AVET', u'22_AVET', u'36_AVET', u'62_AVET', u'74_AVET', u'78_AVET', u'79_AVET', u'80_AVET', u'83_AVET', u'85_AVET', u'86_AVET', u'88_AVET', u'94_AVET', u'95_AVET', u'96_AVET', u'97_AVET', u'99_AVET'], dtype=object)
Which indeed is longer (50) than my number of columns/indices (25). Should I rename either my series or the columns in my data frame?

Categories