Consider this simple example
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
df = pd.DataFrame({'a':[1,3,5,7,4,5,6,4,7,8,9,1,3,5,7,4,5,6,4,7,8,9],
'b':[3,5,6,2,4,6,2,5,7,1,9,5,3,2,5,4,3,6,4,1,1,9]})
I am trying to perform a rolling regression of a on b. I am trying to use the simplest pandas tool available: apply. I want to use apply because I want to keep the flexibility of returning any parameter of the regression.
However, the simple code below does not work
df.rolling(10).apply(lambda x: smf.ols('a ~ b', data = x).fit())
File "<string>", line 1, in <module>
PatsyError: Error evaluating factor: NameError: name 'b' is not defined
a ~ b
^
What is the issue?
Thanks!
rolling apply is not capable of interacting with multiple columns simultaneously, nor is it able to produce non-numeric values. We instead need to take advantage of the iterable nature of rolling objects. We also need to account for handling min_periods ourselves, since the iterable rolling object generates all windows results regardless of other rolling arguments.
We can then create some function to produce each row in the results from the regression results to do something like:
def process(x):
if len(x) >= 10:
reg = smf.ols('a ~ b', data=x).fit()
print(reg.params)
return [
# b from params
reg.params['b'],
# b from tvalues
reg.tvalues['b'],
# Both lower and upper b from conf_int()
*reg.conf_int().loc['b', :].tolist()
]
# Return NaN in the same dimension as the results
return [np.nan] * 4
df = df.join(
# join new DataFrame back to original
pd.DataFrame(
(process(x) for x in df.rolling(10)),
columns=['coef', 't', 'lower', 'upper']
)
)
df:
a b coef t lower upper
0 1 3 NaN NaN NaN NaN
1 3 5 NaN NaN NaN NaN
2 5 6 NaN NaN NaN NaN
3 7 2 NaN NaN NaN NaN
4 4 4 NaN NaN NaN NaN
5 5 6 NaN NaN NaN NaN
6 6 2 NaN NaN NaN NaN
7 4 5 NaN NaN NaN NaN
8 7 7 NaN NaN NaN NaN
9 8 1 -0.216802 -0.602168 -1.047047 0.613442
10 9 9 0.042781 0.156592 -0.587217 0.672778
11 1 5 0.032086 0.097763 -0.724742 0.788913
12 3 3 0.113475 0.329006 -0.681872 0.908822
13 5 2 0.198582 0.600297 -0.564258 0.961421
14 7 5 0.203540 0.611002 -0.564646 0.971726
15 4 4 0.236599 0.686744 -0.557872 1.031069
16 5 3 0.293651 0.835945 -0.516403 1.103704
17 6 6 0.314286 0.936382 -0.459698 1.088269
18 4 4 0.276316 0.760812 -0.561191 1.113823
19 7 1 0.346491 1.028220 -0.430590 1.123572
20 8 1 -0.492424 -1.234601 -1.412181 0.427332
21 9 9 0.235075 0.879433 -0.381326 0.851476
Setup:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
df = pd.DataFrame({
'a': [1, 3, 5, 7, 4, 5, 6, 4, 7, 8, 9, 1, 3, 5, 7, 4, 5, 6, 4, 7, 8, 9],
'b': [3, 5, 6, 2, 4, 6, 2, 5, 7, 1, 9, 5, 3, 2, 5, 4, 3, 6, 4, 1, 1, 9]
})
Rolling.apply applies the rolling operation to each column separately (Related question).
Following user3226167's answer of this thread, it seems that easiest way to accomplish what you want is to use RollingOLS.from_formula from statsmodels.regression.rolling.
from statsmodels.regression.rolling import RollingOLS
df = pd.DataFrame({'a':[1,3,5,7,4,5,6,4,7,8,9,1,3,5,7,4,5,6,4,7,8,9],
'b':[3,5,6,2,4,6,2,5,7,1,9,5,3,2,5,4,3,6,4,1,1,9]})
model = RollingOLS.from_formula('a ~ b', data = df, window=10)
reg_obj = model.fit()
# estimated coefficient
b_coeff = reg_obj.params['b'].rename('coef')
# b t-value
b_t_val = reg_obj.tvalues['b'].rename('t')
# 95 % confidence interval of b
b_conf_int = reg_obj.conf_int(cols=[1]).droplevel(level=0, axis=1)
# join all the desired information to the original df
df = df.join([b_coeff, b_t_val, b_conf_int])
where reg_obj is a RollingRegressionResults which holds lots of information about the regression (see all its different attributes in the docs)
Output
>>> type(reg_obj)
<class 'statsmodels.regression.rolling.RollingRegressionResults'>
>>> df
a b coef t lower upper
0 1 3 NaN NaN NaN NaN
1 3 5 NaN NaN NaN NaN
2 5 6 NaN NaN NaN NaN
3 7 2 NaN NaN NaN NaN
4 4 4 NaN NaN NaN NaN
5 5 6 NaN NaN NaN NaN
6 6 2 NaN NaN NaN NaN
7 4 5 NaN NaN NaN NaN
8 7 7 NaN NaN NaN NaN
9 8 1 -0.216802 -0.602168 -0.922460 0.488856
10 9 9 0.042781 0.156592 -0.492679 0.578240
11 1 5 0.032086 0.097763 -0.611172 0.675343
12 3 3 0.113475 0.329006 -0.562521 0.789472
13 5 2 0.198582 0.600297 -0.449786 0.846949
14 7 5 0.203540 0.611002 -0.449372 0.856452
15 4 4 0.236599 0.686744 -0.438653 0.911851
16 5 3 0.293651 0.835945 -0.394846 0.982147
17 6 6 0.314286 0.936382 -0.343553 0.972125
18 4 4 0.276316 0.760812 -0.435514 0.988146
19 7 1 0.346491 1.028220 -0.313981 1.006963
20 8 1 -0.492424 -1.234601 -1.274162 0.289313
21 9 9 0.235075 0.879433 -0.288829 0.758978
Related
I have a csv that I am loading into a dataframe and I need to identify every time values change in a column, and label each group of adjacent rows with similar values AND have the count ignore rows that are not the values I care about.
Using this code (below) I can successfully identify and label the clusters but it fails to have the count only factor in the value I want (Desire 1).
import pandas as pd
import numpy as np
import os
InputPath = r'C:\Users\YYYY\Desktop\File1.csv'
df=pd.read_csv(InputPath)
df[Result] = ((df['Mark'] != df['Mark'].shift(1)).cumsum()).where(df['Mark'] == 1)
Data:
data = {'Se
ries': ['A','A','A','A','A','A','A','A','A','B','B','B','B','B','B','B','B','B'],
'Time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9],
'Mark': [0,0,1,1,0,0,1,1,0,0,0,0,0,1,0,1,1,0] }
df = pd.DataFrame (data, columns = ['Series','Time','Mark'])
df
(Desire 2
) Additionally, how would I have it restart the count at 1 for each "Series", ensuring the count still increases with each new cluster as Time increases?
import pandas as pd
import numpy as np
import os
df = pd.DataFrame(np.array([['A'] * 9 + ['B'] * 9,
list(range(1, 10)) + list(range(1, 10)),
[0]*2 + [1]*2 + [0]*2 + [1]*2 + [0]*5 + [1]*1 + [0]*1 + [1]*2 + [0]*1]).transpose(), columns = ['Series', 'Time', 'Mark'])
df['Mark'] = [int(x) for x in df['Mark']]
df['Result'] = ((df['Mark'] != df['Mark'].shift(1)).cumsum()).where(df['Mark'] == 1)
df['Desire1'] = ((df['Mark'] != df['Mark'].shift(1)).cumsum() / 2).where(df['Mark'] == 1)
# make out of this a function, so that we can us it in following step:
def get_desire1(df):
return ((df['Mark'] != df['Mark'].shift(1)).cumsum() / 2).where(df['Mark'] == 1)
df['Desire2'] = df.groupby('Series').apply(get_desire1).to_numpy().flatten()
# or try the older solution:
df['Desire2'] = np.ndarray.flatten(np.array([get_desire1(x[1]) for x in df.groupby('Series')]))
# 'Desire2' is more a hack, because I hate the logic how aggregate in pandas works.
# For such stuff I use R more often ;) .
Then it looks like this:
Series Time Mark Result Desire1 Desire2
0 A 1 0 NaN NaN NaN
1 A 2 0 NaN NaN NaN
2 A 3 1 2.0 1.0 1.0
3 A 4 1 2.0 1.0 1.0
4 A 5 0 NaN NaN NaN
5 A 6 0 NaN NaN NaN
6 A 7 1 4.0 2.0 2.0
7 A 8 1 4.0 2.0 2.0
8 A 9 0 NaN NaN NaN
9 B 1 0 NaN NaN NaN
10 B 2 0 NaN NaN NaN
11 B 3 0 NaN NaN NaN
12 B 4 0 NaN NaN NaN
13 B 5 1 6.0 3.0 1.0
14 B 6 0 NaN NaN NaN
15 B 7 1 8.0 4.0 2.0
16 B 8 1 8.0 4.0 2.0
17 B 9 0 NaN NaN NaN
I have a panda data frame that I made and I pivoted it the exact way I want it. Now, I want to unpivot everything to get the position data (row and column) with the newly formed data frame and see which. For example, I want for the first row (in the new data frame that is unpivoted with the position data) to have 1 under "row", 1 under "a", and 1 as the value (example below). Can someone please figure out how I can unpivot to get the row and column values? I have tried used pd.melt but it didn't seem to work (it made no difference). Please respond soon. Thanks! Directly below is code to make the pivoted data frame.
import pandas as pd
row = [1, 2, 3, 4, 5]
df67 = {'row':row,}
df67 = pd.DataFrame(df67,columns=['row'])
df67['a'] = [1, 2, 3, 4, 5]
df67['b'] =[13, 18, 5, 10, 6]
#df67 (dataframe before pivot)
df68 = df67.pivot(index='row', columns = 'a')
#df68 (dataframe after pivot)
What I want the result to be for the first line:
row | a | value
1 | 1 | 13
Use DataFrame.stack with DataFrame.reset_index:
df = df68.stack().reset_index()
print (df)
row a b
0 1 1 13.0
1 2 2 18.0
2 3 3 5.0
3 4 4 10.0
4 5 5 6.0
EDIT:
For avoid removed missing values use dropna=False parameter:
df = df68.stack(dropna=False).reset_index()
print (df)
row a b
0 1 1 13.0
1 1 2 NaN
2 1 3 NaN
3 1 4 NaN
4 1 5 NaN
5 2 1 NaN
6 2 2 18.0
7 2 3 NaN
8 2 4 NaN
9 2 5 NaN
10 3 1 NaN
11 3 2 NaN
12 3 3 5.0
13 3 4 NaN
14 3 5 NaN
15 4 1 NaN
16 4 2 NaN
17 4 3 NaN
18 4 4 10.0
19 4 5 NaN
20 5 1 NaN
21 5 2 NaN
22 5 3 NaN
23 5 4 NaN
24 5 5 6.0
for example:
import pandas as pd
df_1 = pd.DataFrame({"A":[1, 5, 3, 4, 2],
"B":[3, 2, 4, 3, 4],
"C":[2, 2, 7, 3, 4],
"D":[4, 3, 6, 12, 7]})
df_2 = pd.DataFrame(index = list(range(5)),columns = ['a','c'])
df_2.loc[2,['a','c']] = df_1.loc[2,['A','C']]
print(df_1.loc[2,['A','C']])
print(df_2)
I got:
A 3
C 7
Name: 2, dtype: int64
a c
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
Obviously I failed to set multiple cells at the same time in one row. Is there any way to achieve this? (except using loops)
Here working index alignment, so because different a, c with A, C columns it set missing values (here not change), solution is set by numpy array for avoid it:
df_2.loc[2,['a','c']] = df_1.loc[2,['A','C']].values
print (df_2)
a c
0 NaN NaN
1 NaN NaN
2 3 7
3 NaN NaN
4 NaN NaN
If replace columns names for match, it working nice:
df_2.loc[2,['a','c']] = df_1.loc[2,['A','C']].rename({'A':'a','C':'c'})
#alternative
#df_2.loc[2,['a','c']] = df_1.rename(columns={'A':'a','C':'c'}).loc[2,['a','c']]
print (df_2)
a c
0 NaN NaN
1 NaN NaN
2 3 7
3 NaN NaN
4 NaN NaN
I have a pandas dataframe with two dimensions. I want to calculate the rolling standard deviation along axis 1 while also including datapoints in the rows above and below.
So say I have this df:
data = {'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
'C': [9, 10, 11, 12]}
df = pd.DataFrame(data)
print(df)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
I want a rectangular window 3 rows high and 2 columns across, moving from left to right. So, for example,
std_df.loc[1, 'C']
would be equal to
np.std([1, 5, 9, 2, 6, 10, 3, 7, 11])
But no idea how to achieve this without very slow iteration
Looks like what you want is pd.shift
import pandas as pd
import numpy as np
data = {'A': [1,2,3,4], 'B': [5,6,7,8], 'C': [9,10,11,12]}
df = pd.DataFrame(data)
print(df)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
Shifting the dataframe you provided by 1 yields the row above
print(df.shift(1))
A B C
0 NaN NaN NaN
1 1.0 5.0 9.0
2 2.0 6.0 10.0
3 3.0 7.0 11.0
Similarly, shifting the dataframe you provided by -1 yields the row below
print(df.shift(-1))
A B C
0 2.0 6.0 10.0
1 3.0 7.0 11.0
2 4.0 8.0 12.0
3 NaN NaN NaN
so the code below should do what you're looking for (add_prefix prefixes the column names to make them unique)
above_df = df.shift(1).add_prefix('above_')
below_df = df.shift(-1).add_prefix('below_')
lagged = pd.concat([df, above_df, below_df], axis=1)
lagged['std'] = lagged.apply(np.std, axis=1)
print(lagged)
A B C above_A above_B above_C below_A below_B below_C std
0 1 5 9 NaN NaN NaN 2.0 6.0 10.0 3.304038
1 2 6 10 1.0 5.0 9.0 3.0 7.0 11.0 3.366502
2 3 7 11 2.0 6.0 10.0 4.0 8.0 12.0 3.366502
3 4 8 12 3.0 7.0 11.0 NaN NaN NaN 3.304038
Given the following DataFrame in pandas:
user item rating
1 3 2
1 4 5
2 1 5
3 5 1
3 1 3
4 4 4
4 1 1
....
I'd like to transfer it to a numpy array, with the user column as y-axis and item column as x-axis, like this:
1 2 3 4 5
1 nan nan 2 5 nan
2 5 nan nan nan nan
3 3 nan nan nan 1
4 1 nan nan 4 nan
How can I use the apply function to do it quickly?
You need a pivot table:
>>> df.pivot_table(index='user', columns='item', values='rating')
1 3 4 5
user
1 NaN 2 5 NaN
2 5 NaN NaN NaN
3 3 NaN NaN 1
4 1 NaN 4 NaN
Note that totally NaN columns are present; you can reindex to include them if needed:
>>> df.pivot_table(index='user', columns='item', values='rating')
.reindex_axis([1, 2, 3, 4, 5], axis=1)
item 1 2 3 4 5
user
1 NaN NaN 2 5 NaN
2 5 NaN NaN NaN NaN
3 3 NaN NaN NaN 1
4 1 NaN NaN 4 NaN
To put these values into a NumPy array, access the .values attribute:
_.values # _ is the last returned value in the repr
To do it quickly , make it with numpy tools:
def pivotarray(df):
users,i= np.unique(df['user'],return_inverse=True)
item,j= np.unique(df['item'],return_inverse=True)
a=zeros((len(users),len(item)),int)
a[i,j]=df['rating']
return a
Then (you can fill a with NaN before if requisite) :
In [464]: pivotarray(df)
Out[464]:
array([[0, 2, 5, 0],
[5, 0, 0, 0],
[3, 0, 0, 1],
[1, 0, 4, 0]])
column 2 is not there because there is no item 2.
Gain is significative :
In [465]: %timeit pivotarray(df)
1000 loops, best of 3: 417 µs per loop
In [466]: %timeit df.pivot(index='user', columns='item', values='rating')
100 loops, best of 3: 6.38 ms per loop
In [467]: %timeit df.pivot_table(index='user', columns='item', values='rating')
100 loops, best of 3: 18.6 ms per loop
EDIT
for including missing items, a possible hack:
def pivotarraywithallitems(df):
users,i= np.unique(df['user'],return_inverse=True)
item,j= np.unique(df['item'],return_inverse=True)
miss= (~in1d(arange(1,6),item)).cumsum()
j+=miss[j]
a=zeros((len(users),len(item)+miss[-1]),float)*NaN
a[i,j]=df['rating']
return a
You can use pivot:
print df.pivot(index='user', columns='item', values='rating')
item 1 3 4 5
user
1 NaN 2 5 NaN
2 5 NaN NaN NaN
3 3 NaN NaN 1
4 1 NaN 4 NaN
Then you need add missing columns - find min and max values, create range for parameter label in reindex_axis:
print df['item'].min()
1
print df['item'].max()
5
rng = range(df['item'].min(), df['item'].max() + 1)
print rng
[1, 2, 3, 4, 5]
print df.pivot(index='user',columns='item',values='rating').reindex_axis(labels=rng, axis=1)
item 1 2 3 4 5
user
1 NaN NaN 2 5 NaN
2 5 NaN NaN NaN NaN
3 3 NaN NaN NaN 1
4 1 NaN NaN 4 NaN
Last use values for generating numpy array:
print df.pivot(index='user', columns='item', values='rating')
.reindex_axis(labels=rng, axis=1)
.values
[[ nan nan 2. 5. nan]
[ 5. nan nan nan nan]
[ 3. nan nan nan 1.]
[ 1. nan nan 4. nan]]