Related
Consider this simple example
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
df = pd.DataFrame({'a':[1,3,5,7,4,5,6,4,7,8,9,1,3,5,7,4,5,6,4,7,8,9],
'b':[3,5,6,2,4,6,2,5,7,1,9,5,3,2,5,4,3,6,4,1,1,9]})
I am trying to perform a rolling regression of a on b. I am trying to use the simplest pandas tool available: apply. I want to use apply because I want to keep the flexibility of returning any parameter of the regression.
However, the simple code below does not work
df.rolling(10).apply(lambda x: smf.ols('a ~ b', data = x).fit())
File "<string>", line 1, in <module>
PatsyError: Error evaluating factor: NameError: name 'b' is not defined
a ~ b
^
What is the issue?
Thanks!
rolling apply is not capable of interacting with multiple columns simultaneously, nor is it able to produce non-numeric values. We instead need to take advantage of the iterable nature of rolling objects. We also need to account for handling min_periods ourselves, since the iterable rolling object generates all windows results regardless of other rolling arguments.
We can then create some function to produce each row in the results from the regression results to do something like:
def process(x):
if len(x) >= 10:
reg = smf.ols('a ~ b', data=x).fit()
print(reg.params)
return [
# b from params
reg.params['b'],
# b from tvalues
reg.tvalues['b'],
# Both lower and upper b from conf_int()
*reg.conf_int().loc['b', :].tolist()
]
# Return NaN in the same dimension as the results
return [np.nan] * 4
df = df.join(
# join new DataFrame back to original
pd.DataFrame(
(process(x) for x in df.rolling(10)),
columns=['coef', 't', 'lower', 'upper']
)
)
df:
a b coef t lower upper
0 1 3 NaN NaN NaN NaN
1 3 5 NaN NaN NaN NaN
2 5 6 NaN NaN NaN NaN
3 7 2 NaN NaN NaN NaN
4 4 4 NaN NaN NaN NaN
5 5 6 NaN NaN NaN NaN
6 6 2 NaN NaN NaN NaN
7 4 5 NaN NaN NaN NaN
8 7 7 NaN NaN NaN NaN
9 8 1 -0.216802 -0.602168 -1.047047 0.613442
10 9 9 0.042781 0.156592 -0.587217 0.672778
11 1 5 0.032086 0.097763 -0.724742 0.788913
12 3 3 0.113475 0.329006 -0.681872 0.908822
13 5 2 0.198582 0.600297 -0.564258 0.961421
14 7 5 0.203540 0.611002 -0.564646 0.971726
15 4 4 0.236599 0.686744 -0.557872 1.031069
16 5 3 0.293651 0.835945 -0.516403 1.103704
17 6 6 0.314286 0.936382 -0.459698 1.088269
18 4 4 0.276316 0.760812 -0.561191 1.113823
19 7 1 0.346491 1.028220 -0.430590 1.123572
20 8 1 -0.492424 -1.234601 -1.412181 0.427332
21 9 9 0.235075 0.879433 -0.381326 0.851476
Setup:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
df = pd.DataFrame({
'a': [1, 3, 5, 7, 4, 5, 6, 4, 7, 8, 9, 1, 3, 5, 7, 4, 5, 6, 4, 7, 8, 9],
'b': [3, 5, 6, 2, 4, 6, 2, 5, 7, 1, 9, 5, 3, 2, 5, 4, 3, 6, 4, 1, 1, 9]
})
Rolling.apply applies the rolling operation to each column separately (Related question).
Following user3226167's answer of this thread, it seems that easiest way to accomplish what you want is to use RollingOLS.from_formula from statsmodels.regression.rolling.
from statsmodels.regression.rolling import RollingOLS
df = pd.DataFrame({'a':[1,3,5,7,4,5,6,4,7,8,9,1,3,5,7,4,5,6,4,7,8,9],
'b':[3,5,6,2,4,6,2,5,7,1,9,5,3,2,5,4,3,6,4,1,1,9]})
model = RollingOLS.from_formula('a ~ b', data = df, window=10)
reg_obj = model.fit()
# estimated coefficient
b_coeff = reg_obj.params['b'].rename('coef')
# b t-value
b_t_val = reg_obj.tvalues['b'].rename('t')
# 95 % confidence interval of b
b_conf_int = reg_obj.conf_int(cols=[1]).droplevel(level=0, axis=1)
# join all the desired information to the original df
df = df.join([b_coeff, b_t_val, b_conf_int])
where reg_obj is a RollingRegressionResults which holds lots of information about the regression (see all its different attributes in the docs)
Output
>>> type(reg_obj)
<class 'statsmodels.regression.rolling.RollingRegressionResults'>
>>> df
a b coef t lower upper
0 1 3 NaN NaN NaN NaN
1 3 5 NaN NaN NaN NaN
2 5 6 NaN NaN NaN NaN
3 7 2 NaN NaN NaN NaN
4 4 4 NaN NaN NaN NaN
5 5 6 NaN NaN NaN NaN
6 6 2 NaN NaN NaN NaN
7 4 5 NaN NaN NaN NaN
8 7 7 NaN NaN NaN NaN
9 8 1 -0.216802 -0.602168 -0.922460 0.488856
10 9 9 0.042781 0.156592 -0.492679 0.578240
11 1 5 0.032086 0.097763 -0.611172 0.675343
12 3 3 0.113475 0.329006 -0.562521 0.789472
13 5 2 0.198582 0.600297 -0.449786 0.846949
14 7 5 0.203540 0.611002 -0.449372 0.856452
15 4 4 0.236599 0.686744 -0.438653 0.911851
16 5 3 0.293651 0.835945 -0.394846 0.982147
17 6 6 0.314286 0.936382 -0.343553 0.972125
18 4 4 0.276316 0.760812 -0.435514 0.988146
19 7 1 0.346491 1.028220 -0.313981 1.006963
20 8 1 -0.492424 -1.234601 -1.274162 0.289313
21 9 9 0.235075 0.879433 -0.288829 0.758978
I have two arrays
n1 = pd.Series([1,2,3, np.nan, np.nan, 4, 5], index=[3,4,5,6,7,8,9])
n2 = pd.Series([np.nan, np.nan, 4, 5, 3,], index=[2, 4, 5, 10, 11])
the data format is like following and the last column is the result I want to get:
index n1 n2 resultexpected(n1<n2)
2 na na
3 1 na
4 2 na na
5 3 4 True
6 na na
7 na na
8 4 na
9 5 na
10 5 na
11 11 na
Here is my solution and it is very inefficient.
n1 = pd.Series([1,2,3, np.nan, np.nan, 4, 5], index=[3,4,5,6,7,8,9])
n2 = pd.Series([np.nan, np.nan, 4, 5, 3,], index=[2, 4, 5, 10, 11])
def GT(n1, n2):
n1_index = n1.index.values
n2_index = n2.index.values
index = np.sort(list(set(list(n1_index) + list(n2_index))))
new_n1 = pd.Series(np.nan, index=index)
new_n1.loc[n1_index] = n1.values
new_n2 = pd.Series(np.nan, index=index)
new_n2.loc[n2_index] = n2.values
output = pd.Series(new_n1.values < new_n2.values, index=index)
output.loc[n1[n1.isnull()].index] = np.nan
output.loc[n2[n2.isnull()].index] = np.nan
return output
starttime = datetime.datetime.now()
for i in range(500):
GT(n1, n2)
endtime = datetime.datetime.now()
print(endtime - starttime)
My rough idea is to rebuild two arrays with identical index list and compare them. But the currently solution is very slow. The for loop is what I use to test the computation cost.
The difficult point to me is how to efficiently compare the two values at the same index, and what's the best way to nullify the output result if there isn't a value in array n1 or n2.
Is there any better solutions please? Especially, time efficient way.
Use concat here for 2 columns DataFrame:
df = pd.concat([n1, n2], axis=1, keys=('n1','n2'))
print (df['n1'] < df['n2'])
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
10 False
11 False
dtype: bool
If need new column with missing values is possible use here map:
df = pd.concat([n1, n2], axis=1, keys=('n1','n2'))
df['resultexpected'] = (df['n1'] < df['n2']).map({True:True, False: np.nan})
print (df)
n1 n2 resultexpected
2 NaN NaN NaN
3 1.0 NaN NaN
4 2.0 NaN NaN
5 3.0 4.0 True
6 NaN NaN NaN
7 NaN NaN NaN
8 4.0 NaN NaN
9 5.0 NaN NaN
10 NaN 5.0 NaN
11 NaN 3.0 NaN
I have a csv that I am loading into a dataframe and I need to identify every time values change in a column, and label each group of adjacent rows with similar values AND have the count ignore rows that are not the values I care about.
Using this code (below) I can successfully identify and label the clusters but it fails to have the count only factor in the value I want (Desire 1).
import pandas as pd
import numpy as np
import os
InputPath = r'C:\Users\YYYY\Desktop\File1.csv'
df=pd.read_csv(InputPath)
df[Result] = ((df['Mark'] != df['Mark'].shift(1)).cumsum()).where(df['Mark'] == 1)
Data:
data = {'Se
ries': ['A','A','A','A','A','A','A','A','A','B','B','B','B','B','B','B','B','B'],
'Time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9],
'Mark': [0,0,1,1,0,0,1,1,0,0,0,0,0,1,0,1,1,0] }
df = pd.DataFrame (data, columns = ['Series','Time','Mark'])
df
(Desire 2
) Additionally, how would I have it restart the count at 1 for each "Series", ensuring the count still increases with each new cluster as Time increases?
import pandas as pd
import numpy as np
import os
df = pd.DataFrame(np.array([['A'] * 9 + ['B'] * 9,
list(range(1, 10)) + list(range(1, 10)),
[0]*2 + [1]*2 + [0]*2 + [1]*2 + [0]*5 + [1]*1 + [0]*1 + [1]*2 + [0]*1]).transpose(), columns = ['Series', 'Time', 'Mark'])
df['Mark'] = [int(x) for x in df['Mark']]
df['Result'] = ((df['Mark'] != df['Mark'].shift(1)).cumsum()).where(df['Mark'] == 1)
df['Desire1'] = ((df['Mark'] != df['Mark'].shift(1)).cumsum() / 2).where(df['Mark'] == 1)
# make out of this a function, so that we can us it in following step:
def get_desire1(df):
return ((df['Mark'] != df['Mark'].shift(1)).cumsum() / 2).where(df['Mark'] == 1)
df['Desire2'] = df.groupby('Series').apply(get_desire1).to_numpy().flatten()
# or try the older solution:
df['Desire2'] = np.ndarray.flatten(np.array([get_desire1(x[1]) for x in df.groupby('Series')]))
# 'Desire2' is more a hack, because I hate the logic how aggregate in pandas works.
# For such stuff I use R more often ;) .
Then it looks like this:
Series Time Mark Result Desire1 Desire2
0 A 1 0 NaN NaN NaN
1 A 2 0 NaN NaN NaN
2 A 3 1 2.0 1.0 1.0
3 A 4 1 2.0 1.0 1.0
4 A 5 0 NaN NaN NaN
5 A 6 0 NaN NaN NaN
6 A 7 1 4.0 2.0 2.0
7 A 8 1 4.0 2.0 2.0
8 A 9 0 NaN NaN NaN
9 B 1 0 NaN NaN NaN
10 B 2 0 NaN NaN NaN
11 B 3 0 NaN NaN NaN
12 B 4 0 NaN NaN NaN
13 B 5 1 6.0 3.0 1.0
14 B 6 0 NaN NaN NaN
15 B 7 1 8.0 4.0 2.0
16 B 8 1 8.0 4.0 2.0
17 B 9 0 NaN NaN NaN
https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.rolling_quantile.html
I cant not see how to best ignore NaNs in the rolling percentile function. Would anyone know?
seriestest = pd.Series([1, 5, 7, 2, 4, 6, 9, 3, 8, 10])
and insert nans
seriestest2 = pd.Series([1, 5, np.NaN, 2, 4, np.nan, 9, 3, 8, 10])
Now, on the first series, I get expected output, using:
seriestest.rolling(window = 3).quantile(.5)
But, I wish to do the same and ignore NaNs on the test2 series.
seriestest2.rolling(window = 3).quantile(.5)
Gives:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 8.0
9 8.0
dtype: float64
But I think it gives something like this if we can parse a skipna=True, which doesn't work for me:
0 NaN
1 NaN
2 5.0
3 2.0
4 4.0
5 4.0
6 4.0
7 3.0
8 8.0
9 8.0
dtype: float64
The issue is that having nan values will give you less than the required number of elements (3) in your rolling window. You can define the minimum number of valid observations with rolling to be less by setting the min_periods parameter.
seriestest2.rolling(window=3, min_periods=1).quantile(.5)
Alternatively, if you simply want to replace nan values, with say 0, you can use fillna:
seriestest2.fillna(value=0).rolling(window=3).quantile(.5)
I have a dataframe with two rows
df = pd.DataFrame({'group' : ['c'] * 2,
'num_column': range(2),
'num_col_2': range(2),
'seq_col': [[1,2,3,4,5]] * 2,
'seq_col_2': [[1,2,3,4,5]] * 2,
'grp_count': [2]*2})
With 8 nulls, it looks like this:
df = df.append(pd.DataFrame({'group': group}, index=[0] * size))
group grp_count num_col_2 num_column seq_col seq_col_2
0 c 2.0 0.0 0.0 [1, 2, 3, 4, 5] [1, 2, 3, 4, 5]
1 c 2.0 1.0 1.0 [1, 2, 3, 4, 5] [1, 2, 3, 4, 5]
0 c NaN NaN NaN NaN NaN
0 c NaN NaN NaN NaN NaN
0 c NaN NaN NaN NaN NaN
0 c NaN NaN NaN NaN NaN
0 c NaN NaN NaN NaN NaN
0 c NaN NaN NaN NaN NaN
0 c NaN NaN NaN NaN NaN
0 c NaN NaN NaN NaN NaN
What I want
Replace NaN values in sequences columns (seq_col, seq_col_2, seq_col_3 etc) with a list of my own.
Note: .
In this data there are 2 sequence column only but could be many more.
Cannot replace previous lists already in the columns, ONLY NaNs
I could not find solutions that replaces NaN with a user provided list value from a dictionary suppose.
Pseudo Code:
for each key, value in dict,
for each column in df
if column matches key in dict
# here matches means the 'seq_col_n' key of dict matched the df
# column named 'seq_col_n'
replace NaN with value in seq_col_n (which is a list of numbers)
I tried this code below, it works for the first column you pass then for the second column it doesn't. Which is weird.
df.loc[df['seq_col'].isnull(),['seq_col']] = df.loc[df['seq_col'].isnull(),'seq_col'].apply(lambda m: fill_values['seq_col'])
The above works but then try again on seq_col_2, it will give weird results.
Expected Output:
Given param input:
my_dict = {seq_col: [1,2,3], seq_col_2: [6,7,8]}
# after executing the code from pseudo code given, it should look like
group grp_count num_col_2 num_column seq_col seq_col_2
0 c 2.0 0.0 0.0 [1, 2, 3, 4, 5] [1, 2, 3, 4, 5]
1 c 2.0 1.0 1.0 [1, 2, 3, 4, 5] [1, 2, 3, 4, 5]
0 c NaN NaN NaN [1,2,3] [6,7,8]
0 c NaN NaN NaN [1,2,3] [6,7,8]
0 c NaN NaN NaN [1,2,3] [6,7,8]
0 c NaN NaN NaN [1,2,3] [6,7,8]
0 c NaN NaN NaN [1,2,3] [6,7,8]
0 c NaN NaN NaN [1,2,3] [6,7,8]
0 c NaN NaN NaN [1,2,3] [6,7,8]
0 c NaN NaN NaN [1,2,3] [6,7,8]
With input arrays, you can use pd.DataFrame.loc with pd.Series.isnull:
import pandas as pd, numpy as np
df = pd.DataFrame({'group' : ['c'] * 2,
'num_column': range(2),
'num_col_2': range(2),
'seq_col': [[1,2,3,4,5]] * 2,
'seq_col_2': [[1,2,3,4,5]] * 2,
'grp_count': [2]*2})
df = df.append(pd.DataFrame({'group': ['c']*8}, index=[0] * 8))
L1 = np.array([0, 1, 2, 3, 4, 5, 6, 7])
L2 = np.array([10, 11, 12, 13, 14, 15, 16, 17])
df.loc[df['seq_col'].isnull(), 'seq_col'] = L1
df.loc[df['seq_col_2'].isnull(), 'seq_col_2'] = L2
print(df[['seq_col', 'seq_col_2']])
seq_col seq_col_2
0 [1, 2, 3, 4, 5] [1, 2, 3, 4, 5]
1 [1, 2, 3, 4, 5] [1, 2, 3, 4, 5]
0 0 10
0 1 11
0 2 12
0 3 13
0 4 14
0 5 15
0 6 16
0 7 17
If you need list values in your series, then you can convert to a series explicitly before assignment:
df.loc[df['seq_col'].isnull(), 'seq_col'] = pd.Series([[1, 2, 3]]*len(df))