Getting list of column names and coeff in ols.param - python

I'm using OLS with respect to two dataframes:
gab = ols(formula= 'only_volume ~ all_but_volume', data=data_p ).fit()
where,
only_volume = data_p.iloc[:,0] #Only first colum
all_but_volume = data_p.iloc[:, 1:data_p.shape[1]] #All but first column
When I try to extract something, say parameter or pvals, I get something like this:
In [3]: gab.params
Out[3]:
Intercept 2.687598e+06
all_but_volume[0] 5.500544e+01
all_but_volume[1] 2.696902e+02
all_but_volume[2] 3.389568e+04
all_but_volume[3] -2.385838e+04
all_but_volume[4] 5.419860e+02
all_but_volume[5] 3.815161e+02
all_but_volume[6] -2.281344e+04
all_but_volume[7] 1.794128e+04
...
all_but_volume[22] 1.374321e+00
Since gab.params provides with 23 values in LHS and all_but_volume has 23 columns, I was hoping if there was a way to get a list/zip of params with column names, instead of params with all_but_volume[i]
Like,
TMC 9.801195e+01
TAC 2.214464e+02
...
What I've tried:
removing all_but_volume and simply using data_p.iloc[:, 1:data_p.shape[1]]
Didn't work:
...
data_p.iloc[:, 1:data_p.shape[1]][21] 2.918531e+04
data_p.iloc[:, 1:data_p.shape[1]][22] 1.395342e+00
Edit:
Sample Data:
data_p.iloc[1:5,:]
Out[31]:
Volume A B C\
1 569886.171878 759.089217 272.446022 4.163908
2 561695.886128 701.165406 330.301260 4.136530
3 627221.486089 377.746089 656.838394 4.130720
4 625181.750625 361.489041 670.575110 4.134467
D E F G H I \
1 1.000842 12993.06 3371.28 236.90 4.92 6.13
2 0.981514 13005.44 3378.69 236.94 4.92 6.13
3 0.836920 13017.22 3384.47 236.98 4.93 6.13
4 0.810541 13028.56 3388.85 237.01 4.94 6.13
J K L M N \
1 ... 0 0 0 0 0
2 ... 0 0 0 0 0
3 ... 0 0 0 0 0
4 ... 0 0 0 0 0
O P Q R S
1 0 0 0 1 9202.171648
2 0 0 0 0 4381.373520
3 0 0 0 0 -13982.443554
4 0 0 0 0 -22878.843149
only_volume is the first column 'volume'
all_but_volume is all columns except 'volume'

You can use DataFrame constructor or rename, because gab.params is Series:
Sample:
np.random.seed(2018)
import statsmodels.formula.api as sm
data_p = pd.DataFrame(np.random.rand(10, 5), columns=['Volume','A','B','C','D'])
print (data_p)
Volume A B C D
0 0.882349 0.104328 0.907009 0.306399 0.446409
1 0.589985 0.837111 0.697801 0.802803 0.107215
2 0.757093 0.999671 0.725931 0.141448 0.356721
3 0.942704 0.610162 0.227577 0.668732 0.692905
4 0.416863 0.171810 0.976891 0.330224 0.629044
5 0.160611 0.089953 0.970822 0.816578 0.571366
6 0.345853 0.403744 0.137383 0.900934 0.933936
7 0.047377 0.671507 0.034832 0.252691 0.557125
8 0.525823 0.352968 0.092983 0.304509 0.862430
9 0.716937 0.964071 0.539702 0.950540 0.667982
only_volume = data_p.iloc[:,0] #Only first colum
all_but_volume = data_p.iloc[:, 1:data_p.shape[1]] #All but first column
gab = sm.ols(formula= 'only_volume ~ all_but_volume', data=data_p ).fit()
print (gab.params)
Intercept 0.077570
all_but_volume[0] 0.395072
all_but_volume[1] 0.313150
all_but_volume[2] -0.100752
all_but_volume[3] 0.247532
dtype: float64
print (type(gab.params))
<class 'pandas.core.series.Series'>
df = pd.DataFrame({'cols':data_p.columns[1:], 'par': gab.params.values[1:]})
print (df)
cols par
0 A 0.395072
1 B 0.313150
2 C -0.100752
3 D 0.247532
If want return Series:
s = gab.params.rename(dict(zip(gab.params.index, data_p.columns)))
print (s)
Volume 0.077570
A 0.395072
B 0.313150
C -0.100752
D 0.247532
dtype: float64
Series without first value:
s = gab.params.iloc[1:].rename(dict(zip(gab.params.index, data_p.columns)))
print (s)
A 0.395072
B 0.313150
C -0.100752
D 0.247532
dtype: float64

Related

Pandas set value if most columns are equal in a dataframe

starting by another my question I've done yesterday Pandas set value if all columns are equal in a dataframe
Starting by #anky_91 solution I'm working on something similar.
Instead of put 1 or -1 if all columns are equals I want something more flexible.
In fact I want 1 if (for example) the 70% percentage of the columns are 1, -1 for the same but inverse condition and 0 else.
So this is what I've wrote:
# Instead of using .all I use .sum to count the occurence of 1 and 0 for each row
m1 = local_df.eq(1).sum(axis=1)
m2 = local_df.eq(0).sum(axis=1)
# Debug print, it work
print(m1)
print(m2)
But I don't know how to change this part:
local_df['enseamble'] = np.select([m1, m2], [1, -1], 0)
m = local_df.drop(local_df.columns.difference(['enseamble']), axis=1)
I write in pseudo code what I want:
tot = m1 + m2
if m1 > m2
if(m1 * 100) / tot > 0.7 # simple percentage calculus
df['enseamble'] = 1
else if m2 > m1
if(m2 * 100) / tot > 0.7 # simple percentage calculus
df['enseamble'] = -1
else:
df['enseamble'] = 0
Thanks
Edit 1
This is an example of expected output:
NET_0 NET_1 NET_2 NET_3 NET_4 NET_5 NET_6
date
2009-08-02 0 1 1 1 0 1
2009-08-03 1 0 0 0 1 0
2009-08-04 1 1 1 0 0 0
date enseamble
2009-08-02 1 # because 1 is more than 70%
2009-08-03 -1 # because 0 is more than 70%
2009-08-04 0 # because 0 and 1 are 50-50
You could obtain the specified output from the following conditions:
thr = 0.7
c1 = (df.eq(1).sum(1)/df.shape[1]).gt(thr)
c2 = (df.eq(0).sum(1)/df.shape[1]).gt(thr)
c2.astype(int).mul(-1).add(c1)
Output
2009-08-02 0
2009-08-03 0
2009-08-04 0
2009-08-05 0
2009-08-06 -1
2009-08-07 1
dtype: int64
Or using np.select:
pd.DataFrame(np.select([c1,c2], [1,-1], 0), index=df.index, columns=['result'])
result
2009-08-02 0
2009-08-03 0
2009-08-04 0
2009-08-05 0
2009-08-06 -1
2009-08-07 1
Try with (m1 , m2 and tot are same as what you have):
cond1=(m1>m2)&((m1 * 100/tot).gt(0.7))
cond2=(m2>m1)&((m2 * 100/tot).gt(0.7))
df['enseamble'] =np.select([cond1,cond2],[1,-1],0)
m =df.drop(df.columns.difference(['enseamble']), axis=1)
print(m)
enseamble
date
2009-08-02 1
2009-08-03 -1
2009-08-04 0

Find longest run of consecutive zeros for each user in dataframe

I'm looking to find the max run of consecutive zeros in a DataFrame with the result grouped by user. I'm interested in running the RLE on usage.
sample input:
user--day--usage
A-----1------0
A-----2------0
A-----3------1
B-----1------0
B-----2------1
B-----3------0
Desired output
user---longest_run
a - - - - 2
b - - - - 1
mydata <- mydata[order(mydata$user, mydata$day),]
user <- unique(mydata$user)
d2 <- data.frame(matrix(NA, ncol = 2, nrow = length(user)))
names(d2) <- c("user", "longest_no_usage")
d2$user <- user
for (i in user) {
if (0 %in% mydata$usage[mydata$user == i]) {
run <- rle(mydata$usage[mydata$user == i]) #Run Length Encoding
d2$longest_no_usage[d2$user == i] <- max(run$length[run$values == 0])
} else {
d2$longest_no_usage[d2$user == i] <- 0 #some users did not have no-usage days
}
}
d2 <- d2[order(-d2$longest_no_usage),]
this works in R but I want to do the same thing in python, I'm totally stumped
Use groupby with size by columns user, usage and helper Series for consecutive values first:
print (df)
user day usage
0 A 1 0
1 A 2 0
2 A 3 1
3 B 1 0
4 B 2 1
5 B 3 0
6 C 1 1
df1 = (df.groupby([df['user'],
df['usage'].rename('val'),
df['usage'].ne(df['usage'].shift()).cumsum()])
.size()
.to_frame(name='longest_run'))
print (df1)
longest_run
user val usage
A 0 1 2
1 2 1
B 0 3 1
5 1
1 4 1
C 1 6 1
Then filter only zero rows, get max and add reindex for append non 0 groups:
df2 = (df1.query('val == 0')
.max(level=0)
.reindex(df['user'].unique(), fill_value=0)
.reset_index())
print (df2)
user longest_run
0 A 2
1 B 1
2 C 0
Detail:
print (df['usage'].ne(df['usage'].shift()).cumsum())
0 1
1 1
2 2
3 3
4 4
5 5
6 6
Name: usage, dtype: int32
get max number of consecutive zeros on series:
def max0(sr):
return (sr != 0).cumsum().value_counts().max() - (0 if (sr != 0).cumsum().value_counts().idxmax()==0 else 1)
max0(pd.Series([1,0,0,0,0,2,3]))
4
I think the following does what you are looking for, where the consecutive_zero function is an adaptation of the top answer here.
Hope this helps!
import pandas as pd
from itertools import groupby
df = pd.DataFrame([['A', 1], ['A', 0], ['A', 0], ['B', 0],['B',1],['C',2]],
columns=["user", "usage"])
def len_iter(items):
return sum(1 for _ in items)
def consecutive_zero(data):
x = list((len_iter(run) for val, run in groupby(data) if val==0))
if len(x)==0: return 0
else: return max(x)
df.groupby('user').apply(lambda x: consecutive_zero(x['usage']))
Output:
user
A 2
B 1
C 0
dtype: int64
If you have a large dataset and speed is essential, you might want to try the high-performance pyrle library.
Setup:
# pip install pyrle
# or
# conda install -c bioconda pyrle
import numpy as np
np.random.seed(0)
import pandas as pd
from pyrle import Rle
size = int(1e7)
number = np.random.randint(2, size=size)
user = np.random.randint(5, size=size)
df = pd.DataFrame({"User": np.sort(user), "Number": number})
df
# User Number
# 0 0 0
# 1 0 1
# 2 0 1
# 3 0 0
# 4 0 1
# ... ... ...
# 9999995 4 1
# 9999996 4 1
# 9999997 4 0
# 9999998 4 0
# 9999999 4 1
#
# [10000000 rows x 2 columns]
Execution:
for u, udf in df.groupby("User"):
r = Rle(udf.Number)
is_0 = r.values == 0
print("User", u, "Max", np.max(r.runs[is_0]))
# (Wall time: 1.41 s)
# User 0 Max 20
# User 1 Max 23
# User 2 Max 20
# User 3 Max 22
# User 4 Max 23

Mapping values inside pandas column

I used the code below to map the 2 values inside S column to 0 but it didn't work. Any suggestion on how to solve this?
N.B : I want to implement an external function inside the map.
df = pd.DataFrame({
'Age': [30,40,50,60,70,80],
'Sex': ['F','M','M','F','M','F'],
'S' : [1,1,2,2,1,2]
})
def app(value):
for n in df['S']:
if n == 1:
return 1
if n == 2:
return 0
df["S"] = df.S.map(app)
Use eq to create a boolean series and conver that boolean series to int with astype:
df['S'] = df['S'].eq(1).astype(int)
OR
df['S'] = (df['S'] == 1).astype(int)
Output:
Age Sex S
0 30 F 1
1 40 M 1
2 50 M 0
3 60 F 0
4 70 M 1
5 80 F 0
Don't use apply, simply use loc to assign the values:
df.loc[df.S.eq(2), 'S'] = 0
Age Sex S
0 30 F 1
1 40 M 1
2 50 M 0
3 60 F 0
4 70 M 1
5 80 F 0
If you need a more performant option, use np.select. This is also more scalable, as you can always add more conditions:
df['S'] = np.select([df.S.eq(2)], [0], 1)
You're close but you need a few corrections. Since you want to use a function, remove the for loop and replace n with value. Additionally, use apply instead of map. Apply operates on the entire column at once. See this answer for how to properly use apply vs applymap vs map
def app(value):
if value == 1:
return 1
elif value == 2:
return 0
df['S'] = df.S.apply(app)
Age Sex S
0 30 F 1
1 40 M 1
2 50 M 0
3 60 F 0
4 70 M 1
5 80 F 0
If you only wish to change values equal to 2, you can use pd.DataFrame.loc:
df.loc[df['S'] == 0, 'S'] = 0
pd.Series.apply is not recommend and this is just a thinly veiled, inefficient loop.
You could use .replace as follows:
df["S"] = df["S"].replace([2], 0)
This will replace all of 2 values to 0 in one line
Go with vectorize numpy operation:
df['S'] = np.abs(df['S'] - 2)
and stand yourself out from competitions in interviews and SO answers :)
>>>df = pd.DataFrame({'Age':[30,40,50,60,70,80],'Sex':
['F','M','M','F','M','F'],'S':
[1,1,2,2,1,2]})
>>> def app(value):
return 1 if value == 1 else 0
# or app = lambda value : 1 if value == 1 else 0
>>> df["S"] = df["S"].map(app)
>>> df
Age S Sex
Age S Sex
0 30 1 F
1 40 1 M
2 50 0 M
3 60 0 F
4 70 1 M
5 80 0 F
You can do:
import numpy as np
df['S'] = np.where(df['S'] == 2, 0, df['S'])

Simple method to change DataFrame column type but use a default value for errors?

Suppose I have the following column.
>>> import pandas
>>> a = pandas.Series(['0', '1', '5', '1', None, '3', 'Cat', '2'])
I would like to be able to convert all the data in the column to type int, and any element that cannot be converted should be replaced with a 0.
My current solution to this is to use to_numeric with the 'coerce' option, fill any NaN with 0, and then convert to int (since the presence of NaN made the column float instead of int).
>>> pandas.to_numeric(a, errors='coerce').fillna(0).astype(int)
0 0
1 1
2 5
3 1
4 0
5 3
6 0
7 2
dtype: int64
Is there any method that would allow me to do this in one step rather than having to go through two intermediate states? I am looking for something that would behave like the following imaginary option to astype:
>>> a.astype(int, value_on_error=0)
Option 1
pd.to_numeric(a, 'coerce').fillna(0).astype(int)
Option 2
b = pd.to_numeric(a, 'coerce')
b.mask(b.isnull(), 0).astype(int)
Option 3
def try_int(x):
try:
return int(x)
except:
return 0
a.apply(try_int)
Option 4
b = np.empty(a.shape, dtype=int)
i = np.core.defchararray.isdigit(a.values.astype(str))
b[i] = a[i].astype(int)
b[~i] = 0
pd.Series(b, a.index)
All produce
0 0
1 1
2 5
3 1
4 0
5 3
6 0
7 2
dtype: int64
Timing
Code Below
def pir1(a):
return pd.to_numeric(a, 'coerce').fillna(0).astype(int)
def pir2(a):
b = pd.to_numeric(a, 'coerce')
return b.mask(b.isnull(), 0).astype(int)
def try_int(x):
try:
return int(x)
except:
return 0
def pir3(a):
return a.apply(try_int)
def pir4(a):
b = np.empty(a.shape, dtype=int)
i = np.core.defchararray.isdigit(a.values.astype(str))
b[i] = a[i].astype(int)
b[~i] = 0
return pd.Series(b, a.index)
def alt1(a):
return pd.to_numeric(a.where(a.str.isnumeric(), 0))
results = pd.DataFrame(
index=[1, 3, 10, 30, 100, 300, 1000, 3000, 10000],
columns='pir1 pir2 pir3 pir4 alt1'.split()
)
for i in results.index:
c = pd.concat([a] * i, ignore_index=True)
for j in results.columns:
stmt = '{}(c)'.format(j)
setp = 'from __main__ import c, {}'.format(j)
results.set_value(i, j, timeit(stmt, setp, number=10))
results.plot(logx=True, logy=True)
a.where(a.str.isnumeric(),0).astype(int)
Output:
0 0
1 1
2 5
3 1
4 0
5 3
6 0
7 2
dtype: int64

Counting how many times in a row the result of a sum is positive (or negative)

First Part
I have a dataframe with finance data (33023 rows, here the link to the
data: https://mab.to/Ssy3TelRs); df.open is the price of the title and
df.close is the closing price.
I have been trying to see how many times in a row the title closed
with a gain and with a lost.
The result that I'm looking for should tell me that the title was
positive 2 days in a row x times, 3 days in a row y times, 4 days in a
row z times and so forth.
I have started with a for:
for x in range(1,df.close.count()): y = df.close[x]-df.open[x]
and then unsuccessful series of if statements...
Thank you for your help.
CronosVirus00
EDITS:
>>> df.head(7)
data ora open max min close Unnamed: 6
0 20160801 0 1.11781 1.11781 1.11772 1.11773 0
1 20160801 100 1.11774 1.11779 1.11773 1.11777 0
2 20160801 200 1.11779 1.11800 1.11779 1.11795 0
3 20160801 300 1.11794 1.11801 1.11771 1.11771 0
4 20160801 400 1.11766 1.11772 1.11763 1.11772 0
5 20160801 500 1.11774 1.11798 1.11774 1.11796 0
6 20160801 600 1.11796 1.11796 1.11783 1.11783 0
Ifs:
for x in range(1,df.close.count()): y = df.close[x]-df.open[x] if y > 0 : green += 1 y = df.close[x+1] - df.close[x+1]
twotimes += 1 if y > 0 : green += 1 y = df.close[x+2] -
df.close[x+2] threetimes += 1 if y > 0 :
green += 1 y = df.close[x+3] - df.close[x+3] fourtimes += 1
FINAL SOLUTION
Thank you all! And the end I did this:
df['test'] = df.close - df.open >0
green = df.test #days that it was positive
def gg(z):
tot =green.count()
giorni = range (1,z+1) # days in a row i wanna check
for x in giorni:
y = (green.rolling(x).sum()>x-1).sum()
print(x," ",y, " ", round((y/tot)*100,1),"%")
gg(5)
1 14850 45.0 %
2 6647 20.1 %
3 2980 9.0 %
4 1346 4.1 %
5 607 1.8 %
If i understood your question correctly you can do it this way:
In [76]: df.groupby((df.close.diff() < 0).cumsum()).cumcount()
Out[76]:
0 0
1 1
2 2
3 0
4 1
5 2
6 0
7 0
dtype: int64
The result that I'm looking for should tell me that the title was
positive 2 days in a row x times, 3 days in a row y times, 4 days in a
row z times and so forth.
In [114]: df.groupby((df.close.diff() < 0).cumsum()).cumcount().value_counts().to_frame('count')
Out[114]:
count
0 4
2 2
1 2
Data set:
In [78]: df
Out[78]:
data ora open max min close
0 20160801 0 1.11781 1.11781 1.11772 1.11773
1 20160801 100 1.11774 1.11779 1.11773 1.11777
2 20160801 200 1.11779 1.11800 1.11779 1.11795
3 20160801 300 1.11794 1.11801 1.11771 1.11771
4 20160801 400 1.11766 1.11772 1.11763 1.11772
5 20160801 500 1.11774 1.11798 1.11774 1.11796
6 20160801 600 1.11796 1.11796 1.11783 1.11783
7 20160801 700 1.11783 1.11799 1.11783 1.11780
In [80]: df.close.diff()
Out[80]:
0 NaN
1 0.00004
2 0.00018
3 -0.00024
4 0.00001
5 0.00024
6 -0.00013
7 -0.00003
Name: close, dtype: float64
It sounds like what you want to do is:
compute the difference of two series (open & close), eg diff = df.open - df.close
apply a condition to the result to get a boolean series diff > 0
pass the resulting boolean series to the DataFrame to get a subset of the DataFrame where the condition is true df[diff > 0]
Find all contiguous subsequences by applying a column wise function to identify and count
I need to board a plane, but I will provide a sample of what the last step looks like when I can.
If I understood you correctly, you want the number of days that have at least n positive days in a row before and itself included.
Similarly to what #Thang suggested, you can use rolling:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10, 2), columns=["open", "close"])
# This just sets up random test data, for example:
# open close
# 0 0.997986 0.594789
# 1 0.052712 0.401275
# 2 0.895179 0.842259
# 3 0.747268 0.919169
# 4 0.113408 0.253440
# 5 0.199062 0.399003
# 6 0.436424 0.514781
# 7 0.180154 0.235816
# 8 0.750042 0.558278
# 9 0.840404 0.139869
positiveDays = df["close"]-df["open"] > 0
# This will give you a series that is True for positive days:
# 0 False
# 1 True
# 2 False
# 3 True
# 4 True
# 5 True
# 6 True
# 7 True
# 8 False
# 9 False
# dtype: bool
daysToCheck = 3
positiveDays.rolling(daysToCheck).sum()>daysToCheck-1
This will now give you a series, indicating for every day, whether it has been positive for daysToCheck number of days in a row:
0 False
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 False
9 False
dtype: bool
Now you can use (positiveDays.rolling(daysToCheck).sum()>daysToCheck-1).sum() to get the number of days (in the example 3) that obey this, which is what you want, as far as I understand.
This should work:
import pandas as pd
import numpy as np
test = pd.DataFrame(np.random.randn(100,2), columns = ['open','close'])
test['gain?'] = (test['open']-test['close'] < 0)
test['cumulative'] = 0
for i in test.index[1:]:
if test['gain?'][i]:
test['cumulative'][i] = test['cumulative'][i-1] + 1
test['cumulative'][i-1] = 0
results = test['cumulative'].value_counts()
Ignore the '0' row in the results. It can be modified without too much trouble if you want to e.g count both days in a run-of-two as runs-of-one as well.
Edit: without the warnings -
import pandas as pd
import numpy as np
test = pd.DataFrame(np.random.randn(100,2), columns = ['open','close'])
test['gain?'] = (test['open']-test['close'] < 0)
test['cumulative'] = 0
for i in test.index[1:]:
if test['gain?'][i]:
test.loc[i,'cumulative'] = test.loc[i-1,'cumulative'] + 1
test.loc[i-1,'cumulative'] = 0
results = test['cumulative'].value_counts()

Categories