I'm looking to find the max run of consecutive zeros in a DataFrame with the result grouped by user. I'm interested in running the RLE on usage.
sample input:
user--day--usage
A-----1------0
A-----2------0
A-----3------1
B-----1------0
B-----2------1
B-----3------0
Desired output
user---longest_run
a - - - - 2
b - - - - 1
mydata <- mydata[order(mydata$user, mydata$day),]
user <- unique(mydata$user)
d2 <- data.frame(matrix(NA, ncol = 2, nrow = length(user)))
names(d2) <- c("user", "longest_no_usage")
d2$user <- user
for (i in user) {
if (0 %in% mydata$usage[mydata$user == i]) {
run <- rle(mydata$usage[mydata$user == i]) #Run Length Encoding
d2$longest_no_usage[d2$user == i] <- max(run$length[run$values == 0])
} else {
d2$longest_no_usage[d2$user == i] <- 0 #some users did not have no-usage days
}
}
d2 <- d2[order(-d2$longest_no_usage),]
this works in R but I want to do the same thing in python, I'm totally stumped
Use groupby with size by columns user, usage and helper Series for consecutive values first:
print (df)
user day usage
0 A 1 0
1 A 2 0
2 A 3 1
3 B 1 0
4 B 2 1
5 B 3 0
6 C 1 1
df1 = (df.groupby([df['user'],
df['usage'].rename('val'),
df['usage'].ne(df['usage'].shift()).cumsum()])
.size()
.to_frame(name='longest_run'))
print (df1)
longest_run
user val usage
A 0 1 2
1 2 1
B 0 3 1
5 1
1 4 1
C 1 6 1
Then filter only zero rows, get max and add reindex for append non 0 groups:
df2 = (df1.query('val == 0')
.max(level=0)
.reindex(df['user'].unique(), fill_value=0)
.reset_index())
print (df2)
user longest_run
0 A 2
1 B 1
2 C 0
Detail:
print (df['usage'].ne(df['usage'].shift()).cumsum())
0 1
1 1
2 2
3 3
4 4
5 5
6 6
Name: usage, dtype: int32
get max number of consecutive zeros on series:
def max0(sr):
return (sr != 0).cumsum().value_counts().max() - (0 if (sr != 0).cumsum().value_counts().idxmax()==0 else 1)
max0(pd.Series([1,0,0,0,0,2,3]))
4
I think the following does what you are looking for, where the consecutive_zero function is an adaptation of the top answer here.
Hope this helps!
import pandas as pd
from itertools import groupby
df = pd.DataFrame([['A', 1], ['A', 0], ['A', 0], ['B', 0],['B',1],['C',2]],
columns=["user", "usage"])
def len_iter(items):
return sum(1 for _ in items)
def consecutive_zero(data):
x = list((len_iter(run) for val, run in groupby(data) if val==0))
if len(x)==0: return 0
else: return max(x)
df.groupby('user').apply(lambda x: consecutive_zero(x['usage']))
Output:
user
A 2
B 1
C 0
dtype: int64
If you have a large dataset and speed is essential, you might want to try the high-performance pyrle library.
Setup:
# pip install pyrle
# or
# conda install -c bioconda pyrle
import numpy as np
np.random.seed(0)
import pandas as pd
from pyrle import Rle
size = int(1e7)
number = np.random.randint(2, size=size)
user = np.random.randint(5, size=size)
df = pd.DataFrame({"User": np.sort(user), "Number": number})
df
# User Number
# 0 0 0
# 1 0 1
# 2 0 1
# 3 0 0
# 4 0 1
# ... ... ...
# 9999995 4 1
# 9999996 4 1
# 9999997 4 0
# 9999998 4 0
# 9999999 4 1
#
# [10000000 rows x 2 columns]
Execution:
for u, udf in df.groupby("User"):
r = Rle(udf.Number)
is_0 = r.values == 0
print("User", u, "Max", np.max(r.runs[is_0]))
# (Wall time: 1.41 s)
# User 0 Max 20
# User 1 Max 23
# User 2 Max 20
# User 3 Max 22
# User 4 Max 23
Related
I'm using python3. I would like to remove incorrect id's from my dataframe column.
Example:
d = {'name': ['a', 'b', 'c', 'd'], 'id': [9356622,9030321,9408530, 1112200]}
df = pd.DataFrame(data=d)
I need to verify id by multiplying each of the first six digits by a factor of 2 to 7 corresponding to their position from right to left. For example, for id 9356622:
(9×7) + (3×6) + (5×5) + (6×4) + (6×3) + (2×2) = 152. So in this case last number 2 so it's correct since the last number of id 9356622 is 2. I need to check with the last number after preforming this calculation.
Input data:
>>> df
name id
0 a 9356622
1 b 9030321
2 c 9408530
3 d 1112200
Explode the id numbers to digits:
df1 = df['id'].astype(str).map(list).apply(pd.Series).astype(int)
>>> df1
0 1 2 3 4 5 6
0 9 3 5 6 6 2 2 # 152 -> modulo(10) = 2 -> True
1 9 0 3 0 3 2 1 # 91 -> modulo(10) = 1 -> True
2 9 4 0 8 5 3 0 # 140 -> modulo(10) = 0 -> True
3 1 1 1 2 2 0 0 # 32 -> modulo(10) = 2 -> False
Now check your math operation:
>>> df1.iloc[:, :6].mul(range(7, 1, -1)).sum(axis=1).mod(10) == df1.iloc[:, 6]
0 True
1 True
2 True
3 False
dtype: bool
def fun_IMO(string):
try:
pattern = r"([0-9][0-9][0-9][0-9][0-9][0-9][0-9])"
regexFinder = re.compile(pattern)
string = string.lower()
res = regexFinder.search(string)
if res.groups():
try:
nuberIMO = res.groups()[0]
numberIMO_calc = (int(nuberIMO[0])*7) + (int(nuberIMO[1])*6) + (int(nuberIMO[2])*5) + (int(nuberIMO[3])*4) + (int(nuberIMO[4])*3) + (int(nuberIMO[5])*2)
if str(numberIMO_calc)[-1] == nuberIMO[0]:
return True
else:
return False
except Exception as e:
return e
except Exception as e:
return e
I'm using OLS with respect to two dataframes:
gab = ols(formula= 'only_volume ~ all_but_volume', data=data_p ).fit()
where,
only_volume = data_p.iloc[:,0] #Only first colum
all_but_volume = data_p.iloc[:, 1:data_p.shape[1]] #All but first column
When I try to extract something, say parameter or pvals, I get something like this:
In [3]: gab.params
Out[3]:
Intercept 2.687598e+06
all_but_volume[0] 5.500544e+01
all_but_volume[1] 2.696902e+02
all_but_volume[2] 3.389568e+04
all_but_volume[3] -2.385838e+04
all_but_volume[4] 5.419860e+02
all_but_volume[5] 3.815161e+02
all_but_volume[6] -2.281344e+04
all_but_volume[7] 1.794128e+04
...
all_but_volume[22] 1.374321e+00
Since gab.params provides with 23 values in LHS and all_but_volume has 23 columns, I was hoping if there was a way to get a list/zip of params with column names, instead of params with all_but_volume[i]
Like,
TMC 9.801195e+01
TAC 2.214464e+02
...
What I've tried:
removing all_but_volume and simply using data_p.iloc[:, 1:data_p.shape[1]]
Didn't work:
...
data_p.iloc[:, 1:data_p.shape[1]][21] 2.918531e+04
data_p.iloc[:, 1:data_p.shape[1]][22] 1.395342e+00
Edit:
Sample Data:
data_p.iloc[1:5,:]
Out[31]:
Volume A B C\
1 569886.171878 759.089217 272.446022 4.163908
2 561695.886128 701.165406 330.301260 4.136530
3 627221.486089 377.746089 656.838394 4.130720
4 625181.750625 361.489041 670.575110 4.134467
D E F G H I \
1 1.000842 12993.06 3371.28 236.90 4.92 6.13
2 0.981514 13005.44 3378.69 236.94 4.92 6.13
3 0.836920 13017.22 3384.47 236.98 4.93 6.13
4 0.810541 13028.56 3388.85 237.01 4.94 6.13
J K L M N \
1 ... 0 0 0 0 0
2 ... 0 0 0 0 0
3 ... 0 0 0 0 0
4 ... 0 0 0 0 0
O P Q R S
1 0 0 0 1 9202.171648
2 0 0 0 0 4381.373520
3 0 0 0 0 -13982.443554
4 0 0 0 0 -22878.843149
only_volume is the first column 'volume'
all_but_volume is all columns except 'volume'
You can use DataFrame constructor or rename, because gab.params is Series:
Sample:
np.random.seed(2018)
import statsmodels.formula.api as sm
data_p = pd.DataFrame(np.random.rand(10, 5), columns=['Volume','A','B','C','D'])
print (data_p)
Volume A B C D
0 0.882349 0.104328 0.907009 0.306399 0.446409
1 0.589985 0.837111 0.697801 0.802803 0.107215
2 0.757093 0.999671 0.725931 0.141448 0.356721
3 0.942704 0.610162 0.227577 0.668732 0.692905
4 0.416863 0.171810 0.976891 0.330224 0.629044
5 0.160611 0.089953 0.970822 0.816578 0.571366
6 0.345853 0.403744 0.137383 0.900934 0.933936
7 0.047377 0.671507 0.034832 0.252691 0.557125
8 0.525823 0.352968 0.092983 0.304509 0.862430
9 0.716937 0.964071 0.539702 0.950540 0.667982
only_volume = data_p.iloc[:,0] #Only first colum
all_but_volume = data_p.iloc[:, 1:data_p.shape[1]] #All but first column
gab = sm.ols(formula= 'only_volume ~ all_but_volume', data=data_p ).fit()
print (gab.params)
Intercept 0.077570
all_but_volume[0] 0.395072
all_but_volume[1] 0.313150
all_but_volume[2] -0.100752
all_but_volume[3] 0.247532
dtype: float64
print (type(gab.params))
<class 'pandas.core.series.Series'>
df = pd.DataFrame({'cols':data_p.columns[1:], 'par': gab.params.values[1:]})
print (df)
cols par
0 A 0.395072
1 B 0.313150
2 C -0.100752
3 D 0.247532
If want return Series:
s = gab.params.rename(dict(zip(gab.params.index, data_p.columns)))
print (s)
Volume 0.077570
A 0.395072
B 0.313150
C -0.100752
D 0.247532
dtype: float64
Series without first value:
s = gab.params.iloc[1:].rename(dict(zip(gab.params.index, data_p.columns)))
print (s)
A 0.395072
B 0.313150
C -0.100752
D 0.247532
dtype: float64
I used the code below to map the 2 values inside S column to 0 but it didn't work. Any suggestion on how to solve this?
N.B : I want to implement an external function inside the map.
df = pd.DataFrame({
'Age': [30,40,50,60,70,80],
'Sex': ['F','M','M','F','M','F'],
'S' : [1,1,2,2,1,2]
})
def app(value):
for n in df['S']:
if n == 1:
return 1
if n == 2:
return 0
df["S"] = df.S.map(app)
Use eq to create a boolean series and conver that boolean series to int with astype:
df['S'] = df['S'].eq(1).astype(int)
OR
df['S'] = (df['S'] == 1).astype(int)
Output:
Age Sex S
0 30 F 1
1 40 M 1
2 50 M 0
3 60 F 0
4 70 M 1
5 80 F 0
Don't use apply, simply use loc to assign the values:
df.loc[df.S.eq(2), 'S'] = 0
Age Sex S
0 30 F 1
1 40 M 1
2 50 M 0
3 60 F 0
4 70 M 1
5 80 F 0
If you need a more performant option, use np.select. This is also more scalable, as you can always add more conditions:
df['S'] = np.select([df.S.eq(2)], [0], 1)
You're close but you need a few corrections. Since you want to use a function, remove the for loop and replace n with value. Additionally, use apply instead of map. Apply operates on the entire column at once. See this answer for how to properly use apply vs applymap vs map
def app(value):
if value == 1:
return 1
elif value == 2:
return 0
df['S'] = df.S.apply(app)
Age Sex S
0 30 F 1
1 40 M 1
2 50 M 0
3 60 F 0
4 70 M 1
5 80 F 0
If you only wish to change values equal to 2, you can use pd.DataFrame.loc:
df.loc[df['S'] == 0, 'S'] = 0
pd.Series.apply is not recommend and this is just a thinly veiled, inefficient loop.
You could use .replace as follows:
df["S"] = df["S"].replace([2], 0)
This will replace all of 2 values to 0 in one line
Go with vectorize numpy operation:
df['S'] = np.abs(df['S'] - 2)
and stand yourself out from competitions in interviews and SO answers :)
>>>df = pd.DataFrame({'Age':[30,40,50,60,70,80],'Sex':
['F','M','M','F','M','F'],'S':
[1,1,2,2,1,2]})
>>> def app(value):
return 1 if value == 1 else 0
# or app = lambda value : 1 if value == 1 else 0
>>> df["S"] = df["S"].map(app)
>>> df
Age S Sex
Age S Sex
0 30 1 F
1 40 1 M
2 50 0 M
3 60 0 F
4 70 1 M
5 80 0 F
You can do:
import numpy as np
df['S'] = np.where(df['S'] == 2, 0, df['S'])
First Part
I have a dataframe with finance data (33023 rows, here the link to the
data: https://mab.to/Ssy3TelRs); df.open is the price of the title and
df.close is the closing price.
I have been trying to see how many times in a row the title closed
with a gain and with a lost.
The result that I'm looking for should tell me that the title was
positive 2 days in a row x times, 3 days in a row y times, 4 days in a
row z times and so forth.
I have started with a for:
for x in range(1,df.close.count()): y = df.close[x]-df.open[x]
and then unsuccessful series of if statements...
Thank you for your help.
CronosVirus00
EDITS:
>>> df.head(7)
data ora open max min close Unnamed: 6
0 20160801 0 1.11781 1.11781 1.11772 1.11773 0
1 20160801 100 1.11774 1.11779 1.11773 1.11777 0
2 20160801 200 1.11779 1.11800 1.11779 1.11795 0
3 20160801 300 1.11794 1.11801 1.11771 1.11771 0
4 20160801 400 1.11766 1.11772 1.11763 1.11772 0
5 20160801 500 1.11774 1.11798 1.11774 1.11796 0
6 20160801 600 1.11796 1.11796 1.11783 1.11783 0
Ifs:
for x in range(1,df.close.count()): y = df.close[x]-df.open[x] if y > 0 : green += 1 y = df.close[x+1] - df.close[x+1]
twotimes += 1 if y > 0 : green += 1 y = df.close[x+2] -
df.close[x+2] threetimes += 1 if y > 0 :
green += 1 y = df.close[x+3] - df.close[x+3] fourtimes += 1
FINAL SOLUTION
Thank you all! And the end I did this:
df['test'] = df.close - df.open >0
green = df.test #days that it was positive
def gg(z):
tot =green.count()
giorni = range (1,z+1) # days in a row i wanna check
for x in giorni:
y = (green.rolling(x).sum()>x-1).sum()
print(x," ",y, " ", round((y/tot)*100,1),"%")
gg(5)
1 14850 45.0 %
2 6647 20.1 %
3 2980 9.0 %
4 1346 4.1 %
5 607 1.8 %
If i understood your question correctly you can do it this way:
In [76]: df.groupby((df.close.diff() < 0).cumsum()).cumcount()
Out[76]:
0 0
1 1
2 2
3 0
4 1
5 2
6 0
7 0
dtype: int64
The result that I'm looking for should tell me that the title was
positive 2 days in a row x times, 3 days in a row y times, 4 days in a
row z times and so forth.
In [114]: df.groupby((df.close.diff() < 0).cumsum()).cumcount().value_counts().to_frame('count')
Out[114]:
count
0 4
2 2
1 2
Data set:
In [78]: df
Out[78]:
data ora open max min close
0 20160801 0 1.11781 1.11781 1.11772 1.11773
1 20160801 100 1.11774 1.11779 1.11773 1.11777
2 20160801 200 1.11779 1.11800 1.11779 1.11795
3 20160801 300 1.11794 1.11801 1.11771 1.11771
4 20160801 400 1.11766 1.11772 1.11763 1.11772
5 20160801 500 1.11774 1.11798 1.11774 1.11796
6 20160801 600 1.11796 1.11796 1.11783 1.11783
7 20160801 700 1.11783 1.11799 1.11783 1.11780
In [80]: df.close.diff()
Out[80]:
0 NaN
1 0.00004
2 0.00018
3 -0.00024
4 0.00001
5 0.00024
6 -0.00013
7 -0.00003
Name: close, dtype: float64
It sounds like what you want to do is:
compute the difference of two series (open & close), eg diff = df.open - df.close
apply a condition to the result to get a boolean series diff > 0
pass the resulting boolean series to the DataFrame to get a subset of the DataFrame where the condition is true df[diff > 0]
Find all contiguous subsequences by applying a column wise function to identify and count
I need to board a plane, but I will provide a sample of what the last step looks like when I can.
If I understood you correctly, you want the number of days that have at least n positive days in a row before and itself included.
Similarly to what #Thang suggested, you can use rolling:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10, 2), columns=["open", "close"])
# This just sets up random test data, for example:
# open close
# 0 0.997986 0.594789
# 1 0.052712 0.401275
# 2 0.895179 0.842259
# 3 0.747268 0.919169
# 4 0.113408 0.253440
# 5 0.199062 0.399003
# 6 0.436424 0.514781
# 7 0.180154 0.235816
# 8 0.750042 0.558278
# 9 0.840404 0.139869
positiveDays = df["close"]-df["open"] > 0
# This will give you a series that is True for positive days:
# 0 False
# 1 True
# 2 False
# 3 True
# 4 True
# 5 True
# 6 True
# 7 True
# 8 False
# 9 False
# dtype: bool
daysToCheck = 3
positiveDays.rolling(daysToCheck).sum()>daysToCheck-1
This will now give you a series, indicating for every day, whether it has been positive for daysToCheck number of days in a row:
0 False
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 False
9 False
dtype: bool
Now you can use (positiveDays.rolling(daysToCheck).sum()>daysToCheck-1).sum() to get the number of days (in the example 3) that obey this, which is what you want, as far as I understand.
This should work:
import pandas as pd
import numpy as np
test = pd.DataFrame(np.random.randn(100,2), columns = ['open','close'])
test['gain?'] = (test['open']-test['close'] < 0)
test['cumulative'] = 0
for i in test.index[1:]:
if test['gain?'][i]:
test['cumulative'][i] = test['cumulative'][i-1] + 1
test['cumulative'][i-1] = 0
results = test['cumulative'].value_counts()
Ignore the '0' row in the results. It can be modified without too much trouble if you want to e.g count both days in a run-of-two as runs-of-one as well.
Edit: without the warnings -
import pandas as pd
import numpy as np
test = pd.DataFrame(np.random.randn(100,2), columns = ['open','close'])
test['gain?'] = (test['open']-test['close'] < 0)
test['cumulative'] = 0
for i in test.index[1:]:
if test['gain?'][i]:
test.loc[i,'cumulative'] = test.loc[i-1,'cumulative'] + 1
test.loc[i-1,'cumulative'] = 0
results = test['cumulative'].value_counts()
For the dataframe below, how to return all opposite pairs?
import pandas as pd
df1 = pd.DataFrame([1,2,-2,2,-1,-1,1,1], columns=['a'])
a
0 1
1 2
2 -2
3 2
4 -1
5 -1
6 1
7 1
The output should be as below:
(1) sum of all rows is 0
(2) as there are 3 "1" and 2 "-1" in
original data, output includes 2 "1" and 2"-1".
a
0 1
1 2
2 -2
4 -1
5 -1
6 1
Thank you very much.
Well, I thought this would take fewer lines (and probably can) but this does work. First just create a couple of new columns to simplify the later syntax:
>>> df1['abs_a'] = np.abs( df1['a'] )
>>> df1['ones'] = 1
Then the main thing you need is to do some counting. For example, are there fewer 1s or fewer -1s?
>>> df2 = df1.groupby(['abs_a','a']).count()
ones
abs_a a
1 -1 2
1 3
2 -2 1
2 2
>>> df3 = df2.groupby(level=0).min()
ones
abs_a
1 2
2 1
That's basically the answer right there, but I'll put it closer to the form you asked for:
>>> lst = [ [i]*j for i, j in zip( df3.index.tolist(), df3['ones'].tolist() ) ]
>>> arr = np.array( [item for sublist in lst for item in sublist] )
>>> np.hstack( [arr,-1*arr] )
array([ 1, 1, 2, -1, -1, -2], dtype=int64)
Or if you want to put it back into a dataframe:
>>> pd.DataFrame( np.hstack( [arr,-1*arr] ) )
0
0 1
1 1
2 2
3 -1
4 -1
5 -2