i have below dataframe that have columns 0-1 .. and i wanna
count the number of 0->1,1->0 every column. in below dataframe
'a' column state change number is 6, 'b' state change number is 3
, 'c' state change number is 2 .. actually i don't know how
code in pandas.
number a b c
1 0 0 0
2 1 0 1
3 0 1 1
4 1 1 1
5 0 0 0
6 1 0 0
7 0 1 0
actually i don't have idea in pandas.. because recently used only r.
but now i must use python pandas. so have little bit in difficult
situation anybody can help ? thanks in advance !
Use rolling and compare each value, then count all True values by sum:
df = df[['a','b','c']].rolling(2).apply(lambda x: x[0] != x[-1], raw=True).sum().astype(int)
a 6
b 3
c 2
dtype: int64
Bit wise xor (^)
Use the Numpy array df.values and compare the shifted elements with ^
This is meant to be a fast solution.
Xor has the property that only one of the two items being operated on can be true as shown in this truth table
A B XOR
T T F
T F T
F T T
F F F
And replicated in 0/1 form
a = np.array([1, 1, 0, 0])
b = np.array([1, 0, 1, 0])
pd.DataFrame(dict(A=a, B=b, XOR=a ^ b))
A B XOR
0 1 1 0
1 1 0 1
2 0 1 1
3 0 0 0
Demo
v = df.values
pd.Series((v[1:] ^ v[:-1]).sum(0), df.columns)
a 6
b 3
c 2
dtype: int64
Time Testing
Open in Colab
Open in GitHub
Functions
def pir_xor(df):
v = df.values
return pd.Series((v[1:] ^ v[:-1]).sum(0), df.columns)
def pir_diff1(df):
v = df.values
return pd.Series(np.abs(np.diff(v, axis=0)).sum(0), df.columns)
def pir_diff2(df):
v = df.values
return pd.Series(np.diff(v.astype(np.bool), axis=0).sum(0), df.columns)
def cold(df):
return df.ne(df.shift(-1)).sum(0) - 1
def jez(df):
return df.rolling(2).apply(lambda x: x[0] != x[-1]).sum().astype(int)
def naga(df):
return df.diff().abs().sum().astype(int)
Testing
np.random.seed([3, 1415])
idx = [10, 30, 100, 300, 1000, 3000, 10000, 30000, 100000, 300000]
col = 'pir_xor pir_diff1 pir_diff2 cold jez naga'.split()
res = pd.DataFrame(np.nan, idx, col)
for i in idx:
df = pd.DataFrame(np.random.choice([0, 1], size=(i, 3)), columns=[*'abc'])
for j in col:
stmt = f"{j}(df)"
setp = f"from __main__ import {j}, df"
res.at[i, j] = timeit(stmt, setp, number=100)
Results
res.div(res.min(1), 0)
pir_xor pir_diff1 pir_diff2 cold jez naga
10 1.06203 1.119769 1.000000 21.217555 16.768532 6.601518
30 1.00000 1.075406 1.115743 23.229013 18.844025 7.212369
100 1.00000 1.134082 1.174973 22.673289 21.478068 7.519898
300 1.00000 1.119153 1.166782 21.725495 26.293712 7.215490
1000 1.00000 1.106267 1.167786 18.394462 37.925160 6.284253
3000 1.00000 1.118554 1.342192 16.053097 64.953310 5.594610
10000 1.00000 1.163557 1.511631 12.008129 106.466636 4.503359
30000 1.00000 1.249835 1.431120 7.826387 118.380227 3.621455
100000 1.00000 1.275272 1.528840 6.690012 131.912349 3.150155
300000 1.00000 1.279373 1.528238 6.301007 140.667427 3.190868
res.plot(loglog=True, figsize=(15, 8))
shift and compare:
df.ne(df.shift(-1)).sum(0) - 1
a 6
b 3
c 2
dtype: int64
...Assuming "number" is the index, otherwise precede your solution with
df.set_index('number', inplace=True).
You can try of taking difference with previous one and add absolute valeues
df.diff().abs().sum().astype(int)
Out:
1 6
2 3
3 2
dtype: int32
Use:
def agg_columns(x):
shifted = x.shift()
return sum(x[1:] != shifted[1:])
df[['a','b','c']].apply(agg_columns)
a 6
b 3
c 2
dtype: int64
You can try also with:
((df!=df.shift()).cumsum() - 1).iloc[-1:]
Related
this is my first question at stackoverflow.
I have two dataframes of different sizes df1(266808 rows) and df2 (201 rows).
df1
and
df2
I want to append the count of each value/number in df1['WS_140m'] to df2['count'] if number falls in a class interval given in df2['Class_interval'].
I have tried
1)
df2['count']=pd.cut(x=df1['WS_140m'], bins=df2['Class_interval'])
2)
df2['count'] = df1['WS_140m'].groupby(df1['Class_interval'])
3)
for anum in df1['WS_140m']:
if anum in df2['Class_interval']:
df2['count'] = df2['count'] + 1
Please guide, if someone knows.
Please try something like:
def in_class_interval(value, interval):
#TODO
def in_class_interval_closure(interval):
return lambda x: in_class_interval(x, interval)
df2['count'] = df2['Class_interval']
.apply(lambda x: df1[in_class_interval_closure(x)(df1['WS_140m'])].size,axis=1)
Define your function in_class_interval(value, interval), which returns boolean.
I guess something like this would do it:
In [330]: df1
Out[330]:
WS_140m
0 5.10
1 5.16
2 5.98
3 5.58
4 4.81
In [445]: df2
Out[445]:
count Class_interval
0 0 NaN
1 0 (0.05,0.15]
2 0 (0.15,0.25]
3 0 (0.25,0.35]
4 0 (3.95,5.15]
In [446]: df2.Class_interval = df2.Class_interval.str.replace(']', ')')
In [451]: from ast import literal_eval
In [449]: for i, v in df2.Class_interval.iteritems():
...: if pd.notnull(v):
...: df2.at[i, 'Class_interval'] = literal_eval(df2.Class_interval[i])
In [342]: df2['falls_in_range'] = df1.WS_140m.between(df2.Class_interval.str[0], df2.Class_interval.str[1])
You can increase the count wherever True comes like below :
In [360]: df2['count'] = df2.loc[df2.index[df2['falls_in_range'] == True].tolist()]['count'] +1
In [361]: df2
Out[361]:
count Class_interval falls_in_range
0 NaN NaN False
1 NaN (0.05, 0.15) False
2 NaN (0.15, 0.25) False
3 NaN (0.25, 0.35) False
4 1.0 (3.95, 5.15) True
i have a dataframe as below.
test = pd.DataFrame({'col1':[0,0,1,0,0,0,1,2,0], 'col2': [0,0,1,2,3,0,0,0,0]})
col1 col2
0 0 0
1 0 0
2 1 1
3 0 2
4 0 3
5 0 0
6 1 0
7 2 0
8 0 0
For each column, i want to find the index of value 1 before the maximum of each column. For example, for the first column, the max is 2, the index of value 1 before 2 is 6. for the second column, the max is 3, the index of value 1 before the value 3 is 2.
In summary, I am looking to get [6, 2] as the output for this test DataFrame. Is there a quick way to achieve this?
Use Series.mask to hide elements that aren't 1, then apply Series.last_valid_index to each column.
m = test.eq(test.max()).cumsum().gt(0) | test.ne(1)
test.mask(m).apply(pd.Series.last_valid_index)
col1 6
col2 2
dtype: int64
Using numpy to vectorize, you can use numpy.cumsum and argmax:
idx = ((test.eq(1) & test.eq(test.max()).cumsum().eq(0))
.values
.cumsum(axis=0)
.argmax(axis=0))
idx
# array([6, 2])
pd.Series(idx, index=[*test])
col1 6
col2 2
dtype: int64
Using #cs95 idea of last_valid_index:
test.apply(lambda x: x[:x.idxmax()].eq(1)[lambda i:i].last_valid_index())
Output:
col1 6
col2 2
dtype: int64
Expained:
Using index slicing to cut each column to start to max value, then look for the values that are equal to one and find the index of the last true value.
Or as #QuangHoang suggests:
test.apply(lambda x: x[:x.idxmax()].eq(1).cumsum().idxmax())
Overkill with Numpy
t = test.to_numpy()
a = t.argmax(0)
i, j = np.where(t == 1)
mask = i <= a[j]
i = i[mask]
j = j[mask]
b = np.empty_like(a)
b.fill(-1)
np.maximum.at(b, j, i)
pd.Series(b, test.columns)
col1 6
col2 2
dtype: int64
apply
test.apply(lambda s: max(s.index, key=lambda x: (s[x] == 1, s[x] <= s.max(), x)))
col1 6
col2 2
dtype: int64
cummax
test.eq(1).where(test.cummax().lt(test.max())).iloc[::-1].idxmax()
col1 6
col2 2
dtype: int64
Timing
I just wanted to use a new tool and do some bechmarking
see this post
Results
r.to_pandas_dataframe().T
10 31 100 316 1000 3162 10000
al_0 0.003696 0.003718 0.005512 0.006210 0.010973 0.007764 0.012008
wb_0 0.003348 0.003334 0.003913 0.003935 0.004583 0.004757 0.006096
qh_0 0.002279 0.002265 0.002571 0.002643 0.002927 0.003070 0.003987
sb_0 0.002235 0.002246 0.003072 0.003357 0.004136 0.004083 0.005286
sb_1 0.001771 0.001779 0.002331 0.002353 0.002914 0.002936 0.003619
cs_0 0.005742 0.005751 0.006748 0.006808 0.007845 0.008088 0.009898
cs_1 0.004034 0.004045 0.004871 0.004898 0.005769 0.005997 0.007338
pr_0 0.002484 0.006142 0.027101 0.085944 0.374629 1.292556 6.220875
pr_1 0.003388 0.003414 0.003981 0.004027 0.004658 0.004929 0.006390
pr_2 0.000087 0.000088 0.000089 0.000093 0.000107 0.000145 0.000300
fig = plt.figure(figsize=(10, 10))
ax = plt.subplot()
r.plot(ax=ax)
Setup
from simple_benchmark import BenchmarkBuilder
b = BenchmarkBuilder()
def al_0(test): return test.apply(lambda x: x.where(x[:x.idxmax()].eq(1)).drop_duplicates(keep='last').idxmin())
def wb_0(df): return (df.iloc[::-1].cummax().eq(df.max())&df.eq(1).iloc[::-1]).idxmax()
def qh_0(test): return (test.eq(1) & (test.index.values[:,None] < test.idxmax().values)).cumsum().idxmax()
def sb_0(test): return test.apply(lambda x: x[:x.idxmax()].eq(1)[lambda i:i].last_valid_index())
def sb_1(test): return test.apply(lambda x: x[:x.idxmax()].eq(1).cumsum().idxmax())
def cs_0(test): return (lambda m: test.mask(m).apply(pd.Series.last_valid_index))(test.eq(test.max()).cumsum().gt(0) | test.ne(1))
def cs_1(test): return pd.Series((test.eq(1) & test.eq(test.max()).cumsum().eq(0)).values.cumsum(axis=0).argmax(axis=0), test.columns)
def pr_0(test): return test.apply(lambda s: max(s.index, key=lambda x: (s[x] == 1, s[x] <= s.max(), x)))
def pr_1(test): return test.eq(1).where(test.cummax().lt(test.max())).iloc[::-1].idxmax()
def pr_2(test):
t = test.to_numpy()
a = t.argmax(0)
i, j = np.where(t == 1)
mask = i <= a[j]
i = i[mask]
j = j[mask]
b = np.empty_like(a)
b.fill(-1)
np.maximum.at(b, j, i)
return pd.Series(b, test.columns)
import math
def gen_test(n):
a = np.random.randint(100, size=(n, int(math.log10(n)) + 1))
idx = a.argmax(0)
while (idx == 0).any():
a = np.random.randint(100, size=(n, int(math.log10(n)) + 1))
idx = a.argmax(0)
for j, i in enumerate(idx):
a[np.random.randint(i), j] = 1
return pd.DataFrame(a).add_prefix('col')
#b.add_arguments('DataFrame Size')
def argument_provider():
for exponent in np.linspace(1, 3, 5):
size = int(10 ** exponent)
yield size, gen_test(size)
b.add_functions([al_0, wb_0, qh_0, sb_0, sb_1, cs_0, cs_1, pr_0, pr_1, pr_2])
r = b.run()
A little bit logic here
(df.iloc[::-1].cummax().eq(df.max())&df.eq(1).iloc[::-1]).idxmax()
Out[187]:
col1 6
col2 2
dtype: int64
Here's a mixed numpy and pandas solution:
(test.eq(1) & (test.index.values[:,None] < test.idxmax().values)).cumsum().idxmax()
which is a bit faster than the other solutions.
I would use dropna with where to drop duplicated 1 keeping the last 1, and call idxmin on it.
test.apply(lambda x: x.where(x[:x.idxmax()].eq(1)).drop_duplicates(keep='last').idxmin())
Out[1433]:
col1 6
col2 2
dtype: int64
I'm using OLS with respect to two dataframes:
gab = ols(formula= 'only_volume ~ all_but_volume', data=data_p ).fit()
where,
only_volume = data_p.iloc[:,0] #Only first colum
all_but_volume = data_p.iloc[:, 1:data_p.shape[1]] #All but first column
When I try to extract something, say parameter or pvals, I get something like this:
In [3]: gab.params
Out[3]:
Intercept 2.687598e+06
all_but_volume[0] 5.500544e+01
all_but_volume[1] 2.696902e+02
all_but_volume[2] 3.389568e+04
all_but_volume[3] -2.385838e+04
all_but_volume[4] 5.419860e+02
all_but_volume[5] 3.815161e+02
all_but_volume[6] -2.281344e+04
all_but_volume[7] 1.794128e+04
...
all_but_volume[22] 1.374321e+00
Since gab.params provides with 23 values in LHS and all_but_volume has 23 columns, I was hoping if there was a way to get a list/zip of params with column names, instead of params with all_but_volume[i]
Like,
TMC 9.801195e+01
TAC 2.214464e+02
...
What I've tried:
removing all_but_volume and simply using data_p.iloc[:, 1:data_p.shape[1]]
Didn't work:
...
data_p.iloc[:, 1:data_p.shape[1]][21] 2.918531e+04
data_p.iloc[:, 1:data_p.shape[1]][22] 1.395342e+00
Edit:
Sample Data:
data_p.iloc[1:5,:]
Out[31]:
Volume A B C\
1 569886.171878 759.089217 272.446022 4.163908
2 561695.886128 701.165406 330.301260 4.136530
3 627221.486089 377.746089 656.838394 4.130720
4 625181.750625 361.489041 670.575110 4.134467
D E F G H I \
1 1.000842 12993.06 3371.28 236.90 4.92 6.13
2 0.981514 13005.44 3378.69 236.94 4.92 6.13
3 0.836920 13017.22 3384.47 236.98 4.93 6.13
4 0.810541 13028.56 3388.85 237.01 4.94 6.13
J K L M N \
1 ... 0 0 0 0 0
2 ... 0 0 0 0 0
3 ... 0 0 0 0 0
4 ... 0 0 0 0 0
O P Q R S
1 0 0 0 1 9202.171648
2 0 0 0 0 4381.373520
3 0 0 0 0 -13982.443554
4 0 0 0 0 -22878.843149
only_volume is the first column 'volume'
all_but_volume is all columns except 'volume'
You can use DataFrame constructor or rename, because gab.params is Series:
Sample:
np.random.seed(2018)
import statsmodels.formula.api as sm
data_p = pd.DataFrame(np.random.rand(10, 5), columns=['Volume','A','B','C','D'])
print (data_p)
Volume A B C D
0 0.882349 0.104328 0.907009 0.306399 0.446409
1 0.589985 0.837111 0.697801 0.802803 0.107215
2 0.757093 0.999671 0.725931 0.141448 0.356721
3 0.942704 0.610162 0.227577 0.668732 0.692905
4 0.416863 0.171810 0.976891 0.330224 0.629044
5 0.160611 0.089953 0.970822 0.816578 0.571366
6 0.345853 0.403744 0.137383 0.900934 0.933936
7 0.047377 0.671507 0.034832 0.252691 0.557125
8 0.525823 0.352968 0.092983 0.304509 0.862430
9 0.716937 0.964071 0.539702 0.950540 0.667982
only_volume = data_p.iloc[:,0] #Only first colum
all_but_volume = data_p.iloc[:, 1:data_p.shape[1]] #All but first column
gab = sm.ols(formula= 'only_volume ~ all_but_volume', data=data_p ).fit()
print (gab.params)
Intercept 0.077570
all_but_volume[0] 0.395072
all_but_volume[1] 0.313150
all_but_volume[2] -0.100752
all_but_volume[3] 0.247532
dtype: float64
print (type(gab.params))
<class 'pandas.core.series.Series'>
df = pd.DataFrame({'cols':data_p.columns[1:], 'par': gab.params.values[1:]})
print (df)
cols par
0 A 0.395072
1 B 0.313150
2 C -0.100752
3 D 0.247532
If want return Series:
s = gab.params.rename(dict(zip(gab.params.index, data_p.columns)))
print (s)
Volume 0.077570
A 0.395072
B 0.313150
C -0.100752
D 0.247532
dtype: float64
Series without first value:
s = gab.params.iloc[1:].rename(dict(zip(gab.params.index, data_p.columns)))
print (s)
A 0.395072
B 0.313150
C -0.100752
D 0.247532
dtype: float64
I'm looking to find the max run of consecutive zeros in a DataFrame with the result grouped by user. I'm interested in running the RLE on usage.
sample input:
user--day--usage
A-----1------0
A-----2------0
A-----3------1
B-----1------0
B-----2------1
B-----3------0
Desired output
user---longest_run
a - - - - 2
b - - - - 1
mydata <- mydata[order(mydata$user, mydata$day),]
user <- unique(mydata$user)
d2 <- data.frame(matrix(NA, ncol = 2, nrow = length(user)))
names(d2) <- c("user", "longest_no_usage")
d2$user <- user
for (i in user) {
if (0 %in% mydata$usage[mydata$user == i]) {
run <- rle(mydata$usage[mydata$user == i]) #Run Length Encoding
d2$longest_no_usage[d2$user == i] <- max(run$length[run$values == 0])
} else {
d2$longest_no_usage[d2$user == i] <- 0 #some users did not have no-usage days
}
}
d2 <- d2[order(-d2$longest_no_usage),]
this works in R but I want to do the same thing in python, I'm totally stumped
Use groupby with size by columns user, usage and helper Series for consecutive values first:
print (df)
user day usage
0 A 1 0
1 A 2 0
2 A 3 1
3 B 1 0
4 B 2 1
5 B 3 0
6 C 1 1
df1 = (df.groupby([df['user'],
df['usage'].rename('val'),
df['usage'].ne(df['usage'].shift()).cumsum()])
.size()
.to_frame(name='longest_run'))
print (df1)
longest_run
user val usage
A 0 1 2
1 2 1
B 0 3 1
5 1
1 4 1
C 1 6 1
Then filter only zero rows, get max and add reindex for append non 0 groups:
df2 = (df1.query('val == 0')
.max(level=0)
.reindex(df['user'].unique(), fill_value=0)
.reset_index())
print (df2)
user longest_run
0 A 2
1 B 1
2 C 0
Detail:
print (df['usage'].ne(df['usage'].shift()).cumsum())
0 1
1 1
2 2
3 3
4 4
5 5
6 6
Name: usage, dtype: int32
get max number of consecutive zeros on series:
def max0(sr):
return (sr != 0).cumsum().value_counts().max() - (0 if (sr != 0).cumsum().value_counts().idxmax()==0 else 1)
max0(pd.Series([1,0,0,0,0,2,3]))
4
I think the following does what you are looking for, where the consecutive_zero function is an adaptation of the top answer here.
Hope this helps!
import pandas as pd
from itertools import groupby
df = pd.DataFrame([['A', 1], ['A', 0], ['A', 0], ['B', 0],['B',1],['C',2]],
columns=["user", "usage"])
def len_iter(items):
return sum(1 for _ in items)
def consecutive_zero(data):
x = list((len_iter(run) for val, run in groupby(data) if val==0))
if len(x)==0: return 0
else: return max(x)
df.groupby('user').apply(lambda x: consecutive_zero(x['usage']))
Output:
user
A 2
B 1
C 0
dtype: int64
If you have a large dataset and speed is essential, you might want to try the high-performance pyrle library.
Setup:
# pip install pyrle
# or
# conda install -c bioconda pyrle
import numpy as np
np.random.seed(0)
import pandas as pd
from pyrle import Rle
size = int(1e7)
number = np.random.randint(2, size=size)
user = np.random.randint(5, size=size)
df = pd.DataFrame({"User": np.sort(user), "Number": number})
df
# User Number
# 0 0 0
# 1 0 1
# 2 0 1
# 3 0 0
# 4 0 1
# ... ... ...
# 9999995 4 1
# 9999996 4 1
# 9999997 4 0
# 9999998 4 0
# 9999999 4 1
#
# [10000000 rows x 2 columns]
Execution:
for u, udf in df.groupby("User"):
r = Rle(udf.Number)
is_0 = r.values == 0
print("User", u, "Max", np.max(r.runs[is_0]))
# (Wall time: 1.41 s)
# User 0 Max 20
# User 1 Max 23
# User 2 Max 20
# User 3 Max 22
# User 4 Max 23
Suppose I have the following column.
>>> import pandas
>>> a = pandas.Series(['0', '1', '5', '1', None, '3', 'Cat', '2'])
I would like to be able to convert all the data in the column to type int, and any element that cannot be converted should be replaced with a 0.
My current solution to this is to use to_numeric with the 'coerce' option, fill any NaN with 0, and then convert to int (since the presence of NaN made the column float instead of int).
>>> pandas.to_numeric(a, errors='coerce').fillna(0).astype(int)
0 0
1 1
2 5
3 1
4 0
5 3
6 0
7 2
dtype: int64
Is there any method that would allow me to do this in one step rather than having to go through two intermediate states? I am looking for something that would behave like the following imaginary option to astype:
>>> a.astype(int, value_on_error=0)
Option 1
pd.to_numeric(a, 'coerce').fillna(0).astype(int)
Option 2
b = pd.to_numeric(a, 'coerce')
b.mask(b.isnull(), 0).astype(int)
Option 3
def try_int(x):
try:
return int(x)
except:
return 0
a.apply(try_int)
Option 4
b = np.empty(a.shape, dtype=int)
i = np.core.defchararray.isdigit(a.values.astype(str))
b[i] = a[i].astype(int)
b[~i] = 0
pd.Series(b, a.index)
All produce
0 0
1 1
2 5
3 1
4 0
5 3
6 0
7 2
dtype: int64
Timing
Code Below
def pir1(a):
return pd.to_numeric(a, 'coerce').fillna(0).astype(int)
def pir2(a):
b = pd.to_numeric(a, 'coerce')
return b.mask(b.isnull(), 0).astype(int)
def try_int(x):
try:
return int(x)
except:
return 0
def pir3(a):
return a.apply(try_int)
def pir4(a):
b = np.empty(a.shape, dtype=int)
i = np.core.defchararray.isdigit(a.values.astype(str))
b[i] = a[i].astype(int)
b[~i] = 0
return pd.Series(b, a.index)
def alt1(a):
return pd.to_numeric(a.where(a.str.isnumeric(), 0))
results = pd.DataFrame(
index=[1, 3, 10, 30, 100, 300, 1000, 3000, 10000],
columns='pir1 pir2 pir3 pir4 alt1'.split()
)
for i in results.index:
c = pd.concat([a] * i, ignore_index=True)
for j in results.columns:
stmt = '{}(c)'.format(j)
setp = 'from __main__ import c, {}'.format(j)
results.set_value(i, j, timeit(stmt, setp, number=10))
results.plot(logx=True, logy=True)
a.where(a.str.isnumeric(),0).astype(int)
Output:
0 0
1 1
2 5
3 1
4 0
5 3
6 0
7 2
dtype: int64