Get row count in DataFrame without for loop - python

I need to find if the last value of dataframe['position'] is different from 0, then count the previous (so in reverse) values until them changes and store the counted index before the change in a variable, this without for loops. By loc or iloc for example...
dataframe:
| position |
0 1
1 0
2 1 <4
3 1 <3
4 1 <2
5 1 <1
count = 4
I achieved this by a for loop, but I need to avoid it:
count = 1
if data['position'].iloc[-1] != 0:
for i in data['position']:
if data['position'].iloc[-count] == data['position'].iloc[-1]:
count = count + 1
else:
break
if data['position'].iloc[-count] != data['position'].iloc[-1]:
count = count - 1

You can reverse your Series, convert to boolean using the target condition (here "not equal 0" with ne), and apply a cummin to propagate the False upwards and sum to count the trailing True:
count = df.loc[::-1, 'position'].ne(0).cummin().sum()
Output: 4
If you have multiple columns:
counts = df.loc[::-1].ne(0).cummin().sum()
alternative
A slightly faster alternative (~25% faster), but relying on the assumptions that you have at least one zero and non duplicated indices could be to find the last zero and use indexing
m = df['position'].eq(0)
count = len(df.loc[m[m].index[-1]:])-1
Without the requirement to have at least one zero:
m = df['position'].eq(0)
m = m[m]
count = len(df) if m.empty else len(df.loc[m.index[-1]:])-1

This should do the trick:
((data.iloc[-1] != 0) & (data[::-1] == data.iloc[-1])).cumprod().sum()
This builds a condition ((data.iloc[-1] != 0) & (data[::-1] == data.iloc[-1])) indicating whether the value in each row (counting backwards from the end) is nonzero and equals the last value. Then, the values are coerced into 0 or 1 and the cumulative product is taken, so that the first non-matching zero will break the sequence and all subsequent values will be zero. Then the flags are summed to get the count of these consecutive matched values.
Depending on your data, though, stepping iteratively backwards from the end may be faster. This solution is vectorized, but it requires working with the entire column of data and doing several computations which are the same size as the original series.
Example:
In [12]: data = pd.DataFrame(np.random.randint(0, 3, size=(10, 5)), columns=list('ABCDE'))
...: data
Out[12]:
A B C D E
0 2 0 1 2 0
1 1 0 1 2 1
2 2 1 2 1 0
3 1 0 1 2 2
4 1 1 0 0 2
5 2 2 1 0 2
6 2 1 1 2 2
7 0 0 0 1 0
8 2 2 0 0 1
9 2 0 0 2 1
In [13]: ((data.iloc[-1] != 0) & (data[::-1] == data.iloc[-1])).cumprod().sum()
Out[13]:
A 2
B 0
C 0
D 1
E 2
dtype: int64

Related

Count rows that have same value in all columns

If I have a dataframe like this:
A B C D E F
------------------
1 2 3 4 5 6
1 1 1 1 1 1
0 0 0 0 0 0
1 1 1 1 1 1
How can I get the number of rows that have value 1 in every column?
In this case, 2 rows have 1 in every field.
I know one way, for example if we only take columns A and B:
count = df2.query("A == 1 & B == 1").shape[0]
But I have to put the name of every column, is there a more fancy approach?
Thanks in advance
Try:
(df == 1).all(axis=1).sum()
Output:
2
For the large data frame and multiple row , you may always want to try with any , since when it detect the first item , it will yield the result
sum(~df.ne(1).any(1))
2

Check if n consecutive elements equals x and any previous element is greater than x

I have a pandas dataframe with 6 mins readings. I want to mark each row as either NF or DF.
NF = rows with 5 consecutive entries being 0 and at least one prior reading being greater than 0
DF = All other rows that do not meet the NF rule
[[4,6,7,2,1,0,0,0,0,0]
[6,0,0,0,0,0,2,2,2,5]
[0,0,0,0,0,0,0,0,0,0]
[0,0,0,0,0,4,6,7,2,1]]
Expected Result:
[NF, NF, DF, DF]
Can I use a sliding window for this? What is a good pythonic way of doing this?
using staring numpy vectorised solution, two conditions operating on truth matrix
uses fact that True is 1 so cumsum() can be used
position of 5th zero should be 4 places higher than 1st
if you just want the array, the np.where() gives that without assigning if back to a dataframe column
used another test case [1,0,0,0,0,1,0,0,0,0] where there are many zeros, but not 5 consecutive
df = pd.DataFrame([[4,6,7,2,1,0,0,0,0,0],
[6,0,0,0,0,0,2,2,2,5],
[0,0,0,0,0,0,0,0,0,0],
[1,0,0,0,0,1,0,0,0,0],
[0,0,0,0,0,4,6,7,2,1]])
df = df.assign(res=np.where(
# five consecutive zeros
((np.argmax(np.cumsum(df.eq(0).values, axis=1)==1, axis=1)+4) ==
np.argmax(np.cumsum(df.eq(0).values, axis=1)==5, axis=1)) &
# first zero somewhere other that 0th position
np.argmax(df.eq(0).values, axis=1)>0
,"NF","DF")
)
0
1
2
3
4
5
6
7
8
9
res
0
4
6
7
2
1
0
0
0
0
0
NF
1
6
0
0
0
0
0
2
2
2
5
NF
2
0
0
0
0
0
0
0
0
0
0
DF
3
1
0
0
0
0
1
0
0
0
0
DF
4
0
0
0
0
0
4
6
7
2
1
DF

Python Dataframe: How to check specific columns for elements

I want to check whether all elements from a certain column contain the number 0?
I have a dataset that I read with df=pd.read_table('ad-data')
From this I felt a data frame with elements
[0] [1.] [2] [3] [4] [5] [6] [7] ....1559
[1.] 3 2 3 0 0 0 0
[2] 2 3 2 0 0 0 0
[3] 3 2 2 0 0 0 0
[4] 6 7 3 0 0 0 0
[5] 3 2 1 0 0 0 0
...
3220
I would like to check whether the data set from column 4 to 1559 contains only 0 or also other values.
You can check for equality with 0 element-wise and use all for rows:
df['all_zeros'] = (df.iloc[:, 4:1560] == 0).all(axis=1)
Small example to demonstrate it (based on columns 1 to 3 here):
N = 5
df = pd.DataFrame(np.random.binomial(1, 0.4, size=(N, N)))
df['all_zeros'] = (df.iloc[:, 1:4] == 0).all(axis=1)
df
Output:
0 1 2 3 4 all_zeros
0 0 1 1 0 0 False
1 0 0 1 1 1 False
2 0 1 1 0 0 False
3 0 0 0 0 0 True
4 1 0 0 0 0 True
Update: Filtering non-zero values:
df[~df['all_zeros']]
Output:
0 1 2 3 4 all_zeros
0 0 1 1 0 0 False
1 0 0 1 1 1 False
2 0 1 1 0 0 False
Update 2: To show only non-zero values:
pd.melt(
df_filtered.iloc[:, 1:4].reset_index(),
id_vars='index', var_name='column'
).query('value != 0').sort_values('index')
Output:
index column value
0 0 1 1
3 0 2 1
4 1 2 1
7 1 3 1
2 2 1 1
5 2 2 1
df['Check']=df.loc[:,4:].sum(axis=1)
here is the way to check if all of values are zero or not: it's simple
and doesn't need advanced functions as above answers. only basic
functions like filtering and if loops and variable assigning.
first is the way to check if one column has only zeros or not and
second is how to find if all the columns have zeros or not. and it
prints and answer statement.
the method to check if one column has only zero values or not:
first make a series:
has_zero = df[4] == 0
# has_zero is a series which contains bool values for each row eg. True, False.
# if there is a zero in a row it result will be "row_number : True"
next:
rows_which_have_zero = df[has_zero]
# stores the rows which have zero as a data frame
next:
if len[rows_which_have_zero] == total_number_rows:
print("contains only zeros")
else:
print("contains other numbers than zero")
# substitute total_number_rows for 3220
the above method only checks if rows_which_have_zero is equal to amount of the rows in the column.
the method to see if all of the columns have only zero or not:
it uses the above function and puts it into a if loop.
no_of_columns = 1559
value_1 = 1
if value_1 <= 1559
has_zero = df[value_1] == 0
rows_which_have_zero = df[has_zero]
value_1 += 1
if len[rows_which_have_zero] == 1559
no_of_rows_with_only_zero += 1
else:
return
to check if all rows have zero only or not:
#since it doesn't matter if first 3 columns have zero or not:
no_of_rows_with_only_zero = no_of_rows_with_only_zero - 3
if no_of_rows_with_only_zero == 1559:
print("there are only zero values")
else:
print("there are numbers which are not zero")
above checks if no_of_rows_with_only_zero is equal to the amount of rows (which is 1559 minus 3 because only rows 4 - 1559 need to be checked)
update:
# convert the value_1 to str if the column title is a str instead of int
# when updating value_1 by adding: convert it back to int and then back to str

Count how many initial elements in Pandas Series equal to a certain value?

As in question. I know how to compute it, but is there better/faster/more elegant way to do this?
Cnt is the result.
s = pd.Series( np.random.randint(2, size=10) )
cnt = 0
for n in s:
if n != 0:
break
else:
cnt += 1
continue
Use Series.eq to create a boolean mask then use Series.cummin to return a cummulative minimum over this series finally use Series.sum to get the total count:
cnt = s.eq(0).cummin().sum()
Example:
np.random.seed(9)
s = pd.Series(np.random.randint(2, size=10))
# print(s)
0 0
1 0
2 0
3 1
4 0
5 0
6 1
7 0
8 1
9 1
dtype: int64
cnt = s.eq(0).cummin().sum()
#print(cnt)
3
I have done in a dataframe as it is easier to produce but you can use the vectorized .cumsum to speed up your code with .loc for values == 0. Then just find the length with len:
import pandas as pd, numpy as np
s = pd.DataFrame(pd.Series(np.random.randint(2, size=10)))
s['t'] = s[0].cumsum()
o = len(s.loc[s['t']==0])
o
If you set o = to a column with s['o'] = o, then the output looks like this:
0 t o
0 0 0 2
1 0 0 2
2 1 1 2
3 1 2 2
4 0 2 2
5 1 3 2
6 1 4 2
7 1 5 2
8 1 6 2
9 0 6 2
You can use cumsum() in a mask and then sum() to get the number of initial 0s in the sequence:
s = pd.Series(np.random.randint(2, size=10))
(s.cumsum() == 0).sum()
Note that this method only works if you want to count 0s. If you want to count occurrences of non-zero values you can generalize it, ie.:
(s.sub(s[0]).cumsum() == 0).sum()

Pandas: conditional rolling count

I have a Series that looks the following:
col
0 B
1 B
2 A
3 A
4 A
5 B
It's a time series, therefore the index is ordered by time.
For each row, I'd like to count how many times the value has appeared consecutively, i.e.:
Output:
col count
0 B 1
1 B 2
2 A 1 # Value does not match previous row => reset counter to 1
3 A 2
4 A 3
5 B 1 # Value does not match previous row => reset counter to 1
I found 2 related questions, but I can't figure out how to "write" that information as a new column in the DataFrame, for each row (as above). Using rolling_apply does not work well.
Related:
Counting consecutive events on pandas dataframe by their index
Finding consecutive segments in a pandas data frame
I think there is a nice way to combine the solution of #chrisb and #CodeShaman (As it was pointed out CodeShamans solution counts total and not consecutive values).
df['count'] = df.groupby((df['col'] != df['col'].shift(1)).cumsum()).cumcount()+1
col count
0 B 1
1 B 2
2 A 1
3 A 2
4 A 3
5 B 1
One-liner:
df['count'] = df.groupby('col').cumcount()
or
df['count'] = df.groupby('col').cumcount() + 1
if you want the counts to begin at 1.
Based on the second answer you linked, assuming s is your series.
df = pd.DataFrame(s)
df['block'] = (df['col'] != df['col'].shift(1)).astype(int).cumsum()
df['count'] = df.groupby('block').transform(lambda x: range(1, len(x) + 1))
In [88]: df
Out[88]:
col block count
0 B 1 1
1 B 1 2
2 A 2 1
3 A 2 2
4 A 2 3
5 B 3 1
I like the answer by #chrisb but wanted to share my own solution, since some people might find it more readable and easier to use with similar problems....
1) Create a function that uses static variables
def rolling_count(val):
if val == rolling_count.previous:
rolling_count.count +=1
else:
rolling_count.previous = val
rolling_count.count = 1
return rolling_count.count
rolling_count.count = 0 #static variable
rolling_count.previous = None #static variable
2) apply it to your Series after converting to dataframe
df = pd.DataFrame(s)
df['count'] = df['col'].apply(rolling_count) #new column in dataframe
output of df
col count
0 B 1
1 B 2
2 A 1
3 A 2
4 A 3
5 B 1
If you wish to do the same thing but filter on two columns, you can use this.
def count_consecutive_items_n_cols(df, col_name_list, output_col):
cum_sum_list = [
(df[col_name] != df[col_name].shift(1)).cumsum().tolist() for col_name in col_name_list
]
df[output_col] = df.groupby(
["_".join(map(str, x)) for x in zip(*cum_sum_list)]
).cumcount() + 1
return df
col_a col_b count
0 1 B 1
1 1 B 2
2 1 A 1
3 2 A 1
4 2 A 2
5 2 B 1

Categories