Python Dataframe: How to check specific columns for elements - python

I want to check whether all elements from a certain column contain the number 0?
I have a dataset that I read with df=pd.read_table('ad-data')
From this I felt a data frame with elements
[0] [1.] [2] [3] [4] [5] [6] [7] ....1559
[1.] 3 2 3 0 0 0 0
[2] 2 3 2 0 0 0 0
[3] 3 2 2 0 0 0 0
[4] 6 7 3 0 0 0 0
[5] 3 2 1 0 0 0 0
...
3220
I would like to check whether the data set from column 4 to 1559 contains only 0 or also other values.

You can check for equality with 0 element-wise and use all for rows:
df['all_zeros'] = (df.iloc[:, 4:1560] == 0).all(axis=1)
Small example to demonstrate it (based on columns 1 to 3 here):
N = 5
df = pd.DataFrame(np.random.binomial(1, 0.4, size=(N, N)))
df['all_zeros'] = (df.iloc[:, 1:4] == 0).all(axis=1)
df
Output:
0 1 2 3 4 all_zeros
0 0 1 1 0 0 False
1 0 0 1 1 1 False
2 0 1 1 0 0 False
3 0 0 0 0 0 True
4 1 0 0 0 0 True
Update: Filtering non-zero values:
df[~df['all_zeros']]
Output:
0 1 2 3 4 all_zeros
0 0 1 1 0 0 False
1 0 0 1 1 1 False
2 0 1 1 0 0 False
Update 2: To show only non-zero values:
pd.melt(
df_filtered.iloc[:, 1:4].reset_index(),
id_vars='index', var_name='column'
).query('value != 0').sort_values('index')
Output:
index column value
0 0 1 1
3 0 2 1
4 1 2 1
7 1 3 1
2 2 1 1
5 2 2 1

df['Check']=df.loc[:,4:].sum(axis=1)

here is the way to check if all of values are zero or not: it's simple
and doesn't need advanced functions as above answers. only basic
functions like filtering and if loops and variable assigning.
first is the way to check if one column has only zeros or not and
second is how to find if all the columns have zeros or not. and it
prints and answer statement.
the method to check if one column has only zero values or not:
first make a series:
has_zero = df[4] == 0
# has_zero is a series which contains bool values for each row eg. True, False.
# if there is a zero in a row it result will be "row_number : True"
next:
rows_which_have_zero = df[has_zero]
# stores the rows which have zero as a data frame
next:
if len[rows_which_have_zero] == total_number_rows:
print("contains only zeros")
else:
print("contains other numbers than zero")
# substitute total_number_rows for 3220
the above method only checks if rows_which_have_zero is equal to amount of the rows in the column.
the method to see if all of the columns have only zero or not:
it uses the above function and puts it into a if loop.
no_of_columns = 1559
value_1 = 1
if value_1 <= 1559
has_zero = df[value_1] == 0
rows_which_have_zero = df[has_zero]
value_1 += 1
if len[rows_which_have_zero] == 1559
no_of_rows_with_only_zero += 1
else:
return
to check if all rows have zero only or not:
#since it doesn't matter if first 3 columns have zero or not:
no_of_rows_with_only_zero = no_of_rows_with_only_zero - 3
if no_of_rows_with_only_zero == 1559:
print("there are only zero values")
else:
print("there are numbers which are not zero")
above checks if no_of_rows_with_only_zero is equal to the amount of rows (which is 1559 minus 3 because only rows 4 - 1559 need to be checked)
update:
# convert the value_1 to str if the column title is a str instead of int
# when updating value_1 by adding: convert it back to int and then back to str

Related

Get row count in DataFrame without for loop

I need to find if the last value of dataframe['position'] is different from 0, then count the previous (so in reverse) values until them changes and store the counted index before the change in a variable, this without for loops. By loc or iloc for example...
dataframe:
| position |
0 1
1 0
2 1 <4
3 1 <3
4 1 <2
5 1 <1
count = 4
I achieved this by a for loop, but I need to avoid it:
count = 1
if data['position'].iloc[-1] != 0:
for i in data['position']:
if data['position'].iloc[-count] == data['position'].iloc[-1]:
count = count + 1
else:
break
if data['position'].iloc[-count] != data['position'].iloc[-1]:
count = count - 1
You can reverse your Series, convert to boolean using the target condition (here "not equal 0" with ne), and apply a cummin to propagate the False upwards and sum to count the trailing True:
count = df.loc[::-1, 'position'].ne(0).cummin().sum()
Output: 4
If you have multiple columns:
counts = df.loc[::-1].ne(0).cummin().sum()
alternative
A slightly faster alternative (~25% faster), but relying on the assumptions that you have at least one zero and non duplicated indices could be to find the last zero and use indexing
m = df['position'].eq(0)
count = len(df.loc[m[m].index[-1]:])-1
Without the requirement to have at least one zero:
m = df['position'].eq(0)
m = m[m]
count = len(df) if m.empty else len(df.loc[m.index[-1]:])-1
This should do the trick:
((data.iloc[-1] != 0) & (data[::-1] == data.iloc[-1])).cumprod().sum()
This builds a condition ((data.iloc[-1] != 0) & (data[::-1] == data.iloc[-1])) indicating whether the value in each row (counting backwards from the end) is nonzero and equals the last value. Then, the values are coerced into 0 or 1 and the cumulative product is taken, so that the first non-matching zero will break the sequence and all subsequent values will be zero. Then the flags are summed to get the count of these consecutive matched values.
Depending on your data, though, stepping iteratively backwards from the end may be faster. This solution is vectorized, but it requires working with the entire column of data and doing several computations which are the same size as the original series.
Example:
In [12]: data = pd.DataFrame(np.random.randint(0, 3, size=(10, 5)), columns=list('ABCDE'))
...: data
Out[12]:
A B C D E
0 2 0 1 2 0
1 1 0 1 2 1
2 2 1 2 1 0
3 1 0 1 2 2
4 1 1 0 0 2
5 2 2 1 0 2
6 2 1 1 2 2
7 0 0 0 1 0
8 2 2 0 0 1
9 2 0 0 2 1
In [13]: ((data.iloc[-1] != 0) & (data[::-1] == data.iloc[-1])).cumprod().sum()
Out[13]:
A 2
B 0
C 0
D 1
E 2
dtype: int64

Get maximum occurance of one specific value per row with pandas

I have the following dataframe:
1 2 3 4 5 6 7 8 9
0 0 0 1 0 0 0 0 0 1
1 0 0 0 0 1 1 0 1 0
2 1 1 0 1 1 0 0 1 1
...
I want to get for each row the longest sequence of value 0 in the row.
so, the expected results for this dataframe will be an array that looks like this:
[5,4,2,...]
as on the first row, maximum sequenc eof value 0 is 5, ect.
I have seen this post and tried for the beginning to get this for the first row (though I would like to do this at once for the whole dataframe) but I got errors:
s=df_day.iloc[0]
(~s).cumsum()[s].value_counts().max()
TypeError: ufunc 'invert' not supported for the input types, and the
inputs could not be safely coerced to any supported types according to
the casting rule ''safe''
when I inserted manually the values like this:
s=pd.Series([0,0,1,0,0,0,0,0,1])
(~s).cumsum()[s].value_counts().max()
>>>7
I got 7 which is number of total 0 in the row but not the max sequence.
However, I don't understand why it raises the error at first, and , more important, I would like to run it on the end on the while dataframe and per row.
My end goal: get the maximum uninterrupted occurance of value 0 in a row.
Vectorized solution for counts consecutive 0 per rows, so for maximal use max of DataFrame c:
#more explain https://stackoverflow.com/a/52718619/2901002
m = df.eq(0)
b = m.cumsum(axis=1)
c = b.sub(b.mask(m).ffill(axis=1).fillna(0)).astype(int)
print (c)
1 2 3 4 5 6 7 8 9
0 1 2 0 1 2 3 4 5 0
1 1 2 3 4 0 0 1 0 1
2 0 0 1 0 0 1 2 0 0
df['max_consecutive_0'] = c.max(axis=1)
print (df)
1 2 3 4 5 6 7 8 9 max_consecutive_0
0 0 0 1 0 0 0 0 0 1 5
1 0 0 0 0 1 1 0 1 0 4
2 1 1 0 1 1 0 0 1 1 2
Use:
df = df.T.apply(lambda x: (x != x.shift()).astype(int).cumsum().where(x.eq(0)).dropna().value_counts().max())
OUTPUT
0 5
1 4
2 2
The following code should do the job.
the function longest_streak will count the number of consecutive zeros and return the max, and you can use apply on your df.
from itertools import groupby
def longest_streak(l):
lst = []
for n,c in groupby(l):
num,count = n,sum(1 for i in c)
if num==0:
lst.append((num,count))
maxx = max([y for x,y in lst])
return(maxx)
df.apply(lambda x: longest_streak(x),axis=1)

Check if n consecutive elements equals x and any previous element is greater than x

I have a pandas dataframe with 6 mins readings. I want to mark each row as either NF or DF.
NF = rows with 5 consecutive entries being 0 and at least one prior reading being greater than 0
DF = All other rows that do not meet the NF rule
[[4,6,7,2,1,0,0,0,0,0]
[6,0,0,0,0,0,2,2,2,5]
[0,0,0,0,0,0,0,0,0,0]
[0,0,0,0,0,4,6,7,2,1]]
Expected Result:
[NF, NF, DF, DF]
Can I use a sliding window for this? What is a good pythonic way of doing this?
using staring numpy vectorised solution, two conditions operating on truth matrix
uses fact that True is 1 so cumsum() can be used
position of 5th zero should be 4 places higher than 1st
if you just want the array, the np.where() gives that without assigning if back to a dataframe column
used another test case [1,0,0,0,0,1,0,0,0,0] where there are many zeros, but not 5 consecutive
df = pd.DataFrame([[4,6,7,2,1,0,0,0,0,0],
[6,0,0,0,0,0,2,2,2,5],
[0,0,0,0,0,0,0,0,0,0],
[1,0,0,0,0,1,0,0,0,0],
[0,0,0,0,0,4,6,7,2,1]])
df = df.assign(res=np.where(
# five consecutive zeros
((np.argmax(np.cumsum(df.eq(0).values, axis=1)==1, axis=1)+4) ==
np.argmax(np.cumsum(df.eq(0).values, axis=1)==5, axis=1)) &
# first zero somewhere other that 0th position
np.argmax(df.eq(0).values, axis=1)>0
,"NF","DF")
)
0
1
2
3
4
5
6
7
8
9
res
0
4
6
7
2
1
0
0
0
0
0
NF
1
6
0
0
0
0
0
2
2
2
5
NF
2
0
0
0
0
0
0
0
0
0
0
DF
3
1
0
0
0
0
1
0
0
0
0
DF
4
0
0
0
0
0
4
6
7
2
1
DF

Grouped by set of columns, first non zero value and one of all zeros in a column needs to be flagged as 1 and rest as 0

import pandas as pd
df = pd.DataFrame({'Org1': [1,1,1,1,2,2,2,2,3,3,3,4,4,4],
'Org2': ['x','x','y','y','z','y','z','z','x','y','y','z','x','x'],
'Org3': ['a','a','b','b','c','b','c','c','a','b','b','c','a','a'],
'Value': [0,0,3,1,0,1,0,5,0,0,0,1,1,1]})
df
For each unique set of "Org1, Org2, Org3" and based on the "Value"
The first non zero "value" should have "FLAG" = 1 and others = 0
If all "value" are 0 then one of the row's "FLAG" = 1 and others = 0
If "value" are all NON ZERO in a Column then first instance to have FLAG = 1 and others 0
I was using the solutions provided in
Flag the first non zero column value with 1 and rest 0 having multiple columns
One difference is in the above Point 2 isnt covered
"If all "value" are 0 then one of the row's "FLAG" = 1 and others = 0"
You can modify linked solution with remove .where:
m = df['Value'].ne(0)
idx = m.groupby([df['Org1'],df['Org2'],df['Org3']]).idxmax()
df['FLAG'] = df.index.isin(idx).astype(int)
print (df)
Org1 Org2 Org3 Value FLAG
0 1 x a 0 1
1 1 x a 0 0
2 1 y b 3 1
3 1 y b 1 0
4 2 z c 0 0
5 2 y b 1 1
6 2 z c 0 0
7 2 z c 5 1
8 3 x a 0 1
9 3 y b 0 1
10 3 y b 0 0
11 4 z c 1 1
12 4 x a 1 1
13 4 x a 1 0

numpy where - how to set condition on whole column?

How to implement :
t=np.where(<exists at least 1 zero in the same column of t>,t,np.zeros_like(t))
in the "pythonic" way?
this code should set all column to zero in t if t has at least 1 zero in that column
Example :
1 1 1 1 1 1
0 1 1 1 1 1
1 1 0 1 0 1
should turn to
0 1 0 1 0 1
0 1 0 1 0 1
0 1 0 1 0 1
any is what you need
~(arr == 0).any(0, keepdims=True) * arr
0 1 0 1 0 1
0 1 0 1 0 1
0 1 0 1 0 1
this code should set all column to zero in t if t has at least 1 zero
in that column
The simplest way to do this particular task:
t * t.min(0)
A more general way to do it (in case you have an array with different values and the condition is: if a column has at least one occurrence of some_value, then set that column to some_value).
cond = (arr == some_value).any(0)
arr[:, cond] = some_value

Categories