I am beginner, and I really need help on the following:
I need to do similar to the following but on a two dimensional dataframe Identifying consecutive occurrences of a value
I need to use this answer but for two dimensional dataframe. I need to count at least 2 consecuetive ones along the columns dimension. Here is a sample dataframe:
my_df=
0 1 2
0 1 0 1
1 0 1 0
2 1 1 1
3 0 0 1
4 0 1 0
5 1 1 0
6 1 1 1
7 1 0 1
The output I am looking for is:
0 1 2
0 3 5 4
Instead of the column 'consecutive', I need a new output called "out_1_df" for line
df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count
So that later I can do
threshold = 2;
out_2_df= (out_1_df > threshold).astype(int)
I tried the following:
out_1_df= my_df.groupby(( my_df != my_df.shift(axis=0)).cumsum(axis=0))
out_2_df =`(out_1_df > threshold).astype(int)`
How can I modify this?
Try:
import pandas as pd
df=pd.DataFrame({0:[1,0,1,0,0,1,1,1], 1:[0,1,1,0,1,1,1,0], 2: [1,0,1,1,0,0,1,1]})
out_2_df=((df.diff(axis=0).eq(0)|df.diff(periods=-1,axis=0).eq(0))&df.eq(1)).sum(axis=0)
>>> out_2_df
[3 5 4]
Related
I have the following dataframe:
1 2 3 4 5 6 7 8 9
0 0 0 1 0 0 0 0 0 1
1 0 0 0 0 1 1 0 1 0
2 1 1 0 1 1 0 0 1 1
...
I want to get for each row the longest sequence of value 0 in the row.
so, the expected results for this dataframe will be an array that looks like this:
[5,4,2,...]
as on the first row, maximum sequenc eof value 0 is 5, ect.
I have seen this post and tried for the beginning to get this for the first row (though I would like to do this at once for the whole dataframe) but I got errors:
s=df_day.iloc[0]
(~s).cumsum()[s].value_counts().max()
TypeError: ufunc 'invert' not supported for the input types, and the
inputs could not be safely coerced to any supported types according to
the casting rule ''safe''
when I inserted manually the values like this:
s=pd.Series([0,0,1,0,0,0,0,0,1])
(~s).cumsum()[s].value_counts().max()
>>>7
I got 7 which is number of total 0 in the row but not the max sequence.
However, I don't understand why it raises the error at first, and , more important, I would like to run it on the end on the while dataframe and per row.
My end goal: get the maximum uninterrupted occurance of value 0 in a row.
Vectorized solution for counts consecutive 0 per rows, so for maximal use max of DataFrame c:
#more explain https://stackoverflow.com/a/52718619/2901002
m = df.eq(0)
b = m.cumsum(axis=1)
c = b.sub(b.mask(m).ffill(axis=1).fillna(0)).astype(int)
print (c)
1 2 3 4 5 6 7 8 9
0 1 2 0 1 2 3 4 5 0
1 1 2 3 4 0 0 1 0 1
2 0 0 1 0 0 1 2 0 0
df['max_consecutive_0'] = c.max(axis=1)
print (df)
1 2 3 4 5 6 7 8 9 max_consecutive_0
0 0 0 1 0 0 0 0 0 1 5
1 0 0 0 0 1 1 0 1 0 4
2 1 1 0 1 1 0 0 1 1 2
Use:
df = df.T.apply(lambda x: (x != x.shift()).astype(int).cumsum().where(x.eq(0)).dropna().value_counts().max())
OUTPUT
0 5
1 4
2 2
The following code should do the job.
the function longest_streak will count the number of consecutive zeros and return the max, and you can use apply on your df.
from itertools import groupby
def longest_streak(l):
lst = []
for n,c in groupby(l):
num,count = n,sum(1 for i in c)
if num==0:
lst.append((num,count))
maxx = max([y for x,y in lst])
return(maxx)
df.apply(lambda x: longest_streak(x),axis=1)
I have a dataset that looks like below.
df=pd.DataFrame({'unit': ['ABC', 'DEF', 'GEH','IJK','DEF','XRF','BRQ'], 'A': [1,1,1,0,0,0,1], 'B': [1,1,1,1,1,1,0],'C': [1,1,1,0,0,0,1],'row_num': [7,6,5,4,3,2,1]})
I am trying to get the logic
Step 1-Consider a subset with row_number <=4.
Step 2- Column A,B,C has total 12 values(0's and 1's).
Steps 3-Count number of '1' within columns A,B,C. From the example
there are five 1's and seven 0's which calculates to 40%(5/12) of
1's.
Steps-4 since count of 1's is greater than 40% create a column flag
with 1 else if count of 1 is less than 10% then 0.
Hopefully I got it this time:
subdf = df.iloc[3:, 1:4]
df['flag'] = 1 if subdf.values.sum()/subdf.size >= 0.1 else 0
output:
unit A B C row_num flag
0 ABC 1 1 1 7 1
1 DEF 1 1 1 6 1
2 GEH 1 1 1 5 1
3 IJK 0 1 0 4 1
4 DEF 0 1 0 3 1
5 XRF 0 1 0 2 1
6 BRQ 1 0 1 1 1
I have a pandas dataframe with 6 mins readings. I want to mark each row as either NF or DF.
NF = rows with 5 consecutive entries being 0 and at least one prior reading being greater than 0
DF = All other rows that do not meet the NF rule
[[4,6,7,2,1,0,0,0,0,0]
[6,0,0,0,0,0,2,2,2,5]
[0,0,0,0,0,0,0,0,0,0]
[0,0,0,0,0,4,6,7,2,1]]
Expected Result:
[NF, NF, DF, DF]
Can I use a sliding window for this? What is a good pythonic way of doing this?
using staring numpy vectorised solution, two conditions operating on truth matrix
uses fact that True is 1 so cumsum() can be used
position of 5th zero should be 4 places higher than 1st
if you just want the array, the np.where() gives that without assigning if back to a dataframe column
used another test case [1,0,0,0,0,1,0,0,0,0] where there are many zeros, but not 5 consecutive
df = pd.DataFrame([[4,6,7,2,1,0,0,0,0,0],
[6,0,0,0,0,0,2,2,2,5],
[0,0,0,0,0,0,0,0,0,0],
[1,0,0,0,0,1,0,0,0,0],
[0,0,0,0,0,4,6,7,2,1]])
df = df.assign(res=np.where(
# five consecutive zeros
((np.argmax(np.cumsum(df.eq(0).values, axis=1)==1, axis=1)+4) ==
np.argmax(np.cumsum(df.eq(0).values, axis=1)==5, axis=1)) &
# first zero somewhere other that 0th position
np.argmax(df.eq(0).values, axis=1)>0
,"NF","DF")
)
0
1
2
3
4
5
6
7
8
9
res
0
4
6
7
2
1
0
0
0
0
0
NF
1
6
0
0
0
0
0
2
2
2
5
NF
2
0
0
0
0
0
0
0
0
0
0
DF
3
1
0
0
0
0
1
0
0
0
0
DF
4
0
0
0
0
0
4
6
7
2
1
DF
I have a df with badminton scores. Each sets of a games for a team are on rows and the score at each point on the columns like so:
0 0 1 1 2 3 4
0 1 2 3 3 4 4
I want to obtain only O and 1 when a point is scored, like so: (to analyse if there any pattern in the points):
0 0 1 0 1 1 1
0 1 1 1 0 1 0
I was thinking of using df.itertuples() and iloc and conditions to attribute 1 to new dataframe if next score = score+1 or 0 if next score = score + 1
But I dont know how to iterate through the generated tuples and how to generate my new df with the 0 and 1 at the good locations.
Hope that is clear thanks for your help.
Oh also, any suggestions to analyse the patterns after that ?
You just need diff(If you need convert it back try cumsum)
df.diff(axis=1).fillna(0).astype(int)
Out[1382]:
1 2 3 4 5 6 7
0 0 0 1 0 1 1 1
1 0 1 1 1 0 1 0
I want to get a row count of the frequency of each value, even if that value doesn't exist in the dataframe.
d = {'light' : pd.Series(['b','b','c','a','a','a','a'], index=[1,2,3,4,5,6,9]),'injury' : pd.Series([1,5,5,5,2,2,4], index=[1,2,3,4,5,6,9])}
testdf = pd.DataFrame(d)
injury light
1 1 b
2 5 b
3 5 c
4 5 a
5 2 a
6 2 a
9 4 a
I want to get a count of the number of occurrences of each unique value of 'injury' for each unique value in 'light'.
Normally I would just use groupby(), or (in this case, since I want it to be in a specific format), pivot_table:
testdf.reset_index().pivot_table(index='light',columns='injury',fill_value=0,aggfunc='count')
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
But in this case I actually want to compare the records in the dataframe to an external list of values-- in this case, ['a','b','c','d']. So if 'd' doesn't exist in this dataframe, then I want it to return a count of zero:
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
d 0 0 0 0
The closest I've come is filtering the dataframe based on each value, and then getting the size of that dataframe:
for v in sorted(['a','b','c','d']):
idx2 = (df['light'].isin([v]))
df2 = df[idx2]
print(df2.shape[0])
4
2
1
0
But that only returns counts from the 'light' column-- instead of a cross-tabulation of both columns.
Is there a way to make a pivot table, or a groupby() object, that groups things based on values in a list, rather than in a column in a dataframe? Or is there a better way to do this?
Try this:
df = pd.crosstab(df.light, df.injury,margins=True)
df
injury 1 2 4 5 All
light
a 0 2 1 1 4
b 1 0 0 1 2
c 0 0 0 1 1
All 1 2 1 3 7
df["All"]
light
a 4
b 2
c 1
All 7