Subsetting the data frame and applying cumulative operation on multiple columns

Subsetting the data frame and applying cumulative operation on multiple columns - python

I have a dataset that looks like below.
df=pd.DataFrame({'unit': ['ABC', 'DEF', 'GEH','IJK','DEF','XRF','BRQ'], 'A': [1,1,1,0,0,0,1], 'B': [1,1,1,1,1,1,0],'C': [1,1,1,0,0,0,1],'row_num': [7,6,5,4,3,2,1]})
I am trying to get the logic
Step 1-Consider a subset with row_number <=4.
Step 2- Column A,B,C has total 12 values(0's and 1's).
Steps 3-Count number of '1' within columns A,B,C. From the example
there are five 1's and seven 0's which calculates to 40%(5/12) of
1's.
Steps-4 since count of 1's is greater than 40% create a column flag
with 1 else if count of 1 is less than 10% then 0.

Hopefully I got it this time:
subdf = df.iloc[3:, 1:4]
df['flag'] = 1 if subdf.values.sum()/subdf.size >= 0.1 else 0
output:
unit A B C row_num flag
0 ABC 1 1 1 7 1
1 DEF 1 1 1 6 1
2 GEH 1 1 1 5 1
3 IJK 0 1 0 4 1
4 DEF 0 1 0 3 1
5 XRF 0 1 0 2 1
6 BRQ 1 0 1 1 1

Related

Count number of consecutive rows that are greater than current row value but less than the value from other column

Say I have the following sample dataframe (there are about 25k rows in the real dataframe)
df = pd.DataFrame({'A' : [0,3,2,9,1,0,4,7,3,2], 'B': [9,8,3,5,5,5,5,8,0,4]})
df
A B
0 0 9
1 3 8
2 2 3
3 9 5
4 1 5
5 0 5
6 4 5
7 7 8
8 3 0
9 2 4
For the column A I need to know how many next and previous rows are greater than current row value but less than value in column B.
So my expected output is :
A B next count previous count
0 9 2 0
3 8 0 0
2 3 0 1
9 5 0 0
1 5 0 0
0 5 2 1
4 5 1 0
7 8 0 0
3 0 0 2
2 4 0 0
Explanation :
First row is calculated as : since 3 and 2 are greater than 0 but less than corresponding B value 8 and 3
Second row is calculated as : since next value 2 is not greater than 3
Third row is calculated as : since 9 is greater than 2 but not greater than its corresponding B value
Similarly, previous count is calculated
Note : I know how to solve this problem by looping using list comprehension or using the pandas apply method but still I won't mind a clear and concise apply approach. I was looking for a more pandaic approach.
My Solution
Here is the apply solution, which I think is inefficient. Also, as people said that there might be no vector solution for the question. So as mentioned, a more efficient apply solution will be accepted for this question.
This is what I have tried.
This function gets the number of previous/next rows that satisfy the condition.
def get_prev_next_count(row):
next_nrow = df.loc[row['index']+1:,['A', 'B']]
prev_nrow = df.loc[:row['index']-1,['A', 'B']][::-1]
if (next_nrow.size == 0):
return 0, ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin()
if (prev_nrow.size == 0):
return ((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), 0
return (((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin())
Generating output :
df[['next count', 'previous count']] = df.reset_index().apply(get_prev_next_count, axis=1, result_type="expand")
Output :
This gives us the expected output
df
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0

I made some optimizations:
You don't need to reset_index() you can access the index with .name
If you only pass df[['A']] instead of the whole frame, that may help.
prev_nrow.empty is the same as (prev_nrow.size == 0)
Applied different logic to get the desired value via first_false, this speeds things up significantly.
def first_false(val1, val2, A):
i = 0
for x, y in zip(val1, val2):
if A < x < y:
i += 1
else:
break
return i
def get_prev_next_count(row):
A = row['A']
next_nrow = df.loc[row.name+1:,['A', 'B']]
prev_nrow = df2.loc[row.name-1:,['A', 'B']]
if next_nrow.empty:
return 0, first_false(prev_nrow.A, prev_nrow.B, A)
if prev_nrow.empty:
return first_false(next_nrow.A, next_nrow.B, A), 0
return (first_false(next_nrow.A, next_nrow.B, A),
first_false(prev_nrow.A, prev_nrow.B, A))
df2 = df[::-1].copy() # Shave a tiny bit of time by only reversing it once~
df[['next count', 'previous count']] = df[['A']].apply(get_prev_next_count, axis=1, result_type='expand')
print(df)
Output:
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
Timing
Expanding the data:
df = pd.concat([df]*(10000//4), ignore_index=True)
# df.shape == (25000, 2)
Original Method:
Gave up at 15 minutes.
New Method:
1m 20sec
Throw pandarallel at it:
from pandarallel import pandarallel
pandarallel.initialize()
df[['A']].parallel_apply(get_prev_next_count, axis=1, result_type='expand')
26sec

Get maximum occurance of one specific value per row with pandas

I have the following dataframe:
1 2 3 4 5 6 7 8 9
0 0 0 1 0 0 0 0 0 1
1 0 0 0 0 1 1 0 1 0
2 1 1 0 1 1 0 0 1 1
...
I want to get for each row the longest sequence of value 0 in the row.
so, the expected results for this dataframe will be an array that looks like this:
[5,4,2,...]
as on the first row, maximum sequenc eof value 0 is 5, ect.
I have seen this post and tried for the beginning to get this for the first row (though I would like to do this at once for the whole dataframe) but I got errors:
s=df_day.iloc[0]
(~s).cumsum()[s].value_counts().max()
TypeError: ufunc 'invert' not supported for the input types, and the
inputs could not be safely coerced to any supported types according to
the casting rule ''safe''
when I inserted manually the values like this:
s=pd.Series([0,0,1,0,0,0,0,0,1])
(~s).cumsum()[s].value_counts().max()
>>>7
I got 7 which is number of total 0 in the row but not the max sequence.
However, I don't understand why it raises the error at first, and , more important, I would like to run it on the end on the while dataframe and per row.
My end goal: get the maximum uninterrupted occurance of value 0 in a row.

Vectorized solution for counts consecutive 0 per rows, so for maximal use max of DataFrame c:
#more explain https://stackoverflow.com/a/52718619/2901002
m = df.eq(0)
b = m.cumsum(axis=1)
c = b.sub(b.mask(m).ffill(axis=1).fillna(0)).astype(int)
print (c)
1 2 3 4 5 6 7 8 9
0 1 2 0 1 2 3 4 5 0
1 1 2 3 4 0 0 1 0 1
2 0 0 1 0 0 1 2 0 0
df['max_consecutive_0'] = c.max(axis=1)
print (df)
1 2 3 4 5 6 7 8 9 max_consecutive_0
0 0 0 1 0 0 0 0 0 1 5
1 0 0 0 0 1 1 0 1 0 4
2 1 1 0 1 1 0 0 1 1 2

Use:
df = df.T.apply(lambda x: (x != x.shift()).astype(int).cumsum().where(x.eq(0)).dropna().value_counts().max())
OUTPUT
0 5
1 4
2 2

The following code should do the job.
the function longest_streak will count the number of consecutive zeros and return the max, and you can use apply on your df.
from itertools import groupby
def longest_streak(l):
lst = []
for n,c in groupby(l):
num,count = n,sum(1 for i in c)
if num==0:
lst.append((num,count))
maxx = max([y for x,y in lst])
return(maxx)
df.apply(lambda x: longest_streak(x),axis=1)

Counting number of consecutive more than 2 occurences

I am beginner, and I really need help on the following:
I need to do similar to the following but on a two dimensional dataframe Identifying consecutive occurrences of a value
I need to use this answer but for two dimensional dataframe. I need to count at least 2 consecuetive ones along the columns dimension. Here is a sample dataframe:
my_df=
0 1 2
0 1 0 1
1 0 1 0
2 1 1 1
3 0 0 1
4 0 1 0
5 1 1 0
6 1 1 1
7 1 0 1
The output I am looking for is:
0 1 2
0 3 5 4
Instead of the column 'consecutive', I need a new output called "out_1_df" for line
df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count
So that later I can do
threshold = 2;
out_2_df= (out_1_df > threshold).astype(int)
I tried the following:
out_1_df= my_df.groupby(( my_df != my_df.shift(axis=0)).cumsum(axis=0))
out_2_df =`(out_1_df > threshold).astype(int)`
How can I modify this?

Try:
import pandas as pd
df=pd.DataFrame({0:[1,0,1,0,0,1,1,1], 1:[0,1,1,0,1,1,1,0], 2: [1,0,1,1,0,0,1,1]})
out_2_df=((df.diff(axis=0).eq(0)|df.diff(periods=-1,axis=0).eq(0))&df.eq(1)).sum(axis=0)
>>> out_2_df
[3 5 4]

Classifying according to number of consecutive values with pandas

I have a dataframe column with 1s and 0s like this:
df['working'] =
1
1
0
0
0
1
1
0
0
1
which represents when a machine is working (1) or stopped (0). I need to classify this stops depending on their length ie if there are less or equal than n consecutive 0s change all them to short-stop (2) if there are more than n, to long-stop (3). The expected result should look like this when applied over the example with n=2:
df[['working', 'result']]=
1 1
1 1
0 3
0 3
0 3
1 1
1 1
0 2
0 2
1 1
of course this is an example, my df has more than 1M rows.
I tried looping through it but it's really slow and also using this. But I couldn't achieve to transform it to my problem.
Can anyone help?. Thanks so much in advance.

I hope Series.map with Series.value_counts should be used for improve performance:
n = 2
#compare 0 values
m = df['working'].eq(0)
#created groups only by mask
s = df['working'].cumsum()[m]
#counts only 0 groups
out = s.map(s.value_counts())
#set new values by mask
df['result'] = 1
df.loc[m, 'result'] = np.where(out > n, 3, 2)
print (df)
working result
0 1 1
1 1 1
2 0 3
3 0 3
4 0 3
5 1 1
6 1 1
7 0 2
8 0 2
9 1 1

Here's one approach:
# Counter for each gruop where there is a change
m = df.working.ne(df.working.shift()).cumsum()
# mask where working is 0
eq0 = df.working.eq(0)
# Get a count of consecutive 0s
count = df[eq0].groupby(m[eq0]).transform('count')
# replace 0s accordingly
df.loc[eq0, 'result'] = np.where(count > 2, 3, 2).ravel()
# fill the remaining values with 1
df['result'] = df.result.fillna(1)
print(df)
working result
0 1 1.0
1 1 1.0
2 0 3.0
3 0 3.0
4 0 3.0
5 1 1.0
6 1 1.0
7 0 2.0
8 0 2.0
9 1 1.0

pandas: Grouping or filtering based on values in list, instead of dataframe

I want to get a row count of the frequency of each value, even if that value doesn't exist in the dataframe.
d = {'light' : pd.Series(['b','b','c','a','a','a','a'], index=[1,2,3,4,5,6,9]),'injury' : pd.Series([1,5,5,5,2,2,4], index=[1,2,3,4,5,6,9])}
testdf = pd.DataFrame(d)
injury light
1 1 b
2 5 b
3 5 c
4 5 a
5 2 a
6 2 a
9 4 a
I want to get a count of the number of occurrences of each unique value of 'injury' for each unique value in 'light'.
Normally I would just use groupby(), or (in this case, since I want it to be in a specific format), pivot_table:
testdf.reset_index().pivot_table(index='light',columns='injury',fill_value=0,aggfunc='count')
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
But in this case I actually want to compare the records in the dataframe to an external list of values-- in this case, ['a','b','c','d']. So if 'd' doesn't exist in this dataframe, then I want it to return a count of zero:
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
d 0 0 0 0
The closest I've come is filtering the dataframe based on each value, and then getting the size of that dataframe:
for v in sorted(['a','b','c','d']):
idx2 = (df['light'].isin([v]))
df2 = df[idx2]
print(df2.shape[0])
4
2
1
0
But that only returns counts from the 'light' column-- instead of a cross-tabulation of both columns.
Is there a way to make a pivot table, or a groupby() object, that groups things based on values in a list, rather than in a column in a dataframe? Or is there a better way to do this?

Try this:
df = pd.crosstab(df.light, df.injury,margins=True)
df
injury 1 2 4 5 All
light
a 0 2 1 1 4
b 1 0 0 1 2
c 0 0 0 1 1
All 1 2 1 3 7
df["All"]
light
a 4
b 2
c 1
All 7

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Subsetting the data frame and applying cumulative operation on multiple columns - python

Hopefully I got it this time: subdf = df.iloc[3:, 1:4] df['flag'] = 1 if subdf.values.sum()/subdf.size >= 0.1 else 0 output: unit A B C row_num flag 0 ABC 1 1 1 7 1 1 DEF 1 1 1 6 1 2 GEH 1 1 1 5 1 3 IJK 0 1 0 4 1 4 DEF 0 1 0 3 1 5 XRF 0 1 0 2 1 6 BRQ 1 0 1 1 1

Related

Count number of consecutive rows that are greater than current row value but less than the value from other column

Get maximum occurance of one specific value per row with pandas

Counting number of consecutive more than 2 occurences

Classifying according to number of consecutive values with pandas

pandas: Grouping or filtering based on values in list, instead of dataframe

Categories

Resources