Finding consecutive number in column and printing selective value from another column - python

I want to calculate consecutive numbers from colA and then select the middle number from consecutive seqeunce to print out value corresponding to it in column Freq1
this code is not printing any value
for col in df.ColA:
if col == col + 1 and col + 1 == col + 2:
print(col)
can anyone suggest any idea
ColA Freq1
4 0
5 100
6 200
18 5
19 600
20 700

This will return the resulting rows you are looking for:
Basically where ColA is equal to the previous row + 1 and the next row - 1
This of course operates under the assumption that there are always only 3 consecutive numbers.
df.loc[(df['ColA'].eq(df['ColA'].shift(1).add(1))) & (df['ColA'].eq(df['ColA'].shift(-1).sub(1)))]

You can use a custom groupby to identify the stretches of contiguous values, then size of the groups and the middle value:
# find consecutive values
group = df['ColA'].diff().ne(1).cumsum()
# make grouper
g = df.groupby(group)['ColA']
# get index of row in group
# compare to group size to find mid-point
m = g.cumcount().eq(g.transform('size').floordiv(2))
# perform boolean indexing
out = df.loc[m]
output:
ColA Freq1
1 5 100
4 19 600
intermediates:
ColA Freq1 diff group cumcount size size//2 m
0 4 0 NaN 1 0 3 1 False
1 5 100 1.0 1 1 3 1 True
2 6 200 1.0 1 2 3 1 False
3 18 5 12.0 2 0 3 1 False
4 19 600 1.0 2 1 3 1 True
5 20 700 1.0 2 2 3 1 False

Related

I have a dataframe where some index number are missing how do I cut dataframe previous to that missing index number

[enter image description here][1]
Index number 72 is missing from original dataframe which is shown in image. I want to cut dataframe like [0:71,:] with condition like when index sequence breaks then dataframe automatically cuts from previous index value.
Compare shifted values of index subtracted by original values if greater like 1 with invert ordering by [::-1] and Series.cummax, last filter in boolean indexing:
df = pd.DataFrame({'a': range(3,13)}).drop(3)
print (df)
a
0 3
1 4
2 5
4 7
5 8
6 9
7 10
8 11
9 12
df = df[df.index.to_series().shift(-1, fill_value=0).sub(df.index).gt(1)[::-1].cummax()]
print (df)
a
0 3
1 4
2 5
i came to this:
df = pd.DataFrame({'col':[1,2,3,4,5,6,7,8,9]}, index=[-1,0,1,2,3,4,5,7,8])
ind = next((i for i in range(len(df)-1) if df.index[i]+1!=df.index[i+1]),len(df))+1
>>> df.iloc[:ind]
'''
col
-1 1
0 2
1 3
2 4
3 5
4 6
5 7
With numpy, get the values that are equal to a normal range starting from the first index, up to the first mismatch (excluded):
df[np.minimum.accumulate(df.index==np.arange(df.index[0], df.index[0]+len(df)))]
Example:
col
-1 1
0 2
1 3
3 4
4 5
output:
col
-1 1
0 2
1 3

Count number of consecutive rows that are greater than current row value but less than the value from other column

Say I have the following sample dataframe (there are about 25k rows in the real dataframe)
df = pd.DataFrame({'A' : [0,3,2,9,1,0,4,7,3,2], 'B': [9,8,3,5,5,5,5,8,0,4]})
df
A B
0 0 9
1 3 8
2 2 3
3 9 5
4 1 5
5 0 5
6 4 5
7 7 8
8 3 0
9 2 4
For the column A I need to know how many next and previous rows are greater than current row value but less than value in column B.
So my expected output is :
A B next count previous count
0 9 2 0
3 8 0 0
2 3 0 1
9 5 0 0
1 5 0 0
0 5 2 1
4 5 1 0
7 8 0 0
3 0 0 2
2 4 0 0
Explanation :
First row is calculated as : since 3 and 2 are greater than 0 but less than corresponding B value 8 and 3
Second row is calculated as : since next value 2 is not greater than 3
Third row is calculated as : since 9 is greater than 2 but not greater than its corresponding B value
Similarly, previous count is calculated
Note : I know how to solve this problem by looping using list comprehension or using the pandas apply method but still I won't mind a clear and concise apply approach. I was looking for a more pandaic approach.
My Solution
Here is the apply solution, which I think is inefficient. Also, as people said that there might be no vector solution for the question. So as mentioned, a more efficient apply solution will be accepted for this question.
This is what I have tried.
This function gets the number of previous/next rows that satisfy the condition.
def get_prev_next_count(row):
next_nrow = df.loc[row['index']+1:,['A', 'B']]
prev_nrow = df.loc[:row['index']-1,['A', 'B']][::-1]
if (next_nrow.size == 0):
return 0, ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin()
if (prev_nrow.size == 0):
return ((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), 0
return (((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin())
Generating output :
df[['next count', 'previous count']] = df.reset_index().apply(get_prev_next_count, axis=1, result_type="expand")
Output :
This gives us the expected output
df
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
I made some optimizations:
You don't need to reset_index() you can access the index with .name
If you only pass df[['A']] instead of the whole frame, that may help.
prev_nrow.empty is the same as (prev_nrow.size == 0)
Applied different logic to get the desired value via first_false, this speeds things up significantly.
def first_false(val1, val2, A):
i = 0
for x, y in zip(val1, val2):
if A < x < y:
i += 1
else:
break
return i
def get_prev_next_count(row):
A = row['A']
next_nrow = df.loc[row.name+1:,['A', 'B']]
prev_nrow = df2.loc[row.name-1:,['A', 'B']]
if next_nrow.empty:
return 0, first_false(prev_nrow.A, prev_nrow.B, A)
if prev_nrow.empty:
return first_false(next_nrow.A, next_nrow.B, A), 0
return (first_false(next_nrow.A, next_nrow.B, A),
first_false(prev_nrow.A, prev_nrow.B, A))
df2 = df[::-1].copy() # Shave a tiny bit of time by only reversing it once~
df[['next count', 'previous count']] = df[['A']].apply(get_prev_next_count, axis=1, result_type='expand')
print(df)
Output:
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
Timing
Expanding the data:
df = pd.concat([df]*(10000//4), ignore_index=True)
# df.shape == (25000, 2)
Original Method:
Gave up at 15 minutes.
New Method:
1m 20sec
Throw pandarallel at it:
from pandarallel import pandarallel
pandarallel.initialize()
df[['A']].parallel_apply(get_prev_next_count, axis=1, result_type='expand')
26sec

How to create a new column based on a condition in another column

In pandas, How can I create a new column B based on a column A in df, such that:
B=1 if A_(i+1)-A_(i) > 5 or A_(i) <= 10
B=0 if A_(i+1)-A_(i) <= 5
However, the first B_i value is always one
Example:
A
B
5
1 (the first B_i)
12
1
14
0
22
1
20
0
33
1
Use diff with a comparison to your value and convertion from boolean to int using le:
N = 5
df['B'] = (~df['A'].diff().le(N)).astype(int)
NB. using a le(5) comparison with inversion enables to have 1 for the first value
output:
A B
0 5 1
1 12 1
2 14 0
3 22 1
4 20 0
5 33 1
updated answer, simply combine a second condition with OR (|):
df['B'] = (~df['A'].diff().le(5)|df['A'].lt(10)).astype(int)
output: same as above with the provided data
I was little confused with your rows numeration bacause we should have missing value on last row instead of first if we calcule for B_i basing on condition A_(i+1)-A_(i) (first row should have both, A_(i) and A_(i+1) and last row should be missing A_(i+1) value.
Anyway,basing on your example i assumed that we calculate for B_(i+1).
import pandas as pd
df = pd.DataFrame(columns=["A"],data=[5,12,14,22,20,33])
df['shifted_A'] = df['A'].shift(1) #This row can be removed - it was added only show to how shift works on final dataframe
df['B']=''
df.loc[((df['A']-df['A'].shift(1))>5) + (df['A'].shift(1)<=10), 'B']=1 #Update rows that fulfill one of conditions with 1
df.loc[(df['A']-df['A'].shift(1))<=5, 'B']=0 #Update rows that fulfill condition with 0
df.loc[df.index==0, 'B']=1 #Update first row in B column
print(df)
That prints:
A shifted_A B
0 5 NaN 1
1 12 5.0 1
2 14 12.0 0
3 22 14.0 1
4 20 22.0 0
5 33 20.0 1
I am not sure if it is fastest way, but i guess it should be one of easier to understand.
Little explanation:
df.loc[mask, columnname]=newvalue allows us to update value in given column if condition (mask) is fulfilled
(df['A']-df['A'].shift(1))>5) + (df['A'].shift(1)<=10)
Each condition here returns True or False. If we added them the result is True if any of that is True (which is simply OR). In case we need AND we can multiply the conditions
Use Series.diff, replace first missing value for 1 after compare for greater or equal by Series.ge:
N = 5
df['B'] = (df.A.diff().fillna(N).ge(N) | df.A.lt(10)).astype(int)
print (df)
A B
0 5 1
1 12 1
2 14 0
3 22 1
4 20 0
5 33 1

Pandas: return the occurrences of the most frequent value for each group (possibly without apply)

Let's assume the input dataset:
test1 = [[0,7,50], [0,3,51], [0,3,45], [1,5,50],[1,0,50],[2,6,50]]
df_test = pd.DataFrame(test1, columns=['A','B','C'])
that corresponds to:
A B C
0 0 7 50
1 0 3 51
2 0 3 45
3 1 5 50
4 1 0 50
5 2 6 50
I would like to obtain the a dataset grouped by 'A', together with the most common value for 'B' in each group, and the occurrences of that value:
A most_freq freq
0 3 2
1 5 1
2 6 1
I can obtain the first 2 columns with:
grouped = df_test.groupby("A")
out_df = pd.DataFrame(index=grouped.groups.keys())
out_df['most_freq'] = df_test.groupby('A')['B'].apply(lambda x: x.value_counts().idxmax())
but I am having problems the last column.
Also: is there a faster way that doesn't involve 'apply'? This solution doesn't scale well with lager inputs (I also tried dask).
Thanks a lot!
Use SeriesGroupBy.value_counts which sorting by default, so then add DataFrame.drop_duplicates for top values after Series.reset_index:
df = (df_test.groupby('A')['B']
.value_counts()
.rename_axis(['A','most_freq'])
.reset_index(name='freq')
.drop_duplicates('A'))
print (df)
A most_freq freq
0 0 3 2
2 1 0 1
4 2 6 1

Pandas - Delete only contiguous rows that equal zero

I have a large time series df (2.5mil rows) that contain 0 values in a given row, some of which are legitimate. However if there are repeated continuous occurrences of zero values I would like to remove them from my df.
Example:
Col. A contains [1,2,3,0,4,5,0,0,0,1,2,3,0,8,8,0,0,0,0,9] I would like to remove the [0,0,0] and [0,0,0,0] from the middle and leave the remaining 0 to make a new df [1,2,3,0,4,5,1,2,3,0,8,8,9].
The length of zero values before deletion being a parameter that has to be set - in this case > 2.
Is there a clever way to do this in pandas?
It looks like you want to remove the row if it is 0 and either previous or next row in same column is 0. You can use shift to look for previous and next value and compare with current value as below:
result_df = df[~(((df.ColA.shift(-1) == 0) & (df.ColA == 0)) | ((df.ColA.shift(1) == 0) & (df.ColA == 0)))]
print(result_df)
Result:
ColA
0 1
1 2
2 3
3 0
4 4
5 5
9 1
10 2
11 3
12 0
13 8
14 8
19 9
Update for more than 2 consecutive
Following example in link, adding new column to track consecutive occurrence and later checking it to filter:
# https://stackoverflow.com/a/37934721/5916727
df['consecutive'] = df.ColA.groupby((df.ColA != df.ColA.shift()).cumsum()).transform('size')
df[~((df.consecutive>10) & (df.ColA==0))]
We need build a new para meter here, then using drop_duplicates
df['New']=df.A.eq(0).astype(int).diff().ne(0).cumsum()
s=pd.concat([df.loc[df.A.ne(0),:],df.loc[df.A.eq(0),:].drop_duplicates(keep=False)]).sort_index()
s
Out[190]:
A New
0 1 1
1 2 1
2 3 1
3 0 2
4 4 3
5 5 3
9 1 5
10 2 5
11 3 5
12 0 6
13 8 7
14 8 7
19 9 9
Explanation :
#df.A.eq(0) to find the value equal to 0
#diff().ne(0).cumsum() if they are not equal to 0 then we would count them in same group .

Categories