Get rows from Pandas DataFrame from index until condition - python

Let's say I have a Pandas DataFrame:
x = pd.DataFrame(data=[5,4,3,2,1,0,1,2,3,4,5],columns=['value'])
x
Out[9]:
value
0 5
1 4
2 3
3 2
4 1
5 0
6 1
7 2
8 3
9 4
10 5
Now, I want to, given an index, find rows in x until a condition is met.
For example, if index = 2:
x.loc[2]
Out[14]:
value 3
Name: 2, dtype: int64
Now I want to, from that index, find the next n rows where the value is greater than some threshold. For example, if the threshold is 0, the results should be:
x
Out[9]:
value
2 3
3 2
4 1
5 0
How can I do this?
I have tried:
x.loc[2:x['value']>0,:]
But of course this will not work because x['value']>0 returns a boolean array of:
Out[20]:
0 True
1 True
2 True
3 True
4 True
5 False
6 True
7 True
8 True
9 True
10 True
Name: value, dtype: bool

Using idxmin and slicing
x.loc[2:x['value'].gt(0).idxmin(),:]
2 3
3 2
4 1
5 0
Name: value
Edit:
For a general formula, use
index = 7
threshold = 2
x.loc[index:x.loc[index:,'value'].gt(threshold).idxmin(),:]
From your description in comments, seemed like you want to begin from index+1 and not index. So, if that is the case, just use
x.loc[index+1:x.loc[index+1:,'value'].gt(threshold).idxmin(),:]

You want to filter for index greater than your index=2, and for x['value']>=threshold, and then select the first n of these rows, which can be accomplished with .head(n).
Say:
idx = 2
threshold = 0
n = 4
x[(x.index>=idx) & (x['value']>=threshold)].head(n)
Out:
# value
# 2 3
# 3 2
# 4 1
# 5 0
Edit: changed to >=, and updated example to match OP's example.
Edit 2 due to clarification from OP: since n is unknown:
idx = 2
threshold = 0
x.loc[idx:(x['value']<=threshold).loc[x.index>=idx].idxmax()]
This is selecting from the starting idx, in this case idx=2, up to and including the first row where the condition is not met (in this case index 5).

Related

I have a dataframe where some index number are missing how do I cut dataframe previous to that missing index number

[enter image description here][1]
Index number 72 is missing from original dataframe which is shown in image. I want to cut dataframe like [0:71,:] with condition like when index sequence breaks then dataframe automatically cuts from previous index value.
Compare shifted values of index subtracted by original values if greater like 1 with invert ordering by [::-1] and Series.cummax, last filter in boolean indexing:
df = pd.DataFrame({'a': range(3,13)}).drop(3)
print (df)
a
0 3
1 4
2 5
4 7
5 8
6 9
7 10
8 11
9 12
df = df[df.index.to_series().shift(-1, fill_value=0).sub(df.index).gt(1)[::-1].cummax()]
print (df)
a
0 3
1 4
2 5
i came to this:
df = pd.DataFrame({'col':[1,2,3,4,5,6,7,8,9]}, index=[-1,0,1,2,3,4,5,7,8])
ind = next((i for i in range(len(df)-1) if df.index[i]+1!=df.index[i+1]),len(df))+1
>>> df.iloc[:ind]
'''
col
-1 1
0 2
1 3
2 4
3 5
4 6
5 7
With numpy, get the values that are equal to a normal range starting from the first index, up to the first mismatch (excluded):
df[np.minimum.accumulate(df.index==np.arange(df.index[0], df.index[0]+len(df)))]
Example:
col
-1 1
0 2
1 3
3 4
4 5
output:
col
-1 1
0 2
1 3

Python/Pandas: How to find the number of occurrences of a specific value in each column a data frame?

So I have a dataframe that has 300 columns and thousands of rows. Each column contains a value between 0-30. I am trying, to find the total number of times that the number 8 occurs in each column.
Ideally, I'm trying to create a list of length 300 with each index corresponding to the index of the column, with a value corresponding to the number of rows that contain the number 8 in that column.
Any help or guidance is appreciated, thank you.
You can use a boolean test and .sum()
>>> df
a b c d
0 8 6 7 8
1 8 8 7 6
2 1 2 3 4
>>> df == 8
a b c d
0 True False False True
1 True True False False
2 False False False False
>>> (df == 8).sum()
a 2
b 1
c 0
d 1
dtype: int64
Something like this should help:
df.isin([8]).sum(axis=0)

drop group by number of occurrence

Hi I want to delete the rows with the entries whose number of occurrence is smaller than a number, for example:
df = pd.DataFrame({'a': [1,2,3,2], 'b':[4,5,6,7], 'c':[0,1,3,2]})
df
a b c
0 1 4 0
1 2 5 1
2 3 6 3
3 2 7 2
Here I want to delete all the rows if the number of occurrence in column 'a' is less than twice.
Wanted output:
a b c
1 2 5 1
3 2 7 2
What I know:
we can find the number of occurrence by condition = df['a'].value_counts() < 2, and it will give me something like:
2 False
3 True
1 True
Name: a, dtype: int64
But I don't know how I should approach from here to delete the rows.
Thanks in advance!
groupby + size
res = df[df.groupby('a')['b'].transform('size') >= 2]
The transform method maps df.groupby('a')['b'].size() to df aligned with df['a'].
value_counts + map
s = df['a'].value_counts()
res = df[df['a'].map(s) >= 2]
print(res)
a b c
1 2 5 1
3 2 7 2
You Can use df.where and the dropna
df.where(df['a'].value_counts() <2).dropna()
a b c
1 2.0 5.0 1.0
3 2.0 7.0 2.0
You could try something like this to get the length of each group, transform back to original index and index the df by it
df[df.groupby("a").transform(len)["b"] >= 2]
a b c
1 2 5 1
3 2 7 2
Breaking it into individual steps you get:
df.groupby("a").transform(len)["b"]
0 1
1 2
2 1
3 2
Name: b, dtype: int64
These are the group sizes transformed back onto your original index
df.groupby("a").transform(len)["b"] >=2
0 False
1 True
2 False
3 True
Name: b, dtype: bool
We then turn this into the boolean index and index our original dataframe by it

Pandas - Delete only contiguous rows that equal zero

I have a large time series df (2.5mil rows) that contain 0 values in a given row, some of which are legitimate. However if there are repeated continuous occurrences of zero values I would like to remove them from my df.
Example:
Col. A contains [1,2,3,0,4,5,0,0,0,1,2,3,0,8,8,0,0,0,0,9] I would like to remove the [0,0,0] and [0,0,0,0] from the middle and leave the remaining 0 to make a new df [1,2,3,0,4,5,1,2,3,0,8,8,9].
The length of zero values before deletion being a parameter that has to be set - in this case > 2.
Is there a clever way to do this in pandas?
It looks like you want to remove the row if it is 0 and either previous or next row in same column is 0. You can use shift to look for previous and next value and compare with current value as below:
result_df = df[~(((df.ColA.shift(-1) == 0) & (df.ColA == 0)) | ((df.ColA.shift(1) == 0) & (df.ColA == 0)))]
print(result_df)
Result:
ColA
0 1
1 2
2 3
3 0
4 4
5 5
9 1
10 2
11 3
12 0
13 8
14 8
19 9
Update for more than 2 consecutive
Following example in link, adding new column to track consecutive occurrence and later checking it to filter:
# https://stackoverflow.com/a/37934721/5916727
df['consecutive'] = df.ColA.groupby((df.ColA != df.ColA.shift()).cumsum()).transform('size')
df[~((df.consecutive>10) & (df.ColA==0))]
We need build a new para meter here, then using drop_duplicates
df['New']=df.A.eq(0).astype(int).diff().ne(0).cumsum()
s=pd.concat([df.loc[df.A.ne(0),:],df.loc[df.A.eq(0),:].drop_duplicates(keep=False)]).sort_index()
s
Out[190]:
A New
0 1 1
1 2 1
2 3 1
3 0 2
4 4 3
5 5 3
9 1 5
10 2 5
11 3 5
12 0 6
13 8 7
14 8 7
19 9 9
Explanation :
#df.A.eq(0) to find the value equal to 0
#diff().ne(0).cumsum() if they are not equal to 0 then we would count them in same group .

Groupby Column if Condition is fulfilled

I have the following dataframe in pandas:
a b
0 0 0
1 1 1
2 2 0
3 3 0
4 4 1
I want to group by column b (as in groupby('b')), but only if simultaneously the values of column a are consecutive (monotonically increasing). E.g. the output should be:
Group 1: Row 0
Group 2: Row 1
Group 3: Row 2, 3
Group 4: Row 4
How can I do that?
Thanks!
IIUC, construct temporary series based on your conditions -
i = df.a.eq(df.a.shift() + 1) # monotonically increasing values in a
j = df.b.ne(df.b.shift()).cumsum() # equal consecutive values in b
Now, call groupby -
for _, g in df.groupby([i, j]):
print(g, '\n')
a b
0 0 0
a b
1 1 1
a b
2 2 0
3 3 0
a b
4 4 1
Details
i is a series of bools, which says whether a value is monotonically increasing with respect to the element above.
i
0 False
1 True
2 True
3 True
4 True
Name: a, dtype: bool
j is a series that designates groups for consecutive values in df.b.
j
0 1
1 2
2 3
3 3
4 4
Name: b, dtype: int64

Categories