Count of values between other values in a pandas DataFrame - python

I have a column of a pandas DataFrame that looks like this:
1 False
2 False
3 False
4 True
5 True
6 False
7 False
8 False
9 False
10 False
11 True
12 False
I would like to get the count of False between the True. Something like this:
1 3
2 0
3 5
4 1
This is what I've done:
counts = []
count = 0
for k in df['result'].index:
if df['result'].loc[k] == False:
count += 1
else:
counts.append(count)
count = 0
where counts would be the result. Is there a simpler way?

Group by the cumulative sum of itself and then count the False with sum:
s = pd.Series([False, False, False, True, True, False, False, False, False, False, True, False])
(~s).groupby(s.cumsum()).sum()
#0 3.0
#1 0.0
#2 5.0
#3 1.0
#dtype: float64

You can use the groupby function from the itertools package to group the False and True values together and append the count to a list.
s = pd.Series([False,False,False,True,True,False,False,False,False,False,True,False],
index=range(1,13)
from itertools import groupby
out = []
for v,g in groupby(s):
if not v: # v is false
out.append(len(tuple(g)))
else: # v is true
out.extend([0]*(len(tuple(g))-1))
out
[3, 0, 5, 1]

Related

Transform false in between trues

Hello I have a dataframe like the following one:
df = pd.DataFrame({"a": [True, True, False, True, True], "b": [True, True, False, False, True]})
df
I would like to be able to transform the False values in between Trues to obtain a result like this (depending on a threshold).
# Threshold = 1
df = pd.DataFrame({"a": [True, True, True, True, True], "b": [True, True, False, False, True]})
df
# Threshold = 2
df = pd.DataFrame({"a": [True, True, True, True, True], "b": [True, True, True, True, True]})
df
Any suggestions to do this apart from a for loop?
Edit: The threshold value defines how many consecutive Falses you will take into account to do the transformation.
Edit2: In the beggining and end of the array you should not consider any special case.
If possible simplify solution for replace Falses groups less like Threshold value first filter separate groups by DataFrame.cumsum with DataFrame.mask, counts by Series.map with Series.value_counts and last compare by DataFrame.le with pass to DataFrame.mask:
Threshold = 1
m = df.cumsum().mask(df).apply(lambda x: x.map(x.value_counts())).le(Threshold)
df = df.mask(m, True)
If need not replace start or ends groups by Falses:
df = pd.DataFrame({"a": [False, False, True, False, True, False],
"b": [True, True, False, False, True, True]})
print (df)
a b
0 False True
1 False True
2 True False
3 False False
4 True True
5 False True
Threshold = 1
df1 = df.cumsum().mask(df)
m1 = df1.apply(lambda x: x.map(x.value_counts())).le(Threshold)
m2 = df1.ne(df1.iloc[0]) & df1.ne(df1.iloc[-1])
df = df.mask(m1 & m2, True)
print (df)
a b
0 False True
1 False True
2 True False
3 True False
4 True True
5 False True
one way would be to use itertools groupby to generate counts of each adjacent items group, but sadly it does include a couple of loops:
from itertools import groupby
def how_many_identical_elements(itter):
return sum([[x]*x for x in [len(list(v)) for g,v in groupby(itter)]], [])
def fill_up_df(df, th):
df = df.copy()
for c in df.columns:
df[f'{c}_count'] = how_many_identical_elements(df[c].values)
df[c] = [False if x[0]==False and x[1]>th else True for x in zip(df[c], df[f'{c}_count'])]
return df[[c for c in df.columns if 'count' not in c]]
then
fill_up_df(df, 1)
a
b
0
True
True
1
True
True
2
True
False
3
True
False
4
True
True
fill_up_df(df, 2)
a
b
0
True
True
1
True
True
2
True
True
3
True
True
4
True
True
This code looks from -threshold -> threshold, on a column-by-column basis and or's the results together to create a masking dataframe that meets your criteria. The last line is just the logical or of your original data and the new mask as we only need to fill False values. It should be one of the faster solutions if speed is an issue.
threshold = 2
filling_mask = reduce(
lambda x, y: x | y,
(
df.shift(-i, fill_value=True) & df.shift(i, fill_value=True)
for i in range(1, threshold + 1)
)
)
df |= filling_mask
Threshold 1:
>>> df # Threshold 1
a b
0 True True
1 True True
2 True False
3 True False
4 True True
Threshold 2:
>>> df # Threshold 2
a b
0 True True
1 True True
2 True True
3 True True
4 True True

What is best way to loop through Pandas dataframe employing a sequentially counted value in each row where condition is true?

Business Problem: For each row in a Pandas data frame where condition is true, set value in a column. When successive rows meet condition, then increase the value by one. The end goal is to create a column containing integers (e.g., 1, 2, 3, 4, ... , n) upon which a pivot table can be made. As a side note, there will be a second index upon which the pivot will be made.
Below is my attempt, but I'm new to using Pandas.
sales_data_cleansed_2.loc[sales_data_cleansed_2['Duplicate'] == 'FALSE', 'sales_index'] = 1
j = 2
# loop through whether duplicate exists.
for i in range(0, len(sales_data_cleansed_2)):
while sales_data_cleansed_2.loc[i,'Duplicate'] == 'TRUE':
sales_data_cleansed_2.loc[i,'sales_index'] = j
j = j + 1
break
j = 2
You can try:
import pandas as pd
# sample DataFrame
df = pd.DataFrame(np.random.randint(0,2, 15).astype(str), columns=["Duplicate"])
df = df.replace({'1': 'TRUE', '0':'FALSE'})
df['sales_index'] = ((df['Duplicate'] == 'TRUE')
.groupby((df['Duplicate'] != 'TRUE')
.cumsum()).cumsum() + 1)
print(df)
This gives:
Duplicate sales_index
0 FALSE 1
1 FALSE 1
2 TRUE 2
3 TRUE 3
4 TRUE 4
5 TRUE 5
6 TRUE 6
7 TRUE 7
8 TRUE 8
9 FALSE 1
10 FALSE 1
11 TRUE 2
12 TRUE 3
13 TRUE 4
14 FALSE 1

Drop rows from pandas DataFrame based on alternating columns

I'm attempting to remove all rows in a dataframe between an entry and exit point in a timeseries of price data based on a bool entry and exit columns.
data = {'Entry': [True,True,True,False,False,False,False,True, False, False, False],
'Exit': [False,False,True,False,False,True,True,False, False, False, True]}
df = pd.DataFrame(data)
Entry Exit
0 True False
1 True False
2 True True
3 False False
4 False False
5 False True
6 False True
7 True False
8 False False
9 False False
10 False True
So given the above I want to be left with
Entry Exit
0 True False
2 True True
7 True False
10 False True
I need to get the first True from the Entry column, then the following True in the Exit column, followed by the next True in the Entry column and so on.
You can do it the old fashion way using zip:
df = pd.DataFrame(data)
group = None
idx = []
for num, (a, b) in enumerate(zip(df["Entry"], df["Exit"])):
if a is True and not group:
idx.append(num)
group = True
if b is True and group:
if idx[-1] != num:
idx.append(num)
group = False
print (idx) # [0, 2, 7, 10]
print (df.loc[idx])
Entry Exit
0 True False
2 True True
7 True False
10 False True
Try this:
entry = df[df['Entry']]
exit = df[df['Exit']]
idx = []
pos = 0
for i in range(entry.shape[0]):
if i % 2 == 0:
print("bad")
idx.append([entry.iloc[pos][0],entry.iloc[pos][1]])
else:
idx.append([exit.iloc[pos][0],exit.iloc[pos][1]])
pos += 1
,Hope this help

Get the index name AND column name for each cell in pandas dataframe

I have a dataframe populated with True/False values. I want to iterate through each cell of the dataframe and if the value is True, return the index name and column name at that cell location (or else, their row and column index).
I thought there might be a quick pandas way using dataframe.applymap with a function that returns the col,row of the cell, but I don't know how to call the index and column name for a particular cell. Basically its like looking for .iloc[] in reverse.
or else I'd be happy with getting a dataframe back that is a separate dataframe that is all the true values.
you could try melting (pd.melt) your dataframe while keeping the index as your variable :
df = pd.DataFrame(
np.random.choice([True, False], (5, 5)), columns=list("abcde"), index=list("fghij")
)
# a b c d e
# f False True True True False
# g True True False False True
# h False True True False False
# i True True False False False
# j True True False True True
df.reset_index().melt(id_vars='index').query('value == True')
outputs :
index variable value
1 g a True
3 i a True
4 j a True
5 f b True
6 g b True
7 h b True
8 i b True
9 j b True
10 f c True
12 h c True
15 f d True
19 j d True
21 g e True
24 j e True

Part II: Counting how many times in a row the result of a sum is positive (or negative)

Second part First part can be found here: Click me
Hi all, I have been practising with the gg function that you guys help me create -- see part one. Now, I realized that the output of the function are not unique series, yet a sum: for instance, a series of 3 positives in a row is also shown as 2 series of two positives in a row and as 3 single positives.
Let's say I got this:
df = pd.DataFrame(np.random.rand(15, 2), columns=["open", "close"])
df['test'] = df.close-df.open > 0
open close test
0 0.769829 0.261478 False
1 0.770246 0.128516 False
2 0.266448 0.346099 True
3 0.302941 0.065790 False
4 0.747712 0.730082 False
5 0.382923 0.751792 True
6 0.028505 0.083543 True
7 0.137558 0.243148 True
8 0.456349 0.649780 True
9 0.041046 0.163488 True
10 0.291495 0.617486 True
11 0.046561 0.038747 False
12 0.782994 0.150166 False
13 0.435168 0.080925 False
14 0.679253 0.478050 False
df.test
Out[113]:
0 False
1 False
2 True
3 False
4 False
5 True
6 True
7 True
8 True
9 True
10 True
11 False
12 False
13 False
14 False
As output, I would like the unique number of series of True in a row; something like:
1: 1
2: 0
3: 0
4: 0
5: 0
6: 1
7: 0
8: 0
What I've tried so far:
(green.rolling(x).sum()>x-1).sum() #gives me how many times there is a series of x True in a row; yet, this is not unique as explained beforehand
However, I do not feel the rolling is the solution over here...
Thank you again for your help,
CronosVirus00
What you are looking for are the groupby function from itertools and Counter from collections. Here is how to achieve what you want :
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(15, 2), columns=["open", "close"])
df['test'] = df.close-df.open > 0
from itertools import groupby
from collections import Counter
#we group each sequence of True and False
seq_len=[(k,len(list(g))) for k, g in groupby(list(df['test']))]
#we filter to keep only True sequence lenght
true_seq_len= [n for k,n in seq if k == True]
#we count each length
true_seq_count = Counter(true_seq_len)
Output :
>>> print(df['test'])
0 True
1 True
2 False
3 True
4 True
5 False
6 True
7 False
8 True
9 True
10 True
11 True
12 False
13 False
14 True
>>>print(seq_len)
[(True, 2), (False, 1), (True, 2), (False, 1), (True, 1), (False, 1), (True, 4), (False, 2), (True, 1)]
>>>print(true_seq_count)
Counter({1: 2, 2: 2, 4: 1})

Categories