How to Detect a Streak of Certain Values in a DataFrame? - python

In a Python DataFrame, I want to detect the beginning and end position of a block of False values in a row. If the block contains just one False, I would like to get that position.
Example:
df = pd.DataFrame({"a": [True, True, True,False,False,False,True,False,True],})
In[110]: df
Out[111]:
a
0 True
1 True
2 True
3 False
4 False
5 False
6 True
7 False
8 True
In this example, I would like to get the positions
`3`, `5`
and
`7`, `7`.

Use:
a = (df.a.cumsum()[~df.a]
.reset_index()
.groupby('a')['index']
.agg(['first','last'])
.values
.tolist())
print(a)
[[3, 5], [7, 7]]
Explanation:
First get cumulative sum by cumsum - get for all False unique groups:
print (df.a.cumsum())
0 1
1 2
2 3
3 3
4 3
5 3
6 4
7 4
8 5
Name: a, dtype: int32
Filter only False rows by boolean indexing with invert boolean column:
print (df.a.cumsum()[~df.a])
3 3
4 3
5 3
7 4
Name: a, dtype: int32
Create column from index by reset_index:
print (df.a.cumsum()[~df.a].reset_index())
index a
0 3 3
1 4 3
2 5 3
3 7 4
For each group aggregate by agg functions first and last:
print (df.a.cumsum()[~df.a].reset_index().groupby('a')['index'].agg(['first','last']))
first last
a
3 3 5
4 7 7
Last convert to nested list:
print (df.a.cumsum()[~df.a].reset_index().groupby('a')['index'].agg(['first','last']).values.tolist())
[[3, 5], [7, 7]]

Related

Replace specific values in a data frame with column mean

I have a dataframe and I want to replace the value 7 with the round number of mean of its columns with out other 7 in that columns. Here is a simple example:
import pandas as pd
df = pd.DataFrame()
df['a'] = [1, 2, 3]
df['b'] =[3, 0, -1]
df['c'] = [4, 7, 6]
df['d'] = [7, 7, 6]
a b c d
0 1 3 4 7
1 2 0 7 7
2 3 -1 6 6
And here is the output I want:
a b c d
0 1 3 4 2
1 2 0 3 2
2 3 -1 6 6
For example, in row 1, the mean of column c is equal to 3.33 and then its round is 3, and in column column d is equal to 2 (since we do not consider the other 7 in that column).
Can you please help me with that?
here is one way to do it
# replace 7 with np.nan
df.replace(7,np.nan, inplace=True)
# fill NaN values with the mean of the column
(df.fillna(df.apply(lambda x: x.replace(np.nan, 0)
.mean(skipna=False) ))
.round(0)
.astype(int))
a b c d
0 1 3 4 2
1 2 0 3 2
2 3 -1 6 6
temp = df.replace(to_replace=7, value=0, inplace=False).copy()
df.replace(to_replace=7, value=temp.mean().astype(int), inplace=True)

how to get a subgroup start finish indexes of dataframe

df=pd.DataFrame({"C1":['USA','USA','USA','USA','USA','JAPAN','JAPAN','JAPAN','USA','USA'],'C2':['A','B','A','A','A','A','A','A','B','A']})
C1 C2
0 USA A
1 USA B
2 USA A
3 USA A
4 USA A
5 JAPAN A
6 JAPAN A
7 JAPAN A
8 USA B
9 USA A
This is a watered version of my problem so to keep it simple, my objective is to iterate a sub group of the dataframe where C2 has B in it. If a B is in C2 - I look at C1 and need the entire group. So in this example, I see USA and it starts at index 0 and finish at 4. Another one is between 8 and 9.
So my desired result would be the indexes such that:
[[0,4],[8,9]]
I tried to use groupby but it wouldn't work because it groups all the USA together
my_index = list(df[df['C2']=='B'].index)
my_index
woudld give 1,8 but how to get the start/finish?
Here is one approach where you can first mask the dataframe on groups which has atleast 1 B, then grab the index and create a helper column to aggregate the first and last index values:
s = df['C1'].ne(df['C1'].shift()).cumsum()
i = df.index[s.isin(s[df['C2'].eq("B")])]
p = np.where(np.diff(i)>1)[0]+1
split_ = np.split(i,p)
out = [[i[0],i[-1]] for i in split_]
print(out)
[[0, 4], [8, 9]]
Solution
b = df['C1'].ne(df['C1'].shift()).cumsum()
m = b.isin(b[df['C2'].eq('B')])
i = m.index[m].to_series().groupby(b).agg(['first', 'last']).values.squeeze()
Explanations
shift column C1 and comapre the shifted column with the non-shifted one to create a boolean mask then take a cumulative sum on this mask to identify the blocks of rows where the value in column C1 stays the same
>>> b
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 3
9 3
Name: C1, dtype: int64
Create a boolean mask m to identify the blocks of rows that contain at least on B
>>> m
0 True
1 True
2 True
3 True
4 True
5 False
6 False
7 False
8 True
9 True
Name: C1, dtype: bool
Filter the index by using boolean masking with mask m, then group the filtered index by the identified blocks b and aggregate using first and last to get the indices.
>>> i
array([[0, 4],
[8, 9]])
Another approach using more_itertools.
# Keep all the indexes needed
temp = df['C1'].ne(df['C1'].shift()).cumsum()
stored_index = df.index[temp.isin(temp[df['C2'].eq("B")])]
# Group the list based on consecutive numbers
import more_itertools as mit
out = [list(i) for i in mit.consecutive_groups(stored_index)]
# Get first and last elements from the nested (grouped) lists
final = [a[:1] + a[-1:] for a in out]
>>> print(final)
[[0, 4], [8, 9]]
Another version:
x = (
df.groupby((df.C1 != df.C1.shift(1)).cumsum())["C2"]
.apply(lambda x: [x.index[0], x.index[-1]] if ("B" in x.values) else np.nan)
.dropna()
.to_list()
)
print(x)
Prints:
[[0, 4], [8, 9]]

Count Total number of sequences that meet condition, without for-loop

I have the following Dataframe as input:
l = [2,2,2,5,5,5,3,3,2,2,4,4,6,5,5,3,5]
df = pd.DataFrame(l)
print(df)
0
0 2
1 2
2 2
3 5
4 5
5 5
6 3
7 3
8 2
9 2
10 4
11 4
12 6
13 5
14 5
15 3
16 5
As an output I would like to have a final count of the total sequences that meet a certain condition. For example, in this case, I want the number of sequences that the values are greater than 3.
So, the output is 3.
1st Sequence = [555]
2nd Sequence = [44655]
3rd Sequence = [5]
Is there a way to calculate this without a for-loop in pandas ?
I have already implemented a solution using for-loop, and I wonder if there is better approach using pandas in O(N) time.
Thanks very much!
Related to this question: How to count the number of time intervals that meet a boolean condition within a pandas dataframe?
You can use:
m = df[0] > 3
df[1] = (~m).cumsum()
df = df[m]
print (df)
0 1
3 5 3
4 5 3
5 5 3
10 4 7
11 4 7
12 6 7
13 5 7
14 5 7
16 5 8
#create tuples
df = df.groupby(1)[0].apply(tuple).value_counts()
print (df)
(5, 5, 5) 1
(4, 4, 6, 5, 5) 1
(5,) 1
Name: 0, dtype: int64
#alternativly create strings
df = df.astype(str).groupby(1)[0].apply(''.join).value_counts()
print (df)
5 1
44655 1
555 1
Name: 0, dtype: int64
If need output as list:
print (df.astype(str).groupby(1)[0].apply(''.join).tolist())
['555', '44655', '5']
Detail:
print (df.astype(str).groupby(1)[0].apply(''.join))
3 555
7 44655
8 5
Name: 0, dtype: object
If you don't need pandas this will suit your needs:
l = [2,2,2,5,5,5,3,3,2,2,4,4,6,5,5,3,5]
def consecutive(array, value):
result = []
sub = []
for item in array:
if item > value:
sub.append(item)
else:
if sub:
result.append(sub)
sub = []
if sub:
result.append(sub)
return result
print(consecutive(l,3))
#[[5, 5, 5], [4, 4, 6, 5, 5], [5]]

Identifying closest value in a column for each filter using Pandas

I have a data frame with categories and values. I need to find the value in each category closest to a value. I think I'm close but I can't really get the right output when applying the results of argsort to the original dataframe.
For example, if the input was defined in the code below the output should have only (a, 1, True), (b, 2, True), (c, 2, True) and all other isClosest Values should be False.
If multiple values are closest then it should be the first value listed marked.
Here is the code I have which works but I can't get it to reapply to the dataframe correctly. I would love some pointers.
df = pd.DataFrame()
df['category'] = ['a', 'b', 'b', 'b', 'c', 'a', 'b', 'c', 'c', 'a']
df['values'] = [1, 2, 3, 4, 5, 4, 3, 2, 1, 0]
df['isClosest'] = False
uniqueCategories = df['category'].unique()
for c in uniqueCategories:
filteredCategories = df[df['category']==c]
sortargs = (filteredCategories['value']-2.0).abs().argsort()
#how to use sortargs so that we set column in df isClosest=True if its the closest value in each category to 2.0?
You can create a column of absolute differences:
df['dif'] = (df['values'] - 2).abs()
df
Out:
category values dif
0 a 1 1
1 b 2 0
2 b 3 1
3 b 4 2
4 c 5 3
5 a 4 2
6 b 3 1
7 c 2 0
8 c 1 1
9 a 0 2
And then use groupby.transform to check whether the minimum value of each group is equal to the difference you calculated:
df['is_closest'] = df.groupby('category')['dif'].transform('min') == df['dif']
df
Out:
category values dif is_closest
0 a 1 1 True
1 b 2 0 True
2 b 3 1 False
3 b 4 2 False
4 c 5 3 False
5 a 4 2 False
6 b 3 1 False
7 c 2 0 True
8 c 1 1 False
9 a 0 2 False
df.groupby('category')['dif'].idxmin() would also give you the indices of the closest values for each category. You can use that for mapping too.
For selection:
df.loc[df.groupby('category')['dif'].idxmin()]
Out:
category values dif
0 a 1 1
1 b 2 0
7 c 2 0
For assignment:
df['is_closest'] = False
df.loc[df.groupby('category')['dif'].idxmin(), 'is_closest'] = True
df
Out:
category values dif is_closest
0 a 1 1 True
1 b 2 0 True
2 b 3 1 False
3 b 4 2 False
4 c 5 3 False
5 a 4 2 False
6 b 3 1 False
7 c 2 0 True
8 c 1 1 False
9 a 0 2 False
The difference between these approaches is that if you check equality against the difference, you would get True for all rows in case of ties. However, with idxmin it will return True for the first occurrence (only one for each group).
Solution with DataFrameGroupBy.idxmin - get indexes of minimal values per group and then assign boolean mask by Index.isin to column isClosest:
idx = (df['values'] - 2).abs().groupby([df['category']]).idxmin()
print (idx)
category
a 0
b 1
c 7
Name: values, dtype: int64
df['isClosest'] = df.index.isin(idx)
print (df)
category values isClosest
0 a 1 True
1 b 2 True
2 b 3 False
3 b 4 False
4 c 5 False
5 a 4 False
6 b 3 False
7 c 2 True
8 c 1 False
9 a 0 False

Pandas Series with column names for each value above a minimum

I try to get a new series from a DataFrame. This series should contain the column names of the DataFrame's values that are above some value for each row of the DataFrame. But beginning from the left of the DataFrame, like this:
df = pd.DataFrame(np.random.randint(0,10,size=(5, 6)), columns=list('ABCDEF'))
>>> df
A B C D E F
0 2 4 6 8 8 4
1 2 0 9 7 7 1
2 1 7 7 7 3 0
3 5 4 4 0 1 7
4 9 6 1 5 1 5
min = 3
Expected Output:
0 B
1 C
2 B
3 A
4 A
dtype: object
Here the output's row 0 is "B" because in the DataFrame row index 0 column "B" is the most left column that has a value that is equal or bigger than min = 3.
I know that I an use df.idxmin(axis = 1) to get the column names of the minimum for each row but I have now clue at all how to tackle this more complex problem.
Thanks for help or hints!
UPDATE - index of the first element in each row, satisfying condition:
more elegant and more efficient version from #DSM:
In [156]: (df>=3).idxmax(1)
Out[156]:
0 B
1 C
2 B
3 A
4 A
dtype: object
my version:
In [149]: df[df>=3].apply(lambda x: x.first_valid_index(), axis=1)
Out[149]:
0 B
1 C
2 B
3 A
4 A
dtype: object
Old answer - index of the minimum element for each row:
In [27]: df[df>=3].idxmin(1)
Out[27]:
0 E
1 A
2 C
3 C
4 F
dtype: object

Categories