Let's say I have pd.Series like below
s = pd.Series([False, True, False,True,True,True,False, False])
0 False
1 True
2 False
3 True
4 True
5 True
6 False
7 False
dtype: bool
I want to know how long is the longest True sequence, in this example, it is 3.
I tried it in a stupid way.
s_list = s.tolist()
count = 0
max_count = 0
for item in s_list:
if item:
count +=1
else:
if count>max_count:
max_count = count
count = 0
print(max_count)
It will print 3, but in a Series of all True, it will print 0
Option 1
Use a the series itself to mask the cumulative sum of the negation. Then use value_counts
(~s).cumsum()[s].value_counts().max()
3
explanation
(~s).cumsum() is a pretty standard way to produce distinct True/False groups
0 1
1 1
2 2
3 2
4 2
5 2
6 3
7 4
dtype: int64
But you can see that the group we care about is represented by the 2s and there are four of them. That's because the group is initiated by the first False (which becomes True with (~s)). Therefore, we mask this cumulative sum with the boolean mask we started with.
(~s).cumsum()[s]
1 1
3 2
4 2
5 2
dtype: int64
Now we see the three 2s pop out and we just have to use a method to extract them. I used value_counts and max.
Option 2
Use factorize and bincount
a = s.values
b = pd.factorize((~a).cumsum())[0]
np.bincount(b[a]).max()
3
explanation
This is a similar explanation as for option 1. The main difference is in how I a found the max. I use pd.factorize to tokenize the values into integers ranging from 0 to the total number of unique values. Given the actual values we had in (~a).cumsum() we didn't strictly need this part. I used it because it's a general purpose tool that could be used on arbitrary group names.
After pd.factorize I use those integer values in np.bincount which accumulates the total number of times each integer is used. Then take the maximum.
Option 3
As stated in the explanation of option 2, this also works:
a = s.values
np.bincount((~a).cumsum()[a]).max()
3
I think this could work
pd.Series(s.index[~s].values).diff().max()-1
Out[57]: 3.0
Also outside pandas' we can back to python groupby
from itertools import groupby
max([len(list(group)) for key, group in groupby(s.tolist())])
Out[73]: 3
Update :
from itertools import compress
max(list(compress([len(list(group)) for key, group in groupby(s.tolist())],[key for key, group in groupby(s.tolist())])))
Out[84]: 3
You can use (inspired by #piRSquared answer):
s.groupby((~s).cumsum()).sum().max()
Out[513]: 3.0
Another option to use a lambda func to do this.
s.to_frame().apply(lambda x: s.loc[x.name:].idxmin() - x.name, axis=1).max()
Out[429]: 3
Edit: As piRSquared mentioned, my previous solution needs to append two False at the beginning and at the end of the series. piRSquared kindly gave an answer based on that.
(np.diff(np.flatnonzero(np.append(True, np.append(~s.values, True)))) - 1).max()
My original trial is
(np.diff(s.where(~s).dropna().index.values) - 1).max()
(This will not give the correct answer if the longest True starts at the beginning or ends at the end as pointed out by piRSquared. Please use the solution above given by piRSquared. This work remains only for explanation.)
Explanation:
This finds the indices of the False parts and by finding the gaps between the indices of False, we can know the longest True.
s.where(s == False).dropna().index.values finds all the indices of False
array([0, 2, 6, 7])
We know that Trues live between the Falses. Thus, we can use
np.diff to find the gaps between these indices.
array([2, 4, 1])
Minus 1 in the end as Trues lies between these indices.
Find the maximum of the difference.
Your code was actually very close. It becomes perfect with a minor fix:
count = 0
maxCount = 0
for item in s:
if item:
count += 1
if count > maxCount:
maxCount = count
else:
count = 0
print(maxCount)
I'm not exactly sure how to do it with pandas but what about using itertools.groupby?
>>> import pandas as pd
>>> s = pd.Series([False, True, False,True,True,True,False, False])
>>> max(sum(1 for _ in g) for k, g in groupby(s) if k)
3
Related
I'm finding a way (using a built-in pandas function) to scan a column of a DataFrame comparing its-self values for different indices.
Here an example using a for cycle. I've a dataframe with a single column col 1. I want to create a column col 2 with TRUE/FALSE in this way.
df["col_2"] = "False"
N=5
for idx in range(0,len(df)-N):
for i in range (idx+1,idx+N+1):
if(df["col_1"].iloc[idx]==df["col_1"].iloc[i]):
df["col_2"].iloc[idx]=True
What I'm trying to do is to compare the value of col 1 for the i-th index with the next N indices.
I'd like to do the same operation without using a for cycle . I've already tried to use a shift and df.loc , but the computational time is similar.
Have you tried doing something like
df["col_1_shifted"] = df["col_1"].shift(N)
df["col_2"] = (df["col_1"] == df["col_1_shifted"])
update: looking more carefully at your double-loop, it seems you want to flag all duplicates except the last. That's done by just changing the keep argument to 'last' instead of the default 'first'.
As suggested by #QuangHoang in the comments, duplicated() works nicely for this:
newdf = df.assign(col_2=df.duplicated(subset='col_1', keep='last'))
Example:
df = pd.DataFrame(np.random.randint(0, 5, 10), columns=['col_1'])
newdf = df.assign(col_2=df.duplicated(subset='col_1', keep='last'))
>>> newdf
col_1 col_2
0 2 False
1 0 True
2 1 True
3 0 True
4 0 False
5 3 False
6 1 True
7 1 False
8 4 True
9 4 False
I have a dataframe like df = pd.DataFrame({'ID':[1,1,2,2,3,3,4,4,5,5,5],'Col1':['Y','Y','Y','N','N','N','Y','Y','Y','N','N']}). What I would like to do is group by the 'ID' column and then get statistics on three conditions:
How many groups have only 'Y's
How many groups have at least 1 'Y' and at least 1 'N'
How many groups have only 'N's
groups = df.groupby('ID') groups.Col1.value_counts()
gives me a visual representation of what I'm looking for, but how can I then iterate over the results of the value_counts() method to check for these conditions?
I think pd.crosstab() may be more suitable for your use case.
Code
df_crosstab = pd.crosstab(df["ID"], df["Col1"])
Col1 N Y
ID
1 0 2
2 1 1
3 2 0
4 0 2
5 2 1
Groupby can also do the job, but much more tedious:
df_crosstab = df.groupby('ID')["Col1"]\
.value_counts()\
.rename("count")\
.reset_index()\
.pivot(index="ID", columns="Col1", values="count")\
.fillna(0)
Filtering the groups
After producing df_crosstab, the filters for your 3 questions could be easily constructed:
# 1. How many groups have only 'Y's
df_crosstab[df_crosstab['N'] == 0]
Col1 N Y
ID
1 0 2
4 0 2
# 2. How many groups have at least 1 'Y' and at least 1 'N'
df_crosstab[(df_crosstab['N'] > 0) & (df_crosstab['Y'] > 0)]
Col1 N Y
ID
2 1 1
5 2 1
# 3. How many groups have only 'N's
df_crosstab[df_crosstab['Y'] == 0]
Col1 N Y
ID
3 2 0
If you want the number of groups only, just take the length of the the filtered crosstab dataframe. I believe this also makes automation much easier.
groups = df.groupby('ID')
answers = groups.Col1.value_counts()
for item in answers.iteritems():
print(item)
What you are making is a series from value_counts() and you can iterate over them. Note that this is not what you want. You would have to check each of these items for the tests you are looking for.
If you group by 'ID' and use 'sum' function, you will have all letters in one line for each group. Then you can just count strings to check your conditions and take their sums to know the exact numbers for all of the groups:
strings = df.groupby(['ID']).sum()
only_y = sum(strings['Col1'].str.count('N') == 0)
only_n = sum(strings['Col1'].str.count('Y') == 0)
both = sum((strings['Col1'].str.count('Y') > 0) & (strings['Col1'].str.count('N') > 0))
print('Number of groups with Y only: ' + str(only_y),
'Number of groups with N only: ' + str(only_n),
'Number of groups with at least one Y and one N: ' + str(both),
sep='\n')
I'd like to calculate rolling sum of elements as R rollapply doing:
s = pd.Series([1,2,3,4,5,6])
As result I'd like to receive new series with sum of elements for non overlapping intervals(window size is 2):
3
7
11
Pandas Series.rolling procedure works in other way producing rolling on overlapping intervals. Please tell me how to do what I want...
You can try
s.groupby(s.index//2).sum()
0 3
1 7
2 11
dtype: int64
Here is true solution:
s = pd.Series([1,2,3,4,5,6])
pd.Series([np.sum(s[x:x + 2]) for x in range(0, len(s), 2)])
For example, I have a dataframe A likes below :
a b c
x 0 2 1
y 1 3 2
z 0 2 4
I want to get the number of 0 in column 'a' , which should returns 2. ( A[x][a] and A[z][a] )
Is there a simple way or is there a function I can easily do this?
I've Googled for it, but there are only articles like this.
count the frequency that a value occurs in a dataframe column
Which makes a new dataframe, and is too complicated to what I only need to do.
Use sum with boolean mask - Trues are processes like 1, so output is count of 0 values:
out = A.a.eq(0).sum()
print (out)
2
Try value_counts from pandas (here):
df.a.value_counts()["0"]
If the values are changeable, do it with df[column_name].value_counts()[searched_value]
when you index into a pandas dataframe using a list of ints, it returns columns.
e.g. df[[0, 1, 2]] returns the first three columns.
why does indexing with a boolean vector return a list of rows?
e.g. df[[True, False, True]] returns the first and third rows. (and errors out if there aren't 3 rows.)
why? Shouldn't it return the first and third columns?
Thanks!
Because if use:
df[[True, False, True]]
it is called boolean indexing by mask:
[True, False, True]
Sample:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
print (df)
A B C
0 1 4 7
1 2 5 8
2 3 6 9
print (df[[True, False, True]])
A B C
0 1 4 7
2 3 6 9
Boolean mask is same as:
print (df.B != 5)
0 True
1 False
2 True
Name: B, dtype: bool
print (df[df.B != 5])
A B C
0 1 4 7
2 3 6 9
There are very specific slicing accessors to target rows and columns in specific ways.
Mixed Position and Label Based Selection
Select by position
Selection by Label
loc[], at[], and get_value() take row and column labels and return the appropriate slice
iloc[] and iat[] take row and column positions and return the appropriate slice
What you are seeing is the result of pandas trying to infer what you are trying to do. As you have noticed this is inconsistent at times. In fact, it is more pronounced than just what you've highlighted... but I wont go into that now.
See also
pandas docs
However, when an axis is integer based,
ONLY label based access and not positional access is supported.
Thus, in such cases, it’s usually better to be explicit and use .iloc or .loc.