Groupby Column if Condition is fulfilled - python

I have the following dataframe in pandas:
a b
0 0 0
1 1 1
2 2 0
3 3 0
4 4 1
I want to group by column b (as in groupby('b')), but only if simultaneously the values of column a are consecutive (monotonically increasing). E.g. the output should be:
Group 1: Row 0
Group 2: Row 1
Group 3: Row 2, 3
Group 4: Row 4
How can I do that?
Thanks!

IIUC, construct temporary series based on your conditions -
i = df.a.eq(df.a.shift() + 1) # monotonically increasing values in a
j = df.b.ne(df.b.shift()).cumsum() # equal consecutive values in b
Now, call groupby -
for _, g in df.groupby([i, j]):
print(g, '\n')
a b
0 0 0
a b
1 1 1
a b
2 2 0
3 3 0
a b
4 4 1
Details
i is a series of bools, which says whether a value is monotonically increasing with respect to the element above.
i
0 False
1 True
2 True
3 True
4 True
Name: a, dtype: bool
j is a series that designates groups for consecutive values in df.b.
j
0 1
1 2
2 3
3 3
4 4
Name: b, dtype: int64

Related

How to remove columns if the value of one specific row is 0

I have a fairly straight forward question but I could not find the answer on stack.
I have a pd.df
Index A B C
0 1 1 0
1 0 0 0
2 1 1 1
3 0 0 1
I simply wish to remove all columns where the fourth row (3) is 0. So only column C would remain. Cheers.
Assuming "Index" the index, you can use boolean indexing:
df2 = df.loc[:, df.iloc[3].ne(0)]
output:
C
0 0
1 0
2 1
3 1
output of df.iloc[3].ne(0):
A False
B False
C True
Name: 3, dtype: bool

Pandas Counting Each Column with its Spesific Thresholds

If I have a following dataframe:
A B C D E
1 1 2 0 1 0
2 0 0 0 1 -1
3 1 1 3 -5 2
4 -3 4 2 6 0
5 2 4 1 9 -1
T 1 2 2 4 1
The last row is my threshold values for each column. I want to count each column values whether lower its threshold values or not in python pandas.
Desired Output is;
A B C D E
Count 2 2 3 3 4
But, I need to figure it out with a general solution, not for these specific columns. Because I have a large dataset. I cannot specify a column name for each of them in the code.
Could you please help me with this?
Select all rows without first by indexing and compare by DataFrame.lt by last row, then sum and convert Series to one row DataFrame by Series.to_frame with transpose by DataFrame.T:
df = df.iloc[:-1].lt(df.iloc[-1]).sum().to_frame('count').T
print (df)
A B C D E
count 2 2 3 3 4
Numpy alternative with DataFrame constructor:
arr = df.values
df = pd.DataFrame([np.sum(arr[:-1] < arr[-1], axis=0)], columns=df.columns, index=['count'])
print (df)
A B C D E
count 2 2 3 3 4

drop group by number of occurrence

Hi I want to delete the rows with the entries whose number of occurrence is smaller than a number, for example:
df = pd.DataFrame({'a': [1,2,3,2], 'b':[4,5,6,7], 'c':[0,1,3,2]})
df
a b c
0 1 4 0
1 2 5 1
2 3 6 3
3 2 7 2
Here I want to delete all the rows if the number of occurrence in column 'a' is less than twice.
Wanted output:
a b c
1 2 5 1
3 2 7 2
What I know:
we can find the number of occurrence by condition = df['a'].value_counts() < 2, and it will give me something like:
2 False
3 True
1 True
Name: a, dtype: int64
But I don't know how I should approach from here to delete the rows.
Thanks in advance!
groupby + size
res = df[df.groupby('a')['b'].transform('size') >= 2]
The transform method maps df.groupby('a')['b'].size() to df aligned with df['a'].
value_counts + map
s = df['a'].value_counts()
res = df[df['a'].map(s) >= 2]
print(res)
a b c
1 2 5 1
3 2 7 2
You Can use df.where and the dropna
df.where(df['a'].value_counts() <2).dropna()
a b c
1 2.0 5.0 1.0
3 2.0 7.0 2.0
You could try something like this to get the length of each group, transform back to original index and index the df by it
df[df.groupby("a").transform(len)["b"] >= 2]
a b c
1 2 5 1
3 2 7 2
Breaking it into individual steps you get:
df.groupby("a").transform(len)["b"]
0 1
1 2
2 1
3 2
Name: b, dtype: int64
These are the group sizes transformed back onto your original index
df.groupby("a").transform(len)["b"] >=2
0 False
1 True
2 False
3 True
Name: b, dtype: bool
We then turn this into the boolean index and index our original dataframe by it

Get rows from Pandas DataFrame from index until condition

Let's say I have a Pandas DataFrame:
x = pd.DataFrame(data=[5,4,3,2,1,0,1,2,3,4,5],columns=['value'])
x
Out[9]:
value
0 5
1 4
2 3
3 2
4 1
5 0
6 1
7 2
8 3
9 4
10 5
Now, I want to, given an index, find rows in x until a condition is met.
For example, if index = 2:
x.loc[2]
Out[14]:
value 3
Name: 2, dtype: int64
Now I want to, from that index, find the next n rows where the value is greater than some threshold. For example, if the threshold is 0, the results should be:
x
Out[9]:
value
2 3
3 2
4 1
5 0
How can I do this?
I have tried:
x.loc[2:x['value']>0,:]
But of course this will not work because x['value']>0 returns a boolean array of:
Out[20]:
0 True
1 True
2 True
3 True
4 True
5 False
6 True
7 True
8 True
9 True
10 True
Name: value, dtype: bool
Using idxmin and slicing
x.loc[2:x['value'].gt(0).idxmin(),:]
2 3
3 2
4 1
5 0
Name: value
Edit:
For a general formula, use
index = 7
threshold = 2
x.loc[index:x.loc[index:,'value'].gt(threshold).idxmin(),:]
From your description in comments, seemed like you want to begin from index+1 and not index. So, if that is the case, just use
x.loc[index+1:x.loc[index+1:,'value'].gt(threshold).idxmin(),:]
You want to filter for index greater than your index=2, and for x['value']>=threshold, and then select the first n of these rows, which can be accomplished with .head(n).
Say:
idx = 2
threshold = 0
n = 4
x[(x.index>=idx) & (x['value']>=threshold)].head(n)
Out:
# value
# 2 3
# 3 2
# 4 1
# 5 0
Edit: changed to >=, and updated example to match OP's example.
Edit 2 due to clarification from OP: since n is unknown:
idx = 2
threshold = 0
x.loc[idx:(x['value']<=threshold).loc[x.index>=idx].idxmax()]
This is selecting from the starting idx, in this case idx=2, up to and including the first row where the condition is not met (in this case index 5).

pandas: Grouping or filtering based on values in list, instead of dataframe

I want to get a row count of the frequency of each value, even if that value doesn't exist in the dataframe.
d = {'light' : pd.Series(['b','b','c','a','a','a','a'], index=[1,2,3,4,5,6,9]),'injury' : pd.Series([1,5,5,5,2,2,4], index=[1,2,3,4,5,6,9])}
testdf = pd.DataFrame(d)
injury light
1 1 b
2 5 b
3 5 c
4 5 a
5 2 a
6 2 a
9 4 a
I want to get a count of the number of occurrences of each unique value of 'injury' for each unique value in 'light'.
Normally I would just use groupby(), or (in this case, since I want it to be in a specific format), pivot_table:
testdf.reset_index().pivot_table(index='light',columns='injury',fill_value=0,aggfunc='count')
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
But in this case I actually want to compare the records in the dataframe to an external list of values-- in this case, ['a','b','c','d']. So if 'd' doesn't exist in this dataframe, then I want it to return a count of zero:
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
d 0 0 0 0
The closest I've come is filtering the dataframe based on each value, and then getting the size of that dataframe:
for v in sorted(['a','b','c','d']):
idx2 = (df['light'].isin([v]))
df2 = df[idx2]
print(df2.shape[0])
4
2
1
0
But that only returns counts from the 'light' column-- instead of a cross-tabulation of both columns.
Is there a way to make a pivot table, or a groupby() object, that groups things based on values in a list, rather than in a column in a dataframe? Or is there a better way to do this?
Try this:
df = pd.crosstab(df.light, df.injury,margins=True)
df
injury 1 2 4 5 All
light
a 0 2 1 1 4
b 1 0 0 1 2
c 0 0 0 1 1
All 1 2 1 3 7
df["All"]
light
a 4
b 2
c 1
All 7

Categories