Slicing Pandas Dataframe according to number of lines - python

I suppose this is something rather simple, but I can't find how to make this. I've been searching tutorials and stackoverflow.
Suppose I have a dataframe df loking like this :
Group Id_In_Group SomeQuantity
1 1 10
1 2 20
2 1 7
3 1 16
3 2 22
3 3 5
3 4 12
3 5 28
4 1 1
4 2 18
4 3 14
4 4 7
5 1 36
I would like to select only the lines having at least 4 objects in the group (so there are at least 4 rows having the same "group" number) and for which SomeQuantity for the 4th object, when sorted in the group by ascending SomeQuantity, is greater than 20 (for example).
In the given Dataframe, for example, it would only return the 3rd group, since it has 4 (>=4) members and its 4th SomeQuantity (after sorting) is 22 (>=20), so it should construct the dataframe :
Group Id_In_Group SomeQuantity
3 1 16
3 2 22
3 3 5
3 4 12
3 5 28
(being or not sorted by SomeQuantity, whatever).
Could somebody be kind enough to help me? :)

I would use .groupby() + .filter() methods:
In [66]: df.groupby('Group').filter(lambda x: len(x) >= 4 and x['SomeQuantity'].max() >= 20)
Out[66]:
Group Id_In_Group SomeQuantity
3 3 1 16
4 3 2 22
5 3 3 5
6 3 4 12
7 3 5 28

A slightly different approach using map, value_counts, groupby , filter:
(df[df.Group.map(df.Group.value_counts().ge(4))]
.groupby('Group')
.filter(lambda x: np.any(x['SomeQuantity'].sort_values().iloc[3] >= 20)))
Breakdown of steps:
Perform value_counts to compute the total counts of distinct elements present in Group column.
>>> df.Group.value_counts()
3 5
4 4
1 2
5 1
2 1
Name: Group, dtype: int64
Use map which functions like a dictionary (wherein the index becomes the keys and the series elements become the values) to map these results back to the original DF
>>> df.Group.map(df.Group.value_counts())
0 2
1 2
2 1
3 5
4 5
5 5
6 5
7 5
8 4
9 4
10 4
11 4
12 1
Name: Group, dtype: int64
Then, we check for the elements having a value of 4 or more which is our threshold limit and take only those subset from the entire DF.
>>> df[df.Group.map(df.Group.value_counts().ge(4))]
Group Id_In_Group SomeQuantity
3 3 1 16
4 3 2 22
5 3 3 5
6 3 4 12
7 3 5 28
8 4 1 1
9 4 2 28
10 4 3 14
11 4 4 7
Inorder to use groupby.filter operation on this, we must make sure that we return a single boolean value corresponding to each grouped key when we perform the sorting process and compare the fourth element to the threshold which is 20.
np.any returns all such possiblities matching our filter.
>>> df[df.Group.map(df.Group.value_counts().ge(4))] \
.groupby('Group').apply(lambda x: x['SomeQuantity'].sort_values().iloc[3])
Group
3 22
4 18
dtype: int64
From these, we compare the fourth element .iloc[3] as it is 0-based indexed and return all such favourable matches.

This is how I have worked through your question, warts and all. Im sure there are much nicer ways to do this.
Find groups with "4 objects in the group"
import collections
groups = list({k for k, v in collections.Counter(df.Group).items() if v > 3} );groups
Out:[3, 4]
Use these groups to filter to a new df containing these groups:
df2 = df[df.Group.isin(groups)]
"4th SomeQuantity (after sorting) is 22 (>=20)"
df3 = df2.sort_values(by='SomeQuantity',ascending=False)
(Updated as per comment below...)
df3.groupby('Group').filter(lambda grp: any(grp.sort_values('SomeQuantity').iloc[3] >= 20)).sort_index()
Group Id_In_Group SomeQuantity
3 3 1 16
4 3 2 22
5 3 3 5
6 3 4 12
7 3 5 28

Related

Ordering a dataframe by each column

I have a dataframe that looks like this:
ID Age Score
0 9 5 3
1 4 6 1
2 9 7 2
3 3 2 1
4 12 1 15
5 2 25 6
6 9 5 4
7 9 5 61
8 4 2 12
I want to sort based on the first column, then the second column, and so on.
So I want my output to be this:
ID Age Score
5 2 25 6
3 3 2 1
8 4 2 12
1 4 6 1
0 9 5 3
6 9 5 4
7 9 5 61
2 9 7 2
4 12 1 15
I know I can do the above with df.sort_values(df.columns.to_list()), however I'm worried this might be quite slow for much larger dataframes (in terms of columns and rows).
Is there a more optimal solution?
You can use numpy.lexsort to improve performance.
import numpy as np
a = df.to_numpy()
out = pd.DataFrame(a[np.lexsort(np.rot90(a))],
index=df.index, columns=df.columns)
Assuming as input a random square DataFrame of side n:
df = pd.DataFrame(np.random.randint(0, 100, size=(n, n)))
here is the comparison for 100 to 100M items (slower runtime is the best):
Same graph with the speed relative to pandas
By still using df.sort_values() you can speed it up a bit by selecting the type of sorting algorithm. By default it's set to quicksort, but there is the alternatives of 'mergesort', 'heapsort' and 'stable'.
Maybe specifying one of these would improve it?
df.sort_values(df.columns.to_list(), kind="mergesort")
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html

How to repeat the cumsum for previous values in a Pandas Series, when the count group is restarted?

I have a Pandas Series that represents a group count.
How to create a new series with the maximum values from the series up to alter the count group?
Minimal example:
import pandas as pd
s_count = pd.Series([1,2,3,1,2,3,4,5,1,2,3,4])
Desired:
s_max_count_group = pd.Series([3,3,3,5,5,5,5,5,4,4,4,4])
Print result:
df = pd.DataFrame({
'counts': s_count,
'expected': s_max_count_group
})
print(df)
Display:
counts expected
0 1 3
1 2 3
2 3 3
3 1 5
4 2 5
5 3 5
6 4 5
7 5 5
8 1 4
9 2 4
10 3 4
11 4 4
I looked for similar questions, tested some answers, so i'm trying to use fill, cumsum, diff and mask methods, but no success up to now.
We can identify the individual groups by comparing the count group with 1 followed by cumsum, then group the given series on these indentified groups and transform using max
s_count.groupby(s_count.eq(1).cumsum()).transform('max')
0 3
1 3
2 3
3 5
4 5
5 5
6 5
7 5
8 4
9 4
10 4
11 4
dtype: int64

return first column number that fulfills a condition in pandas

I have a dataset with several columns of cumulative sums. For every row, I want to return the first column number that satisfies a condition.
Toy example:
df = pd.DataFrame(np.array(range(20)).reshape(4,5).T).cumsum(axis=1)
>>> df
0 1 2 3
0 0 5 15 30
1 1 7 18 34
2 2 9 21 38
3 3 11 24 42
4 4 13 27 46
If I want to return the first column whose value is greater than 20 for instance.
Desired output:
3
3
2
2
2
Many thanks as always!
Try with idxmax
df.gt(20).idxmax(1)
Out[66]:
0 3
1 3
2 2
3 2
4 2
dtype: object
No as short as #YOBEN_S but works is the chaining of index.get_loc and first_valid_index
df[df>20].apply(lambda x: x.index.get_loc(x.first_valid_index()), axis=1)
0 3
1 3
2 2
3 2
4 2
dtype: int64

How do you make combined rolling groups in Pandas

How do you get rolling groups in Pandas I need the following group (1,2), then group (2,3), then group (3,4), etc. The best i can do is group (1,2), then group (3,4). I take group 1, add the values to group 2. Then the next iteration is group (2,3). I take group 2's newly updated values, and add them to group 3's original values. I then take those group 3 newly updated values, and add them to group 4's original values, so we get:
The most important parts of this, don't get stuck on adding the values in the right order, really the most important thing is that i want to update a group, i want to update group 2, with group 1's values (my post is just an example) , then in the next transform, i want those new group values i updated in group2 to update the next group, which is 3. Then in the next transform or apply, I want those new group 3 values so i can update group4. I hope that makes sense
num group
1 1
2 1
2 1
4 1
5 2
6 2
7 2
8 2
9 3
10 3
11 3
12 3
13 4
14 4
15 4
16 4
df=pd.read_clipboard()
I want my first group to be the following. group two has had it's values added by group 1:
1 1
2 1
3 1
4 1
6 2
8 2
10 2
14 2
My second group will then hopefully be the new modified values due to adding group one to them. group 3 will have it's original values added by group 2's new values:
6 2
8 2
10 2
14 2
15 3
18 3
21 3
26 3
My third group to be group 3's new values. and group four will be it's original values added by group 3 in order:
15 3
18 3
21 3
26 3
29 4
33 4
36 4
42 4
I tried
df.groupby(np.arrange(len(df))//4))
, except it only splits it by groups (1,2) the then the next group is (3,4). I need (1,2), (2,3), (3,4). This is due to me processing group 1 to make group 2's values. I then use group 2 to create group 3's values. I then use group 3 to make group 4's values. Any help on this would be appreciated. I made a simple example because I don't need help with what I'm doing with the groups, I just need to know how to group like that.
Again, this is just an example, The most important parts of this, don't get stuck on adding the values in the right order, i'm not trying to test anyone. really the most important thing is that i want to update a group, i want to update group 2, with group 1's values (my post is just an example) , then in the next transform, i want those new group values i updated in group2 to update the next group, which is 3. Then in the next transform or apply, I want those new group 3 values so i can update group4. I hope that makes sense
I will do
s=df.group.drop_duplicates()
l=[df.loc[df.group.isin([x,y])]for x , y in zip(s.iloc[1:],s.shift().iloc[1:])]
Update
df['num']=df['num'].groupby(df.groupby('group').cumcount()).cumsum()
s=df.group.drop_duplicates()
l=[df.loc[df.group.isin([x,y])]for x , y in zip(s.iloc[1:],s.shift().iloc[1:])]
l[0]
num group
0 1 1
1 2 1
2 2 1
3 4 1
4 6 2
5 8 2
6 9 2
7 12 2
df['grp'] = df.apply(lambda x : x.iloc[::4]).groupby('num').cumsum()
df['grp'] = df['grp'].ffill()
print(df)
num group grp
0 1 1 1.0
1 2 1 1.0
2 3 1 1.0
3 4 1 1.0
4 5 2 2.0
5 6 2 2.0
6 7 2 2.0
7 8 2 2.0
8 9 3 3.0
9 10 3 3.0
10 11 3 3.0
11 12 3 3.0
12 13 4 4.0
13 14 4 4.0
14 15 4 4.0
15 16 4 4.0
Prerequisites:
df["group_sub"]=df.groupby("group").cumcount()
dfprev=df["num"]
for i in range(1, df.group.nunique()):
dfprev+=df["num"].groupby(df["group_sub"]).shift(i).fillna(0)
df.drop("group_sub", axis=1, inplace=True)
You can do:
df_series=[df.loc[df.group.isin(df.group.unique()[i:i+2])] for i in range(df.group.nunique()-1)]

Select particular rows from inside groups in pandas dataframe

Suppose I have a dataframe that looks like this:
group level
0 1 10
1 1 10
2 1 11
3 2 5
4 2 5
5 3 9
6 3 9
7 3 9
8 3 8
The desired output is this:
group level
0 1 10
5 3 9
Namely, this is the logic: look inside each group, if there is more than 1 distinct value present in the level column, return the first row in that group. For example, no row from group 2 is selected, because the only value present in the level column is 5.
In addition, how does the situation change if I want the last, instead of the first row of such groups?
What I have attempted was combining group_by statements, with creating sets from entries in the level column, but failed to produce anything even nearly sensible.
This can be done with groupby and using apply to run a simple function on each group:
def get_first_val(group):
has_multiple_vals = len(group['level'].unique()) >= 2
if has_multiple_vals:
return group['level'].loc[group['level'].first_valid_index()]
else:
return None
df.groupby('group').apply(get_first_val).dropna()
Out[8]:
group
1 10
3 9
dtype: float64
There's also a last_valid_index() method, so you wouldn't have to
make any huge changes to get the last row instead.
If you have other columns that you want to keep, you just need a slight tweak:
import numpy as np
df['col1'] = np.random.randint(10, 20, 9)
df['col2'] = np.random.randint(20, 30, 9)
df
Out[17]:
group level col1 col2
0 1 10 19 21
1 1 10 18 24
2 1 11 14 23
3 2 5 14 26
4 2 5 10 22
5 3 9 13 27
6 3 9 16 20
7 3 9 18 26
8 3 8 11 2
def get_first_val_keep_cols(group):
has_multiple_vals = len(group['level'].unique()) >= 2
if has_multiple_vals:
return group.loc[group['level'].first_valid_index(), :]
else:
return None
df.groupby('group').apply(get_first_val_keep_cols).dropna()
Out[20]:
group level col1 col2
group
1 1 10 19 21
3 3 9 13 27
This would be simpler:
In [121]:
print df.groupby('group').\
agg(lambda x: x.values[0] if (x.values!=x.values[0]).any() else np.nan).\
dropna()
level
group
1 10
3 9
For each group, if any of the values are not the same as the first value, aggregate that group into its first value; otherwise, aggregate it to nan.
Finally, dropna().

Categories