I have a Pandas DataFrame with a meaningful index and various groups of repeating rows. Suppose it looks like this:
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 1, 1], [2, 3, 4], [2, 3, 4], [1, 1, 1], [1, 1, 1], [1, 1, 1], [3, 3, 3]], columns=["a", "b", "c"])
>>> df
a b c
0 1 1 1
1 2 3 4
2 2 3 4
3 1 1 1
4 1 1 1
5 1 1 1
6 3 3 3
I am trying to remove the repeated rows (apart from the first one in each repeating batch), but keep the index of the last row from the batch.
The result I am looking for is this (i.e. a new "last" column containing the index of the last repeated row from the batch, which will be equal to the index if there is no repeat):
>>> df2
last a b c
0 0 1 1 1
1 2 2 3 4
3 5 1 1 1
6 6 3 3 3
Notice that the [1, 1, 1] entries appear twice, and are treated as separate blocks.
I have tried various combinations of group_by, duplicated, etc. but without stumbling on the necessary formulation. This feels like it should be a fairly standard thing to want to do. Is there a straightforward way to achieve this for an arbitrary DataFrame?
Edit:
Note that I would like to preserve the original index from the first items in the batch, and add a new column called, say, last for the last index from the batch.
So in your case
out = df[~df.shift().ne(df).cumsum().duplicated(keep='last')]
Out[19]:
a b c
0 1 1 1
2 2 3 4
5 1 1 1
6 3 3 3
One way of doing this, similair to BENYs approach but using pandas.DataFrame.diff:
df[~df.diff().cumsum().duplicated(keep='last')]
a b c
0 1 1 1
2 2 3 4
5 1 1 1
6 3 3 3
Thanks to #BENY and #jab for your answers, which were very close to what I needed. I added the extra last index column with some simple tweaks as follows:
last_indices = df[~df.diff().cumsum().duplicated(keep='last')].index
df2 = df[~df.diff().cumsum().duplicated(keep='first')]
df2.insert(0, "last", last_indices)
This yields:
last a b c
0 0 1 1 1
1 2 2 3 4
3 5 1 1 1
6 6 3 3 3
Extension:
Although not requested in the question, a useful extension is to add a column containing the counts from each group. The following code achieves this (without relying on dense integer indexes):
count_groups = df.ne(df.shift()).cumsum().max(axis=1)
counts = count_groups.groupby(count_groups).agg("count")
df2.insert(1, "counts", counts.values)
Yielding this:
last counts a b c
0 0 1 1 1 1
1 2 2 2 3 4
3 5 3 1 1 1
6 6 1 3 3 3
Related
I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4
I've got a pandas df that contains values at various time points. I perform a groupby of these time points and values. I'm hoping the filter the output so both groups contain values at each time point. If either group does not contain a value at that time point I want to drop that row.
Using the df below, there is values for Group A and Group B at various time points. However, time point 3,4,6 only contain one item from either Group A or Group B. When there isn't at least two items per group, I want to drop these rows altogether.
The ordering matters and not the total amount. So if there are missing items for either Group at a specific time point, I want to drop these rows.
Note: the df only contains a max of one value per group at each time point. But my actual data could contain numerous. The main concern is dropping rows where at least one group is absent.
df1 = pd.DataFrame({
'Time' : [1,1,1,2,2,3,4,5,5,6],
'Group' : ['A','B','B','A','B','A','B','A','B','B'],
'Val_A' : [6,7,4,5,4,4,9,6,7,8],
'Val_B' : [1,2,2,3,2,1,2,1,4,9],
'Val_C' : [1,2,2,3,4,5,7,8,9,7],
})
Group_A = df1.loc[df1['Group'] == 'A']
Group_B = df1.loc[df1['Group'] == 'B']
Group_A = list(Group_A.groupby(['Time'])['Val_A'].apply(list))
Group_B = list(Group_B.groupby(['Time'])['Val_B'].apply(list))
print(df1)
print(Group_A)
print(Group_B)
Time Group Val_A Val_B Val_C
0 1 A 6 1 1
1 1 B 7 2 2
2 1 B 4 2 2
3 2 A 5 3 3
4 2 B 4 2 4
5 3 A 4 1 5
6 4 B 9 2 7
7 5 A 6 1 8
8 5 B 7 4 9
9 6 B 8 9 7
[[6], [5], [4], [6]]
[[2, 2], [2], [2], [4], [9]]
I can't use dropna or drop_duplicates. Furthermore, data may contain items for Group B and not Group A. So I'm hoping to find a function that can handle both instances.
Intended Output:
Time Group Val_A Val_B Val_C
0 1 A 6 1 1
1 1 B 7 2 2
2 1 B 4 2 2
3 2 A 5 3 3
4 2 B 4 2 4
7 5 A 6 1 8
8 5 B 7 4 9
[[6], [5], [6]]
[[2, 2], [2], [4]]
If you don't care about which row you drop, you could pick the first n rows in each group where n is the smallest number of rows in any group:
df1.groupby('Group').head(df1.groupby('Group')['Val_A'].count().min())
Or, if you only want rows with a value of 'Time' in each group, you could do the following:
df1.groupby('Time').filter(lambda x: len(x['Val_A']) > 1)
Or, if you want to check that you have each group (e.g. A and B) at each point in time and they only appear once at that point
df1.groupby('Time').filter(lambda x: {'A','B'} == set(x['Group']) and len(x) == 2)
'Time' as a grouping of each and 'set()' to do a single comparison. The comparison results are extracted conditionally. Does this fit your intent?
mask = list(set(Group_A['Time'])^set(Group_B['Time']))
df1[~(df1['Time'].isin(mask))]
Time Group Val_A Val_B Val_C
0 1 A 6 1 1
1 1 B 7 2 2
2 1 B 4 2 2
3 2 A 5 3 3
4 2 B 4 2 4
7 5 A 6 1 8
8 5 B 7 4 9
Having a large DataFrame as follows:
userid user_mentions
1 [2, 3, 4]
1 [3]
2 NaN
2 [1,3]
3 [1,4,5]
3 [4]
The user_mentions columns is a list of userids that have been mentioned by each user. For example, the first line means:
user 1 has mentioned users 2, 3, and 4.
I need to create a mention network among the users in the userid column. That is, I want the number of times each user in the userid column has been mentioned by other users in the userid column. So basically, first I need something like this:
filtered = df[df['user_mentions'].isin(df['userid'].unique())]
But this doesn't work on a column of lists.
If I resolve the above issue, then I can groupby['userid','user_mentions'].
EDIT
The final output should be:
Source Target Number
1 2 1
1 3 2
2 1 1
2 3 1
3 1 1
3 5 1
This isn't a task well suited to Pandas / NumPy. So I suggest you use collections.defaultdict to create a dictionary of counts, then construct a dataframe from the dictionary:
from collections import defaultdict
dd = defaultdict(lambda: defaultdict(int))
for row in df.itertuples(index=False):
vals = row.user_mentions
if vals == vals:
for val in vals:
dd[row.userid][val] += 1
df = pd.DataFrame([(k, w, dd[k][w]) for k, v in dd.items() for w in v],
columns=['source', 'target', 'number'])
print(df)
source target number
0 1 2 1
1 1 3 2
2 1 4 1
3 2 1 1
4 2 3 1
5 3 1 1
6 3 4 2
7 3 5 1
Of course, you shouldn't put lists in Pandas series in the first place. It's a nested layer of pointers, which should be avoided if at all possible.
Following your edit, I would have to agree with #jpp.
To your (unedited) original question, in terms of gathering the number of mentions of each user, you can do:
df['counts'] = df['userid'].apply(lambda x: df['user_mentions'].dropna().sum().count(x))
df[['userid','counts']].groupby('userid').first()
Yields:
counts
userid
1 2
2 1
3 3
Here's one way.
# Remove the `NaN` rows
df = df.dropna()
# Construct a new DataFrame
df2 = pd.DataFrame(df.user_mentions.tolist(),
index=df.userid.rename('source')
).stack().astype(int).to_frame('target')
# Groupby + size
df2.groupby(['source', 'target']).size().rename('counts').reset_index()
source target counts
0 1 2 1
1 1 3 2
2 1 4 1
3 2 1 1
4 2 3 1
5 3 1 1
6 3 4 2
7 3 5 1
This question already has answers here:
How to efficiently assign unique ID to individuals with multiple entries based on name in very large df
(3 answers)
Closed 4 years ago.
I've consistently run into this issue of having to assign a unique ID to each group in a data set. I've used this when zero padding for RNN's, generating graphs, and many other occasions.
This can usually be done by concatenating the values in each pd.groupby column. However, it is often the case the number of columns that define a group, their dtype, or the value sizes make concatenation an impractical solution that needlessly uses up memory.
I was wondering if there was an easy way to assign a unique numeric ID to groups in pandas.
You just need ngroup data from seeiespi (or pd.factorize)
df.groupby('C').ngroup()
Out[322]:
0 0
1 0
2 2
3 1
4 1
5 1
6 1
7 2
8 2
dtype: int64
More Option
pd.factorize(df.C)[0]
Out[323]: array([0, 0, 1, 2, 2, 2, 2, 1, 1], dtype=int64)
df.C.astype('category').cat.codes
Out[324]:
0 0
1 0
2 2
3 1
4 1
5 1
6 1
7 2
8 2
dtype: int8
I managed a simple solution that I constantly reference and wanted to share:
df = pd.DataFrame({'A':[1,2,3,4,6,3,7,3,2],'B':[4,3,8,2,6,3,9,1,0], 'C':['a','a','c','b','b','b','b','c','c']})
df = df.sort_values('C')
df['gid'] = (df.groupby(['C']).cumcount()==0).astype(int)
df['gid'] = df['gid'].cumsum()
In [17]: df
Out[17]:
A B C gid
0 1 4 a 1
1 2 3 a 1
2 3 8 b 2
3 4 2 b 2
4 6 6 b 2
5 3 3 b 2
6 7 9 c 3
7 3 1 c 3
8 2 0 c 3
This might be a very simple question, but I am trying to understand how grouping and indexing work in pandas.
Let's say I have a DataFrame with the following data:
df = pd.DataFrame(data={
'p_id': [1, 1, 1, 2, 3, 3, 3, 4, 4],
'rating': [5, 3, 2, 2, 5, 1, 3, 4, 5]
})
Now, the index would be assigned automatically, so the DataFrame looks like:
p_id rating
0 1 5
1 1 3
2 1 2
3 2 2
4 3 5
5 3 1
6 3 3
7 4 4
8 4 5
When I try to group it by p_id, I get:
>> df[['p_id', 'rating']].groupby('p_id').count()
rating
p_id
1 3
2 1
3 3
4 2
I noticed that p_id now becomes an index for the grouped DataFrame, but the first row looks weird to me -- why does it have p_id index in it with empty rating?
I know how to fix it, kind of, if I do this:
>> df[['p_id', 'rating']].groupby('p_id', as_index=False).count()
p_id rating
0 1 3
1 2 1
2 3 3
3 4 2
Now I don't have this weird first column, but I have both index and p_id.
So my question is, where does this extra row coming from when I don't use as_index=False and is there a way to group DataFrame and keep p_id as index while not having to deal with this extra row? If there are any docs I can read on this, that would also be greatly appreciated.
It's just an index name...
Demo:
In [46]: df
Out[46]:
p_id rating
0 1 5
1 1 3
2 1 2
3 2 2
4 3 5
5 3 1
6 3 3
7 4 4
8 4 5
In [47]: df.index.name = 'AAA'
pay attention at the index name: AAA
In [48]: df
Out[48]:
p_id rating
AAA
0 1 5
1 1 3
2 1 2
3 2 2
4 3 5
5 3 1
6 3 3
7 4 4
8 4 5
You can get rid of it using rename_axis() method:
In [42]: df[['p_id', 'rating']].groupby('p_id').count().rename_axis(None)
Out[42]:
rating
1 3
2 1
3 3
4 2
There is no "extra row", it's simply how pandas visually renders a GroupBy object, i.e. how pandas.core.groupby.generic.DataFrameGroupBy.__str__ method renders a grouped dataframe object: rating is the column, but now p_id has now gone from being a column to being the (row) index.
Another reason they stagger them (i.e. the row with the column names, and the row with the index/multi-index name) is because the index can be a MultiIndex (if you grouped-by multiple columns).