This question already has answers here:
How to efficiently assign unique ID to individuals with multiple entries based on name in very large df
(3 answers)
Closed 4 years ago.
I've consistently run into this issue of having to assign a unique ID to each group in a data set. I've used this when zero padding for RNN's, generating graphs, and many other occasions.
This can usually be done by concatenating the values in each pd.groupby column. However, it is often the case the number of columns that define a group, their dtype, or the value sizes make concatenation an impractical solution that needlessly uses up memory.
I was wondering if there was an easy way to assign a unique numeric ID to groups in pandas.
You just need ngroup data from seeiespi (or pd.factorize)
df.groupby('C').ngroup()
Out[322]:
0 0
1 0
2 2
3 1
4 1
5 1
6 1
7 2
8 2
dtype: int64
More Option
pd.factorize(df.C)[0]
Out[323]: array([0, 0, 1, 2, 2, 2, 2, 1, 1], dtype=int64)
df.C.astype('category').cat.codes
Out[324]:
0 0
1 0
2 2
3 1
4 1
5 1
6 1
7 2
8 2
dtype: int8
I managed a simple solution that I constantly reference and wanted to share:
df = pd.DataFrame({'A':[1,2,3,4,6,3,7,3,2],'B':[4,3,8,2,6,3,9,1,0], 'C':['a','a','c','b','b','b','b','c','c']})
df = df.sort_values('C')
df['gid'] = (df.groupby(['C']).cumcount()==0).astype(int)
df['gid'] = df['gid'].cumsum()
In [17]: df
Out[17]:
A B C gid
0 1 4 a 1
1 2 3 a 1
2 3 8 b 2
3 4 2 b 2
4 6 6 b 2
5 3 3 b 2
6 7 9 c 3
7 3 1 c 3
8 2 0 c 3
Related
I have a Pandas DataFrame with a meaningful index and various groups of repeating rows. Suppose it looks like this:
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 1, 1], [2, 3, 4], [2, 3, 4], [1, 1, 1], [1, 1, 1], [1, 1, 1], [3, 3, 3]], columns=["a", "b", "c"])
>>> df
a b c
0 1 1 1
1 2 3 4
2 2 3 4
3 1 1 1
4 1 1 1
5 1 1 1
6 3 3 3
I am trying to remove the repeated rows (apart from the first one in each repeating batch), but keep the index of the last row from the batch.
The result I am looking for is this (i.e. a new "last" column containing the index of the last repeated row from the batch, which will be equal to the index if there is no repeat):
>>> df2
last a b c
0 0 1 1 1
1 2 2 3 4
3 5 1 1 1
6 6 3 3 3
Notice that the [1, 1, 1] entries appear twice, and are treated as separate blocks.
I have tried various combinations of group_by, duplicated, etc. but without stumbling on the necessary formulation. This feels like it should be a fairly standard thing to want to do. Is there a straightforward way to achieve this for an arbitrary DataFrame?
Edit:
Note that I would like to preserve the original index from the first items in the batch, and add a new column called, say, last for the last index from the batch.
So in your case
out = df[~df.shift().ne(df).cumsum().duplicated(keep='last')]
Out[19]:
a b c
0 1 1 1
2 2 3 4
5 1 1 1
6 3 3 3
One way of doing this, similair to BENYs approach but using pandas.DataFrame.diff:
df[~df.diff().cumsum().duplicated(keep='last')]
a b c
0 1 1 1
2 2 3 4
5 1 1 1
6 3 3 3
Thanks to #BENY and #jab for your answers, which were very close to what I needed. I added the extra last index column with some simple tweaks as follows:
last_indices = df[~df.diff().cumsum().duplicated(keep='last')].index
df2 = df[~df.diff().cumsum().duplicated(keep='first')]
df2.insert(0, "last", last_indices)
This yields:
last a b c
0 0 1 1 1
1 2 2 3 4
3 5 1 1 1
6 6 3 3 3
Extension:
Although not requested in the question, a useful extension is to add a column containing the counts from each group. The following code achieves this (without relying on dense integer indexes):
count_groups = df.ne(df.shift()).cumsum().max(axis=1)
counts = count_groups.groupby(count_groups).agg("count")
df2.insert(1, "counts", counts.values)
Yielding this:
last counts a b c
0 0 1 1 1 1
1 2 2 2 3 4
3 5 3 1 1 1
6 6 1 3 3 3
I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4
I've the following column:
column
0 10
1 10
2 8
3 8
4 6
5 6
My goal is to find the today unique values (3 in this case) and create a new column which would create the following
new_column
0 3
1 3
2 2
3 2
4 1
5 1
The numbering starts from length of unique values (3) and same number is repeated if current row is same as previous row based on original column. Number gets decreased as row value changes. All unique values in original column have same number of rows (2 rows for each unique value in this case).
My solution was to groupby the original column and create a new list like below:
i=1
new_time=[]
for j, v in df.groupby('column'):
new_time.append([i]*2)
i=i+1
Then I'd flatten the list sort in decreasing order. Any other simpler solution?
Thanks.
pd.factorize
i, u = pd.factorize(df.column)
df.assign(new=len(u) - i)
column new
0 10 3
1 10 3
2 8 2
3 8 2
4 6 1
5 6 1
dict.setdefault
d = {}
for k in df.column:
d.setdefault(k, len(d))
df.assign(new=len(d) - df.column.map(d))
Use GroupBy.ngroup with ascending=False:
df.groupby('column', sort=False).ngroup(ascending=False)+1
0 3
1 3
2 2
3 2
4 1
5 1
dtype: int64
For DataFrame that looks like this,
df = pd.DataFrame({'column': [10, 10, 8, 8, 10, 10]})
. . .where only consecutive values are to be grouped, you'll need to modify your grouper:
(df.groupby(df['column'].ne(df['column'].shift()).cumsum(), sort=False)
.ngroup(ascending=False)
.add(1))
0 3
1 3
2 2
3 2
4 1
5 1
dtype: int64
Acutally, we can use rank with method being dense i.e
dense: like ‘min’, but rank always increases by 1 between groups
df['column'].rank(method='dense')
0 3.0
1 3.0
2 2.0
3 2.0
4 1.0
5 1.0
rank version of #cs95's solution would be
df['column'].ne(df['column'].shift()).cumsum().rank(method='dense',ascending=False)
Try with unique and map
df.column.map(dict(zip(df.column.unique(),reversed(range(df.column.nunique())))))+1
Out[350]:
0 3
1 3
2 2
3 2
4 1
5 1
Name: column, dtype: int64
IIUC, you want groupID of same-values consecutive groups in reversed order. If so, I think this should work too:
df.column.nunique() - df.column.ne(df.column.shift()).cumsum().sub(1)
Out[691]:
0 3
1 3
2 2
3 2
4 1
5 1
Name: column, dtype: int32
I want to sort a subset of a dataframe (say, between indexes i and j) according to some value. I tried
df2=df.iloc[i:j].sort_values(by=...)
df.iloc[i:j]=df2
No problem with the first line but nothing happens when I run the second one (not even an error). How should I do ? (I tried also the update function but it didn't do either).
I believe need assign to filtered DataFrame with converting to numpy array by values for avoid align indices:
df = pd.DataFrame({'A': [1,2,3,4,3,2,1,4,1,2]})
print (df)
A
0 1
1 2
2 3
3 4
4 3
5 2
6 1
7 4
8 1
9 2
i = 2
j = 7
df.iloc[i:j] = df.iloc[i:j].sort_values(by='A').values
print (df)
A
0 1
1 2
2 1
3 2
4 3
5 3
6 4
7 4
8 1
9 2
This might be a very simple question, but I am trying to understand how grouping and indexing work in pandas.
Let's say I have a DataFrame with the following data:
df = pd.DataFrame(data={
'p_id': [1, 1, 1, 2, 3, 3, 3, 4, 4],
'rating': [5, 3, 2, 2, 5, 1, 3, 4, 5]
})
Now, the index would be assigned automatically, so the DataFrame looks like:
p_id rating
0 1 5
1 1 3
2 1 2
3 2 2
4 3 5
5 3 1
6 3 3
7 4 4
8 4 5
When I try to group it by p_id, I get:
>> df[['p_id', 'rating']].groupby('p_id').count()
rating
p_id
1 3
2 1
3 3
4 2
I noticed that p_id now becomes an index for the grouped DataFrame, but the first row looks weird to me -- why does it have p_id index in it with empty rating?
I know how to fix it, kind of, if I do this:
>> df[['p_id', 'rating']].groupby('p_id', as_index=False).count()
p_id rating
0 1 3
1 2 1
2 3 3
3 4 2
Now I don't have this weird first column, but I have both index and p_id.
So my question is, where does this extra row coming from when I don't use as_index=False and is there a way to group DataFrame and keep p_id as index while not having to deal with this extra row? If there are any docs I can read on this, that would also be greatly appreciated.
It's just an index name...
Demo:
In [46]: df
Out[46]:
p_id rating
0 1 5
1 1 3
2 1 2
3 2 2
4 3 5
5 3 1
6 3 3
7 4 4
8 4 5
In [47]: df.index.name = 'AAA'
pay attention at the index name: AAA
In [48]: df
Out[48]:
p_id rating
AAA
0 1 5
1 1 3
2 1 2
3 2 2
4 3 5
5 3 1
6 3 3
7 4 4
8 4 5
You can get rid of it using rename_axis() method:
In [42]: df[['p_id', 'rating']].groupby('p_id').count().rename_axis(None)
Out[42]:
rating
1 3
2 1
3 3
4 2
There is no "extra row", it's simply how pandas visually renders a GroupBy object, i.e. how pandas.core.groupby.generic.DataFrameGroupBy.__str__ method renders a grouped dataframe object: rating is the column, but now p_id has now gone from being a column to being the (row) index.
Another reason they stagger them (i.e. the row with the column names, and the row with the index/multi-index name) is because the index can be a MultiIndex (if you grouped-by multiple columns).