Pandas indexing behavior after grouping: do I see an "extra row"? - python

This might be a very simple question, but I am trying to understand how grouping and indexing work in pandas.
Let's say I have a DataFrame with the following data:
df = pd.DataFrame(data={
'p_id': [1, 1, 1, 2, 3, 3, 3, 4, 4],
'rating': [5, 3, 2, 2, 5, 1, 3, 4, 5]
})
Now, the index would be assigned automatically, so the DataFrame looks like:
p_id rating
0 1 5
1 1 3
2 1 2
3 2 2
4 3 5
5 3 1
6 3 3
7 4 4
8 4 5
When I try to group it by p_id, I get:
>> df[['p_id', 'rating']].groupby('p_id').count()
rating
p_id
1 3
2 1
3 3
4 2
I noticed that p_id now becomes an index for the grouped DataFrame, but the first row looks weird to me -- why does it have p_id index in it with empty rating?
I know how to fix it, kind of, if I do this:
>> df[['p_id', 'rating']].groupby('p_id', as_index=False).count()
p_id rating
0 1 3
1 2 1
2 3 3
3 4 2
Now I don't have this weird first column, but I have both index and p_id.
So my question is, where does this extra row coming from when I don't use as_index=False and is there a way to group DataFrame and keep p_id as index while not having to deal with this extra row? If there are any docs I can read on this, that would also be greatly appreciated.

It's just an index name...
Demo:
In [46]: df
Out[46]:
p_id rating
0 1 5
1 1 3
2 1 2
3 2 2
4 3 5
5 3 1
6 3 3
7 4 4
8 4 5
In [47]: df.index.name = 'AAA'
pay attention at the index name: AAA
In [48]: df
Out[48]:
p_id rating
AAA
0 1 5
1 1 3
2 1 2
3 2 2
4 3 5
5 3 1
6 3 3
7 4 4
8 4 5
You can get rid of it using rename_axis() method:
In [42]: df[['p_id', 'rating']].groupby('p_id').count().rename_axis(None)
Out[42]:
rating
1 3
2 1
3 3
4 2

There is no "extra row", it's simply how pandas visually renders a GroupBy object, i.e. how pandas.core.groupby.generic.DataFrameGroupBy.__str__ method renders a grouped dataframe object: rating is the column, but now p_id has now gone from being a column to being the (row) index.
Another reason they stagger them (i.e. the row with the column names, and the row with the index/multi-index name) is because the index can be a MultiIndex (if you grouped-by multiple columns).

Related

Group Pandas DataFrame repeated rows, preserving last index

I have a Pandas DataFrame with a meaningful index and various groups of repeating rows. Suppose it looks like this:
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 1, 1], [2, 3, 4], [2, 3, 4], [1, 1, 1], [1, 1, 1], [1, 1, 1], [3, 3, 3]], columns=["a", "b", "c"])
>>> df
a b c
0 1 1 1
1 2 3 4
2 2 3 4
3 1 1 1
4 1 1 1
5 1 1 1
6 3 3 3
I am trying to remove the repeated rows (apart from the first one in each repeating batch), but keep the index of the last row from the batch.
The result I am looking for is this (i.e. a new "last" column containing the index of the last repeated row from the batch, which will be equal to the index if there is no repeat):
>>> df2
last a b c
0 0 1 1 1
1 2 2 3 4
3 5 1 1 1
6 6 3 3 3
Notice that the [1, 1, 1] entries appear twice, and are treated as separate blocks.
I have tried various combinations of group_by, duplicated, etc. but without stumbling on the necessary formulation. This feels like it should be a fairly standard thing to want to do. Is there a straightforward way to achieve this for an arbitrary DataFrame?
Edit:
Note that I would like to preserve the original index from the first items in the batch, and add a new column called, say, last for the last index from the batch.
So in your case
out = df[~df.shift().ne(df).cumsum().duplicated(keep='last')]
Out[19]:
a b c
0 1 1 1
2 2 3 4
5 1 1 1
6 3 3 3
One way of doing this, similair to BENYs approach but using pandas.DataFrame.diff:
df[~df.diff().cumsum().duplicated(keep='last')]
a b c
0 1 1 1
2 2 3 4
5 1 1 1
6 3 3 3
Thanks to #BENY and #jab for your answers, which were very close to what I needed. I added the extra last index column with some simple tweaks as follows:
last_indices = df[~df.diff().cumsum().duplicated(keep='last')].index
df2 = df[~df.diff().cumsum().duplicated(keep='first')]
df2.insert(0, "last", last_indices)
This yields:
last a b c
0 0 1 1 1
1 2 2 3 4
3 5 1 1 1
6 6 3 3 3
Extension:
Although not requested in the question, a useful extension is to add a column containing the counts from each group. The following code achieves this (without relying on dense integer indexes):
count_groups = df.ne(df.shift()).cumsum().max(axis=1)
counts = count_groups.groupby(count_groups).agg("count")
df2.insert(1, "counts", counts.values)
Yielding this:
last counts a b c
0 0 1 1 1 1
1 2 2 2 3 4
3 5 3 1 1 1
6 6 1 3 3 3

Filtering DataFrame on groups with at least one matching criterion

I'm working with a DataFrame having the following structure:
import pandas as pd
df = pd.DataFrame({'group' : [1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 5],
'id' : ['A_410', 'B_171', 'X_218', 'A_685', 'B_305', 'C_407', 'X_202', 'B_989', 'C_616', 'X_267', 'A_112', 'C_358'],
'active' : [-1, -1, 999, -1, -1, 1, 999, 1, 1, 999, -1, 1]})
print(df)
group id active
0 1 A_410 -1
1 1 B_171 -1
2 1 X_218 999
3 2 A_685 -1
4 2 B_305 -1
5 2 C_407 1
6 2 X_202 999
7 3 B_989 1
8 3 C_616 1
9 3 X_267 999
10 4 A_112 -1
11 5 C_358 1
My goal is simple enough to formulate: I want to view only the groups having a least one active id in them (an active id is flagged with a 1).
The resulting DataFrame should look like this:
group id active
1 2 A_685 -1
2 2 B_305 -1
3 2 C_407 1
4 2 X_202 999
5 3 B_989 1
6 3 C_616 1
7 3 X_267 999
8 5 C_358 1
Unfortunately, I don't know how to formulate this in Python/Pandas syntax. I searched previous post using appropriate keywords but could not find a similar problem. Any help would be appreciated.
Compare value 1 and test it at least one True per group by GroupBy.transform, last filter by boolean indexing:
df = df[df['active'].eq(1).groupby(df['group']).transform('any')]
print (df)
group id active
3 2 A_685 -1
4 2 B_305 -1
5 2 C_407 1
6 2 X_202 999
7 3 B_989 1
8 3 C_616 1
9 3 X_267 999
11 5 C_358 1
Another solution check all groups with 1 and filter original column group by Series.isin:
df = df[df['group'].isin(df.loc[df['active'].eq(1), 'group'])]

Replacing all values in a Pandas column, with no conditions

I have a Pandas dataframe with a column full of values I want to replace with another, non conditionally.
For the purpose of this question, let's assume I don't know how long this column is and I don't want to iterate over its values.
Using .replace() is not appropriate since I don't know which values are in that column: I want to replace all values, non conditionally.
Using df.loc[<row selection>, <column selection>] is not appropriate since there is no row selection logic: I want all the rows and simply writing True (as in data.loc[True, 'ColumnName'] = new_value) returns KeyError(True,). I tried data.loc[1, 'ColumnName'] = new_value and it works but it really looks like a shitty solution.
If I know len() of data['ColumnName'] I could create an array of that size, filled with as many time of my new_value and simply replace the column with that array. 10 lines of code to do something simpler than something that requires 1 line of code (doing so conditionally): this is also not ok.
How can I tell Pandas in 1 line: all the values in ColumnName are now new_value? I refuse to believe there's no way to tell Pandas not to bother me with conditions.
As I explained in the comment, you don't need to create an array.
Let's say you have df:
InvoiceNO Month Year Size
0 1 1 2 7
1 2 1 2 8
2 3 2 2 11
3 4 3 2 9
4 5 7 2 8.5
..and you want to change all values in InvoiceNO to 1234:
df['InvoiceNO'] = 1234
Output:
InvoiceNO Month Year Size
0 1234 1 2 7
1 1234 1 2 8
2 1234 2 2 11
3 1234 3 2 9
4 1234 7 2 8.5
import pandas as pd
df = pd.DataFrame(
{'num1' : [3, 5, 9, 9, 14, 1],
'num2' : [3, 5, 9, 9, 14, 1]},
index=[0, 1, 2, 3, 4, 5])
print(df)
print('\n')
df['num1'] = 100
print(df)
df['num1'] = 'Hi'
print('\n')
print(df)
The output is
num1 num2
0 3 3
1 5 5
2 9 9
3 9 9
4 14 14
5 1 1
num1 num2
0 100 3
1 100 5
2 100 9
3 100 9
4 100 14
5 100 1
num1 num2
0 Hi 3
1 Hi 5
2 Hi 9
3 Hi 9
4 Hi 14
5 Hi 1

Assign Unique Numeric Group IDs to Groups in Pandas [duplicate]

This question already has answers here:
How to efficiently assign unique ID to individuals with multiple entries based on name in very large df
(3 answers)
Closed 4 years ago.
I've consistently run into this issue of having to assign a unique ID to each group in a data set. I've used this when zero padding for RNN's, generating graphs, and many other occasions.
This can usually be done by concatenating the values in each pd.groupby column. However, it is often the case the number of columns that define a group, their dtype, or the value sizes make concatenation an impractical solution that needlessly uses up memory.
I was wondering if there was an easy way to assign a unique numeric ID to groups in pandas.
You just need ngroup data from seeiespi (or pd.factorize)
df.groupby('C').ngroup()
Out[322]:
0 0
1 0
2 2
3 1
4 1
5 1
6 1
7 2
8 2
dtype: int64
More Option
pd.factorize(df.C)[0]
Out[323]: array([0, 0, 1, 2, 2, 2, 2, 1, 1], dtype=int64)
df.C.astype('category').cat.codes
Out[324]:
0 0
1 0
2 2
3 1
4 1
5 1
6 1
7 2
8 2
dtype: int8
I managed a simple solution that I constantly reference and wanted to share:
df = pd.DataFrame({'A':[1,2,3,4,6,3,7,3,2],'B':[4,3,8,2,6,3,9,1,0], 'C':['a','a','c','b','b','b','b','c','c']})
df = df.sort_values('C')
df['gid'] = (df.groupby(['C']).cumcount()==0).astype(int)
df['gid'] = df['gid'].cumsum()
In [17]: df
Out[17]:
A B C gid
0 1 4 a 1
1 2 3 a 1
2 3 8 b 2
3 4 2 b 2
4 6 6 b 2
5 3 3 b 2
6 7 9 c 3
7 3 1 c 3
8 2 0 c 3

Python - Get group names from aggregated results in pandas

I have a dataframe like this:
minute values
0 1 3
1 2 4
2 1 1
3 4 6
4 3 7
5 2 2
When I apply
df.groupby('minute').sum().sort('values', ascending=False)
This gives:
values
minute
3 7
2 6
4 6
1 4
I want to get first two values in minute column in an array like [3,2]. How can I access values in minute column
If what you want is the values from the minute column in the grouped dataframe (which would be the index column as well) , you can use DataFrame.index , to access that column. Example -
grouped = df.groupby('minute').sum().sort('values', ascending=False)
grouped.index[:2]
If you really want it as a list, you can use .tolist() to convert it to a list. Example -
grouped.index[:2].tolist()
Demo -
In [3]: df
Out[3]:
minute values
0 1 3
1 2 4
2 1 1
3 4 6
4 3 7
5 2 2
In [4]: grouped = df.groupby('minute').sum().sort('values', ascending=False)
In [5]: grouped.index[:2]
Out[5]: Int64Index([3, 2], dtype='int64', name='minute')
In [6]: grouped.index[:2].tolist()
Out[6]: [3, 2]

Categories