Python Pandas groupby: Group A vs non-Group A? - python

Let's say data looks like:
df = pd.DataFrame({'Group' : ['A', 'B', 'A', 'B', 'C'],
'Value' : [1, 4, 3, 2, 3]})
Group Value
0 A 1
1 B 4
2 A 3
3 B 2
4 C 3
Normally when grouping by "Group" and get sum I would get:
df.groupby(by="Group").agg(["sum"])
Group Value sum
A 4
B 6
C 3
Is there a way to get "Group A" vs "non-Group A", so something like:
df.groupby(by="Group A vs non-Group A").agg(["sum"])
Group Value sum
A 4
non-A 9
Thanks everyone!

Use groupby and replace
In [566]: df.groupby(
df.Group.eq('A').replace({True: 'A', False: 'non-A'})
)['Value'].sum().reset_index()
Out[566]:
Group Value
0 A 4
1 non-A 9
Details
In [567]: df.Group.eq('A').replace({True: 'A', False: 'non-A'})
Out[567]:
0 A
1 non-A
2 A
3 non-A
4 non-A
Name: Group, dtype: object
In [568]: df.groupby(df.Group.eq('A').replace({True: 'A', False: 'non-A'}))['Value'].sum()
Out[568]:
Group
A 4
non-A 9
Name: Value, dtype: int64

Related

Get last observations from Pandas

Assuming the following dataframe:
variable value
0 A 12
1 A 11
2 B 4
3 A 2
4 B 1
5 B 4
I want to extract the last observation for each variable. In this case, it would give me:
variable value
3 A 2
5 B 4
How would you do this in the most panda/pythonic way?
I'm not worried about performance. Clarity and conciseness is important.
The best way I came up with:
df = pd.DataFrame({'variable': ['A', 'A', 'B', 'A', 'B', 'B'], 'value': [12, 11, 4, 2, 1, 4]})
variables = df['variable'].unique()
new_df = df.drop(index=df.index, axis=1)
for v in variables:
new_df = new_df.append(df[df['variable'] == v].tail(1), inplace=True)
Use drop_duplicates
new_df = df.drop_duplicates('variable',keep='last')
Out[357]:
variable value
3 A 2
5 B 4

how to reorder of rows of a dataframe based on values in a column

I have a dataframe like this:
A B C D
b 3 3 4
a 1 2 1
a 1 2 1
d 4 4 1
d 1 2 1
c 4 5 6
Now I hope to reorder the rows based on values in column A.
I don't want to sort the values but reorder them with a specific order like ['b', 'd', 'c', 'a']
what I expect is:
A B C D
b 3 3 4
d 4 4 1
d 1 2 1
c 4 5 6
a 1 2 1
a 1 2 1
This is a good use case for pd.Categorical, since you have ordered categories. Just make that column a categorical and mark ordered=True. Then, sort_values should do the rest.
df['A'] = pd.Categorical(df.A, categories=['b', 'd', 'c', 'a'], ordered=True)
df.sort_values('A')
If you want to keep your column as is, you can just use loc and the indexes.
df.loc[pd.Series(pd.Categorical(df.A,
categories=['b', 'd', 'c', 'a'],
ordered=True))\
.sort_values()\
.index\
]
Use dictionary like mapping for order of strings then sort the values and reindex:
order = ['b', 'd', 'c', 'a']
df = df.reindex(df['A'].map(dict(zip(order, range(len(order))))).sort_values().index)
print(df)
A B C D
0 b 3 3 4
3 d 4 4 1
4 d 1 2 1
5 c 4 5 6
1 a 1 2 1
2 a 1 2 1
Without changing datatype of A, you can set 'A' as index and select elements in the desired order defined by sk.
sk = ['b', 'd', 'c', 'a']
df.set_index('A').loc[sk].reset_index()
Or use a temp column for sorting:
sk = ['b', 'd', 'c', 'a']
(
df.assign(S=df.A.map({v:k for k,v in enumerate(sk)}))
.sort_values(by='S')
.drop('S', axis=1)
)
I'm taking the solution provided by rafaelc a step further. If you want to do it in a chained process, here is how you'd do it:
df = (
df
.assign(A = lambda x: pd.Categorical(x['A'], categories = ['b', 'd', 'c', 'a'], ordered = True))
.sort_values('A')
)

Pandas - aggregate over inconsistent values types (string vs list)

Given the following DataFrame, I try to aggregate over columns 'A' and 'C'. for 'A', count unique appearances of the strings, and for 'C', sum the values.
Problem arises when some of the samples in 'A' are actually lists of those strings.
Here's a simplified example:
df = pd.DataFrame({'ID': [1, 1, 1, 1, 1, 2, 2, 2],
'A' : ['a', 'a', 'a', 'b', ['b', 'c', 'd'], 'a', 'a', ['a', 'b', 'c']],
'C' : [1, 2, 15, 5, 13, 6, 7, 1]})
df
Out[100]:
ID A C
0 1 a 1
1 1 a 2
2 1 a 15
3 1 b 5
4 1 [b, c, d] 13
5 2 a 6
6 2 a 7
7 2 [a, b, c] 1
aggs = {'A' : lambda x: x.nunique(dropna=True),
'C' : 'sum'}
# This will result an error: TypeError: unhashable type: 'list'
agg_df = df.groupby('ID').agg(aggs)
I'd like the following output:
print(agg_df)
A C
ID
1 4 36
2 3 14
Which resulted because for 'ID' = 1 we had 'a', 'b', 'c' and 'd' and for 'ID' = 2, we had 'a', 'b', 'c'.
One solution is to split your problem into 2 parts. First flatten your dataframe to ensure df['A'] consists only of strings. Then concatenate a couple of GroupBy operations.
Step 1: Flatten your dataframe
You can use itertools.chain and numpy.repeat to chain and repeat values as appropriate.
from itertools import chain
A = df['A'].apply(lambda x: [x] if not isinstance(x, list) else x)
lens = A.map(len)
res = pd.DataFrame({'ID': np.repeat(df['ID'], lens),
'A': list(chain.from_iterable(A)),
'C': np.repeat(df['C'], lens)})
print(res)
# A C ID
# 0 a 1 1
# 1 a 2 1
# 2 a 15 1
# 3 b 5 1
# 4 b 13 1
# 4 c 13 1
# 4 d 13 1
# 5 a 6 2
# 6 a 7 2
# 7 a 1 2
# 7 b 1 2
# 7 c 1 2
Step 2: Concatenate GroupBy on original and flattened
agg_df = pd.concat([res.groupby('ID')['A'].nunique(),
df.groupby('ID')['C'].sum()], axis=1)
print(agg_df)
# A C
# ID
# 1 4 36
# 2 3 14

How to do group by according to sorted value in Pandas, Python?

I have a data set like
Type1 Value
A 1
B 6
C 4
A 3
C 1
B 2
For each element in Type1, I want it to sum over Value, and then display it in sorted order.
I want my result like,
Type1 Value
A 4
C 5
B 8
Use DataFrame.groupby:
df = pd.DataFrame([['A', 'B', 'C', 'A', 'C', 'B'], [1, 6, 4, 3, 1, 2]], index=['Type1', 'Value']).T
df2 = df.groupby('Type1').sum()
This gives:
Value
Type1
A 4
B 8
C 5
If this isn't sorted, you can do df2.sort_index(inplace=True).
If you want to turn Type1 back into a column, you can do df3 = df2.reset_index().

python pandas groupby filter

Concretely, say I have a DataFrame like this:
appid mac_id count
1 a 1
2 b 1
2 c 1
3 d 1
3 e 1
And I also have a :
mac_list = ['b', 'd', 'e']
I want to group this data frame on appid and for every group filter mac_id if it's in mac_list. Last, sum(count) for every group.
for this DataFrame the result is:
appid count
1 0
2 1
3 2
How can I do this with Pandas?
>>> df = pd.DataFrame({"appid": [1,2,2,3,3], "mac_id": ['a', 'b', 'c', 'd', 'e'], "count": [1,1,1,1,1]})
>>> summer = lambda x: x[x["mac_id"].isin(mac_list)].sum()
>>> df.groupby("appid").apply(summer)["count"]
18
appid
1 0
2 1
3 2
Name: count, dtype: object

Categories