Concretely, say I have a DataFrame like this:
appid mac_id count
1 a 1
2 b 1
2 c 1
3 d 1
3 e 1
And I also have a :
mac_list = ['b', 'd', 'e']
I want to group this data frame on appid and for every group filter mac_id if it's in mac_list. Last, sum(count) for every group.
for this DataFrame the result is:
appid count
1 0
2 1
3 2
How can I do this with Pandas?
>>> df = pd.DataFrame({"appid": [1,2,2,3,3], "mac_id": ['a', 'b', 'c', 'd', 'e'], "count": [1,1,1,1,1]})
>>> summer = lambda x: x[x["mac_id"].isin(mac_list)].sum()
>>> df.groupby("appid").apply(summer)["count"]
18
appid
1 0
2 1
3 2
Name: count, dtype: object
Related
I have a dataframe like this:
A B C D
b 3 3 4
a 1 2 1
a 1 2 1
d 4 4 1
d 1 2 1
c 4 5 6
Now I hope to reorder the rows based on values in column A.
I don't want to sort the values but reorder them with a specific order like ['b', 'd', 'c', 'a']
what I expect is:
A B C D
b 3 3 4
d 4 4 1
d 1 2 1
c 4 5 6
a 1 2 1
a 1 2 1
This is a good use case for pd.Categorical, since you have ordered categories. Just make that column a categorical and mark ordered=True. Then, sort_values should do the rest.
df['A'] = pd.Categorical(df.A, categories=['b', 'd', 'c', 'a'], ordered=True)
df.sort_values('A')
If you want to keep your column as is, you can just use loc and the indexes.
df.loc[pd.Series(pd.Categorical(df.A,
categories=['b', 'd', 'c', 'a'],
ordered=True))\
.sort_values()\
.index\
]
Use dictionary like mapping for order of strings then sort the values and reindex:
order = ['b', 'd', 'c', 'a']
df = df.reindex(df['A'].map(dict(zip(order, range(len(order))))).sort_values().index)
print(df)
A B C D
0 b 3 3 4
3 d 4 4 1
4 d 1 2 1
5 c 4 5 6
1 a 1 2 1
2 a 1 2 1
Without changing datatype of A, you can set 'A' as index and select elements in the desired order defined by sk.
sk = ['b', 'd', 'c', 'a']
df.set_index('A').loc[sk].reset_index()
Or use a temp column for sorting:
sk = ['b', 'd', 'c', 'a']
(
df.assign(S=df.A.map({v:k for k,v in enumerate(sk)}))
.sort_values(by='S')
.drop('S', axis=1)
)
I'm taking the solution provided by rafaelc a step further. If you want to do it in a chained process, here is how you'd do it:
df = (
df
.assign(A = lambda x: pd.Categorical(x['A'], categories = ['b', 'd', 'c', 'a'], ordered = True))
.sort_values('A')
)
I have a list of group IDs:
letters = ['A', 'A/D', 'B', 'B/D', 'C', 'C/D', 'D']
and a dataframe of groups:
groups = pd.DataFrame({'group': ['B', 'A/D', 'D', 'D', 'A']})
I'd like to create a column in the dataframe that gives the position of the group ids in the list, like so:
group group_idx
0 B 2
1 A/D 1
2 D 6
3 D 6
4 A 0
My current solution is this:
group_to_num = {hsg: i for i, hsg in enumerate(letters)}
groups['group_idx'] = groups.applymap(lambda x: group_to_num.get(x)).max(axis=1).fillna(-1).astype(np.int32)
but it seems inelegant. Is there a simpler way of doing this?
You can try merge after a dataframe constructor:
groups.merge(pd.DataFrame(letters).reset_index(),left_on='group',right_on=0).\
rename(columns={'index':'group_idx'}).drop(0,1)
group group_idx
0 B 2
1 A/D 1
2 D 6
3 D 6
4 A 0
Use map:
import pandas as pd
letters = ['A', 'A/D', 'B', 'B/D', 'C', 'C/D', 'D']
group_to_num = {hsg: i for i, hsg in enumerate(letters)}
groups = pd.DataFrame({'group': ['B', 'A/D', 'D', 'D', 'A']})
groups['group_idx'] = groups.group.map(group_to_num)
print(groups)
Output
group group_idx
0 B 2
1 A/D 1
2 D 6
3 D 6
4 A 0
My data frame looks like this
Pandas data frame with multiple categorical variables for a user
I made sure there are no duplicates in it. I want to encode it and I want my final output like this
I tried using pandas dummies directly but I am not getting the desired result.
Can anyone help me through this??
IIUC, your user is empty and everything is on name. If that's the case, you can
pd.pivot_table(df, index=df.name.str[0], columns=df.name.str[1:].values, aggfunc='count').fillna(0)
You can split each row in name using r'(\d+)' to separate digits from letters, and use pd.crosstab:
d = pd.DataFrame(df.name.str.split(r'(\d+)').values.tolist())
pd.crosstab(columns=d[2], index=d[1], values=d[1], aggfunc='count')
You could try the the str accessor get_dummies with groupby user column:
df.name.str.get_dummies().groupby(df.user).sum()
Example
Given your sample DataFrame
df = pd.DataFrame({'user': [1]*4 + [2]*4 + [3]*3,
'name': ['a', 'b', 'c', 'd']*2 + ['d', 'e', 'f']})
df_dummies = df.name.str.get_dummies().groupby(df.user).sum()
print(df_dummies)
[out]
a b c d e f
user
1 1 1 1 1 0 0
2 1 1 1 1 0 0
3 0 0 0 1 1 1
Assuming the following dataframe:
user name
0 1 a
1 1 b
2 1 c
3 1 d
4 2 a
5 2 b
6 2 c
7 3 d
8 3 e
9 3 f
You could groupby user and then use get_dummmies:
import pandas as pd
# create data-frame
data = [[1, 'a'], [1, 'b'], [1, 'c'], [1, 'd'], [2, 'a'],
[2, 'b'], [2, 'c'], [3, 'd'], [3, 'e'], [3, 'f']]
df = pd.DataFrame(data=data, columns=['user', 'name'])
# group and get_dummies
grouped = df.groupby('user')['name'].apply(lambda x: '|'.join(x))
print(grouped.str.get_dummies())
Output
a b c d e f
user
1 1 1 1 1 0 0
2 1 1 1 0 0 0
3 0 0 0 1 1 1
As a side-note, you can do it all in one line:
result = df.groupby('user')['name'].apply(lambda x: '|'.join(x)).str.get_dummies()
Let's say data looks like:
df = pd.DataFrame({'Group' : ['A', 'B', 'A', 'B', 'C'],
'Value' : [1, 4, 3, 2, 3]})
Group Value
0 A 1
1 B 4
2 A 3
3 B 2
4 C 3
Normally when grouping by "Group" and get sum I would get:
df.groupby(by="Group").agg(["sum"])
Group Value sum
A 4
B 6
C 3
Is there a way to get "Group A" vs "non-Group A", so something like:
df.groupby(by="Group A vs non-Group A").agg(["sum"])
Group Value sum
A 4
non-A 9
Thanks everyone!
Use groupby and replace
In [566]: df.groupby(
df.Group.eq('A').replace({True: 'A', False: 'non-A'})
)['Value'].sum().reset_index()
Out[566]:
Group Value
0 A 4
1 non-A 9
Details
In [567]: df.Group.eq('A').replace({True: 'A', False: 'non-A'})
Out[567]:
0 A
1 non-A
2 A
3 non-A
4 non-A
Name: Group, dtype: object
In [568]: df.groupby(df.Group.eq('A').replace({True: 'A', False: 'non-A'}))['Value'].sum()
Out[568]:
Group
A 4
non-A 9
Name: Value, dtype: int64
I'm trying to group by a column containing tuples. Each tuple has a different length.
I'd like to perform simple groupby operations on this column of tuples, such as sum or count.
Example :
df = pd.DataFrame(data={
'col1': [1,2,3,4] ,
'col2': [('a', 'b'), ('a'), ('b', 'n', 'k'), ('a', 'c', 'k', 'z') ] ,
})
print df
outputs :
col1 col2
0 1 (a, b)
1 2 (a, m)
2 3 (b, n, k)
3 4 (a, c, k, z)
I'd like to be able to group by col2 on col1, with for instance a sum.
Expected output would be :
col2 sum_col1
0 a 7
1 b 4
2 c 4
3 n 3
3 m 2
3 k 7
3 z 4
I feel that pd.melt might be able to use, but i can't see exactly how.
Here is an approach using .get_dummies and .melt:
import pandas as pd
df = pd.DataFrame(data={
'col1': [1,2,3,4] ,
'col2': [('a', 'b'), ('a'), ('b', 'n', 'k'), ('a', 'c', 'k', 'z') ] ,
})
value_col = 'col1'
id_col = 'col2'
Unpack tuples to DataFrame:
df = df.join(df.col2.apply(lambda x: pd.Series(x)))
Create columns with values of tuples:
dummy_cols = df.columns.difference(df[[value_col, id_col]].columns)
dfd = pd.get_dummies(df[dummy_cols | pd.Index([value_col])])
Producing:
col1 0_a 0_b 1_b 1_c 1_n 2_k 3_z
0 1 1 0 1 0 0 0 0
1 2 1 0 0 0 0 0 0
2 3 0 1 0 0 1 1 0
3 4 1 0 0 1 0 1 1
Then .melt it and clean variable column from prefixes:
dfd = pd.melt(dfd, value_vars=dfd.columns.difference([value_col]).tolist(), id_vars=value_col)
dfd['variable'] = dfd.variable.str.replace(r'\d_', '')
print dfd.head()
Yielding:
col1 variable value
0 1 a 1
1 2 a 1
2 3 a 0
3 4 a 1
4 1 b 0
And finally get your output:
dfd[dfd.value != 0].groupby('variable')[value_col].sum()
variable
a 7
b 4
c 4
k 7
n 3
z 4
Name: col1, dtype: int64