I have a list of group IDs:
letters = ['A', 'A/D', 'B', 'B/D', 'C', 'C/D', 'D']
and a dataframe of groups:
groups = pd.DataFrame({'group': ['B', 'A/D', 'D', 'D', 'A']})
I'd like to create a column in the dataframe that gives the position of the group ids in the list, like so:
group group_idx
0 B 2
1 A/D 1
2 D 6
3 D 6
4 A 0
My current solution is this:
group_to_num = {hsg: i for i, hsg in enumerate(letters)}
groups['group_idx'] = groups.applymap(lambda x: group_to_num.get(x)).max(axis=1).fillna(-1).astype(np.int32)
but it seems inelegant. Is there a simpler way of doing this?
You can try merge after a dataframe constructor:
groups.merge(pd.DataFrame(letters).reset_index(),left_on='group',right_on=0).\
rename(columns={'index':'group_idx'}).drop(0,1)
group group_idx
0 B 2
1 A/D 1
2 D 6
3 D 6
4 A 0
Use map:
import pandas as pd
letters = ['A', 'A/D', 'B', 'B/D', 'C', 'C/D', 'D']
group_to_num = {hsg: i for i, hsg in enumerate(letters)}
groups = pd.DataFrame({'group': ['B', 'A/D', 'D', 'D', 'A']})
groups['group_idx'] = groups.group.map(group_to_num)
print(groups)
Output
group group_idx
0 B 2
1 A/D 1
2 D 6
3 D 6
4 A 0
Related
I want to find local duplicates and give them a unique id, directly in pandas.
Reallife example:
Time-ordered purchase data where a customer id occures multiple times (because he visits a shop multiple times a week), but I want to identify occasions where the customer purches multiple items at the same time.
My current approach would look like this:
def follow_ups(lst):
lst2 = [None] + lst[:-1]
i = 0
l = []
for e1, e2 in zip(lst, lst2):
if e1 != e2:
i += 1
l.append(i)
return l
follow_ups(['A', 'B', 'B', 'C', 'B', 'D', 'D', 'D', 'E', 'A', 'B', 'C'])
# [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 8, 9]
# for pandas
df['out'] = follow_ups(df['test'])
But I have the feeling there might be a much simpler and cleaner approach in pandas which I am unable to find.
Pandas Sample data
import pandas as pd
df = pd.DataFrame({'test':['A', 'B', 'B', 'C', 'B', 'D', 'D', 'D', 'E', 'A', 'B', 'C']})
# test
# 0 A
# 1 B
# 2 B
# 3 C
# 4 B
# 5 D
# 6 D
# 7 D
# 8 E
# 9 A
# 10 B
# 11 C
df_out = pd.DataFrame({'test':['A', 'B', 'B', 'C', 'B', 'D', 'D', 'D', 'E', 'A', 'B', 'C'], 'out':[1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 8, 9]})
# test out
# 0 A 1
# 1 B 2
# 2 B 2
# 3 C 3
# 4 B 4
# 5 D 5
# 6 D 5
# 7 D 5
# 8 E 6
# 9 A 7
# 10 B 8
# 11 C 9
You can compare whether your column test is not equal to it's shifted version, using shift() with ne(), and use cumsum() on that:
df['out'] = df['test'].ne(df['test'].shift()).cumsum()
Which prints:
df
test out
0 A 1
1 B 2
2 B 2
3 C 3
4 B 4
5 D 5
6 D 5
7 D 5
8 E 6
9 A 7
10 B 8
11 C 9
I have a dataframe like this:
A B C D
b 3 3 4
a 1 2 1
a 1 2 1
d 4 4 1
d 1 2 1
c 4 5 6
Now I hope to reorder the rows based on values in column A.
I don't want to sort the values but reorder them with a specific order like ['b', 'd', 'c', 'a']
what I expect is:
A B C D
b 3 3 4
d 4 4 1
d 1 2 1
c 4 5 6
a 1 2 1
a 1 2 1
This is a good use case for pd.Categorical, since you have ordered categories. Just make that column a categorical and mark ordered=True. Then, sort_values should do the rest.
df['A'] = pd.Categorical(df.A, categories=['b', 'd', 'c', 'a'], ordered=True)
df.sort_values('A')
If you want to keep your column as is, you can just use loc and the indexes.
df.loc[pd.Series(pd.Categorical(df.A,
categories=['b', 'd', 'c', 'a'],
ordered=True))\
.sort_values()\
.index\
]
Use dictionary like mapping for order of strings then sort the values and reindex:
order = ['b', 'd', 'c', 'a']
df = df.reindex(df['A'].map(dict(zip(order, range(len(order))))).sort_values().index)
print(df)
A B C D
0 b 3 3 4
3 d 4 4 1
4 d 1 2 1
5 c 4 5 6
1 a 1 2 1
2 a 1 2 1
Without changing datatype of A, you can set 'A' as index and select elements in the desired order defined by sk.
sk = ['b', 'd', 'c', 'a']
df.set_index('A').loc[sk].reset_index()
Or use a temp column for sorting:
sk = ['b', 'd', 'c', 'a']
(
df.assign(S=df.A.map({v:k for k,v in enumerate(sk)}))
.sort_values(by='S')
.drop('S', axis=1)
)
I'm taking the solution provided by rafaelc a step further. If you want to do it in a chained process, here is how you'd do it:
df = (
df
.assign(A = lambda x: pd.Categorical(x['A'], categories = ['b', 'd', 'c', 'a'], ordered = True))
.sort_values('A')
)
My data frame looks like this
Pandas data frame with multiple categorical variables for a user
I made sure there are no duplicates in it. I want to encode it and I want my final output like this
I tried using pandas dummies directly but I am not getting the desired result.
Can anyone help me through this??
IIUC, your user is empty and everything is on name. If that's the case, you can
pd.pivot_table(df, index=df.name.str[0], columns=df.name.str[1:].values, aggfunc='count').fillna(0)
You can split each row in name using r'(\d+)' to separate digits from letters, and use pd.crosstab:
d = pd.DataFrame(df.name.str.split(r'(\d+)').values.tolist())
pd.crosstab(columns=d[2], index=d[1], values=d[1], aggfunc='count')
You could try the the str accessor get_dummies with groupby user column:
df.name.str.get_dummies().groupby(df.user).sum()
Example
Given your sample DataFrame
df = pd.DataFrame({'user': [1]*4 + [2]*4 + [3]*3,
'name': ['a', 'b', 'c', 'd']*2 + ['d', 'e', 'f']})
df_dummies = df.name.str.get_dummies().groupby(df.user).sum()
print(df_dummies)
[out]
a b c d e f
user
1 1 1 1 1 0 0
2 1 1 1 1 0 0
3 0 0 0 1 1 1
Assuming the following dataframe:
user name
0 1 a
1 1 b
2 1 c
3 1 d
4 2 a
5 2 b
6 2 c
7 3 d
8 3 e
9 3 f
You could groupby user and then use get_dummmies:
import pandas as pd
# create data-frame
data = [[1, 'a'], [1, 'b'], [1, 'c'], [1, 'd'], [2, 'a'],
[2, 'b'], [2, 'c'], [3, 'd'], [3, 'e'], [3, 'f']]
df = pd.DataFrame(data=data, columns=['user', 'name'])
# group and get_dummies
grouped = df.groupby('user')['name'].apply(lambda x: '|'.join(x))
print(grouped.str.get_dummies())
Output
a b c d e f
user
1 1 1 1 1 0 0
2 1 1 1 0 0 0
3 0 0 0 1 1 1
As a side-note, you can do it all in one line:
result = df.groupby('user')['name'].apply(lambda x: '|'.join(x)).str.get_dummies()
Given the following DataFrame, I try to aggregate over columns 'A' and 'C'. for 'A', count unique appearances of the strings, and for 'C', sum the values.
Problem arises when some of the samples in 'A' are actually lists of those strings.
Here's a simplified example:
df = pd.DataFrame({'ID': [1, 1, 1, 1, 1, 2, 2, 2],
'A' : ['a', 'a', 'a', 'b', ['b', 'c', 'd'], 'a', 'a', ['a', 'b', 'c']],
'C' : [1, 2, 15, 5, 13, 6, 7, 1]})
df
Out[100]:
ID A C
0 1 a 1
1 1 a 2
2 1 a 15
3 1 b 5
4 1 [b, c, d] 13
5 2 a 6
6 2 a 7
7 2 [a, b, c] 1
aggs = {'A' : lambda x: x.nunique(dropna=True),
'C' : 'sum'}
# This will result an error: TypeError: unhashable type: 'list'
agg_df = df.groupby('ID').agg(aggs)
I'd like the following output:
print(agg_df)
A C
ID
1 4 36
2 3 14
Which resulted because for 'ID' = 1 we had 'a', 'b', 'c' and 'd' and for 'ID' = 2, we had 'a', 'b', 'c'.
One solution is to split your problem into 2 parts. First flatten your dataframe to ensure df['A'] consists only of strings. Then concatenate a couple of GroupBy operations.
Step 1: Flatten your dataframe
You can use itertools.chain and numpy.repeat to chain and repeat values as appropriate.
from itertools import chain
A = df['A'].apply(lambda x: [x] if not isinstance(x, list) else x)
lens = A.map(len)
res = pd.DataFrame({'ID': np.repeat(df['ID'], lens),
'A': list(chain.from_iterable(A)),
'C': np.repeat(df['C'], lens)})
print(res)
# A C ID
# 0 a 1 1
# 1 a 2 1
# 2 a 15 1
# 3 b 5 1
# 4 b 13 1
# 4 c 13 1
# 4 d 13 1
# 5 a 6 2
# 6 a 7 2
# 7 a 1 2
# 7 b 1 2
# 7 c 1 2
Step 2: Concatenate GroupBy on original and flattened
agg_df = pd.concat([res.groupby('ID')['A'].nunique(),
df.groupby('ID')['C'].sum()], axis=1)
print(agg_df)
# A C
# ID
# 1 4 36
# 2 3 14
Concretely, say I have a DataFrame like this:
appid mac_id count
1 a 1
2 b 1
2 c 1
3 d 1
3 e 1
And I also have a :
mac_list = ['b', 'd', 'e']
I want to group this data frame on appid and for every group filter mac_id if it's in mac_list. Last, sum(count) for every group.
for this DataFrame the result is:
appid count
1 0
2 1
3 2
How can I do this with Pandas?
>>> df = pd.DataFrame({"appid": [1,2,2,3,3], "mac_id": ['a', 'b', 'c', 'd', 'e'], "count": [1,1,1,1,1]})
>>> summer = lambda x: x[x["mac_id"].isin(mac_list)].sum()
>>> df.groupby("appid").apply(summer)["count"]
18
appid
1 0
2 1
3 2
Name: count, dtype: object