Python Pandas incorrect date count - python

Working with the following python pandas dataframe "df":
Customer_ID | Transaction_ID | Item_ID
ABC 2017-04-12-333 X8973
ABC 2017-04-12-333 X2468
ABC 2017-05-22-658 X2906
ABC 2017-05-22-757 X8790
ABC 2017-07-13-864 X8790
BCD 2017-08-11-879 X2346
BCD 2017-08-11-879 X2468
I want to count the transactions to have in a column denoted, when it's the client's 1st transaction, 2nd transaction and so forth by date. (If there are two transactions on the same day, I am counting them both as the same count, since I don't have the time, so I don't know which one came first - basically treating them as one transaction).
#get the date out of the Transaction_ID string
df['date'] = pd.to_datetime(df.Transaction_ID.str[:10])
#calculate the transaction number
df['trans_nr'] = df.groupby(['Customer_ID',"Transaction_ID", df['date'].dt.year]).cumcount()+1
Unfortunately, this is my output with the code above:
Customer_ID | Transaction_ID | Item_ID | date | trans_nr
ABC 2017-04-12-333 X8973 2017-04-12 1
ABC 2017-04-12-333 X2468 2017-04-12 2
ABC 2017-05-22-658 X2906 2017-05-22 1
ABC 2017-05-22-757 X8790 2017-05-22 1
ABC 2017-07-13-864 X8790 2017-07-13 1
BCD 2017-08-11-879 X2346 2017-08-11 1
BCD 2017-08-11-879 X2468 2017-08-11 2
Which is incorrect, this is the correct output I am looking for:
Customer_ID | Transaction_ID | Item_ID | date | trans_nr
ABC 2017-04-12-333 X8973 2017-04-12 1
ABC 2017-04-12-333 X2468 2017-04-12 1
ABC 2017-05-22-658 X2906 2017-05-22 2
ABC 2017-05-22-757 X8790 2017-05-22 2
ABC 2017-07-13-864 X8790 2017-07-13 3
BCD 2017-08-11-879 X2346 2017-08-11 1
BCD 2017-08-11-879 X2468 2017-08-11 1
Maybe the logic should be based only on Customer_ID and date (without Transaction_ID)?
I tried this
df['trans_nr'] = df.groupby(['Customer_ID','date').cumcount()+1
But it also counts incorrectly.

Let's try:
df['trans_nr'] = df.groupby(['Customer_ID', df['date'].dt.year])['date']\
.transform(lambda x: (x.diff() != pd.Timedelta('0 days')).cumsum())
Output:
Customer_ID Transaction_ID Item_ID date trans_nr
0 ABC 2017-04-12-333 X8973 2017-04-12 1
1 ABC 2017-04-12-333 X2468 2017-04-12 1
2 ABC 2017-05-22-658 X2906 2017-05-22 2
3 ABC 2017-05-22-757 X8790 2017-05-22 2
4 ABC 2017-07-13-864 X8790 2017-07-13 3
5 BCD 2017-08-11-879 X2346 2017-08-11 1
6 BCD 2017-08-11-879 X2468 2017-08-11 1

Use dual groupby with ngroup() i.e
df['trans_nr'] = df.groupby('Customer_ID').apply(lambda x : \
x.groupby([x['date'].dt.date]).ngroup()+1).values
Customer_ID Transaction_ID Item_ID date trans_nr
0 ABC 2017-04-12-333 X8973 2017-04-12 1
1 ABC 2017-04-12-333 X2468 2017-04-12 1
2 ABC 2017-05-22-658 X2906 2017-05-22 2
3 ABC 2017-05-22-757 X8790 2017-05-22 2
4 ABC 2017-07-13-864 X8790 2017-07-13 3
5 BCD 2017-08-11-879 X2346 2017-08-11 1
6 BCD 2017-08-11-879 X2468 2017-08-11 1

One way would be to drop duplicate values before making the cumulative count:
trans_nr = (df
.drop_duplicates(subset=['Customer_ID', 'date'])
.set_index(['Customer_ID', 'date'])
.groupby(level='Customer_ID')
.cumcount() + 1
)
df.set_index(['Customer_ID', 'date'], inplace=True)
df['trans_nr'] = trans_nr
df.reset_index(inplace=True)
To get the transaction number, you first remove rows with duplicate Customer_ID and date values. Then you set their index using Customer_ID and date (for merging later) and perform your groupby and cumcount. This produces a series whose values are the cumulative count for each Customer_ID and date.
You also set the index for the original dataframe (again to allow for merging). Then you simply assign the trans_nr series to a column in df. The indices take care of the merging logic.

Related

Create "Yes" column according to another column value pandas dataframe

Imagine I have a dataframe with employee IDs, their Contract Number, and the Company they work for. Each employee can have as many contracts as they want for the same company or even for different companies:
ID Contract Number Company
10000 1 Abc
10000 2 Zxc
10000 3 Abc
10001 1 Zxc
10002 2 Abc
10002 1 Cde
10002 3 Zxc
I need to find a way to identify the company of the contract number "1" per each ID and then create a column "Primary Contract" that would be set to "Yes" if the contract is in the same company as the company of contract number 1 resulting on this dataframe:
ID Contract Number Company Primary Compay
10000 1 Abc Yes
10000 2 Zxc No
10000 3 Abc Yes
10001 1 Zxc Yes
10002 2 Abc No
10002 1 Cde Yes
10002 3 Zxc No
What would be the best way to achieve it?
You can use groupby.apply with isin and numpy.where:
df['Primary Company'] = np.where(
df.groupby('ID', group_keys=False)
.apply(lambda g: g['Company'].isin(g.loc[g['Contract Number'].eq(1), 'Company'])
),
'Yes', 'No'
)
Output:
ID Contract Number Company Primary Company
0 10000 1 Abc Yes
1 10000 2 Zxc No
2 10000 3 Abc Yes
3 10001 1 Zxc Yes
4 10002 2 Abc No
5 10002 1 Cde Yes
6 10002 3 Zxc No
If you can just use a boolean (True/False) instead of 'Yes'/'No':
df['Primary Company'] = (
df.groupby('ID', group_keys=False)
.apply(lambda g: g['Company'].isin(g.loc[g['Contract Number'].eq(1), 'Company']))
)
Filter rows with Contract Number is 1, use left join in DataFrame.merge and compare _merge column generated by indicator=True parameter:
mask = (df.merge(df[df['Contract Number'].eq(1)],
how='left', on=['ID','Company'], indicator=True)['_merge'].eq('both'))
df['Primary Company'] = np.where(mask, 'Yes','No')
print (df)
ID Contract Number Company Primary Company
0 10000 1 Abc Yes
1 10000 2 Zxc No
2 10000 3 Abc Yes
3 10001 1 Zxc Yes
4 10002 2 Abc No
5 10002 1 Cde Yes
6 10002 3 Zxc No
Another idea is with compare MultiIndex by Index.isin:
idx = df[df['Contract Number'].eq(1)].set_index(['ID','Company']).index
df['Primary Company'] = np.where(df.set_index(['ID','Company']).index.isin(idx),
'Yes','No')
print (df)
ID Contract Number Company Primary Company
0 10000 1 Abc Yes
1 10000 2 Zxc No
2 10000 3 Abc Yes
3 10001 1 Zxc Yes
4 10002 2 Abc No
5 10002 1 Cde Yes
6 10002 3 Zxc No

How to group sequence based on group column assign a groupid

Below is the dataframe I have
ColA ColB Time ColC
A B 01-01-2022 ABC
A B 02-01-2022 ABC
A B 07-01-2022 XYZ
A B 11-01-2022 IJK
A B 14-01-2022 ABC
Desired result:
ColA ColB Time ColC groupID
A B 01-01-2022 ABC 1
A B 02-01-2022 ABC 1
A B 07-01-2022 XYZ 2
A B 11-01-2022 IJK 3
A B 14-01-2022 ABC 4
UPDATED:
Below is the code executed after cumsum
df['ColC'] = df['ColC'].ne(df['ColC'].shift(1)).groupby([df['ColA'],
df['ColB']]).cumsum()
ColA ColB Time ColC groupID
A B 01-01-2022 ABC 1
A B 02-01-2022 ABC 1
A B 07-01-2022 XYZ 2
A B 11-01-2022 XYZ 3
A B 14-01-2022 XYZ 4
A B 14-01-2022 XYZ 4
Thank you in advance
The logic is not fully clear, but it looks like you're trying to group by week number (and ColC):
df['groupID'] = (df
.groupby([pd.to_datetime(df['Time'], dayfirst=True).dt.isocalendar().week,
'ColC'], sort=False)
.ngroup().add(1)
)
output:
ColA ColB Time ColC groupID
0 A B 01-01-2022 ABC 1
1 A B 02-01-2022 ABC 1
2 A B 07-01-2022 XYZ 2
3 A B 11-01-2022 IJK 3
4 A B 14-01-2022 ABC 4

"Rank" DataFrame columns per row

Given a Time Series DataFrame is it possible to create a new DataFrame with the same dimensions but the values are the ranking for each row compared to other columns (ordered smallest value first)?
Example:
ABC DEFG HIJK XYZ
date
2018-01-14 0.110541 0.007615 0.063217 0.002543
2018-01-21 0.007012 0.042854 0.061271 0.007988
2018-01-28 0.085946 0.177466 0.046432 0.069297
2018-02-04 0.018278 0.065254 0.038972 0.027278
2018-02-11 0.071785 0.033603 0.075826 0.073270
The first row would become:
ABC DEFG HIJK XYZ
date
2018-01-14 4 2 3 1
as XYZ has the smallest value in that row and ABC the largest.
numpy.argsort looks like it might help however as it outputs the location itself I have not managed to get it to work.
Many thanks
Use double argsort for rank per rows and pass to DataFrame constructor:
df1 = pd.DataFrame(df.values.argsort().argsort() + 1, index=df.index, columns=df.columns)
print (df1)
ABC DEFG HIJK XYZ
date
2018-01-14 4 2 3 1
2018-01-21 1 3 4 2
2018-01-28 3 4 1 2
2018-02-04 1 4 3 2
2018-02-11 2 1 4 3
Or use DataFrame.rank with method='dense':
df1 = df.rank(axis=1, method='dense').astype(int)
print (df1)
ABC DEFG HIJK XYZ
date
2018-01-14 4 2 3 1
2018-01-21 1 3 4 2
2018-01-28 3 4 1 2
2018-02-04 1 4 3 2
2018-02-11 2 1 4 3

Pandas deleting rows in order

Given a particular df:
ID Text
1 abc
1 xyz
2 xyz
2 abc
3 xyz
3 abc
3 ijk
4 xyz
I want to apply condition where: Grouping by ID, if abc exists then delete row with xyz. The outcome would be:
ID Text
1 abc
2 abc
3 abc
3 ijk
4 xyz
Usually I would group them by Id and apply np.where(...). However, I don't think this approach would work for this case since it's based on rows.
Many thanks!
To the best of my knowledge, you can vectorize this with a groupby + transform:
df[~(df.Text.eq('abc').groupby(df.ID).transform('any') & df.Text.eq('xyz'))]
ID Text
0 1 abc
3 2 abc
5 3 abc
6 3 ijk
7 4 xyz
I am using crosstab
s=pd.crosstab(df.ID,df.Text)
s.xyz=s.xyz.mask(s.abc.eq(1)&s.xyz.eq(1))
s
Out[162]:
Text abc ijk xyz
ID
1 1 0 NaN
2 1 0 NaN
3 1 1 NaN
4 0 0 1.0
s.replace(0,np.nan).stack().reset_index().drop(0,1)
Out[167]:
ID Text
0 1 abc
1 2 abc
2 3 abc
3 3 ijk
4 4 xyz

How can I make hierarchical columns based on unique values in that column

I've a pandas data frame which looks like below:
S.No Name1 Name2 Size
1 ABC XYZ 12
2 BCA XCZ 15
3 DAB ZXM 20
How do I make a hierarcial column for all unique values in Name1 column, followed by column with all unique values in Name2 , which would make the dataframe look like below:
ABC BCA DAB
S.NO
XYZ XCZ ZXM
XYZ XCZ ZXM
XYZ XCZ ZXM
1 12 N/A N/A
2 N/A 15 N/A
3 N/A N/A 20
Consider filling in empty rows with a merge on a helper dataframe which is created from unique cartesian product of values (all possible combinations of S.No, Name1, Name2) using itertools.product:
from io import StringIO
from itertools import product
import pandas as pd
txt = '''S.No Name1 Name2 Size
1 ABC XYZ 12
2 BCA XCZ 15
3 DAB ZXM 20'''
df = pd.read_table(StringIO(txt), sep="\s+")
fill_df = pd.DataFrame(list(product(df['S.No'].unique(), df['Name1'].unique(), df['Name2'].unique())),
columns=['S.No', 'Name1', 'Name2'])
df = df.merge(fill_df, on=['S.No', 'Name1', 'Name2'], how='right')
pvtdf = df.pivot_table(index='S.No', columns=['Name1', 'Name2'],
values='Size', aggfunc='max', dropna=False)\
.rename_axis([None, None], axis="columns")
print(pvtdf)
# ABC BCA DAB
# XCZ XYZ ZXM XCZ XYZ ZXM XCZ XYZ ZXM
# S.No
# 1 NaN 12.0 NaN NaN NaN NaN NaN NaN NaN
# 2 NaN NaN NaN 15.0 NaN NaN NaN NaN NaN
# 3 NaN NaN NaN NaN NaN NaN NaN NaN 20.0
You can use .unstack also to get the desired multiindex format.
Let's say df is your data frame. Do this:
df = df.set_index(['S.No','Name1','Name2'])['Size'].unstack(level=-2).unstack(level=-1)
df.columns.names = [None, None]
df = df.reindex(columns=['XYZ', 'XCZ', 'ZXM'], level = 1)
df.fillna('', inplace=True) # if you want to replace NAs with blanks
print(df)
ABC BCA DAB
XYZ XCZ ZXM XYZ XCZ ZXM XYZ XCZ ZXM
S.No
1 12
2 15
3 20

Categories