Pandas Merge function creating incorrect data - python

I am attempting to join two data frames together. However, when I use the Merge function in Pandas, in some cases it is creating rows that should not be present and are not accurate.
import pandas as pd
df = pd.DataFrame({'TimeAway Type': ['Sick', 'Vacation', 'Late'], 'Date': ['2022-03-09', '2022-03-09', '2022-03-15'], 'Hours Requested': [0.04, 3, 5]})
df2 = pd.DataFrame({'Schedule Segment': ['Sick', 'VTO', 'Tardy'], 'Date': ['2022-03-09', '2022-03-09', '2022-03-15'], 'Duration': [2, 3, 1]})
merged = pd.merge(df, df2, on=['Date'])
print(merged)
Output image
As you can see from the image above, on the date where there is only 1 instance in each DF, everything works perfectly fine. However, on the dates where there are more than one in each DF, it is producing extra rows. It's almost as if it's saying "I don't know how to match this data so here's every single possible combination."
Desired Output:
Desired output image
This is just a subset of the data. The DF is quite large and there are a lot more instances of where this occurs with different values. Is there anything that can be done to stop this from happening?

Update per comment in question to join on order of records by date:
df_o=df.assign(date_order=df.groupby('Date').cumcount())
df2_o=df2.assign(date_order=df2.groupby('Date').cumcount())
df_o.merge(df2_o, on=['Date','date_order'])
Output:
TimeAway Type Date Hours Requested date_order Schedule Segment Duration
0 Sick 2022-03-09 0.04 0 Sick 2
1 Vacation 2022-03-09 3.00 1 VTO 3
2 Late 2022-03-15 5.00 0 Tardy 1
Create a psuedo key to join on order the by count in each day.
Try this:
df2['TimeAway Type'] = (df2['Schedule Segment'].map({'VTO':'Vacation',
'Tardy':'Late',})
.fillna(df2['Schedule Segment']))
merged = pd.merge(df, df2, on=['TimeAway Type', 'Date'])
merged
Output:
TimeAway Type Date Hours Requested Schedule Segment Duration
0 Sick 2022-03-09 0.04 Sick 2
1 Vacation 2022-03-09 3.00 VTO 3
2 Late 2022-03-15 5.00 Tardy 1

Related

How do I aggregate rows in a pandas dataframe according to the latest dates in a column?

I have a dataframe containing materials, dates of purchase and purchase prices. I want to filter my dataframe such that I only keep one row containing each material, and that row contains the material at the latest purchase date and corresponding price.
How could I achieve this? I have racked my brains trying to work out how to apply aggregation functions to this but I just can't work out how.
Do a multisort and then use drop duplicates, keeping the first occurrence.
import pandas as pd
df.sort_values(by=['materials', 'purchase_date'], ascending=[True, False], inplace=True)
df.drop_duplicates(subset=['materials'], keep='first', inplace=True)
Two steps
sort_values() by material and purchaseDate
groupby() material and take first row
d = pd.date_range("1-apr-2020", "30-oct-2020", freq="W")
df = pd.DataFrame({"material":np.random.choice(list("abcd"),len(d)), "purchaseDate":d, "purchasePrice":np.random.randint(1,100, len(d))})
df.sort_values(["material","purchaseDate"], ascending=[1,0]).groupby("material", as_index=False).first()
output
material
purchaseDate
purchasePrice
0
a
2020-09-27 00:00:00
85
1
b
2020-10-25 00:00:00
54
2
c
2020-10-11 00:00:00
21
3
d
2020-10-18 00:00:00
45

How to count the sale of each day according stocks with pandas

I want to count the sale of each day. The values in the original table are stocks but not sale.
I use excel to solve the problem,But now I have millions of products ,so I want to solve the problem with pandas.
I am still new to programming and Pandas but I have read up on pandas docs and am still unable to do it.
pandas.DataFrame.diff() is enough.
df['STOCK'] = df['STOCK'].diff()
df.rename(columns={'STOCK': 'SALE'}, inplace=True)
df.rename(columns={'ID1_stock': 'ID1_sale', 'ID2_stock': 'ID2_sale', 'ID3_stock': 'ID3_sale'}, level=1, inplace=True)
Use DataFrame.diff with rename first first level of MultiIndex and then second by lambda function:
print (df)
STOCK
ID1_stock ID2_stock
0 20 21
1 18 20
2 16 19
df = (df.diff()
.rename(columns={'STOCK': 'SALE'}, level=0)
.rename(columns=lambda x: x.replace('stock','sale'), level=1))
print (df)
SALE
ID1_sale ID2_sale
0 NaN NaN
1 -2.0 -1.0
2 -2.0 -1.0

Computing average time between sales

I am working on a transactions dataset and I need to figure out the average time between purchases for each client.
I managed to get the diff between the latest date and the earliest (in months) and divide by the total purchases (NumPurchases). But I am not sure of this approach as it does not take into consideration the fact that not every only bought on multiple occasions.
Imagine the following dataset how would you extract the average time between purchases.
CustomerID EarliestSale LatestSale NumPurchases
0 1 2017-01-05 2017-12-23 11
1 10 2017-06-20 2017-11-17 5
2 100 2017-05-10 2017-12-19 2
3 1000 2017-02-19 2017-12-30 9
4 1001 2017-02-07 2017-11-18 7
Apologies for the rookie question in advance and thanks StackOverflow community :).
Given your revised question and initial dataset (I've revised your dataset slightly to include two customers):
df = pd.DataFrame({'CustomerId': ['001', '001', '002', '002'],
'SaleDate': ['2017-01-10', '2017-04-10', '2017-08-10', '2017-09-10'],
'Quantity': [5, 1, 1, 6]})
You can easily include the average time between transactions (in days) in your group by with the following code:
NOTE: This will only work if you dataset is ordered by CustomerID then SaleDate.
import pandas as pd
df = pd.DataFrame({'CustomerId': ['001', '001', '002', '002'],
'SaleDate': ['2017-01-10', '2017-04-10', '2017-08-10', '2017-09-10'],
'Quantity': ['5', '1', '1', '6']})
# convert the string date to a datetime
df['SaleDate'] = pd.to_datetime(df.SaleDate)
# sort the dataset
df = df.sort_values(['CustomerId', 'SaleDate'])
# calculate the difference between each date in days
# (the .shift method will offset the rows, play around with this to understand how it works
# - We apply this to every customer using a groupby
df2 = df.groupby("CustomerId").apply(
lambda df: (df.SaleDate - df.SaleDate.shift(1)).dt.days).reset_index()
df2 = df2.rename(columns={'SaleDate': 'time-between-sales'})
df2.index = df2.level_1
# then join the result back on to the original dataframe
df = df.join(df2['time-between-sales'])
# add the mean time to your groupby
grouped = df.groupby("CustomerId").agg({
"SaleDate": ["min", "max"],
"Quantity": "sum",
"time-between-sales": "mean"})
# rename columns per your original specification
grouped.columns = grouped.columns.get_level_values(0) + grouped.columns.get_level_values(1)
grouped = grouped.rename(columns={
'SaleDatemin': 'EarliestSale',
'SaleDatemax': 'LatestSale',
'Quantitysum': 'NumPurchases',
'time-between-salesmean': 'avgTimeBetweenPurchases'})
print(grouped)
EarliestSale LatestSale NumPurchases avgTimeBetweenPurchases
CustomerId
001 2017-01-10 2017-04-10 6 90.0
002 2017-08-10 2017-09-10 7 31.0

find first unique items selected by user and ranking them in order of user selection by date

I am trying to identify only first orders of unique "items" purchased by "test" customers in a simplified sample dataframe from the dataframe created below:
df=pd.DataFrame({"cust": ['A55', 'A55', 'A55', 'B080', 'B080', 'D900', 'D900', 'D900', 'D900', 'C019', 'C019', 'Z09c', 'A987', 'A987', 'A987'],
"date":['01/11/2016', '01/11/2016', '01/11/2016', '08/17/2016', '6/17/2016','03/01/2016',
'04/30/2016', '05/16/2016','09/27/2016', '04/20/2016','04/29/2016', '07/07/2016', '1/29/2016', '10/17/2016', '11/11/2016' ],
"item": ['A10BABA', 'A10BABA', 'A10DBDB', 'A9GABA', 'A11AD', 'G198A', 'G198A', 'F673', 'A11BB', 'CBA1', 'CBA1', 'DA21',
'BG10A', 'CG10BA', 'BG10A']
})
df.date = pd.to_datetime(df.date)
df = df.sort_values(["cust", "date"], ascending = True)
The desired output would look as shown in picture - with all unique items ordered by date of purchase in a new column called "cust_item_rank" and remove any repeated (duplicated) orders of the same item by same user.
To clarify further, those items purchased on the same date by same user should have the same order/rank as shown in picture for customer A55 (A10BABA and A10DBDB are ranked as 1).
I have spent a fair bit of time using a combination of group by and/or rank operations but unsuccessful thus far. As an example:
df["cust_item_rank"] = df.groupby("cust")["date"]["item"].rank(ascending = 1, method = "min")
Yields an error (Exception: Column(s) date already selected).
Can somebody please guide me to the desired solution here?
# Remove duplicates
df2 = (df.loc[~df.groupby(['cust'])['item'].apply(pd.Series.duplicated)]
.reset_index(drop=True))
df2['cust_item_rank'] = df2.groupby('cust').cumcount().add(1)
df2
cust date item cust_item_rank
0 A55 2016-01-11 A10BABA 1
1 A55 2016-11-01 A10DBDB 2
2 A987 2016-01-29 BG10A 1
3 A987 2016-10-17 CG10BA 2
4 B080 2016-06-17 A11AD 1
5 B080 2016-08-17 A9GABA 2
6 C019 2016-04-20 CBA1 1
7 D900 2016-03-01 G198A 1
8 D900 2016-05-16 F673 2
9 D900 2016-09-27 A11BB 3
10 Z09c 2016-07-07 DA21 1
To solve this question, I built upon the excellent initial answer by cs95 and calling on the rank function in pandas as follows:
#remove duplicates as recommended by cs95
df2 = (df.loc[~df.groupby(['cust'])['item'].apply(pd.Series.duplicated)]
.reset_index(drop=True))
#rank by date afer grouping by customer
df2["cust_item_rank"]= df2.groupby(["cust"])["date"].rank(ascending=1,method='dense').astype(int)
This resulted in the following (desired output):
It appears that this problem is solved using either "min" or "dense" method of ranking but I chose the latter "dense" method to potentially avoid skipping any rank.

Shift time in multi-index to merge

I want to merge two datasets that are indexed by time and id. The problem is, the time is slightly different in each dataset. In one dataset, the time (Monthly) is mid-month, so the 15th of every month. In the other dataset, it is the last business day. This should still be a one-to-one match, but the dates are not exactly the same.
My approach is to shift mid-month dates to business day end-of-month dates.
Data:
dt = pd.date_range('1/1/2011','12/31/2011', freq='D')
dt = dt[dt.day == 15]
lst = [1,2,3]
idx = pd.MultiIndex.from_product([dt,lst],names=['date','id'])
df = pd.DataFrame(np.random.randn(len(idx)), index=idx)
df.head()
output:
0
date id
2011-01-15 1 -0.598584
2 -0.484455
3 -2.044912
2011-02-15 1 -0.017512
2 0.852843
This is what I want (I removed the performance warning):
In[83]:df.index.levels[0] + BMonthEnd()
Out[83]:
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',
'2011-05-31', '2011-06-30', '2011-07-29', '2011-08-31',
'2011-09-30', '2011-10-31', '2011-11-30', '2011-12-30'],
dtype='datetime64[ns]', freq='BM')
However, indexes are immutable, so this does not work:
In: df.index.levels[0] = df.index.levels[0] + BMonthEnd()
TypeError: 'FrozenList' does not support mutable operations.
The only solution I've got is to reset_index(), change the dates, then set_index() again:
df.reset_index(inplace=True)
df['date'] = df['date'] + BMonthEnd()
df.set_index(['date','id'], inplace=True)
This gives what I want, but is this the best way? Is there a set_level_values() function (I didn't see it in the API)?
Or maybe I'm taking the wrong approach to the merge. I could merge the dataset with keys df.index.get_level_values(0).year, df.index.get_level_values(0).month and id but this doesn't seem much better.
You can use set_levels in order to set multiindex levels:
df.index.set_levels(df.index.levels[0] + pd.tseries.offsets.BMonthEnd(),
level='date', inplace=True)
>>> df.head()
0
date id
2011-01-31 1 -1.410646
2 0.642618
3 -0.537930
2011-02-28 1 -0.418943
2 0.983186
You could just build it again:
df.index = pd.MultiIndex.from_arrays(
[
df.index.get_level_values(0) + BMonthEnd(),
df.index.get_level_values(1)
])
set_levels implicitly rebuilds the index under the covers. If you have more than two levels, this solution becomes unweildy, so consider using set_levels for typing brevity.
Since you want to merge anyway, you can forget about changing the index and use use pandas.merge_asof()
Data
df1
0
date id
2011-01-15 1 -0.810581
2 1.177235
3 0.083883
2011-02-15 1 1.217419
2 -0.970804
3 1.262364
2011-03-15 1 -0.026136
2 -0.036250
3 -1.103929
2011-04-15 1 -1.303298
And here is one with last business day of the month, df2
0
date id
2011-01-31 1 -0.277675
2 0.086539
3 1.441449
2011-02-28 1 1.330212
2 -0.028398
3 -0.114297
2011-03-31 1 -0.031264
2 -0.787093
3 -0.133088
2011-04-29 1 0.938732
merge
Use df1 as your left DataFrame and then choose the merge direction as forward since the last business day is always after the 15th. Optionally, you can set a tolerance. This is useful in the situation where you are missing a month in the right DataFrame and will prevent you from merging 03-31-2011 to 02-15-2011 if you are missing data for the last business day February.
import pandas as pd
pd.merge_asof(df1.reset_index(), df2.reset_index(), by='id', on='date',
direction='forward', tolerance=pd.Timedelta(days=20)).set_index(['date', 'id'])
Results in
0_x 0_y
date id
2011-01-15 1 -0.810581 -0.277675
2 1.177235 0.086539
3 0.083883 1.441449
2011-02-15 1 1.217419 1.330212
2 -0.970804 -0.028398
3 1.262364 -0.114297
2011-03-15 1 -0.026136 -0.031264
2 -0.036250 -0.787093
3 -1.103929 -0.133088
2011-04-15 1 -1.303298 0.938732

Categories