Problem
I have a dataframe df :
Index Client_ID Date
1 johndoe 2019-01-15
2 johndoe 2015-11-25
3 pauldoe 2015-05-26
And I have another dataframe df_prod, with products like this :
Index Product-Type Product-Date Buyer Price
1 A 2020-01-01 pauldoe 300
2 A 2018-01-01 pauldoe 200
3 A 2019-01-01 johndoe 600
4 A 2017-01-01 johndoe 800
5 A 2020-11-05 johndoe 100
6 B 2014-12-12 johndoe 200
7 B 2016-11-15 johndoe 300
What I want is to add to df a column, that will sum the Prices of the last products of each type known at the current date (with Product-Date <= df.Date). An example is the best way to explain :
For the first row of df
1 johndoe 2019-01-01
The last A-Product known at this date bought by johndoe is this one :
3 A 2019-01-01 johndoe 600
(since the 4th one is older, and the 5th one has a Product-Date > Date)
The last B-Product known at this date bought by johndoe is this one :
7 B 2016-11-15 johndoe 300
So the row in df, after transformation, will look like that (900 being 600 + 300, prices of the 2 products of interest) :
1 johndoe 2019-01-15 900
The full df after transformation will then be :
Index Client_ID Date LastProdSum
1 johndoe 2019-15-01 900
2 johndoe 2015-11-25 200
3 pauldoe 2015-05-26 0
As you can see, there are multiple possibilities :
Buyers didn't necessary buy all products (see pauldoe, who only bought A-products)
Sometimes, no product is known at df.Date (see row 3 of the new df, in 2015, we don't know any product bought by pauldoe)
Sometimes, only one product is known at df.Date, and the value is the one of the product (see row 3 of the new df, in 2015, we only have one product for johndoe, which is a B-product bought in 2014, whose price is 200)
What I did :
I found a solution to this problem, but it's taking too much time to be used, since my dataframe is huge.
For that, I iterate using iterrows on rows of df, I then select the products linked to the Buyer, having Product-Date < Date on df_prod, then get the older grouping by Product-Type and getting the max date, then I finally sum all my products prices.
The fact I solve the problem iterating on each row (with a for iterrows), extracting for each row of df a part of df_prod that I work on to finally get my sum, makes it really long.
I'm almost sure there's a better way to solve the problem, with pandas functions (pivot for example), but I couldn't find the way. I've been searching a lot.
Thanks in advance for your help
Edit after Dani's answer
Thanks a lot for your answer. It looks really good, I accepted it since you spent a lot of time on it.
Execution is still pretty long, since I didn't specify something.
In fact, Product-Types are not shared through buyers : each buyers has its own multiple products types. The true way to see this is like this :
Index Product-Type Product-Date Buyer Price
1 pauldoe-ID1 2020-01-01 pauldoe 300
2 pauldoe-ID1 2018-01-01 pauldoe 200
3 johndoe-ID2 2019-01-01 johndoe 600
4 johndoe-ID2 2017-01-01 johndoe 800
5 johndoe-ID2 2020-11-05 johndoe 100
6 johndoe-ID3 2014-12-12 johndoe 200
7 johndoe-ID3 2016-11-15 johndoe 300
As you can understand, product types are not shared through different buyers (in fact, it can happen, but in really rare situations that we won't consider here)
The problem remains the same, since you want to sum prices, you'll add the prices of last occurences of johndoe-ID2 and johndoe-ID3 to have the same final result row
1 johndoe 2019-15-01 900
But as you now understand, there are actually more Product-Types than Buyers, so the step "get unique product types" from your answer, that looked pretty fast on the initial problem, actually takes a lot of time.
Sorry for being unclear on this point, I didn't think of a possibility of creating a new df based on product types.
The main idea is to use merge_asof to fetch the last date for each Product-Type and Client_ID, so do the following:
# get unique product types
product_types = list(df_prod['Product-Type'].unique())
# create a new DataFrame with a row for each Product-Type for each Client_ID
df['Product-Type'] = [product_types for _ in range(len(df))]
df_with_prod = df.explode('Product-Type')
# merge only the closest date by each client and product type
merge = pd.merge_asof(df_with_prod.sort_values(['Date', 'Client_ID']),
df_prod.sort_values(['Product-Date', 'Buyer']),
left_on='Date',
right_on='Product-Date',
left_by=['Client_ID', 'Product-Type'], right_by=['Buyer', 'Product-Type'])
# fill na in prices
merge['Price'] = merge['Price'].fillna(0)
# sum Price by client and date
res = merge.groupby(['Client_ID', 'Date'], as_index=False)['Price'].sum().rename(columns={'Price' : 'LastProdSum'})
print(res)
Output
Client_ID Date LastProdSum
0 johndoe 2015-11-25 200.0
1 johndoe 2019-01-15 900.0
2 pauldoe 2015-05-26 0.0
The problem is that merge_asof won't work with duplicate values, so we need to create unique values. These new values are the cartesian product of Client_ID and Product-Type, this part is done in:
# get unique product types
product_types = list(df_prod['Product-Type'].unique())
# create a new DataFrame with a row for each Product-Type for each Client_ID
df['Product-Type'] = [product_types for _ in range(len(df))]
df_with_prod = df.explode('Product-Type')
Finally do a groupby and sum the Price, not before doing a fillna to fill the missing values.
UPDATE
You could try:
# get unique product types
product_types = df_prod.groupby('Buyer')['Product-Type'].apply(lambda x: list(set(x)))
# create a new DataFrame with a row for each Product-Type for each Client_ID
df['Product-Type'] = df['Client_ID'].map(product_types)
df_with_prod = df.explode('Product-Type')
# merge only the closest date by each client and product type
merge = pd.merge_asof(df_with_prod.sort_values(['Date', 'Client_ID']),
df_prod.sort_values(['Product-Date', 'Buyer']),
left_on='Date',
right_on='Product-Date',
left_by=['Client_ID', 'Product-Type'], right_by=['Buyer', 'Product-Type'])
# fill na in prices
merge['Price'] = merge['Price'].fillna(0)
# sum Price by client and date
res = merge.groupby(['Client_ID', 'Date'], as_index=False)['Price'].sum().rename(columns={'Price' : 'LastProdSum'})
print(res)
The idea here is to change how you generate the unique values.
Hello Stack overflow Community
I've got the Data Frame here
code sum of August
AA 1000
BB 4000
CC 72262
So there are two columns ['code','sum of August']
I've to convert this dataFrame into ['month', 'year', 'code', 'sum of August'] columns
month year code sum of August
8 2020 AA 1000
8 2020 BB 4000
8 2020 CC 72262
So the ['sum of August'] column sometimes named as just ['August'] or ['august']. Also sometimes, it can be ['sum of November'] or ['November'] or ['november'].
I thought of using regex to extract the month name and covert to month number.
Can anyone please help me with this?
Thanks in advance!
You can do the following:
month = {1:'janauary',
2:'february',
3:'march',
4:'april',
5:'may',
6:'june',
7:'july',
8:'august',
9:'september',
10:'october',
11:'november',
12:'december'}
Let's say your data frame is called df. Then you can create the column month automatically using the following:
df['month']=[i for i,j in month.items() if j in str.lower(" ".join(df.columns))][0]
code sum of August month
0 AA 1000 8
1 BB 4000 8
2 CC 72262 8
That means that if a month's name exists in the column names in any way, return the number of this month.
It looks like you're trying to convert month names to their numbers, and the columns can be uppercse or lowercase.
This might work:
months = ['january','febuary','march','april','may','june','july','august','september','october','november','december']
monthNum = []#If you're using a list, just to make this run
sumOfMonths = ['sum of august','sum of NovemBer']#Just to show functionality
for sumOfMonth in sumOfMonths:
for idx, month in enumerate(months):
if month in sumOfMonth.lower():#If the column month name has any of the month keywords
monthNum.append(str(idx + 1)) #i'm just assuming that it's a list, just add the index + 1 to your variable.
I hope this helps! Of course, this wouldn't be exactly what you do, you fill in the variables and change append() if you're not using it.
I am trying to identify only first orders of unique "items" purchased by "test" customers in a simplified sample dataframe from the dataframe created below:
df=pd.DataFrame({"cust": ['A55', 'A55', 'A55', 'B080', 'B080', 'D900', 'D900', 'D900', 'D900', 'C019', 'C019', 'Z09c', 'A987', 'A987', 'A987'],
"date":['01/11/2016', '01/11/2016', '01/11/2016', '08/17/2016', '6/17/2016','03/01/2016',
'04/30/2016', '05/16/2016','09/27/2016', '04/20/2016','04/29/2016', '07/07/2016', '1/29/2016', '10/17/2016', '11/11/2016' ],
"item": ['A10BABA', 'A10BABA', 'A10DBDB', 'A9GABA', 'A11AD', 'G198A', 'G198A', 'F673', 'A11BB', 'CBA1', 'CBA1', 'DA21',
'BG10A', 'CG10BA', 'BG10A']
})
df.date = pd.to_datetime(df.date)
df = df.sort_values(["cust", "date"], ascending = True)
The desired output would look as shown in picture - with all unique items ordered by date of purchase in a new column called "cust_item_rank" and remove any repeated (duplicated) orders of the same item by same user.
To clarify further, those items purchased on the same date by same user should have the same order/rank as shown in picture for customer A55 (A10BABA and A10DBDB are ranked as 1).
I have spent a fair bit of time using a combination of group by and/or rank operations but unsuccessful thus far. As an example:
df["cust_item_rank"] = df.groupby("cust")["date"]["item"].rank(ascending = 1, method = "min")
Yields an error (Exception: Column(s) date already selected).
Can somebody please guide me to the desired solution here?
# Remove duplicates
df2 = (df.loc[~df.groupby(['cust'])['item'].apply(pd.Series.duplicated)]
.reset_index(drop=True))
df2['cust_item_rank'] = df2.groupby('cust').cumcount().add(1)
df2
cust date item cust_item_rank
0 A55 2016-01-11 A10BABA 1
1 A55 2016-11-01 A10DBDB 2
2 A987 2016-01-29 BG10A 1
3 A987 2016-10-17 CG10BA 2
4 B080 2016-06-17 A11AD 1
5 B080 2016-08-17 A9GABA 2
6 C019 2016-04-20 CBA1 1
7 D900 2016-03-01 G198A 1
8 D900 2016-05-16 F673 2
9 D900 2016-09-27 A11BB 3
10 Z09c 2016-07-07 DA21 1
To solve this question, I built upon the excellent initial answer by cs95 and calling on the rank function in pandas as follows:
#remove duplicates as recommended by cs95
df2 = (df.loc[~df.groupby(['cust'])['item'].apply(pd.Series.duplicated)]
.reset_index(drop=True))
#rank by date afer grouping by customer
df2["cust_item_rank"]= df2.groupby(["cust"])["date"].rank(ascending=1,method='dense').astype(int)
This resulted in the following (desired output):
It appears that this problem is solved using either "min" or "dense" method of ranking but I chose the latter "dense" method to potentially avoid skipping any rank.
I have data about three variables where I want to find the largest X values of one variable on a per day basis. Previously I wrote some code to find the hour where the max value of the day occurred, but now I want to add some options to find more max hours per day.
I've been able to find the Top X values per day for all the days, but I've gotten stuck on narrowing it down to the Top X Values from the Top X Days. I've included pictures detailing what the end result would hopefully look like.
Data
Identified Top 2 Hours
Code
df = pd.DataFrame(
{'ID':['ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1','ID_1'],
'Year':[2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018],
'Month':[6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6],
'Day':[12,12,12,12,13,13,13,13,14,14,14,14,15,15,15,15,16,16,16,16,17,17,17,17],
'Hour':[19,20,21,22,11,12,13,19,19,20,21,22,18,19,20,21,19,20,21,23,19,20,21,22],
'var_1': [0.83,0.97,0.69,0.73,0.66,0.68,0.78,0.82,1.05,1.05,1.08,0.88,0.96,0.81,0.71,0.88,1.08,1.02,0.88,0.79,0.91,0.91,0.80,0.96],
'var_2': [47.90,42.85,67.37,57.18,66.13,59.96,52.63,54.75,32.54,36.58,36.99,37.23,46.94,52.80,68.79,50.84,37.79,43.54,48.04,38.01,42.22,47.13,50.96,44.19],
'var_3': [99.02,98.10,98.99,99.12,98.78,98.90,99.09,99.20,99.22,99.11,99.18,99.24,99.00,98.90,98.87,99.07,99.06,98.86,98.92,99.32,98.93,98.97,98.99,99.21],})
# Get the top 2 var2 values each day
top_two_var2_each_day = df.groupby(['ID', 'Year', 'Month', 'Day'])['var_2'].nlargest(2)
top_two_var2_each_day = top_two_var2_each_day.reset_index()
# set level_4 index to the current index
top_two_var2_each_day = top_two_var2_each_day.set_index('level_4')
# use the index from the top_two_var2 to get the rows from df to get values of the other variables when top 2 values occured
top_2_all_vars = df[df.index.isin(top_two_var2_each_day.index)]
End Goal Result
I figure the best way would be to average the two hours to identify what days have the largest average, then go back into top_2_all_vars dataframe and grab the rows where the Days occur. I am unsure how to proceed.
mean_day = top_2_all_vars.groupby(['ID', 'Year', 'Month', 'Day'],as_index=False)['var_2'].mean()
top_2_day = mean_day.nlargest(2, 'var_2')
Final Dataframe
This is the result I am trying to find. A dataframe consisting of the Top 2 values of var_2 from each of the Top 2 days.
Code I previously used to find the single largest value of each day, but I don't know how I would make it work for more than a single max per day
# For each ID and Day, Find the Hour where the Max Amount of var_2 occurred and save the index location
df_idx = df.groupby(['ID', 'Year', 'Month', 'Day',])['var_2'].transform(max) == df['var_2']
# Now the hour has been found, store the rows in a new dataframe based on the saved index location
top_var2_hour_of_each_day = df[df_idx]
Using Groupbys may be not the best way to go about it, but I am open to anything.
This is one approach:
If your data spans multiple months its a lot harder dealing with it when the month and day are in different columns. So First I made a new column called 'Date' which just combines the month and the day.
df['Date'] = df['Month'].astype('str')+"-"+df['Day'].astype('str')
Next we need the top two values of var_2 per day, and then average them. So we can create a really simple function to find exactly that.
def topTwoMean(series):
top = series.sort_values(ascending = False).iloc[0]
second = series.sort_values(ascending = False).iloc[1]
return (top+second)/2
We then use our function, sort by the average of var_2 to get the highest 2 days, then save the dates to a list.
maxDates = df.groupby('Date').agg({'var_2': [topTwoMean]})\
.sort_values(by = ('var_2', 'topTwoMean'), ascending = False)\
.reset_index()['Date']\
.head(2)\
.to_list()
Finally we filter by the dates chosen above, then find the highest two of var_2 on those days.
df[df['Date'].isin(maxDates)]\
.groupby('Date')\
.apply(lambda x: x.sort_values('var_2', ascending = False).head(2))\
.reset_index(drop = True)
ID Year Month Day Hour var_1 var_2 var_3 Date
0 ID_1 2018 6 12 21 0.69 67.37 98.99 6-12
1 ID_1 2018 6 12 22 0.73 57.18 99.12 6-12
2 ID_1 2018 6 13 11 0.66 66.13 98.78 6-13
3 ID_1 2018 6 13 12 0.68 59.96 98.90 6-13