How to find customers who made 2nd purchase within 30 days? - python

I need your quick help. I want to find a list of customer_id's and first purchase_date for customers who have made their second purchase within 30 days of their first purchase.
i.e. curstomer_id's 1,2,3 have made their 2nd purchase within 30 days.
I need curstomer_id's 1,2,3 and their respective first purchase_date.
I have more than 100k customer_id's.
How I can achieve this in pandas?

You can do it with groupby
s = df.groupby('Customer_id')['purchase_date'].apply(lambda x : (x.iloc[1]-x.iloc[0]).days<30)
out = df.loc[df.Customer_id.isin(s.index[s])].drop_duplicates('Customer_id')

Here is a way:
df2 = (df.loc[df['purchase_date']
.lt(df['Customer_id']
.map((df.sort_values('purchase_date').groupby('Customer_id').first() + pd.to_timedelta(30,'d'))
.squeeze()))])
df2 = (df2.loc[df2.duplicated('Customer_id',keep=False)]
.groupby('Customer_id').first())

You can set a boolean mask to filter the groups of customers who have made their second purchase within 30 days, as follows:
# Pre-processing to sort the data and convert date to the required date format
df = df.sort_values(['Customer_id', 'purchase_date'])
df['purchase_date'] = pd.to_datetime(df['purchase_date'])
# Set boolean mask
mask = (((df['purchase_date'] - df['purchase_date'].groupby(df['Customer_id']).shift()).dt.days <= 30)
.groupby(df['Customer_id'])
.transform('any')
)
Then, we can already filter the transaction records of customers with second purchase within 30 days by the following code:
df[mask]
To further show the customer_id's and their respective first purchase_date, you can use:
df[mask].groupby('Customer_id', as_index=False).first()

Related

Pandas - using group by and including value counts which are larger than n

I have a table which includes salary and company_location.
I was trying to calculate the mean salary of a country, its works:
wage = df.groupby('company_location').mean()['salary']
However, I have many with company_location which have less than 5 entries, I would like to exclude them from the report.
I know how to calculate countries with the top 5 entries:
Top_5 = df['company_location'].value_counts().head(5)
I am just having a problem connecting those to variables into one and making a graph out of it...
Thank you.
You can remove rows whose value occurrence is below a threshold:
df = df[df.groupby('company_location')['company_location'].transform('size') > 5]
You can do the following to only apply the groupby and aggregation to those with more than 5 records:
mask = (df['company_location'].map(df['company_location'].value_counts()) > 5)
wage = df[mask].groupby('company_location')['salary'].mean()

Grouping information by multiple criteria?

so I have a data frame similar to the one below where stamp is an date time index;
for context, it represents orders received and my goal is to match orders that may be the same but have come as two separate orders.
Stamp
Price.
indicator
EX
qty
1234
10
1
d
12
2345
30
-1
d
13
I want to group entries that have the same date time stamp, given that those entries have the same EX and Indicator.
I think I know how to do this with just the stamp however I'm unsure how to add the conditions of EX and indicator to the groupby.
Beginner here so any help is greatly appreciated!
Try this:
df.groupby(["Stamp", "EX", "indicator"])
And if you then want to get the sum of quantities and prices you can do this:
df.groupby(["Stamp", "EX", "indicator"]).sum()
you can groupby more than one column: df.groupby(['Stamp', 'EX'])
Then you can check the length of each group to see if there are multiple rows that share both columns:
df.groupby(['Stamp', 'EX']).apply(len)

Select every 1 hr timestamp record/based on condtion in Pandas

Here, I need to fetch the records if Food Value is not 'Nan' and next 1 hr of timestamp row.
You can filter out the Nans (and sort by Time, if it isn't sorted):
data_filtered = data[data['Food'].notna()].sort_values('Time', ascending = True)
And then use groupby:
data_filtered.groupby(data_filtered['Time'].dt.to_period('H')).first()
PS. please paste the example data instead of screen shot next time (it's easier to test it).

Usage of WHERE CLAUSE in Python

Trying to write SQL queries in Python.
I have a data frame df with columns date, ID and amount
Every month I am getting a new load of data. I have to calculate the average amount for a particular ID for the last 12 months (means we will have 12 records for that one ID).
Currently, my approach is
M1 = pd.date_range(first_day_of_month, last_day_of_12_month, freq='D').strftime("%Y%d%m").tolist()
df["new"] = df[(df['date'].isin(M1))]['amount'].mean()
Now I want to upload this average as a new column, each ID with current (latest) time stamp has average of last 12 months amount. Tried using groupby but was not able to apply properly.
mask = d.date.between(datetime.datetime(2019,1,1),datetime.datetime(2019,12,31))
d[].groupby(['ID'])['amount'].mean()
I guess ? maybe ?

how to divide pandas dataframe into different dataframes based on unique values from one column and itterate over that?

I have a dataframe with three columns
The first column has 3 unique values I used the below code to create unique dataframes, However I am unable to iterate over that dataframe and not sure how to use that to iterate.
df = pd.read_excel("input.xlsx")
unique_groups = list(df.iloc[:,0].unique()) ### lets assume Unique values are 0,1,2
mtlist = []
for index, value in enumerate(unique_groups):
globals()['df%s' % index] = df[df.iloc[:,0] == value]
mtlist.append('df%s' % index)
print(mtlist)
O/P
['df0', 'df1', 'df2']
for example lets say I want to find out the length of the first unique dataframe
if I manually type the name of the DF I get the correct output
len(df0)
O/P
35
But I am trying to automate the code so technically I want to find the length and itterate over that dataframe normally as i would by typing the name.
What I'm looking for is
if I try the below code
len('df%s' % 0)
I want to get the actual length of the dataframe instead of the length of the string.
Could someone please guide me how to do this?
I have also tried to create a Dictionary using the below code but I cant figure out how to iterate over the dictionary when the DF columns are more than two, where key would be the unique group and the value containes the two columns in same line.
df = pd.read_excel("input.xlsx")
unique_groups = list(df["Assignment Group"].unique())
length_of_unique_groups = len(unique_groups)
mtlist = []
df_dict = {name: df.loc[df['Assignment Group'] == name] for name in unique_groups}
Can someone please provide a better solution?
UPDATE
SAMPLE DATA
Assignment_group Description Document
Group A Text to be updated on the ticket 1 doc1.pdf
Group B Text to be updated on the ticket 2 doc2.pdf
Group A Text to be updated on the ticket 3 doc3.pdf
Group B Text to be updated on the ticket 4 doc4.pdf
Group A Text to be updated on the ticket 5 doc5.pdf
Group B Text to be updated on the ticket 6 doc6.pdf
Group C Text to be updated on the ticket 7 doc7.pdf
Group C Text to be updated on the ticket 8 doc8.pdf
Lets assume there are 100 rows of data
I'm trying to automate ServiceNow ticket creation with the above data.
So my end goal is GROUP A tickets should go to one group, however for each description an unique task has to be created, but we can club 10 task once and submit as one request so if I divide the df's into different df based on the Assignment_group it would be easier to iterate over(thats the only idea which i could think of)
For example lets say we have REQUEST001
within that request it will have multiple sub tasks such as STASK001,STASK002 ... STASK010.
hope this helps
Your problem is easily solved by groupby: one of the most useful tools in pandas. :
length_of_unique_groups = df.groupby('Assignment Group').size()
You can do all kind of operations (sum, count, std, etc) on your remaining columns, like getting the mean value of price for each group if that was a column.
I think you want to try something like len(eval('df%s' % 0))

Categories