A fairly beginner-level pandas question here.
I have a DataFrame of transactions:
Customer Date Amount
Angus 2009-07-18 $76.46
Bruno 2009-07-21 $68.66
Danno 2009-07-25 $73.52
Chapp 2009-07-11 $56.04
Chapp 2009-07-21 $11.30
Frank 2009-07-07 $52.86
Chapp 2009-07-09 $97.82
Danno 2009-07-11 $84.98
(etc. for thousands of lines)
I'd like to create four DataFrames from this data:
For each customer, the customers name, how many transactions they've done, and the sum of the Amounts of these transactions
For each customer, the date and amount of their most recent transaction.
For each customer, the date and amount of their first transaction.
For each customer, the date and amount of their largest (amount-wise) transaction.
Can you advise me on the appropriate code?
(Answers along the lines of "Why are you using DataFrames? You should be using ThnargLopes for this!" will be warmly received.)
I think a DataFrame is a great structure for your data. Whenever you're setting up for a "split-apply-combine" set of analysis steps, Pandas excels. You can write a function that assumes you only have one customer and returns a Series like you're looking for.
import pandas as pd
def trans_count(DF):
return pd.Series({'count': len(DF),
'total': sum(DF['Amount'])})
Then use groupby and apply:
yourDF.groupby('Customer').apply(trans_count)
However, since each of your new DataFrames is a summary of a single customer, I would suggest writing one function that can return all of your desired results in a single Series.
untested from my phone!
OK, I've figured this out. First, we make a transaction field of ones to sum:
df["Trans"] = len(df)*[1]
We group by Customer:
cust_gp = df.groupby("Customer")
The the first one's easiest:
cust_gp.sum()
Four is also not hard:
cust_gp.max()
2 and 3 were tricky... I found a solution that seemed to work with my test data. Sort the data by Customer and Date, then aggregate by taking the first for each Customer:
df.sort(["Customer","Date"]).groupby("Customer").first()
df.sort(["Customer","Date"]).groupby("Customer").last()
...but when I ran it on my big data set, I was told that some of my Most Recent Transactions took place before the Last Transactions. Which makes no sense.
It turned out that the date field was being imported as text! So, complete solution:
df.Date = pd.to_datetime(df.Date) # Date field should be date, not text
df = df.sort(["Customer","Date"])
cust_gp = df.groupby("Customer")
total_df = cust_gp.sum() # 1
largest_df = cust_gp.max() # 2
first_df = cust_gp.first() # 3
last_df = cust_gp.last() # 4
I'm pleased with this, except for the "Gifts" column, which I'm sure isn't implemented in the most elegant way.
Related
I have a data set that contains hourly data of marketing campaigns. There are several campaigns and not all of them are active during the 24 hours of the day. My goal is to eliminate all rows of active hour campaigns where I don't have the 24 data rows of a single day.
The raw data contains a lot of information like this:
Original Data Set
I created a dummy variable with ones to be able to count single instance of rows. This is the code I applied to be able to see the results I want to get.
tmp = df.groupby(['id','date']).count()
tmp.query('Hour' > 23)
I get the following results:
Results of two lines of code
These results illustrate exactly the data that I want to keep in my data frame.
How can I eliminate the data per campaign per day that does not reach 24? The objective is not the count but the real data. Therefore ungrouped from what I present in the second picture.
I appreciate the guidance.
Use transform to broadcast the count over all rows of your dataframe the use loc as replacement of query:
out = df.loc[df.groupby(['id', 'date'])['Hour'].transform('count')
.loc[lambda x: x > 23].index]
drop the data you don't want before you do the groupby
you can use .loc or .drop, I am unfamiliar with .query
so I've the following sample dataset:
Column A: Name
Column B: Email
Column C: Products
Column D: Transaction Date
I've two objectives:
To determine bulk customers (customers who purchase, let's say,
5 products or more in a single transaction), where each row
represents a unique transaction with a unique timestamp.
To determine from the recurring customers (customers frequently
making different transactions), who all are also bulk customers.
Now, I've already determined the list of recurring customers as follows:
n = 15
custmost1 = Order_Details['Name'].value_counts().index.tolist()[:n]
custmost2 = Order_Details['Name'].value_counts().values.tolist()[:n]
custmost = np.column_stack((custmost1,custmost2))
Where custmost denotes the series tuple clubbed together as an array for customers making frequent purchases with their counts. Order_Details is the dataframe I created for the dataset.
Now, I'm at my wit ends to figure out to maintain a count of different products being purchased in a single transaction (with a unique timestamp), and possibly, add it as a separate column in a dataframe.
I don't know if it's a feasible approach or not, but two ways were coming to my mind:
One to count the number of commas, so that number of commas+1 will be number of products.
To segregate each products into a separate line (which I already did, by the way, for maintaining a total count for a different insight), and checking with timestamp the number of products sold at a given timestamp.
I'd segregated the Products as follows:
reshaped = \
(Order_Details.set_index(Order_Details.columns.drop('Product',1).tolist())
.Product.str.split(',', expand=True)
.stack()
.reset_index()
.rename(columns={0:'Product'})
.loc[:, Order_Details.columns]
)
So, in this lieu, I would want someone to guide me as I feel like the aforementioned approaches are actually rather messy.
Assuming you already have a proper DataFrame:
>>> df.applymap(lambda e: type(e).__name__).value_counts()
name email product date
str str list Timestamp 29
dtype: int64
(i.e., with columns: ['name', 'email', 'product', 'date'], where the 'product' column contains list objects, and date contains Timestamp),
Then you could do this:
bulk_customers = set(df.loc[df['product'].apply(len) >= 5, 'name'])
s = df.groupby('name').size() > 1
recur_customers = set(s[s].index)
>>> bulk_customers
{'PERSON_108', 'PERSON_123'}
>>> recur_customers
{'PERSON_105'}
Notes
I changed the row of PERSON_125 to be PERSON_105, so that there would be one repeat customer. Likewise, I used a threshold of n_visits > 1 as the criterion for "recurring", but you may want something different.
You'd be well inspired to assign a unique ID to each of your customers. This could be based on email or perhaps you already have a customer ID. In any case, using name is prone to collisions, plus sometimes customers change name (e.g. through marriage) while keeping the same email or customer ID.
You didn't mention over what period of time a customer needs to visit again in order to be considered "frequent". If that is to be considered, you have to be specific whether it is e.g. "within a calendar month", or "over the past 30 days", etc., as each leads to slightly different expressions.
Ok, so after a bit of extensive brainstorming, I've concocted the following way to do this:
In the original dataset's dataframe (Order_Details), I figured out to get the count of commas in each row of the Product column, which gave me a frequency of the number of products purchased in a single transaction. The code for that goes:
Order_Details['Number Of Products'] = Order_Details['Product'].str.count(",")+1
To make sure that I get the names of customers in a sorted order according to the frequency of purchases, I applied the following sort_values() function:
Dup_Order_Details = Order_Details
Dup_Order_Details.sort_values(["Number Of Products","Name"],axis=0, ascending=False,inplace=True,na_position='first')
Finally, a filter for those buying more than 'N' products (here, I took N=10, as I wanted this insight, y'all can take 'N' as input if you want):
Dup_Order_Details = Dup_Order_Details[Dup_Order_Details["Number Of Products"] >= 10]
Then a simple direct display can be done as per your need or you can convert it into a list or something, in case any visualization is needed (which I did).
I have the following dataframe called df1 that contains data for a number of regions in the column NUTS_ID:
The index, called Date has all the days of 2010. That is, for each code in NUTS_ID I have a day of 2010 (all days of the year for AT1, AT2and so on). I created a list containing the dates corresponding to non-workdays and I want to add a column that with 0 for non-workdays and 1 for workdays.
For this, I simply used a for loop that checks day by day if it's in the workday list I created:
for day in df1.index:
if day not in workdays_list:
df1.loc[day,'Workday'] = 0 # Assigning 0 to to non-workdays
else:
df1.loc[day,'Workday'] = 1 # Assigning 1 to workdays
This works well enough if the dataset is not big. But with some of the datasets I'm processing this takes a very long time. I would like to ask for ideas in order to do the process faster and more efficient. Thank you in advance for your input.
EDIT: One of the things I have thought is that maybe a groupby could be helpful, but I don't know if that is correct.
You can use np.where with isin to check if your Date (i.e. your index) is in the list you created:
import numpy as np
df1['Workday'] = np.where(df1.index.isin(workdays_list),1,0)
I can't reproduce your dataset, but something along those lines should work.
Dataset
I have a movie dataset where there are over half a million rows, and this dataset looks like following (with made-up numbers)
MovieName Date Rating Revenue
A 2019-01-15 3 3.4 million
B 2019-02-03 3 1.2 million
... ... ... ...
Object
Select movies that are released "closed enough" in terms of date (for example, the release date difference of movie A and movie B is less than a month) and see when the rating is same, how the revenue could be different.
Question
I know I could write a double loop to achieve this goal. However, I am doubting this is the right/efficient way to do, because
Some posts (see comment of #cs95 to the question) suggested iterating over a dataframe is "anti-pattern" and therefore something not advisable to do.
The dataset has over half a million rows, I am not sure if writing double loop is something efficient to do.
Could someone provide pointers to the question I have? Thank you in advance.
In general, it is true that you should try avoiding loops when working with pandas. My idea is not ideal, but might point you in the right direction:
Retrieve month and year from the date column in every row to create new columns "month" and "year". You can see how to do it here
Afterwards, group your dataframe by month and year (grouped_df = df.groupby(by=["month","year])) and the resulting groups are dataframe with movies from the same month and year. Now it's up to you what further analysis you want to perform, for example mean (grouped_df = df.groupby(by=["month","year]).mean()), standard deviation or something more fancy with the apply() function.
You can also extract weeks if you want a period shorter than a month
The same question on how to do it in SQL, with views or a stored procedure?
I have a sales table simplified by 4 columns namely id, product_id, sale_date and quantity. I would like to build a request returning :
1. for each product_id the total sales by date in one row
2. on the same row for each id the total sales on 7, 15, 30 days
For now, I use multiple WITH-views, one for each columns :
days7=session.query(Sales.product_id,func.sum(Sales.quantity).label('count')).\
filter(Sales.sale_date > now() - timedelta(days=7)).\
group_by(Sales.product_id).cte('days7')
...
req=session.query(Sales.product_id,days7.c.count.label('days7'),\
days15.c.count.label('days15'),\
days30.c.count.label('days30')).\
outerjoin(days7,days7.c.product_id==Sales.product_id).\
outerjoin(days15,days15.c.product_id==Sales.product_id).\
outerjoin(days30,days30.c.product_id==Sales.product_id).\
all()
It works pretty well but I not sure if this is the best way of doing it. Moreover if I want to add the count for each date of the 30 (or 360) previous days, it becomes crazy. The idea could have been to use a simple for loop :
viewSumByDay=[]
for day in range(180):
date=now()-timedelta(days=day)
viewSumByDay.append(session.query(...).cte(str(date.date())))
which is ok to create the view. And although for the left join it should also be ok with a req=req.outerjoin(viewSumByDay[day],...), I'm now stuck on how to use the loop to add the columns into the main query.
Do you see another nice solution?
Thanks a lot for your help!
Ok, sorry, a simple req=req.add_column(...) is described in the documentation.
However if there is a prettier way, I would like to know it.