Counting number of records per month Python - python

please help me with creating a dataframe.
There is a table containing data: id - user number, FirstVisit - date of the first visit, LastVisit - date of the last visit.
I need to create a table with fields: Month - month, MAU - number of unique users for this month.
I have so far created a list of months, through the minimum and maximum date in the table.
df=pd.DataFrame(pd.date_range(visit.FirstVisit.min(),visit.LastVisit.max(), freq='M').strftime('%Y-%m'),columns=['Month'])
I came up with an approximate way how to calculate MAU, this is the Number of unique ID per month, that have FirstVisit<=Month and LastVisit>=Month.
You need to pay attention to the fact that the user, FirstVisit and LastVisit are always the same, but the record can be repeated, because the columns are missing. That is, in fact, duplicates can be deleted and simply the number of records can be counted.
I tried through the function, but so far it does not work, please help.

Related

How to make a complex analysis/calculation in pandas/python without a for loop

I have a database DF1 of sales data: DATE_SOLD, PRICE, NAME of product, indexed on ID. Products of the same NAME sell sometimes fast, sometimes slowly.
I'd like to add a column which flags the entry if, for a given entry in the database, having looked at all other entries which have the same product NAME as the given entry, and counted how many were sold by looking at the DATE_SOLD of all entries with the same NAME. If there is only one other entry within 365 days and one entry within 90 days (i.e. only one product with that NAME sold within 1 year prior to the entry row being analyzed, and that one sale was within 90 days), then I want to look at the price of that previous sale and see if it is more than 25% more expensive than the row being analyzed. If it is, I want to flag the row being analyzed.
I have no idea how to do with in Pandas. I can do it with a for loop, but it will take forever given the million entries or so that I have.
Help?

How To Determine Bulk Customers (Customers Buying More Than 'N' Items Or So In A Single Transaction) In Python (Sales Data Analytics)?

so I've the following sample dataset:
Column A: Name
Column B: Email
Column C: Products
Column D: Transaction Date
I've two objectives:
To determine bulk customers (customers who purchase, let's say,
5 products or more in a single transaction), where each row
represents a unique transaction with a unique timestamp.
To determine from the recurring customers (customers frequently
making different transactions), who all are also bulk customers.
Now, I've already determined the list of recurring customers as follows:
n = 15
custmost1 = Order_Details['Name'].value_counts().index.tolist()[:n]
custmost2 = Order_Details['Name'].value_counts().values.tolist()[:n]
custmost = np.column_stack((custmost1,custmost2))
Where custmost denotes the series tuple clubbed together as an array for customers making frequent purchases with their counts. Order_Details is the dataframe I created for the dataset.
Now, I'm at my wit ends to figure out to maintain a count of different products being purchased in a single transaction (with a unique timestamp), and possibly, add it as a separate column in a dataframe.
I don't know if it's a feasible approach or not, but two ways were coming to my mind:
One to count the number of commas, so that number of commas+1 will be number of products.
To segregate each products into a separate line (which I already did, by the way, for maintaining a total count for a different insight), and checking with timestamp the number of products sold at a given timestamp.
I'd segregated the Products as follows:
reshaped = \
(Order_Details.set_index(Order_Details.columns.drop('Product',1).tolist())
.Product.str.split(',', expand=True)
.stack()
.reset_index()
.rename(columns={0:'Product'})
.loc[:, Order_Details.columns]
)
So, in this lieu, I would want someone to guide me as I feel like the aforementioned approaches are actually rather messy.
Assuming you already have a proper DataFrame:
>>> df.applymap(lambda e: type(e).__name__).value_counts()
name email product date
str str list Timestamp 29
dtype: int64
(i.e., with columns: ['name', 'email', 'product', 'date'], where the 'product' column contains list objects, and date contains Timestamp),
Then you could do this:
bulk_customers = set(df.loc[df['product'].apply(len) >= 5, 'name'])
s = df.groupby('name').size() > 1
recur_customers = set(s[s].index)
>>> bulk_customers
{'PERSON_108', 'PERSON_123'}
>>> recur_customers
{'PERSON_105'}
Notes
I changed the row of PERSON_125 to be PERSON_105, so that there would be one repeat customer. Likewise, I used a threshold of n_visits > 1 as the criterion for "recurring", but you may want something different.
You'd be well inspired to assign a unique ID to each of your customers. This could be based on email or perhaps you already have a customer ID. In any case, using name is prone to collisions, plus sometimes customers change name (e.g. through marriage) while keeping the same email or customer ID.
You didn't mention over what period of time a customer needs to visit again in order to be considered "frequent". If that is to be considered, you have to be specific whether it is e.g. "within a calendar month", or "over the past 30 days", etc., as each leads to slightly different expressions.
Ok, so after a bit of extensive brainstorming, I've concocted the following way to do this:
In the original dataset's dataframe (Order_Details), I figured out to get the count of commas in each row of the Product column, which gave me a frequency of the number of products purchased in a single transaction. The code for that goes:
Order_Details['Number Of Products'] = Order_Details['Product'].str.count(",")+1
To make sure that I get the names of customers in a sorted order according to the frequency of purchases, I applied the following sort_values() function:
Dup_Order_Details = Order_Details
Dup_Order_Details.sort_values(["Number Of Products","Name"],axis=0, ascending=False,inplace=True,na_position='first')
Finally, a filter for those buying more than 'N' products (here, I took N=10, as I wanted this insight, y'all can take 'N' as input if you want):
Dup_Order_Details = Dup_Order_Details[Dup_Order_Details["Number Of Products"] >= 10]
Then a simple direct display can be done as per your need or you can convert it into a list or something, in case any visualization is needed (which I did).

Ignore rows if the criteria are not met either in SQL/Python

I have two tables, first one is a list of customers that is browsing/checking in the website, and second one is list of members and actual transactions that they made.
I am trying to see for every check that they done, what are the total transactions from that day to the next 30 days. But I want to ignore the checks after the initial point if its within 30 days, I will include it if > 30 days.
I have tried in SQL adding additional column (if next date-current date < 30 then 0) and also using row number, but unsure what is the next steps. But this is will roll over and the final step is just need to exclude the check date if its < 30 days, so my total cost would be accurate.
Thanks
[This is sample table]
[Result that I am aiming for]

Searching for values from one dataframe in another and returning information in the corresponding row/different column

I have 2 Dataframes:
df_Billed: pd.Dataframe({'Bill_Number':[220119, 220120, 220219, 220219, 220419, 220519, 220619, 221219],'Date': [1/31/2019, 2/20/2020, 2/28/2019, 6/30/2019,6/30/2019,6/30/2019,6/30/2019,12/31/2019], 'Amount': [3312.5, 832.0,10000.0, -3312.5,8725.0,1862.5,3637.5,1587.5]})
df_Received: pd.Dataframe({'Bill_Number':[220119, 220219, 220419, 220519, 220619],'Date':[4/16/2019,5/21/2019,8/2/2019,8/2/2019,8/2/2019],'Amount':[3312.5,6687.5,8725,1862.5,3637.5]})
I am trying to search for for each "Bill_Number" in df_Billed to see if it is present if df_Received. Ideally, if it is present, I would like to take the difference between the dates between df_Billed and df_Received for that particular bill number (to see how many days it took to get paid). If the billing number is not present in df_Received, I would like to simply return the all rows for that billing number in df_Billed.
EX: Since df_Billed Bill_Number 220119 is in df_Received, it would return 75 (which is the number of days it took for the bill to be paid 4/16/2019 - 1/31/2019).
EX: Since df_Billed Bill_Number 221219 is not in df_Received, it would return 12/31/2019 (which is the date it was billed).
You could probably use merge on Bill_Number initially
df_Billed=df_Billed.merge(df_Received,on='Bill_Number',how='left')
Then use apply and pandas.to_datetime for computing diffrence between dates
df_Billed['result']=df_Billed.apply(lambda x:x.Date_x if pd.isnull(x.Date_y)
else abs(pd.to_datetime(x.Date_x)-pd.to_datetime(x.Date_y)).days,
axis=1)
And finally, I think you want to create a new column for final result..so I'm renaming the merged columns Date_x and Amount_y back to Date and Amount below:
df_Billed.drop(['Date_y','Amount_y'],axis=1,inplace=True)
df_Billed.rename(columns={"Date_x": "Date","Amount_x":"Amount"},inplace=True)
Final Dataframe:

Where in on a list of string in sqlite 3

I am training a ML model on a huge dataset. I have converted the CSV to a SQL database. But, I have encountered a problem in correctly retrieving data for each batch.
custs= #list of 1000 customers
df=pd.read_sql_query('SELECT * FROM transactions WHERE cust_id IN (%s)'% (",".join(map(str,custs))) ,con=disk_engine)
This gives me all the records of the customers. But, say, if I want only last three transactions of a customer.
months= #list containing the required months, custs= #each cust_id is repeated thrice
df=pd.read_sql_query('SELECT * FROM transactions WHERE cust_id IN (%s) AND month IN (%s)'
% (",".join(map(str,custs)),",".join(map(str,months))) ,con=disk_engine)
This doesn't work because the AND statement is applied to whole list. Therefore, I created a new column which is a string combination of customer_id and month and this key is unique for each record.
key= #str(cust_id) +'_'+ str(month)
I can't use the join statement for a list of keys. What method will work here? I tried to use placeholders here but since I am new to SQL, I didn't make much progress.
EDIT
The data has 116 columns. 113 of them are modelling variables and there is a DV column. The other two columns are cust_id and month. The data is for 8 months and each customer has 1 or 0 transactions per month. Therefore, a customer might have transactions ranging from 1 to 8 in the database.
For training a batch I pick 1000 customers and a month at random from a list in memory . For example if I picked list contains cust_id=1 and month=4, then I need transaction for month [2,3,4] for this customer. For cust_id=20 and month=2 I'll use [1,1,2] If a transaction is missing then also I pad with the last transaction.

Categories