Where in on a list of string in sqlite 3 - python

I am training a ML model on a huge dataset. I have converted the CSV to a SQL database. But, I have encountered a problem in correctly retrieving data for each batch.
custs= #list of 1000 customers
df=pd.read_sql_query('SELECT * FROM transactions WHERE cust_id IN (%s)'% (",".join(map(str,custs))) ,con=disk_engine)
This gives me all the records of the customers. But, say, if I want only last three transactions of a customer.
months= #list containing the required months, custs= #each cust_id is repeated thrice
df=pd.read_sql_query('SELECT * FROM transactions WHERE cust_id IN (%s) AND month IN (%s)'
% (",".join(map(str,custs)),",".join(map(str,months))) ,con=disk_engine)
This doesn't work because the AND statement is applied to whole list. Therefore, I created a new column which is a string combination of customer_id and month and this key is unique for each record.
key= #str(cust_id) +'_'+ str(month)
I can't use the join statement for a list of keys. What method will work here? I tried to use placeholders here but since I am new to SQL, I didn't make much progress.
EDIT
The data has 116 columns. 113 of them are modelling variables and there is a DV column. The other two columns are cust_id and month. The data is for 8 months and each customer has 1 or 0 transactions per month. Therefore, a customer might have transactions ranging from 1 to 8 in the database.
For training a batch I pick 1000 customers and a month at random from a list in memory . For example if I picked list contains cust_id=1 and month=4, then I need transaction for month [2,3,4] for this customer. For cust_id=20 and month=2 I'll use [1,1,2] If a transaction is missing then also I pad with the last transaction.

Related

How to make a complex analysis/calculation in pandas/python without a for loop

I have a database DF1 of sales data: DATE_SOLD, PRICE, NAME of product, indexed on ID. Products of the same NAME sell sometimes fast, sometimes slowly.
I'd like to add a column which flags the entry if, for a given entry in the database, having looked at all other entries which have the same product NAME as the given entry, and counted how many were sold by looking at the DATE_SOLD of all entries with the same NAME. If there is only one other entry within 365 days and one entry within 90 days (i.e. only one product with that NAME sold within 1 year prior to the entry row being analyzed, and that one sale was within 90 days), then I want to look at the price of that previous sale and see if it is more than 25% more expensive than the row being analyzed. If it is, I want to flag the row being analyzed.
I have no idea how to do with in Pandas. I can do it with a for loop, but it will take forever given the million entries or so that I have.
Help?

How To Determine Bulk Customers (Customers Buying More Than 'N' Items Or So In A Single Transaction) In Python (Sales Data Analytics)?

so I've the following sample dataset:
Column A: Name
Column B: Email
Column C: Products
Column D: Transaction Date
I've two objectives:
To determine bulk customers (customers who purchase, let's say,
5 products or more in a single transaction), where each row
represents a unique transaction with a unique timestamp.
To determine from the recurring customers (customers frequently
making different transactions), who all are also bulk customers.
Now, I've already determined the list of recurring customers as follows:
n = 15
custmost1 = Order_Details['Name'].value_counts().index.tolist()[:n]
custmost2 = Order_Details['Name'].value_counts().values.tolist()[:n]
custmost = np.column_stack((custmost1,custmost2))
Where custmost denotes the series tuple clubbed together as an array for customers making frequent purchases with their counts. Order_Details is the dataframe I created for the dataset.
Now, I'm at my wit ends to figure out to maintain a count of different products being purchased in a single transaction (with a unique timestamp), and possibly, add it as a separate column in a dataframe.
I don't know if it's a feasible approach or not, but two ways were coming to my mind:
One to count the number of commas, so that number of commas+1 will be number of products.
To segregate each products into a separate line (which I already did, by the way, for maintaining a total count for a different insight), and checking with timestamp the number of products sold at a given timestamp.
I'd segregated the Products as follows:
reshaped = \
(Order_Details.set_index(Order_Details.columns.drop('Product',1).tolist())
.Product.str.split(',', expand=True)
.stack()
.reset_index()
.rename(columns={0:'Product'})
.loc[:, Order_Details.columns]
)
So, in this lieu, I would want someone to guide me as I feel like the aforementioned approaches are actually rather messy.
Assuming you already have a proper DataFrame:
>>> df.applymap(lambda e: type(e).__name__).value_counts()
name email product date
str str list Timestamp 29
dtype: int64
(i.e., with columns: ['name', 'email', 'product', 'date'], where the 'product' column contains list objects, and date contains Timestamp),
Then you could do this:
bulk_customers = set(df.loc[df['product'].apply(len) >= 5, 'name'])
s = df.groupby('name').size() > 1
recur_customers = set(s[s].index)
>>> bulk_customers
{'PERSON_108', 'PERSON_123'}
>>> recur_customers
{'PERSON_105'}
Notes
I changed the row of PERSON_125 to be PERSON_105, so that there would be one repeat customer. Likewise, I used a threshold of n_visits > 1 as the criterion for "recurring", but you may want something different.
You'd be well inspired to assign a unique ID to each of your customers. This could be based on email or perhaps you already have a customer ID. In any case, using name is prone to collisions, plus sometimes customers change name (e.g. through marriage) while keeping the same email or customer ID.
You didn't mention over what period of time a customer needs to visit again in order to be considered "frequent". If that is to be considered, you have to be specific whether it is e.g. "within a calendar month", or "over the past 30 days", etc., as each leads to slightly different expressions.
Ok, so after a bit of extensive brainstorming, I've concocted the following way to do this:
In the original dataset's dataframe (Order_Details), I figured out to get the count of commas in each row of the Product column, which gave me a frequency of the number of products purchased in a single transaction. The code for that goes:
Order_Details['Number Of Products'] = Order_Details['Product'].str.count(",")+1
To make sure that I get the names of customers in a sorted order according to the frequency of purchases, I applied the following sort_values() function:
Dup_Order_Details = Order_Details
Dup_Order_Details.sort_values(["Number Of Products","Name"],axis=0, ascending=False,inplace=True,na_position='first')
Finally, a filter for those buying more than 'N' products (here, I took N=10, as I wanted this insight, y'all can take 'N' as input if you want):
Dup_Order_Details = Dup_Order_Details[Dup_Order_Details["Number Of Products"] >= 10]
Then a simple direct display can be done as per your need or you can convert it into a list or something, in case any visualization is needed (which I did).

Counting number of records per month Python

please help me with creating a dataframe.
There is a table containing data: id - user number, FirstVisit - date of the first visit, LastVisit - date of the last visit.
I need to create a table with fields: Month - month, MAU - number of unique users for this month.
I have so far created a list of months, through the minimum and maximum date in the table.
df=pd.DataFrame(pd.date_range(visit.FirstVisit.min(),visit.LastVisit.max(), freq='M').strftime('%Y-%m'),columns=['Month'])
I came up with an approximate way how to calculate MAU, this is the Number of unique ID per month, that have FirstVisit<=Month and LastVisit>=Month.
You need to pay attention to the fact that the user, FirstVisit and LastVisit are always the same, but the record can be repeated, because the columns are missing. That is, in fact, duplicates can be deleted and simply the number of records can be counted.
I tried through the function, but so far it does not work, please help.

Find list of id's which have another column change values multiple times

I have a dataframe new_df which has a list of customer id's, dates, and a customer segment for each day. Customer segment can take multiple values. I am looking to identify a list of customers whose segment has changed more than twice in the past 15 days.
Currently, I am using the following to check how many times each segment appears for each customer id.
segment_count = new_df.groupby(new_df['customer_id'].ne(new_df['customer_id'].shift()).cumsum())['segment'].value_counts()
My thinking is if a customer has more than 2 segments which have a count of >1, then they must have migrated from one segment to another at least twice. 2 sample customers may look like this:
|customer_id|day|segment|
|-----------|---|-------|
|12345|'2021-01-01'|G|
|12345|'2021-01-02'|G|
|12345|'2021-01-03'|M|
|12345|'2021-01-04'|G|
|12345|'2021-01-05'|M|
|12345|'2021-01-06'|M|
|6789|'2021-01-01'|G|
|6789|'2021-01-02'|G|
|6789|'2021-01-03'|G|
|6789|'2021-01-04'|G|
|6789|'2021-01-05'|G|
|6789|'2021-01-06'|M|
As an output, I would want to return the following:
customer_id
migration_count
12345
3
6789
1
Anyone have any advice on best way to tackle this or if there are any built in functions I can use to simplify?Thanks!

Usage of WHERE CLAUSE in Python

Trying to write SQL queries in Python.
I have a data frame df with columns date, ID and amount
Every month I am getting a new load of data. I have to calculate the average amount for a particular ID for the last 12 months (means we will have 12 records for that one ID).
Currently, my approach is
M1 = pd.date_range(first_day_of_month, last_day_of_12_month, freq='D').strftime("%Y%d%m").tolist()
df["new"] = df[(df['date'].isin(M1))]['amount'].mean()
Now I want to upload this average as a new column, each ID with current (latest) time stamp has average of last 12 months amount. Tried using groupby but was not able to apply properly.
mask = d.date.between(datetime.datetime(2019,1,1),datetime.datetime(2019,12,31))
d[].groupby(['ID'])['amount'].mean()
I guess ? maybe ?

Categories