How to randomise the rebalancing of a dataset - python

In python I am trying to rebalance a dataset which contains approximately 4000 transactions for a single credit card number, which are all ordered by time.
There is a large class imbalance between genuine and fraudulent transaction, and this data only contains about 15 fraudulent transactions that occurred within two days time.
Naturally, I want to rebalance the dataset. However, when I did this using SMOTE, I noticed that now there are approximately 4000 additional synthetic fraudulent transactions that occur during the exact two days as the original fraudulent transactions.
Is there any way to generate synthetic fraudulent transactions which are more randomised than this?

Related

Efficient algorithms to perform Market Basket Analysis

I want to perform Market Basket Analysis (or Association Analysis) on retail ecommerce dataset.
The problem I am facing is the huge data size of 3.3 million transactions in a single month. I cannot cut down the transactions as I may miss some products. Provided below the structure of the data:
Order_ID = Unique transaction identifier
Customer_ID = Identifier of the customer who placed the order
Product_ID = List of all the products the customer has purchased
Date = Date on which the sale has happened
When I feed this data to the #apriori algorithm in Python, my system cannot handle the huge memory requirements to run. It can run with just 100K transactions. I have 16gb RAM.
Any help in suggesting a better (and faster) algorithm is much appreciated.
I can use SQL as well to sort out data size issues, but I will get only 1 Antecedent --> 1 Consequent rule. Is there a way to get multiset rules such as {A,B,C} --> {D,E} i.e, If a customer purchases products A, B and C, then there is a high chance to purchase products D and E.
For a huge data size try FP Growth, as it is an improvement to the Apriori method.
It also only loop data twice when compared to Apriori.
from mlxtend.frequent_patterns import fpgrowth
Then just change:
apriori(df, min_support=0.6)
To
fpgrowth(df, min_support=0.6)
There also an research that compare each algorithm, for memory issue I recommend :
Evaluation of Apriori, FP growth and Eclat association rule miningalgorithms or
Comparing the Performance of Frequent Pattern Mining Algorithms.

How do find correlation between time events and time series data in python?

I have two different excel files. One of them is including time series data (268943 accident time rows) as below
The other file is value of 14 workers measured daily from 8 to 17 and during 4 months(all data merged in one file)
I am trying to understand correlation between accident times and values (hourly from 8 to 17 per one hour and daily from Monday to Friday and monthly)
Which statistical method is fit(Normalized Auto or cross correlation) and how can I do that?
Generally, in the questions, the correlation analysis are performed between two time series based values, but I think this is a little bit different. Also, here times are different.
Thank your advance..
I think the accident times and the bloodsugar levels are not coming from the same source, and so I think it is not possible to draw a correlation between these two separate datasets. If you would like to assume that the blood sugar levels of all 14 workers reflect that of the workers accident dataset, that is a different story. But what if those who had accidents had a significantly different blood sugar level profile than the rest, and what if your tiny dataset of 14 workers does not comprise such examples? I think the best you may do is to graph the blood sugar level of your 14 worker dataset and also similarly analyze the accident dataset separately, and try to see visually whether there is any correlation here.

Predict Sales as Counterfactual for Experiment

Which modelling strategy (time frame, features, technique) would you recommend to forecast 3-month sales for total customer base?
At my company, we often analyse the effect of e.g. marketing campaigns that run at the same time for the total customer base. In order to get an idea of the true incremental impact of the campaign, we - among other things - want to use predicted sales as a counterfactual for the campaign, i.e. what sales were expected to be assuming no marketing campaign.
Time frame used to train the model I'm currently considering 2 options (static time frame and rolling window) - let me know what you think. 1. Static: Use the same period last year as the dependent variable to build a specific model for this particular 3 month time frame. Data of 12 months before are used to generate features. 2. Use a rolling window logic of 3 months, dynamically creating dependent time frames and features. Not yet sure what the benefit of that would be. It uses more recent data for the model creation but feels less specific because it uses any 3 month period in a year as dependent. Not sure what the theory says for this particular example. Experiences, thoughts?
Features - currently building features per customer from one year pre-period data, e.g. Sales in individual months, 90,180,365 days prior, max/min/avg per customer, # of months with sales, tenure, etc. Takes quite a lot of time - any libraries/packages you would recommend for this?
Modelling technique - currently considering GLM, XGboost, S/ARIMA, and LSTM networks. Any experience here?
To clarify, even though I'm considering e.g. ARIMA, I do not need to predict any seasonal patterns of the 3 month window. As a first stab, a single number, total predicted sales of customer base for these 3 months, would be sufficient.
Any experience or comment would be highly appreciated.
Thanks,
F

How to treat missing data involving multiple datasets

I'm developing a model used to predict the probability of a client changing telephone companies based on their daily usage. My dataset has information from two weeks (14 days).
My datasets include in each row:
User ID, day (number from 1 to 14), a list of 15 more values.
The problem comes from the fact that some clients don't use their telephones everyday so for each client we have a random amount of rows (from 1 to 14) depending on the days they have used their telephones. Therefore we have some missing client-day data combinations.
Removing the missing values is not an option since the data set is small and it would affect the predictive methods.
What kind of treatement could I make for this missing day values for each client?
I've tried to make a new dataset in which we have only one entry per client, there is a new value that quantifies the amount of days of telephone usage and the rest of values are a mean of all the values found on each day of the original dataset. This decreases the size of the dataset and we would have the same problem than just removing the missing values.
I've thought about adding values for the missing days for each client (using interpolation methods) but that would twist the results since that would make the dataset as if every client used their phones everyday and that would affect the predictive model.

Can training and evaluation sets be the same in predictive analytics?

I'm creating a model to predict the probability that customers will buy product A in a department store that sells product A through Z. The store has it's own credit card with demographic and transactional information of 140,000 customers.
There is a subset of customers (say 10,000) who currently buy A. The goal is to learn from these customers 10,000 customers and score the remaining 130,000 with their probability to buy A, then target the ones with the highest scores with marketing campaigns to increase A sales.
How should I define my training and evaluation sets?
Training set:
Should it be only the 10,000 who bought A or the whole 140k customers?
Evaluation set: (where the model will be used in production)
I believe this should be the 130k who haven't bought A.
The question about time:
Another alternative is to take a photograph of the database last year, use it as a training set, then take the database as it is today and evaluate all customer's with the model created with last year's info. Is this correct? When should I do this?
Which option is correct for all sets?
The training set and the evaluation set must be different. The whole point of having an evaluation set is guard against over-fitting.
In this case what you should do is take say 100,000 customers, picked at random. Then use the data to try and learn what is about customers that make them likely purchase A. Then use the remaining 40,000 to test how well you model works.

Categories