I want to perform Market Basket Analysis (or Association Analysis) on retail ecommerce dataset.
The problem I am facing is the huge data size of 3.3 million transactions in a single month. I cannot cut down the transactions as I may miss some products. Provided below the structure of the data:
Order_ID = Unique transaction identifier
Customer_ID = Identifier of the customer who placed the order
Product_ID = List of all the products the customer has purchased
Date = Date on which the sale has happened
When I feed this data to the #apriori algorithm in Python, my system cannot handle the huge memory requirements to run. It can run with just 100K transactions. I have 16gb RAM.
Any help in suggesting a better (and faster) algorithm is much appreciated.
I can use SQL as well to sort out data size issues, but I will get only 1 Antecedent --> 1 Consequent rule. Is there a way to get multiset rules such as {A,B,C} --> {D,E} i.e, If a customer purchases products A, B and C, then there is a high chance to purchase products D and E.
For a huge data size try FP Growth, as it is an improvement to the Apriori method.
It also only loop data twice when compared to Apriori.
from mlxtend.frequent_patterns import fpgrowth
Then just change:
apriori(df, min_support=0.6)
To
fpgrowth(df, min_support=0.6)
There also an research that compare each algorithm, for memory issue I recommend :
Evaluation of Apriori, FP growth and Eclat association rule miningalgorithms or
Comparing the Performance of Frequent Pattern Mining Algorithms.
Related
For a university project I’m trying to see the relation oil production/consumption and crude oil price have on certain oil stocks, and I’m a bit confused about how to sort this data.
I basically have 4 datasets-
-Oil production
-Oil consumption
-Crude oil price
-Historical price of certain oil company stock
If I am trying to find a way these 4 tables relate, what is the recommended way of organizing the data? Should I manually combine all this data to a single Excel sheet (seems like the most straight-forward way) or is there a more efficient way to go about this.
I am brand new to PyTorch and data, so I apologise if this is a very basic question. Also, the data can basically get infinitely larger, by adding data from additional countries, other stock indexes, etc. So is there a way I can organize the data so it’s easy to add additional related data?
Finally, I have the month-to-month values for certain data (eg: oil production), and day-to-day values for other data (eg: oil price). What is the best way I can adjust the data to make up for this discrepancy?
Thanks in advance!
You can use pandas.DataFrame to create 4 dataframes for each dataset, then proceed with combining them in one dataframe by using merge
I'm trying if a warehouse can help me improve my service level to my customers. If so, I want to identify where to establish the warehouse. I used the following logic in python to find the warehouse location.
Identified the lat, long and volume by district wise
Have run a k-means using volume as sample weight & identified optimum lat-long of the warehouse location
But, I'm confused how to incorporate the factory location in my analysis. Because even after establishing a warehouse, I will be servicing few closeby districts from the factory itself. Could you please suggest on how to go about this?
TIA
k-means is a classification algorithm. It serves to classify groups of things in sub groups, so you can treat the groups the same way.
For example, it may classify your clients in 3 groups, and when you inspect the groups, you may find that one group are families, other are males, other teens, other group small stores, and other group large companies.
But you do not input that info on the algorithm. You put data like amount of purchases, type of goods traded, time of day of sales, whatever.
Whether it serves or not, for your purpose, depends on the data you have.
From the little info you give, it looks like you have something closer to a transport problem, where the important data is the distance and time spent transporting things.
Probably TSP algorithms are more appropriate for tour problem.
I'm searching for a way to apply an arbitrage algorithm across multiple exchanges, multiple currencies. and multiple trading amounts. I've seen examples of using BellmanFord and FloydWarshall, but the one's I've tried all seem to be assuming the graph data set is made up of prices for multiple currencies on one single exchange. I've tried tinkering and making it support prices across multiple exchanges but I haven't found any success.
One article I read said that I use BellmanFord and simply put only the best exchange's price in the graph (as opposed to all the exchange's prices). While it sounds like that should work, I feel like that could be missing out on value that way. Is this the right way to go about it?
And regarding multiple amounts, should I just make one graph per trade amount? So say I want to run the algorithm for $100 and for $1000, do I just literally populate the graph twice once for each set of data? The prices will be different at $100 than for $1000 so one exchange that has the best price at $100 may be different then that of the $1000 amount.
Examples:
The graph would look like this:
rates = [
[1, 0.23, 0.26, 17.41],
[4.31, 1, 1.14, 75.01],
[3.79, 0.88, 1, 65.93],
[0.057, 0.013, 0.015, 1],
]
currencies = ('PLN', 'EUR', 'USD', 'RUB')
REFERENCES:
Here is the code I've been using, but this assumes one exchange and one single trade quantity
Here is where someone mentions you can just include the best exchange's price in the graph in order to support multiple exchanges
Trying for accuracy over speed, there's a way to represent the whole order book of each exchange inside of a linear program. For each bid, we have a real-valued variable that represents how much we want to sell at that price, ranging between zero and the amount bid for. For each ask, we have a real-valued variable that represents how much we want to buy at that price, ranging between zero and the amount asked for. (I'm assuming that it's possible to get partial fills, but if not, you could switch to integer programming.) The constraints say, for each currency aside from dollars (or whatever you want more of), the total amount bought equals the total amount sold. You can strengthen this by requiring detailed balance for each (currency, exchange) pair, but then you might leave some opportunities on the table. Beware counterparty risk and slippage.
For different amounts of starting capital, you can split dollars into "in-dollars" and "out-dollars" and constrain your supply of "in-dollars", maximizing "out-dollars", with a one-to-one conversion with no limit from in- to out-dollars. Then you can solve for one in-dollars constraint, adjust the constraint, and use dual simplex to re-solve the LP faster than from scratch.
Which modelling strategy (time frame, features, technique) would you recommend to forecast 3-month sales for total customer base?
At my company, we often analyse the effect of e.g. marketing campaigns that run at the same time for the total customer base. In order to get an idea of the true incremental impact of the campaign, we - among other things - want to use predicted sales as a counterfactual for the campaign, i.e. what sales were expected to be assuming no marketing campaign.
Time frame used to train the model I'm currently considering 2 options (static time frame and rolling window) - let me know what you think. 1. Static: Use the same period last year as the dependent variable to build a specific model for this particular 3 month time frame. Data of 12 months before are used to generate features. 2. Use a rolling window logic of 3 months, dynamically creating dependent time frames and features. Not yet sure what the benefit of that would be. It uses more recent data for the model creation but feels less specific because it uses any 3 month period in a year as dependent. Not sure what the theory says for this particular example. Experiences, thoughts?
Features - currently building features per customer from one year pre-period data, e.g. Sales in individual months, 90,180,365 days prior, max/min/avg per customer, # of months with sales, tenure, etc. Takes quite a lot of time - any libraries/packages you would recommend for this?
Modelling technique - currently considering GLM, XGboost, S/ARIMA, and LSTM networks. Any experience here?
To clarify, even though I'm considering e.g. ARIMA, I do not need to predict any seasonal patterns of the 3 month window. As a first stab, a single number, total predicted sales of customer base for these 3 months, would be sufficient.
Any experience or comment would be highly appreciated.
Thanks,
F
I'm doing an ongoing survey, every quarter. We get people to sign up (where they give extensive demographic info).
Then we get them to answer six short questions with 5 possible values much worse, worse, same, better, much better.
Of course over time we will not get the same participants,, some will drop out and some new ones will sign up,, so I'm trying to decide how to best build a db and code (hope to use Python, Numpy?) to best allow for ongoing collection and analysis by the various categories defined by the initial demographic data..As of now we have 700 or so participants, so the dataset is not too big.
I.E.;
demographic, UID, North, south, residential. commercial Then answer for 6 questions for Q1
Same for Q2 and so on,, then need able to slice dice and average the values for the quarterly answers by the various demographics to see trends over time.
The averaging, grouping and so forth is modestly complicated by having differing participants each quarter
Any pointers to design patterns for this sort of DB? and analysis? Is this a sparse matrix?
Regarding the survey analysis portion of your question, I would strongly recommend looking at the survey package in R (which includes a number of useful vignettes, including "A survey analysis example"). You can read about it in detail on the webpage "survey analysis in R". In particular, you may want to have a look at the page entitled database-backed survey objects which covers the subject of dealing with very large survey data.
You can integrate this analysis into Python with RPy2 as needed.
This is a Data Warehouse. Small, but a data warehouse.
You have a Star Schema.
You have Facts:
response values are the measures
You have Dimensions:
time period. This has many attributes (year, quarter, month, day, week, etc.) This dimension allows you to accumulate unlimited responses to your survey.
question. This has some attributes. Typically your questions belong to categories or product lines or focus or anything else. You can have lots question "category" columns in this dimension.
participant. Each participant has unique attributes and reference to a Demographic category. Your demographic category can -- very simply -- enumerate your demographic combinations. This dimension allows you to follow respondents or their demographic categories through time.
But Ralph Kimball's The Data Warehouse Toolkit and follow those design patterns. http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247
Buy the book. It's absolutely essential that you fully understand it all before you start down a wrong path.
Also, since you're doing Data Warehousing. Look at all the [Data Warehousing] questions on Stack Overflow. Read every Data Warehousing BLOG you can find.
There's only one relevant design pattern -- the Star Schema. If you understand that, you understand everything.
On the analysis, if your six questions have been posed in a way that would lead you to believe the answers will be correlated, consider conducting a factor analysis on the raw scores first. Often comparing the factors across regions or customer type has more statistical power than comparing across questions alone. Also, the factor scores are more likely to be normally distributed (they are the weighted sum of 6 observations) while the six questions alone would not. This allows you to apply t-tests based on the normal distibution when comparing factor scores.
One watchout, though. If you assign numeric values to answers - 1 = much worse, 2 = worse, etc. you are implying that the distance between much worse and worse is the same as the distance between worse and same. This is generally not true - you might really have to screw up to get a vote of "much worse" while just being a passive screw up might get you a "worse" score. So the assignment of cardinal (numerics) to ordinal (ordering) has a bias of its own.
The unequal number of participants per quarter isn't a problem - there are statistical t-tests that deal with unequal sample sizes.