Using multivariable LSTM to predict only certain values - python

Okay, so, the question might be a bit tricky.
For a project I'm working on, I'm supposed to predict sales values from a store for certain products. Easy enough, I've done two functional models that, analyzing the sales over the past 10 years of a single product, is capable of predicting the future sales.
However, here's where it gets complicated:
My dataframe looks something like this:
df={month : [...], id : [...], n_sales : [...], group : [...], brand : [...]}
Id refers to the product, whereas group refers to the type of product and brand is just the brand.
It's important to understand that, of course, a single id has only one group and one brand, contrary to them since they both can have multiple different id's.
Finally, my data is organized by month (ascendant) and by ID (also ascendant).
Meaning that, let's say the store has 50 products (50 id's).
Then the first 50 rows of my dataset would be:
----Date----|--Id--|--n_sales--|......
2012-01-01 | 1 ......
2012-01-01 | 2 ......
2012-01-01 | 3 ......
......
2012-01-01 | 50 .....
Then the next 50 rows would be the respective sales of each product for the month 2012-02-01 and so on until now.
I'm sorry if this is confusing, I'm trying to explain it as clear as I can.
Okay, I'm almost done. It's understandable that, if I isolate a single product, it would be easy to analyze the data.
I could just plot the sales from the known months alongside the sales from the prediction.
However, in order to make a more accurate prediction, I was asked to run a LSTM multivariable model, meaning that I have to take into account both group and brand. This, of course, means training my model with all the data from all the products. This is better understood with an example:
Let's say a new ice cream from Nestle was just created last November. Only analyzing the sales from that ice cream could not predict that the sales in summer will go up, since the only data the model would have is the few sales made in the cold months.
Nonetheless, if I analyze all the products, LSTM would know that, products from Nestle sell considerably more in summer and would take this into account when making the prediction for this new product.
And there's the problem, so now, getting to the question, how can I analyze all the data, from all the products but only get the predictions from a single Id?
Note: It has to be with LSTM, other models aren't an option.
And to anyone making it this far, even if you are not able to help, thank you for reading such a mess!

Related

Predict real estate price of available sales in the market based on past transaction data - machine learning

I am building a project in the data science course.
I decided to go for a familiar theme of real estate but with some additions of my own.
I took a city and collected all the real estate transactions from 2000 until today.
In the model learning phase I used RandomForestRegressor and managed to reach a not bad result of 0.8 after processing all the data.
In the second step, I went to a website where people advertise second-hand apartments for sale and collected all the data so that all the data I have in the first model is also here.
Now the problem:
The results I get are far from reality - the price of the model in many apartments is far from the price at which the apartment is sold.
I processed the data and did some comparison.
I think the problem stems from the fact that the first model calculates all the transactions from the year 2000 until today, while when I run the model on the new data, it does not know how to treat them in the same way:
That is, the year of the transaction is 2023 and not spread from 2000-2023, and the model does not know how to take into account the increase in prices.
I will point out that there is some reasonable correlation between the real price of the assets and the price that the model shows, but the spread is still too large and inaccurate.
In addition, another figure that is a little strange to me, I noticed that the average price per meter in 2022 of an apartment sold in the city is:
$11621.85
Compared to the price of an apartment that is currently in the market which is:
$18405.18
The average increase in recent years is 7.5 percent
I would appreciate your ideas on how to make the model more accurate.
*The data in my model contains:
['price',
'year,
'neighborhood',
'buildingMR',
'TotalFloors',
'rooms',
'floor',
'lat',
'long',
'buildyear',
'Distance_From_Sea',
'avgroomsize']

how to analyze numerical and categorical variables at the same time?

I'm trying to analyze the data of a food ordering application,
the data consist of both numerical and categorical variables, the main variable I'm studying is the total delivery time of an order, which represent the time from placing the order to closing it, I want to study what are the variables the affects it the most.
an example of rows in the data is the following:
order id
branch id
date
time placed
day
period
items id
no. items
total no. items
total delivery time
total time in seconds
113113
31
2/2/2021
13:32:24
Tuesday
afternoon
571
4
11
00:46:19
2805
113113
31
2/2/2021
13:32:24
Tuesday
afternoon
573
4
11
00:46:19
2805
I want to study the effects of all the variables on the total time, even items id and branch id, does a certain item affect time? does the day and period of the day affect it as well?
I used linear regression to get the correlation between total time and the numerical variables, and tried one way anova for some categorical variables, but I didn't like the results, is there a way to analyze all variable together without encoding categorical variables?
I'm looking forward to seeing what other people say about this. Here's my two cents.
ML algos like Regression, love numbers. ML algos like Classification love labels (non-numbers). You can certainly convert labeled data to 'numbered' data. One example is to code ['red','green','blue'] with [1,2,3], would produce weird things like 'red' is lower than 'blue', and if you average a 'red' and a 'blue' you will get a 'green'. Another more subtle example might happen when you code ['low', 'medium', 'high'] with [1,2,3]. In the latter case it might happen to have an ordering which makes sense, however, some subtle inconsistencies might happen when 'medium' in not in the middle of 'low' and 'high'. Now, under the hood, I think classifiers convert labels to numbers, so if you feed in large, medium, and small, it isn't using large, medium, and small to do it's analysis, it's converting those categories to numbers. I think. Maybe someone can confirm this for me.
Thus, I don't think it makes sense to try to measure any kind of relationship between IDs and specific outcomes, like 'totaltime', 'totaldays', etc. If you kick off a project on a Monday or a Friday, does the project end sooner or later than non-Monday-start or non-Friday-start projects? Well, maybe it does. But, is that correlation or causation? You can find correlations between all kinds of things, but these don't necessarily imply causation between these same things. Let's say you find a strong relationship between multiple projects that start on the second Monday of the month and all of these projects get finished off much faster than all other projects. This seems like pure coincidence, rather than causation. Or, there is some other factor impacting the outcome. Maybe projects that start on the second Monday of the month are typically small upgrades, rather than full-blown new undertakings, so the volume of work is less, and the project is done faster. However, starting the work on the second Monday of the month doesn't CAUSE the project to be finished off faster. Tell me if I am wrong. I'm always open to feedback.

Predict Sales as Counterfactual for Experiment

Which modelling strategy (time frame, features, technique) would you recommend to forecast 3-month sales for total customer base?
At my company, we often analyse the effect of e.g. marketing campaigns that run at the same time for the total customer base. In order to get an idea of the true incremental impact of the campaign, we - among other things - want to use predicted sales as a counterfactual for the campaign, i.e. what sales were expected to be assuming no marketing campaign.
Time frame used to train the model I'm currently considering 2 options (static time frame and rolling window) - let me know what you think. 1. Static: Use the same period last year as the dependent variable to build a specific model for this particular 3 month time frame. Data of 12 months before are used to generate features. 2. Use a rolling window logic of 3 months, dynamically creating dependent time frames and features. Not yet sure what the benefit of that would be. It uses more recent data for the model creation but feels less specific because it uses any 3 month period in a year as dependent. Not sure what the theory says for this particular example. Experiences, thoughts?
Features - currently building features per customer from one year pre-period data, e.g. Sales in individual months, 90,180,365 days prior, max/min/avg per customer, # of months with sales, tenure, etc. Takes quite a lot of time - any libraries/packages you would recommend for this?
Modelling technique - currently considering GLM, XGboost, S/ARIMA, and LSTM networks. Any experience here?
To clarify, even though I'm considering e.g. ARIMA, I do not need to predict any seasonal patterns of the 3 month window. As a first stab, a single number, total predicted sales of customer base for these 3 months, would be sufficient.
Any experience or comment would be highly appreciated.
Thanks,
F

How to pre-process transactional data to predict probability to buy?

I'm working on a model for a departament store that uses data from previous purchases to predict the customer's probability to buy today. For the sake of simplicity, say that we have 3 categories of products (A, B, C) and I want to use the purchase history of the customers in Q1, Q2 and Q3 2017 to predict the probability to buy in Q4 2017.
How should I structure my indicators file?
My try:
The variables I want to predict are the red colored cells in the production set.
Please note the following:
Since my set of customers is the same for both years, I'm using a photo of how customers acted last year to predict what will they do at the end of this year (which is unknown).
Data is separated by trimester, a co-worker sugested this is not correct, because I'm unintentionally giving greater weight to the indicators splitting each one in 4, when they should only be one per category.
Alternative:
Another aproach I was sugested was to use two indicators per category: Ex.'bought_in_category_A' and 'days_since_bought_A'. For me this looks simpler, but then the model will only be able to predict IF the customer will buy Y, not WHEN they will buy Y. Also, what will happen if the customer never bought A? I cannot use a 0 since that will imply customers who never bought are closer to customers who just bought a few days ago.
Questions:
Is this structure ok or would you structure the data in another way?
Is it ok to use information from last year in this case?
Is it ok to 'split' a cateogorical variable into several binary variables? does this affect the importance given to that variable?
Unfortunately, you need a different approach in order to achieve predictive analysis.
For example the products' properties are unknown here (color, taste,
size, seasonality,....)
There is no information about the customers
(age, gender, living area etc...)
You need more "transactional"
information, (when, why - how did they buy etc......)
What is the products "lifecycle"? Does it have to do with fashion?
What branch are you in? (Retail, Bulk, Finance, Clothing...)
Meanwhile have you done any campaign? How will this be measured?
I would first (if applicable) concetrate on the categories relations and behaviour for each Quarter:
For example When n1 decreases then n2 decreases
when q1 is lower than q2 or q1/2016 vs q2/2017.
I think you should first of all, work this out with a business analyst in order to to find out the right "rules" and approach.
I do no think you could get a concrete answer with these generic-assumed data.
Usually you need data from at least 3-5 recent years to do some descent predictive analysis, depending of course, on the nature of your product.
Hope, this helped a bit.
;-)
-mwk

How to generate features of data in my case so that I can use some tools like LinearRegression to predict?

I am new in this field. I want to analyze, as a retailer, which customer bought my goods during my promotion would become a loyal customer. I have a list of user action during the promotion and user information and also a list of customer&merchant pair whose customer is known as loyal to the merchant. I still have another list of customer&merchant pair, and I need to predict if they would have a loyal relationship. The data is quite huge I just put some lines here.
user_id item_id cat_id merchant_id brand_id time_stamp action_type
168006 348194 544 692 517 625 0
168006 768080 984 706 1060 1016 1
168006 810877 284 692 517 625 2
user_id#merchant_id prob
7562#3571 0
7562#4784 0
7562#3404 1
cat_id: product category
action_type: on behalf of something like add to cart, purchase, add to favorite
I think I can use something like sklearn.linear_model.LinearRegression to predict the prob item in my predict list, by making every user#perchant pair as one item. Here the prob means loyal if 1 and not loyal if 0, in the new list, prob would be a float num.
I still have a list of user_information which is quite easy to deal with which I won't put it here. But I don't know how to generate the features from my user action list. Could you give me some idea? In fact, I still don't know if I should use LinearRegression or maybe better tool available?
It's a great question. What you've captured here is that you have some data (user actions associated with merchants) and you want to translate that into an insight (probability of user having more action with a merchant).
This problem is very similar to Netflix's problem. They have viewers (like your customers) and movies (like your merchants or products) and they need to know what movies a viewer will like before they watch the movie. This is a recommender system.
There is a good course on this topic here : https://www.coursera.org/learn/ml-recommenders
From a conceptual point of view, what you are looking to do is to learn the preference vectors for each customer and description vectors for each merchant. The preference and description vectors combine to determine the recommendation.
If I was going to implement this in a Neural Net I'd do the following:
1 hot encode the customers and connect that in a feed forward net to a few nodes. If you think there are 5 distinct dimensions of preference, then use 5 nodes. Perhaps as many as 10.
1 hot encode the merchants and connect that in a feed forward net to a few nodes. If you think there are 5 distinct dimension of merchant value, then use 5 nodes. Perhaps as low as 3.
These two smaller layers on top of the 1 hot encoding of the customer and merchant make a dimensionality reduction layer for the merchant and customer and will encode their preferences.
Now feed the two nets another small layer (perhaps 5 nodes again) and then to two output nodes ( 1 for doesn't fit and 1 for does fit ).
Train the whole configuration on your training data. At the end you'll have learnt the preferences of your customers and description of your merchants and how they match up conceptually.
When you have a new customer or merchant your task is to discover their vector, not recalculate everything. You can use technology like K-means to find customers like that customer or merchants like a merchant.

Categories