Longitudinal data analysis in python

Longitudinal data analysis in python - python

I have data whose structure is as below (a fictitious example):
data
There are 3 predictor variables and 1 response variable.
We have data of 5 students, and each student have 3 observations for time 1,2,3. Thus total number of observations is 15.
But I don't have an idea how to analyze the effect of X1, X2, X3 on Y in this kind of longitudinal data.(I will use python)
Can anyone give me some idea?
Thank you.

Since you have longitudinal data and a continuous response you have a few different options:
Ignore the grouping structure. I would not recommend doing this, since you may be ignoring information.
Model your groups separately. This is usually not a good idea, and most certainly not in the case where the groups have low sample size.
Treat your grouping variable as a categorical predictor. This is again may not be ideal when the number of groups are high, even with recent boosting packages that handle categorical predictors with high cardinality well (e.g. CatBoost).
Use a mixed effects model.
If you want to proceed with bullet 4, I recommend taking a look at the Gaussian Process Boosting or GPBoost package first. However, there are other Python packages to consider: MERF and LMER in Statsmodels.

Related

K Means Clustering: What does it mean about my input features if the Elbow Method gives me a straight line?

I am trying to cluster retail data in order to extract groupings of customers based on 6 input features. The data has a shape of (1712594, 6) in the following format:
I've spilt the 'Department' categorical variable into binary n-dimensional array using Pandas get_dummies(). I'm aware this is not optimal but I just wanted to test it out before trying out Gower Distances.
The Elbow method gives the following output:
USING:
I'm using Python and Scikitlearn's KMeans because the dataset is so large and the more complex models are too computationally demanding for Google Colab.
OBSERVATINS:
I'm aware that columns 1-5 are extremely correlated but the data is limited Sales data and little to no data is captured about Customers. KMeans is very sensitive to inputs and this may affect the WCSS in the Elbow Method and cause the straight line but this is just an inclination and I don't have any quantitative backing to support the argument. I'm a Junior Data Scientist so knowledge about technical foundations of Clustering models and algorithms is still developing so forgive me if I'm missing something.
WHAT I'VE DONE:
There were massive outliers that were skewing the data (this is a Building Goods company and therefore most of their sale prices and quantities fall within a certain range. But ~5% of the data contained massive quantity entries (eg. a company buying 300000 bricks at R3/brick) or massive price entries (eg. company buying an expensive piece of equipment).
I've removed them and maintained ~94% of the data. I've also removed the returns made by customers (ie. negative quantities and prices) under the inclination that I may create a binary variable 'Returned' to capture this feature. Here are some metrics:
These are some metrics before removing the outliers:
and these are the metrics after Outlier removal:
KMeans uses Euclidean distances. I've used both Scikitlearn's StandardScaler and RobustScaler when scaling without any significant changes in both. Here are some distribution plots and scatter plots for the 3 numeric variables:
Anybody have any practical/intuitive reasoning as to why this may be happening? Open to any alternative methods to use as well and any help would be much appreciated! Thanks

I am not an expert, in my experience with scikit learn cluster analysis I find that when the features are really similar in magnitude K-means clustering usually does not fulfill the job. I will first try to use a StandardScaler to see if normalizing the data makes the clustering more efficient. the elbow plot shows that with more n_neighbors you get higher accuracy, and by the looks of the plot and the plots you provide, I would think the data is too similar, making it hard to separate into groups (clusters). Adding an additional feature made up of your data can do the trick.
I would try normalizing the data first, standard scaler.
If the groups are still not very clear with a simple plot of the data I would create another column made up of the combination of the others columns.
I would not suggest using DBSCAN, since the eps parameter (distance) would have to be tunned very finely and as you mention is more computationally expensive.

randomforest Regressor with all independent variable as categorical

I am stuck in the process of building a model.Basically I have 10 parameters all of which are categorical variables, Even the categories have a large number of unique values (one category has 1335 unique values of 300 000 records), and the y value which is to be predicted is the number of days (Numerical). I am using randomforestregressor and getting an accuracy of around 55-60%. I am not sure if this is the max limit or I really need to change the algorithm itself. I am flexible with any kind of solutions.

Having up to 1335 categories for a categorical dimension might cause a random forest regressor (or classifier) some headache depending on how categorical dimensions are handled internally, and things will also depend on the distribution frequencies of the categories. What library are you using for the random forest regression?
Have you tried converting the categorical dimensions into unique integer IDs and interpreting this representation as a real number dimension? I've made the experience that this can raise the variable importance of many a type of categorical dimensions. (At times the inherent/initial ordering of the categories can provide useful grouping/partitioning information).
You can even shuffle your dimensions a few times and use these as input dimensions. I'll try to explain with an example:
You have a categorical dimension x1 with categories [c11,c12,...,c1n]
We easily map these categories to numerical values by saying x1 has a value of 1 if it's the category is c11, or a value of 2 if it's category, or a value or i for category c1i etc.
Use this new non-categorical dimension as an input dimension for training (you will have to change your input to the regressor accordingly later on).
You can go further than this. Shuffle the order (randomly) of your categories of x1 so you get a random order, for example [c13,c19,c1n,c1i,...,c12]. Do the same thing as above and you have another new non-categorical input dimension (Consider that you'll have to remember the shuffling order for the sake of regression later on).
I'm curious if adding a few (anywhere between 1 to 100, or whatever number you choose) dimensions like this can improve your performance.
Please, see how performance changes for different numbers of such dimensions. (But be aware that more such dimensions will cost you in preprocessing time at regression)
The statement in the codeblock below would require combining multiple categorical dimensions at once. Consider it only for inspiration.
Another idea would be to check if some form of linear classifier with the hot-encodings for each individual category for multiple categorical dimensions might be able improve things (This can help you find useful orderings more quickly than the approach above).

I am sure you need to more processing on your data.
having 1335 unique values on one variable is something bizarre.
please, if the data is public share it with me, I want to take a look.

Preprocessing Dataset with Large Categorical Variables

I have tried to find out basic answers for this question, but none on Stack Overflow seems a best fit.
I have a dataset with 40 columns and 55,000 rows. Only 8 out of these columns are numerical. The remaining 32 are categorical with string values in each.
Now I wish to do an exploratory data analysis for a predictive model and I need to drop certain irrelevant columns that do not show high correlation with the target (variable to predict). But since all of these 32 variables are categorical what can I do to see their relevance with the target variable?
What I am thinking to try:
LabelEncoding all 32 columns then run a Dimensional Reduction via PCA, and then create a predictive model. (If I do this, then how can I clean my data by removing the irrelevant columns that have low corr() with target?)
One Hot Encoding all 32 columns and directly run a predictive model on it.
(If I do this, then the concept of cleaning data is lost totally, and the number of columns will skyrocket and the model will consider all relevant and irrelevant variables for its prediction!)
What should be the best practice in such a situation to make a predictive model in the end where you have many categorical columns?

you got to check the correlation.. There are two scenarios I can think of..
if the target variable is continuous and independent variable is categorical, you can go with Kendall Tau correlation
if both target and independent variable are categorical, you can go with CramersV correlation
There's a package in python which cam do all of these for you and you can select only columns that you need..
pip install ctrl4ai
from ctrl4ai import automl
automl.preprocess(dataframe, learning type)
use help(automl.preprocess) to understand more about the hyper parameters and you can customise your preprocessing in the way you want to..
please check automl.master_correlation which checks correlation based on the approach I explained above.

You can have a look if your categorical variables are suitable for a Spearman rank correlation, which ranks the categorical variables and calculates the correlation coefficient. However, be careful for collinearity between the categorical variables.

Machine learning classification dataset setup

I am very sorry if this question violates SO's question guidelines but I am stuck and I cannot find anywhere else to ask this type of questions. Suppose I have a dataset containing three experimental data that were obtained in three different conditions (hot, cold, comfortable). The data is arranged in three columns in a pandas dataframe consisting of 4 columns (time, cold, comfortable and hot).
When I plot the data, I can visually see the separation of the three experiments, but I would like to do it automatically with machine learning.
The x-axis represents the time and the y-axis represents the magnitude of the data. I have read about different machine learning classification techniquesbut I do not understand how to set up my data so that I can 'feed' it into the classification algorithm. Namely, my questions are:
Is this programmatically feasible?
How can I set up (arrange my data) so that it can be easily fed into the classification algorithm? From what I read so far, it seems, for the algorithm to work, the data has to be in a certain order (see for example the iris dataset where the data is nicely labeled. How can I customize the algorithms to fit my needs?
NOTE: Ideally, I would like the program that, given a magnitude value, it would classify the value as hot, comfortable or cold. The time series is not much of relevance in my case

Of course this is feasible.
It's not entirely clear from the original post exactly what variables/features you have available for your model, but here is a bit of general guidance. All of these machine learning problems, from classification to regression, rely on the same core assumption that you are trying to predict some outcome based on a bunch of inputs. Usually this relationship is modeled like this: y ~ X1 + X2 + X3 ..., where y is your outcome ("dependent") variable, and X1, X2, etc. are features ("explanatory" variables). More simply, we can say that using our entire feature-set matrix X (i.e. the matrix containing all of our x-variables), we can predict some outcome variable y using a variety of ML techniques.
So in your case, you'd try to predict whether it's Cold, Comfortable, or Hot based on time. This is really more of a forecasting problem than it is a ML problem, since you have a time component that looks to be one of the most important (if not the only) features in your dataset. You may want to look at some simpler time-series forecasting methods (e.g. ARIMA) instead of ML algorithms, as some of the time-series ML approaches may not be well-suited for a beginner.
In any case, this should get you started, I think.

Multi-Output Classification using scikit Decision Trees

Disclaimer: I'm new to the field of Machine Learning, and even though I have done my fair share of research during the past month I still lack deep understanding on this topic.
I have been playing around with the scikit library with the objective of learning how to predict new data based on historic information, and classify existing information.
I'm trying to solve 2 different problems which may be correlated:
Problem 1
Given a data set containing rows R1 ... RN with features F1 ... FN, and a target per each group of rows, determine in which group does row R(N+1) belongs to.
Now, the target value is not singular, it's a set of values; The best solution I have been able to come up with is to represent those sets of values as a concatenation, this creates an artificial class and allows me to represent multiple values using only one attribute. Is there a better approach to this?
What I'm expecting is to be able to pass totally new set of rows, and being told which are the target values per each of them.
Problem 2
Given a data set containing rows R1 ... RN with features F1 ... FN, predict the values of R(N+1) based on the frequency of the features.
A few considerations here:
Most of the features are categorical in nature.
Some of the features are dates, so when doing the prediction the date should be in the future relative to the historic data.
The frequency analysis needs to be done per row, because certain sets of values may be invalid.
My question here is: Is there any process/ML algorithm, which given historic data would be able to predict a new set of values based on just the frequency of the parameters?
If you have any suggestions, please let me know.

Regarding Problem 1, if you expect the different components of the target value to be independent, you can approach the problem as building a classifier for every component. That is, if the features are F = (F_1, F_2, ..., F_N) and the targets Y = (Y_1, Y_2, ..., Y_N), create a classifier with features F and target Y_1, a second classifier with features F and target Y_2, etc.
Regarding Problem 2, if you are not dealing with a time series, IMO the best you can do is simply predict the most frequent value for each feature.
That said, I believe your question fits better another stack exchange like cross-validated.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.