Causal Impact: Python, how to show all data points in predicted series? - python

My question is the same one being asked here but in regards to the Python port of the package, rather than in r.
https://stats.stackexchange.com/questions/424433/causal-impact-how-to-show-all-data-points-in-predicted-series
I am using the below code to run the model and plot out the results & summary, but I'd like to access the predicted values so I can export them. Is this possible in Python?
pre_period = [0, index_dict['2022-08-29']]
post_period = [index_dict['2022-09-05'],len(model_df)-1]
ci = CausalImpact(model_df, pre_period, post_period)
ci.run()
ci.plot()
print(ci.summary())
print(ci.summary(output='report'))

For anyone wondering it looks like running ci.inferences gives you the outputs in a dataframe. ci.inferences.point_pred being the predicted output from the synthetic control.
Using inspect on the plot function revealed a lot of info about how the plots are constructed, as explained in this thread. Is there a way a save plot generated by causalimpact in python?

Related

How to plot a regression curve of Random Forest Model

I am currently working on a project for which I have simulated/mock-up data. This data consists of multiple features of which only one is affecting the response variable. This is a very simplified use case because it is only for demo purposes.
I have used a basic random forest regression (scikit-learn) to predict the dependent variable. This model is performing rather well which was expected due to its simplicity. The thing I am having problems with is plotting a regression curve of the model (Remaining Useful Life is the dependent variable and temp is the feature which is affecting it). I am using pyplot to do this but I am not getting the expected result (see below). I would have expected the plot to be roughly the bottom curve. I am not sure why the straight lines above are there.
To clarify what I was expecting to get:
Below is a scatter plot of the same data
My questions regarding this:
Why is the plot coming out like this? Does it have something to do with how RF works?
Is there a way of getting a "clean" regression curve? (e.g. the shape of the scatter plot but one line) If so: how can this be achieved?
Code I am using for the plot:
plt.plot(y_hat_train_rf, X_train[['temp']], color='k')
Thanks to F. Gyllenhammar's comment I have found a solution now. This should be obvious to experienced people but I will share my solution nevertheless.
Steps to solve:
Create new Dataframe that joins x and y.
sort by x
plot

How to proceed forward with my data with python for process parameters?

I am working on an injection molding machine to analyze process data for some parameters to determine what parameters are related or if they are, which ones are important to determine a change in the condition of the machine.
I have plotted the correlation matrix heatmap for the parameters, from which I can see some positive and negative correlation between different parameters. I have attached a picture Heat Map for selected parameters. The problem I am facing now is some parameters may be or may not be related theoretically and also some of the related data (from the figure) have completely different units.
I want to do further analysis of this data. Please suggest a path about what should I be doing now. Should I proceed towards PCA or regression analysis or something else?
PS: I wanted to share my data regarding the same but I don't know how or where to upload it.
Thank you in advance

How to evaluate HDBSCAN text clusters?

I'm currently trying to use HDBSCAN to cluster movie data. The goal is to cluster similar movies together (based on movie info like keywords, genres, actor names, etc) and then apply LDA to each cluster and get the representative topics. However, I'm having a hard time evaluating the results (apart from visual analysis, which is not great as the data grows). With LDA, although it's hard to evaluate it, i've been using the coherence measure. However, does anyone have any idea on how to evaluate the clusters made by HDBSCAN? I haven't been able to find much info on it, so if anyone has any idea, I'd very much appreciate!
HDBSCAN implements Density-Based Clustering Validation in the method called relative_validity. It will allow you to compare one clustering, obtained with a given set of hyperparameters, to another one.
In general, read about cluster analysis and cluster validation.
Here's a good discussion about this with the author of the HDBSCAN library.
Its the same problem everywhere in unsupervised learning.
It is unsupervised, you are trying to discover something new and interesting. There is no way for the computer to decide whether something is actually interesting or new. It can decide and trivial cases when the prior knowledge is coded in machine processable form already, and you can compute some heuristics values as a proxy for interestingness. But such measures (including density-based measures such as DBCV are actually in no way better to judge this than the clustering algorithm itself is choosing the "best" solution).
But in the end, there is no way around manually looking at the data, and doing the next steps - try to put into use what you learned of the data. Supposedly you are not invory tower academic just doing this because of trying to make up yet another useless method... So use it, don't fake using it.
You can try the clusteval library. This library helps your to find the optimal number of clusters in your dataset, also for hdbscan. When you have the cluster labels, you can start enrichment analysis using hnet.
pip install clusteval
pip install hnet
Example:
# Import library
from clusteval import clusteval
# Set the method
ce = clusteval(method='hdbscan')
# Evaluate
results = ce.fit(X)
# Make plot of the evaluation
ce.plot()
# Make scatter plot using the first two coordinates.
ce.scatter(X)
So at this point you have the optimal detected cluster labels and now you may want to know whether there is association between any of the clusters with a (group of) feature(s) in your meta-data. The idea is to compute for each cluster label how often it is seen for a particular class in your meta-data. This can be defined with a P-value. The lower the P-value (below alpha=0.05), the less likely it happened by random chance.
results is a dict and contains the optimal cluster labels in the key labx. With hnet we can compute the enrichment very easily. More information can be found here: https://erdogant.github.io/hnet
# Import library
import hnet
# Get labels
clusterlabels = results['labx']
# Compute the enrichment of the cluster labels with the dataframe df
enrich_results = hnet.enrichment(df, clusterlabels)
When we look at the enrich_results, there is a column with category_label. These are the metadata variables of the dataframe df that we gave as an input. The second columns: P stands for P-value, which is the computed significance of the catagory_label with the target variable y. In this case, target variable y are are the cluster labels clusterlabels.
The target labels in y can be significantly enriched more then once. This means that certain y are enriched for multiple variables in the dataframe. This can occur because we may need to better estimate the cluster labels or its a mixed group or something else.
More information about cluster enrichment can be found here:
https://erdogant.github.io/hnet/pages/html/Use%20Cases.html#cluster-enrichment

Understanding Multiple-linear regression and using python to accomplish this?

I am wondering what package/library would be best suited for preforming multiple-linear regression. I've read about it but the concept still confuses me. Currently I am trying to preform MLR on two files I have one is something called a Shapefile this is basically like a vector image with a lot of data about a particular data. Another is a raster image of that same state which has a lot of associated data about the state and number of pixels, areas, things like that. What I am trying to do is preform multiple linear regression on the three variables I have:
impervious surface
developed class
planted/cultivated class.
The instructions I have ask me to:
"Perform multiple linear regression between population density and area percentage of the following surface covers and calculate the R2 of the regression"
I'm not sure what this means. When I asked for further clarificatio, thinking it was doing combinations of those three variables and correlating it with a variable called Population_desnity, from an associate I was told this:
"By multiple regression, I don't mean to run three regressions separately, with each one using one independent variable. A multiple regression is one regression with any number of independent variables, not just two only. For this project, you need to use with three independent variables in each regression. Search the internet to understand what what is a multiple linear regression if you don't have a good understanding of it yet."
I need help understanding MLR in this context and how I would go about programming it into python.
Thank you

Python: Create Nomograms from Data (using PyNomo)

I am working on Python 2.7. I want to create nomograms based on the data of various variables in order to predict one variable. I am looking into and have installed PyNomo package.
However, the from the documentation here and here and the examples, it seems that nomograms can only be made when you have equation(s) relating these variables, and not from the data. For example, examples here show how to use equations to create nomograms. What I want, is to create a nomogram from the data and use that to predict things. How do I do that? In other words, how do I make the nomograph take data as input and not the function as input? Is it even possible?
Any input would be helpful. If PyNomo cannot do it, please suggest some other package (in any language). For example, I am trying function nomogram from package rms in R, but not having luck with figuring out how to properly use it. I have asked a separate question for that here.
The term "nomogram" has become somewhat confused of late as it now refers to two entirely different things.
A classic nomogram performs a full calculation - you mark two scales, draw a straight line across the marks and read your answer from a third scale. This is the type of nomogram that pynomo produces, and as you correctly say, you need a formula. As mentioned above, producing nomograms like this is definitely a two-step process.
The other use of the term (very popular, recently) is to refer to regression nomograms. These are graphical depictions of regression models (usually logistic regression models). For these, a group of parallel predictor variables are depicted with a common scale on the bottom; for each predictor you read the 'score' from the scale and add these up. These types of nomograms have become very popular in the last few years, and thats what the RMS package will draft. I haven't used this but my understanding is that it works directly from the data.
Hope this is of some use! :-)

Categories