I was trying to do multiple imputation in python.
My motivation is driven by the mice package in R, however, I am looking for something equivalent in python. I found the IterativeImputer of sklearn.
Following documentation and some posts on SO I am able to produce multiple imputed sets. However, this the imputed values are drawn from a distribution by setting sample_posterior = True. But this is not what I am looking for. I would like to draw the values not from a distribution but to be a real sample. I.e. as in R, draw from those values that are in the same leaf in a decision tree. (see page 94 https://cran.r-project.org/web/packages/mice/mice.pdf). Is there a way to change the "prediction" of a decision tree within the IterativeImputer to drawing a random observation of the same leaf?
Documentation: https://scikit-learn.org/stable/modules/impute.html
Post on SO:
IterativeImputer - sample_posterior and Imputing missing values using sklearn IterativeImputer class for MICE
miceforest does what you are looking for. It implements mean matching by default, which will pull from real samples in the data.
However, miceforest uses lightgbm as a backend. This may or may not be what you want.
I have a dataset with more than 50 columns and I'm trying to find a way in Python to make a simple linear regression between each combination of variables. The goal here is to find a starting point in furthering my analysis (i.e, I will dwelve deeper into those pairs that have a somewhat significant R Square).
I've put all my columns in a list of numpy arrays. How could I go about making a simple linear regression between each combination, and for that combination, print the R square? Is there a possibility to try also a multiple linear regression, with up to 5-6 variables, again with each combination?
Each array has ~200 rows, so code efficiency in terms of speed would not be a big issue for this personal project.
If you are looking for columns with high r squared values, just try a correlation matrix. To ease the visualization, I would recommend you to plot a heat map using seaborn:
import seaborn as sns
import matplotlib.pyplot as plt
df_corr = df.corr()
sns.heatmap(df_corr, cmap="coolwarm", annot=True)
plt.show()
Other suggestion I have to you is to run a Principal Component Analysis (PCA) in your dataset to find the features with highest variability. Usually, these variables are the most important, and can be used to make the best predictions. Just let me know if want more info on this technique.
This is more of an EDA problem than a python problem. Look into some regression resources, specifically a correlation matrix. However, one possible solution could use itertools.combinations with a group size of 6. This will give you 15,890,700 different options for running a regression so unless you want to run greater than 15 million regressions you should do some EDA to find important features in your dataset.
I am currently working on a project for which I have simulated/mock-up data. This data consists of multiple features of which only one is affecting the response variable. This is a very simplified use case because it is only for demo purposes.
I have used a basic random forest regression (scikit-learn) to predict the dependent variable. This model is performing rather well which was expected due to its simplicity. The thing I am having problems with is plotting a regression curve of the model (Remaining Useful Life is the dependent variable and temp is the feature which is affecting it). I am using pyplot to do this but I am not getting the expected result (see below). I would have expected the plot to be roughly the bottom curve. I am not sure why the straight lines above are there.
To clarify what I was expecting to get:
Below is a scatter plot of the same data
My questions regarding this:
Why is the plot coming out like this? Does it have something to do with how RF works?
Is there a way of getting a "clean" regression curve? (e.g. the shape of the scatter plot but one line) If so: how can this be achieved?
Code I am using for the plot:
plt.plot(y_hat_train_rf, X_train[['temp']], color='k')
Thanks to F. Gyllenhammar's comment I have found a solution now. This should be obvious to experienced people but I will share my solution nevertheless.
Steps to solve:
Create new Dataframe that joins x and y.
sort by x
plot
I have a set of raw data and I have to identify the distribution of that data. What is the easiest way to plot a probability distribution function? I have tried fitting it in normal distribution.
But I am more curious to know which distribution does the data carry within itself ?
I have no code to show my progress as I have failed to find any functions in python that will allow me to test the distribution of the dataset. I do not want to slice the data and force it to fit in may be normal or skew distribution.
Is any way to determine the distribution of the dataset ? Any suggestion appreciated.
Is this any correct approach ? Example
This is something close what I am looking for but again it fits the data into normal distribution. Example
EDIT:
The input has million rows and the short sample is given below
Hashtag,Frequency
#Car,45
#photo,4
#movie,6
#life,1
The frequency ranges from 1 to 20,000 count and I am trying to identify the distribution of the frequency of the keywords. I tried plotting a simple histogram but I get the output as a single bar.
Code:
import pandas
import matplotlib.pyplot as plt
df = pandas.read_csv('Paris_random_hash.csv', sep=',')
plt.hist(df['Frequency'])
plt.show()
Output
This is a minimal working example for showing a histogram. It only solves part of your question, but it can be a step towards your goal. Note that the histogram function gives you the values at the two corners of the bin and you have to interpolate to get the center value.
import numpy as np
import matplotlib.pyplot as pl
x = np.random.randn(10000)
nbins = 20
n, bins = np.histogram(x, nbins, density=1)
pdfx = np.zeros(n.size)
pdfy = np.zeros(n.size)
for k in range(n.size):
pdfx[k] = 0.5*(bins[k]+bins[k+1])
pdfy[k] = n[k]
pl.plot(pdfx, pdfy)
You can fit your data using the example shown in:
Fitting empirical distribution to theoretical ones with Scipy (Python)?
Definitely a stats question - sounds like you're trying to do a probability test of whether the distribution is significantly similar to the normal, lognormal, binomial, etc. distributions. The easiest is to test for normal or lognormal as explained below.
Set your Pvalue cutoff, usually if your Pvalue <= 0.05 then it is NOT normally distributed.
In Python use SciPy, you just need your P value returned to test, so 2 return values from this function (I'm ignoring optional (not needed) inputs here for clarity):
import scipy.stats
[W, Pvalue] = scipy.stats.morestats.shapiro(x)
Perform the Shapiro-Wilk test for normality. The Shapiro-Wilk test tests the null hypothesis that the data was drawn from a normal distribution.
If you want to see if it is lognormally distributed (provided it doesn't pass the P test above), you can try:
import numpy
[W, Pvalue] = scipy.stats.morestats.shapiro(numpy.log(x))
Interpret the same way - I just tested on a known lognormally distributed simulation and got a 0.17 Pvalue on the np.log(x) test, and a number close to 0 for the standard shapiro(x) test. That tells me lognormally distributed is the better choice, normally distributed fails miserably.
I made it simple which is what I gathered you are looking for. For other distributions, you may need to do more work along the lines of Q-Q plots https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot and not simply following a few tests I proposed. That means you have a plot of the distribution you are trying to fit to vs. your data plotted. Here's a quick example that can get you down that path if you so desire:
import numpy as np
import pylab
import scipy.stats as stats
mydata = whatever data you are looking to fit to a distribution
stats.probplot(mydata, dist='norm', plot=pylab)
pylab.show()
Above you can substitute anything for dist='norm' from the scipy library http://docs.scipy.org/doc/scipy/reference/tutorial/stats/continuous.html#continuous-distributions-in-scipy-stats
then find its scipy name (must add shape parameters according to the documentation such as stats.probplot(mydata, dist='loggamma', sparams=(1,1), plot=pylab) or for student T stats.probplot(mydata, dist='t', sparams=(1), plot=pylab)), then look at the plot and see how close your data follows that distribution. If the data points are close you've found your distribution. It will give you an R^2 value too on the graph; closer to 1 the better the fit generally.
And if you want to continue trying to do what you're doing with the dataframe, try changing to: plt.hist(df['Frequency'].values)
Please vote for this answer if it answers your question :) Need some bounty to get replies on my own programming dilemmas.
Did you try using the seaborn library? They have a nice kernel density estimation function. Try:
import seaborn as sns
sns.kdeplot(df['frequency'])
You find installation instructions here
The only distribution the data carry within itself is the empirical probability. If your have data as a 1d numpy array data you can compute the value of the empirical distribution function at x as the cumulative relative frequency of the values lesser than or equal to x:
d[d <= x].size / d.size
This is a step function so it does not have an associated probability density function but a probability mass function where the mass of each observed value is its relative frequency. To compute the relative frequencies:
values, freqs = np.unique(data, return_counts=True)
rfreqs = freqs / data.size
This does not mean that the data is a random sample from their empirical distribution. If you want to know what distribution your data are a sample from (if any) just by looking at the data, the answer is you can't. But that is more about statistics than about programming.
The histogram does not what you think it does, you try to show a bar graph. The histogram needs each data point separately in a list, not the frequency itself. You have [3,2,0,4,...] bout should have [1,1,1,2,2,4,4,4,4]. You can not determine a probability distribution automatically
I think you are asking a slightly different question:
What is the correlation between my raw data and the curve to which I have mapped it?
This is a conceptual problem, and you're are trying to understand the meanings of the values R and R squared. Start by working through this MiniTab blog post. You may want to skim this non-Python Kaledia Graph Guide to understand the classes of curves to fit and the usage of Least-Mean-Squares in fitting the curves.
You were probably downvoted because it is a math question more than a programming question.
I may be missing something, but it seems that a major point is being overlooked across the board: The data set you are describing is a categorical data set. That is, the x-values are not numeric, they're just words (#Car, #photo, etc.). The concept of a the shape of a probability distribution is meaningless for a categorical data set, since there is no logical ordering for the categories. What would a histogram even look like? Would #Car be the first bin? Or would it be all the way to the right of your graph? Unless you have some criteria for quantifying your categories then trying to make judgments based on the shape of the distribution is meaningless.
Here's a small text-based example to clarify what I'm saying. Suppose I survey a group of people and ask their favorite color. I plot the results:
Red | ##
Green | #####
Blue | #######
Yellow | #####
Orange | ##
Huh, looks like color preferences are normally distributed. Wait, what if I had randomly put the colors in a different order in my graph:
Blue | #######
Yellow | #####
Green | #####
Orange | ##
Red | ##
I guess the data is actually positively skewed? Not so, of course - for a categorical data set the shape of the distribution is meaningless. Only if you were to decide to some how quantify each hashtag in your data set would the problem have meaning. Do you want to compare the length of a hashtag to its frequency? Or the alphabetical ordering of a hashtag to its frequency? Etc.
I am trying to find a numerical package which will fit a natural spline which minimizes weighted least squares.
There is a package in scipy which does what I want for unnatural splines.
import numpy as np
import matplotlib.pyplot as plt
from scipy import interpolate, randn
x = np.arange(0,5,1.0/6)
xs = np.arange(0,5,1.0/500)
y = np.sin(x+1) + .2*np.random.rand(len(x)) -.1
knots = np.array([1,2,3,4])
tck = interpolate.splrep(x,y,s=0,k=3,t=knots,task=-1)
ys = interpolate.splev(xs,tck,der=0)
plt.figure()
plt.plot(xs,ys,x,y,'x')
The spline.py file inside of this tar file from this page does a natural spline fit by default. There is also some code on this page that claims to mostly what you want. The pyD3D package also has a natural spline function in its pyDataUtils module. This last one looks the most promising to me. However, it doesn't appear to have the option of setting your own knots. Maybe if you look at the source you can find a way to rectify that.
Also, I found this message on the Scipy mailing list which says that using s=0.0 (as in your given code) makes splines fitted using your above procedure natural according the writer of the message. I did find this splmake function that has an option to do a natural spline fit, but upon looking at the source I found that it isn't implemented yet.