Best model for variable selection with big data? - python
I posted a question earlier about some code but now I realize I should be more broad with the general idea. Basically, I'm trying to build a statistical model with about 1000 observations and 2000 variables. I would like to determine which variables are most influential in effecting my dependent variable with high significance. I don't plan to use the model for prediction, just for variable selection. My independent variables are binary and dependent variable is continuous. I've tried multiple linear regression and fixed models with tools such as statsmodels and scikit-learn. However, I have encountered issues such as having more variables than observations. I would prefer to solve the problem in python since I have basic knowledge in it. However, stats is very new to me so I don't know the best direction. Any help is appreciated.
Tree method
import pandas as pd
from sklearn import tree
from sklearn import preprocessing
data=pd.read_excel('data_file.xlsx')
y=data.iloc[:, -1]
X=data.iloc[:, :-1]
le=preprocessing.LabelEncoder()
y=le.fit_transform(y)
clf=tree.DecisionTreeClassifier()
clf=clf.fit(X,y)
tree.export_graphviz(clf, out_file='tree.dot')
Or if I output to text file, the first few lines are:
digraph Tree {
node [shape=box] ;
0 [label="X[685] <= 0.5\ngini = 0.995\nsamples = 1097\nvalue = [2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1\n1, 1, 1, 8, 1, 1, 3, 1, 2, 1, 1, 1, 2, 1\n1, 1, 1, 2, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 3, 2, 4, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1\n1, 3, 2, 2, 1, 2, 1, 1, 2, 1, 1, 1, 2, 2\n1, 1, 1, 1, 1, 1, 30, 3, 1, 3, 1, 1, 2, 1\n1, 5, 1, 2, 1, 4, 2, 1, 1, 1, 1, 1, 1, 1\n1, 1, 2, 1, 1, 1, 3, 1, 1, 3, 1, 2, 1, 1\n1, 7, 3, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1\n6, 2, 1, 2, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 3, 7, 6, 1, 1, 1\n1, 1, 3, 4, 1, 1, 1, 1, 1, 4, 1, 2, 1, 1\n1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1\n1, 4, 1, 1, 4, 2, 1, 1, 1, 2, 1, 1, 2, 2\n11, 1, 1, 2, 1, 3, 1, 1, 1, 1, 1, 1, 12, 1\n1, 1, 3, 1, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1\n6, 1, 1, 1, 1, 1, 4, 2, 1, 2, 1, 1, 1, 1\n1, 1, 1, 1, 3, 1, 1, 3, 1, 1, 1, 1, 1, 1\n1, 1, 1, 1, 1, 11, 1, 2, 1, 2, 1, 1, 1, 1\n4, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1\n1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2\n1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3\n1, 7, 1, 1, 2, 1, 2, 7, 1, 1, 1, 3, 1, 11\n1, 1, 2, 2, 2, 1, 1, 10, 1, 1, 5, 21, 1, 1\n11, 1, 2, 1, 1, 1, 1, 1, 5, 15, 3, 1, 1, 1\n1, 1, 1, 3, 1, 1, 2, 1, 3, 1, 1, 1, 1, 1\n1, 1, 6, 1, 1, 1, 1, 1, 1, 14, 1, 1, 1, 1\n17, 1, 1, 1, 1, 1, 1, 1, 2, 3, 1, 1, 1, 4\n1, 1, 1, 6, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1\n1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 14, 1\n3, 1, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 3, 1\n1, 2, 1, 12, 1, 1, 1, 1, 8, 2, 1, 1, 1, 2\n1, 1, 3, 1, 1, 6, 1, 1, 1, 3, 1, 1, 2, 1\n1, 1, 1, 1, 4, 1, 1, 2, 1, 3, 2, 4, 1, 3\n1, 1, 1, 1, 1, 7, 1, 1, 2, 1, 1, 2, 13, 2\n1, 1, 1, 1, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1\n9, 1, 2, 5, 7, 1, 1, 1, 2, 9, 2, 2, 13, 1\n1, 1, 1, 2, 1, 3, 1, 1, 6, 1, 3, 1, 1, 3\n1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 4, 1, 5, 1\n4, 1, 2, 3, 3]"] ;
1 [label="X[990] <= 0.5\ngini = 0.995\nsamples = 1040\nvalue = [2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1\n1, 1, 1, 8, 1, 1, 3, 1, 2, 1, 1, 1, 2, 1\n1, 1, 1, 2, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 3, 2, 4, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1\n1, 3, 2, 2, 1, 2, 1, 1, 2, 1, 1, 1, 2, 2\n1, 1, 1, 1, 1, 1, 30, 3, 1, 3, 1, 1, 2, 1\n1, 5, 1, 2, 1, 4, 2, 1, 1, 1, 1, 1, 1, 1\n1, 1, 2, 1, 1, 1, 3, 1, 1, 3, 1, 2, 1, 1\n1, 7, 3, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1\n6, 2, 1, 2, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 3, 7, 6, 1, 1, 1\n1, 1, 3, 4, 1, 1, 1, 1, 1, 4, 1, 2, 1, 1\n1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1\n1, 4, 1, 1, 4, 2, 1, 1, 1, 2, 1, 1, 2, 2\n11, 1, 1, 2, 1, 3, 1, 1, 1, 1, 1, 1, 12, 1\n1, 1, 3, 1, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1\n6, 1, 0, 1, 1, 1, 4, 2, 1, 2, 1, 1, 1, 1\n1, 1, 1, 1, 3, 1, 1, 3, 1, 1, 1, 0, 1, 1\n1, 1, 1, 1, 1, 9, 1, 2, 1, 2, 1, 1, 1, 1\n4, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1\n1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2\n1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3\n1, 7, 1, 1, 2, 1, 2, 7, 1, 1, 1, 1, 1, 11\n1, 1, 2, 2, 2, 1, 1, 10, 1, 1, 5, 21, 1, 1\n1, 1, 2, 1, 1, 1, 1, 1, 5, 15, 3, 1, 1, 1\n1, 1, 1, 3, 1, 1, 2, 1, 3, 1, 1, 0, 1, 1\n1, 1, 6, 1, 1, 1, 1, 1, 1, 14, 1, 1, 1, 1\n16, 1, 1, 1, 1, 1, 1, 1, 2, 3, 1, 1, 1, 4\n1, 1, 1, 6, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1\n1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 0, 1\n3, 1, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 3, 1\n1, 2, 1, 12, 1, 1, 1, 1, 8, 2, 0, 1, 1, 2\n1, 1, 3, 1, 1, 6, 1, 1, 1, 3, 1, 1, 2, 0\n1, 1, 1, 1, 4, 1, 1, 2, 1, 3, 2, 4, 1, 3\n1, 1, 1, 1, 1, 7, 1, 1, 2, 1, 0, 1, 3, 2\n1, 1, 1, 0, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1\n9, 1, 2, 5, 6, 1, 1, 1, 2, 9, 2, 2, 13, 1\n1, 1, 1, 2, 1, 3, 1, 1, 6, 1, 3, 1, 0, 3\n1, 0, 1, 1, 2, 0, 1, 2, 1, 1, 0, 1, 5, 1\n4, 1, 0, 3, 3]"] ;
I would recommend getting closer look to variance of your variables ot keep those with the largest range (pandas.DataFrame.var()) and eliminate those variables which correlate at most with others (pandas.DataFrame.corr()), as further steps I'd suggest to get any methods mentioned earlier.
1.Variante A: Feature Selection Scikit
For future selection scikitoffers a lot of different approaches:
https://scikit-learn.org/stable/modules/feature_selection.html
Here it sumps up the comments from above.
2.Variante B: Feature Selection with linear regression
You can also read your feature importance if you run linearregression on it. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html .The function reg.coef_will give you the coefiecents for your futres, the higher the absolute number is, the more important is your feature, so for exmaple 0.8 is a really important future, where 0.00001 is not important.
3.Variante C: PCA (not for binary case)
Why you wanna kill your variables ? I would recommend you to use: PCA - Principal ocmponent analysis https://en.wikipedia.org/wiki/Principal_component_analysis.
The basic concept is to transform your 2000 features to a smaller space (maybe 1000 or whatever), while still being mathematically useful.
Scikik-learnhas a good package for it: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
Related
How to get a colormap for a multiple histogram plot in python?
I have 10 network realisations of the number of occurences for certain substructures in a web decomposition algorithm. I am considering the 10 most important webs and so I have ten entries in each list where each list is a realisation of the network. Basically I have a list of lists: full_l2 = [[1, 1, 1, 1, 1, 1, 1, 1, 3, 1], [1, 1, 1, 1, 1, 2, 2, 2, 1, 1], [1, 1, 1, 1, 1, 2, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 3, 1, 1, 2, 2], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 3, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 2, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 2, 1]] The numbers in the list tells the number of substructures and each list has the webs in decreasing order of importance. So I used: occ = [] for i in range(10): a = list(zip(*full_l2))[i] occ.append(a) to get the 1st, 2nd and so on upto 10th important webs. Now the occurences will look like: occ = [(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (1, 1, 1, 1, 1, 1, 1, 3, 1, 1), (1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (1, 2, 2, 1, 3, 1, 1, 1, 1, 1), (1, 2, 1, 1, 1, 1, 1, 1, 2, 1), (1, 2, 1, 1, 1, 1, 1, 1, 1, 1), (3, 1, 1, 1, 2, 1, 1, 1, 1, 2), (1, 1, 1, 1, 2, 1, 1, 1, 1, 1)] So, I plot the histogram for the number of occurences. I am showing just 10 realisations so that the lists are easier to understand but I want to do it for 1000. I just used: plt.hist(occ) plt.yscale(log) and I get a plot like this: But I need to have it as a colormap. I tried using: cm = plt.cm.get_cmap('jet') and like this answer here: Plot histogram with colors taken from colormap but it has a problem: ValueError: color kwarg must have one color per dataset I need it to look like: Does anyone know if I am missing something?
Keras GridSearch model prediction
I'm battling with a weird issue that I can't seem to figure out. So, I used KerasClassifier and GridSearch to build and search for the best parameters for my model. This part worked fine. After this, I tried predicting on my test data which is where the weird thing happened. Assuming my grid_search object is grid and my test data is X_test, I noticed that the result of grid.best_estimator_.predict(X_test) is completely different from the result of grid_best_estimator_.model.predict(X_test). For more context, here's a sample of the result from grid.best_estimator_.predict(X_test): 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 3, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 3, 1, 1, 1, 1, 3, 3, 1, 3, 1, 3, 1, 1, 1, 1, 0, 1, 1, 1, 3, 1, 3, 3, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 3, 1, 3, 1, 1, 3, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 3, 1, 1, 1, 3, 1, 1, 3, 1, 3, 1, 1, 0, 1, 1, 3, 1, 1, 3, 3, 1, 1, 1, 3, 1, 1, 3, 1, 3, 1, 3, 1, 1, 3, 1, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 3, 1, 1, 3, 3, 3, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 3]) and here's the result from the grid_best_estimator_.model.predict(X_test): [[4.16748554e-01 3.44473362e-01 1.22328281e-01 1.16449825e-01] [3.47690374e-01 4.35497969e-01 9.62351710e-02 1.20576508e-01] [4.16748554e-01 3.44473362e-01 1.22328281e-01 1.16449825e-01] [4.16748554e-01 3.44473362e-01 1.22328281e-01 1.16449825e-01] [4.16748554e-01 3.44473362e-01 1.22328281e-01 1.16449825e-01] [4.48489130e-01 3.48928362e-01 1.13302141e-01 8.92804191e-02] [4.16748554e-01 3.44473362e-01 1.22328281e-01 1.16449825e-01] [2.65852152e-03 2.72439304e-03 5.55709645e-04 9.94061410e-01] [4.16748554e-01 3.44473362e-01 1.22328281e-01 1.16449825e-01] [4.16748554e-01 3.44473362e-01 1.22328281e-01 1.16449825e-01] [4.16748554e-01 3.44473362e-01 1.22328281e-01 1.16449825e-01] [1.14751011e-01 2.33341262e-01 3.13971192e-02 6.20510638e-01] [8.30730610e-03 1.07289189e-02 1.87594432e-03 9.79087830e-01] In an attempt to debug this, I've tried to call np.argmax() on the output of grid_best_estimator_.model(X_test). Then tried (result_of_best_estimator == result_of_model).all() which returns False. So, am i missing something? Or do I misunderstand how this is supposed to work?
convert values and calculate stddev
I have a dataframe like: +-----+------+ |A | B| +-----+------+ | 1| 2| | 200| 0| | 300| 4| +-----+------+ I want to convert that to a list of 1s for each A and 0s for each B, create a list from them and calculate their standard deviation and add that as Column C to the dataframe. So for example, for the first row we would calculate the standard deviation of [1, 0, 0]. Is that possible in pyspark?
A simple udf function should get your requirement fulfilled as import pyspark.sql.functions as F import pyspark.sql.types as T def stdev(x, y): return [1]*x + [0]*y stdevUdf = F.udf(stdev, T.ArrayType(T.IntegerType())) df.withColumn('stdev', stdevUdf(df.A, df.B)).show(truncate=False) which should give you +---+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |A |B |stdev | +---+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |1 |2 |[1, 0, 0] | |200|0 |[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] | |300|4 |[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]| +---+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
How to get matplotlib bar chart to match numeric count in python terminal
My main objective is to be consistent with both my numeric output and my visual output. However, I can't seem to get to them to match. Here is my setup using python 3.x: df = pd.DataFrame([ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],columns=['Expo']) Followed by my setup for the bar chart in matplotlib: x = df['Expo'] N = len(x) y = range(N) width = 0.125 plt.bar(x, y, width, color="blue") fig = plt.gcf(); A Nice pretty graph produced: However, using this snippet code to check and see what the actual numeric counts of both classes are... print("Class 1: "+str(df['Expo'].value_counts()[1]),"Class 2: "+str(df['Expo'].value_counts()[2])) I get the below: Class 1: 85 Class 2: 70 Since I have 155 records in the data frame, numerically this makes sense. Having a single bar in the bar chart be at 155 does not. I appreciate any help in advance.
I guess something like this is what you're after: import pandas as pd import numpy as np import matplotlib.pyplot as plt df = pd.DataFrame([ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],columns=['Expo']) # Count number of '1' and '2' elements in df N1, N2 = len(df[df['Expo'] == 1]), len(df[df['Expo'] == 2]) width = 0.125 # Plot the lengths in x positions [1, 2] plt.bar([1, 2], [N1, N2], width, color="blue") fig = plt.gcf() plt.show() Which produces
You may use a histogram, plt.hist(df["Expo"]) or specifying the bins plt.hist(df["Expo"], bins=[0.5,1.5,2.5], ec="k") plt.xticks([1,2])
Python: TypeError: Only 2-D and 3-D images supported with scikit-image regionprops
Given a numpy.ndarray of the kind myarray= array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) I want to use scikit-image on the array (which is already labelled) to derive some properties. This is what I do: myarray.reshape((11,11)) labelled=label(myarray) props=sk.measure.regionprops(labelled) But then I get this error: TypeError: Only 2-D and 3-D images supported., pointing at props. What is the problem? The image I am passing to props is already a 2D object. Shape of myarray: In [17]: myarray Out[17]: array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
I tried this code and I got no errors: import numpy as np from skimage.measure import label, regionprops myarray = np.random.randint(1, 4, (11,11), dtype=np.int64) labelled = label(myarray) props = regionprops(labelled) Sample output: In [714]: myarray Out[714]: array([[1, 2, 1, 1, 3, 3, 1, 1, 3, 3, 3], [1, 1, 3, 1, 3, 2, 2, 2, 3, 3, 2], [3, 3, 3, 1, 3, 3, 1, 1, 2, 3, 1], [1, 3, 1, 1, 1, 2, 1, 3, 1, 3, 3], [3, 2, 3, 3, 1, 1, 2, 1, 3, 2, 3], [3, 2, 1, 3, 1, 1, 3, 1, 1, 2, 2], [1, 3, 1, 1, 1, 1, 3, 3, 1, 2, 2], [3, 3, 1, 1, 3, 2, 1, 2, 2, 2, 1], [1, 1, 1, 3, 3, 2, 2, 3, 3, 3, 1], [1, 2, 2, 2, 2, 2, 1, 3, 3, 2, 2], [3, 2, 2, 3, 1, 3, 3, 1, 3, 3, 2]], dtype=int64) In [715]: labelled Out[715]: array([[ 0, 1, 0, 0, 2, 2, 3, 3, 4, 4, 4], [ 0, 0, 5, 0, 2, 6, 6, 6, 4, 4, 7], [ 5, 5, 5, 0, 2, 2, 0, 0, 6, 4, 8], [ 9, 5, 0, 0, 0, 10, 0, 4, 0, 4, 4], [ 5, 11, 5, 5, 0, 0, 10, 0, 4, 12, 4], [ 5, 11, 0, 5, 0, 0, 13, 0, 0, 12, 12], [14, 5, 0, 0, 0, 0, 13, 13, 0, 12, 12], [ 5, 5, 0, 0, 15, 12, 0, 12, 12, 12, 16], [ 0, 0, 0, 15, 15, 12, 12, 17, 17, 17, 16], [ 0, 12, 12, 12, 12, 12, 18, 17, 17, 19, 19], [20, 12, 12, 21, 22, 17, 17, 18, 17, 17, 19]], dtype=int64) In [716]: props[0].area Out[716]: 1.0 In [717]: props[1].centroid Out[717]: (1.0, 4.4000000000000004) I noticed that when all the elements of myarray have the same value (as in your example), labelled is an array of zeros. I also read this in the regionprops documentation: Parameters: label_image : (N, M) ndarray Labeled input image. Labels with value 0 are ignored. Perhaps you should use a myarray with more than one distinct value in order to get meaningful properties...
I was having this same issue, then after checking Tonechas answer I realized I was importing label from scipy instead of skimage. from scipy.ndimage.measurements import label I just replaced it to from skimage.measure import label, regionprops And everything worked :)