PyMC3 failing to broadcast correct dimensions for inference - python
I am trying to extend the ideas of item response theory to multiple responses. Consider a marketing survey, which asks customers, "what's the deciding factor in whether or not you purchase product X?" Where answers are {0: price, 1: durability, 2: ease-of-use}.
Here is some synthetic data (rows are customers, columns are products, each cell is the class response.)
responses = np.array([
[0,1,2,1,0],
[1,1,1,1,1],
[0,0,2,2,1],
[1,1,2,2,1],
[1,1,0,0,0]
])
students = 5
questions = 5
categories = 3
with pm.Model() as model:
z_student = pm.Normal("z_student", mu=0, sigma=1, shape=(students,categories))
z_question = pm.Normal("z_question",mu=0, sigma=1, shape=(categories,questions))
# Transformed parameter
theta = pm.Deterministic("theta", tt.nnet.softmax(z_student - z_question))
# Likelihood
kij = pm.Categorical("kij", p=theta, observed=responses)
trace = pm.sample(chains=4)
az.plot_trace(trace, var_names=["z_student", "z_question"], compact=False);
This code produces the following error: ValueError: Input dimension mis-match. (input[0].shape[0] = 5, input[1].shape[0] = 3).
However, when I change the theta line to: theta = pm.Deterministic("theta", tt.nnet.softmax(z_student - z_question.transpose())) the sampler doesn't instantly failure, rather is samples wrong.
az.summary(trace)
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_mean ess_sd ess_bulk ess_tail r_hat
z_student[0,0] 0.150 0.893 -1.620 1.752 0.012 0.013 5789.0 2327.0 5771.0 2991.0 1.0
z_student[0,1] 0.393 0.879 -1.319 1.980 0.012 0.012 5150.0 2610.0 5153.0 3195.0 1.0
z_student[0,2] -0.591 0.915 -2.254 1.108 0.011 0.012 6408.0 2737.0 6415.0 2830.0 1.0
z_student[1,0] -0.064 0.860 -1.676 1.538 0.011 0.014 5748.0 1942.0 5747.0 2850.0 1.0
z_student[1,1] 0.602 0.864 -0.982 2.185 0.012 0.011 4921.0 3028.0 4920.0 3269.0 1.0
z_student[1,2] -0.548 0.906 -2.218 1.137 0.012 0.012 6076.0 2870.0 6083.0 3410.0 1.0
z_student[2,0] -0.166 0.907 -1.974 1.450 0.013 0.014 4681.0 2121.0 4692.0 3108.0 1.0
z_student[2,1] -0.188 0.875 -1.776 1.472 0.011 0.014 5923.0 2073.0 5945.0 3333.0 1.0
z_student[2,2] 0.344 0.865 -1.288 1.951 0.012 0.012 4828.0 2750.0 4822.0 3039.0 1.0
z_student[3,0] -0.212 0.892 -1.980 1.395 0.011 0.013 6019.0 2504.0 5996.0 3391.0 1.0
z_student[3,1] 0.097 0.876 -1.573 1.713 0.012 0.013 5304.0 2252.0 5332.0 2971.0 1.0
z_student[3,2] 0.096 0.851 -1.583 1.645 0.011 0.012 5554.0 2678.0 5543.0 3288.0 1.0
z_student[4,0] 0.160 0.881 -1.367 1.947 0.012 0.013 5421.0 2189.0 5413.0 2927.0 1.0
z_student[4,1] 0.414 0.863 -1.255 2.026 0.012 0.012 4900.0 2548.0 4897.0 3248.0 1.0
z_student[4,2] -0.558 0.901 -2.266 1.130 0.011 0.012 6551.0 2728.0 6582.0 3142.0 1.0
z_question[0,0] -0.179 0.883 -1.795 1.488 0.011 0.015 6317.0 1769.0 6315.0 3389.0 1.0
z_question[0,1] 0.107 0.886 -1.511 1.807 0.012 0.013 5236.0 2431.0 5209.0 3503.0 1.0
z_question[0,2] 0.164 0.878 -1.450 1.834 0.012 0.013 5131.0 2248.0 5106.0 3102.0 1.0
z_question[0,3] 0.186 0.904 -1.450 1.882 0.011 0.014 6228.0 2175.0 6219.0 3335.0 1.0
z_question[0,4] -0.187 0.877 -1.790 1.508 0.011 0.014 5819.0 2089.0 5834.0 3198.0 1.0
z_question[1,0] -0.389 0.849 -1.948 1.219 0.012 0.012 4726.0 2494.0 4713.0 3146.0 1.0
z_question[1,1] -0.600 0.858 -2.249 0.946 0.012 0.011 5093.0 3247.0 5116.0 3312.0 1.0
z_question[1,2] 0.179 0.868 -1.520 1.763 0.012 0.012 5204.0 2514.0 5201.0 3418.0 1.0
z_question[1,3] -0.103 0.862 -1.683 1.561 0.013 0.013 4608.0 2212.0 4615.0 3163.0 1.0
z_question[1,4] -0.381 0.866 -2.047 1.147 0.011 0.012 6181.0 2735.0 6188.0 3038.0 1.0
z_question[2,0] 0.565 0.908 -1.125 2.337 0.012 0.012 6022.0 2879.0 6045.0 3173.0 1.0
z_question[2,1] 0.536 0.923 -1.192 2.241 0.012 0.013 6041.0 2476.0 6046.0 3059.0 1.0
z_question[2,2] -0.325 0.856 -1.918 1.289 0.012 0.012 5429.0 2741.0 5418.0 3004.0 1.0
z_question[2,3] -0.107 0.881 -1.953 1.363 0.012 0.012 5834.0 2545.0 5841.0 3332.0 1.0
z_question[2,4] 0.576 0.910 -1.202 2.253 0.011 0.013 6385.0 2606.0 6371.0 2905.0 1.0
theta[0,0] 0.360 0.173 0.072 0.685 0.003 0.002 4309.0 3774.0 4256.0 2846.0 1.0
theta[0,1] 0.528 0.182 0.208 0.857 0.003 0.002 4949.0 4563.0 4908.0 3050.0 1.0
theta[0,2] 0.113 0.104 0.001 0.304 0.001 0.001 6095.0 4045.0 7146.0 2780.0 1.0
theta[1,0] 0.216 0.144 0.007 0.477 0.002 0.002 6149.0 4576.0 6493.0 3116.0 1.0
theta[1,1] 0.678 0.168 0.381 0.962 0.002 0.002 5954.0 5954.0 6180.0 3320.0 1.0
theta[1,2] 0.107 0.100 0.000 0.294 0.001 0.001 6321.0 3863.0 7623.0 3252.0 1.0
theta[2,0] 0.234 0.150 0.010 0.509 0.002 0.002 6154.0 4352.0 6684.0 3252.0 1.0
theta[2,1] 0.230 0.152 0.005 0.506 0.002 0.001 6885.0 5424.0 6459.0 2923.0 1.0
theta[2,2] 0.536 0.186 0.194 0.858 0.002 0.002 5595.0 5250.0 5622.0 2805.0 1.0
theta[3,0] 0.239 0.157 0.007 0.526 0.002 0.002 5843.0 4627.0 5789.0 2853.0 1.0
theta[3,1] 0.381 0.178 0.065 0.703 0.003 0.002 4927.0 4377.0 5009.0 3315.0 1.0
theta[3,2] 0.380 0.174 0.069 0.692 0.003 0.002 4653.0 4176.0 4624.0 2562.0 1.0
theta[4,0] 0.361 0.175 0.057 0.668 0.002 0.002 5185.0 4637.0 5269.0 2985.0 1.0
theta[4,1] 0.527 0.184 0.186 0.852 0.003 0.002 4614.0 4445.0 4668.0 2497.0 1.0
theta[4,2] 0.111 0.100 0.002 0.303 0.001 0.001 6159.0 3978.0 7520.0 3473.0 1.0
Of note, please reference the theta values learned. Their names include: Theta[0,0]...Theta[0,2],...Theta[4,2]. So, in the first example, what PyMC3 has learned is the strength of relation between (z_student[0] - z_question[0]) and class/response 0.
This is not the effect I wish to accomplish, I want to learn a 3D tensor accounting for every possible {student, question, category} pairing; there should be 74 thetas, not 15, where Theta[0,0,0] refers to the learned value {student_0, question_0, response_0}. However, my code is currently not accomplishing this effect.
Any ideas?
Edit: More recently, I've built a function in Theano to demonstrate my goal:
responses = np.array([
[0,1,2,2,2],
[0,1,2,1,1],
[0,1,2,0,0],
[0,1,2,0,1],
[0,1,2,1,0]
])
students = 5
questions = 5
categories = 3
a = tensor.matrix()
b = tensor.matrix()
elem_sub = a[0,0] - b[0,0], a[0,1] - b[1,0], a[0,2] - b[2,0]
function = theano.function([a,b], elem_sub)
with pm.Model() as model:
z_student = pm.Normal("student_dim1", mu=0, sigma=1, shape=(students,categories))
z_question = pm.Normal("question_dim1", mu=0, sigma=1, shape=(categories,questions))
# Transformed parameter
theta = pm.Deterministic("theta", tt.nnet.softmax(function(z_student,z_question)))
# Likelihood
kij = pm.Categorical("kij", p=theta, observed=responses)
However, the following error is triggered:
TypeError: Bad input argument with name "z_student" to theano function with name "<ipython-input-2-2a16f255dca1>:23" at index 0 (0-based).
Backtrace when that variable is created:
.
.
.
Expected an array-like object, but found a Variable: maybe you are trying to call a function on a (possibly shared) variable instead of a numeric array?
Related
predict value with interactions in statsmodel
I have code like this m2=smf.ols(formula='demand~year+C(months)+year*C(months)',data=df).fit() m2.summary() A dataframe with three columns, 144 rows, demand, year 2000-2011 and months 1-12. Now I want to get predicted value based on interactions between year and month as predictor to predict demand (month is treated as categorical variables here. What should I do? m2.predict( #what should I enter here?) Here is the model of linear regression. If it is helpful OLS Regression Results Dep. Variable: demand R-squared: 0.985 Model: OLS Adj. R-squared: 0.982 Method: Least Squares F-statistic: 343.4 Date: Thu, 08 Oct 2020 Prob (F-statistic): 2.78e-98 Time: 00:38:14 Log-Likelihood: -590.64 No. Observations: 144 AIC: 1229. Df Residuals: 120 BIC: 1301. Df Model: 23 Covariance Type: nonrobust coef std err t P>|t| [0.025 0.975] Intercept -5.548e+04 2686.757 -20.651 0.000 -6.08e+04 -5.02e+04 C(months)[T.2] 6521.6434 3799.648 1.716 0.089 -1001.396 1.4e+04 C(months)[T.3] 217.7471 3799.648 0.057 0.954 -7305.292 7740.786 C(months)[T.4] -3200.2960 3799.648 -0.842 0.401 -1.07e+04 4322.743 C(months)[T.5] -7465.9988 3799.648 -1.965 0.052 -1.5e+04 57.040 C(months)[T.6] -1.832e+04 3799.648 -4.822 0.000 -2.58e+04 -1.08e+04 C(months)[T.7] -3.072e+04 3799.648 -8.086 0.000 -3.82e+04 -2.32e+04 C(months)[T.8] -3.013e+04 3799.648 -7.929 0.000 -3.77e+04 -2.26e+04 C(months)[T.9] -1.265e+04 3799.648 -3.328 0.001 -2.02e+04 -5122.469 C(months)[T.10] -5374.5897 3799.648 -1.414 0.160 -1.29e+04 2148.449 C(months)[T.11] 3139.5781 3799.648 0.826 0.410 -4383.461 1.07e+04 C(months)[T.12] -1122.9114 3799.648 -0.296 0.768 -8645.950 6400.127 year 27.7867 1.340 20.741 0.000 25.134 30.439 year:C(months)[T.2] -3.2552 1.895 -1.718 0.088 -7.006 0.496 year:C(months)[T.3] -0.0944 1.895 -0.050 0.960 -3.846 3.657 year:C(months)[T.4] 1.6084 1.895 0.849 0.398 -2.143 5.360 year:C(months)[T.5] 3.7378 1.895 1.973 0.051 -0.013 7.489 year:C(months)[T.6] 9.1713 1.895 4.841 0.000 5.420 12.923 year:C(months)[T.7] 15.3741 1.895 8.115 0.000 11.623 19.125 year:C(months)[T.8] 15.0769 1.895 7.958 0.000 11.326 18.828 year:C(months)[T.9] 6.3357 1.895 3.344 0.001 2.584 10.087 year:C(months)[T.10] 2.6923 1.895 1.421 0.158 -1.059 6.444 year:C(months)[T.11] -1.5699 1.895 -0.829 0.409 -5.321 2.181 year:C(months)[T.12] 0.5699 1.895 0.301 0.764 -3.181 4.321
m2.predict(df.loc[:,['year', 'months']])
Different Scikit Learn R^2 results on different computers
In the code below the 'correct' R2 value for sigma = 1 is 0.33, which I receive when run on my work computer. However on my personal computer I receive R2 = 0.119. This has been confirmed across multiple other computers running my exact code. Only my personal computer produces this strange 0.119 result (even running the 'solution' code produces the 0.119 result). I have tried multiple clean installs of Anaconda to no avail. Only thing I can think is that maybe my 'clean' installs aren't 'clean' enough. I have tried a few methods of fully deleting Anaconda and Python, maybe someone has a robust method for this? x_peak [2688.126327 2692.813829 2697.501331 2702.188833 2706.876334 2711.563836 2716.251338 2720.93884 2725.626341 2730.313843 2735.001345 2739.688846 2744.376348 2749.06385 2753.751352 2758.438853 2763.126355 2767.813857 2772.501359 2777.18886 2781.876362 2786.563864 2791.251366 2795.938867 2800.626369 2805.313871 2810.001373 2814.688874 2819.376376 2824.063878 2828.75138 2833.438881 2838.126383 2842.813885 2847.501387 2852.188888 2856.87639 2861.563892 2866.251394 2870.938895 2875.626397 2880.313899 2885.0014 2889.688902 2894.376404 2899.063906 2903.751407 2908.438909 2913.126411 2917.813913 2922.501414 2927.188916 2931.876418 2936.56392 2941.251421 2945.938923 2950.626425 2955.313927 2960.001428 2964.68893 2969.376432 2974.063934 2978.751435 2983.438937 2988.126439 2992.813941 2997.501442 3002.188944 3006.876446 3011.563948 3016.251449 3020.938951 3025.626453 3030.313954 3035.001456 3039.688958 3044.37646 3049.063961 3053.751463 3058.438965 3063.126467 3067.813968 3072.50147 3077.188972 3081.876474 3086.563975 3091.251477 3095.938979 3100.626481 3105.313982 3110.001484 3114.688986 3119.376488 3124.063989 3128.751491 3133.438993 3138.126495 3142.813996 3147.501498 3152.189 ] y_peak [0.01 0.011 0.011 0.012 0.013 0.015 0.017 0.018 0.02 0.021 0.024 0.027 0.029 0.03 0.031 0.033 0.034 0.036 0.037 0.039 0.04 0.043 0.047 0.049 0.052 0.055 0.058 0.062 0.066 0.071 0.077 0.085 0.097 0.111 0.141 0.169 0.183 0.235 0.265 0.324 0.35 0.396 0.421 0.45 0.467 0.486 0.514 0.51 0.464 0.444 0.437 0.432 0.432 0.437 0.442 0.45 0.475 0.501 0.541 0.553 0.594 0.611 0.611 0.607 0.612 0.607 0.521 0.471 0.424 0.331 0.264 0.216 0.161 0.114 0.094 0.054 0.034 0.021 0.014 0.008 0.007 0.005 0.004 0.003 0.003 0.002 0.002 0.002 0.001 0.001 0.001 0.001 0. 0. 0. 0. 0. 0. 0. 0. ] import numpy as np import pandas as pd import pylab as plt from sklearn.linear_model import LinearRegression df = pd.read_csv('data/ethanol_IR.csv') x_all = df['wavenumber [cm^-1]'].values y_all = df['absorbance'].values x_peak = x_all[475:575] y_peak = y_all[475:575] x_train = x_peak[::3] y_train = y_peak[::3] sigmas = [1, 10, 50, 100, 150] def rbf(x_train, x_test=None, gamma=1): if x_test is None: x_test = x_train N = len(x_test) #<- number of data points M = len(x_train) #<- number of features X = np.zeros((N,M)) for i in range(N): for j in range(M): X[i,j] = np.exp(-gamma*(x_test[i] - x_train[j])**2) return X model_rbf = LinearRegression() #create a linear regression model instance n = len(sigmas) def gam(sigma): gam = 1./(2*sigma**2) return gam for i in range(n): total = [] gamma = gam(sigmas[i]) print('Sigma = {}'.format(sigmas[i])) X_train = rbf(x_train, gamma=gamma) model_rbf.fit(X_train, y_train) #fit the model r2 = model_rbf.score(X_train, y_train) #get the "score", which is equivalent to r^2 print('r^2 training = {}'.format(r2)) X_all = rbf(x_train, x_test=x_peak, gamma=gamma) yhat = model_rbf.predict(X_all) r2 = model_rbf.score(X_all, y_peak) #get the "score", which is equivalent to r^2 print('r^2 testing = {}'.format(r2))
Difference in Linear Regression using Statsmodels between Patsy version and Dummy lists version
I am having differences in the coefficient values and coefficient errors using smf.ols and sm.OLS functions of statsmodels. Even though matematically, they should be the same regression formula and give the same results. I have done a 100% reproducible example of my question, the dataframe df can be downloaded from here: https://drive.google.com/drive/folders/1i67wztkrAeEZH2tv2hyOlgxG7N80V3pI?usp=sharing Case 1: Linear Model using Patsy from Statsmodels # First we load the libraries: import statsmodels.api as sm import statsmodels.formula.api as smf import random import pandas as pd # We define a specific seed to have the same results: random.seed(1234) # Now we read the data that can be downloaded from Google Drive link provided above: df = pd.read_csv("/Users/user/Documents/example/cars.csv", sep = "|") # We create the linear regression: lm1 = smf.ols('price ~ make + fuel_system + engine_type + num_of_doors + bore + compression_ratio + height + peak_rpm + 1', data = df) # We see the results: lm1.fit().summary() The result of lm1 is: OLS Regression Results ============================================================================== Dep. Variable: price R-squared: 0.894 Model: OLS Adj. R-squared: 0.868 Method: Least Squares F-statistic: 35.54 Date: Mon, 18 Feb 2019 Prob (F-statistic): 5.24e-62 Time: 17:19:14 Log-Likelihood: -1899.7 No. Observations: 205 AIC: 3879. Df Residuals: 165 BIC: 4012. Df Model: 39 Covariance Type: nonrobust ========================================================================================= coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------------- Intercept 1.592e+04 1.21e+04 1.320 0.189 -7898.396 3.97e+04 make[T.audi] 6519.7045 2371.807 2.749 0.007 1836.700 1.12e+04 make[T.bmw] 1.427e+04 2292.551 6.223 0.000 9740.771 1.88e+04 make[T.chevrolet] -571.8236 2860.026 -0.200 0.842 -6218.788 5075.141 make[T.dodge] -1186.3430 2261.240 -0.525 0.601 -5651.039 3278.353 make[T.honda] 2779.6496 2891.626 0.961 0.338 -2929.709 8489.009 make[T.isuzu] 3098.9677 2592.645 1.195 0.234 -2020.069 8218.004 make[T.jaguar] 1.752e+04 2416.313 7.252 0.000 1.28e+04 2.23e+04 make[T.mazda] 306.6568 2134.567 0.144 0.886 -3907.929 4521.243 make[T.mercedes-benz] 1.698e+04 2320.871 7.318 0.000 1.24e+04 2.16e+04 make[T.mercury] 2958.1002 3605.739 0.820 0.413 -4161.236 1.01e+04 make[T.mitsubishi] -1188.8337 2284.697 -0.520 0.604 -5699.844 3322.176 make[T.nissan] -1211.5463 2073.422 -0.584 0.560 -5305.405 2882.312 make[T.peugot] 3057.0217 4255.809 0.718 0.474 -5345.841 1.15e+04 make[T.plymouth] -894.5921 2332.746 -0.383 0.702 -5500.473 3711.289 make[T.porsche] 9558.8747 3688.038 2.592 0.010 2277.044 1.68e+04 make[T.renault] -2124.9722 2847.536 -0.746 0.457 -7747.277 3497.333 make[T.saab] 3490.5333 2319.189 1.505 0.134 -1088.579 8069.645 make[T.subaru] -1.636e+04 4002.796 -4.087 0.000 -2.43e+04 -8456.659 make[T.toyota] -770.9677 1911.754 -0.403 0.687 -4545.623 3003.688 make[T.volkswagen] 406.9179 2219.714 0.183 0.855 -3975.788 4789.623 make[T.volvo] 5433.7129 2397.030 2.267 0.025 700.907 1.02e+04 fuel_system[T.2bbl] 2142.1594 2232.214 0.960 0.339 -2265.226 6549.545 fuel_system[T.4bbl] 464.1109 3999.976 0.116 0.908 -7433.624 8361.846 fuel_system[T.idi] 1.991e+04 6622.812 3.007 0.003 6837.439 3.3e+04 fuel_system[T.mfi] 3716.5201 3936.805 0.944 0.347 -4056.488 1.15e+04 fuel_system[T.mpfi] 3964.1109 2267.538 1.748 0.082 -513.019 8441.241 fuel_system[T.spdi] 3240.0003 2719.925 1.191 0.235 -2130.344 8610.344 fuel_system[T.spfi] 932.1959 4019.476 0.232 0.817 -7004.041 8868.433 engine_type[T.dohcv] -1.208e+04 4205.826 -2.872 0.005 -2.04e+04 -3773.504 engine_type[T.l] -4833.9860 3763.812 -1.284 0.201 -1.23e+04 2597.456 engine_type[T.ohc] -4038.8848 1213.598 -3.328 0.001 -6435.067 -1642.702 engine_type[T.ohcf] 9618.9281 3504.600 2.745 0.007 2699.286 1.65e+04 engine_type[T.ohcv] 3051.7629 1445.185 2.112 0.036 198.323 5905.203 engine_type[T.rotor] 1403.9928 3217.402 0.436 0.663 -4948.593 7756.579 num_of_doors[T.two] -419.9640 521.754 -0.805 0.422 -1450.139 610.211 bore 3993.4308 1373.487 2.908 0.004 1281.556 6705.306 compression_ratio -1200.5665 460.681 -2.606 0.010 -2110.156 -290.977 height -80.7141 146.219 -0.552 0.582 -369.417 207.988 peak_rpm -0.5903 0.790 -0.747 0.456 -2.150 0.970 ============================================================================== Omnibus: 65.777 Durbin-Watson: 1.217 Prob(Omnibus): 0.000 Jarque-Bera (JB): 399.594 Skew: 1.059 Prob(JB): 1.70e-87 Kurtosis: 9.504 Cond. No. 3.26e+05 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 3.26e+05. This might indicate that there are strong multicollinearity or other numerical problems. """ Case 2: Linear Model using Dummy Variables from Statsmodels as well # We define a specific seed to have the same results: random.seed(1234) # First we check what `object` type variables we have in our dataset: df.dtypes # We create a list where we save the `object` type variables names: object = ['make', 'fuel_system', 'engine_type', 'num_of_doors' ] # Now we convert those object variables to numeric with get_dummies function to have 1 unique numeric dataframe: df_num = pd.get_dummies(df, columns = object) # We ensure the dataframe is numeric casting all values to float64: df_num = df_num[df_num.columns].apply(pd.to_numeric, errors='coerce', axis = 1) # We define the predictive variables dataset: X = df_num.drop('price', axis = 1) # We define the response variable values: y = df_num.price.values # We add a constant as we did in the previous example (adding "+1" to Patsy): Xc = sm.add_constant(X) # Adds a constant to the model # We create the linear model and obtain results: lm2 = sm.OLS(y, Xc) lm2.fit().summary() The result of lm2 is: OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.894 Model: OLS Adj. R-squared: 0.868 Method: Least Squares F-statistic: 35.54 Date: Mon, 18 Feb 2019 Prob (F-statistic): 5.24e-62 Time: 17:28:16 Log-Likelihood: -1899.7 No. Observations: 205 AIC: 3879. Df Residuals: 165 BIC: 4012. Df Model: 39 Covariance Type: nonrobust ====================================================================================== coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------------- const 1.205e+04 6811.094 1.769 0.079 -1398.490 2.55e+04 bore 3993.4308 1373.487 2.908 0.004 1281.556 6705.306 compression_ratio -1200.5665 460.681 -2.606 0.010 -2110.156 -290.977 height -80.7141 146.219 -0.552 0.582 -369.417 207.988 peak_rpm -0.5903 0.790 -0.747 0.456 -2.150 0.970 make_alfa-romero -2273.9631 1865.185 -1.219 0.225 -5956.669 1408.743 make_audi 4245.7414 1324.140 3.206 0.002 1631.299 6860.184 make_bmw 1.199e+04 1232.635 9.730 0.000 9559.555 1.44e+04 make_chevrolet -2845.7867 1976.730 -1.440 0.152 -6748.733 1057.160 make_dodge -3460.3061 1170.966 -2.955 0.004 -5772.315 -1148.297 make_honda 505.6865 2049.865 0.247 0.805 -3541.661 4553.034 make_isuzu 825.0045 1706.160 0.484 0.629 -2543.716 4193.725 make_jaguar 1.525e+04 1903.813 8.010 0.000 1.15e+04 1.9e+04 make_mazda -1967.3063 982.179 -2.003 0.047 -3906.564 -28.048 make_mercedes-benz 1.471e+04 1423.004 10.338 0.000 1.19e+04 1.75e+04 make_mercury 684.1370 2913.361 0.235 0.815 -5068.136 6436.410 make_mitsubishi -3462.7968 1221.018 -2.836 0.005 -5873.631 -1051.963 make_nissan -3485.5094 946.316 -3.683 0.000 -5353.958 -1617.060 make_peugot 783.0586 3513.296 0.223 0.824 -6153.754 7719.871 make_plymouth -3168.5552 1293.376 -2.450 0.015 -5722.256 -614.854 make_porsche 7284.9115 2853.174 2.553 0.012 1651.475 1.29e+04 make_renault -4398.9354 2037.945 -2.159 0.032 -8422.747 -375.124 make_saab 1216.5702 1487.192 0.818 0.415 -1719.810 4152.950 make_subaru -1.863e+04 3263.524 -5.710 0.000 -2.51e+04 -1.22e+04 make_toyota -3044.9308 776.059 -3.924 0.000 -4577.218 -1512.644 make_volkswagen -1867.0452 1170.975 -1.594 0.113 -4179.072 444.981 make_volvo 3159.7498 1327.405 2.380 0.018 538.862 5780.638 fuel_system_1bbl -2790.4092 2230.161 -1.251 0.213 -7193.740 1612.922 fuel_system_2bbl -648.2498 1094.525 -0.592 0.554 -2809.330 1512.830 fuel_system_4bbl -2326.2983 3094.703 -0.752 0.453 -8436.621 3784.024 fuel_system_idi 1.712e+04 6154.806 2.782 0.006 4971.083 2.93e+04 fuel_system_mfi 926.1109 3063.134 0.302 0.763 -5121.881 6974.102 fuel_system_mpfi 1173.7017 1186.125 0.990 0.324 -1168.238 3515.642 fuel_system_spdi 449.5911 1827.318 0.246 0.806 -3158.349 4057.531 fuel_system_spfi -1858.2133 3111.596 -0.597 0.551 -8001.891 4285.464 engine_type_dohc 2703.6445 1803.080 1.499 0.136 -856.440 6263.729 engine_type_dohcv -9374.0342 3504.717 -2.675 0.008 -1.63e+04 -2454.161 engine_type_l -2130.3416 3357.283 -0.635 0.527 -8759.115 4498.431 engine_type_ohc -1335.2404 1454.047 -0.918 0.360 -4206.177 1535.696 engine_type_ohcf 1.232e+04 2850.883 4.322 0.000 6693.659 1.8e+04 engine_type_ohcv 5755.4074 1669.627 3.447 0.001 2458.820 9051.995 engine_type_rotor 4107.6373 3032.223 1.355 0.177 -1879.323 1.01e+04 num_of_doors_four 6234.8048 3491.722 1.786 0.076 -659.410 1.31e+04 num_of_doors_two 5814.8408 3337.588 1.742 0.083 -775.045 1.24e+04 ============================================================================== Omnibus: 65.777 Durbin-Watson: 1.217 Prob(Omnibus): 0.000 Jarque-Bera (JB): 399.594 Skew: 1.059 Prob(JB): 1.70e-87 Kurtosis: 9.504 Cond. No. 1.01e+16 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The smallest eigenvalue is 5.38e-23. This might indicate that there are strong multicollinearity problems or that the design matrix is singular. """ As we can see, some variables like height have the same coefficient. Nevertheless some others don't (level isuzu from variable make, level ohc of engine_type or the independent term, etc.). Shouldn't it be the same result for both outputs? What am I missing here or doing wrong? Thanks in advance for your help. P.D. As clarified by #sukhbinder, even using Patsy formula without independent term (putting "-1" in the formula, as Patsy incorporates it by default) and eliminating independent term from dummy formulation, I receive different results.
The reason why the results do not match is because Statsmodels does a pre-selection on predictive variables depending on high multicollinearity. Exactly the same results are accomplished going through descriptive summary of the regression and identifying variables missing: deletex = [ 'make_alfa-romero', 'fuel_system_1bbl', 'engine_type_dohc', 'num_of_doors_four' ] df_num.drop( deletex, axis = 1, inplace = True) df_num = df_num[df_num.columns].apply(pd.to_numeric, errors='coerce', axis = 1) X = df_num.drop('price', axis = 1) y = df_num.price.values Xc = sm.add_constant(X) # Adds a constant to the model random.seed(1234) linear_regression = sm.OLS(y, Xc) linear_regression.fit().summary() Which prints the result: OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.894 Model: OLS Adj. R-squared: 0.868 Method: Least Squares F-statistic: 35.54 Date: Thu, 21 Feb 2019 Prob (F-statistic): 5.24e-62 Time: 18:16:08 Log-Likelihood: -1899.7 No. Observations: 205 AIC: 3879. Df Residuals: 165 BIC: 4012. Df Model: 39 Covariance Type: nonrobust ====================================================================================== coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------------- const 1.592e+04 1.21e+04 1.320 0.189 -7898.396 3.97e+04 bore 3993.4308 1373.487 2.908 0.004 1281.556 6705.306 compression_ratio -1200.5665 460.681 -2.606 0.010 -2110.156 -290.977 height -80.7141 146.219 -0.552 0.582 -369.417 207.988 peak_rpm -0.5903 0.790 -0.747 0.456 -2.150 0.970 make_audi 6519.7045 2371.807 2.749 0.007 1836.700 1.12e+04 make_bmw 1.427e+04 2292.551 6.223 0.000 9740.771 1.88e+04 make_chevrolet -571.8236 2860.026 -0.200 0.842 -6218.788 5075.141 make_dodge -1186.3430 2261.240 -0.525 0.601 -5651.039 3278.353 make_honda 2779.6496 2891.626 0.961 0.338 -2929.709 8489.009 make_isuzu 3098.9677 2592.645 1.195 0.234 -2020.069 8218.004 make_jaguar 1.752e+04 2416.313 7.252 0.000 1.28e+04 2.23e+04 make_mazda 306.6568 2134.567 0.144 0.886 -3907.929 4521.243 make_mercedes-benz 1.698e+04 2320.871 7.318 0.000 1.24e+04 2.16e+04 make_mercury 2958.1002 3605.739 0.820 0.413 -4161.236 1.01e+04 make_mitsubishi -1188.8337 2284.697 -0.520 0.604 -5699.844 3322.176 make_nissan -1211.5463 2073.422 -0.584 0.560 -5305.405 2882.312 make_peugot 3057.0217 4255.809 0.718 0.474 -5345.841 1.15e+04 make_plymouth -894.5921 2332.746 -0.383 0.702 -5500.473 3711.289 make_porsche 9558.8747 3688.038 2.592 0.010 2277.044 1.68e+04 make_renault -2124.9722 2847.536 -0.746 0.457 -7747.277 3497.333 make_saab 3490.5333 2319.189 1.505 0.134 -1088.579 8069.645 make_subaru -1.636e+04 4002.796 -4.087 0.000 -2.43e+04 -8456.659 make_toyota -770.9677 1911.754 -0.403 0.687 -4545.623 3003.688 make_volkswagen 406.9179 2219.714 0.183 0.855 -3975.788 4789.623 make_volvo 5433.7129 2397.030 2.267 0.025 700.907 1.02e+04 fuel_system_2bbl 2142.1594 2232.214 0.960 0.339 -2265.226 6549.545 fuel_system_4bbl 464.1109 3999.976 0.116 0.908 -7433.624 8361.846 fuel_system_idi 1.991e+04 6622.812 3.007 0.003 6837.439 3.3e+04 fuel_system_mfi 3716.5201 3936.805 0.944 0.347 -4056.488 1.15e+04 fuel_system_mpfi 3964.1109 2267.538 1.748 0.082 -513.019 8441.241 fuel_system_spdi 3240.0003 2719.925 1.191 0.235 -2130.344 8610.344 fuel_system_spfi 932.1959 4019.476 0.232 0.817 -7004.041 8868.433 engine_type_dohcv -1.208e+04 4205.826 -2.872 0.005 -2.04e+04 -3773.504 engine_type_l -4833.9860 3763.812 -1.284 0.201 -1.23e+04 2597.456 engine_type_ohc -4038.8848 1213.598 -3.328 0.001 -6435.067 -1642.702 engine_type_ohcf 9618.9281 3504.600 2.745 0.007 2699.286 1.65e+04 engine_type_ohcv 3051.7629 1445.185 2.112 0.036 198.323 5905.203 engine_type_rotor 1403.9928 3217.402 0.436 0.663 -4948.593 7756.579 num_of_doors_two -419.9640 521.754 -0.805 0.422 -1450.139 610.211 ============================================================================== Omnibus: 65.777 Durbin-Watson: 1.217 Prob(Omnibus): 0.000 Jarque-Bera (JB): 399.594 Skew: 1.059 Prob(JB): 1.70e-87 Kurtosis: 9.504 Cond. No. 3.26e+05 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 3.26e+05. This might indicate that there are strong multicollinearity or other numerical problems. Results that is completely equal to the first call with Statsmodels: random.seed(1234) lm_python = smf.ols('price ~ make + fuel_system + engine_type + num_of_doors + bore + compression_ratio + height + peak_rpm + 1', data = df) lm_python.fit().summary() OLS Regression Results ============================================================================== Dep. Variable: price R-squared: 0.894 Model: OLS Adj. R-squared: 0.868 Method: Least Squares F-statistic: 35.54 Date: Thu, 21 Feb 2019 Prob (F-statistic): 5.24e-62 Time: 18:17:37 Log-Likelihood: -1899.7 No. Observations: 205 AIC: 3879. Df Residuals: 165 BIC: 4012. Df Model: 39 Covariance Type: nonrobust ========================================================================================= coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------------- Intercept 1.592e+04 1.21e+04 1.320 0.189 -7898.396 3.97e+04 make[T.audi] 6519.7045 2371.807 2.749 0.007 1836.700 1.12e+04 make[T.bmw] 1.427e+04 2292.551 6.223 0.000 9740.771 1.88e+04 make[T.chevrolet] -571.8236 2860.026 -0.200 0.842 -6218.788 5075.141 make[T.dodge] -1186.3430 2261.240 -0.525 0.601 -5651.039 3278.353 make[T.honda] 2779.6496 2891.626 0.961 0.338 -2929.709 8489.009 make[T.isuzu] 3098.9677 2592.645 1.195 0.234 -2020.069 8218.004 make[T.jaguar] 1.752e+04 2416.313 7.252 0.000 1.28e+04 2.23e+04 make[T.mazda] 306.6568 2134.567 0.144 0.886 -3907.929 4521.243 make[T.mercedes-benz] 1.698e+04 2320.871 7.318 0.000 1.24e+04 2.16e+04 make[T.mercury] 2958.1002 3605.739 0.820 0.413 -4161.236 1.01e+04 make[T.mitsubishi] -1188.8337 2284.697 -0.520 0.604 -5699.844 3322.176 make[T.nissan] -1211.5463 2073.422 -0.584 0.560 -5305.405 2882.312 make[T.peugot] 3057.0217 4255.809 0.718 0.474 -5345.841 1.15e+04 make[T.plymouth] -894.5921 2332.746 -0.383 0.702 -5500.473 3711.289 make[T.porsche] 9558.8747 3688.038 2.592 0.010 2277.044 1.68e+04 make[T.renault] -2124.9722 2847.536 -0.746 0.457 -7747.277 3497.333 make[T.saab] 3490.5333 2319.189 1.505 0.134 -1088.579 8069.645 make[T.subaru] -1.636e+04 4002.796 -4.087 0.000 -2.43e+04 -8456.659 make[T.toyota] -770.9677 1911.754 -0.403 0.687 -4545.623 3003.688 make[T.volkswagen] 406.9179 2219.714 0.183 0.855 -3975.788 4789.623 make[T.volvo] 5433.7129 2397.030 2.267 0.025 700.907 1.02e+04 fuel_system[T.2bbl] 2142.1594 2232.214 0.960 0.339 -2265.226 6549.545 fuel_system[T.4bbl] 464.1109 3999.976 0.116 0.908 -7433.624 8361.846 fuel_system[T.idi] 1.991e+04 6622.812 3.007 0.003 6837.439 3.3e+04 fuel_system[T.mfi] 3716.5201 3936.805 0.944 0.347 -4056.488 1.15e+04 fuel_system[T.mpfi] 3964.1109 2267.538 1.748 0.082 -513.019 8441.241 fuel_system[T.spdi] 3240.0003 2719.925 1.191 0.235 -2130.344 8610.344 fuel_system[T.spfi] 932.1959 4019.476 0.232 0.817 -7004.041 8868.433 engine_type[T.dohcv] -1.208e+04 4205.826 -2.872 0.005 -2.04e+04 -3773.504 engine_type[T.l] -4833.9860 3763.812 -1.284 0.201 -1.23e+04 2597.456 engine_type[T.ohc] -4038.8848 1213.598 -3.328 0.001 -6435.067 -1642.702 engine_type[T.ohcf] 9618.9281 3504.600 2.745 0.007 2699.286 1.65e+04 engine_type[T.ohcv] 3051.7629 1445.185 2.112 0.036 198.323 5905.203 engine_type[T.rotor] 1403.9928 3217.402 0.436 0.663 -4948.593 7756.579 num_of_doors[T.two] -419.9640 521.754 -0.805 0.422 -1450.139 610.211 bore 3993.4308 1373.487 2.908 0.004 1281.556 6705.306 compression_ratio -1200.5665 460.681 -2.606 0.010 -2110.156 -290.977 height -80.7141 146.219 -0.552 0.582 -369.417 207.988 peak_rpm -0.5903 0.790 -0.747 0.456 -2.150 0.970 ============================================================================== Omnibus: 65.777 Durbin-Watson: 1.217 Prob(Omnibus): 0.000 Jarque-Bera (JB): 399.594 Skew: 1.059 Prob(JB): 1.70e-87 Kurtosis: 9.504 Cond. No. 3.26e+05 ============================================================================== There is the need to check correspondence in predictive variables as pd.get_dummies does an extensive obtaining of all dummy variables, and Statsmodels applies an N-1 levels inside the categorical variable selection.
Speed up extraction of coordinates from DICOM structure set
Using numpy.reshape helped a lot and using map helped a little. Is it possible to speed this up some more? import pydicom import numpy as np import cProfile import pstats def parse_coords(contour): """Given a contour from a DICOM ROIContourSequence, returns coordinates [loop][[x0, x1, x2, ...][y0, y1, y2, ...][z0, z1, z2, ...]]""" if not hasattr(contour, "ContourSequence"): return [] # empty structure def _reshape_contour_data(loop): return np.reshape(np.array(loop.ContourData), (3, len(loop.ContourData) // 3), order='F') return list(map(_reshape_contour_data,contour.ContourSequence)) def profile_load_contours(): rs = pydicom.dcmread('RS.gyn1.dcm') structs = [parse_coords(contour) for contour in rs.ROIContourSequence] cProfile.run('profile_load_contours()','prof.stats') p = pstats.Stats('prof.stats') p.sort_stats('cumulative').print_stats(30) Using a real structure set exported from Varian Eclipse. ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 12.165 12.165 {built-in method builtins.exec} 1 0.151 0.151 12.165 12.165 <string>:1(<module>) 1 0.000 0.000 12.014 12.014 load_contour_time.py:19(profile_load_contours) 1 0.000 0.000 11.983 11.983 load_contour_time.py:21(<listcomp>) 56 0.009 0.000 11.983 0.214 load_contour_time.py:7(parse_coords) 50745/33837 0.129 0.000 11.422 0.000 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/dataset.py:455(__getattr__) 50741/33825 0.152 0.000 10.938 0.000 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/dataset.py:496(__getitem__) 16864 0.069 0.000 9.839 0.001 load_contour_time.py:12(_reshape_contour_data) 16915 0.101 0.000 9.780 0.001 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/dataelem.py:439(DataElement_from_raw) 16915 0.052 0.000 9.300 0.001 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/values.py:320(convert_value) 16864 0.038 0.000 7.099 0.000 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/values.py:89(convert_DS_string) 16870 0.042 0.000 7.010 0.000 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/valuerep.py:495(MultiString) 16908 1.013 0.000 6.826 0.000 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/multival.py:29(__init__) 3004437 3.013 0.000 5.577 0.000 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/multival.py:42(number_string_type_constructor) 3038317/3038231 1.037 0.000 3.171 0.000 {built-in method builtins.hasattr} Much of the time is in convert_DS_string. Is it possible to make it faster? I guess part of the problem is that the coordinates are not stored very efficiently in the DICOM file. EDIT: As a way of avoiding the loop at the end of MultiVal.__init__ I am wondering about getting the raw double string of each ContourData and using numpy.fromstring on it. However, I have not been able to get the raw double string.
Eliminating the loop in MultiVal.__init__ and using numpy.fromstring provides more than 4 times speedup. I will post on the pydicom github see if there is some interest in taking this into the library code. It is a little ugly. I would welcome advice on further improvement. import pydicom import numpy as np import cProfile import pstats def parse_coords(contour): """Given a contour from a DICOM ROIContourSequence, returns coordinates [loop][[x0, x1, x2, ...][y0, y1, y2, ...][z0, z1, z2, ...]]""" if not hasattr(contour, "ContourSequence"): return [] # empty structure cd_tag = pydicom.tag.Tag(0x3006, 0x0050) # ContourData tag def _reshape_contour_data(loop): val = super(loop.__class__, loop).__getitem__(cd_tag).value try: double_string = val.decode(encoding='utf-8') double_vec = np.fromstring(double_string, dtype=float, sep=chr(92)) # 92 is '/' except AttributeError: # 'MultiValue' has no 'decode' (bytes does) # It's already been converted to doubles and cached double_vec = loop.ContourData return np.reshape(np.array(double_vec), (3, len(double_vec) // 3), order='F') return list(map(_reshape_contour_data, contour.ContourSequence)) def profile_load_contours(): rs = pydicom.dcmread('RS.gyn1.dcm') structs = [parse_coords(contour) for contour in rs.ROIContourSequence] profile_load_contours() cProfile.run('profile_load_contours()','prof.stats') p = pstats.Stats('prof.stats') p.sort_stats('cumulative').print_stats(15) Result ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 2.800 2.800 {built-in method builtins.exec} 1 0.017 0.017 2.800 2.800 <string>:1(<module>) 1 0.000 0.000 2.783 2.783 load_contour_time3.py:29(profile_load_contours) 1 0.000 0.000 2.761 2.761 load_contour_time3.py:31(<listcomp>) 56 0.006 0.000 2.760 0.049 load_contour_time3.py:9(parse_coords) 153/109 0.001 0.000 2.184 0.020 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/dataset.py:455(__getattr__) 149/97 0.001 0.000 2.182 0.022 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/dataset.py:496(__getitem__) 51 0.000 0.000 2.178 0.043 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/dataelem.py:439(DataElement_from_raw) 51 0.000 0.000 2.177 0.043 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/values.py:320(convert_value) 44 0.000 0.000 2.176 0.049 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/values.py:255(convert_SQ) 44 0.035 0.001 2.176 0.049 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/filereader.py:427(read_sequence) 152/66 0.000 0.000 2.171 0.033 {built-in method builtins.hasattr} 16920 0.147 0.000 1.993 0.000 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/filereader.py:452(read_sequence_item) 16923 0.116 0.000 1.267 0.000 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/filereader.py:365(read_dataset) 84616 0.113 0.000 0.699 0.000 /home/cf/python/venv/lib/python3.5/site-packages/pydicom/dataset.py:960(__setattr__)
3D Surface Colormap in Python
There's another problem I'm facing now. I have a data file like the following: # Time THR-1 ALA-2 PRO-3 VAL-4 PRO-5 MET-6 PRO-7 ASP-8 LEU-9 LYS-10 ASN-11 VAL-12 LYS-13 SER-14 LYS-15 ILE-16 GLY-17 SER-18 THR-19 GLU-20 ASN-21 LEU-22 LYS-23 HIS-24 GLN-25 PRO-26 GLY-27 GLY-28 GLY-29 LYS-30 VAL-31 GLN-32 ILE-33 ILE-34 ASN-35 LYS-36 LYS-37 LEU-38 ASP-39 LEU-40 SER-41 ASN-42 VAL-43 GLN-44 SER-45 LYS-46 CYS-47 GLY-48 SER-49 LYS-50 ASP-51 ASN-52 ILE-53 LYS-54 HIS-55 VAL-56 PRO-57 GLY-58 GLY-59 GLY-60 SER-61 VAL-62 GLN-63 ILE-64 VAL-65 TYR-66 LYS-67 PRO-68 VAL-69 ASP-70 LEU-71 SER-72 LYS-73 VAL-74 THR-75 SER-76 LYS-77 CYS-78 GLY-79 SER-80 LEU-81 GLY-82 ASN-83 ILE-84 HIS-85 HIS-86 LYS-87 PRO-88 GLY-89 GLY-90 GLY-91 GLN-92 VAL-93 GLU-94 VAL-95 LYS-96 SER-97 GLU-98 LYS-99 LEU-100 ASP-101 PHE-102 LYS-103 ASP-104 ARG-105 VAL-106 GLN-107 SER-108 LYS-109 ILE-110 GLY-111 SER-112 LEU-113 ASP-114 ASN-115 ILE-116 THR-117 HIS-118 VAL-119 PRO-120 GLY-121 GLY-122 GLY-123 ASN-124 DA-1 DA-2 DA-3 DC-4 DA-5 DT-6 DG-7 DT-8 DT-9 DA-10 DA-11 DA-12 DC-13 DA-14 DT-15 DG-16 DT-17 DT-18 DT-19 DA-1 DA-2 DA-3 DC-4 DA-5 DT-6 DG-7 DT-8 DT-9 DT-10 DA-11 DA-12 DC-13 DA-14 DT-15 DG-16 DT-17 DT-18 DT-19 0.000 84.841 0.274 8.595 -4.939 1.713 -1.704 0.768 -127.825 5.554 108.207 5.297 8.390 212.124 2.830 39.479 8.168 0.458 8.848 6.897 -83.882 29.016 9.647 308.856 6.400 32.481 11.327 10.372 0.247 -3.669 45.391 7.648 -6.990 16.870 11.946 18.778 29.161 127.841 -1.885 -49.943 4.716 6.552 16.029 4.803 7.307 5.423 35.449 -1.362 0.703 0.817 5.544 -14.168 -2.450 0.138 10.984 2.680 -0.238 -0.204 -1.814 -0.273 0.971 -0.256 2.553 -1.172 0.337 0.659 -3.890 8.570 1.180 2.319 -10.711 0.433 0.320 7.904 -0.021 1.672 -0.895 -1.804 -0.317 0.233 0.013 1.462 -1.310 -3.139 -1.453 -4.536 0.559 59.050 -10.891 3.089 5.579 9.818 6.599 -1.635 -34.622 2.576 14.145 9.062 -82.518 51.319 -5.944 -42.734 -0.065 5.200 -18.819 -1.670 0.354 -0.142 -0.938 -4.108 -0.582 -0.511 -0.452 0.763 -21.291 2.587 -5.088 -0.458 5.958 -0.746 -0.587 0.600 6.134 9.432 -47.476 0.517 -0.958 -1.246 0.005 -1.422 -5.105 -2.815 -6.459 -1.618 56.055 117.408 92.845 60.554 -6.065 -9.293 -3.752 -5.407 -1.491 -4.924 -0.944 13.894 32.688 15.937 2.866 -0.934 25.169 1.291 -5.292 -8.727 5.852 -8.092 -40.334 -18.542 0.468 -6.011 -2.043 -1.305 -0.959 10.000 127.315 0.993 15.230 12.627 0.804 0.642 -2.810 -101.634 5.500 114.097 3.368 9.100 162.819 -10.033 39.935 6.920 9.887 9.732 4.997 -79.368 25.134 -5.714 307.359 5.781 34.996 8.885 7.234 -5.875 -0.094 31.674 3.963 -8.064 14.720 12.726 25.431 25.011 108.108 -0.293 -63.815 4.442 1.071 12.768 2.871 1.451 2.179 30.666 -2.066 0.995 1.496 3.384 -1.398 -0.776 -0.101 5.159 1.092 -0.829 -0.205 -0.125 1.054 0.574 -0.291 1.106 0.875 -1.106 -1.955 1.153 4.273 0.628 1.305 -5.547 0.755 0.126 3.704 0.925 0.074 -0.516 3.643 -0.133 -0.064 0.717 0.547 0.197 -0.408 -0.912 -1.296 0.508 35.027 -3.056 10.216 5.885 8.755 -0.792 -1.442 -28.498 2.122 6.803 1.344 -58.583 47.395 -2.332 -32.863 -2.826 5.311 -23.087 6.478 -0.205 0.288 -0.373 4.358 0.362 -1.010 -0.352 2.271 -13.406 -2.747 -4.616 -2.275 3.943 -4.391 -7.063 -0.599 3.081 12.778 -40.043 0.327 -1.940 -2.012 2.592 2.909 1.041 0.658 -0.868 -3.206 16.355 109.843 107.372 63.801 8.499 0.931 2.639 -0.884 0.214 1.880 -2.379 8.408 12.583 10.883 23.083 7.955 31.277 0.539 3.992 -0.887 12.925 -4.248 -31.420 -4.812 1.125 3.287 -0.532 -0.438 0.291 20.000 84.636 5.538 15.954 10.437 0.439 1.773 -1.913 -96.625 5.704 132.598 -0.572 6.877 174.628 -9.400 32.417 -0.264 3.812 6.175 5.056 -62.617 25.479 -1.171 288.031 8.114 37.636 10.461 4.612 -3.521 -0.335 37.957 6.596 -11.250 12.510 11.557 21.128 37.344 135.293 -2.163 -80.896 0.912 1.963 1.101 2.815 6.051 5.374 28.443 0.905 1.734 0.813 5.060 -1.365 1.653 -0.415 4.862 1.758 -0.572 -0.339 0.423 0.759 1.036 -0.543 0.783 0.102 -0.971 -1.529 -1.595 5.519 0.587 1.306 -2.813 0.605 0.761 4.542 0.698 0.767 -0.050 2.201 -0.084 0.563 0.357 0.422 0.642 0.588 -1.426 -1.375 1.455 31.332 -3.390 16.696 15.616 13.449 0.096 -2.711 -24.804 1.969 4.095 2.078 -58.303 47.776 -1.047 -22.013 -2.270 4.204 -11.059 3.952 0.382 -0.863 0.010 3.473 0.375 -1.301 -0.037 1.396 -14.392 -2.887 -5.915 -2.315 5.888 -3.365 -5.950 -2.439 4.814 7.125 -46.399 4.393 5.939 -0.508 2.461 2.562 -0.717 4.225 3.642 4.664 27.859 104.835 114.077 74.730 8.410 1.862 0.061 -1.288 -1.181 2.106 4.346 9.017 29.050 -5.088 14.618 4.149 5.062 1.369 15.083 9.537 18.306 -1.165 -8.966 3.864 3.523 7.232 4.275 1.888 4.708 30.000 91.953 11.008 15.794 12.043 0.596 4.611 1.048 -70.764 7.475 72.100 1.360 6.891 150.455 -7.180 11.932 4.845 9.519 6.184 4.684 -57.283 24.797 0.393 275.626 14.021 22.233 10.877 0.934 -7.551 -2.439 27.929 5.098 -6.797 12.784 12.140 19.698 25.762 108.882 0.267 -54.801 1.470 2.139 1.302 1.996 2.021 3.090 22.690 0.669 1.347 0.113 5.378 -1.570 0.585 -0.143 1.156 -0.050 -1.086 0.148 -0.017 -0.417 -0.201 -1.304 0.808 -0.950 -0.958 -1.741 0.200 2.846 0.633 1.279 -3.693 0.338 -1.058 3.651 0.009 0.202 -1.009 0.037 -0.245 -0.183 -0.615 0.192 -0.386 0.426 -1.800 -2.009 0.496 33.517 -4.213 15.421 16.942 14.559 0.109 -2.553 -25.113 1.199 2.074 -0.265 -56.399 40.657 -0.746 -24.020 -1.986 3.400 -9.631 1.384 0.502 -1.001 0.547 2.622 -0.201 -1.062 -0.916 0.493 -14.621 -2.660 -4.459 -1.066 3.788 -4.289 -7.086 2.460 5.341 8.759 -39.474 -0.051 2.116 0.498 1.267 0.728 1.071 1.155 0.824 3.214 32.413 124.028 144.011 80.795 11.199 5.365 1.969 0.659 2.780 2.311 1.671 14.244 33.170 -6.859 -6.106 13.690 4.742 0.645 17.301 12.245 15.829 -11.976 -22.289 3.100 1.725 5.538 5.041 3.517 -0.205 40.000 149.956 11.453 22.603 13.125 1.909 5.563 1.533 -90.126 5.479 90.590 4.141 6.652 173.681 -3.703 24.551 3.012 10.247 12.607 7.241 -64.707 21.636 -0.285 276.445 6.223 29.727 8.346 5.092 -5.591 -2.969 27.881 3.581 -6.824 13.884 11.709 21.034 25.732 104.610 -0.237 -54.221 1.960 1.674 2.394 1.727 6.499 3.453 25.335 0.636 0.754 -0.591 5.789 -3.344 1.182 -0.366 0.810 0.901 -0.625 -0.997 -0.241 0.214 0.311 -0.312 0.498 -1.336 -0.911 -1.210 -2.459 3.182 0.599 0.713 -4.273 0.326 0.522 3.207 0.312 0.830 -0.558 1.351 -0.017 0.569 -0.367 0.966 -0.637 -2.392 -2.722 -3.405 0.818 39.708 -2.537 16.297 14.229 10.427 0.837 -1.855 -24.033 0.996 5.579 -1.055 -65.068 48.891 -2.411 -21.785 -2.094 1.285 -3.668 1.264 0.463 0.070 -0.034 2.779 0.115 -0.947 1.107 0.337 -16.009 -3.881 -5.203 -1.503 0.358 -4.410 -8.007 -1.383 10.872 17.390 -47.147 1.140 -2.218 -0.597 -0.312 0.685 1.781 5.662 1.917 1.504 32.806 123.230 132.991 68.245 11.523 3.048 0.389 -0.890 0.170 2.100 1.166 11.693 31.756 2.595 19.844 24.565 30.414 11.828 18.563 22.426 20.596 -13.383 -18.574 -2.142 4.737 1.680 0.071 3.983 -0.001 For which, I'm trying to plot a 3D colormap with time in the x-axis, number of columns in the y-axis and their respective values in the z-axis. I have written the following code to extract data from the file and to plot it: #!/usr/bin/python from mpl_toolkits.mplot3d import Axes3D import matplotlib.pyplot as plt from matplotlib import cm from matplotlib.ticker import LinearLocator, FormatStrFormatter import numpy as np data = np.loadtxt('contrib_pol.dat', skiprows=1) x = data[:,0] y = range(1,len(data[0,:])) z = [] fig=plt.figure() ax=fig.gca(projection='3d') for r, row in enumerate(data): for c, col in enumerate(row[1:], start=1): z.append(col) surf = ax.plot_surface(x, y, z, cmap=cm.coolwarm, linewidth=0, antialiased=False) plt.show() And I'm having the errors: Traceback (most recent call last): File "./barplot.py", line 31, in <module> linewidth=0, antialiased=False) File "/usr/lib/python2.7/dist-packages/mpl_toolkits/mplot3d/axes3d.py", line 1586, in plot_surface X, Y, Z = np.broadcast_arrays(X, Y, Z) File "/home/microbio/.local/lib/python2.7/site-packages/numpy/lib/stride_tricks.py", line 250, in broadcast_arrays shape = _broadcast_shape(*args) File "/home/microbio/.local/lib/python2.7/site-packages/numpy/lib/stride_tricks.py", line 185, in _broadcast_shape b = np.broadcast(*args[:32]) ValueError: shape mismatch: objects cannot be broadcast to a single shape Can you help???
Check the docstring of plot_surface, it states that you need to supply the data as 2D-arrays. With two additional lines of code, you can make it work using numpy.meshgrid to get the base grid, and a numpy.reshape to get your z value in the right format. from mpl_toolkits.mplot3d import Axes3D import matplotlib.pyplot as plt from matplotlib import cm from matplotlib.ticker import LinearLocator, FormatStrFormatter import numpy as np data = np.loadtxt('contrib_pol.dat', skiprows=1) x = data[:,0] y = range(1,len(data[0,:])) z = [] fig=plt.figure() ax=fig.gca(projection='3d') for r, row in enumerate(data): for c, col in enumerate(row[1:], start=1): z.append(col) # generate the grid xx, yy = np.meshgrid(x, y) # reshaping your data to match the grid shape zz = np.reshape(z, (len(y), len(x))) surf = ax.plot_surface(xx, yy, zz, cmap=cm.coolwarm, linewidth=0, antialiased=False) plt.show()