Discrepancy in array value when calling a variable from different file - python

Discrepancy in array value when calling a variable from different file - python - python

I am facing a little problem while calling a variable from a different file. I
have two different files train_dataset.py and test_dataset.py. I run the
train_dataset.py file from my IDE and note the value of the array variable
array_val as given below.
array([[ 0.08695652, 0.66459627, 0.08695652, 0.07453416, 0.07453416,
... 0.15217391]])
Now I switch on to test_dataset.py and import import train_dataset and print
the value of array_val by calling train_dataset.array_val, I see a very
different output. The output is given below.
array([[ 8.11594203e-01, 1.15942029e-01, 4.05797101e-01,
... 1.30434783e-01, 5.65217391e-01, 2.02898551e-01]])
Please suggest how do I get rid of it and state the reason of the discrepancy.
Please find the code that I have embedded in my train_dataset.py
no_of_clusters=9
cluster_centroids=[]
k_means=KMeans(n_clusters=no_of_clusters,n_init=14, max_iter=400)
k_means.fit(matrix_for_cluster)
labels=k_means.labels_
array_val=k_means.cluster_centers_
i.e matrix_for_cluster is a numpy n-dimensional array.
In my test_dataset.py all I do is
import train_dataset
print train_dataset.array_val

This is probably due to the random initialization of the k-means algorithm
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
As #ali_m explains nicely in the comments, the line import train_dataset re-runs the clustering and the cluster centers are not actually saved from the previous time you ran the code. To do that you can serialise the data with
http://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html
http://docs.scipy.org/doc/numpy/reference/generated/numpy.load.html#numpy.load

Related

Octave isnan: not defined error using oct2py

Using oct2py to call corrcoef.m on several (10MM+) size dataframes to return [R,P] matrices to generate training sets for a ML algorithm. Yesterday, I had this working no problem. Ran the script from the top this morning, returning an identical test set to be passed to Octave through oct2py.
I am being returned:
Oct2PyError: Octave evaluation error:
error: isnan: not defined for cell
error: called from:
corrcoef at line 152, column 5
CorrCoefScript at line 1, column 7
First, there are no null/nan values in the set. In fact, there aren't even any zeros. There is no uniformity in any column such that there is no standard deviation being returned in the corrcoef calculation. It is mathmatically sound.
Second, when I load the test set into Octave through the GUI and execute the same .m on the same data no errors are returned, and the [R,P] matrices are identical to the saved outputs from last night. I tested to see if the matrix var is being passed to Octave through oct2py correctly, and Octave is receiving an identical matrix. However, oct2py can no longer execute ANY .m with a nan check in the source code. The error above is returned for any Octave packaged .m script that contains .isnan at any point.
For s&g, I modified my .m to receive the matrix var and write it to a flat file like so:
csvwrite ('filename', data);
This also fails with an fprintf error; if I run the same code on the same dataset inside of the Octave GUI, works fine.
I'm at a loss here. I updated conda, oct2py, and Octave with the same results. Again, the exact code with the exact data ran behaved as expected less than 24 hours prior.
I'm using the code below in Jupyter Notebook to test:
%env OCTAVE_EXECUTABLE = F:\Octave\Octave-5.1.0.0\mingw32\bin\octave-cli-5.1.0.exe
import oct2py
from oct2py import octave
octave.addpath('F:\\FinanceServer\\Python\\Secondary Docs\\autotesting\\atOctave_Scripts');
data = x
octave.push('data',data)
octave.eval('CorrCoefScript')
cmat = octave.pull('R')
enter code here
Side note - I am only having this issue inside of a specific .ipynb script. Through some stroke of luck, the no other scripts using oct2py seem to be affected.

Got it fixed, but it generates more questions than answers. I was using a list of dataframes to loop by type, such that for each iteration i, x was generated through x = dflst[i]. For reasons beyond my understanding, that failed with the passage of time. However, by writing my loop into a custom function and explicitly calling each data frame within that function as so: oct_func(type1df) I am seeing the expected behavior and desired outcome. However, I still cannot use a loop to pass the dataframes to oct_func(). So, it's a band-aid solution that will fit my purposes, but is frustratingly unable to scale .
Edit:
The loop works fine if iterating through a dict of dataframes instead of a list.

Reading data in while loop

I'm not sure if I'm missing something I'm still new at Python, but I am reading a lot of Matlab files in a folder. In each matlab file, there are multiple arrays and I can do things with each array like plot and find the average, max min etc. My code works perfectly and reads the data correctly. Now, I want to add a while loop so it can keep running until I tell it to stop, meaning queuing the user to keep choosing folders that need the data to be read. However, when I run it the second time, it gives me this error TypeError: 'list' object is not callable
Correct me if I'm wrong, but I feel like what the code is doing is adding the next set of data, to the overall data of the program. Which is why it gives me an error when at this line maxGF=max(GainF). Because then it becomes an array of arrays...and it can't take the maximum of that.
when I load data from each matlabfile, this is how I did it:
Files=[] #list of files
for s in os.listdir(filename):
Files.append(scipy.io.loadmat(filename+s))
for filenumber in range(0,length):
#Gets information from MATLAB file and inserts it into array
results[filenumber]=Files[filenumber]['results']
#Paramaters within MATLAB
Gain_loading[filenumber]=results[filenumber]['PowerDomain'][0,0]['Gain'][0,0] #gets length of data within data array
length_of_data_array=len(Gain_loading[filenumber])
Gain[filenumber]=np.reshape(results[filenumber]['PowerDomain'][0,0]['Gain'][0,0],length_of_data_array) #reshapes for graphing purposes
PL_f0_dBm[filenumber]=np.reshape(results[filenumber]['PowerDomain'][0,0]['PL_f0_dBm'][0,0],length_of_data_array)
Pavs_dBm[filenumber]=np.reshape(results[filenumber]['PowerDomain'][0,0]['Pavs_dBm'][0,0],length_of_data_array)
PL_f0[filenumber]=np.reshape(results[filenumber]['PowerDomain'][0,0]['PL_f0'][0,0],length_of_data_array)
PL_2f0_dBm[filenumber]=np.reshape(results[filenumber]['PowerDomain'][0,0]['PL_2f0_dBm'][0,0],length_of_data_array)
CarrierFrequency[filenumber]=np.reshape(results[filenumber]['MeasurementParameters'][0,0]['CarrierFrequency'][0,0],length_of_data_array)
Gamma_In[filenumber]=np.reshape(abs(results[filenumber]['PowerDomain'][0,0]['Gin_f0'][0,0]),length_of_data_array)
Delta[filenumber]=PL_2f0_dBm[filenumber]-PL_f0_dBm[filenumber]
When I start doing stuff with the data like below, it works and it displays the correct data up until I run the max(GainF) command.
GainF=[None]*length
MaxPL_Gain=[None]*length
#finds highest value of PL # each frequency
for c in range(0,length):
MaxPL_Gain[c]=max(PL_f0_dBm[c])
MaxPL_Gain[:]=[x-6 for x in MaxPL_Gain] #subtracts 6dB from highest PL Values # each frequency
for c in range (0,length):
GainF[c]=np.interp(MaxPL_Gain[c],PL_f0_dBm[c],Gain[c]) #interpolates PL vs Gain. Finds value of Gain at 6dB backoff form max PL
maxGF=max(GainF)
I read other threads that said to use the seek(0) function. And I tried that with Files.seek(0) since that is where all my of data is initially saved, but when I run it it gives me the same error: AttributeError: 'list' object has no attribute 'seek'
How do I reset all of my data? Help
UPDATE:
I tried the following code
for name in dir():
if not name.startswith('_'):
del globals()[name]
And it works the way I want it to...or so I think. When I look at the PDF outputed from the program, I get distorted plots. It looks like the axis from the last program is still in the pdfs. Not just that, but when I run it 4-5 times, the spacing gets bigger and bigger and the graphs are further away from each other. How can I fix this error?
distorted graphs

Python: saving numpy array to ROOT file

So my research mates and I are trying to save a pretty big (47104,5) array into a TTree in a ROOT file. The array on the Python side works fine. We can access everything and run normal commands, but when we run the root_numpy.array2root() command, we get a weird error.
Object of type 'NoneType' has no len()
The code we are running for this portion is as follows:
import root_numpy as rnp
import numpy as np
import scipy
import logging
def save_array(outputArray, outputName):
outputString =str(outputName)
logging.info("Creating .Root file")
rnp.array2root(outputArray,outputString,treename="Training_Variables",mode="recreate")
We placed the outputString variable as a way to make sure we were putting the filename in as a string. ( In our python terminal, we add .root at the end of outputName to save it as a .root file.).
Here is a picture of the terminal.
Showing exact error location in root_numpy
Pretty much, we are confused about why array2root() is calling for the len() of an object, which we dont think should have a len? It should just have a shape. Any insight would be greatly appreciated.

The conversion routines from NumPy arrays to ROOT datatypes work with structured arrays. See the two following links. (Not tested, but this is very likely the problem as the routines use the arr.dtypes.names and arr.dtypes.fields attributes).
http://rootpy.github.io/root_numpy/reference/generated/root_numpy.array2tree.html#root_numpy.array2tree
http://rootpy.github.io/root_numpy/reference/generated/root_numpy.array2root.html#root_numpy.array2root

petsc4py: Creating AIJ Matrix from csc_matrix results in TypeError

I am trying to create a petsc-matrix form an already existing csc-matrix. With this in mind I created the following example code:
import numpy as np
import scipy.sparse as sp
import math as math
from petsc4py import PETSc
n=100
A = sp.csc_matrix((n,n),dtype=np.complex128)
print A.shape
A[1:5,:]=1+1j*5*math.pi
p1=A.indptr
p2=A.indices
p3=A.data
petsc_mat = PETSc.Mat().createAIJ(size=A.shape,csr=(p1,p2,p3))
This works perfectly well as long as the matrix only consist of real values. When the matrix is complex running this piece of code results in a
TypeError: Cannot cast array data from dtype('complex128') to dtype('float64') according to the rule 'safe'.
I tried to figure out where the error occurs exactly, but could not make much sense of the traceback:
petsc_mat = PETSc.Mat().createAIJ(size=A.shape,csr=(p1,p2,p3)) File "Mat.pyx", line 265, in petsc4py.PETSc.Mat.createAIJ (src/petsc4py.PETSc.c:98970)
File "petscmat.pxi", line 662, in petsc4py.PETSc.Mat_AllocAIJ (src/petsc4py.PETSc.c:24264)
File "petscmat.pxi", line 633, in petsc4py.PETSc.Mat_AllocAIJ_CSR (src/petsc4py.PETSc.c:23858)
File "arraynpy.pxi", line 136, in petsc4py.PETSc.iarray_s (src/petsc4py.PETSc.c:8048)
File "arraynpy.pxi", line 117, in petsc4py.PETSc.iarray (src/petsc4py.PETSc.c:7771)
Is there an efficient way of creating a petsc matrix (of which i want to retrieve some eigenpairs later) from a complex scipy csc matrix ?
I would be really happy if you guys could help me find my (hopefully not too obvious) mistake.

I had troubles getting PETSc to work, so I configured it more than just once, and in the last run I obviously forgot the option --with-scalar-type=complex.
This is what I should have done:
Either check the log file $PETSC_DIR/arch-linux2-c-opt/conf/configure.log.
Or take a look at the reconfigure-arch-linux2-c-opt.py.
There you can find all options you used to configure PETSc. In case you use SLEPc as well, you also need to recompile it. Now since I added the option (--with-scalar-type=complex) to the reconfigure script and ran it, everything works perfectly fine.

How to organize a Python GIS-project with multiple analysis steps?

I just started to use ArcPy to analyse geo-data with ArcGIS. The analysis has different steps, which are to be executed one after the other.
Here is some pseudo-code:
import arcpy
# create a masking variable
mask1 = "mask.shp"
# create a list of raster files
files_to_process = ["raster1.tif", "raster2.tif", "raster3.tif"]
# step 1 (e.g. clipping of each raster to study extent)
for index, item in enumerate(files_to_process):
raster_i = "temp/ras_tem_" + str(index) + ".tif"
arcpy.Clip_management(item, '#', raster_i, mask1)
# step 2 (e.g. change projection of raster files)
...
# step 3 (e.g. calculate some statistics for each raster)
...
etc.
This code works amazingly well so far. However, the raster files are big and some steps take quite long to execute (5-60 minutes). Therefore, I would like to execute those steps only if the input raster data changes. From the GIS-workflow point of view, this shouldn't be a problem, because each step saves a physical result on the hard disk which is then used as input by the next step.
I guess if I want to temporarily disable e.g. step 1, I could simply put a # in front of every line of this step. However, in the real analysis, each step might have a lot of lines of code, and I would therefore prefer to outsource the code of each step into a separate file (e.g. "step1.py", "step2.py",...), and then execute each file.
I experimented with execfile(step1.py), but received the error NameError: global name 'files_to_process' is not defined. It seems that the variables defined in the main script are not automatically passed to scripts called by execfile.
I also tried this, but I received the same error as above.
I'm a total Python newbie (as you might have figured out by the misuse of any Python-related expressions), and I would be very thankful for any advice on how to organize such a GIS project.

I think what you want to do is build each step into a function. These functions can be stored in the same script file or in their own module that gets loaded with the import statement (just like arcpy). The pseudo code would be something like this:
#file 1: steps.py
def step1(input_files):
# step 1 code goes here
print 'step 1 complete'
return
def step2(input_files):
# step 2 code goes here
print 'step 2 complete'
return output # optionally return a derivative here
#...and so on
Then in a second file in the same directory, you can import and call the functions passing the rasters as your inputs.
#file 2: analyze.py
import steps
files_to_process = ["raster1.tif", "raster2.tif", "raster3.tif"]
steps.step1(files_to_process)
#steps.step2(files_to_process) # uncomment this when you're ready for step 2
Now you can selectively call different steps of your code and it only requires commenting/excluding one line instead of a whle chunk of code. Hopefully I understood your question correctly.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.