Using oct2py to call corrcoef.m on several (10MM+) size dataframes to return [R,P] matrices to generate training sets for a ML algorithm. Yesterday, I had this working no problem. Ran the script from the top this morning, returning an identical test set to be passed to Octave through oct2py.
I am being returned:
Oct2PyError: Octave evaluation error:
error: isnan: not defined for cell
error: called from:
corrcoef at line 152, column 5
CorrCoefScript at line 1, column 7
First, there are no null/nan values in the set. In fact, there aren't even any zeros. There is no uniformity in any column such that there is no standard deviation being returned in the corrcoef calculation. It is mathmatically sound.
Second, when I load the test set into Octave through the GUI and execute the same .m on the same data no errors are returned, and the [R,P] matrices are identical to the saved outputs from last night. I tested to see if the matrix var is being passed to Octave through oct2py correctly, and Octave is receiving an identical matrix. However, oct2py can no longer execute ANY .m with a nan check in the source code. The error above is returned for any Octave packaged .m script that contains .isnan at any point.
For s&g, I modified my .m to receive the matrix var and write it to a flat file like so:
csvwrite ('filename', data);
This also fails with an fprintf error; if I run the same code on the same dataset inside of the Octave GUI, works fine.
I'm at a loss here. I updated conda, oct2py, and Octave with the same results. Again, the exact code with the exact data ran behaved as expected less than 24 hours prior.
I'm using the code below in Jupyter Notebook to test:
%env OCTAVE_EXECUTABLE = F:\Octave\Octave-5.1.0.0\mingw32\bin\octave-cli-5.1.0.exe
import oct2py
from oct2py import octave
octave.addpath('F:\\FinanceServer\\Python\\Secondary Docs\\autotesting\\atOctave_Scripts');
data = x
octave.push('data',data)
octave.eval('CorrCoefScript')
cmat = octave.pull('R')
enter code here
Side note - I am only having this issue inside of a specific .ipynb script. Through some stroke of luck, the no other scripts using oct2py seem to be affected.
Got it fixed, but it generates more questions than answers. I was using a list of dataframes to loop by type, such that for each iteration i, x was generated through x = dflst[i]. For reasons beyond my understanding, that failed with the passage of time. However, by writing my loop into a custom function and explicitly calling each data frame within that function as so: oct_func(type1df) I am seeing the expected behavior and desired outcome. However, I still cannot use a loop to pass the dataframes to oct_func(). So, it's a band-aid solution that will fit my purposes, but is frustratingly unable to scale .
Edit:
The loop works fine if iterating through a dict of dataframes instead of a list.
Related
After building and installing the Python engine shipped with Matlab 2019b in Anaconda
(TestEnvironment) PS C:\Program Files\MATLAB\R2019b\extern\engines\python> C:\Users\USER\Anaconda3\envs\TestEnvironment\python.exe .\setup.py build -b C:\Users\USER\MATLAB\build_temp install
for Python 3.7 I wrote a simple script to test a couple of features I'm interested in:
import matlab.engine as ml_e
# Start Matlab engine
eng = ml_e.start_matlab()
# Load MAT file into engine. The result is a dictionary
mat_file = "samples/lena.mat"
lenaMat = eng.load("samples/lena.mat")
print("Variables found in \"" + mat_file + "\"")
for key in lenaMat.keys():
print(key)
# print(lenaMat["lena512"])
# Use the variable from the MAT file to display it as an image
eng.imshow(lenaMat["lena512"], [])
I have a problem with imshow() (or any similar function that displays a figure in the Matlab GUI on the screen) namely that it shows quickly and then disappears, which - I guess - at least confirms that it is possible to use it. The only possibility to keep it on the screen is to add an infinite loop at the end:
while True:
continue
For obvious reasons this is not a good solution. I am not looking for a conversion of Matlab data to NumPy or similar and displaying it using matplotlib or similar third party libraries (I am aware that SciPy can load MAT files for example). The reason is simple - I would like to use Matlab (including loading whole environments) and for debugging purposes I'd like to be able to show this and that result without having to go through loops and hoops of converting the data manually.
I'm not sure if I'm missing something I'm still new at Python, but I am reading a lot of Matlab files in a folder. In each matlab file, there are multiple arrays and I can do things with each array like plot and find the average, max min etc. My code works perfectly and reads the data correctly. Now, I want to add a while loop so it can keep running until I tell it to stop, meaning queuing the user to keep choosing folders that need the data to be read. However, when I run it the second time, it gives me this error TypeError: 'list' object is not callable
Correct me if I'm wrong, but I feel like what the code is doing is adding the next set of data, to the overall data of the program. Which is why it gives me an error when at this line maxGF=max(GainF). Because then it becomes an array of arrays...and it can't take the maximum of that.
when I load data from each matlabfile, this is how I did it:
Files=[] #list of files
for s in os.listdir(filename):
Files.append(scipy.io.loadmat(filename+s))
for filenumber in range(0,length):
#Gets information from MATLAB file and inserts it into array
results[filenumber]=Files[filenumber]['results']
#Paramaters within MATLAB
Gain_loading[filenumber]=results[filenumber]['PowerDomain'][0,0]['Gain'][0,0] #gets length of data within data array
length_of_data_array=len(Gain_loading[filenumber])
Gain[filenumber]=np.reshape(results[filenumber]['PowerDomain'][0,0]['Gain'][0,0],length_of_data_array) #reshapes for graphing purposes
PL_f0_dBm[filenumber]=np.reshape(results[filenumber]['PowerDomain'][0,0]['PL_f0_dBm'][0,0],length_of_data_array)
Pavs_dBm[filenumber]=np.reshape(results[filenumber]['PowerDomain'][0,0]['Pavs_dBm'][0,0],length_of_data_array)
PL_f0[filenumber]=np.reshape(results[filenumber]['PowerDomain'][0,0]['PL_f0'][0,0],length_of_data_array)
PL_2f0_dBm[filenumber]=np.reshape(results[filenumber]['PowerDomain'][0,0]['PL_2f0_dBm'][0,0],length_of_data_array)
CarrierFrequency[filenumber]=np.reshape(results[filenumber]['MeasurementParameters'][0,0]['CarrierFrequency'][0,0],length_of_data_array)
Gamma_In[filenumber]=np.reshape(abs(results[filenumber]['PowerDomain'][0,0]['Gin_f0'][0,0]),length_of_data_array)
Delta[filenumber]=PL_2f0_dBm[filenumber]-PL_f0_dBm[filenumber]
When I start doing stuff with the data like below, it works and it displays the correct data up until I run the max(GainF) command.
GainF=[None]*length
MaxPL_Gain=[None]*length
#finds highest value of PL # each frequency
for c in range(0,length):
MaxPL_Gain[c]=max(PL_f0_dBm[c])
MaxPL_Gain[:]=[x-6 for x in MaxPL_Gain] #subtracts 6dB from highest PL Values # each frequency
for c in range (0,length):
GainF[c]=np.interp(MaxPL_Gain[c],PL_f0_dBm[c],Gain[c]) #interpolates PL vs Gain. Finds value of Gain at 6dB backoff form max PL
maxGF=max(GainF)
I read other threads that said to use the seek(0) function. And I tried that with Files.seek(0) since that is where all my of data is initially saved, but when I run it it gives me the same error: AttributeError: 'list' object has no attribute 'seek'
How do I reset all of my data? Help
UPDATE:
I tried the following code
for name in dir():
if not name.startswith('_'):
del globals()[name]
And it works the way I want it to...or so I think. When I look at the PDF outputed from the program, I get distorted plots. It looks like the axis from the last program is still in the pdfs. Not just that, but when I run it 4-5 times, the spacing gets bigger and bigger and the graphs are further away from each other. How can I fix this error?
distorted graphs
I am facing a little problem while calling a variable from a different file. I
have two different files train_dataset.py and test_dataset.py. I run the
train_dataset.py file from my IDE and note the value of the array variable
array_val as given below.
array([[ 0.08695652, 0.66459627, 0.08695652, 0.07453416, 0.07453416,
... 0.15217391]])
Now I switch on to test_dataset.py and import import train_dataset and print
the value of array_val by calling train_dataset.array_val, I see a very
different output. The output is given below.
array([[ 8.11594203e-01, 1.15942029e-01, 4.05797101e-01,
... 1.30434783e-01, 5.65217391e-01, 2.02898551e-01]])
Please suggest how do I get rid of it and state the reason of the discrepancy.
Please find the code that I have embedded in my train_dataset.py
no_of_clusters=9
cluster_centroids=[]
k_means=KMeans(n_clusters=no_of_clusters,n_init=14, max_iter=400)
k_means.fit(matrix_for_cluster)
labels=k_means.labels_
array_val=k_means.cluster_centers_
i.e matrix_for_cluster is a numpy n-dimensional array.
In my test_dataset.py all I do is
import train_dataset
print train_dataset.array_val
This is probably due to the random initialization of the k-means algorithm
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
As #ali_m explains nicely in the comments, the line import train_dataset re-runs the clustering and the cluster centers are not actually saved from the previous time you ran the code. To do that you can serialise the data with
http://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html
http://docs.scipy.org/doc/numpy/reference/generated/numpy.load.html#numpy.load
I am currently using a Python script to process information stored in the EnSight Gold format. My Python (2.6) scipt uses VTK (5.10.0) to process the file, where I used the vtkEnSightGoldReader for reading the data, and loop over time steps. In principle this works find for smaller datasets, however, for large datasets (GBs), I see the memory usage (via top) increasing with time while the process is running. This filling of the memory goes slow, but in some cases problems are inevitable.
The following script is the minimal productive script that I reduced my issue to.
import vtk
reader = vtk.vtkEnSightGoldReader()
reader.SetCaseFileName("case.case")
reader.Update()
# Get time values
timeset=reader.GetTimeSets()
time=timeset.GetItem(0)
timesteps=time.GetSize()
#reader.ReleaseDataFlagOn()
for j in range(timesteps):
curTime=time.GetTuple(j)[0]
print curTime
reader.SetTimeValue(curTime)
reader.Update()
#reader.RemoveAllInputs()
My question is, how can I unload/replace the data that is stored in the memory, instead of using more memory continuously?
As you can see in my source code, I tried member functions "RemoveAllInputs" and "ReleaseDataFlagOn", but they don't work or I used them in the wrong way. Infortunately, I am not getting any closer to a solution.
Something else I tried is the DeepCopy() approach, which I found on the VTK website. However, it seems that this approach is not useful for me, because I get the memory issues even before calling GetOutput()
There is indeed a (minor) memory leak in the vtkEnsightGoldReader. The memory leak is a result of not properly clear collection object, which becomes apparent only for processing very large datasets. Technically it is not a memoryleak, since it gets properly cleared after a run.
This can only be solved by applying a patch to the VTK source and recompiling. I received the patch below via people from Kitware, so I would assume this rolled out in later versions of VTK.
diff --git a/IO/vtkEnSightReader.cxx b/IO/vtkEnSightReader.cxx
index 68a9b8f..7ab8ddd 100644
--- a/IO/vtkEnSightReader.cxx
+++ b/IO/vtkEnSightReader.cxx
## -985,6 +985,8 ## int vtkEnSightReader::ReadCaseFileTime(char* line)
int timeSet, numTimeSteps, i, filenameNum, increment, lineRead;
float timeStep;
+ this->TimeSetFileNameNumbers->RemoveAllItems();
+
// found TIME section
int firstTimeStep = 1;
I just started to use ArcPy to analyse geo-data with ArcGIS. The analysis has different steps, which are to be executed one after the other.
Here is some pseudo-code:
import arcpy
# create a masking variable
mask1 = "mask.shp"
# create a list of raster files
files_to_process = ["raster1.tif", "raster2.tif", "raster3.tif"]
# step 1 (e.g. clipping of each raster to study extent)
for index, item in enumerate(files_to_process):
raster_i = "temp/ras_tem_" + str(index) + ".tif"
arcpy.Clip_management(item, '#', raster_i, mask1)
# step 2 (e.g. change projection of raster files)
...
# step 3 (e.g. calculate some statistics for each raster)
...
etc.
This code works amazingly well so far. However, the raster files are big and some steps take quite long to execute (5-60 minutes). Therefore, I would like to execute those steps only if the input raster data changes. From the GIS-workflow point of view, this shouldn't be a problem, because each step saves a physical result on the hard disk which is then used as input by the next step.
I guess if I want to temporarily disable e.g. step 1, I could simply put a # in front of every line of this step. However, in the real analysis, each step might have a lot of lines of code, and I would therefore prefer to outsource the code of each step into a separate file (e.g. "step1.py", "step2.py",...), and then execute each file.
I experimented with execfile(step1.py), but received the error NameError: global name 'files_to_process' is not defined. It seems that the variables defined in the main script are not automatically passed to scripts called by execfile.
I also tried this, but I received the same error as above.
I'm a total Python newbie (as you might have figured out by the misuse of any Python-related expressions), and I would be very thankful for any advice on how to organize such a GIS project.
I think what you want to do is build each step into a function. These functions can be stored in the same script file or in their own module that gets loaded with the import statement (just like arcpy). The pseudo code would be something like this:
#file 1: steps.py
def step1(input_files):
# step 1 code goes here
print 'step 1 complete'
return
def step2(input_files):
# step 2 code goes here
print 'step 2 complete'
return output # optionally return a derivative here
#...and so on
Then in a second file in the same directory, you can import and call the functions passing the rasters as your inputs.
#file 2: analyze.py
import steps
files_to_process = ["raster1.tif", "raster2.tif", "raster3.tif"]
steps.step1(files_to_process)
#steps.step2(files_to_process) # uncomment this when you're ready for step 2
Now you can selectively call different steps of your code and it only requires commenting/excluding one line instead of a whle chunk of code. Hopefully I understood your question correctly.