How to organize a Python GIS-project with multiple analysis steps?

How to organize a Python GIS-project with multiple analysis steps? - python

I just started to use ArcPy to analyse geo-data with ArcGIS. The analysis has different steps, which are to be executed one after the other.
Here is some pseudo-code:
import arcpy
# create a masking variable
mask1 = "mask.shp"
# create a list of raster files
files_to_process = ["raster1.tif", "raster2.tif", "raster3.tif"]
# step 1 (e.g. clipping of each raster to study extent)
for index, item in enumerate(files_to_process):
raster_i = "temp/ras_tem_" + str(index) + ".tif"
arcpy.Clip_management(item, '#', raster_i, mask1)
# step 2 (e.g. change projection of raster files)
...
# step 3 (e.g. calculate some statistics for each raster)
...
etc.
This code works amazingly well so far. However, the raster files are big and some steps take quite long to execute (5-60 minutes). Therefore, I would like to execute those steps only if the input raster data changes. From the GIS-workflow point of view, this shouldn't be a problem, because each step saves a physical result on the hard disk which is then used as input by the next step.
I guess if I want to temporarily disable e.g. step 1, I could simply put a # in front of every line of this step. However, in the real analysis, each step might have a lot of lines of code, and I would therefore prefer to outsource the code of each step into a separate file (e.g. "step1.py", "step2.py",...), and then execute each file.
I experimented with execfile(step1.py), but received the error NameError: global name 'files_to_process' is not defined. It seems that the variables defined in the main script are not automatically passed to scripts called by execfile.
I also tried this, but I received the same error as above.
I'm a total Python newbie (as you might have figured out by the misuse of any Python-related expressions), and I would be very thankful for any advice on how to organize such a GIS project.

I think what you want to do is build each step into a function. These functions can be stored in the same script file or in their own module that gets loaded with the import statement (just like arcpy). The pseudo code would be something like this:
#file 1: steps.py
def step1(input_files):
# step 1 code goes here
print 'step 1 complete'
return
def step2(input_files):
# step 2 code goes here
print 'step 2 complete'
return output # optionally return a derivative here
#...and so on
Then in a second file in the same directory, you can import and call the functions passing the rasters as your inputs.
#file 2: analyze.py
import steps
files_to_process = ["raster1.tif", "raster2.tif", "raster3.tif"]
steps.step1(files_to_process)
#steps.step2(files_to_process) # uncomment this when you're ready for step 2
Now you can selectively call different steps of your code and it only requires commenting/excluding one line instead of a whle chunk of code. Hopefully I understood your question correctly.

Related

Randomly selected file take longer to load with numpy.load than sequential ones

Context
While training a neural network I realized the time spent per batch increased when I increased the size of my dataset (without changing the batch size). The important part is, I need to fetch 20 .npy files per data point, this number doesn't depend on the dataset size.
Problem
Training goes from 2s/iteration to 10s/iteration...
There is no apparent reason why training would take longer. However, I managed to track down the bottleneck. It seems to have to do with the loading of the .npy files.
To reproduce this behavior, here's a small script you can run to generate 10,000 dummy .npy files:
def path(i):
return os.sep.join(('../datasets/test', str(i)))
def create_dummy_files(N=10000):
for i in range(N):
x = np.random.random((100, 100))
np.save(path(random.getrandbits(128)), x)
Then you can run the following two scripts and compare them yourself:
The first script where 20 .npy files are randomly selected and loaded:
L = os.listdir('../datasets/test')
S = random.sample(L, 20)
for s in S:
np.load(path(s)) # <- timed this
The second version, where 20 .npy 'sequential' files are selected and loaded.
L = os.listdir('../datasets/test')
i = 100
S = L[i: i + 20]
for s in S:
np.load(path(s)) # <- timed this
I tested both scripts and ran them 100 times each (in the 2nd script I used the iteration count as the value for i so the same files are not loaded twice). I wrapped the np.load(path(s)) line with time.time() calls. I'm not timing the sampling, only the loading. Here are the results:
Random loads (times roughly stay between 0.1s and 0.4s, average is 0.25s):
Non random loads (times roughly stay between 0.010s and 0.014s, average is 0.01s):
I'm assuming those times are related to the CPU's activity when the scripts are loaded. However, it doesn't explain this gap. Why are these two results so different? Is there something to do with the way files are indexed?
Edit: I printed S in the random sample script, copied the list of 20 filenames then ran it again with S as a list literally defined. The time it took is comparable to the 'sequential' script. This means it's not related to files not being sequential in the fs or anything. It seems the random sampling gets counted in the timer, yet timing is defined as:
t = time.time()
np.load(path(s))
print(time.time() - t)
I tried as well wrapping np.load (exclusively) with cProfile: same result.

I did say:
I tested both scripts and ran them 100 times each (in the 2nd script I used the iteration count as the value for i so the same files are not loaded twice)
But as tevemadar mentioned
i should be randomized
I completely messed up the operation of selecting different files in the second version. My code was timing the scripts 100 times like so:
for i in trange(100):
if rand:
S = random.sample(L, 20)
else:
S = L[i: i+20] # <- every loop there's only 1 new file added in the selection,
# 19 files will have already been cached in the previous fetch
For the second script, it should rather be S = L[100*i, 100*i+20]!
And yes, when timing, the results are comparable.

Selecting multiple files for input and getting respective output

So I have this bit of code, which clips out a shapefile of a tree out of a Lidar Pointcloud. When doing this for a single shapefile it works well.
What I want to do: I have 180 individual tree shapefiles and want to clip every file out of the same pointcloud and save it as a individual .las file.
So in the end I should have 180 .las files. E.g. Input_shp: Tree11.shp -> Output_las: Tree11.las
I am sure that there is a way to do all of this at once. I just dont know how to select all shapefiles and save the output to 180 individual .las files.
Im really new to Python and any help would be appreciated.
I already tried to get this with placeholders (.format()) but couldnt really get anywhere.
from WBT.whitebox_tools import WhiteboxTools
wbt = WhiteboxTools()
wbt.work_dir = "/home/david/Documents/Masterarbeit/Pycrown/Individual Trees/"
wbt.clip_lidar_to_polygon(i="Pointcloud_to_clip.las", polygons="tree_11.shp", output="Tree11.las")

I don't have the plugin you are using, but you may be looking for this code snippet:
from WBT.whitebox_tools import WhiteboxTools
wbt = WhiteboxTools()
workDir = "/home/david/Documents/Masterarbeit/Pycrown/Individual Trees/"
wbt.work_dir = workDir
# If you want to select all the files in your work dir you can use the following.
# though you may need to make it absolute, depending on where you run this:
filesInFolder = os.listDir(workDir)
numberOfShapeFiles = len([_ for _ in filesInFolder if _.endswith('.shp')])
# assume shape files start at 0 and end at n-1
# loop over all your shape files.
for fileNumber in range(numberOfShapeFiles):
wbt.clip_lidar_to_polygon(
i="Pointcloud_to_clip.las",
polygons=f"tree_{fileNumber}.shp",
output=f"Tree{fileNumber}.las"
)
This makes use of python format string templates.
Along with the os.listdir function.

Octave isnan: not defined error using oct2py

Using oct2py to call corrcoef.m on several (10MM+) size dataframes to return [R,P] matrices to generate training sets for a ML algorithm. Yesterday, I had this working no problem. Ran the script from the top this morning, returning an identical test set to be passed to Octave through oct2py.
I am being returned:
Oct2PyError: Octave evaluation error:
error: isnan: not defined for cell
error: called from:
corrcoef at line 152, column 5
CorrCoefScript at line 1, column 7
First, there are no null/nan values in the set. In fact, there aren't even any zeros. There is no uniformity in any column such that there is no standard deviation being returned in the corrcoef calculation. It is mathmatically sound.
Second, when I load the test set into Octave through the GUI and execute the same .m on the same data no errors are returned, and the [R,P] matrices are identical to the saved outputs from last night. I tested to see if the matrix var is being passed to Octave through oct2py correctly, and Octave is receiving an identical matrix. However, oct2py can no longer execute ANY .m with a nan check in the source code. The error above is returned for any Octave packaged .m script that contains .isnan at any point.
For s&g, I modified my .m to receive the matrix var and write it to a flat file like so:
csvwrite ('filename', data);
This also fails with an fprintf error; if I run the same code on the same dataset inside of the Octave GUI, works fine.
I'm at a loss here. I updated conda, oct2py, and Octave with the same results. Again, the exact code with the exact data ran behaved as expected less than 24 hours prior.
I'm using the code below in Jupyter Notebook to test:
%env OCTAVE_EXECUTABLE = F:\Octave\Octave-5.1.0.0\mingw32\bin\octave-cli-5.1.0.exe
import oct2py
from oct2py import octave
octave.addpath('F:\\FinanceServer\\Python\\Secondary Docs\\autotesting\\atOctave_Scripts');
data = x
octave.push('data',data)
octave.eval('CorrCoefScript')
cmat = octave.pull('R')
enter code here
Side note - I am only having this issue inside of a specific .ipynb script. Through some stroke of luck, the no other scripts using oct2py seem to be affected.

Got it fixed, but it generates more questions than answers. I was using a list of dataframes to loop by type, such that for each iteration i, x was generated through x = dflst[i]. For reasons beyond my understanding, that failed with the passage of time. However, by writing my loop into a custom function and explicitly calling each data frame within that function as so: oct_func(type1df) I am seeing the expected behavior and desired outcome. However, I still cannot use a loop to pass the dataframes to oct_func(). So, it's a band-aid solution that will fit my purposes, but is frustratingly unable to scale .
Edit:
The loop works fine if iterating through a dict of dataframes instead of a list.

Reading data in while loop

I'm not sure if I'm missing something I'm still new at Python, but I am reading a lot of Matlab files in a folder. In each matlab file, there are multiple arrays and I can do things with each array like plot and find the average, max min etc. My code works perfectly and reads the data correctly. Now, I want to add a while loop so it can keep running until I tell it to stop, meaning queuing the user to keep choosing folders that need the data to be read. However, when I run it the second time, it gives me this error TypeError: 'list' object is not callable
Correct me if I'm wrong, but I feel like what the code is doing is adding the next set of data, to the overall data of the program. Which is why it gives me an error when at this line maxGF=max(GainF). Because then it becomes an array of arrays...and it can't take the maximum of that.
when I load data from each matlabfile, this is how I did it:
Files=[] #list of files
for s in os.listdir(filename):
Files.append(scipy.io.loadmat(filename+s))
for filenumber in range(0,length):
#Gets information from MATLAB file and inserts it into array
results[filenumber]=Files[filenumber]['results']
#Paramaters within MATLAB
Gain_loading[filenumber]=results[filenumber]['PowerDomain'][0,0]['Gain'][0,0] #gets length of data within data array
length_of_data_array=len(Gain_loading[filenumber])
Gain[filenumber]=np.reshape(results[filenumber]['PowerDomain'][0,0]['Gain'][0,0],length_of_data_array) #reshapes for graphing purposes
PL_f0_dBm[filenumber]=np.reshape(results[filenumber]['PowerDomain'][0,0]['PL_f0_dBm'][0,0],length_of_data_array)
Pavs_dBm[filenumber]=np.reshape(results[filenumber]['PowerDomain'][0,0]['Pavs_dBm'][0,0],length_of_data_array)
PL_f0[filenumber]=np.reshape(results[filenumber]['PowerDomain'][0,0]['PL_f0'][0,0],length_of_data_array)
PL_2f0_dBm[filenumber]=np.reshape(results[filenumber]['PowerDomain'][0,0]['PL_2f0_dBm'][0,0],length_of_data_array)
CarrierFrequency[filenumber]=np.reshape(results[filenumber]['MeasurementParameters'][0,0]['CarrierFrequency'][0,0],length_of_data_array)
Gamma_In[filenumber]=np.reshape(abs(results[filenumber]['PowerDomain'][0,0]['Gin_f0'][0,0]),length_of_data_array)
Delta[filenumber]=PL_2f0_dBm[filenumber]-PL_f0_dBm[filenumber]
When I start doing stuff with the data like below, it works and it displays the correct data up until I run the max(GainF) command.
GainF=[None]*length
MaxPL_Gain=[None]*length
#finds highest value of PL # each frequency
for c in range(0,length):
MaxPL_Gain[c]=max(PL_f0_dBm[c])
MaxPL_Gain[:]=[x-6 for x in MaxPL_Gain] #subtracts 6dB from highest PL Values # each frequency
for c in range (0,length):
GainF[c]=np.interp(MaxPL_Gain[c],PL_f0_dBm[c],Gain[c]) #interpolates PL vs Gain. Finds value of Gain at 6dB backoff form max PL
maxGF=max(GainF)
I read other threads that said to use the seek(0) function. And I tried that with Files.seek(0) since that is where all my of data is initially saved, but when I run it it gives me the same error: AttributeError: 'list' object has no attribute 'seek'
How do I reset all of my data? Help
UPDATE:
I tried the following code
for name in dir():
if not name.startswith('_'):
del globals()[name]
And it works the way I want it to...or so I think. When I look at the PDF outputed from the program, I get distorted plots. It looks like the axis from the last program is still in the pdfs. Not just that, but when I run it 4-5 times, the spacing gets bigger and bigger and the graphs are further away from each other. How can I fix this error?
distorted graphs

Is matplotlib savefig threadsafe?

I have an in-house distributed computing library that we use all the time for parallel computing jobs. After the processes are partitioned, they run their data loading and computation steps and then finish with a "save" step. Usually this involved writing data to database tables.
But for a specific task, I need the output of each process to be a .png file with some data plots. There are 95 processes in total, so 95 .pngs.
Inside of my "save" step (executed on each process), I have some very simple code that makes a boxplot with matplotlib's boxplot function and some code that uses savefig to write it to a .png file that has a unique name based on the specific data used in that process.
However, I occasionally see output where it appears that two or more sets of data were written into the same output file, despite the unique names.
Does matplotlib use temporary file saves when making boxplots or saving figures? If so, does it always use the same temp file names (thus leading to over-write conflicts)? I have run my process using strace and cannot see anything that obviously looks like temp file writing from matplotlib.
How can I ensure that this will be threadsafe? I definitely want to conduct the file saving in parallel, as I am looking to expand the number of output .pngs considerably, so the option of first storing all the data and then just serially executing the plot/save portion is very undesirable.
It's impossible for me to reproduce the full parallel infrastructure we are using, but below is the function that gets called to create the plot handle, and then the function that gets called to save the plot. You should assume for the sake of the question that the thread safety has nothing to do with our distributed library. We know it's not coming from our code, which has been used for years for our multiprocessing jobs without threading issues like this (especially not for something we don't directly control, like any temp files from matplotlib).
import pandas
import numpy as np
import matplotlib.pyplot as plt
def plot_category_data(betas, category_name):
"""
Function to organize beta data by date into vectors and pass to box plot
code for producing a single chart of multi-period box plots.
"""
beta_vector_list = []
yms = np.sort(betas.yearmonth.unique())
for ym in yms:
beta_vector_list.append(betas[betas.yearmonth==ym].Beta.values.flatten().tolist())
###
plot_output = plt.boxplot(beta_vector_list)
axs = plt.gcf().gca()
axs.set_xticklabels(betas.FactorDate.unique(), rotation=40, horizontalalignment='right')
axs.set_xlabel("Date")
axs.set_ylabel("Beta")
axs.set_title("%s Beta to BMI Global"%(category_name))
axs.set_ylim((-1.0, 3.0))
return plot_output
### End plot_category_data
def save(self):
"""
Make calls to store the plot to the desired output file.
"""
out_file = self.output_path + "%s.png"%(self.category_name)
fig = plt.gcf()
fig.set_figheight(6.5)
fig.set_figwidth(10)
fig.savefig(out_file, bbox_inches='tight', dpi=150)
print "Finished and stored output file %s"%(out_file)
return None
### End save

In your two functions, you're calling plt.gcf(). I would try generating a new figure every time you plot with plt.figure() and referencing that one explicitly so you skirt the whole issue entirely.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.