Pythonic way of importing 'complex' script result to another script - python

This is probably going to be flagged as duplicate as I went through a lot of post on that matter, but I really need help for that particular example.
So let say that I have a main script built for plotting some data, and then a few other scripts that I would like to import into that main plotting script in order, well, to retrieve those data.
I found that the most pythonic way of doing so is to import the script and then use it like a module, but since I am pretty new to Python I can't figure out how to properly do it with those pieces of code.
I use code written by someone else and adapted to fit my purpose so that might be an issue too.
I wanted to save some resources and avoid to write and read files to disk for every step so that is why I want to extract the data within my plotting script and have my pandas panel available in memory for plotting purpose.
Ideally my master_plot.py would look like that:
#functions for the plotting part
if __name__ == '__main__':
import Tab
data= Tab.main(some_path) #this script would calculate tab and return a pd.panel that will be used by the plotting section
plot(data)
Now for the actual tab script:
#!/usr/bin/env python
class SubDomains( object ):
'''
a class that reads in and prepares a dataset to be used in masking
'''
def __init__( self, fiona_shape, rasterio_raster, id_field, **kwargs ):
#do stuff
#staticmethod
def rasterize_subdomains( fiona_shape, rasterio_raster, id_field ):
from rasterio.features import rasterize
import six
import numpy as np
#do stuff
def _domains_name( self, fiona_shape, rasterio_raster, **kwargs ):
#do stuff
def _domains_generator( self, **kwargs ):
#do stuff
def extract_tab( filelist, subdomains_arr ):
'''
extract the number of burned pixels across subdomains
'''
def read_firescar( x ):
#do stuff
def tab_counts( x ):
#do stuff
def get_year_fn( x ):
#do stuff
return pd.DataFrame( rep_arr ).T
def tab_processing(maps_path):
l = pd.Series( sorted( glob.glob( os.path.join( maps_path, 'FireScar*.tif' ) ) ) )
# now lets groupby the repnum by splitting the filenames with a function
def get_rep_fn( x ):
''' function to split the repnum out of an ALF output filename '''
return os.path.basename( x ).split( '_' )[ 1 ]
rep_groups = l.groupby( l.map( get_rep_fn ) )
# initialize a SubDomains object
sub_domains = SubDomains( shp, rasterio.open( l[0] ), 'Id' )
# now extract the newly rasterized sub domains numpy array
subdomains_arr = sub_domains.sub_domains
pool = mp.Pool( 32 )
out = pool.map( lambda x: extract_tab( x, subdomains_arr ), [ group.tolist() for rep_num, group in rep_groups ] )
pool.close()
# now there is a list of pd.DataFrame objects that we can collapse into a single 3-D pd.Panel
rep_list = sorted( [ int(rep_num) for rep_num, group in rep_groups ] )
year_list = sorted( out[0].index.astype(int) )
# rep_list = np.repeat( rep_list, len(year_list) )
# year_list = np.array([year_list for i in range(len(rep_list)) ]).ravel()
tab_panel = pd.Panel( { i:j for i,j in zip(rep_list, out) } )
return tab_panel
if __name__ == '__main__':
import os, sys, re, rasterio, glob, fiona, shapely
import pandas as pd
import numpy as np
from pathos import multiprocessing as mp
from collections import defaultdict
path = sys.argv[1]
tab_processing(path)
This script was a standalone before so I am not sure if the best way would be to call subprocess or if it could still be used as a module. One of the issue with this exact version is global name 'pd' is not defined which makes sense, but I don't really see how to fix that problem, as all the functions and classes are used in the tab_processing function. I wish that I could just run the whole thing instead of just running one function, so then maybe subprocess is the best practice here?
Not sure if it changes anything but this goes through a lot of data and so need to be pretty efficient resource-wise.

Related

Python multiprocessing can't use functions from other module

Update: it's working after updating my Spyder to 5.0.5. Thanks everyone!
I am trying to speed up a loop using multiprocessing. The code below aims to generate 10000 random vectors.
My idea is to split the task into 5 processes and store it in result. However, it returned an empty list when I run the code.
But, if I remove result = add_one(result) in the randomize_data function, the code runs perfectly. So, the error must be coming from using functions from other modules (Testing.test) inside multiprocessing.
Here is the add_one function from Testing.test:
def add_one(x):
return x+1
How can I use function from other modules inside process? Thank you.
import multiprocessing
import numpy as np
import pandas as pd
def randomize_data(mean, cov, n_init, proc_num, return_dict):
result = pd.DataFrame()
for _ in range(n_init):
temp = np.random.multivariate_normal(mean, cov)
result = result.append(pd.Series(temp), ignore_index=True)
result = add_one(result)
return_dict[proc_num] = result
if __name__ == "__main__":
from Testing.test import add_one
mean = np.arange(0, 1, 0.1)
cov = np.identity(len(mean))
manager = multiprocessing.Manager()
return_dict = manager.dict()
jobs = []
for i in range(5):
p = multiprocessing.Process(target=randomize_data, args=(mean, cov, 2000, i, return_dict, ))
jobs.append(p)
p.start()
for proc in jobs:
proc.join()
result = return_dict.values()
The issue here is pretty obvious:
You imported add_one in a local scope, not in global. Because of this, the referenz to this function only exists inside your main-if.
Move this import-statement to the other ones to the top of your file, and your code should work.
import multiprocessing
import numpy as np
import pandas as pd
from Testing.test import add_one

dask.delayed and import statements

I'm learning dask and I want to generate random strings. But this only works if the import statements are inside the function f.
This works:
import dask
from dask.distributed import Client, progress
c = Client(host='scheduler')
def f():
from random import choices
from string import ascii_letters
rand_str = lambda n: ''.join(choices(population=list(ascii_letters), k=n))
return rand_str(5)
xs = []
for i in range(3):
x = dask.delayed(f)()
xs.append(x)
res = c.compute(xs)
print([r.result() for r in res])
This prints something like ['myvDi', 'rZnYO', 'MyzaG']. This is good, as the strings are random.
This, however, doesn't work:
from random import choices
from string import ascii_letters
import dask
from dask.distributed import Client, progress
c = Client(host='scheduler')
def f():
rand_str = lambda n: ''.join(choices(population=list(ascii_letters), k=n))
return rand_str(5)
xs = []
for i in range(3):
x = dask.delayed(f)()
xs.append(x)
res = c.compute(xs)
print([r.result() for r in res])
This prints something like ['tySQP', 'tySQP', 'tySQP'], which is bad because all the random strings are the same.
So I'm curious how I'm going to distribute large non-trivial code. My goal is to be able to pass arbitrary json to a dask.delayed function and have that function perform analysis using other modules, like google's ortools.
Any suggestions?
Python's random module is odd.
It creates some state when it first imports and uses that state when generating random numbers. Unfortunately, having this state around makes it difficult to serialize and move between processes.
Your solution of importing random within your function is what I do.

How to ease and efficient store simulation data for numpy ufuncs in OO

In a jupyter notebook I OO-modeled a resource but in the control loop need to aggregate data over multiple objects being inefficient compared to ufuncs and similar operations. To package functionality i chose OO but for efficient and concise code i probably have to pull out the data into a storage class (maybe) and push all the ri[0] lines into a 2d array, in this case (2,K).
The class does not need the log, only the last entries.
K = 100
class Resource:
def __init__(self):
self.log = np.random( (5,K) )
# log gets filled during simulation
r0 = Resource()
r1 = Resource()
# while control loop:
#aggregate control data
for k in K:
total_row_0 = r0.log[0][k] + r1.log[0][k]
#do sth with the totals and loop again
This would greatly improves performance, but i have difficulty to link the data to the class if separately stored. How would you approach this? pandas DataFrames, np View or Shallow Copy?
[[...] #r0
[...] ]#r1 same data into one array, efficient but map back to class difficult
Here is my take on it:
import numpy as np
K = 3
class Res:
logs = 2
def __init__(self):
self.log = None
def set_log(self, view):
self.log = view
batteries = [Res(), Res()]
d = {'Res': np.random.random( (Res.logs * len(batteries), K) )}
for i in range(len(batteries)):
view = d['Res'].view()[i::len(batteries)][:]
batteries[i].set_log(view)
print(d)
batteries[1].log[1][2] = 1#test modifies view of last entry of second Res of second log
print(d)

How to plot binary data in python?

First of all I know my question is frequently asked. But I have not found a solution in them.
I work with USBTMC to control oscilloscope. Here you can find more information about it. I am able to capture screen and write it into a file (see picture). But I want to plot screen every n secs in real time. I would like to use matplotlib.pyplot, for example.
Here is my code (with a desperate attempt to plot data with pyplot):
import usbtmc
from time import sleep
import matplotlib.pyplot as plot
import numpy as np
import subprocess
maxTries = 3
scope = usbtmc.Instrument(0x0699, 0x03a6)
print scope.ask("*IDN?")
scope.write("ACQ:STOPA SEQ")
scope.write("ACQ:STATE ON")
while ( True ):
#get trigger state
trigState = scope.ask("TRIG:STATE?")
#check if Acq complete
if ( trigState.endswith('SAVE') ):
print 'Acquisition complete. Writing into file ...'
#save screen
scope.write("SAVE:IMAG:FILEF PNG")
scope.write("HARDCOPY START")
#HERE I get binary data
screenData = scope.read_raw()
#HERE I try to convert it?
strData = np.fromstring( screenData, dtype=np.uint8 )
#HERE I try to plot previous
plot.plot( strData )
plot.show()
#rewrite in file (this works fine)
try:
outFile = open("screen.png", "wb")
outFile.write( screenData )
except IOError:
print 'Error: cannot write to file'
else:
print 'Data was written successfully in file: ', outFile.name
finally:
outFile.close()
#continue doing something
After run this code I get ... look at the picture.
Unfortunately I cannot test it, but you may try something like this
import io
screenData = scope.read_raw()
arrayData = plt.imread(io.BytesIO(screenData))
plt.imshow(arrayData)
plt.show()
I would like to note that for live plotting, it is probably better not to obtain the image from the scope's screen but the raw data. This should allow for much faster operation.

How can I speed up my Python Code by multiprocessing (parallel-processing)?

The function of my python code is very straightforward. It reads the netCDF file through a file list and returns the mean value in this case.
However, it takes time to read the netCDF file. I am wondering can I speedup this process by Multiprocessing (parallel-processing) since my work station has 32-core processors.
The code looks like:
from netCDF4 import Dataset
for i in filerange:
print "Reading the",i, "file", "Wait"
infile_Radar = Dataset(file_list[i],'r')
# Read the hourly Data
Radar_rain=np.array(infile_Radar.variables['rain'][:])
for h in range(0,24):
hourly_rain = Radar_rain[h,:]
hourly_mean[i,h] = np.mean(hourly_rain)
np.savetxt('Hourly_Spatial_mean.txt', hourly_mean, delimiter='\t')
Since the reading file is independet to each other, so how can make the best of my work station? Thanks.
It seems like you're looking for a fairly standard threading implementation. Assuming that it's the Dataset constructor that's the blocking part you may want to do something like this:
from threading import Thread
def CreateDataset( offset, files, datasets ):
datasets[offset] = Dataset( files[i], 'r' )
threads = [None] * len( filerange )
data_sets = [None] * len( filerange )
for i in filerange:
threads[i] = Thread( None, CreateDataset, None, ( i, file_list, data_sets ) )
threads[i].start();
for t in threads:
t.join()
# Resume work with each item in the data_sets list
print "All Done";
Then for each dataset do the rest of the work you detailed. Wherever the actual "slow stuff" is, that's the basic approach.

Categories