Storing terms in fipy as arrays instead of fipy objects - python

I am new to fipy, so I apologise if this is a stupid question (and this doesn't seem to help me).
But is there a way to store fipy objects in human-readable (or python-readable) form, other than suggested in the question above? This is only applicable to the cell variable.
If I want to do some more fancy/customized plotting than what is in the default fipy viewer, how can I do it?
Take for example a simple 1D diffusion:
from fipy import *
# USER-DEFINED PARAMETERS
nx = 100
dx = 0.1
D = 1.0
bound1 = 30
bound2 = 70
# PREPARED FOR SOLUTION
mesh = Grid1D(nx=nx, dx=dx)
print "mesh", mesh
# define some parameters specific to this solution
T0 = bound2
Tinf = bound1
hour = 3600
day = hour*24
ndays = 1
duration = ndays*day
T = CellVariable(name="Temperature", mesh=mesh, value=bound1)
# Constant temperature boundary condition
T.constrain(T0, mesh.facesLeft)
T.constrain(Tinf, mesh.facesRight)
# SOLUTION
eq = (TransientTerm() == DiffusionTerm(coeff=D))
timeStepDuration = 0.5*hour
steps = int(duration/timeStepDuration)
for step in range(steps):
eqCirc.solve(var=T,dt=timeStepDuration)
But could I, for example, store the mesh as an array? Or could I store the value of the DiffusionTerm instead of the CellVariable in each step?
In my case, I would like to plot the thermal gradient (so extract it from the diffusion term) with distance for each time step.
Can I do it? How?

But is there a way to store fipy objects in human-readable (or
python-readable) form, other than suggested in the question above?
There are a number of options. Any FiPy object can be pickled using fipy.dump, which will gather data when running in parallel. For example,
import fipy
mesh = fipy.Grid2D(nx=3, ny=3)
var = fipy.CellVariable(mesh=mesh)
var[:] = mesh.x * mesh.y
fipy.dump.write(var, 'dump.gz')
You can then read this back in another Python session with
var = fipy.dump.read('dump.gz')
However, Pickle isn't great for long term storage as it depends on using the same version of the code to read the data back. An alternative is to save a Numpy array using,
np.save('dump.npy', var)
and then read in with
var_array = np.load('dump.npy')
var = fipy.CellVariable(mesh=mesh, value=var_array)
If I want to do some more fancy/customized plotting than what is in
the default fipy viewer, how can I do it? If I want to do some more
fancy/customized plotting than what is in the default fipy viewer, how
can I do it?
To save the data in a human readable form with the location and value data for plotting in another package, you might try using pandas
import pandas
df = pandas.DataFrame({'x' : mesh.x, 'y': mesh.y, 'value': var})
df.to_csv('dump.csv')
But could I, for example, store the mesh as an array?
You can of course Pickle any Python object, but using knowledge of the actual object is better for long term storage. For a grid mesh, only dx, dy, nx, ny are required to reinstantiate. Mesh objects have a __getstate__ method that gives the requirements for pickling the object. All that needs to be stored is what this method returns.
Or could I store the value of the DiffusionTerm instead of the
CellVariable in each step?
The DiffusionTerm doesn't really store anything other than its coefficient. The equation stores its matrix and b vector.

Related

Coordinate offsets in xarray and dask

I'm making use of xarray as the coordinates and automatic alignment are really useful. and I've been using Dask as the data I'm generally dealing with datasets in the order of terabytes.
I have a 3D source array generated (or loaded) and dependent on wavelength (wl) and x position and y position at origin zero.
I also have a 2D output array dependant only on x and y which accumulates all of the wavelengths from the source array. Idealistically the output would be:
output = source.sum('wl')
However, the wavelength dependence means that each wavelength offsets the source origin by a certain amount. The best (and ugliest) solution I could come up with is to loop through each wavelength, reassign coordinates, interp up to the output coordinates, stack them into a new array and then sum.
I have an example code that shows what I'm trying to do:
from dask.distributed import Client
import xarray as xr
import dask.array as da
import numpy as np
client = Client(n_workers=2, threads_per_worker=2, memory_limit='2GB')
client
# Generate some offset data here
wavelengths = np.linspace(0.1,10,1000)
x_offsets = np.linspace(100,400,1000)
y_offsets = np.linspace(100,400,1000)
# Coordinate offsets for each wavelength
offset = xr.Dataset(
{
'x': (['wl'],x_offsets),
'y': (['wl'],y_offsets)
},
coords={
'wl': wavelengths
})
# Our example source function
source_shape = (1000, 10000, 10000,)
wl_source = np.linspace(0.4,5,source_shape[0])
x_source = np.linspace(-6,6, source_shape[1])
y_source = np.linspace(-6,6, source_shape[2])
source = xr.DataArray(da.random.random(size=source_shape, chunks=(10,400,400)),
coords=[wl_source,x_source,y_source],
dims=['wl','x','y'])
out_shape = (10000, 10000,)
# Our final output array
x_out = np.linspace(-1000,1000,out_shape[0])
y_out = np.linspace(-1000,1000,out_shape[1])
out = xr.DataArray(da.random.random(size=out_shape, chunks=(4000,4000)),coords=[x_out, y_out], dims=['x','y'])
accum =[]
for wl in source.wl:
# Build our map from source -> output space
x_map = offset.interp(wl=wl).x + source.x
y_map = offset.interp(wl=wl).y + source.y
# Remap coordinates
source_mapped = source.sel(wl=wl).assign_coords({'x':x_map,
'y':y_map})
# Interp_like unchunks it so need to rechunk it here
# Interp up to the output coordinates
accum.append(
source_mapped.interp_like(out, method='nearest',kwargs={'fill_value':0}).chunk({'x':4000,'y':4000})
)
# Accumalate and add to the output
out += xr.concat(accum,dim='wl').sum('wl')
out
This solution ends up with over 1 million tasks, because of that the building of the task graph takes a long time and during computation the gc collection takes a long time, memory is exhausted or I spill so much to disk that I run out of storage. Manually slicing has the same issue.
Additionally, this can't scale if I have more than one source as well. I've been racking my brain trying to figure out a better solution.
I'm wondering if theres a more efficient way of doing this? Either through, dask, xarray or some other library. I'm fairly new to dask and xarray so I'm still trying to get to grips with how they work and how to better chunk and distribute tasks
Sorry for the long winded question!

using scipy curve_fit with dask/xarray

I'm trying to use scipy.optimize.curve_fit on a large latitude/longitude/time xarray using dask.distributed as computing backend.
The idea is to run an individual data fitting for every (latitude, longitude) using the time series.
All of this runs fine outside xarray/dask. I tested it using the time series of a single location passed as a pandas dataframe. However, if I try to run the same process on the same (latitude, longitude) directly on the xarray, the curve_fit operation returns the initial parameters.
I am performing this operation using xr.apply_ufunc like so (here I'm providing only the code that is strictly relevant to the problem):
# function to perform the fit
def _fit_rti_curve(data, data_rti, fit, loc=False):
fit_func, linearize, find_init_params = _get_fit_functions(fit)
# remove nans
x, y = _filter_nodata(data_rti, data)
# remove outliers
x, y = _filter_for_outliers(x, y, linearize=linearize)
# find a first guess for maximum achieveable value
yscale = np.max(y) * 1.05
# find a first guess for the other parameters
# here loc can be manually passed if you have a good estimation
init_parms = find_init_params(x, y, yscale, loc=loc, linearize=linearize)
# fit the curve and return parameters
parms = curve_fit(fit_func, x, y, p0=init_parms, maxfev=10000)
parms = parms[0]
return parms
# shell around _fit_rti_curve
def find_rti_func_parms(data, rti, fit):
# sort and fit highest n values
top_data = np.sort(data)
top_data = top_data[-len(rti):]
# convert to float64 if needed
top_data = top_data.astype(np.float64)
rti = rti.astype(np.float64)
# run the fit
parms = _fit_rti_curve(top_data, rti, fit, loc=0) #TODO maybe add function to allow a free loc
return parms
# call for the apply_ufunc
# `fit` is a string that defines the distribution type
# `rti` is an array for the x values
parms_data = xr.apply_ufunc(
find_rti_func_parms,
xr_obj,
input_core_dims=[['time']],
output_core_dims=[[fit + ' parameters']],
output_sizes = {fit + ' parameters': len(signature(fit_func).parameters) - 1},
vectorize=True,
kwargs={'rti':return_time_interval, 'fit':fit},
dask='parallelized',
output_dtypes=['float64']
)
My guess would be that is a problem related to threading, or at least some shared memory space that is not properly passed between workers and scheduler.
However, I am just not knowledgeable enough to test this within dask.
Any idea on this problem?
You should have a look at this issue https://github.com/pydata/xarray/issues/4300
I had the same problem and I solved using apply_ufunc. It is not optimized, since it has to perform rechunking operations, but it works!
I've created a GitHub Gist for it https://gist.github.com/clausmichele/8350e1f7f15e6828f29579914276de71
This previous answer might be helpful? It's using numpy.polyfit but I think the general approach should be similar.
Applying numpy.polyfit to xarray Dataset
Also, I haven't tried it but xr.polyfit() just got merged recently! Could also be something to look into. http://xarray.pydata.org/en/stable/generated/xarray.DataArray.polyfit.html#xarray.DataArray.polyfit

SVM with python and CPLEX, load the quadratic part of the objective function

''In general, it would get better performance creating batches of linear constraints rather than creating them one at a time. I just wondering if it states even with a huge problem.'' - The wise programmer.
To be clear, I have a (35k x 40) dataset, and I want to do SVM on it. I need to produce the Gramm matrix of this dataset, it is fine, but to pass the coefficient to CPLEX is a mess, it takes hours, here my code:
nn = 35000
XXt = np.random.rand(nn,nn) # the gramm matrix of the dataset
yy = np.random.rand(nn) # the label vector of the dataset
temp = ((yy*XXt).T)*yy
xg, yg = np.meshgrid(range(nn), range(nn))
indici = np.dstack([yg,xg])
quadraric_part = []
for ii in xrange(nn):
for indd in indici[ii][ii:]:
quadraric_part.append([indd[0],indd[1],temp[indd[0],indd[1]]])
The 'quadratic_part' is a list of the form [i,j,c_ij] where c_ij is the coefficient stored in temp. It will be passed to the function 'objective.set_quadratic_coefficients()' of the CPLEX Python API.
There is a wiser way to do that?
P.S. I have maybe a Memory problem, so It wold be better, instead store the whole list 'quadratic_part', call several times the function 'objective.set_quadratic_coefficients()'.... you know what I mean?!
Under the hood, objective.set_quadratic makes use of the CPXXcopyquad function in the C Callable Library. Whereas, objective.set_quadratic_coefficients uses CPXXcopyqpsep.
Here is an example (bear in mind that I am not a numpy expert; it's quite possible there's a better way to do that part):
import numpy as np
import cplex
nn = 5 # a small example size here
XXt = np.random.rand(nn,nn) # the gramm matrix of the dataset
yy = np.random.rand(nn) # the label vector of the dataset
temp = ((yy*XXt).T)*yy
# create symetric matrix
tempu = np.triu(temp) # upper triangle
iu1 = np.triu_indices(nn, 1)
tempu.T[iu1] = tempu[iu1] # copy upper into lower
ind = np.array([[x for x in range(nn)] for x in range(nn)])
qmat = []
for i in range(nn):
qmat.append([np.arange(nn), tempu[i]])
c = cplex.Cplex()
c.variables.add(lb=[0]*nn)
c.objective.set_quadratic(qmat)
c.write("test2.lp")
Your Q matrix is completely dense so depending on the amount of memory you have, this technique may not scale. When it's possible, though, you should get better performance initializing your Q matrix with objective.set_quadratic. Perhaps you'll need to use some hybrid technique where you use both set_quadratic and set_quadratic_coefficients.

Using adaptive time step for scipy.integrate.ode when solving ODE systems

I have to just read Using adaptive step sizes with scipy.integrate.ode and the accepted solution to that problem, and have even reproduced the results by copy-and-paste in my Python interpreter.
My problem is that when I try and adapt the solution code to my own code I only get flat lines.
My code is as follows:
from scipy.integrate import ode
from matplotlib.pyplot import plot, show
initials = [1,1,1,1,1]
integration_range = (0, 100)
f = lambda t,y: [1.0*y[0]*y[1], -1.0*y[0]*y[1], 1.0*y[2]*y[3] - 1.0*y[2], -1.0*y[2]*y[3], 1.0*y[2], ]
y_solutions = []
t_solutions = []
def solution_getter(t,y):
t_solutions.append(t)
y_solutions.append(y)
backend = "dopri5"
ode_solver = ode(f).set_integrator(backend)
ode_solver.set_solout(solution_getter)
ode_solver.set_initial_value(y=initials, t=0)
ode_solver.integrate(integration_range[1])
plot(t_solutions,y_solutions)
show()
And the plot it yields:
In the line
y_solutions.append(y)
you think that you are appending the current vector. What actally happens is that you are appending the object reference to y. Since apparently the integrator reuses the vector y during the integration loop, you are always appending the same object reference. Thus at the end, each position of the list is filled by the same reference pointing to the vector of the last state of y.
Long story short: replace with
y_solutions.append(y.copy())
and everything is fine.

Ways to Create Tables and Presentable Objects Other than Plots in Python

I have the following code that runs through the following:
Draw a number of points from a true distribution.
Use those points with curve_fit to extract the parameters.
Check if those parameters are, on average, close to the true values.
(You can do this by creating the "Pull distribution" and see if it returns
a standard normal variable.
# This script calculates the mean and standard deviation for
# the pull distributions on the estimators that curve_fit returns
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
import gauss
import format
numTrials = 10000
# Pull given by (a_j - a_true)/a_error)
error_vec_A = []
error_vec_mean = []
error_vec_sigma = []
# Loop to determine pull distribution
for i in xrange(0,numTrials):
# Draw from primary distribution
mean = 0; var = 1; sigma = np.sqrt(var);
N = 20000
A = 1/np.sqrt((2*np.pi*var))
points = gauss.draw_1dGauss(mean,var,N)
# Histogram parameters
bin_size = 0.1; min_edge = mean-6*sigma; max_edge = mean+9*sigma
Nn = (max_edge-min_edge)/bin_size; Nplus1 = Nn + 1
bins = np.linspace(min_edge, max_edge, Nplus1)
# Obtain histogram from primary distributions
hist, bin_edges = np.histogram(points,bins,density=True)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2
# Initial guess
p0 = [5, 2, 4]
coeff, var_matrix = curve_fit(gauss.gaussFun, bin_centres, hist, p0=p0)
# Get the fitted curve
hist_fit = gauss.gaussFun(bin_centres, *coeff)
# Error on the estimates
error_parameters = np.sqrt(np.array([var_matrix[0][0],var_matrix[1][1],var_matrix[2][2]]))
# Obtain the error for each value: A,mu,sigma
A_std = (coeff[0]-A)/error_parameters[0]
mean_std = ((coeff[1]-mean)/error_parameters[1])
sigma_std = (np.abs(coeff[2])-sigma)/error_parameters[2]
# Store results in container
error_vec_A.append(A_std)
error_vec_mean.append(mean_std)
error_vec_sigma.append(sigma_std)
# Plot the distribution of each estimator
plt.figure(1); plt.hist(error_vec_A,bins,normed=True); plt.title('Pull of A')
plt.figure(2); plt.hist(error_vec_mean,bins,normed=True); plt.title('Pull of Mu')
plt.figure(3); plt.hist(error_vec_sigma,bins,normed=True); plt.title('Pull of Sigma')
# Store key information regarding distribution
mean_A = np.mean(error_vec_A); sigma_A = np.std(error_vec_A)
mean_mu = np.mean(error_vec_mean); sigma_mu = np.std(error_vec_mean)
mean_sigma = np.mean(error_vec_sigma); sigma_sig = np.std(error_vec_sigma)
info = np.array([[mean_A,sigma_A],[mean_mu,sigma_mu],[mean_sigma,sigma_sig]])
My problem is I don't know how to use python to format the data into a table. I have to manually go into the variables and go to google docs to present the information. I'm just wondering how I can do that using pandas or some other library.
Here's an example of the manual insertion:
Trial 1 Trial 2 Trial 3
Seed [0.2,0,1] [10,2,5] [5,2,4]
Bins for individual runs 20 20 20
Points Thrown 1000 1000 1000
Number of Runs 5000 5000 5000
Bins for pull dist fit 20 20 20
Mean_A -0.11177 -0.12249 -0.10965
sigma_A 1.17442 1.17517 1.17134
Mean_mu 0.00933 -0.02773 -0.01153
sigma_mu 1.38780 1.38203 1.38671
Mean_sig 0.05292 0.06694 0.04670
sigma_sig 1.19411 1.18438 1.19039
I would like to automate this table so If I change my parameters in my code, I get a new table with that new data.
I would go with the CSV module to generate a presentable table.
if you're not already using it, the IPython notebook is really good for rendering rich display formats. It's really good in a lot of other ways, too.
It will render pandas dataframe objects as an html table when they're either the last, unreturned value in a cell or if you explicitly call Ipython.core.display.display function instead of print.
If you're not already using pandas, I highly recommend it. It's basically a wrapper around 2D & 3D numpy arrays; it's just as fast, but it has nice naming conventions, data grouping and filtering funcitons, and some other cool stuff.
At that point, it depends on how you want to present it. You can use nbconvert to render a whole notebook as static html or a pdf. You can copy-paste the html table into Excel or PowerPoint or an E-mail.

Categories