Python + GNU Plot: dealing with missing values - python

For clarity I have isolated my problem and used a small but complete snippet to describe it.
I have a bunch of data but there is a lot of missing pieces. I want to ignore these (a break in the graph if it were a line graph). I have set "?" to be the symbol for missing data. Here is my snippet:
import math
import Gnuplot
gp = Gnuplot.Gnuplot(persist=1)
gp("set datafile missing '?'")
x = range(1000)
y = [math.sin(a) + math.cos(a) + math.tan(a) for a in x]
# Force a piece of missing data
y[4] = '?'
data = Gnuplot.Data(x, y, title='Plotting from Python')
gp.plot(data);
gp.hardcopy(filename="pyplot.png",terminal="png")
But it doesn't work:
> python missing_test.py
Traceback (most recent call last):
File "missing_test.py", line 8, in <module>
data = Gnuplot.Data(x, y, title='Plotting from Python')
File "/usr/lib/python2.6/dist-packages/Gnuplot/PlotItems.py", line 560, in Data
data = utils.float_array(data)
File "/usr/lib/python2.6/dist-packages/Gnuplot/utils.py", line 33, in float_array
return numpy.asarray(m, numpy.float32)
File "/usr/lib/python2.6/dist-packages/numpy/core/numeric.py", line 230, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.
What's going wrong?

Gnuplot is calling numpy.asarray to convert your Python list into a numpy array.
Unfortunately, this command (with dtype=numpy.float32) is incompatible with a Python list that contains strings.
You can reproduce the error like this:
In [36]: np.asarray(['?',1.0,2.0],np.float32)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/lib/python2.6/dist-packages/numpy/core/numeric.pyc in asarray(a, dtype, order)
228
229 """
--> 230 return array(a, dtype, copy=False, order=order)
231
232 def asanyarray(a, dtype=None, order=None):
ValueError: setting an array element with a sequence.
Furthermore, the Gnuplot python module (version 1.7) docs say
There is no provision for missing data points in array data (which
gnuplot allows via the 'set missing' command).
I'm not sure if this has been fixed in version 1.8.
How married are you to gnuplot? Have you tried matplotlib?

Related

Cannot plot my function : return array(a, dtype, copy=False, order=order) TypeError: float() argument must be a string or a number

I'm trying to plot a function that gives the arctan of the angle of several scatterplots (it's a physics experiment):
Here is my code:
import numpy as np
import matplotlib.pyplot as plt
filename='rawPhaseDataf2f_17h_15m.dat'
datatype=np.dtype( [('Shotnumber',np.dtype('>f8')),('A1',np.dtype('>f8')), ('A2',np.dtype('>f8')), ('f2f',np.dtype('>f8')), ('intensity',np.dtype('>f8'))])
data=np.fromfile(filename,dtype=datatype)
#time=data['Shotnumber']/9900 # reprate is 9900 Hz -> time in seconds
A1=data['A1']
A2=data['A2']
#np.sort()
i=range(1,209773)
def x(i) :
return arctan((A1.item(i)/A2.item(i))*(i/209772))
def y(i) :
return i*2*pi/209772
plot(x,y)
plt.figure('Scatterplot')
plt.plot(A1,A2,',') #Scatterplot
plt.xlabel('A1')
plt.ylabel('A2')
plt.figure('2D Histogram')
plt.hist2d(A1,A2,100) # 2D Histogram
plt.xlabel('A1')
plt.ylabel('A2')
plt.show()
My error is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/spyderlib/widgets/externalshell /sitecustomize.py", line 540, in runfile
execfile(filename, namespace)
File "/home/nelly/Bureau/ Téléchargements/Kr4 Experiment/read_rawPhaseData.py", line 21, in <module>
plot(x,y)
File "/usr/lib/pymodules/python2.7/matplotlib/pyplot.py", line 2987, in plot
ret = ax.plot(*args, **kwargs)
File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 4138, in plot
self.add_line(line)
File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 1497, in add_line
self._update_line_limits(line)
File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 1508, in _update_line_limits
path = line.get_path()
File "/usr/lib/pymodules/python2.7/matplotlib/lines.py", line 743, in get_path
self.recache()
File "/usr/lib/pymodules/python2.7/matplotlib/lines.py", line 420, in recache
x = np.asarray(xconv, np.float_)
File "/usr/lib/python2.7/dist-packages/numpy/core/numeric.py", line 460, in asarray
return array(a, dtype, copy=False, order=order)
TypeError: float() argument must be a string or a number
I know that the problem is from the plot(x,y). I think that my error comes from the definition of x and y. A1 and A2 are matrix, N the number of points and Ak is the index of the matrix. I want to have arctan(A1k/A2k)*(k/N).
There are lots of problems with your code, and your understanding of python and array operations. I'm just going to handle the first part of the code (and the error you get), and hopefully you can continue to fix it from there.
This should fix the error you're getting and generate a plot:
# size = 209772
size = A1.size # I'm assuming that the size of the array is 209772
z = np.arange(1, size+1)/(size+1) # construct an array from [1/209773, 1.0]
# Calculate the x and y arrays
x = np.arctan((A1/A2)*z)
y = z*2*pi
# Plot x and y
plt.plot(x, y)
Discussion:
There are lots of issues with this chunk of code:
i=range(1,209773)
def x(i) :
return arctan((A1.item(i)/A2.item(i))*(i/209772))
def y(i) :
return i*2*pi/209772
plot(x, y)
You're defining two functions called x and y, and then you are passing those functions to the plotting method. The plotting method accepts numbers (in lists or arrays), not functions. That is the reason for the error that you are getting. So you instead need to construct a list/array of numbers and pass that to the function.
You're defining a variable i which is a list of numbers. But when you define the functions x and y, you are creating new variables named i which have nothing to do with the list you created earlier. This is because of how "scope" works in python.
The functions arctan and plot are not defined "globally", instead they are only defined in the packages numpy and matplotlib. So you need to call them from those packages.

NumPy: TypeError: reshape() got an unexpected keyword argument 'order'

I get the following error while reshaping a numpy ndarray
DeprecationWarning: :func:`reshape` is deprecated, use :func:`numerix.reshape()<numpy.reshape>` instead!
return reshape(newshape, order=order)
Traceback (most recent call last):
File "./render2.py", line 374, in <module>
,u=np.reshape(voltage.grad[0], (ny, nx))
File "/home/jana/Builds/lib/python2.6/site-packages/numpy/core/fromnumeric.py", line 172, in reshape
return reshape(newshape, order=order)
File "/home/jana/Builds/lib/python2.6/site-packages/fipy/tools/decorators.py", line 151, in newfunc
return func(*args, **kwds)
TypeError: reshape() got an unexpected keyword argument 'order'
Below is the part of the code that gives this error. Note: plot.py is a user defined module.
plot.streamlinePlot(x = x
,y = y
,u=np.reshape(voltage.grad[0], (ny, nx))
,v=np.reshape(voltage.grad[1], (ny, nx))
,filename='Analysis/electricFieldStreamPlot_%s.png'
,show=False
,clear=True)
The output of
print "Voltage shape =", voltage.shape
print "Voltage.grad[0] shape =", voltage.grad[0].shape
print "ny times nx =", ny*nx
is
Voltage shape = (269700,)
Voltage.grad[0] shape = (269700,)
ny times nx = 269700
I am running FiPy 3.0 and NumPy 1.7.2.
Any clues? Thanks!
You should get the desired result by calling
from fipy import numerix as nx
nx.reshape(voltage.grad[0], (ny, nx))
FiPy overrides a number of NumPy routines for working with its own data structures in a self-consistent way. You should always use fipy.numerix instead of numpy when working with FiPy objects.
If you aren't aware, FiPY now includes a MatplotlibStreamViewer that may either serve your needs or at least show you the data manipulations you'll need to perform for your own display.
There's definitely something wrong in the interaction between numpy.reshape(), fipy.numerix.reshape(), and fipy.CellVariable.reshape(). I've filed a ticket to look into this. Thanks for raising the question.

How could I use a custom distance metric for KNeighboursRegressor?

I'm trying to apply my own custom distance metric function when using knn regression model.
My dataset is a mixture of nominal, ordinal, numeric and binary types of fields
Code:
def cus_distance(array1, array2, **kwargs):
# calculate the distance, return a float
pass
knn = neighbors.KNeighborsRegressor(weights='distance', metric=cus_distance)
# train_data is a pandas dataframe obj
knn.fit(train_data.ix[:, fields_list], train_data['time_costs'])
The last line will cause an exception:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-284-04520b227b8a> in <module>()
----> 1 knn.fit(train_data.ix[:, fields_list], train_data['time_costs'])
/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.pyc in fit(self, X, y)
587 X, y = check_arrays(X, y, sparse_format="csr")
588 self._y = y
--> 589 return self._fit(X)
590
591
/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.pyc in _fit(self, X)
214 self._tree = BallTree(X, self.leaf_size,
215 metric=self.effective_metric_,
--> 216 **self.effective_metric_kwds_)
217 elif self._fit_method == 'kd_tree':
218 self._tree = KDTree(X, self.leaf_size,
/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/ball_tree.so in sklearn.neighbors.ball_tree.BinaryTree.__init__ (sklearn/neighbors/ball_tree.c:7983)()
/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.pyc in asarray(a, dtype, order)
318
319 """
--> 320 return array(a, dtype, copy=False, order=order)
321
322 def asanyarray(a, dtype=None, order=None):
ValueError: could not convert string to float: Unknown
I know this error caused by string values(the 'Unknown' is one of them) in my dataset.
This confused me, in my understanding, the function cus_distance should take care of these str values, and the KNeighborsRegressor just use the return value of my function.
Q:
* Is this the right way to use a custom defined distance metric in KNN Regression?
* If it is, why I met this exception?
* If not, what is the right way?
The Ball Tree and KD Tree require floating point data, regardless of the metric used. If your data cannot be converted to floating point, then you will get this sort of error.
>>> import numpy as np
>>> data = [1, "Unknown", 2]
>>> np.asarray(data, dtype=float)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
----> 1 np.asarray(data, dtype=float)
ValueError: could not convert string to float: Unknown
Thanks #jakevdp .
The scikit-learn supports Brute Force, Ball Tree and KD Tree, and according to #jakevdp 's answer, the only one I can use is Brute Force algorighm, so my code change to:
knn = neighbors.KNeighborsRegressor(weights='distance', metric=cus_distance, algorithm='brute')
knn.fit(train_data.ix[:, fields_list], train_data['time_costs'])
This time it won't raise error anymore, Thanks jakevdp!
But new question came, when I try to use this knn object:
knn.predict(check_data.ix[:, fields_list])
this will cause a same error in my question. So I look into the scikit-learn's source code, found this line cause this error:
elif callable(metric):
# Check matrices first (this is usually done by the metric).
X, Y = check_pairwise_arrays(X, Y)
n_x, n_y = X.shape[0], Y.shape[0]
the function check_pairwise_arrays will try to convert all values to float, "Unknown" cause the error again.
I think this is kind of bug, because scikit's builtin metrics don't support mixture types of dataset, I write a customer metric function, but this line still force the dataset to be pure float type.
And as the comment above this line said, the checking works should be done by customer metrics, so I just commented this line, reload this module, my knn object can work perfectly now :)
ps: I'm working on pushing this change to the scikit-learn official github repo.

Value Error :Storing data from binary file into numpy 3d arrays

I am trying to read float numbers from a Binary file using Struct module and then storing them in numpy 3D arrays.When I run it as an independent script, it works fine.But when I call it as a class's function from another script(after import) it gives value error
Here is my code.
import struct
from numpy import *
class DCD_read:
def read_cord(self,total_atoms,dcd_data):
cord_data=dcd_data[276:len(dcd_data)] ## binary data string
byte=0
count=0
total_frames=info_dict['nset']
coord=numpy.zeros((total_frames,total_atoms,3)) ## 3d array
for frames in range(0,total_frames):
for atoms in range(0,total_atoms):
x = list(struct.unpack('<f',cord_data[60+byte:64+byte])) ###reading float
byte+=4
y = list(struct.unpack('<f',cord_data[60+byte:64+byte]))
byte+=4
z = list(struct.unpack('<f',cord_data[60+byte:64+byte]))
byte+=4
ls=x
ls.extend(y)
ls.extend(z)
coord[frames][atoms]=ls
return coord
Error:
Traceback (most recent call last):
File "C:\Users\Hira\Documents\PROJECT\md.py", line 24, in <module>
coord=dcd.read_cord(total_atoms,dcd_data)
File "C:\Users\Hira\Documents\PROJECT\DCD_read.py", line 51, in read_cord
coord=numpy.zeros((total_frames,total_atoms,3))
File "C:\Python27\numpy\core\numeric.py", line 148, in ones
a = empty(shape, dtype, order)
ValueError: negative dimensions are not allowed
md.py is the main (calling script) while DCD_read.py is module. Here is code for md.py (main script)
from DCD_read import *
import numpy
dcd_file=open('frame3.dcd',"rb")
dcd_data=dcd_file.read()
dcd=read_dcd()
total_atoms=6141
coord=dcd.read_cord(total_atoms,dcd_data)
Please can any one help???? I hope I explained it completely and clearly.Thanx

ValueError when trying to save ndarray (Numpy)

I am trying to translate a project I have in MATLAB to Python+Numpy because MATLAB keeps running out of memory. The file I have is rather long, so I have tried to make a minimal example that shows the same error.
Basically I'm making a 2d histogram of a dataset, and want to save it after some processing. The problem is that the numpy.save function throws a "ValueError: setting an array element with a sequence" when I try to save the output of the histogram function. I can't find the problem when I look at the docs of Numpy.
My version of Python is 2.6.6, Numpy version 1.4.1 on a Debian distro.
import numpy as np
import random
n_samples = 5
rows = 5
out_file = file('dens.bin','wb')
x_bins = np.arange(-2.005,2.005,0.01)
y_bins = np.arange(-0.5,n_samples+0.5)
listy = [random.gauss(0,1) for r in range(n_samples*rows)]
dens = np.histogram2d( listy, \
range(n_samples)*rows, \
[y_bins, x_bins])
print 'Write data'
np.savez(out_file, dens)
out_file.close()
Full output:
$ python error.py
Write data
Traceback (most recent call last):
File "error.py", line 19, in <module>
np.savez(out_file, dens)
File "/usr/lib/pymodules/python2.6/numpy/lib/io.py", line 439, in savez
format.write_array(fid, np.asanyarray(val))
File "/usr/lib/pymodules/python2.6/numpy/core/numeric.py", line 312, in asanyarray
return array(a, dtype, copy=False, order=order, subok=True)
ValueError: setting an array element with a sequence.
Note that np.histogram2d actually returns a tuple of three arrays: (hist, x_bins, y_bins). If you want to save all three of these, you have to unpack them as #Francesco said.
dens = np.histogram2d(listy,
range(n_samples)*rows,
[y_bins, x_bins])
np.savez('dens.bin', *dens)
Alternatively, if you only need the histogram itself, you could save just that.
np.savez('dens.bin', dens[0])
If you want to keep track of which of these is which, use the **kwds instead of the *args
denskw = dict(zip(['hist','y_bins','x_bins'], dens))
np.savez('dens.bin', **denskw)
Then, you can load it like
dens = np.load('dens.bin')
hist = dens['hist']# etc

Categories