Python: ValueError: could not convert string to float: 'D' - python

I am loading a train.csv file to fit it with a RandomForestClassifier.
The load and processing of the .csv file happens fine.I am able to play around with my dataframe.
When I try:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=150, min_samples_split=2, n_jobs=-1)
rf.fit(train, target)
I get this:
ValueError: could not convert string to float: 'D'
I have tried:
train=train.astype(float)
Replacing all 'D' with another value.
train.convert_objects(convert_numeric=True)
But the issue still persists.
I also tried printing all the valueErrors in my csv file, but cannot find a reference to 'D'.
This is my trace:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-20-9d8e309c06b6> in <module>()
----> 1 rf.fit(train, target)
\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py in fit(self, X, y, sample_weight)
222
223 # Convert data
--> 224 X, = check_arrays(X, dtype=DTYPE, sparse_format="dense")
225
226 # Remap output
\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_arrays(*arrays, **options)
279 array = np.ascontiguousarray(array, dtype=dtype)
280 else:
--> 281 array = np.asarray(array, dtype=dtype)
282 if not allow_nans:
283 _assert_all_finite(array)
\Anaconda3\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order)
460
461 """
--> 462 return array(a, dtype, copy=False, order=order)
463
464 def asanyarray(a, dtype=None, order=None):
ValueError: could not convert string to float: 'D'
How should I approach this problem?

Without RandomForestClassifier is not (as far as I could find) a python library (as included in python), it's difficult to know what's going on in your case. However, what's really happening there is that at some point, you're trying to transform a string 'D' into a float.
I can reproduce your error by doing:
float('D')
Now, to be able to debug this problem, I recommend you to catch the exception:
try:
rf.fit(train, target)
except ValueError as e:
print(e)
#do something clever with train and target like pprint them or something.
Then you can look into what's really going on. I couldn't find much about that random forest classifier except for this that might help:
https://www.npmjs.com/package/random-forest-classifier

You should explore and clean your data. Probably you have a 'D' somewhere in your data which your code try to convert to a float. A trace within a "try-except" block is a good idea.

Related

How to convert dtype from '0' to 'int64'?

I started working with a dataset, which is a collection of murder reports.There is a column "Perpetrator Age" in which there are simple integers. But when I looked at his type, it turned out that he was dtype('O').
In order to work with this column further, I want to change its type to dtype('int64'). I tried to do it like this:
data['Perpetrator Age'] = data['Perpetrator Age'].astype(int)
and got this error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-64-50a3c796ab1e> in <module>()
----> 1 data['Perpetrator Age'] = data['Perpetrator Age'].astype(int)
4 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
972 # work around NumPy brokenness, #1987
973 if np.issubdtype(dtype.type, np.integer):
--> 974 return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
975
976 # if we have a datetime/timedelta array of objects
pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()
ValueError: invalid literal for int() with base 10: ' '
I saw advice for the "object" type, which must first be converted to a string, and then to "int". Tried it, it didn't work either, same error appeared. Please tell me how I can fix this?
As mentioned in the comments, the first row of your df is apparently an empty space (' '). You can either remove it, replace it with something else, or skip it:
df['column_1'].iloc[1:].astype('int')

what reliable method to save huge numpy arrays

I saved some arrays using numpy.savez_compressed(). One of the arrays is gigantic, it has the shape (120000,7680), type float32.
Trying to load the array gave me the error below (message caught using Ipython).
Is seems like this is a Numpy limitation:
Numpy: apparent memory error
What are other ways to save such a huge array? (I had problems with cPickle as well)
In [5]: t=numpy.load('humongous.npz')
In [6]: humg = (t['arr_0.npy'])
/usr/lib/python2.7/dist-packages/numpy/lib/npyio.pyc in __getitem__(self, key)
229 if bytes.startswith(format.MAGIC_PREFIX):
230 value = BytesIO(bytes)
--> 231 return format.read_array(value)
232 else:
233 return bytes
/usr/lib/python2.7/dist-packages/numpy/lib/format.pyc in read_array(fp)
456 # way.
457 # XXX: we can probably chunk this to avoid the memory hit.
--> 458 data = fp.read(int(count * dtype.itemsize))
459 array = numpy.fromstring(data, dtype=dtype, count=count)
460
SystemError: error return without exception set
System: Ubuntu 12.04 64 bit, Python 2.7, numpy 1.6.1

Why the source code of conjugate in Numpy cannot be found by using the inspect module?

I want to see the implementation of the conjugate function used in Numpy. Then I tried the following:
import numpy as np
import inspect
inspect.getsource(np.conjugate)
However, I received the following error message stating that <ufunc 'conjugate'> is not a module, class, method, function, traceback, frame, or code object. May someone answer why?
In [8]: inspect.getsource(np.conjugate)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-821ecfb71e08> in <module>()
----> 1 inspect.getsource(np.conj)
/Users/duanlx/anaconda/python.app/Contents/lib/python2.7/inspect.pyc in getsource(object)
699 or code object. The source code is returned as a single string. An
700 IOError is raised if the source code cannot be retrieved."""
--> 701 lines, lnum = getsourcelines(object)
702 return string.join(lines, '')
703
/Users/duanlx/anaconda/python.app/Contents/lib/python2.7/inspect.pyc in getsourcelines(object)
688 original source file the first line of code was found. An IOError is
689 raised if the source code cannot be retrieved."""
--> 690 lines, lnum = findsource(object)
691
692 if ismodule(object): return lines, 0
/Users/duanlx/anaconda/lib/python2.7/site-packages/IPython/core/ultratb.pyc in findsource(object)
149 FIXED version with which we monkeypatch the stdlib to work around a bug."""
150
--> 151 file = getsourcefile(object) or getfile(object)
152 # If the object is a frame, then trying to get the globals dict from its
153 # module won't work. Instead, the frame object itself has the globals
/Users/duanlx/anaconda/python.app/Contents/lib/python2.7/inspect.pyc in getsourcefile(object)
442 Return None if no way can be identified to get the source.
443 """
--> 444 filename = getfile(object)
445 if string.lower(filename[-4:]) in ('.pyc', '.pyo'):
446 filename = filename[:-4] + '.py'
/Users/duanlx/anaconda/python.app/Contents/lib/python2.7/inspect.pyc in getfile(object)
418 return object.co_filename
419 raise TypeError('{!r} is not a module, class, method, '
--> 420 'function, traceback, frame, or code object'.format(object))
421
422 ModuleInfo = namedtuple('ModuleInfo', 'name suffix mode module_type')
TypeError: <ufunc 'conjugate'> is not a module, class, method, function, traceback, frame, or code object
Thanks!
Numpy is written in C, for speed. You can only see the source of Python functions.

Python splines or other interpolations that work with time on x-axis?

Trying to use the awfully useful pandas to deal with data as time series, I am now stumbling over the fact that there do not seem to exist libraries that can directly interpolate (with a spline or similar) over data that has DateTime as an x-axis? I always seem to be forced to convert first to some floating point number, like seconds since 1980 or something like that.
I was trying the following things so far, sorry for the weird formatting, I have this stuff only in the ipython notebook, and I can't copy cells from there:
from scipy.interpolate import InterpolatedUnivariateSpline as IUS
type(bb2temp): pandas.core.series.TimeSeries
s = IUS(bb2temp.index.to_pydatetime(), bb2temp, k=1)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-67-19c6b8883073> in <module>()
----> 1 s = IUS(bb2temp.index.to_pydatetime(), bb2temp, k=1)
/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/scipy/interpolate/fitpack2.py in __init__(self, x, y, w, bbox, k)
335 #_data == x,y,w,xb,xe,k,s,n,t,c,fp,fpint,nrdata,ier
336 self._data = dfitpack.fpcurf0(x,y,k,w=w,
--> 337 xb=bbox[0],xe=bbox[1],s=0)
338 self._reset_class()
339
TypeError: float() argument must be a string or a number
By using bb2temp.index.values (that look like these:
array([1970-01-15 184:00:35.884999, 1970-01-15 184:00:58.668999,
1970-01-15 184:01:22.989999, 1970-01-15 184:01:45.774000,
1970-01-15 184:02:10.095000, 1970-01-15 184:02:32.878999,
1970-01-15 184:02:57.200000, 1970-01-15 184:03:19.984000,
) as x-argument, interestingly, the Spline class does create an interpolator, but it still breaks when trying to interpolate/extrapolate to a larger DateTimeIndex (which is my final goal here). Here is how that looks:
all_times = divcal.timed.index.levels[2] # part of a MultiIndex
all_times
<class 'pandas.tseries.index.DatetimeIndex'>
[2009-07-20 00:00:00.045000, ..., 2009-07-20 00:30:00.018000]
Length: 14063, Freq: None, Timezone: None
s(all_times.values) # applying the above generated interpolator
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-74-ff11f6d6d7da> in <module>()
----> 1 s(tall.values)
/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/scipy/interpolate/fitpack2.py in __call__(self, x, nu)
219 # return dfitpack.splev(*(self._eval_args+(x,)))
220 # return dfitpack.splder(nu=nu,*(self._eval_args+(x,)))
--> 221 return fitpack.splev(x, self._eval_args, der=nu)
222
223 def get_knots(self):
/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/scipy/interpolate/fitpack.py in splev(x, tck, der, ext)
546
547 x = myasarray(x)
--> 548 y, ier =_fitpack._spl_(x, der, t, c, k, ext)
549 if ier == 10:
550 raise ValueError("Invalid input data")
TypeError: array cannot be safely cast to required type
I tried to use s(all_times) and s(all_times.to_pydatetime()) as well, with the same TypeError: array cannot be safely cast to required type.
Am I, sadly, correct? Did everybody get used to convert times to floating points so much, that nobody thought it's a good idea that these interpolations should work automatically? (I would finally have found a super-useful project to contribute..) Or would you like to prove me wrong and earn some SO points? ;)
Edit: Warning: Check your pandas data for NaNs before you hand it to the interpolation routines. They will not complain about anything but just silently fail.
The problem is that those fitpack routines that are used underneath require floats. So, at some point there has to be a conversion from datetime to floats. This conversion is easy. If bb2temp.index.values is your datetime array, just do:
In [1]: bb2temp.index.values.astype('d')
Out[1]:
array([ 1.22403588e+12, 1.22405867e+12, 1.22408299e+12,
1.22410577e+12, 1.22413010e+12, 1.22415288e+12,
1.22417720e+12, 1.22419998e+12])
You just need to pass that to your spline. And to convert the results back to datetime objects, you do results.astype('datetime64').

scipy, fftpack and float64

I would like to use the dct functionality from the scipy.fftpack with an array of numpy float64. However, it seems it is only implemented for np.float32. Is there any quick workaround I could do to get this done? I looked into it quickly but I am not sure of all the dependencies. So, before messing everything up, I thought I'd ask for tips here!
The only thing I have found so far about this is this link : http://mail.scipy.org/pipermail/scipy-svn/2010-September/004197.html
Thanks in advance.
Here is the ValueError it raises:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-f09567c28e37> in <module>()
----> 1 scipy.fftpack.dct(c[100])
/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/scipy/fftpack/realtransforms.pyc in dct(x, type, n, axis, norm, overwrite_x)
118 raise NotImplementedError(
119 "Orthonormalization not yet supported for DCT-I")
--> 120 return _dct(x, type, n, axis, normalize=norm, overwrite_x=overwrite_x)
121
122 def idct(x, type=2, n=None, axis=-1, norm=None, overwrite_x=0):
/usr/local/Cellar/python/2.7.3/lib/python2.7/site-packages/scipy/fftpack/realtransforms.pyc in _dct(x, type, n, axis, overwrite_x, normalize)
215 raise ValueError("Type %d not understood" % type)
216 else:
--> 217 raise ValueError("dtype %s not supported" % tmp.dtype)
218
219 if normalize:
ValueError: dtype >f8 not supported
The problem is not the double precision. Double precision is of course supported. The problem is that you have a little endian computer and (maybe loading a file from a file?) have big endian data, note the > in dtype >f8 not supported. It seems you will simply have to cast it to native double yourself. If you know its double precision, you probably just want to convert everytiong to your native order once:
c = c.astype(float)
Though I guess you could also check c.dtype.byteorder which I think should be '=', and if, switch... something along:
if c.dtype.byteorder != '=':
c = c.astype(c.dtype.newbyteorder('='))
Which should work also if you happen to have single precision or integers...

Categories