error with persisted sklearn.feature_extraction.text.TfidfVectorizer - python

I persisted a TfidfVectorizer using the module joblib. The object that I run through the method fit_transform was a list of strings.
The resulting matrix had a dimensionality of 263744 columns.
I am running a list of strings through the transform method, and I get the following error.
Any clues?
File "/usr/local/lib/python2.7/dist- packages/sklearn/feature_extraction/text.py",
line 1334, in transform
return self._tfidf.transform(X, copy=False)
File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py",
line 1037, in transform
X = X * self._idf_diag
File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/base.py", line
318, in __mul__
return self._mul_sparse_matrix(other)
File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/compressed.py",
line 487, in _mul_sparse_matrix
other = self.__class__(other) # convert to this format
File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/compressed.py",
line 31, in __init__
arg1 = arg1.asformat(self.format)
File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/base.py",
line 219, in asformat
return getattr(self,'to' + format)()
File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/dia.py",
line 241, in tocsr
return self.tocoo().tocsr()
File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/dia.py",
line 249, in tocoo
num_offsets, offset_len = self.data.shape
AttributeError: 'NDArrayWrapper' object has no attribute 'shape'

Assuming you are persisting the trained transformer or pipeline to disk, and then reloading it before seeing the error, you could:
Try saving the original (working) object using the compress keyword argument parameter to joblib.dump, with an integer value greater than 0:
_ = joblib.dump(python_object, persisted_file_name, compress=3)
If the persisted file is being moved to a new location, make
sure to copy all the file pieces. If it is large, joblib will
split it up, e.g.:
persisted_model.joblib.pkl
persisted_model.joblib.pkl_01.npy
persisted_model.joblib.pkl_02.npy
joblib docs

Related

Numba Indexing Error: TypeError: Can't index at [0] in i8*

I'm learning how to use Numba to speed up functions with jit and vectorize. I didn't have any issues with the jit version of this code, but I am getting an index error with vectorize. I suspect this question's answer is getting at the right idea that there is a type error, but I'm not confident on which direction to take on changing the indexing. Included below is the function I've been playing around with, which outputs the Fibonacci numbers up to a chosen index of the sequence. What is going wrong with the indexing, and how I can correct my code to account for it?
from numba import vectorize
import numpy as np
from timeit import timeit
#vectorize
def fib(n):
'''
Adjusted from:
https://lectures.quantecon.org/py/numba.html
https://en.wikipedia.org/wiki/Fibonacci_number
https://www.geeksforgeeks.org/program-for-nth-fibonacci-number/
'''
if n == 1:
return np.ones(1)
elif n > 1:
x = np.empty(n)
x[0] = 1
x[1] = 1
for i in range(2,n):
x[i] = x[i-1] + x[i-2]
return x
else:
print('WARNING: Check validity of input.')
print(timeit('fib(10)', globals={'fib':fib}))
Which results in the following error output.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/llvmlite/ir/instructions.py", line 619, in __init__
typ = typ.elements[i]
AttributeError: 'PointerType' object has no attribute 'elements'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/galen/Projects/myjekyllblog/test_code/quantecon_2.py", line 27, in <module>
print(timeit('fib(10)', globals={'fib':fib}))
File "/usr/lib/python3.6/timeit.py", line 233, in timeit
return Timer(stmt, setup, timer, globals).timeit(number)
File "/usr/lib/python3.6/timeit.py", line 178, in timeit
timing = self.inner(it, self.timer)
File "<timeit-src>", line 6, in inner
File "/usr/local/lib/python3.6/dist-packages/numba/npyufunc/dufunc.py", line 166, in _compile_for_args
return self._compile_for_argtys(tuple(argtys))
File "/usr/local/lib/python3.6/dist-packages/numba/npyufunc/dufunc.py", line 188, in _compile_for_argtys
cres, actual_sig)
File "/usr/local/lib/python3.6/dist-packages/numba/npyufunc/ufuncbuilder.py", line 157, in _build_element_wise_ufunc_wrapper
cres.objectmode, cres)
File "/usr/local/lib/python3.6/dist-packages/numba/npyufunc/wrappers.py", line 220, in build_ufunc_wrapper
env=envptr)
File "/usr/local/lib/python3.6/dist-packages/numba/npyufunc/wrappers.py", line 130, in build_fast_loop_body
env=env)
File "/usr/local/lib/python3.6/dist-packages/numba/npyufunc/wrappers.py", line 23, in _build_ufunc_loop_body
store(retval)
File "/usr/local/lib/python3.6/dist-packages/numba/npyufunc/wrappers.py", line 126, in store
out.store_aligned(retval, ind)
File "/usr/local/lib/python3.6/dist-packages/numba/npyufunc/wrappers.py", line 276, in store_aligned
self.context.pack_value(self.builder, self.fe_type, value, ptr)
File "/usr/local/lib/python3.6/dist-packages/numba/targets/base.py", line 482, in pack_value
dataval = self.data_model_manager[ty].as_data(builder, value)
File "/usr/local/lib/python3.6/dist-packages/numba/datamodel/models.py", line 558, in as_data
elems = self._as("as_data", builder, value)
File "/usr/local/lib/python3.6/dist-packages/numba/datamodel/models.py", line 530, in _as
self.get(builder, value, i)))
File "/usr/local/lib/python3.6/dist-packages/numba/datamodel/models.py", line 558, in as_data
elems = self._as("as_data", builder, value)
File "/usr/local/lib/python3.6/dist-packages/numba/datamodel/models.py", line 530, in _as
self.get(builder, value, i)))
File "/usr/local/lib/python3.6/dist-packages/numba/datamodel/models.py", line 624, in get
name="extracted." + self._fields[pos])
File "/usr/local/lib/python3.6/dist-packages/llvmlite/ir/builder.py", line 911, in extract_value
instr = instructions.ExtractValue(self.block, agg, idx, name=name)
File "/usr/local/lib/python3.6/dist-packages/llvmlite/ir/instructions.py", line 622, in __init__
% (list(indices), agg.type))
TypeError: Can't index at [0] in i8*
The error is because you are trying to vectorize a function which you can say is essentially not vectorizable. I think you are confusing the functionality of how #jit and #vectorize work. In order to speed up your functions, you use #jit, while #vectorize is used to create numpy universal functions. See the official documentation here :
Using vectorize(), you write your function as operating over input
scalars, rather than arrays. Numba will generate the surrounding loop
(or kernel) allowing efficient iteration over the actual inputs.
So it is essentially not possible to create a numpy universal function which has the same functionality as your fibonacci function. Here is the link for official documentation on universal functions if you are interested.
So in order to use #vectorize, you need to create a function which can be essentially used as a numpy universal function. For your purpose of speeding up your code, you simply need to use #jit.

2nd Iteration adding extra character in Pandas/Numpy

I am running the code below and it is running fine for the first iteration and when the second iteration starts it gives me a key error. I notice that the there is a string "L" added to the key automatically when the second iteration starts.
Link to my code below:
Code for KNN having issues here
Link for the data that I am using is below:
Data used for the code
Not sure why it is happening. Can someone pls let me know what is causing the issue. Help is greatly appreciated!!
Traceback (most recent call last):
File "C:/Python27/myScripts/KNN.py", line 114, in <module>
pred_lst.append(predict_output_of_query(10.0, features_train, df_housePrice_train, features_test[i]))
File "C:/Python27/myScripts/KNN.py", line 96, in predict_output_of_query
avg1 += output_train["price"][i]
File "C:\Python27\lib\site-packages\pandas\core\series.py", line 557, in __getitem__
result = self.index.get_value(self, key)
File "C:\Python27\lib\site-packages\pandas\core\index.py", line 1790, in get_value
return self._engine.get_value(s, k)
File "pandas\index.pyx", line 103, in pandas.index.IndexEngine.get_value (pandas\index.c:3204)
File "pandas\index.pyx", line 111, in pandas.index.IndexEngine.get_value (pandas\index.c:2903)
File "pandas\index.pyx", line 157, in pandas.index.IndexEngine.get_loc (pandas\index.c:3843)
File "pandas\hashtable.pyx", line 303, in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6525)
File "pandas\hashtable.pyx", line 309, in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6463)
KeyError: 6818L
Now, I looked only at your get_numpy_data definition and think that it doesn't work as you'd expect. For example, the line
features_train, output_train = get_numpy_data(df_housePrice_train, feature_list, 'price')
seems to modify df_housePrice_train. And output_train becomes an np array containing a string "price".
Update:
The line distances = [] should really be inside the function compute_distances. This function appends elements to distances on every execution. Next, indices (positions) of some of the elements are applied to a data frame. On first execution everything works fine, but later the list grows and some indices get larger - exceeding the size of the data frame.
Update 2
For completeness: KeyError: 6818L means that the long integer 6818 (L denotes a type here) is not a valid key in df_housePrice_train.
Needed code modification:
## KNN.py, line 61:
#distances = [] # <- delete this line
def compute_distances(features_instances, features_query):
distances = [] # <-- add here
# rest of the function body...

How do I fix the 'TypeError: hasattr(): attribute name must be string' error?

I have the following code:
import pymc as pm
from matplotlib import pyplot as plt
from pymc.Matplot import plot as mcplot
import numpy as np
from matplotlib import rc
res = [18.752, 12.450, 11.832]
v = pm.Uniform('v', 0, 20)
errors = pm.Uniform('errors', 0, 100, size = 3)
taus = 1/(errors ** 2)
mydist = pm.Normal('mydist', mu = v, tau = taus, value = res, observed = True)
model=pm.Model([mydist, errors, taus, v, res])
mcmc=pm.MCMC(model) # This is line 19 where the TypeError originates
mcmc.sample(20000,10000)
mcplot(mcmc.trace('mydist'))
For some reason it doesn't work, I get the 'TypeError: hasattr(): attribute name must be string' error, with the following trace:
Traceback (most recent call last):
File "<ipython-input-49-759ebaf4321c>", line 1, in <module>
runfile('C:/Users/Paul/.spyder2-py3/temp.py', wdir='C:/Users/Paul/.spyder2-py3')
File "C:\Users\Paul\Miniconda3\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 685, in runfile
execfile(filename, namespace)
File "C:\Users\Paul\Miniconda3\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 85, in execfile
exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)
File "C:/Users/Paul/.spyder2-py3/temp.py", line 19, in <module>
mcmc=pm.MCMC(model)
File "C:\Users\Paul\Miniconda3\lib\site-packages\pymc\MCMC.py", line 82, in __init__
**kwds)
File "C:\Users\Paul\Miniconda3\lib\site-packages\pymc\Model.py", line 197, in __init__
Model.__init__(self, input, name, verbose)
File "C:\Users\Paul\Miniconda3\lib\site-packages\pymc\Model.py", line 99, in __init__
ObjectContainer.__init__(self, input)
File "C:\Users\Paul\Miniconda3\lib\site-packages\pymc\Container.py", line 606, in __init__
conservative_update(self, input_to_file)
File "C:\Users\Paul\Miniconda3\lib\site-packages\pymc\Container.py", line 549, in conservative_update
if not hasattr(obj, k):
TypeError: hasattr(): attribute name must be string
How do I make it work and output "mydist"?
Edit: I posted a wrong trace at first by accident.
Edit2: It all must be because res doesn't have a name, because it's an array, but I don't know how to assign a name to it, so it'll make this work.
I must admit that I'm not familiar with pymc, but changing it to the following at least made the application run:
mydist = pm.Normal('mydist', mu = v, tau = taus, value = res, observed = False)
mcmc=pm.MCMC([mydist, errors, taus, v, res])
This seems to be because you were wrapping everything in a Model which is an extension of ObjectContainer, but since you passed it a list, MCMC file_items in Container.py tried to assign index 4 in a list to something using replace, but since Model is an ObjectContainer it assigned the key 4 in it's __dict__ causing the weird TypeError you got. Removing the wrapping Model caused MCMC to correctly use an ListContainer instead.
Now, there's probably a bug in Model.py on line 543 where observable stochastics aren't stored in the database - the expression is for object in self.stochastics | self.deterministics: but I suspect it should include self.observable_stochastics too - so I needed to change observable to False or the last line would throw a KeyError.
I'm not familiar enough with pymc to determine if it's actually or bug or desired behaviour so I leave it up to you to submit an issue about it.
You simply need to define res as a numpy array:
res = np.array([18.752, 12.450, 11.832])
Then you'll get an error here mcmc.trace('mydist')because mydist is observed data, and therefore is not sampled. You probably want to plot other variables...

How to fix "ValueError: need at least one array to concatenate" error

Following up from here: Calculating percentage of Bounding box overlap, for image detector evaluation, I'm getting an error at this line:
poly_clipped = poly.clip_to_bbox(clip_rect).to_polygons()[0]
This is the error:
File "C:\work_asaaki\code\detection.py", line 32, in clip_boxes
poly_clipped = poly.clip_to_bbox(clip_rect).to_polygons()[0]
File "C:\Anaconda\lib\site-packages\matplotlib\path.py", line 909, in clip_to_bbox
return self.make_compound_path(*paths)
File "C:\Anaconda\lib\site-packages\matplotlib\path.py", line 328, in make_compound_path
vertices = np.vstack([x.vertices for x in args])
File "C:\Anaconda\lib\site-packages\numpy\core\shape_base.py", line 228, in vstack
return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
ValueError: need at least one array to concatenate
This doesn't always happen, it happens based on specific sets of polygons... What I'm trying to understand is when exactly does it not work? And how can I solve the issue?

Remove the numba.lowering.LoweringError: Internal error

I'm using numba to speed up my code which is working fine without numba. But after using #jit, it crashes with this error:
Traceback (most recent call last):
File "C:\work_asaaki\code\gbc_classifier_train_7.py", line 54, in <module>
gentlebooster.train(X_train, y_train, boosting_rounds)
File "C:\work_asaaki\code\gentleboost_c_class_jit_v7_nolimit.py", line 298, in train
self.g_per_round, self.g = train_function(X, y, H)
File "C:\Anaconda\lib\site-packages\numba\dispatcher.py", line 152, in _compile_for_args
return self.jit(sig)
File "C:\Anaconda\lib\site-packages\numba\dispatcher.py", line 143, in jit
return self.compile(sig, **kws)
File "C:\Anaconda\lib\site-packages\numba\dispatcher.py", line 250, in compile
locals=self.locals)
File "C:\Anaconda\lib\site-packages\numba\compiler.py", line 183, in compile_bytecode
flags.no_compile)
File "C:\Anaconda\lib\site-packages\numba\compiler.py", line 323, in native_lowering_stage
lower.lower()
File "C:\Anaconda\lib\site-packages\numba\lowering.py", line 219, in lower
self.lower_block(block)
File "C:\Anaconda\lib\site-packages\numba\lowering.py", line 254, in lower_block
raise LoweringError(msg, inst.loc)
numba.lowering.LoweringError: Internal error:
NotImplementedError: ('cast', <llvm.core.Instruction object at 0x000000001801D320>, slice3_type, int64)
File "gentleboost_c_class_jit_v7_nolimit.py", line 103
Line 103 is below, in a loop:
weights = np.empty([n,m])
for curr_n in range(n):
weights[curr_n,:] = 1.0/(n) # this is line 103
where n is a constant already defined somewhere above in my code.
How can I remove the error? What "lowering" is going on? I'm using Anaconda 2.0.1 with Numba 0.13.x and Numpy 1.8.x on a 64-bit machine.
Based on this: https://gist.github.com/cc7768/bc5b8b7b9052708f0c0a,
I figured out what to do to avoid the issue. Instead of using the colon : to refer to any row/column, I just opened up the loop into two loops to explicitly refer to the indices in each dimension of the array:
weights = np.empty([n,m])
for curr_n in range(n):
for curr_m in range (m):
weights[curr_n,curr_m] = 1.0/(n)
There were other instances in my code after this where I used the colon, but they didn't cause errors further down, not sure why.

Categories