I want to import data from text file and make vector space representation out of words:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(input="file")
f = open('D:\\test\\17.txt')
bag_of_words = vectorizer.fit(f)
bag_of_words = vectorizer.transform(f)
print(bag_of_words)
But I get this error:
Traceback (most recent call last):
File "D:\test\test.py", line 5, in <module>
bag_of_words = vectorizer.fit(f)
File "C:\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 776, in fit
self.fit_transform(raw_documents)
File "C:\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 804, in fit_transform
self.fixed_vocabulary_)
File "C:\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 739, in _count_vocab
for feature in analyze(doc):
File "C:\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 236, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 110, in decode
doc = doc.read()
AttributeError: 'str' object has no attribute 'read'
Any ideas?
The vectorizer.fit method expects an iterable of file or string objects (not a single file object), hence you should have vectorizer.fit([f]).
In addition, you cannot reuse the f in the second call to vectorizer.transform (because the file has been read at that moment). What you probably want to do is the following:
vectorizer = CountVectorizer(input="file")
f = open('D:\\test\\17.txt')
bag_of_words = vectorizer.fit_transform([f])
Related
I am using the code below to simulate a model.
def run_demo(with_plots=True):
traj = np.array([[start_time,2.25]])
input_object = ('input_1[1]', traj)
model = load_fmu('[pyfmimodel.fmu',log_level=7)
opts = model.simulate_options ()
opts['ncp']=266
# Simulate
res = model.simulate(options=opts, input=input_object,final_time=stop_time )
This is the error I am getting. I need help to resolve this error.
Traceback (most recent call last):
File "D:\Projects\Python\DOCKER\model_2.py", line 55, in <module>
run_demo()
File "D:\Projects\Python\DOCKER\model_2.py", line 38, in run_demo
res = model.simulate(options=opts, input=input_object,final_time=stop_time )
File "src\pyfmi\fmi.pyx", line 7519, in pyfmi.fmi.FMUModelCS2.simulate
File "src\pyfmi\fmi.pyx", line 378, in pyfmi.fmi.ModelBase._exec_simulate_algorithm
File "src\pyfmi\fmi.pyx", line 372, in pyfmi.fmi.ModelBase._exec_simulate_algorithm
File "C:\Users\tcto5k\Miniconda3\lib\site-packages\pyfmi\fmi_algorithm_drivers.py", line 984, in __init__
self.result_handler.simulation_start()
File "C:\Users\tcto5k\Miniconda3\lib\site-packages\pyfmi\common\io.py", line 2553, in simulation_start
[parameter_data, sorted_vars_real_vref, sorted_vars_int_vref, sorted_vars_bool_vref] = fmi_util.prepare_data_info(data_info, sorted_vars,
File "src\pyfmi\fmi_util.pyx", line 257, in pyfmi.fmi_util.prepare_data_info
File "src\pyfmi\fmi_util.pyx", line 337, in pyfmi.fmi_util.prepare_data_info
File "src\pyfmi\fmi.pyx", line 4377, in pyfmi.fmi.FMUModelBase2.get_boolean
pyfmi.fmi.FMUException: Failed to get the Boolean values.
This is the FMU model variable definition which accepts 1D array as input:
<ScalarVariable name="input_1[1]" valueReference="0" description="u" causality="input" variability="continuous">
<Real start="2.0"/>
</ScalarVariable>
<!-- 2 -->
<ScalarVariable name="dense_3[1]" valueReference="614" description="y (1st order)" causality="output" variability="continuous" initial="calculated">
<Real/>
</ScalarVariable>
I have a variable, data_words, which is my corpus and is a list of lists of strings (tokens).
Also, I have a variable topics, a list of list of strings (tokens).
Now, I want to find the 'c_v' score for my topics. To do so, I run the following code:
import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel
id2word = corpora.Dictionary(data_words)
corpus = [id2word.doc2bow(text) for text in data_words]
coherence_score = CoherenceModel(topics=topics,
texts = data_words,
corpus= corpus,
dictionary= id2word,
coherence= 'c_v',
topn=20).get_coherence()
However, when I run this, I get the following errors:
Traceback (most recent call last):
File "C:\Users\20200016\Anaconda3\lib\site-packages\gensim\models\coherencemodel.py", line 448, in _ensure_elements_are_ids
return np.array([self.dictionary.token2id[token] for token in topic])
File "C:\Users\20200016\Anaconda3\lib\site-packages\gensim\models\coherencemodel.py", line 448, in <listcomp>
return np.array([self.dictionary.token2id[token] for token in topic])
KeyError: 'afgelopen'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<ipython-input-570-8aef06174d6c>", line 1, in <module>
coherence_score = CoherenceModel(topics=topics,
File "C:\Users\20200016\Anaconda3\lib\site-packages\gensim\models\coherencemodel.py", line 215, in __init__
self.topics = topics
File "C:\Users\20200016\Anaconda3\lib\site-packages\gensim\models\coherencemodel.py", line 430, in topics
topic_token_ids = self._ensure_elements_are_ids(topic)
File "C:\Users\20200016\Anaconda3\lib\site-packages\gensim\models\coherencemodel.py", line 451, in _ensure_elements_are_ids
return np.array([self.dictionary.token2id[token] for token in topic])
File "C:\Users\20200016\Anaconda3\lib\site-packages\gensim\models\coherencemodel.py", line 451, in <listcomp>
return np.array([self.dictionary.token2id[token] for token in topic])
File "C:\Users\20200016\Anaconda3\lib\site-packages\gensim\models\coherencemodel.py", line 450, in <genexpr>
topic = (self.dictionary.id2token[_id] for _id in topic)
KeyError: 'lamp'
The error indicates that I am passing a str where I should have passed an id.
However, the variables and variables types align with the formats described in the documentation.
What can I do to get the coherence scores?
I'm working on slowly converting my very serialized text analysis engine to use Modin and Ray. Feels like I'm nearly there, however, I seem to have hit a stumbling block. My code looks like this:
vectorizer = TfidfVectorizer(
analyzer=ngrams, encoding="ascii", stop_words="english", strip_accents="ascii"
)
tf_idf_matrix = vectorizer.fit_transform(r_strings["name"])
r_vectorizer = ray.put(vectorizer)
r_tf_idf_matrix = ray.put(tf_idf_matrix)
n = 2
match_results = []
for fn in files["c.file"]:
match_results.append(
match_name.remote(fn, r_vectorizer, r_tf_idf_matrix, r_strings, n)
)
match_returns = ray.get(match_results)
I'm following the guidance from the "anti-patterns" section in the Ray documentation, on what to avoid, and this is very similar to that of the "better" pattern.
Traceback (most recent call last):
File "alt.py", line 213, in <module>
match_returns = ray.get(match_results)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper
return func(*args, **kwargs)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/worker.py", line 1501, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(PicklingError): ray::match_name() (pid=23393, ip=192.168.1.173)
File "python/ray/_raylet.pyx", line 564, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 565, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1652, in ray._raylet.CoreWorker.store_task_outputs
File "/home/myuser/.local/lib/python3.7/site-packages/ray/serialization.py", line 327, in serialize
return self._serialize_to_msgpack(value)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/serialization.py", line 307, in _serialize_to_msgpack
self._serialize_to_pickle5(metadata, python_objects)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/serialization.py", line 267, in _serialize_to_pickle5
raise e
File "/home/myuser/.local/lib/python3.7/site-packages/ray/serialization.py", line 264, in _serialize_to_pickle5
value, protocol=5, buffer_callback=writer.buffer_callback)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
cp.dump(obj)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 580, in dump
return Pickler.dump(self, obj)
_pickle.PicklingError: args[0] from __newobj__ args has the wrong class
Definitely an unexpected result. I'm not sure where to go next with this and would appreciate help from folks who have more experience with Ray and Modin.
I want to test my keras model. But I've faced that problem. I have an image for checking in the "path".
path = 'C:\\Users\\Администратор\\AppData\\Local\\Programs\\Python\\Python36-32\\577793008_ef4345205b.jpg'
model = keras.models.load_model('C:\\Users\\Администратор\\AppData\\Local\\Programs\\Python\\Python36-32\\model1.h5')
predictions = model.predict(path)
print (predictions[0])
Error.
Traceback (most recent call last):
File "C:\Users\Администратор\AppData\Local\Programs\Python\Python36-32\load1.p
y", line 11, in <module>
predictions = model.predict(path)
File "C:\Users\Администратор\AppData\Local\Programs\Python\Python36-32\lib\sit
e-packages\keras\engine\training.py", line 1441, in predict
x, _, _ = self._standardize_user_data(x)
File "C:\Users\Администратор\AppData\Local\Programs\Python\Python36-32\lib\sit
e-packages\keras\engine\training.py", line 579, in _standardize_user_data
exception_prefix='input')
File "C:\Users\Администратор\AppData\Local\Programs\Python\Python36-32\lib\sit
e-packages\keras\engine\training_utils.py", line 99, in standardize_input_data
data = [standardize_single_array(x) for x in data]
File "C:\Users\Администратор\AppData\Local\Programs\Python\Python36-32\lib\sit
e-packages\keras\engine\training_utils.py", line 99, in <listcomp>
data = [standardize_single_array(x) for x in data]
File "C:\Users\Администратор\AppData\Local\Programs\Python\Python36-32\lib\sit
e-packages\keras\engine\training_utils.py", line 34, in standardize_single_array
elif x.ndim == 1:
AttributeError: 'str' object has no attribute 'ndim'
The predict method can take several types of inputs, but not a string. It cannot directly read a file based on the path.
You need to transform this file into whatever can be read by the Model class. Read the file and make the content an array for instance.
I have trained a HMM model to add punctuation into Arabic text and I want to save it to not repeating the training phase every time I enter a text to the model for tagging it .. I use pickle for these task as I see in tutorials. I do exactly like them but it fail and give me these error!.
Traceback (most recent call last):
File "C:\Python27\file_pun_tag.py", line 205, in <module>
hmm_tagger("test_file.txt")
File "C:\Python27\file_pun_tag.py", line 179, in hmm_tagger
hmm = pickle.load(saved_model)
File "C:\Python27\lib\pickle.py", line 1378, in load
return Unpickler(file).load()
File "C:\Python27\lib\pickle.py", line 858, in load
dispatch[key](self)
File "C:\Python27\lib\pickle.py", line 1133, in load_reduce
value = func(*args)
TypeError: __init__() takes at least 3 arguments (2 given)
I tried several solutions but none of them working with me ...
Here is the code where I save my model. It is working correctly for saving the model and creating the "hmm.pickle":
file = codecs.open("train_sents_hmm.txt", "r", "utf_8")
train_sents = file.readlines()
labelled_sequences, tag_set, symbols = load_pun(train_sents)
trainer = nltk.HiddenMarkovModelTrainer (tag_set, symbols)
hmm = trainer.train_supervised (labelled_sequences, estimator=lambda fd, bins: LidstoneProbDist(fd, 0.1, bins))
# save object
save_model = open("hmm.pickle", "wb")
pickle.dump(hmm, save_model, -1)
save_model.close()
And here is the code when i'm trying to load the model after saving it, and here where it gives me the error:
saved_model = open("hmm.pickle", "rb")
hmm = pickle.load(saved_model)
saved_model.close()