KeyError when running CoherenceModel - python

I have a variable, data_words, which is my corpus and is a list of lists of strings (tokens).
Also, I have a variable topics, a list of list of strings (tokens).
Now, I want to find the 'c_v' score for my topics. To do so, I run the following code:
import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel
id2word = corpora.Dictionary(data_words)
corpus = [id2word.doc2bow(text) for text in data_words]
coherence_score = CoherenceModel(topics=topics,
texts = data_words,
corpus= corpus,
dictionary= id2word,
coherence= 'c_v',
topn=20).get_coherence()
However, when I run this, I get the following errors:
Traceback (most recent call last):
File "C:\Users\20200016\Anaconda3\lib\site-packages\gensim\models\coherencemodel.py", line 448, in _ensure_elements_are_ids
return np.array([self.dictionary.token2id[token] for token in topic])
File "C:\Users\20200016\Anaconda3\lib\site-packages\gensim\models\coherencemodel.py", line 448, in <listcomp>
return np.array([self.dictionary.token2id[token] for token in topic])
KeyError: 'afgelopen'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<ipython-input-570-8aef06174d6c>", line 1, in <module>
coherence_score = CoherenceModel(topics=topics,
File "C:\Users\20200016\Anaconda3\lib\site-packages\gensim\models\coherencemodel.py", line 215, in __init__
self.topics = topics
File "C:\Users\20200016\Anaconda3\lib\site-packages\gensim\models\coherencemodel.py", line 430, in topics
topic_token_ids = self._ensure_elements_are_ids(topic)
File "C:\Users\20200016\Anaconda3\lib\site-packages\gensim\models\coherencemodel.py", line 451, in _ensure_elements_are_ids
return np.array([self.dictionary.token2id[token] for token in topic])
File "C:\Users\20200016\Anaconda3\lib\site-packages\gensim\models\coherencemodel.py", line 451, in <listcomp>
return np.array([self.dictionary.token2id[token] for token in topic])
File "C:\Users\20200016\Anaconda3\lib\site-packages\gensim\models\coherencemodel.py", line 450, in <genexpr>
topic = (self.dictionary.id2token[_id] for _id in topic)
KeyError: 'lamp'
The error indicates that I am passing a str where I should have passed an id.
However, the variables and variables types align with the formats described in the documentation.
What can I do to get the coherence scores?

Related

Error executing FMU model with pyFMI: "pyfmi.fmi.FMUException: Failed to get the Boolean values"

I am using the code below to simulate a model.
def run_demo(with_plots=True):
traj = np.array([[start_time,2.25]])
input_object = ('input_1[1]', traj)
model = load_fmu('[pyfmimodel.fmu',log_level=7)
opts = model.simulate_options ()
opts['ncp']=266
# Simulate
res = model.simulate(options=opts, input=input_object,final_time=stop_time )
This is the error I am getting. I need help to resolve this error.
Traceback (most recent call last):
File "D:\Projects\Python\DOCKER\model_2.py", line 55, in <module>
run_demo()
File "D:\Projects\Python\DOCKER\model_2.py", line 38, in run_demo
res = model.simulate(options=opts, input=input_object,final_time=stop_time )
File "src\pyfmi\fmi.pyx", line 7519, in pyfmi.fmi.FMUModelCS2.simulate
File "src\pyfmi\fmi.pyx", line 378, in pyfmi.fmi.ModelBase._exec_simulate_algorithm
File "src\pyfmi\fmi.pyx", line 372, in pyfmi.fmi.ModelBase._exec_simulate_algorithm
File "C:\Users\tcto5k\Miniconda3\lib\site-packages\pyfmi\fmi_algorithm_drivers.py", line 984, in __init__
self.result_handler.simulation_start()
File "C:\Users\tcto5k\Miniconda3\lib\site-packages\pyfmi\common\io.py", line 2553, in simulation_start
[parameter_data, sorted_vars_real_vref, sorted_vars_int_vref, sorted_vars_bool_vref] = fmi_util.prepare_data_info(data_info, sorted_vars,
File "src\pyfmi\fmi_util.pyx", line 257, in pyfmi.fmi_util.prepare_data_info
File "src\pyfmi\fmi_util.pyx", line 337, in pyfmi.fmi_util.prepare_data_info
File "src\pyfmi\fmi.pyx", line 4377, in pyfmi.fmi.FMUModelBase2.get_boolean
pyfmi.fmi.FMUException: Failed to get the Boolean values.
This is the FMU model variable definition which accepts 1D array as input:
<ScalarVariable name="input_1[1]" valueReference="0" description="u" causality="input" variability="continuous">
<Real start="2.0"/>
</ScalarVariable>
<!-- 2 -->
<ScalarVariable name="dense_3[1]" valueReference="614" description="y (1st order)" causality="output" variability="continuous" initial="calculated">
<Real/>
</ScalarVariable>

Pyspark UDF unable to use large dictionary

I've a dictionary consisting of keys = word, value = Array of 300 float numbers.
I'm unable to use this dictionary in my pyspark UDF.
When size of this dictionary is 2Million keys it does not work. But when I reduce the size to 200K it works.
This is my code for the function to be converted to UDF
def get_sentence_vector(sentence, dictionary_containing_word_vectors):
cleanedSentence = list(clean_text(sentence))
words_vector_list = np.zeros(300)# 300 dimensional vector
for x in cleanedSentence:
try:
words_vector_list = np.add(words_vector_list, dictionary_containing_word_vectors[str(x)])
except Exception as e:
print("Exception caught while finding word vector from Fast text pretrained model Dictionary: ",e)
return words_vector_list.tolist()
This is my UDF
get_sentence_vector_udf = F.udf(lambda val: get_sentence_vector(val, fast_text_dictionary), ArrayType(FloatType()))
This is how i'm calling the udf to be added as a column in my dataframe
dmp_df_with_vectors = df.filter(df.item_name.isNotNull()).withColumn("sentence_vector", get_sentence_vector_udf(df.item_name))
And this is the stack trace for the error
Traceback (most recent call last):
File "/usr/lib/spark/python/pyspark/broadcast.py", line 83, in dump
pickle.dump(value, f, 2)
SystemError: error return without exception set
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1957, in wrapper
return udf_obj(*args)
File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1916, in __call__
judf = self._judf
File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1900, in _judf
self._judf_placeholder = self._create_judf()
File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1909, in _create_judf
wrapped_func = _wrap_function(sc, self.func, self.returnType)
File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1866, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/usr/lib/spark/python/pyspark/rdd.py", line 2377, in _prepare_for_python_RDD
broadcast = sc.broadcast(pickled_command)
File "/usr/lib/spark/python/pyspark/context.py", line 799, in broadcast
return Broadcast(self, value, self._pickled_broadcast_vars)
File "/usr/lib/spark/python/pyspark/broadcast.py", line 74, in __init__
self._path = self.dump(value, f)
File "/usr/lib/spark/python/pyspark/broadcast.py", line 90, in dump
raise pickle.PicklingError(msg)
cPickle.PicklingError: Could not serialize broadcast: SystemError: error return without exception set
How big is your fast_text_dictionary in the 2M case? It might be too big.
Try broadcast it first before running udf. e.g.
broadcastVar = sc.broadcast(fast_text_dictionary)
Then use broadcastVar instead in your udf.
See the document for broadcast

How to parse email FROM headers with parentheses in Python?

I'm having trouble using the Python email module to parse emails where the FROM header has parentheses in it. This only seems to be the problem when using email.policy.default as opposed to email.policy.compat32.
Is there a solution to this problem, other than switching policies?
A minimum working example is below, for Python 3.6.5:
import email
import email.policy as email_policy
raw_mime_msg=b"from: James Mishra \\(says hi\\) <james#example.com>"
compat32_obj = email.message_from_bytes(
raw_mime_msg, policy=email_policy.compat32)
default_obj = email.message_from_bytes(
raw_mime_msg, policy=email_policy.default)
print(compat32_obj['from'])
print(default_obj['from'])
The first print statement returns:
James Mishra \(says hi\) <james#example.com>
and the second print statement returns:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/email/_header_value_parser.py", line 1908, in get_address
token, value = get_group(value)
File "/usr/local/lib/python3.6/email/_header_value_parser.py", line 1867, in get_group
"display name but found '{}'".format(value))
email.errors.HeaderParseError: expected ':' at end of group display name but found '\(says hi\) <james#example.com>'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/email/_header_value_parser.py", line 1734, in get_mailbox
token, value = get_name_addr(value)
File "/usr/local/lib/python3.6/email/_header_value_parser.py", line 1720, in get_name_addr
token, value = get_angle_addr(value)
File "/usr/local/lib/python3.6/email/_header_value_parser.py", line 1646, in get_angle_addr
"expected angle-addr but found '{}'".format(value))
email.errors.HeaderParseError: expected angle-addr but found '\(says hi\) <james#example.com>'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test_email.py", line 12, in <module>
print(default_obj['from'])
File "/usr/local/lib/python3.6/email/message.py", line 391, in __getitem__
return self.get(name)
File "/usr/local/lib/python3.6/email/message.py", line 471, in get
return self.policy.header_fetch_parse(k, v)
File "/usr/local/lib/python3.6/email/policy.py", line 162, in header_fetch_parse
return self.header_factory(name, value)
File "/usr/local/lib/python3.6/email/headerregistry.py", line 589, in __call__
return self[name](name, value)
File "/usr/local/lib/python3.6/email/headerregistry.py", line 197, in __new__
cls.parse(value, kwds)
File "/usr/local/lib/python3.6/email/headerregistry.py", line 340, in parse
kwds['parse_tree'] = address_list = cls.value_parser(value)
File "/usr/local/lib/python3.6/email/headerregistry.py", line 331, in value_parser
address_list, value = parser.get_address_list(value)
File "/usr/local/lib/python3.6/email/_header_value_parser.py", line 1931, in get_address_list
token, value = get_address(value)
File "/usr/local/lib/python3.6/email/_header_value_parser.py", line 1911, in get_address
token, value = get_mailbox(value)
File "/usr/local/lib/python3.6/email/_header_value_parser.py", line 1737, in get_mailbox
token, value = get_addr_spec(value)
File "/usr/local/lib/python3.6/email/_header_value_parser.py", line 1583, in get_addr_spec
token, value = get_local_part(value)
File "/usr/local/lib/python3.6/email/_header_value_parser.py", line 1413, in get_local_part
obs_local_part, value = get_obs_local_part(str(local_part) + value)
File "/usr/local/lib/python3.6/email/_header_value_parser.py", line 1454, in get_obs_local_part
token, value = get_word(value)
File "/usr/local/lib/python3.6/email/_header_value_parser.py", line 1340, in get_word
if value[0]=='"':
IndexError: string index out of range
email.policy.default is intended to be compliant with the email RFCs, and your message is not compliant with RFC 5322. If the parenthesized part is supposed to be a comment, then the message should look like
raw_mime_msg=b"from: James Mishra (says hi) <james#example.com>"
to be compliant. If it is not supposed to be a comment, then the parentheses should appear inside a quoted string. That might look something like
raw_mime_msg=b'from: "James Mishra (says hi)" <james#example.com>'
Since your message is not compliant, using the policy that expects compliance is a poor fit. If you want to handle non-compliant messages, email.policy.compat32 is a better choice than email.policy.default.

Error when creating vector in python from text file

I want to import data from text file and make vector space representation out of words:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(input="file")
f = open('D:\\test\\17.txt')
bag_of_words = vectorizer.fit(f)
bag_of_words = vectorizer.transform(f)
print(bag_of_words)
But I get this error:
Traceback (most recent call last):
File "D:\test\test.py", line 5, in <module>
bag_of_words = vectorizer.fit(f)
File "C:\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 776, in fit
self.fit_transform(raw_documents)
File "C:\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 804, in fit_transform
self.fixed_vocabulary_)
File "C:\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 739, in _count_vocab
for feature in analyze(doc):
File "C:\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 236, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 110, in decode
doc = doc.read()
AttributeError: 'str' object has no attribute 'read'
Any ideas?
The vectorizer.fit method expects an iterable of file or string objects (not a single file object), hence you should have vectorizer.fit([f]).
In addition, you cannot reuse the f in the second call to vectorizer.transform (because the file has been read at that moment). What you probably want to do is the following:
vectorizer = CountVectorizer(input="file")
f = open('D:\\test\\17.txt')
bag_of_words = vectorizer.fit_transform([f])

Unable to read column family using pycassa

I've just started using pycassa, so if this is a stupid question, I apologize upfront.
I have a column family with the following schema:
create column family MyColumnFamilyTest
with column_type = 'Standard'
and comparator = 'CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.TimeUUIDType)'
and default_validation_class = 'BytesType'
and key_validation_class = 'UTF8Type'
and read_repair_chance = 0.1
and dclocal_read_repair_chance = 0.0
and populate_io_cache_on_flush = false
and gc_grace = 864000
and min_compaction_threshold = 4
and max_compaction_threshold = 32
and replicate_on_write = true
and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
and caching = 'KEYS_ONLY'
and compression_options = {'sstable_compression' : 'org.apache.cassandra.io.compress.SnappyCompressor'};
When I try to do a get() with a valid key (works fine in cassandra-cli) I get:
Traceback (most recent call last):
File "<pyshell#19>", line 1, in <module>
cf.get('mykey',column_count=3)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 664, in get
return self._cosc_to_dict(list_col_or_super, include_timestamp, include_ttl)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 368, in _cosc_to_dict
ret[self._unpack_name(col.name)] = self._col_to_dict(col, include_timestamp, include_ttl)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 444, in _unpack_name
return self._name_unpacker(b)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/marshal.py", line 140, in unpack_composite
components.append(unpacker(bytestr[2:2 + length]))
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/marshal.py", line 374, in <lambda>
return lambda v: uuid.UUID(bytes=v)
File "/usr/lib/python2.7/uuid.py", line 144, in __init__
raise ValueError('bytes is not a 16-char string')
ValueError: bytes is not a 16-char string
Here's some more information I've discovered:
When using cassandra-cli I can see the data as:
% cassandra-cli -h 10.249.238.131
Connected to: "LocalDB" on 10.249.238.131/9160
Welcome to Cassandra CLI version 1.2.10-SNAPSHOT
Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.
[default#unknown] use Keyspace;
[default#Keyspace] list ColumnFamily;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: urn:keyspace:ColumnFamily:a36e8ab1-7032-4e4c-a53d-e3317f63a640:
=> (name=autoZoning:::, value=01, timestamp=1391298393966000)
=> (name=creationTime:::, value=00000143efd8b76e, timestamp=1391298393966000)
=> (name=inactive:::14fe78e0-8b9b-11e3-b171-005056b700bb, value=00, timestamp=1391298393966000)
=> (name=label:::14fe78e0-8b9b-11e3-b171-005056b700bb, value=726a6d2d766e782d76613031, timestamp=1391298393966000)
1 Row Returned.
Elapsed time: 16 msec(s).
Since it was unclear what was causing the exception, I decided to add a print prior to the 'return self._name_unpacker(b)' line in columnfamily.py and I see:
>>> cf.get(dict(cf.get_range(column_count=0,filter_empty=False)).keys()[0])
Attempting to unpack: <00>\rautoZoning<00><00><00><00><00><00><00><00><00><00>
Traceback (most recent call last):
File "<pyshell#172>", line 1, in <module>
cf.get(dict(cf.get_range(column_count=0,filter_empty=False)).keys()[0])
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 665, in get
return self._cosc_to_dict(list_col_or_super, include_timestamp, include_ttl)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 368, in _cosc_to_dict
ret[self._unpack_name(col.name)] = self._col_to_dict(col, include_timestamp, include_ttl)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 445, in _unpack_name
return self._name_unpacker(b)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/marshal.py", line 140, in unpack_composite
components.append(unpacker(bytestr[2:2 + length]))
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/marshal.py", line 374, in <lambda>
return lambda v: uuid.UUID(bytes=v)
File "/usr/lib/python2.7/uuid.py", line 144, in __init__
raise ValueError('bytes is not a 16-char string')
ValueError: bytes is not a 16-char string
I have no idea where the extra characters are coming from around the column name. But that got me curious so I added another print in _cosc_to_dict in columnfamily.py and I see:
>>> cf.get(dict(cf.get_range(column_count=0,filter_empty=False)).keys()[0])
list_col_or_super is: []
list_col_or_super is: [ColumnOrSuperColumn(column=Column(timestamp=1391298393966000,
name='\x00\rautoZoning\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', value='\x01', ttl=None),
counter_super_column=None, super_column=None, counter_column=None),
ColumnOrSuperColumn(column=Column(timestamp=1391298393966000,
name='\x00\x0ccreationTime\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
value='\x00\x00\x01C\xef\xd8\xb7n', ttl=None), counter_super_column=None, super_column=None,
counter_column=None), ColumnOrSuperColumn(column=Column(timestamp=1391298393966000,
name='\x00\x08inactive\x00\x00\x00\x00\x00\x00\x00\x00\x10\x14\xfex\xe0\x8b\x9b\x11\xe3\xb1q\x00PV\xb7\x00\xbb\x00', value='\x00', ttl=None), counter_super_column=None, super_column=None,
counter_column=None), ColumnOrSuperColumn(column=Column(timestamp=1391298393966000,
name='\x00\x05label\x00\x00\x00\x00\x00\x00\x00\x00\x10\x14\xfex\xe0\x8b\x9b\x11\xe3\xb1q\x00PV\xb7\x00\xbb\x00', value='thisIsATest', ttl=None), counter_super_column=None, super_column=None, counter_column=None)]
autoZoning unpack:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib64/python2.6/site-packages/pycassa-1.11.0-py2.6.egg/pycassa/columnfamily.py", line 666, in get
return self._cosc_to_dict(list_col_or_super, include_timestamp, include_ttl)
File "/usr/local/lib64/python2.6/site-packages/pycassa-1.11.0-py2.6.egg/pycassa/columnfamily.py", line 369, in _cosc_to_dict
ret[self._unpack_name(col.name)] = self._col_to_dict(col, include_timestamp, include_ttl)
File "/usr/local/lib64/python2.6/site-packages/pycassa-1.11.0-py2.6.egg/pycassa/columnfamily.py", line 446, in _unpack_name
return self._name_unpacker(b)
File "/usr/local/lib64/python2.6/site-packages/pycassa-1.11.0-py2.6.egg/pycassa/marshal.py", line 140, in unpack_composite
components.append(unpacker(bytestr[2:2 + length]))
File "/usr/local/lib64/python2.6/site-packages/pycassa-1.11.0-py2.6.egg/pycassa/marshal.py", line 374, in <lambda>
return lambda v: uuid.UUID(bytes=v)
File "/usr/lib64/python2.6/uuid.py", line 144, in __init__
raise ValueError('bytes is not a 16-char string')
ValueError: bytes is not a 16-char string
Am I correct in assuming that the extra characters around the column names are what is responsible for the 'ValueError: bytes is not a 16-char string' exception?
Also if I try to use the column name and select it I get:
>>> cf.get(u'urn:keyspace:ColumnFamily:a36e8ab1-7032-4e4c-a53d-e3317f63a640:',columns=['autoZoning:::'])
Traceback (most recent call last):
File "<pyshell#184>", line 1, in <module>
cf.get(u'urn:keyspace:ColumnFamily:a36e8ab1-7032-4e4c-a53d-e3317f63a640:',columns=['autoZoning:::'])
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 651, in get
cp = self._column_path(super_column, column)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 383, in _column_path
self._pack_name(column, False))
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 426, in _pack_name
return self._name_packer(value, slice_start)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/marshal.py", line 115, in pack_composite
packed = packer(item)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/marshal.py", line 298, in pack_uuid
randomize=True)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/util.py", line 75, in convert_time_to_uuid
'neither a UUID, a datetime, or a number')
ValueError: Argument for a v1 UUID column name or value was neither a UUID, a datetime, or a number
Any further thoughts?
Thanks,
Rob
Turns out that the problem wasn't with the key, it was being caused, in part, by a bug in pycassa that wasn't handling an empty (null) string in the column UUID. A short-term fix is in the answer in google groups:
https://groups.google.com/d/msg/pycassa-discuss/Vf_bSgDIi9M/KTA1kbE9IXAJ
The other part of the answer was to get at the columns by using tuples (with the UUID as a UUID and not a str) instead of a string with ':' separators because that's, as I found out, a cassandra-cli thing.

Categories