Pyspark UDF unable to use large dictionary

Pyspark UDF unable to use large dictionary - python

I've a dictionary consisting of keys = word, value = Array of 300 float numbers.
I'm unable to use this dictionary in my pyspark UDF.
When size of this dictionary is 2Million keys it does not work. But when I reduce the size to 200K it works.
This is my code for the function to be converted to UDF
def get_sentence_vector(sentence, dictionary_containing_word_vectors):
cleanedSentence = list(clean_text(sentence))
words_vector_list = np.zeros(300)# 300 dimensional vector
for x in cleanedSentence:
try:
words_vector_list = np.add(words_vector_list, dictionary_containing_word_vectors[str(x)])
except Exception as e:
print("Exception caught while finding word vector from Fast text pretrained model Dictionary: ",e)
return words_vector_list.tolist()
This is my UDF
get_sentence_vector_udf = F.udf(lambda val: get_sentence_vector(val, fast_text_dictionary), ArrayType(FloatType()))
This is how i'm calling the udf to be added as a column in my dataframe
dmp_df_with_vectors = df.filter(df.item_name.isNotNull()).withColumn("sentence_vector", get_sentence_vector_udf(df.item_name))
And this is the stack trace for the error
Traceback (most recent call last):
File "/usr/lib/spark/python/pyspark/broadcast.py", line 83, in dump
pickle.dump(value, f, 2)
SystemError: error return without exception set
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1957, in wrapper
return udf_obj(*args)
File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1916, in __call__
judf = self._judf
File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1900, in _judf
self._judf_placeholder = self._create_judf()
File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1909, in _create_judf
wrapped_func = _wrap_function(sc, self.func, self.returnType)
File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1866, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/usr/lib/spark/python/pyspark/rdd.py", line 2377, in _prepare_for_python_RDD
broadcast = sc.broadcast(pickled_command)
File "/usr/lib/spark/python/pyspark/context.py", line 799, in broadcast
return Broadcast(self, value, self._pickled_broadcast_vars)
File "/usr/lib/spark/python/pyspark/broadcast.py", line 74, in __init__
self._path = self.dump(value, f)
File "/usr/lib/spark/python/pyspark/broadcast.py", line 90, in dump
raise pickle.PicklingError(msg)
cPickle.PicklingError: Could not serialize broadcast: SystemError: error return without exception set

How big is your fast_text_dictionary in the 2M case? It might be too big.
Try broadcast it first before running udf. e.g.
broadcastVar = sc.broadcast(fast_text_dictionary)
Then use broadcastVar instead in your udf.
See the document for broadcast

Related

My first hello word program in MistQL is throwing exception Expected RuntimeValueType.Number, got RuntimeValueType.String

I was trying this tutorial given in the MistQL for a personal work but this below code is throwing exception as given below
import mistql
data="{\"foo\": \"bar\"}"
query = '#.foo'
results = mistql.query(query, data)
print(results)
Traceback (most recent call last): File "C:\Temp\sample.py", line 4,
in
purchaserEmails = mistql.query(query, data) File "C:\Python3.10.0\lib\site-packages\mistql\query.py", line 18, in query
result = execute_outer(ast, data) File "C:\Python3.10.0\lib\site-packages\mistql\execute.py", line 73, in
execute_outer
return execute(ast, build_initial_stack(data, builtins)) File "C:\Python3.10.0\lib\site-packages\typeguard_init_.py", line 1033,
in wrapper
retval = func(*args, **kwargs) File "C:\Python3.10.0\lib\site-packages\mistql\execute.py", line 60, in
execute
return execute_fncall(ast.fn, ast.args, stack) File "C:\Python3.10.0\lib\site-packages\typeguard_init_.py", line 1033,
in wrapper
retval = func(*args, **kwargs) File "C:\Python3.10.0\lib\site-packages\mistql\execute.py", line 28, in
execute_fncall
return function_definition(arguments, stack, execute) File "C:\Python3.10.0\lib\site-packages\mistql\builtins.py", line 37, in
wrapped
return fn(arguments, stack, exec) File "C:\Python3.10.0\lib\site-packages\mistql\builtins.py", line 168, in
dot
return _index_single(RuntimeValue.of(right.name), left) File "C:\Python3.10.0\lib\site-packages\mistql\builtins.py", line 295, in
_index_single
assert_type(index, RVT.Number) File "C:\Python3.10.0\lib\site-packages\mistql\runtime_value.py", line 344,
in assert_type
raise MistQLTypeError(f"Expected {expected_type}, got {value.type}") mistql.exceptions.MistQLTypeError: Expected
RuntimeValueType.Number, got RuntimeValueType.String

Strings passed as data into MistQL's query method correspond to MistQL string types, and thus the #.foo query is attempting to index the string "{\"foo\": \"bar\"}") using the "foo" value, leading to the error above.
MistQL is expecting native dictionaries and lists. The correspondence between Python data types and MistQL data types can be seen here
That being said, the error message is indeed very confusing and should be fixed post-haste! Issue here

PicklingError when getting the result from ray

I'm working on slowly converting my very serialized text analysis engine to use Modin and Ray. Feels like I'm nearly there, however, I seem to have hit a stumbling block. My code looks like this:
vectorizer = TfidfVectorizer(
analyzer=ngrams, encoding="ascii", stop_words="english", strip_accents="ascii"
)
tf_idf_matrix = vectorizer.fit_transform(r_strings["name"])
r_vectorizer = ray.put(vectorizer)
r_tf_idf_matrix = ray.put(tf_idf_matrix)
n = 2
match_results = []
for fn in files["c.file"]:
match_results.append(
match_name.remote(fn, r_vectorizer, r_tf_idf_matrix, r_strings, n)
)
match_returns = ray.get(match_results)
I'm following the guidance from the "anti-patterns" section in the Ray documentation, on what to avoid, and this is very similar to that of the "better" pattern.
Traceback (most recent call last):
File "alt.py", line 213, in <module>
match_returns = ray.get(match_results)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper
return func(*args, **kwargs)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/worker.py", line 1501, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(PicklingError): ray::match_name() (pid=23393, ip=192.168.1.173)
File "python/ray/_raylet.pyx", line 564, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 565, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1652, in ray._raylet.CoreWorker.store_task_outputs
File "/home/myuser/.local/lib/python3.7/site-packages/ray/serialization.py", line 327, in serialize
return self._serialize_to_msgpack(value)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/serialization.py", line 307, in _serialize_to_msgpack
self._serialize_to_pickle5(metadata, python_objects)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/serialization.py", line 267, in _serialize_to_pickle5
raise e
File "/home/myuser/.local/lib/python3.7/site-packages/ray/serialization.py", line 264, in _serialize_to_pickle5
value, protocol=5, buffer_callback=writer.buffer_callback)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
cp.dump(obj)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 580, in dump
return Pickler.dump(self, obj)
_pickle.PicklingError: args[0] from __newobj__ args has the wrong class
Definitely an unexpected result. I'm not sure where to go next with this and would appreciate help from folks who have more experience with Ray and Modin.

how to explode dict (or list of dict) object in multiple column in dask.dataframe

When I try to convert some xml to dataframe using xmltodict it happens that a particular column contains all the info I need as dict or list of dict. I'm able to convert this column in multiple ones with pandas but I'm not able to perform the similar operation in dask.
Is not possible to use meta because I've no idea of all the possible fields that are available in the xml and dask is necessary because the true xml files are bigger than 1Gb each.
example.xml:
<?xml version="1.0" encoding="UTF-8"?>
<itemList>
<eventItem uid="1">
<timestamp>2019-07-04T09:57:35.044Z</timestamp>
<eventType>generic</eventType>
<details>
<detail>
<name>columnA</name>
<value>AAA</value>
</detail>
<detail>
<name>columnB</name>
<value>BBB</value>
</detail>
</details>
</eventItem>
<eventItem uid="2">
<timestamp>2019-07-04T09:57:52.188Z</timestamp>
<eventType>generic</eventType>
<details>
<detail>
<name>columnC</name>
<value>CCC</value>
</detail>
</details>
</eventItem>
</itemList>
Working pandas code:
import xmltodict
import collections
import pandas as pd
def pd_output_dict(details):
detail = details.get("detail", [])
ret_value = {}
if type(detail) in (collections.OrderedDict, dict):
ret_value[detail["name"]] = detail["value"]
elif type(detail) == list:
for i in detail:
ret_value[i["name"]] = i["value"]
return pd.Series(ret_value)
with open("example.xml", "r", encoding="utf8") as f:
df_dict_list = xmltodict.parse(f.read()).get("itemList", {}).get("eventItem", [])
df = pd.DataFrame(df_dict_list)
df = pd.concat([df, df.apply(lambda row: pd_output_dict(row.details), axis=1, result_type="expand")], axis=1)
print(df.head())
Not working dask code:
import xmltodict
import collections
import dask
import dask.bag as db
import dask.dataframe as dd
def dd_output_dict(row):
detail = row.get("details", {}).get("detail", [])
ret_value = {}
if type(detail) in (collections.OrderedDict, dict):
row[detail["name"]] = detail["value"]
elif type(detail) == list:
for i in detail:
row[i["name"]] = i["value"]
return row
with open("example.xml", "r", encoding="utf8") as f:
df_dict_list = xmltodict.parse(f.read()).get("itemList", {}).get("eventItem", [])
df_bag = db.from_sequence(df_dict_list)
df = df_bag.to_dataframe()
df = df.apply(lambda row: dd_output_dict(row), axis=1)
The idea is to have in dask similar result I've in pandas but a the moment I'm receiving errors:
>>> df = df.apply(lambda row: output_dict(row), axis=1)
Traceback (most recent call last):
File "C:\Anaconda3\lib\site-packages\dask\dataframe\utils.py", line 169, in raise_on_meta_error
yield
File "C:\Anaconda3\lib\site-packages\dask\dataframe\core.py", line 4711, in _emulate
return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
File "C:\Anaconda3\lib\site-packages\dask\utils.py", line 854, in __call__
return getattr(obj, self.method)(*args, **kwargs)
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 6487, in apply
return op.get_result()
File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 151, in get_result
return self.apply_standard()
File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 257, in apply_standard
self.apply_series_generator()
File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 286, in apply_series_generator
results[i] = self.f(v)
File "<stdin>", line 1, in <lambda>
File "<stdin>", line 4, in output_dict
AttributeError: ("'str' object has no attribute 'get'", 'occurred at index 0')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda3\lib\site-packages\dask\dataframe\core.py", line 3964, in apply
M.apply, self._meta_nonempty, func, args=args, udf=True, **kwds
File "C:\Anaconda3\lib\site-packages\dask\dataframe\core.py", line 4711, in _emulate
return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
File "C:\Anaconda3\lib\contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "C:\Anaconda3\lib\site-packages\dask\dataframe\utils.py", line 190, in raise_on_meta_error
raise ValueError(msg)
ValueError: Metadata inference failed in `apply`.
You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.
To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.
Original error is below:
------------------------
AttributeError("'str' object has no attribute 'get'", 'occurred at index 0')
Traceback:
---------
File "C:\Anaconda3\lib\site-packages\dask\dataframe\utils.py", line 169, in raise_on_meta_error
yield
File "C:\Anaconda3\lib\site-packages\dask\dataframe\core.py", line 4711, in _emulate
return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
File "C:\Anaconda3\lib\site-packages\dask\utils.py", line 854, in __call__
return getattr(obj, self.method)(*args, **kwargs)
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 6487, in apply
return op.get_result()
File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 151, in get_result
return self.apply_standard()
File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 257, in apply_standard
self.apply_series_generator()
File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 286, in apply_series_generator
results[i] = self.f(v)
File "<stdin>", line 1, in <lambda>
File "<stdin>", line 4, in output_dict

Right, so operations like map_partitions will need to know the column names and data types. As you've mentioned, you can specify this with the meta= keyword.
Perhaps you can run through your data once to compute what these will be, and then construct a proper meta object, and pass that in? This is inefficient, and requires reading through all of your data, but I'm not sure that there is another way.

error in creating series in pandas

I am getting an error when creating a series in pandas.
Whenever I try to print the series I have created, I get an error.
The code I am running:
import pandas as pd
data2 = [1,2,3,4]
index = ['a','b','c','d']
s = pd.Series(data2, index)
print(s.shape)
s
The error:
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
s
File "C:\Python34\lib\idlelib\rpc.py", line 611, in displayhook
text = repr(value)
File "C:\Python34\lib\site-packages\pandas\core\base.py", line 80, in __repr__
return str(self)
File "C:\Python34\lib\site-packages\pandas\core\base.py", line 59, in __str__
return self.__unicode__()
File "C:\Python34\lib\site-packages\pandas\core\series.py", line 1060, in __unicode__
width, height = get_terminal_size()
File "C:\Python34\lib\site-packages\pandas\io\formats\terminal.py", line 33, in get_terminal_size
return shutil.get_terminal_size()
File "C:\Python34\lib\shutil.py", line 1071, in get_terminal_size
size = os.get_terminal_size(sys.__stdout__.fileno())
AttributeError: 'NoneType' object has no attribute 'fileno'

Your error is related to pyshell, not to pandas.
Try to run it through python directly or jupyter console, because the code you provided is correct.

Unable to read column family using pycassa

I've just started using pycassa, so if this is a stupid question, I apologize upfront.
I have a column family with the following schema:
create column family MyColumnFamilyTest
with column_type = 'Standard'
and comparator = 'CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.TimeUUIDType)'
and default_validation_class = 'BytesType'
and key_validation_class = 'UTF8Type'
and read_repair_chance = 0.1
and dclocal_read_repair_chance = 0.0
and populate_io_cache_on_flush = false
and gc_grace = 864000
and min_compaction_threshold = 4
and max_compaction_threshold = 32
and replicate_on_write = true
and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
and caching = 'KEYS_ONLY'
and compression_options = {'sstable_compression' : 'org.apache.cassandra.io.compress.SnappyCompressor'};
When I try to do a get() with a valid key (works fine in cassandra-cli) I get:
Traceback (most recent call last):
File "<pyshell#19>", line 1, in <module>
cf.get('mykey',column_count=3)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 664, in get
return self._cosc_to_dict(list_col_or_super, include_timestamp, include_ttl)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 368, in _cosc_to_dict
ret[self._unpack_name(col.name)] = self._col_to_dict(col, include_timestamp, include_ttl)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 444, in _unpack_name
return self._name_unpacker(b)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/marshal.py", line 140, in unpack_composite
components.append(unpacker(bytestr[2:2 + length]))
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/marshal.py", line 374, in <lambda>
return lambda v: uuid.UUID(bytes=v)
File "/usr/lib/python2.7/uuid.py", line 144, in __init__
raise ValueError('bytes is not a 16-char string')
ValueError: bytes is not a 16-char string
Here's some more information I've discovered:
When using cassandra-cli I can see the data as:
% cassandra-cli -h 10.249.238.131
Connected to: "LocalDB" on 10.249.238.131/9160
Welcome to Cassandra CLI version 1.2.10-SNAPSHOT
Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.
[default#unknown] use Keyspace;
[default#Keyspace] list ColumnFamily;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: urn:keyspace:ColumnFamily:a36e8ab1-7032-4e4c-a53d-e3317f63a640:
=> (name=autoZoning:::, value=01, timestamp=1391298393966000)
=> (name=creationTime:::, value=00000143efd8b76e, timestamp=1391298393966000)
=> (name=inactive:::14fe78e0-8b9b-11e3-b171-005056b700bb, value=00, timestamp=1391298393966000)
=> (name=label:::14fe78e0-8b9b-11e3-b171-005056b700bb, value=726a6d2d766e782d76613031, timestamp=1391298393966000)
1 Row Returned.
Elapsed time: 16 msec(s).
Since it was unclear what was causing the exception, I decided to add a print prior to the 'return self._name_unpacker(b)' line in columnfamily.py and I see:
>>> cf.get(dict(cf.get_range(column_count=0,filter_empty=False)).keys()[0])
Attempting to unpack: <00>\rautoZoning<00><00><00><00><00><00><00><00><00><00>
Traceback (most recent call last):
File "<pyshell#172>", line 1, in <module>
cf.get(dict(cf.get_range(column_count=0,filter_empty=False)).keys()[0])
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 665, in get
return self._cosc_to_dict(list_col_or_super, include_timestamp, include_ttl)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 368, in _cosc_to_dict
ret[self._unpack_name(col.name)] = self._col_to_dict(col, include_timestamp, include_ttl)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 445, in _unpack_name
return self._name_unpacker(b)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/marshal.py", line 140, in unpack_composite
components.append(unpacker(bytestr[2:2 + length]))
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/marshal.py", line 374, in <lambda>
return lambda v: uuid.UUID(bytes=v)
File "/usr/lib/python2.7/uuid.py", line 144, in __init__
raise ValueError('bytes is not a 16-char string')
ValueError: bytes is not a 16-char string
I have no idea where the extra characters are coming from around the column name. But that got me curious so I added another print in _cosc_to_dict in columnfamily.py and I see:
>>> cf.get(dict(cf.get_range(column_count=0,filter_empty=False)).keys()[0])
list_col_or_super is: []
list_col_or_super is: [ColumnOrSuperColumn(column=Column(timestamp=1391298393966000,
name='\x00\rautoZoning\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', value='\x01', ttl=None),
counter_super_column=None, super_column=None, counter_column=None),
ColumnOrSuperColumn(column=Column(timestamp=1391298393966000,
name='\x00\x0ccreationTime\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
value='\x00\x00\x01C\xef\xd8\xb7n', ttl=None), counter_super_column=None, super_column=None,
counter_column=None), ColumnOrSuperColumn(column=Column(timestamp=1391298393966000,
name='\x00\x08inactive\x00\x00\x00\x00\x00\x00\x00\x00\x10\x14\xfex\xe0\x8b\x9b\x11\xe3\xb1q\x00PV\xb7\x00\xbb\x00', value='\x00', ttl=None), counter_super_column=None, super_column=None,
counter_column=None), ColumnOrSuperColumn(column=Column(timestamp=1391298393966000,
name='\x00\x05label\x00\x00\x00\x00\x00\x00\x00\x00\x10\x14\xfex\xe0\x8b\x9b\x11\xe3\xb1q\x00PV\xb7\x00\xbb\x00', value='thisIsATest', ttl=None), counter_super_column=None, super_column=None, counter_column=None)]
autoZoning unpack:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib64/python2.6/site-packages/pycassa-1.11.0-py2.6.egg/pycassa/columnfamily.py", line 666, in get
return self._cosc_to_dict(list_col_or_super, include_timestamp, include_ttl)
File "/usr/local/lib64/python2.6/site-packages/pycassa-1.11.0-py2.6.egg/pycassa/columnfamily.py", line 369, in _cosc_to_dict
ret[self._unpack_name(col.name)] = self._col_to_dict(col, include_timestamp, include_ttl)
File "/usr/local/lib64/python2.6/site-packages/pycassa-1.11.0-py2.6.egg/pycassa/columnfamily.py", line 446, in _unpack_name
return self._name_unpacker(b)
File "/usr/local/lib64/python2.6/site-packages/pycassa-1.11.0-py2.6.egg/pycassa/marshal.py", line 140, in unpack_composite
components.append(unpacker(bytestr[2:2 + length]))
File "/usr/local/lib64/python2.6/site-packages/pycassa-1.11.0-py2.6.egg/pycassa/marshal.py", line 374, in <lambda>
return lambda v: uuid.UUID(bytes=v)
File "/usr/lib64/python2.6/uuid.py", line 144, in __init__
raise ValueError('bytes is not a 16-char string')
ValueError: bytes is not a 16-char string
Am I correct in assuming that the extra characters around the column names are what is responsible for the 'ValueError: bytes is not a 16-char string' exception?
Also if I try to use the column name and select it I get:
>>> cf.get(u'urn:keyspace:ColumnFamily:a36e8ab1-7032-4e4c-a53d-e3317f63a640:',columns=['autoZoning:::'])
Traceback (most recent call last):
File "<pyshell#184>", line 1, in <module>
cf.get(u'urn:keyspace:ColumnFamily:a36e8ab1-7032-4e4c-a53d-e3317f63a640:',columns=['autoZoning:::'])
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 651, in get
cp = self._column_path(super_column, column)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 383, in _column_path
self._pack_name(column, False))
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 426, in _pack_name
return self._name_packer(value, slice_start)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/marshal.py", line 115, in pack_composite
packed = packer(item)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/marshal.py", line 298, in pack_uuid
randomize=True)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/util.py", line 75, in convert_time_to_uuid
'neither a UUID, a datetime, or a number')
ValueError: Argument for a v1 UUID column name or value was neither a UUID, a datetime, or a number
Any further thoughts?
Thanks,
Rob

Turns out that the problem wasn't with the key, it was being caused, in part, by a bug in pycassa that wasn't handling an empty (null) string in the column UUID. A short-term fix is in the answer in google groups:
https://groups.google.com/d/msg/pycassa-discuss/Vf_bSgDIi9M/KTA1kbE9IXAJ
The other part of the answer was to get at the columns by using tuples (with the UUID as a UUID and not a str) instead of a string with ':' separators because that's, as I found out, a cassandra-cli thing.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyspark UDF unable to use large dictionary - python

How big is your fast_text_dictionary in the 2M case? It might be too big. Try broadcast it first before running udf. e.g. broadcastVar = sc.broadcast(fast_text_dictionary) Then use broadcastVar instead in your udf. See the document for broadcast

Related

My first hello word program in MistQL is throwing exception Expected RuntimeValueType.Number, got RuntimeValueType.String

PicklingError when getting the result from ray

how to explode dict (or list of dict) object in multiple column in dask.dataframe

error in creating series in pandas

Unable to read column family using pycassa

Categories

Resources