Dask agg functions pickle error - python

I have the following dask dataframe
#timestamp datetime64[ns]
#version object
dst object
dst_port object
host object
http_req_header_contentlength object
http_req_header_host object
http_req_header_referer object
http_req_header_useragent object
http_req_method object
http_req_secondleveldomain object
http_req_url object
http_req_version object
http_resp_code object
http_resp_header_contentlength object
http_resp_header_contenttype object
http_user object
local_time object
path object
src object
src_port object
tags object
type int64
dtype: object
I am trying to get a group by operations
grouped_by_df = df.groupby(['http_user', 'src'])['#timestamp'].agg(['min', 'max']).reset_index()
when running grouped_by_df.count().compute()` I get the following error:
Traceback (most recent call last):
File "/home/avlach/virtualenvs/dask/local/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2881, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-62-9acb48b4ac67>", line 1, in <module>
user_host_map.count().compute()
File "/home/avlach/virtualenvs/dask/local/lib/python2.7/site-packages/dask/base.py", line 98, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/home/avlach/virtualenvs/dask/local/lib/python2.7/site-packages/dask/base.py", line 205, in compute
results = get(dsk, keys, **kwargs)
File "/home/avlach/virtualenvs/dask/local/lib/python2.7/site-packages/distributed/client.py", line 1893, in get
results = self.gather(packed)
File "/home/avlach/virtualenvs/dask/local/lib/python2.7/site-packages/distributed/client.py", line 1355, in gather
direct=direct, local_worker=local_worker)
File "/home/avlach/virtualenvs/dask/local/lib/python2.7/site-packages/distributed/client.py", line 531, in sync
return sync(self.loop, func, *args, **kwargs)
File "/home/avlach/virtualenvs/dask/local/lib/python2.7/site-packages/distributed/utils.py", line 234, in sync
six.reraise(*error[0])
File "/home/avlach/virtualenvs/dask/local/lib/python2.7/site-packages/distributed/utils.py", line 223, in f
result[0] = yield make_coro()
File "/home/avlach/virtualenvs/dask/local/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/home/avlach/virtualenvs/dask/local/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "/home/avlach/virtualenvs/dask/local/lib/python2.7/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/home/avlach/virtualenvs/dask/local/lib/python2.7/site-packages/distributed/client.py", line 1235, in _gather
traceback)
File "/home/avlach/virtualenvs/dask/local/lib/python2.7/site-packages/distributed/protocol/pickle.py", line 59, in loads
return pickle.loads(x)
TypeError: itemgetter expected 1 arguments, got 0
I am using dask version 0.15.1 and LocalCLuster Client. What could be causing the issue?

We just had a similar error, we were running something of the form:
df[['col1','col2']].groupby('col1').agg("count")
and getting a similar error with this at the end:
return pickle.loads(x)
TypeError: itemgetter expected 1 arguments, got 0
but then when we reformatted the groupby to be of the form:
df.groupby('col1')['col2'].count()
We stopped getting that error. Which we have now repeated a few times and doesn't just seem to be fluke. Not sure at all why that happens, but worth a try if someone is struggling with the same issue.

Related

PicklingError when getting the result from ray

I'm working on slowly converting my very serialized text analysis engine to use Modin and Ray. Feels like I'm nearly there, however, I seem to have hit a stumbling block. My code looks like this:
vectorizer = TfidfVectorizer(
analyzer=ngrams, encoding="ascii", stop_words="english", strip_accents="ascii"
)
tf_idf_matrix = vectorizer.fit_transform(r_strings["name"])
r_vectorizer = ray.put(vectorizer)
r_tf_idf_matrix = ray.put(tf_idf_matrix)
n = 2
match_results = []
for fn in files["c.file"]:
match_results.append(
match_name.remote(fn, r_vectorizer, r_tf_idf_matrix, r_strings, n)
)
match_returns = ray.get(match_results)
I'm following the guidance from the "anti-patterns" section in the Ray documentation, on what to avoid, and this is very similar to that of the "better" pattern.
Traceback (most recent call last):
File "alt.py", line 213, in <module>
match_returns = ray.get(match_results)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper
return func(*args, **kwargs)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/worker.py", line 1501, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(PicklingError): ray::match_name() (pid=23393, ip=192.168.1.173)
File "python/ray/_raylet.pyx", line 564, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 565, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1652, in ray._raylet.CoreWorker.store_task_outputs
File "/home/myuser/.local/lib/python3.7/site-packages/ray/serialization.py", line 327, in serialize
return self._serialize_to_msgpack(value)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/serialization.py", line 307, in _serialize_to_msgpack
self._serialize_to_pickle5(metadata, python_objects)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/serialization.py", line 267, in _serialize_to_pickle5
raise e
File "/home/myuser/.local/lib/python3.7/site-packages/ray/serialization.py", line 264, in _serialize_to_pickle5
value, protocol=5, buffer_callback=writer.buffer_callback)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
cp.dump(obj)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 580, in dump
return Pickler.dump(self, obj)
_pickle.PicklingError: args[0] from __newobj__ args has the wrong class
Definitely an unexpected result. I'm not sure where to go next with this and would appreciate help from folks who have more experience with Ray and Modin.

Error - unhashable type: 'list' when performing groupby using multi values

I'm pretty new to Python and I've been struggling to figure out how to solve the following error I started seeing in certain instances
for idKey, idKey_dataFrame in dataFrame[~pd.isna(dataFrame.idKey)].groupby("idKey", sort=False):
idKey_dataFrame = idKey_dataFrame[idKey_dataFrame.matched_customerProps.astype(str).str.contains("entitlement") & idKey_dataFrame.matched_customerProps.astype(str).str.contains("entitlement address")
]
for customer_type, idKey_sub_dataFrame in idKey_dataFrame.groupby("customer_type"):
if customer_type != “repeat” and customer_type != “mvp”:
continue
self.log.info(f”Link made to {idKey}")
idKey_csv_export = self.make_csv(idKey_sub_dataFrame, inventory=False)
I noticed that whenever I have more than 1 value, I get
"ERROR - unhashable type: 'list'" error.
I'm not sure how to make it so I execute a groupby when there's multiple values and also perform a comparison against the object type. Any help would be greatly appreciated.
These are my columns
customer_name customer_id customer_type
xyz 003 repeat
zzz 389 repeat, mvp, intl
yyy 002 repeat
yay 005 repeat
kdi 083 mvp, repeat
If I provide single valued customer_type, I get no error. Whenever I test with customers that have more than 1 value under customer_type, I get an error. The offending line is the first for statement for idKey, idKey_dataFrame in dataFrame[~pd.isna(dataFrame.idKey)
Here is the complete error:
ERROR - unhashable type: 'list' in line 258, in execute for customer_type, idKey_sub_dataFrame in idKey_dataFrame.groupby("customer_type"):
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/pandas/core/groupby/ops.py", line 162, in get_iterator
splitter = self._get_splitter(data, axis=axis)
File "/usr/local/lib/python3.7/site-packages/pandas/core/groupby/ops.py", line 168, in _get_splitter
comp_ids, _, ngroups = self.group_info
File "pandas/_libs/properties.pyx", line 34, in pandas._libs.properties.CachedProperty.__get__
File "/usr/local/lib/python3.7/site-packages/pandas/core/groupby/ops.py", line 296, in group_info
comp_ids, obs_group_ids = self._get_compressed_labels()
File "/usr/local/lib/python3.7/site-packages/pandas/core/groupby/ops.py", line 312, in _get_compressed_labels
all_labels = [ping.labels for ping in self.groupings]
File "/usr/local/lib/python3.7/site-packages/pandas/core/groupby/ops.py", line 312, in <listcomp>
all_labels = [ping.labels for ping in self.groupings]
File "/usr/local/lib/python3.7/site-packages/pandas/core/groupby/grouper.py", line 397, in labels
self._make_labels()
File "/usr/local/lib/python3.7/site-packages/pandas/core/groupby/grouper.py", line 421, in _make_labels
labels, uniques = algorithms.factorize(self.grouper, sort=self.sort)
File "/usr/local/lib/python3.7/site-packages/pandas/util/_decorators.py", line 208, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/pandas/core/algorithms.py", line 672, in factorize
values, na_sentinel=na_sentinel, size_hint=size_hint, na_value=na_value
File "/usr/local/lib/python3.7/site-packages/pandas/core/algorithms.py", line 508, in _factorize_array
values, na_sentinel=na_sentinel, na_value=na_value
File "pandas/_libs/hashtable_class_helper.pxi", line 1798, in pandas._libs.hashtable.PyObjectHashTable.factorize
File "pandas/_libs/hashtable_class_helper.pxi", line 1718, in pandas._libs.hashtable.PyObjectHashTable._unique

(Casting) errors using extract_(relevant_)features from tsfresh

Trying out Python package tsfresh I run into issues in the first steps. Given a series how to (automatically) make features for it? This snippet produces different errors based on which part I try.
import tsfresh
import pandas as pd
import numpy as np
#tfX, tfy = tsfresh.utilities.dataframe_functions.make_forecasting_frame(pd.Series(np.random.randn(1000)/50), kind='float64', max_timeshift=50, rolling_direction=1)
#rf = tsfresh.extract_relevant_features(tfX, y=tfy, n_jobs=1, column_id='id')
tfX, tfy = tsfresh.utilities.dataframe_functions.make_forecasting_frame(pd.Series(np.random.randn(1000)/50), kind=1, max_timeshift=50, rolling_direction=1)
rf = tsfresh.extract_relevant_features(tfX, y=tfy, n_jobs=1, column_id='id')
The errors are in the first case
""" Traceback (most recent call last): File "C:\Users\user\Anaconda3\envs\env1\lib\multiprocessing\pool.py", line
119, in worker
result = (True, func(*args, **kwds)) File "C:\Users\user\Anaconda3\envs\env1\lib\site-packages\tsfresh\utilities\distribution.py",
line 38, in _function_with_partly_reduce
results = list(itertools.chain.from_iterable(results)) File "C:\Users\user\Anaconda3\envs\env1\lib\site-packages\tsfresh\utilities\distribution.py",
line 37, in
results = (map_function(chunk, **kwargs) for chunk in chunk_list) File
"C:\Users\user\Anaconda3\envs\env1\lib\site-packages\tsfresh\feature_extraction\extraction.py",
line 358, in _do_extraction_on_chunk
return list(_f()) File "C:\Users\user\Anaconda3\envs\env1\lib\site-packages\tsfresh\feature_extraction\extraction.py",
line 350, in _f
result = [("", func(data))] File "C:\Users\user\Anaconda3\envs\env1\lib\site-packages\tsfresh\feature_extraction\feature_calculators.py",
line 193, in variance_larger_than_standard_deviation
y = np.var(x) File "C:\Users\user\Anaconda3\envs\env1\lib\site-packages\numpy\core\fromnumeric.py",
line 3157, in var
**kwargs) File "C:\Users\user\Anaconda3\envs\env1\lib\site-packages\numpy\core_methods.py",
line 110, in _var
arrmean, rcount, out=arrmean, casting='unsafe', subok=False) TypeError: unsupported operand type(s) for /: 'str' and 'int' """
and in the second case
""" Traceback (most recent call last): File
"C:\Users\user\Anaconda3\envs\env1\lib\multiprocessing\pool.py", line
119, in worker
result = (True, func(*args, **kwds)) File "C:\Users\user\Anaconda3\envs\env1\lib\site-packages\tsfresh\utilities\distribution.py",
line 38, in _function_with_partly_reduce
results = list(itertools.chain.from_iterable(results)) File "C:\Users\user\Anaconda3\envs\env1\lib\site-packages\tsfresh\utilities\distribution.py",
line 37, in
results = (map_function(chunk, **kwargs) for chunk in chunk_list) File
"C:\Users\user\Anaconda3\envs\env1\lib\site-packages\tsfresh\feature_extraction\extraction.py",
line 358, in _do_extraction_on_chunk
return list(_f()) File "C:\Users\user\Anaconda3\envs\env1\lib\site-packages\tsfresh\feature_extraction\extraction.py",
line 345, in _f
result = func(data, param=parameter_list) File "C:\Users\user\Anaconda3\envs\env1\lib\site-packages\tsfresh\feature_extraction\feature_calculators.py",
line 1752, in friedrich_coefficients
coeff = _estimate_friedrich_coefficients(x, m, r) File "C:\Users\user\Anaconda3\envs\env1\lib\site-packages\tsfresh\feature_extraction\feature_calculators.py",
line 145, in _estimate_friedrich_coefficients
result.dropna(inplace=True) File "C:\Users\user\Anaconda3\envs\env1\lib\site-packages\pandas\core\frame.py",
line 4598, in dropna
result = self.loc(axis=axis)[mask] File "C:\Users\user\Anaconda3\envs\env1\lib\site-packages\pandas\core\indexing.py",
line 1500, in getitem
return self._getitem_axis(maybe_callable, axis=axis) File "C:\Users\user\Anaconda3\envs\env1\lib\site-packages\pandas\core\indexing.py",
line 1859, in _getitem_axis
if is_iterator(key): File "C:\Users\user\Anaconda3\envs\env1\lib\site-packages\pandas\core\dtypes\inference.py",
line 157, in is_iterator
return hasattr(obj, 'next') File "C:\Users\user\Anaconda3\envs\env1\lib\site-packages\pandas\core\generic.py",
line 5065, in getattr
if self._info_axis._can_hold_identifiers_and_holds_name(name): File
"C:\Users\user\Anaconda3\envs\env1\lib\site-packages\pandas\core\indexes\base.py",
line 3984, in _can_hold_identifiers_and_holds_name
return name in self File "C:\Users\user\Anaconda3\envs\env1\lib\site-packages\pandas\core\indexes\category.py",
line 327, in contains
return contains(self, key, container=self._engine) File "C:\Users\user\Anaconda3\envs\env1\lib\site-packages\pandas\core\arrays\categorical.py",
line 188, in contains
loc = cat.categories.get_loc(key) File "C:\Users\user\Anaconda3\envs\env1\lib\site-packages\pandas\core\indexes\interval.py",
line 770, in get_loc
start, stop = self._find_non_overlapping_monotonic_bounds(key) File
"C:\Users\user\Anaconda3\envs\env1\lib\site-packages\pandas\core\indexes\interval.py",
line 717, in _find_non_overlapping_monotonic_bounds
start = self._searchsorted_monotonic(key, 'left') File "C:\Users\user\Anaconda3\envs\env1\lib\site-packages\pandas\core\indexes\interval.py",
line 681, in _searchsorted_monotonic
return sub_idx._searchsorted_monotonic(label, side) File "C:\Users\user\Anaconda3\envs\env1\lib\site-packages\pandas\core\indexes\base.py",
line 4755, in _searchsorted_monotonic
return self.searchsorted(label, side=side) File "C:\Users\user\Anaconda3\envs\env1\lib\site-packages\pandas\core\base.py",
line 1501, in searchsorted
return self._values.searchsorted(value, side=side, sorter=sorter) TypeError: Cannot cast array data from dtype('float64') to
dtype('
np.version, tsfresh.version are ('1.15.4', 'unknown'). I installed tsfresh using conda, probably from conda-forge. I am on Windows 10. Using another kernel with np.version, tsfresh.version ('1.15.4', '0.11.2') lead to the same results.
Trying the first couple of cells from timeseries_forecasting_basic_example.ipynb yields the casting error as well.
Fixed it. Either the version on conda(-forge) or one of the dependencies was the issue. So using "conda uninstall tsfresh", "conda install patsy future six tqdm" and "pip install tsfresh" combined did the trick.

How to add an edge in Python Gremlin variant

I'm trying to create a graph using gremlin-python, but I can't seem to work out how to add an edge.
Using the standard Gremlin console I can do the following:
gremlin> a = g.addV().next()
==>v[0]
gremlin> b = g.addV().next()
==>v[1]
gremlin> g.V()
==>v[0]
==>v[1]
gremlin> a.addEdge('conn', b)
==>e[2][0-conn->1]
gremlin> g.E()
==>e[2][0-conn->1]
gremlin>
But when trying to do the same via python connected to gremlin server, I can't seem to do the same:
>>> a = g.addV().next()
>>> b = g.addV().next()
>>> g.V().toList()
[v[1519], v[1520]]
>>> a.addEdge('conn', b)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Vertex' object has no attribute 'addEdge'
I've tried various incantations, but can't seem to work it out, and can't find any examples anywhere. Also, I see reference in the Gremlin docs to both addE and addEdge but can't work out what the difference is (neither appear to work above).
Edit: Getting a bit further, but still no luck. It seems GraphTraversal.addE() exists, so if I don't call next() then I can call addE... but still I don't seem to be able to get the arguments something it likes.
>>> a = g.addV()
>>> b = g.addV()
>>> a.addE('foo', b).toList()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Development/matt/lib/python2.7/site-packages/gremlin_python/process/traversal.py", line 52, in toList
return list(iter(self))
File "/Development/matt/lib/python2.7/site-packages/gremlin_python/process/traversal.py", line 70, in next
return self.__next__()
File "/Development/matt/lib/python2.7/site-packages/gremlin_python/process/traversal.py", line 43, in __next__
self.traversal_strategies.apply_strategies(self)
File "/Development/matt/lib/python2.7/site-packages/gremlin_python/process/traversal.py", line 284, in apply_strategies
traversal_strategy.apply(traversal)
File "/Development/matt/lib/python2.7/site-packages/gremlin_python/driver/remote_connection.py", line 95, in apply
remote_traversal = self.remote_connection.submit(traversal.bytecode)
File "/Development/matt/lib/python2.7/site-packages/gremlin_python/driver/driver_remote_connection.py", line 53, in submit
traversers = self._loop.run_sync(lambda: self.submit_traversal_bytecode(request_id, bytecode))
File "/Development/matt/lib/python2.7/site-packages/tornado/ioloop.py", line 457, in run_sync
return future_cell[0].result()
File "/Development/matt/lib/python2.7/site-packages/tornado/concurrent.py", line 237, in result
raise_exc_info(self._exc_info)
File "/Development/matt/lib/python2.7/site-packages/tornado/gen.py", line 1021, in run
yielded = self.gen.throw(*exc_info)
File "/Development/matt/lib/python2.7/site-packages/gremlin_python/driver/driver_remote_connection.py", line 73, in submit_traversal_bytecode
traversers = yield self._execute_message(message)
File "/Development/matt/lib/python2.7/site-packages/tornado/gen.py", line 1015, in run
value = future.result()
File "/Development/matt/lib/python2.7/site-packages/tornado/concurrent.py", line 237, in result
raise_exc_info(self._exc_info)
File "/Development/matt/lib/python2.7/site-packages/tornado/gen.py", line 1021, in run
yielded = self.gen.throw(*exc_info)
File "/Development/matt/lib/python2.7/site-packages/gremlin_python/driver/driver_remote_connection.py", line 149, in _execute_message
recv_message = yield response.receive()
File "/Development/matt/lib/python2.7/site-packages/tornado/gen.py", line 1015, in run
value = future.result()
File "/Development/matt/lib/python2.7/site-packages/tornado/concurrent.py", line 237, in result
raise_exc_info(self._exc_info)
File "/Development/matt/lib/python2.7/site-packages/tornado/gen.py", line 1024, in run
yielded = self.gen.send(value)
File "/Development/matt/lib/python2.7/site-packages/gremlin_python/driver/driver_remote_connection.py", line 236, in receive
"{0}: {1}".format(status_code, recv_message["status"]["message"]))
gremlin_python.driver.driver_remote_connection.GremlinServerError: 599: Could not locate method: DefaultGraphTraversal.addE([foo, [AddVertexStep({})]])
As far as I know, addEdge() works on the graph object and addE() works on the graph traversal object. Since you were using g() which is the latter, you need addE().
Seems the following syntax works:
>>> a = g.addV()
>>> b = g.addV()
>>> a.addE('foo').to(b).toList()
[e[1534][1532-foo->1533]]
I'm still not clear on the difference between addE and addEdge but I guess the latter is not available in python and I was confusing the signature of them.

Unable to read column family using pycassa

I've just started using pycassa, so if this is a stupid question, I apologize upfront.
I have a column family with the following schema:
create column family MyColumnFamilyTest
with column_type = 'Standard'
and comparator = 'CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.TimeUUIDType)'
and default_validation_class = 'BytesType'
and key_validation_class = 'UTF8Type'
and read_repair_chance = 0.1
and dclocal_read_repair_chance = 0.0
and populate_io_cache_on_flush = false
and gc_grace = 864000
and min_compaction_threshold = 4
and max_compaction_threshold = 32
and replicate_on_write = true
and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
and caching = 'KEYS_ONLY'
and compression_options = {'sstable_compression' : 'org.apache.cassandra.io.compress.SnappyCompressor'};
When I try to do a get() with a valid key (works fine in cassandra-cli) I get:
Traceback (most recent call last):
File "<pyshell#19>", line 1, in <module>
cf.get('mykey',column_count=3)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 664, in get
return self._cosc_to_dict(list_col_or_super, include_timestamp, include_ttl)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 368, in _cosc_to_dict
ret[self._unpack_name(col.name)] = self._col_to_dict(col, include_timestamp, include_ttl)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 444, in _unpack_name
return self._name_unpacker(b)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/marshal.py", line 140, in unpack_composite
components.append(unpacker(bytestr[2:2 + length]))
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/marshal.py", line 374, in <lambda>
return lambda v: uuid.UUID(bytes=v)
File "/usr/lib/python2.7/uuid.py", line 144, in __init__
raise ValueError('bytes is not a 16-char string')
ValueError: bytes is not a 16-char string
Here's some more information I've discovered:
When using cassandra-cli I can see the data as:
% cassandra-cli -h 10.249.238.131
Connected to: "LocalDB" on 10.249.238.131/9160
Welcome to Cassandra CLI version 1.2.10-SNAPSHOT
Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.
[default#unknown] use Keyspace;
[default#Keyspace] list ColumnFamily;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: urn:keyspace:ColumnFamily:a36e8ab1-7032-4e4c-a53d-e3317f63a640:
=> (name=autoZoning:::, value=01, timestamp=1391298393966000)
=> (name=creationTime:::, value=00000143efd8b76e, timestamp=1391298393966000)
=> (name=inactive:::14fe78e0-8b9b-11e3-b171-005056b700bb, value=00, timestamp=1391298393966000)
=> (name=label:::14fe78e0-8b9b-11e3-b171-005056b700bb, value=726a6d2d766e782d76613031, timestamp=1391298393966000)
1 Row Returned.
Elapsed time: 16 msec(s).
Since it was unclear what was causing the exception, I decided to add a print prior to the 'return self._name_unpacker(b)' line in columnfamily.py and I see:
>>> cf.get(dict(cf.get_range(column_count=0,filter_empty=False)).keys()[0])
Attempting to unpack: <00>\rautoZoning<00><00><00><00><00><00><00><00><00><00>
Traceback (most recent call last):
File "<pyshell#172>", line 1, in <module>
cf.get(dict(cf.get_range(column_count=0,filter_empty=False)).keys()[0])
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 665, in get
return self._cosc_to_dict(list_col_or_super, include_timestamp, include_ttl)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 368, in _cosc_to_dict
ret[self._unpack_name(col.name)] = self._col_to_dict(col, include_timestamp, include_ttl)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 445, in _unpack_name
return self._name_unpacker(b)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/marshal.py", line 140, in unpack_composite
components.append(unpacker(bytestr[2:2 + length]))
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/marshal.py", line 374, in <lambda>
return lambda v: uuid.UUID(bytes=v)
File "/usr/lib/python2.7/uuid.py", line 144, in __init__
raise ValueError('bytes is not a 16-char string')
ValueError: bytes is not a 16-char string
I have no idea where the extra characters are coming from around the column name. But that got me curious so I added another print in _cosc_to_dict in columnfamily.py and I see:
>>> cf.get(dict(cf.get_range(column_count=0,filter_empty=False)).keys()[0])
list_col_or_super is: []
list_col_or_super is: [ColumnOrSuperColumn(column=Column(timestamp=1391298393966000,
name='\x00\rautoZoning\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', value='\x01', ttl=None),
counter_super_column=None, super_column=None, counter_column=None),
ColumnOrSuperColumn(column=Column(timestamp=1391298393966000,
name='\x00\x0ccreationTime\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
value='\x00\x00\x01C\xef\xd8\xb7n', ttl=None), counter_super_column=None, super_column=None,
counter_column=None), ColumnOrSuperColumn(column=Column(timestamp=1391298393966000,
name='\x00\x08inactive\x00\x00\x00\x00\x00\x00\x00\x00\x10\x14\xfex\xe0\x8b\x9b\x11\xe3\xb1q\x00PV\xb7\x00\xbb\x00', value='\x00', ttl=None), counter_super_column=None, super_column=None,
counter_column=None), ColumnOrSuperColumn(column=Column(timestamp=1391298393966000,
name='\x00\x05label\x00\x00\x00\x00\x00\x00\x00\x00\x10\x14\xfex\xe0\x8b\x9b\x11\xe3\xb1q\x00PV\xb7\x00\xbb\x00', value='thisIsATest', ttl=None), counter_super_column=None, super_column=None, counter_column=None)]
autoZoning unpack:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib64/python2.6/site-packages/pycassa-1.11.0-py2.6.egg/pycassa/columnfamily.py", line 666, in get
return self._cosc_to_dict(list_col_or_super, include_timestamp, include_ttl)
File "/usr/local/lib64/python2.6/site-packages/pycassa-1.11.0-py2.6.egg/pycassa/columnfamily.py", line 369, in _cosc_to_dict
ret[self._unpack_name(col.name)] = self._col_to_dict(col, include_timestamp, include_ttl)
File "/usr/local/lib64/python2.6/site-packages/pycassa-1.11.0-py2.6.egg/pycassa/columnfamily.py", line 446, in _unpack_name
return self._name_unpacker(b)
File "/usr/local/lib64/python2.6/site-packages/pycassa-1.11.0-py2.6.egg/pycassa/marshal.py", line 140, in unpack_composite
components.append(unpacker(bytestr[2:2 + length]))
File "/usr/local/lib64/python2.6/site-packages/pycassa-1.11.0-py2.6.egg/pycassa/marshal.py", line 374, in <lambda>
return lambda v: uuid.UUID(bytes=v)
File "/usr/lib64/python2.6/uuid.py", line 144, in __init__
raise ValueError('bytes is not a 16-char string')
ValueError: bytes is not a 16-char string
Am I correct in assuming that the extra characters around the column names are what is responsible for the 'ValueError: bytes is not a 16-char string' exception?
Also if I try to use the column name and select it I get:
>>> cf.get(u'urn:keyspace:ColumnFamily:a36e8ab1-7032-4e4c-a53d-e3317f63a640:',columns=['autoZoning:::'])
Traceback (most recent call last):
File "<pyshell#184>", line 1, in <module>
cf.get(u'urn:keyspace:ColumnFamily:a36e8ab1-7032-4e4c-a53d-e3317f63a640:',columns=['autoZoning:::'])
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 651, in get
cp = self._column_path(super_column, column)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 383, in _column_path
self._pack_name(column, False))
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 426, in _pack_name
return self._name_packer(value, slice_start)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/marshal.py", line 115, in pack_composite
packed = packer(item)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/marshal.py", line 298, in pack_uuid
randomize=True)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/util.py", line 75, in convert_time_to_uuid
'neither a UUID, a datetime, or a number')
ValueError: Argument for a v1 UUID column name or value was neither a UUID, a datetime, or a number
Any further thoughts?
Thanks,
Rob
Turns out that the problem wasn't with the key, it was being caused, in part, by a bug in pycassa that wasn't handling an empty (null) string in the column UUID. A short-term fix is in the answer in google groups:
https://groups.google.com/d/msg/pycassa-discuss/Vf_bSgDIi9M/KTA1kbE9IXAJ
The other part of the answer was to get at the columns by using tuples (with the UUID as a UUID and not a str) instead of a string with ':' separators because that's, as I found out, a cassandra-cli thing.

Categories