I wanna insert data into a data frame like:
df = pd.DataFrame(columns=["Date", "Title", "Artist"])
insertion happens here:
df.insert(loc=0, column="Date", value=dateTime.group(0), allow_duplicates=True)
df.insert(loc=0, column="Title", value=title, allow_duplicates=True)
df.insert(loc=0, column="Artist", value=artist, allow_duplicates=True)
sadly I don't know how to handle these errors:
Traceback (most recent call last):
File "/Users/sashakaun/IdeaProjects/python/venv/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 697, in _try_cast
subarr = maybe_cast_to_datetime(arr, dtype)
File "/Users/sashakaun/IdeaProjects/python/venv/lib/python3.8/site-packages/pandas/core/dtypes/cast.py", line 1067, in maybe_cast_to_datetime
value = maybe_infer_to_datetimelike(value)
File "/Users/sashakaun/IdeaProjects/python/venv/lib/python3.8/site-packages/pandas/core/dtypes/cast.py", line 865, in maybe_infer_to_datetimelike
if isinstance(value, (ABCDatetimeIndex, ABCPeriodIndex,
File "/Users/sashakaun/IdeaProjects/python/venv/lib/python3.8/site-packages/pandas/core/dtypes/generic.py", line 9, in _check
return getattr(inst, attr, '_typ') in comp
TypeError: 'in <string>' requires string as left operand, not NoneType
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/sashakaun/IdeaProjects/scrapyscrape/test.py", line 24, in <module>
df.insert(loc=0, column="Title", value=title, allow_duplicates=True)
File "/Users/sashakaun/IdeaProjects/python/venv/lib/python3.8/site-packages/pandas/core/frame.py", line 3470, in insert
self._ensure_valid_index(value)
File "/Users/sashakaun/IdeaProjects/python/venv/lib/python3.8/site-packages/pandas/core/frame.py", line 3424, in _ensure_valid_index
value = Series(value)
File "/Users/sashakaun/IdeaProjects/python/venv/lib/python3.8/site-packages/pandas/core/series.py", line 261, in __init__
data = sanitize_array(data, index, dtype, copy,
File "/Users/sashakaun/IdeaProjects/python/venv/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 625, in sanitize_array
subarr = _try_cast(data, False, dtype, copy, raise_cast_failure)
File "/Users/sashakaun/IdeaProjects/python/venv/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 720, in _try_cast
subarr = np.array(arr, dtype=object, copy=copy)
File "/Users/sashakaun/IdeaProjects/python/venv/lib/python3.8/site-packages/bs4/element.py", line 971, in __getitem__
return self.attrs[key]
KeyError: 0
its my first question, please be kind,
thanks in advance
The error seems to be from your value=dateTime.group(0) value. Can you elaborate on what is structure of dateTime?
Plus, df.insert() inserts a column rather than adding the data to a dataframe.
You should first transform your data into a series object and then use df.concat() to concatenate with the original dataframe.
Below are some references:
https://kite.com/python/answers/how-to-insert-a-row-into-a-pandas-dataframe
Add one row to pandas DataFrame
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
Related
So, I have this code to process CTD data, it's working normally until I try to process the salinity column, giving me this error:
KeyError: "['SALINITY;PSU'] not in index"
But when I check the columns it's all there, here is the code`
df = pd.read_csv('/home/labdino/PycharmProjects/CTDprocessing/venv/DadosCTD_tabulacao.csv',
sep='\t',
skiprows=header,
)
down, up = df.split()
down = down[["TEMPERATURE;C", "SALINITY;PSU"]]
process = (down.remove_above_water()
.remove_up_to(idx=7)
.despike(n1=2, n2=20, block=100)
.lp_filter()
.press_check()
.interpolate()
.bindata(delta=1, method="average")
.smooth(window_len=21, window="hanning")
)
process.head()
`
Output:
Traceback (most recent call last):
File "/home/labdino/PycharmProjects/CTDprocessing/venv/CTDLab.py", line 47, in <module>
down = down[["TEMPERATURE;C", "SALINITY;PSU"]]
File "/home/labdino/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 3811, in __getitem__
indexer = self.columns._get_indexer_strict(key, "columns")[1]
File "/home/labdino/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6108, in _get_indexer_strict
self._raise_if_missing(keyarr, indexer, axis_name)
File "/home/labdino/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6171, in _raise_if_missing
raise KeyError(f"{not_found} not in index")
KeyError: "['SALINITY;PSU'] not in index"
When I use this code with any other column it works, but when trying with salinity it doesn't, and I checked the csv file and it's all normal.
When I try to convert some xml to dataframe using xmltodict it happens that a particular column contains all the info I need as dict or list of dict. I'm able to convert this column in multiple ones with pandas but I'm not able to perform the similar operation in dask.
Is not possible to use meta because I've no idea of all the possible fields that are available in the xml and dask is necessary because the true xml files are bigger than 1Gb each.
example.xml:
<?xml version="1.0" encoding="UTF-8"?>
<itemList>
<eventItem uid="1">
<timestamp>2019-07-04T09:57:35.044Z</timestamp>
<eventType>generic</eventType>
<details>
<detail>
<name>columnA</name>
<value>AAA</value>
</detail>
<detail>
<name>columnB</name>
<value>BBB</value>
</detail>
</details>
</eventItem>
<eventItem uid="2">
<timestamp>2019-07-04T09:57:52.188Z</timestamp>
<eventType>generic</eventType>
<details>
<detail>
<name>columnC</name>
<value>CCC</value>
</detail>
</details>
</eventItem>
</itemList>
Working pandas code:
import xmltodict
import collections
import pandas as pd
def pd_output_dict(details):
detail = details.get("detail", [])
ret_value = {}
if type(detail) in (collections.OrderedDict, dict):
ret_value[detail["name"]] = detail["value"]
elif type(detail) == list:
for i in detail:
ret_value[i["name"]] = i["value"]
return pd.Series(ret_value)
with open("example.xml", "r", encoding="utf8") as f:
df_dict_list = xmltodict.parse(f.read()).get("itemList", {}).get("eventItem", [])
df = pd.DataFrame(df_dict_list)
df = pd.concat([df, df.apply(lambda row: pd_output_dict(row.details), axis=1, result_type="expand")], axis=1)
print(df.head())
Not working dask code:
import xmltodict
import collections
import dask
import dask.bag as db
import dask.dataframe as dd
def dd_output_dict(row):
detail = row.get("details", {}).get("detail", [])
ret_value = {}
if type(detail) in (collections.OrderedDict, dict):
row[detail["name"]] = detail["value"]
elif type(detail) == list:
for i in detail:
row[i["name"]] = i["value"]
return row
with open("example.xml", "r", encoding="utf8") as f:
df_dict_list = xmltodict.parse(f.read()).get("itemList", {}).get("eventItem", [])
df_bag = db.from_sequence(df_dict_list)
df = df_bag.to_dataframe()
df = df.apply(lambda row: dd_output_dict(row), axis=1)
The idea is to have in dask similar result I've in pandas but a the moment I'm receiving errors:
>>> df = df.apply(lambda row: output_dict(row), axis=1)
Traceback (most recent call last):
File "C:\Anaconda3\lib\site-packages\dask\dataframe\utils.py", line 169, in raise_on_meta_error
yield
File "C:\Anaconda3\lib\site-packages\dask\dataframe\core.py", line 4711, in _emulate
return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
File "C:\Anaconda3\lib\site-packages\dask\utils.py", line 854, in __call__
return getattr(obj, self.method)(*args, **kwargs)
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 6487, in apply
return op.get_result()
File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 151, in get_result
return self.apply_standard()
File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 257, in apply_standard
self.apply_series_generator()
File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 286, in apply_series_generator
results[i] = self.f(v)
File "<stdin>", line 1, in <lambda>
File "<stdin>", line 4, in output_dict
AttributeError: ("'str' object has no attribute 'get'", 'occurred at index 0')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda3\lib\site-packages\dask\dataframe\core.py", line 3964, in apply
M.apply, self._meta_nonempty, func, args=args, udf=True, **kwds
File "C:\Anaconda3\lib\site-packages\dask\dataframe\core.py", line 4711, in _emulate
return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
File "C:\Anaconda3\lib\contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "C:\Anaconda3\lib\site-packages\dask\dataframe\utils.py", line 190, in raise_on_meta_error
raise ValueError(msg)
ValueError: Metadata inference failed in `apply`.
You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.
To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.
Original error is below:
------------------------
AttributeError("'str' object has no attribute 'get'", 'occurred at index 0')
Traceback:
---------
File "C:\Anaconda3\lib\site-packages\dask\dataframe\utils.py", line 169, in raise_on_meta_error
yield
File "C:\Anaconda3\lib\site-packages\dask\dataframe\core.py", line 4711, in _emulate
return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
File "C:\Anaconda3\lib\site-packages\dask\utils.py", line 854, in __call__
return getattr(obj, self.method)(*args, **kwargs)
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 6487, in apply
return op.get_result()
File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 151, in get_result
return self.apply_standard()
File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 257, in apply_standard
self.apply_series_generator()
File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 286, in apply_series_generator
results[i] = self.f(v)
File "<stdin>", line 1, in <lambda>
File "<stdin>", line 4, in output_dict
Right, so operations like map_partitions will need to know the column names and data types. As you've mentioned, you can specify this with the meta= keyword.
Perhaps you can run through your data once to compute what these will be, and then construct a proper meta object, and pass that in? This is inefficient, and requires reading through all of your data, but I'm not sure that there is another way.
I've a dictionary consisting of keys = word, value = Array of 300 float numbers.
I'm unable to use this dictionary in my pyspark UDF.
When size of this dictionary is 2Million keys it does not work. But when I reduce the size to 200K it works.
This is my code for the function to be converted to UDF
def get_sentence_vector(sentence, dictionary_containing_word_vectors):
cleanedSentence = list(clean_text(sentence))
words_vector_list = np.zeros(300)# 300 dimensional vector
for x in cleanedSentence:
try:
words_vector_list = np.add(words_vector_list, dictionary_containing_word_vectors[str(x)])
except Exception as e:
print("Exception caught while finding word vector from Fast text pretrained model Dictionary: ",e)
return words_vector_list.tolist()
This is my UDF
get_sentence_vector_udf = F.udf(lambda val: get_sentence_vector(val, fast_text_dictionary), ArrayType(FloatType()))
This is how i'm calling the udf to be added as a column in my dataframe
dmp_df_with_vectors = df.filter(df.item_name.isNotNull()).withColumn("sentence_vector", get_sentence_vector_udf(df.item_name))
And this is the stack trace for the error
Traceback (most recent call last):
File "/usr/lib/spark/python/pyspark/broadcast.py", line 83, in dump
pickle.dump(value, f, 2)
SystemError: error return without exception set
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1957, in wrapper
return udf_obj(*args)
File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1916, in __call__
judf = self._judf
File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1900, in _judf
self._judf_placeholder = self._create_judf()
File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1909, in _create_judf
wrapped_func = _wrap_function(sc, self.func, self.returnType)
File "/usr/lib/spark/python/pyspark/sql/functions.py", line 1866, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/usr/lib/spark/python/pyspark/rdd.py", line 2377, in _prepare_for_python_RDD
broadcast = sc.broadcast(pickled_command)
File "/usr/lib/spark/python/pyspark/context.py", line 799, in broadcast
return Broadcast(self, value, self._pickled_broadcast_vars)
File "/usr/lib/spark/python/pyspark/broadcast.py", line 74, in __init__
self._path = self.dump(value, f)
File "/usr/lib/spark/python/pyspark/broadcast.py", line 90, in dump
raise pickle.PicklingError(msg)
cPickle.PicklingError: Could not serialize broadcast: SystemError: error return without exception set
How big is your fast_text_dictionary in the 2M case? It might be too big.
Try broadcast it first before running udf. e.g.
broadcastVar = sc.broadcast(fast_text_dictionary)
Then use broadcastVar instead in your udf.
See the document for broadcast
I read an excel file with df = pd.read_csv(encoding = 'utf-8', engine = 'python') while print(df) I got the following TypeError:
Traceback (most recent call last):
File "C:\Users\User\Anaconda3\lib\site-packages\IPython\core\formatters.py", line 702, in __call__
printer.pretty(obj)
File "C:\Users\User\Anaconda3\lib\site-packages\IPython\lib\pretty.py", line 400, in pretty
return _repr_pprint(obj, self, cycle)
File "C:\Users\User\Anaconda3\lib\site-packages\IPython\lib\pretty.py", line 695, in _repr_pprint
output = repr(obj)
File "C:\Users\User\Anaconda3\lib\site-packages\pandas\core\base.py", line 82, in __repr__
return str(self)
File "C:\Users\User\Anaconda3\lib\site-packages\pandas\core\base.py", line 61, in __str__
return self.__unicode__()
File "C:\Users\User\Anaconda3\lib\site-packages\pandas\core\frame.py", line 663, in __unicode__
line_width=width, show_dimensions=show_dimensions)
File "C:\Users\User\Anaconda3\lib\site-packages\pandas\core\frame.py", line 1971, in to_string
formatter.to_string()
File "C:\Users\User\Anaconda3\lib\site-packages\pandas\io\formats\format.py", line 620, in to_string
max_len = Series(text).str.len().max()
File "C:\Users\User\Anaconda3\lib\site-packages\pandas\core\generic.py", line 9611, in stat_func
numeric_only=numeric_only)
File "C:\Users\User\Anaconda3\lib\site-packages\pandas\core\series.py", line 3221, in _reduce
return op(delegate, skipna=skipna, **kwds)
File "C:\Users\User\Anaconda3\lib\site-packages\pandas\core\nanops.py", line 131, in f
result = alt(values, axis=axis, skipna=skipna, **kwds)
File "C:\Users\User\Anaconda3\lib\site-packages\pandas\core\nanops.py", line 507, in reduction
result = getattr(values, meth)(axis)
File "C:\Users\User\Anaconda3\lib\site-packages\numpy\core\_methods.py", line 28, in _amax
return umr_maximum(a, axis, None, out, keepdims, initial)
TypeError: int() argument must be a string, a bytes-like object or a number, not '_NoValueType'
When read column names I get Index(['names\t', 'matched_names', 'ratio'], dtype='object').
If someone knows what caused this typeerror, please help me. Thanks at advance.
Update: for your better understanding, I added the code how can I generate this csv file:
with open('result.csv', 'w', encoding = 'utf_8_sig') as f1:
writer = csv.writer(f1, delimiter='\t', lineterminator='\n', )
writer.writerow(('names', 'matched_names', 'ratio'))
for dish1, dish2 in itertools.combinations(enumerate(processedDishes), 2):
matcher = matchers[dish1[0]]
matcher.set_seq2(dish2[1])
ratio = int(round(100 * matcher.ratio()))
if ratio >= threshold_ratio:
#print(dishes[dish1[0]], dishes[dish2[0]])
my_list = (dishes[dish1[0]], dishes[dish2[0]], ratio)
print(my_list)
writer.writerow([my_list])
my_list has rows with format as follows: ('steve', 'john', 0)
With df.info()
Traceback (most recent call last):
File "<ipython-input-142-83941e9879da>", line 1, in <module>
df.info()
File "C:\Users\User\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2252, in info
_verbose_repr()
File "C:\Users\User\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2210, in _verbose_repr
counts = self.count()
File "C:\Users\User\Anaconda3\lib\site-packages\pandas\core\frame.py", line 6800, in count
result = Series(counts, index=frame._get_agg_axis(axis))
File "C:\Users\User\Anaconda3\lib\site-packages\pandas\core\series.py", line 262, in __init__
.format(val=len(data), ind=len(index)))
ValueError: Length of passed values is 0, index implies 3
I've just started using pycassa, so if this is a stupid question, I apologize upfront.
I have a column family with the following schema:
create column family MyColumnFamilyTest
with column_type = 'Standard'
and comparator = 'CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.TimeUUIDType)'
and default_validation_class = 'BytesType'
and key_validation_class = 'UTF8Type'
and read_repair_chance = 0.1
and dclocal_read_repair_chance = 0.0
and populate_io_cache_on_flush = false
and gc_grace = 864000
and min_compaction_threshold = 4
and max_compaction_threshold = 32
and replicate_on_write = true
and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
and caching = 'KEYS_ONLY'
and compression_options = {'sstable_compression' : 'org.apache.cassandra.io.compress.SnappyCompressor'};
When I try to do a get() with a valid key (works fine in cassandra-cli) I get:
Traceback (most recent call last):
File "<pyshell#19>", line 1, in <module>
cf.get('mykey',column_count=3)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 664, in get
return self._cosc_to_dict(list_col_or_super, include_timestamp, include_ttl)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 368, in _cosc_to_dict
ret[self._unpack_name(col.name)] = self._col_to_dict(col, include_timestamp, include_ttl)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 444, in _unpack_name
return self._name_unpacker(b)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/marshal.py", line 140, in unpack_composite
components.append(unpacker(bytestr[2:2 + length]))
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/marshal.py", line 374, in <lambda>
return lambda v: uuid.UUID(bytes=v)
File "/usr/lib/python2.7/uuid.py", line 144, in __init__
raise ValueError('bytes is not a 16-char string')
ValueError: bytes is not a 16-char string
Here's some more information I've discovered:
When using cassandra-cli I can see the data as:
% cassandra-cli -h 10.249.238.131
Connected to: "LocalDB" on 10.249.238.131/9160
Welcome to Cassandra CLI version 1.2.10-SNAPSHOT
Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.
[default#unknown] use Keyspace;
[default#Keyspace] list ColumnFamily;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: urn:keyspace:ColumnFamily:a36e8ab1-7032-4e4c-a53d-e3317f63a640:
=> (name=autoZoning:::, value=01, timestamp=1391298393966000)
=> (name=creationTime:::, value=00000143efd8b76e, timestamp=1391298393966000)
=> (name=inactive:::14fe78e0-8b9b-11e3-b171-005056b700bb, value=00, timestamp=1391298393966000)
=> (name=label:::14fe78e0-8b9b-11e3-b171-005056b700bb, value=726a6d2d766e782d76613031, timestamp=1391298393966000)
1 Row Returned.
Elapsed time: 16 msec(s).
Since it was unclear what was causing the exception, I decided to add a print prior to the 'return self._name_unpacker(b)' line in columnfamily.py and I see:
>>> cf.get(dict(cf.get_range(column_count=0,filter_empty=False)).keys()[0])
Attempting to unpack: <00>\rautoZoning<00><00><00><00><00><00><00><00><00><00>
Traceback (most recent call last):
File "<pyshell#172>", line 1, in <module>
cf.get(dict(cf.get_range(column_count=0,filter_empty=False)).keys()[0])
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 665, in get
return self._cosc_to_dict(list_col_or_super, include_timestamp, include_ttl)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 368, in _cosc_to_dict
ret[self._unpack_name(col.name)] = self._col_to_dict(col, include_timestamp, include_ttl)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 445, in _unpack_name
return self._name_unpacker(b)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/marshal.py", line 140, in unpack_composite
components.append(unpacker(bytestr[2:2 + length]))
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/marshal.py", line 374, in <lambda>
return lambda v: uuid.UUID(bytes=v)
File "/usr/lib/python2.7/uuid.py", line 144, in __init__
raise ValueError('bytes is not a 16-char string')
ValueError: bytes is not a 16-char string
I have no idea where the extra characters are coming from around the column name. But that got me curious so I added another print in _cosc_to_dict in columnfamily.py and I see:
>>> cf.get(dict(cf.get_range(column_count=0,filter_empty=False)).keys()[0])
list_col_or_super is: []
list_col_or_super is: [ColumnOrSuperColumn(column=Column(timestamp=1391298393966000,
name='\x00\rautoZoning\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', value='\x01', ttl=None),
counter_super_column=None, super_column=None, counter_column=None),
ColumnOrSuperColumn(column=Column(timestamp=1391298393966000,
name='\x00\x0ccreationTime\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
value='\x00\x00\x01C\xef\xd8\xb7n', ttl=None), counter_super_column=None, super_column=None,
counter_column=None), ColumnOrSuperColumn(column=Column(timestamp=1391298393966000,
name='\x00\x08inactive\x00\x00\x00\x00\x00\x00\x00\x00\x10\x14\xfex\xe0\x8b\x9b\x11\xe3\xb1q\x00PV\xb7\x00\xbb\x00', value='\x00', ttl=None), counter_super_column=None, super_column=None,
counter_column=None), ColumnOrSuperColumn(column=Column(timestamp=1391298393966000,
name='\x00\x05label\x00\x00\x00\x00\x00\x00\x00\x00\x10\x14\xfex\xe0\x8b\x9b\x11\xe3\xb1q\x00PV\xb7\x00\xbb\x00', value='thisIsATest', ttl=None), counter_super_column=None, super_column=None, counter_column=None)]
autoZoning unpack:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib64/python2.6/site-packages/pycassa-1.11.0-py2.6.egg/pycassa/columnfamily.py", line 666, in get
return self._cosc_to_dict(list_col_or_super, include_timestamp, include_ttl)
File "/usr/local/lib64/python2.6/site-packages/pycassa-1.11.0-py2.6.egg/pycassa/columnfamily.py", line 369, in _cosc_to_dict
ret[self._unpack_name(col.name)] = self._col_to_dict(col, include_timestamp, include_ttl)
File "/usr/local/lib64/python2.6/site-packages/pycassa-1.11.0-py2.6.egg/pycassa/columnfamily.py", line 446, in _unpack_name
return self._name_unpacker(b)
File "/usr/local/lib64/python2.6/site-packages/pycassa-1.11.0-py2.6.egg/pycassa/marshal.py", line 140, in unpack_composite
components.append(unpacker(bytestr[2:2 + length]))
File "/usr/local/lib64/python2.6/site-packages/pycassa-1.11.0-py2.6.egg/pycassa/marshal.py", line 374, in <lambda>
return lambda v: uuid.UUID(bytes=v)
File "/usr/lib64/python2.6/uuid.py", line 144, in __init__
raise ValueError('bytes is not a 16-char string')
ValueError: bytes is not a 16-char string
Am I correct in assuming that the extra characters around the column names are what is responsible for the 'ValueError: bytes is not a 16-char string' exception?
Also if I try to use the column name and select it I get:
>>> cf.get(u'urn:keyspace:ColumnFamily:a36e8ab1-7032-4e4c-a53d-e3317f63a640:',columns=['autoZoning:::'])
Traceback (most recent call last):
File "<pyshell#184>", line 1, in <module>
cf.get(u'urn:keyspace:ColumnFamily:a36e8ab1-7032-4e4c-a53d-e3317f63a640:',columns=['autoZoning:::'])
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 651, in get
cp = self._column_path(super_column, column)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 383, in _column_path
self._pack_name(column, False))
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/columnfamily.py", line 426, in _pack_name
return self._name_packer(value, slice_start)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/marshal.py", line 115, in pack_composite
packed = packer(item)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/marshal.py", line 298, in pack_uuid
randomize=True)
File "/usr/local/lib/python2.7/dist-packages/pycassa-1.11.0-py2.7.egg/pycassa/util.py", line 75, in convert_time_to_uuid
'neither a UUID, a datetime, or a number')
ValueError: Argument for a v1 UUID column name or value was neither a UUID, a datetime, or a number
Any further thoughts?
Thanks,
Rob
Turns out that the problem wasn't with the key, it was being caused, in part, by a bug in pycassa that wasn't handling an empty (null) string in the column UUID. A short-term fix is in the answer in google groups:
https://groups.google.com/d/msg/pycassa-discuss/Vf_bSgDIi9M/KTA1kbE9IXAJ
The other part of the answer was to get at the columns by using tuples (with the UUID as a UUID and not a str) instead of a string with ':' separators because that's, as I found out, a cassandra-cli thing.