IndexError obstructing code from working with larger csv file - python

I have data that sorts a csv by using groupby and then plots the information. I used a small sample of information to create the code. It ran smoothly and so then I tried running it with the huge file of data.
I am pretty new at Python and this problem has been quite frustrating so even suggestions on how to troubleshoot this problem would be helpful.
My code is stopping in this section:
import pandas as pd
df =pd.DataFrame.from_csv('MYDATA.csv')
mode = lambda ts: ts.value_counts(sort=True).index[0]
I tried selecting only parts of the huge data file and it ran, but for the entire thing I am getting this error:
IndexError: index 0 is out of bounds for axis 0 with size 0
But I've looked at the two data set side-by-side and the columns are the same! I noticed that the big file has some utf8 issues with accents and I am working on combing those out, but this IndexError is perplexing me.
Here is the traceback
runfile('C:/Users/jbyrusb/Documents/Python Scripts/Tests/tests/TopSixCustomersExecute.py', wdir='C:/Users/jbyrusb/Documents/Python Scripts/Tests/tests')
Traceback (most recent call last):
File "<ipython-input-45-53a2a006076e>", line 1, in <module>
runfile('C:/Users/jbyrusb/Documents/Python Scripts/Tests/tests/TopSixCustomersExecute.py', wdir='C:/Users/jbyrusb/Documents/Python Scripts/Tests/tests')
File "C:\Users\jbyrusb\AppData\Local\Continuum\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 682, in runfile
execfile(filename, namespace)
File "C:\Users\jbyrusb\AppData\Local\Continuum\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 71, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/jbyrusb/Documents/Python Scripts/Tests/tests/TopSixCustomersExecute.py", line 23, in <module>
df = df.groupby('CompanyName')[['Column1','Name', 'Birthday', 'Country', 'County']].agg(mode).T.reindex(columns=cols)
File "C:\Users\jbyrusb\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\groupby.py", line 676, in agg
return self.aggregate(func, *args, **kwargs)
File "C:\Users\jbyrusb\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\groupby.py", line 2674, in aggregate
result = self._aggregate_generic(arg, *args, **kwargs)
File "C:\Users\jbyrusb\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\groupby.py", line 2722, in _aggregate_generic
return self._aggregate_item_by_item(func, *args, **kwargs)
File "C:\Users\jbyrusb\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\groupby.py", line 2751, in _aggregate_item_by_item
colg.aggregate(func, *args, **kwargs), data)
File "C:\Users\jbyrusb\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\groupby.py", line 2307, in aggregate
result = self._aggregate_named(func_or_funcs, *args, **kwargs)
File "C:\Users\jbyrusb\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\groupby.py", line 2394, in _aggregate_named
output = func(group, *args, **kwargs)
File "C:/Users/jbyrusb/Documents/Python Scripts/Tests/tests/TopSixCustomersExecute.py", line 20, in <lambda>
mode = lambda ts: ts.value_counts(sort=True).index[0]
File "C:\Users\jbyrusb\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\index.py", line 915, in __getitem__
return getitem(key)
IndexError: index 0 is out of bounds for axis 0 with size 0

It is difficult without seeing the data causing the error, but try this:
mode = (lambda ts: ts.value_counts(sort=True).index[0]
if len(ts.value_counts(sort=True)) else None)

Had the same issue i resolved by changing the sep argument from
sep='\t'
to
sep=','.
Hope it saves someone.

Related

My first hello word program in MistQL is throwing exception Expected RuntimeValueType.Number, got RuntimeValueType.String

I was trying this tutorial given in the MistQL for a personal work but this below code is throwing exception as given below
import mistql
data="{\"foo\": \"bar\"}"
query = '#.foo'
results = mistql.query(query, data)
print(results)
Traceback (most recent call last): File "C:\Temp\sample.py", line 4,
in
purchaserEmails = mistql.query(query, data) File "C:\Python3.10.0\lib\site-packages\mistql\query.py", line 18, in query
result = execute_outer(ast, data) File "C:\Python3.10.0\lib\site-packages\mistql\execute.py", line 73, in
execute_outer
return execute(ast, build_initial_stack(data, builtins)) File "C:\Python3.10.0\lib\site-packages\typeguard_init_.py", line 1033,
in wrapper
retval = func(*args, **kwargs) File "C:\Python3.10.0\lib\site-packages\mistql\execute.py", line 60, in
execute
return execute_fncall(ast.fn, ast.args, stack) File "C:\Python3.10.0\lib\site-packages\typeguard_init_.py", line 1033,
in wrapper
retval = func(*args, **kwargs) File "C:\Python3.10.0\lib\site-packages\mistql\execute.py", line 28, in
execute_fncall
return function_definition(arguments, stack, execute) File "C:\Python3.10.0\lib\site-packages\mistql\builtins.py", line 37, in
wrapped
return fn(arguments, stack, exec) File "C:\Python3.10.0\lib\site-packages\mistql\builtins.py", line 168, in
dot
return _index_single(RuntimeValue.of(right.name), left) File "C:\Python3.10.0\lib\site-packages\mistql\builtins.py", line 295, in
_index_single
assert_type(index, RVT.Number) File "C:\Python3.10.0\lib\site-packages\mistql\runtime_value.py", line 344,
in assert_type
raise MistQLTypeError(f"Expected {expected_type}, got {value.type}") mistql.exceptions.MistQLTypeError: Expected
RuntimeValueType.Number, got RuntimeValueType.String
Strings passed as data into MistQL's query method correspond to MistQL string types, and thus the #.foo query is attempting to index the string "{\"foo\": \"bar\"}") using the "foo" value, leading to the error above.
MistQL is expecting native dictionaries and lists. The correspondence between Python data types and MistQL data types can be seen here
That being said, the error message is indeed very confusing and should be fixed post-haste! Issue here

PicklingError when getting the result from ray

I'm working on slowly converting my very serialized text analysis engine to use Modin and Ray. Feels like I'm nearly there, however, I seem to have hit a stumbling block. My code looks like this:
vectorizer = TfidfVectorizer(
analyzer=ngrams, encoding="ascii", stop_words="english", strip_accents="ascii"
)
tf_idf_matrix = vectorizer.fit_transform(r_strings["name"])
r_vectorizer = ray.put(vectorizer)
r_tf_idf_matrix = ray.put(tf_idf_matrix)
n = 2
match_results = []
for fn in files["c.file"]:
match_results.append(
match_name.remote(fn, r_vectorizer, r_tf_idf_matrix, r_strings, n)
)
match_returns = ray.get(match_results)
I'm following the guidance from the "anti-patterns" section in the Ray documentation, on what to avoid, and this is very similar to that of the "better" pattern.
Traceback (most recent call last):
File "alt.py", line 213, in <module>
match_returns = ray.get(match_results)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper
return func(*args, **kwargs)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/worker.py", line 1501, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(PicklingError): ray::match_name() (pid=23393, ip=192.168.1.173)
File "python/ray/_raylet.pyx", line 564, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 565, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1652, in ray._raylet.CoreWorker.store_task_outputs
File "/home/myuser/.local/lib/python3.7/site-packages/ray/serialization.py", line 327, in serialize
return self._serialize_to_msgpack(value)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/serialization.py", line 307, in _serialize_to_msgpack
self._serialize_to_pickle5(metadata, python_objects)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/serialization.py", line 267, in _serialize_to_pickle5
raise e
File "/home/myuser/.local/lib/python3.7/site-packages/ray/serialization.py", line 264, in _serialize_to_pickle5
value, protocol=5, buffer_callback=writer.buffer_callback)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
cp.dump(obj)
File "/home/myuser/.local/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 580, in dump
return Pickler.dump(self, obj)
_pickle.PicklingError: args[0] from __newobj__ args has the wrong class
Definitely an unexpected result. I'm not sure where to go next with this and would appreciate help from folks who have more experience with Ray and Modin.

Why can dask.dataframe.apply only process a column called 'name'?

I am attempting to port some Pandas (Python) code to Dask instead. I am using Pandas 1.1.3 and Dask 2.30.0. I keep ramming my head against a wall I can't see. That is, I cannot understand what is going on here. I have boiled it down to the following minimal working example:
My data is the file 'test.csv' containing the following:
age,name
28,Alice
The following Python script (using Pandas) works fine:
import pandas as pd
df = pd.read_csv("test.csv", dtype={'name': str})
result = df['name'].apply(lambda text: text.upper())
#result = df['age'].apply(lambda num: num + 1)
print(result)
and prints:
0 ALICE
Name: name, dtype: object
The commented-out line operating on the 'age' column also works and prints:
0 29
Name: age, dtype: int64
Now, with Dask instead, my example becomes:
import dask.dataframe as dd
df = dd.read_csv("test.csv", dtype={'name': str})
result = df['name'].apply(lambda text: text.upper(), meta={'name': str})
#result = df['age'].apply(lambda num: num + 1, meta={'age': int})
print(result.compute())
which works fine just like the Pandas example. However, if I try the commented-out line operating on the 'age' column instead, Python complains with the following error message:
Traceback (most recent call last):
File "test_dask.py", line 7, in <module>
print(result.compute())
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/base.py", line 167, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/base.py", line 452, in compute
results = schedule(dsk, keys, **kwargs)
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/threaded.py", line 76, in get
results = get_async(
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/local.py", line 486, in get_async
raise_exception(exc, tb)
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/local.py", line 316, in reraise
raise exc
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/local.py", line 222, in execute_task
result = _execute_task(task, data)
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/optimization.py", line 961, in __call__
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/core.py", line 151, in get
result = _execute_task(task, cache)
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/utils.py", line 29, in apply
return func(*args, **kwargs)
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/dataframe/core.py", line 5306, in apply_and_enforce
c = meta.name
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/pandas/core/generic.py", line 5139, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'name'
Even if I just call the 'name' column something else, it also fails like this. It is as if Dask is only able to work on columns of a DataFrame that are called 'name'. This seems extraordinarily weird to me, and I must be misunderstanding something. What is really going on here?
The docs seem to suggest that the dict should work, so that's weird, but if you replace the meta argument with a tuple instead, your code runs as expected:
df = dd.read_csv("test.csv")
result = df['age'].apply(lambda num: num + 1, meta=('age', 'int64'))
print(result.compute())
becomes
0 29
Name: age, dtype: int64

Dask Dataframe GroupBy.size() returns memoryError

I have two large CSV files ~28 million rows each. I am performing an inner join, adding columns against the new Dask Dataframe, then requesting a GroupBy.Size() on certain columns to return a count. In the example the inputs are coming from two parquet files, which have been generated from the original CSVs.
The end to end program does work on a 8 Core / 32GB Ram Computer and produces a 4x6 Pandas DF of the groupBy Size, but when running on 16GB and 10GB RAM devices, I get a memory error.
What can I do to avoid this memory error?
Here is the code in question:
def merge(ubs_dd, br_dd):
return dd.merge(ubs_dd, br_dd, left_on='mabid', right_on='brid', how='inner', suffixes=('_ubs', '_br'),) # slow
#return dd.merge(ubs_dd, br_dd, left_index=True, right_index=True) # fast
def reconcile(merged_dd):
merged_dd['amount_different'] = merged_dd['AMOUNT_ubs'].astype(float) - merged_dd['AMOUNT_br'].astype(float)
merged_dd['amount_break'] = merged_dd['amount_different'].abs() >= 1 #+/- $1 tolerance
merged_dd['billable_break'] = merged_dd['BILLABLE_ubs'] == merged_dd['BILLABLE_br']
merged_dd['eligible_break'] = merged_dd['ELIGIBLE_ubs'] == merged_dd['ELIGIBLE_br']
return merged_dd
def metrics_report(merged_dd):
return merged_dd.groupby(['amount_break', 'billable_break', 'eligible_break']).size().reset_index().rename(columns={0:'count'}).compute()
merged_dd = merge(ubs_dd, br_dd)
merged_dd = reconcile(merged_dd)
metrics = metrics_report(merged_dd)
When running on low memory devices, here is the error I receive after 70% complete:
generating final outputs
[############################ ] | 70% Completed | 29min 19.5s
Traceback (most recent call last):
File "c:/Users/<>/git/repository/<>/wma_billing_rec.py", line 155, in <module>
metrics = metrics_report(merged_dd)
File "c:/Users/<>/git/repository/<>/wma_billing_rec.py", line 115, in metrics_report
return merged_dd.groupby(['amount_break', 'billable_break', 'eligible_break']).size().reset_index().rename(columns={0:'count'}).compute()
File "C:\Programs\Miniconda3_64\envs\WMABillingRecEnv\lib\site-packages\dask\base.py", line 167, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "C:\Programs\Miniconda3_64\envs\WMABillingRecEnv\lib\site-packages\dask\base.py", line 452, in compute
results = schedule(dsk, keys, **kwargs)
File "C:\Programs\Miniconda3_64\envs\WMABillingRecEnv\lib\site-packages\dask\threaded.py", line 84, in get
**kwargs
File "C:\Programs\Miniconda3_64\envs\WMABillingRecEnv\lib\site-packages\dask\local.py", line 486, in get_async
raise_exception(exc, tb)
File "C:\Programs\Miniconda3_64\envs\WMABillingRecEnv\lib\site-packages\dask\local.py", line 316, in reraise
raise exc
File "C:\Programs\Miniconda3_64\envs\WMABillingRecEnv\lib\site-packages\dask\local.py", line 222, in execute_task
result = _execute_task(task, data)
File "C:\Programs\Miniconda3_64\envs\WMABillingRecEnv\lib\site-packages\dask\core.py", line 121, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File "C:\Programs\Miniconda3_64\envs\WMABillingRecEnv\lib\site-packages\dask\dataframe\shuffle.py", line 780, in collect
res = p.get(part)
File "C:\Programs\Miniconda3_64\envs\WMABillingRecEnv\lib\site-packages\partd\core.py", line 73, in get
return self.get([keys], **kwargs)[0]
File "C:\Programs\Miniconda3_64\envs\WMABillingRecEnv\lib\site-packages\partd\core.py", line 79, in get
return self._get(keys, **kwargs)
File "C:\Programs\Miniconda3_64\envs\WMABillingRecEnv\lib\site-packages\partd\encode.py", line 28, in _get
raw = self.partd._get(keys, **kwargs)
File "C:\Programs\Miniconda3_64\envs\WMABillingRecEnv\lib\site-packages\partd\buffer.py", line 54, in _get
self.slow.get(keys, lock=False)))
MemoryError

"Already tz-aware" error when reading h5 file using pandas, python 3 (but not 2)

I have an h5 store named weather.h5. My default Python environment is 3.5.2. When I try to read this store I get TypeError: Already tz-aware, use tz_convert to convert.
I've tried both pd.read_hdf('weather.h5','weather_history') and pd.io.pytables.HDFStore('weather.h5')['weather_history], but I get the error no matter what.
I can open the h5 in a Python 2.7 environment. Is this a bug in Python 3 / pandas?
I have the same issue. I'm using Anaconda Python: 3.4.5 and 2.7.3. Both are using pandas 0.18.1.
Here is a reproducible example:
generate.py (to be executed with Python2):
import pandas as pd
from pandas import HDFStore
index = pd.DatetimeIndex(['2017-06-20 06:00:06.984630-05:00', '2017-06-20 06:03:01.042616-05:00'], dtype='datetime64[ns, CST6CDT]', freq=None)
p1 = [0, 1]
p2 = [0, 2]
# Saving any of these dataframes cause issues
df1 = pd.DataFrame({"p1":p1, "p2":p2}, index=index)
df2 = pd.DataFrame({"p1":p1, "p2":p2, "i":index})
store = HDFStore("./test_issue.h5")
store['df'] = df1
#store['df'] = df2
store.close()
read_issue.py:
import pandas as pd
from pandas import HDFStore
store = HDFStore("./test_issue.h5", mode="r")
df = store['/df']
store.close()
print(df)
Running read_issue.py in Python2 has no issues and produces this output:
p1 p2
2017-06-20 11:00:06.984630-05:00 0 0
2017-06-20 11:03:01.042616-05:00 1 2
But running it in Python3 produces Error with this traceback:
Traceback (most recent call last):
File "read_issue.py", line 5, in
df = store['df']
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 417, in getitem
return self.get(key)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 634, in get
return self._read_group(group)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 1272, in _read_group
return s.read(**kwargs)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 2779, in read
ax = self.read_index('axis%d' % i)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 2367, in read_index
_, index = self.read_index_node(getattr(self.group, key))
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 2492, in read_index_node
_unconvert_index(data, kind, encoding=self.encoding), **kwargs)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/indexes/base.py", line 153, in new
result = DatetimeIndex(data, copy=copy, name=name, **kwargs)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/util/decorators.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/tseries/index.py", line 321, in new
raise TypeError("Already tz-aware, use tz_convert "
TypeError: Already tz-aware, use tz_convert to convert.
Closing remaining open files:./test_issue.h5...done
So, there is an issue with indices. However, if you save df2 in generate.py (datetime as a column, not as an index), then Python3 in read_issue.py produces a different error:
Traceback (most recent call last):
File "read_issue.py", line 5, in
df = store['/df']
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 417, in getitem
return self.get(key)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 634, in get
return self._read_group(group)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 1272, in _read_group
return s.read(**kwargs)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 2788, in read
placement=items.get_indexer(blk_items))
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/core/internals.py", line 2518, in make_block
return klass(values, ndim=ndim, fastpath=fastpath, placement=placement)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/core/internals.py", line 90, in init
len(self.mgr_locs)))
ValueError: Wrong number of items passed 2, placement implies 1
Closing remaining open files:./test_issue.h5...done
Also, if you execute generate_issue.py in Python3 (saving either df1 or df2), then there is no problem executing read_issue.py in either Python3 or Python2

Categories