I am attempting to port some Pandas (Python) code to Dask instead. I am using Pandas 1.1.3 and Dask 2.30.0. I keep ramming my head against a wall I can't see. That is, I cannot understand what is going on here. I have boiled it down to the following minimal working example:
My data is the file 'test.csv' containing the following:
age,name
28,Alice
The following Python script (using Pandas) works fine:
import pandas as pd
df = pd.read_csv("test.csv", dtype={'name': str})
result = df['name'].apply(lambda text: text.upper())
#result = df['age'].apply(lambda num: num + 1)
print(result)
and prints:
0 ALICE
Name: name, dtype: object
The commented-out line operating on the 'age' column also works and prints:
0 29
Name: age, dtype: int64
Now, with Dask instead, my example becomes:
import dask.dataframe as dd
df = dd.read_csv("test.csv", dtype={'name': str})
result = df['name'].apply(lambda text: text.upper(), meta={'name': str})
#result = df['age'].apply(lambda num: num + 1, meta={'age': int})
print(result.compute())
which works fine just like the Pandas example. However, if I try the commented-out line operating on the 'age' column instead, Python complains with the following error message:
Traceback (most recent call last):
File "test_dask.py", line 7, in <module>
print(result.compute())
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/base.py", line 167, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/base.py", line 452, in compute
results = schedule(dsk, keys, **kwargs)
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/threaded.py", line 76, in get
results = get_async(
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/local.py", line 486, in get_async
raise_exception(exc, tb)
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/local.py", line 316, in reraise
raise exc
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/local.py", line 222, in execute_task
result = _execute_task(task, data)
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/optimization.py", line 961, in __call__
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/core.py", line 151, in get
result = _execute_task(task, cache)
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/utils.py", line 29, in apply
return func(*args, **kwargs)
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/dask/dataframe/core.py", line 5306, in apply_and_enforce
c = meta.name
File "/some/path/miniconda3/envs/testdask/lib/python3.8/site-packages/pandas/core/generic.py", line 5139, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'name'
Even if I just call the 'name' column something else, it also fails like this. It is as if Dask is only able to work on columns of a DataFrame that are called 'name'. This seems extraordinarily weird to me, and I must be misunderstanding something. What is really going on here?
The docs seem to suggest that the dict should work, so that's weird, but if you replace the meta argument with a tuple instead, your code runs as expected:
df = dd.read_csv("test.csv")
result = df['age'].apply(lambda num: num + 1, meta=('age', 'int64'))
print(result.compute())
becomes
0 29
Name: age, dtype: int64
Related
There are two function :
func_1(arg1, arg2):
...
return df1. # Returns a pandas dataframe
func_2(arg3):
....
return df1 # Returns a pandas dataframe
My goal is to run those functions parallelly and collect df1 and df2.
Then join those Dataframe to a single df.
I tried to use joblib.Parallel like:
delayed_funcs = [delayed(func_1)(arg1, arg2), delayed(func_2)(arg3)]
raw_df = Parallel(n_jobs=-1, verbose=5)(delayed_funcs)
print(raw_df)
And the following error:
Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 436, in _process_worker
r = call_item()
File "/opt/anaconda3/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 288, in __call__
return self.fn(*self.args, **self.kwargs)
File "/opt/anaconda3/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 595, in __call__
return self.func(*args, **kwargs)
File "/opt/anaconda3/lib/python3.9/site-packages/joblib/parallel.py", line 262, in __call__
return [func(*args, **kwargs)
File "/opt/anaconda3/lib/python3.9/site-packages/joblib/parallel.py", line 262, in <listcomp>
return [func(*args, **kwargs)
TypeError: 'DataFrame' object is not callable
Am I heading to the right approach? Can you please help me with this?
When I try to convert some xml to dataframe using xmltodict it happens that a particular column contains all the info I need as dict or list of dict. I'm able to convert this column in multiple ones with pandas but I'm not able to perform the similar operation in dask.
Is not possible to use meta because I've no idea of all the possible fields that are available in the xml and dask is necessary because the true xml files are bigger than 1Gb each.
example.xml:
<?xml version="1.0" encoding="UTF-8"?>
<itemList>
<eventItem uid="1">
<timestamp>2019-07-04T09:57:35.044Z</timestamp>
<eventType>generic</eventType>
<details>
<detail>
<name>columnA</name>
<value>AAA</value>
</detail>
<detail>
<name>columnB</name>
<value>BBB</value>
</detail>
</details>
</eventItem>
<eventItem uid="2">
<timestamp>2019-07-04T09:57:52.188Z</timestamp>
<eventType>generic</eventType>
<details>
<detail>
<name>columnC</name>
<value>CCC</value>
</detail>
</details>
</eventItem>
</itemList>
Working pandas code:
import xmltodict
import collections
import pandas as pd
def pd_output_dict(details):
detail = details.get("detail", [])
ret_value = {}
if type(detail) in (collections.OrderedDict, dict):
ret_value[detail["name"]] = detail["value"]
elif type(detail) == list:
for i in detail:
ret_value[i["name"]] = i["value"]
return pd.Series(ret_value)
with open("example.xml", "r", encoding="utf8") as f:
df_dict_list = xmltodict.parse(f.read()).get("itemList", {}).get("eventItem", [])
df = pd.DataFrame(df_dict_list)
df = pd.concat([df, df.apply(lambda row: pd_output_dict(row.details), axis=1, result_type="expand")], axis=1)
print(df.head())
Not working dask code:
import xmltodict
import collections
import dask
import dask.bag as db
import dask.dataframe as dd
def dd_output_dict(row):
detail = row.get("details", {}).get("detail", [])
ret_value = {}
if type(detail) in (collections.OrderedDict, dict):
row[detail["name"]] = detail["value"]
elif type(detail) == list:
for i in detail:
row[i["name"]] = i["value"]
return row
with open("example.xml", "r", encoding="utf8") as f:
df_dict_list = xmltodict.parse(f.read()).get("itemList", {}).get("eventItem", [])
df_bag = db.from_sequence(df_dict_list)
df = df_bag.to_dataframe()
df = df.apply(lambda row: dd_output_dict(row), axis=1)
The idea is to have in dask similar result I've in pandas but a the moment I'm receiving errors:
>>> df = df.apply(lambda row: output_dict(row), axis=1)
Traceback (most recent call last):
File "C:\Anaconda3\lib\site-packages\dask\dataframe\utils.py", line 169, in raise_on_meta_error
yield
File "C:\Anaconda3\lib\site-packages\dask\dataframe\core.py", line 4711, in _emulate
return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
File "C:\Anaconda3\lib\site-packages\dask\utils.py", line 854, in __call__
return getattr(obj, self.method)(*args, **kwargs)
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 6487, in apply
return op.get_result()
File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 151, in get_result
return self.apply_standard()
File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 257, in apply_standard
self.apply_series_generator()
File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 286, in apply_series_generator
results[i] = self.f(v)
File "<stdin>", line 1, in <lambda>
File "<stdin>", line 4, in output_dict
AttributeError: ("'str' object has no attribute 'get'", 'occurred at index 0')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda3\lib\site-packages\dask\dataframe\core.py", line 3964, in apply
M.apply, self._meta_nonempty, func, args=args, udf=True, **kwds
File "C:\Anaconda3\lib\site-packages\dask\dataframe\core.py", line 4711, in _emulate
return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
File "C:\Anaconda3\lib\contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "C:\Anaconda3\lib\site-packages\dask\dataframe\utils.py", line 190, in raise_on_meta_error
raise ValueError(msg)
ValueError: Metadata inference failed in `apply`.
You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.
To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.
Original error is below:
------------------------
AttributeError("'str' object has no attribute 'get'", 'occurred at index 0')
Traceback:
---------
File "C:\Anaconda3\lib\site-packages\dask\dataframe\utils.py", line 169, in raise_on_meta_error
yield
File "C:\Anaconda3\lib\site-packages\dask\dataframe\core.py", line 4711, in _emulate
return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
File "C:\Anaconda3\lib\site-packages\dask\utils.py", line 854, in __call__
return getattr(obj, self.method)(*args, **kwargs)
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 6487, in apply
return op.get_result()
File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 151, in get_result
return self.apply_standard()
File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 257, in apply_standard
self.apply_series_generator()
File "C:\Anaconda3\lib\site-packages\pandas\core\apply.py", line 286, in apply_series_generator
results[i] = self.f(v)
File "<stdin>", line 1, in <lambda>
File "<stdin>", line 4, in output_dict
Right, so operations like map_partitions will need to know the column names and data types. As you've mentioned, you can specify this with the meta= keyword.
Perhaps you can run through your data once to compute what these will be, and then construct a proper meta object, and pass that in? This is inefficient, and requires reading through all of your data, but I'm not sure that there is another way.
I want to get live prices of concurrency from rest api of binance.
I am using:
def inCoin(coin):
url = 'https://api.binance.com/api/v3/ticker/price?symbol='+coin+'USDT'
df = pd.read_json(url)
df.columns = ["symbol","price"]
return df
It gives the following error when this function is called:
Traceback (most recent call last):
File "ee2.py", line 201, in <module>
aa = inCoin('BTC')
File "ee2.py", line 145, in inCoin
df = pd.read_json(url, orient='index')
File "/home/hspace/.local/lib/python3.6/site-packages/pandas/io/json/json.py", line 422, in read_json
result = json_reader.read()
File "/home/hspace/.local/lib/python3.6/site-packages/pandas/io/json/json.py", line 529, in read
obj = self._get_object_parser(self.data)
File "/home/hspace/.local/lib/python3.6/site-packages/pandas/io/json/json.py", line 546, in _get_object_parser
obj = FrameParser(json, **kwargs).parse()
File "/home/hspace/.local/lib/python3.6/site-packages/pandas/io/json/json.py", line 638, in parse
self._parse_no_numpy()
File "/home/hspace/.local/lib/python3.6/site-packages/pandas/io/json/json.py", line 861, in _parse_no_numpy
loads(json, precise_float=self.precise_float), dtype=None).T
File "/home/hspace/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 348, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "/home/hspace/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 459, in _init_dict
return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "/home/hspace/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 7356, in _arrays_to_mgr
index = extract_index(arrays)
File "/home/hspace/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 7393, in extract_index
raise ValueError('If using all scalar values, you must pass'
ValueError: If using all scalar values, you must pass an index
Previously, I used this function to call historical data from binance api:
def Cryptodata2(symbol,tick_interval='1m'):
url = 'https://api.binance.com/api/v1/klines?symbol='+symbol+'&interval='+tick_interval
df = pd.read_json(url)
df.columns = [ "date","open","high","low","close","volume",
"close time","quote asset volume","number of trades","taker buy base asset volume",
"Taker buy quote asset volume","ignore"]
df['date'] = pd.to_datetime(df['date'],dayfirst=True, unit = 'ms')
df.set_index('date',inplace=True)
del df['ignore']
return df
And this works fluently.
I just want price of that coin and show it as an integer or dataframe from this url:
https://api.binance.com/api/v3/ticker/price?symbol=BTCUSDT
Thanks for helping me.
Also, it would be great if you could provide more detail on debugging such "value" errors.
I have an h5 store named weather.h5. My default Python environment is 3.5.2. When I try to read this store I get TypeError: Already tz-aware, use tz_convert to convert.
I've tried both pd.read_hdf('weather.h5','weather_history') and pd.io.pytables.HDFStore('weather.h5')['weather_history], but I get the error no matter what.
I can open the h5 in a Python 2.7 environment. Is this a bug in Python 3 / pandas?
I have the same issue. I'm using Anaconda Python: 3.4.5 and 2.7.3. Both are using pandas 0.18.1.
Here is a reproducible example:
generate.py (to be executed with Python2):
import pandas as pd
from pandas import HDFStore
index = pd.DatetimeIndex(['2017-06-20 06:00:06.984630-05:00', '2017-06-20 06:03:01.042616-05:00'], dtype='datetime64[ns, CST6CDT]', freq=None)
p1 = [0, 1]
p2 = [0, 2]
# Saving any of these dataframes cause issues
df1 = pd.DataFrame({"p1":p1, "p2":p2}, index=index)
df2 = pd.DataFrame({"p1":p1, "p2":p2, "i":index})
store = HDFStore("./test_issue.h5")
store['df'] = df1
#store['df'] = df2
store.close()
read_issue.py:
import pandas as pd
from pandas import HDFStore
store = HDFStore("./test_issue.h5", mode="r")
df = store['/df']
store.close()
print(df)
Running read_issue.py in Python2 has no issues and produces this output:
p1 p2
2017-06-20 11:00:06.984630-05:00 0 0
2017-06-20 11:03:01.042616-05:00 1 2
But running it in Python3 produces Error with this traceback:
Traceback (most recent call last):
File "read_issue.py", line 5, in
df = store['df']
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 417, in getitem
return self.get(key)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 634, in get
return self._read_group(group)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 1272, in _read_group
return s.read(**kwargs)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 2779, in read
ax = self.read_index('axis%d' % i)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 2367, in read_index
_, index = self.read_index_node(getattr(self.group, key))
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 2492, in read_index_node
_unconvert_index(data, kind, encoding=self.encoding), **kwargs)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/indexes/base.py", line 153, in new
result = DatetimeIndex(data, copy=copy, name=name, **kwargs)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/util/decorators.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/tseries/index.py", line 321, in new
raise TypeError("Already tz-aware, use tz_convert "
TypeError: Already tz-aware, use tz_convert to convert.
Closing remaining open files:./test_issue.h5...done
So, there is an issue with indices. However, if you save df2 in generate.py (datetime as a column, not as an index), then Python3 in read_issue.py produces a different error:
Traceback (most recent call last):
File "read_issue.py", line 5, in
df = store['/df']
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 417, in getitem
return self.get(key)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 634, in get
return self._read_group(group)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 1272, in _read_group
return s.read(**kwargs)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/io/pytables.py", line 2788, in read
placement=items.get_indexer(blk_items))
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/core/internals.py", line 2518, in make_block
return klass(values, ndim=ndim, fastpath=fastpath, placement=placement)
File "/home/denper/anaconda3/envs/py34/lib/python3.4/site-packages/pandas/core/internals.py", line 90, in init
len(self.mgr_locs)))
ValueError: Wrong number of items passed 2, placement implies 1
Closing remaining open files:./test_issue.h5...done
Also, if you execute generate_issue.py in Python3 (saving either df1 or df2), then there is no problem executing read_issue.py in either Python3 or Python2
I have data that sorts a csv by using groupby and then plots the information. I used a small sample of information to create the code. It ran smoothly and so then I tried running it with the huge file of data.
I am pretty new at Python and this problem has been quite frustrating so even suggestions on how to troubleshoot this problem would be helpful.
My code is stopping in this section:
import pandas as pd
df =pd.DataFrame.from_csv('MYDATA.csv')
mode = lambda ts: ts.value_counts(sort=True).index[0]
I tried selecting only parts of the huge data file and it ran, but for the entire thing I am getting this error:
IndexError: index 0 is out of bounds for axis 0 with size 0
But I've looked at the two data set side-by-side and the columns are the same! I noticed that the big file has some utf8 issues with accents and I am working on combing those out, but this IndexError is perplexing me.
Here is the traceback
runfile('C:/Users/jbyrusb/Documents/Python Scripts/Tests/tests/TopSixCustomersExecute.py', wdir='C:/Users/jbyrusb/Documents/Python Scripts/Tests/tests')
Traceback (most recent call last):
File "<ipython-input-45-53a2a006076e>", line 1, in <module>
runfile('C:/Users/jbyrusb/Documents/Python Scripts/Tests/tests/TopSixCustomersExecute.py', wdir='C:/Users/jbyrusb/Documents/Python Scripts/Tests/tests')
File "C:\Users\jbyrusb\AppData\Local\Continuum\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 682, in runfile
execfile(filename, namespace)
File "C:\Users\jbyrusb\AppData\Local\Continuum\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 71, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/jbyrusb/Documents/Python Scripts/Tests/tests/TopSixCustomersExecute.py", line 23, in <module>
df = df.groupby('CompanyName')[['Column1','Name', 'Birthday', 'Country', 'County']].agg(mode).T.reindex(columns=cols)
File "C:\Users\jbyrusb\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\groupby.py", line 676, in agg
return self.aggregate(func, *args, **kwargs)
File "C:\Users\jbyrusb\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\groupby.py", line 2674, in aggregate
result = self._aggregate_generic(arg, *args, **kwargs)
File "C:\Users\jbyrusb\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\groupby.py", line 2722, in _aggregate_generic
return self._aggregate_item_by_item(func, *args, **kwargs)
File "C:\Users\jbyrusb\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\groupby.py", line 2751, in _aggregate_item_by_item
colg.aggregate(func, *args, **kwargs), data)
File "C:\Users\jbyrusb\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\groupby.py", line 2307, in aggregate
result = self._aggregate_named(func_or_funcs, *args, **kwargs)
File "C:\Users\jbyrusb\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\groupby.py", line 2394, in _aggregate_named
output = func(group, *args, **kwargs)
File "C:/Users/jbyrusb/Documents/Python Scripts/Tests/tests/TopSixCustomersExecute.py", line 20, in <lambda>
mode = lambda ts: ts.value_counts(sort=True).index[0]
File "C:\Users\jbyrusb\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\index.py", line 915, in __getitem__
return getitem(key)
IndexError: index 0 is out of bounds for axis 0 with size 0
It is difficult without seeing the data causing the error, but try this:
mode = (lambda ts: ts.value_counts(sort=True).index[0]
if len(ts.value_counts(sort=True)) else None)
Had the same issue i resolved by changing the sep argument from
sep='\t'
to
sep=','.
Hope it saves someone.