`ValueError: cannot reindex from a duplicate axis` using Dask DataFrame - python

I've been trying to adapt my code to utilize Dask to utilize multiple machines for processing. While the initial data load is not time-consuming, the subsequent processing takes roughly 12 hours on an 8-core i5. This isn't ideal and figured that using Dask to help spread the processing across machines would be beneficial. The following code works fine with the standard Pandas approach:
import pandas as pd
artists = pd.read_csv("artists.csv")
print(f"... shape before cleaning {artists.shape}")
artists["name"] = artists["name"].astype("str")
artists["name"] = (
artists["name"]
.str.encode("ascii", "ignore")
.str.decode("ascii")
.str.lower()
.str.replace("&", " and ", regex=False)
.str.strip()
)
Converting to Dask seemed straightforward, but I'm hitting hiccups along the way. The following Dask-adapted code throws a ValueError: cannot reindex from a duplicate axis error:
import dask.dataframe as dd
from dask.distributed import Client
artists = dd.read_csv("artists.csv")
print(f"... shape before cleaning {artists.shape}")
artists["name"] = artists["name"].astype(str).compute()
artists["name"] = (
artists["name"]
.str.encode("ascii", "ignore")
.str.decode("ascii")
.str.lower()
.str.replace("&", " and ", regex=False)
.str.strip().compute()
)
if __name__ == '__main__':
client = Client()
The best I can discern is that Dask won't allow reassignment to an existing Dask DataFrame. So this works:
...
artists_new = artists["name"].astype("str").compute()
...
However, I really don't want to create a new DataFrame each time. I'd rather replace the existing DataFrame with a new one, mainly because I have multiple data cleaning steps before processing.
While the tutorial and guides are useful, they are pretty basic and don't cover such use cases.
What are the preferred approaches here with Dask DataFrames?

Every time you call .compute() on Dask dataframe/series, it converts it into pandas. So what is happening in this line
artists["name"] = artists["name"].astype(str).compute()
is that you are computing the string column and then assigning pandas series to a dask series (without ensuring alignment of partitions). The solution is to call .compute() only on the final result, while intermediate steps can use regular pandas syntax:
# modified example (.compute is removed)
artists["name"] = artists["name"].astype(str).str.lower()

Related

Best way to perform arbitrary operations on groups with Dask DataFrames

I want to use Dask for operations of the form
df.groupby(some_columns).apply(some_function)
where some_function() may compute some summary statistics, perform timeseries forecasting, or even just save the group to a single file in AWS S3.
Dask documentation states (and several other StackOverflow answers cite) that groupby-apply is not appropriate for aggregations:
Pandas’ groupby-apply can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-apply will apply func once to each partition-group pair, so when func is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.
It is not clear whether Aggregation supports operations on multiple columns. However, this DataFrames tutorial seems to do exactly what I'm suggesting, with roughly some_function = lambda x: LinearRegression().fit(...). The example seems to work as intended, and I've similarly had no problems so far with e.g. some_function = lambda x: x.to_csv(...).
Under what conditions can I expect that some_function will be passed all rows of the group? If this is never guaranteed, is there a way to break the LinearRegression example? And most importantly, what is the best way to handle these use cases?
It appears that the current version of documentation and the source code are not in sync. Specifically, in the source code for dask.groupby, there is this message:
Dask groupby supports reductions, i.e., mean, sum and alike, and apply. The former do not shuffle the data and are efficiently implemented as tree reductions. The latter is implemented by shuffling the underlying partiitons such that all items of a group can be found in the same parititon.
This is not consistent with the warning in the docs about partition-group. The snippet below and task graph visualization also show that there is shuffling of data to ensure that partitions contain all members of the same group:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame({'group': [0,0,1,1,2,0,1,2], 'npeople': [1,2,3,4,5,6,7,8]})
ddf = dd.from_pandas(df, npartitions=2)
def myfunc(df):
return df['npeople'].sum()
results_pandas = df.groupby('group').apply(myfunc).sort_index()
results_dask = ddf.groupby('group').apply(myfunc).compute().sort_index()
print(sum(results_dask != results_pandas))
# will print 0, so there are no differences
# between dask and pandas groupby
This is speculative, but maybe one way to work around the scenario where partition-group leads to rows from a single group split across partitions is to explicitly re-partition the data in a way that ensures each group is associated with a unique partition.
One way to achieve that is by creating an index that is the same as the group identifier column. This in general is not a cheap operation, but it can be helped by pre-processing the data in a way that the group identifier is already sorted.

Pandas parallel apply with koalas (pyspark)

I'm new to Koalas (pyspark), and I was trying to utilize Koalas for parallel apply, but it seemed like it was using a single core for the whole operation (correct me if I'm wrong) and ended up using dask for parallel apply (using map_partition) which worked pretty well.
However, I would like to know if there's a way to utilize Koalas for parallel apply.
I used basic codes for operation like below.
import pandas as pd
import databricks.koalas as ks
my_big_data = ks.read_parquet('my_big_file') # file is single partitioned parquet file
my_big_data['new_column'] = my_big_data['string_column'].apply(my_prep) # my_prep does stirng operations
my_big_data.to_parquet('my_big_file_modified') # for Koalas does lazy evaluation
I found a link that discuss this problem. https://github.com/databricks/koalas/issues/1280
If the number of rows that are being applied by function is less than 1,000 (default value), then pandas dataframe will be called to do the operation.
The user defined function above my_prep is applied to each row, so single core pandas was being used.
In order to force it to work in pyspark (parallel) manner, user should modify the configuration as below.
import databricks.koalas as ks
ks.set_option('compute.default_index_type','distributed') # when .head() call is too slow
ks.set_option('compute.shortcut_limit',1) # Koalas will apply pyspark
Also, explicitly specifying type (type hint) in the user defined function will make Koalas not to go shortcut path and will make parallel.
def my_prep(row) -> string:
return row
kdf['my_column'].apply(my_prep)

Dask doesn't group/apply the results properly compared to pandas

I have a very large (150M rows - 30GB RAM) dataframe. I do a groupby (around 40 groups) and apply a function on each group. Takes about 30 minutes to perform everything. Here was the code I used:
df = df.groupby(by='date').apply(func=my_func)
Since the operations are not interdependant, I figured I'd use Dask to parallelize the processing of each group seperately.
So I use this code:
from dask import dataframe as dd
df_dask = dd.from_pandas(df_pandas, npartitions=40)
template = pd.DataFrame(columns=['A','B','C','D','E'])
df_dask = df_dask.groupby(by='date').apply(func=my_func, meta=template)
df_dask = df_dask.compute()
However, when I run this, I get different results depending on the value of npartitions I give it. If I give a value of 1, it gives me the same (and correct) results, but then it takes the same amount of time as with pandas. If I give it a higher number, it performs faster, but returns way fewer rows. I don't understand the relationship between npartitions and the groupby.
Also, if I try with a slightly larger DataFrame (40GB), Dask runs out of memory, even though I have 64GB on my machine, while pandas is fine.
Any ideas?
Dask's DataFrameGroupBy.apply applies the user-provided function to each partition: https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.groupby.DataFrameGroupBy.apply.
If you need a custom reduction, use Aggregate: https://docs.dask.org/en/latest/dataframe-groupby.html#aggregate

Building a dataframe from many InfluxDB result series

I am using the influxdb Python library to get a list of series in my database. There are about 20k series. I am then trying to build a Pandas dataframe out of the series'. I am rounding on 15S (I'd like to get rid of the nanoseconds, and I wonder why InfluxDB's Python library's documented get_list_series() call is missing in all versions I've tried, but those are other questions...); I'd like to end up with one big dataframe.
Here's the code:
from influxdb import DataFrameClient
.... get series list ...
temp_df = pd.DataFrame()
for series in series_list:
df = dfclient.query('select time,temp from {} where "series" = \'{}\''.format(location, temp))[location].asfreq('15S')
df.columns = [series]
if temp_df.empty:
temp_df = df
else:
temp_df = temp_df.join(df, how='outer')
This starts out fine but after a few hundred series, slows down quickly, grinding nearly to a halt. I am sure I'm not using Pandas the right way, and I'm hoping you can tell me how to do it the right way.
For what it's worth, I'm running this on relatively powerful hardware (which is why I believe I'm doing this the wrong way.)
One more thing: the time series' for each series I pull from InfluxDB may be different than all the others, which is why I'm using join. I'd like to end up with a DF with a column for each series, with the datetimes appropriately sorted in the index; join does that.

basic groupby operations in Dask

I am attempting to use Dask to handle a large file (50 gb). Typically, I would load it in memory and use Pandas. I want to groupby two columns "A", and "B", and whenever column "C" starts with a value, I want to repeat that value in that column for that particular group.
In pandas, I would do the following:
df['C'] = df.groupby(['A','B'])['C'].fillna(method = 'ffill')
What would be the equivalent in Dask?
Also, I am a little bit lost as to how to structure problems in Dask as opposed to in Pandas,
thank you,
My progress so far:
First set index:
df1 = df.set_index(['A','B'])
Then groupby:
df1.groupby(['A','B']).apply(lambda x: x.fillna(method='ffill').compute()
It appears dask does not currently implement the fillna method for GroupBy objects. I've tried PRing it some time ago and gave up quite quickly.
Also, dask doesn't support the method parameter (as it isn't always trivial to implement with delayed algorithms).
A workaround for this could be using fillna before grouping, like so:
df['C'] = df.fillna(0).groupby(['A','B'])['C']
Although this wasn't tested.
You can find my (failed) attempt here: https://github.com/nirizr/dask/tree/groupy_fillna

Categories