I am attempting to use Dask to handle a large file (50 gb). Typically, I would load it in memory and use Pandas. I want to groupby two columns "A", and "B", and whenever column "C" starts with a value, I want to repeat that value in that column for that particular group.
In pandas, I would do the following:
df['C'] = df.groupby(['A','B'])['C'].fillna(method = 'ffill')
What would be the equivalent in Dask?
Also, I am a little bit lost as to how to structure problems in Dask as opposed to in Pandas,
thank you,
My progress so far:
First set index:
df1 = df.set_index(['A','B'])
Then groupby:
df1.groupby(['A','B']).apply(lambda x: x.fillna(method='ffill').compute()
It appears dask does not currently implement the fillna method for GroupBy objects. I've tried PRing it some time ago and gave up quite quickly.
Also, dask doesn't support the method parameter (as it isn't always trivial to implement with delayed algorithms).
A workaround for this could be using fillna before grouping, like so:
df['C'] = df.fillna(0).groupby(['A','B'])['C']
Although this wasn't tested.
You can find my (failed) attempt here: https://github.com/nirizr/dask/tree/groupy_fillna
Related
I want to use Dask for operations of the form
df.groupby(some_columns).apply(some_function)
where some_function() may compute some summary statistics, perform timeseries forecasting, or even just save the group to a single file in AWS S3.
Dask documentation states (and several other StackOverflow answers cite) that groupby-apply is not appropriate for aggregations:
Pandas’ groupby-apply can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-apply will apply func once to each partition-group pair, so when func is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.
It is not clear whether Aggregation supports operations on multiple columns. However, this DataFrames tutorial seems to do exactly what I'm suggesting, with roughly some_function = lambda x: LinearRegression().fit(...). The example seems to work as intended, and I've similarly had no problems so far with e.g. some_function = lambda x: x.to_csv(...).
Under what conditions can I expect that some_function will be passed all rows of the group? If this is never guaranteed, is there a way to break the LinearRegression example? And most importantly, what is the best way to handle these use cases?
It appears that the current version of documentation and the source code are not in sync. Specifically, in the source code for dask.groupby, there is this message:
Dask groupby supports reductions, i.e., mean, sum and alike, and apply. The former do not shuffle the data and are efficiently implemented as tree reductions. The latter is implemented by shuffling the underlying partiitons such that all items of a group can be found in the same parititon.
This is not consistent with the warning in the docs about partition-group. The snippet below and task graph visualization also show that there is shuffling of data to ensure that partitions contain all members of the same group:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame({'group': [0,0,1,1,2,0,1,2], 'npeople': [1,2,3,4,5,6,7,8]})
ddf = dd.from_pandas(df, npartitions=2)
def myfunc(df):
return df['npeople'].sum()
results_pandas = df.groupby('group').apply(myfunc).sort_index()
results_dask = ddf.groupby('group').apply(myfunc).compute().sort_index()
print(sum(results_dask != results_pandas))
# will print 0, so there are no differences
# between dask and pandas groupby
This is speculative, but maybe one way to work around the scenario where partition-group leads to rows from a single group split across partitions is to explicitly re-partition the data in a way that ensures each group is associated with a unique partition.
One way to achieve that is by creating an index that is the same as the group identifier column. This in general is not a cheap operation, but it can be helped by pre-processing the data in a way that the group identifier is already sorted.
I'm trying to organise data using a pandas dataframe.
Given the structure of the data it seems logical to use a composite index; 'league_id' and 'fixture_id'. I believe I have implemented this according to the examples in the docs, however I am unable to access the data using the index.
My code can be found here;
https://repl.it/repls/OldCorruptRadius
** I am very new to Pandas and programming in general, so any advice would be much appreciated! Thanks! **
For multi-indexing, you would need to use the pandas MutliIndex API, which brings its own learning curve; thus, I would not recommend it for beginners. Link: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html
The way that I use multi-indexing is only to display a final product to others (i.e. making it easy/pretty to view). Before the multi-indexing, you filter the fixture_id and league_id as columns first:
df = pd.DataFrame(fixture, columns=features)
df[(df['fixture_id'] == 592) & (df['league_id'] == 524)]
This way, you are still technically targeting the indexes if you would have gone through with multi-indexing the two columns.
If you have to use multi-indexing, try the transform feature of a pandas DataFrame. This turns the indexes into columns and vise-versa. For example, you can do something like this:
df = pd.DataFrame(fixture, columns=features).set_index(['league_id', 'fixture_id'])
df.T[524][592].loc['event_date'] # gets you the row of `event_dates`
df.T[524][592].loc['event_date'].iloc[0] # gets you the first instance of event_dates
I have a very large (150M rows - 30GB RAM) dataframe. I do a groupby (around 40 groups) and apply a function on each group. Takes about 30 minutes to perform everything. Here was the code I used:
df = df.groupby(by='date').apply(func=my_func)
Since the operations are not interdependant, I figured I'd use Dask to parallelize the processing of each group seperately.
So I use this code:
from dask import dataframe as dd
df_dask = dd.from_pandas(df_pandas, npartitions=40)
template = pd.DataFrame(columns=['A','B','C','D','E'])
df_dask = df_dask.groupby(by='date').apply(func=my_func, meta=template)
df_dask = df_dask.compute()
However, when I run this, I get different results depending on the value of npartitions I give it. If I give a value of 1, it gives me the same (and correct) results, but then it takes the same amount of time as with pandas. If I give it a higher number, it performs faster, but returns way fewer rows. I don't understand the relationship between npartitions and the groupby.
Also, if I try with a slightly larger DataFrame (40GB), Dask runs out of memory, even though I have 64GB on my machine, while pandas is fine.
Any ideas?
Dask's DataFrameGroupBy.apply applies the user-provided function to each partition: https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.groupby.DataFrameGroupBy.apply.
If you need a custom reduction, use Aggregate: https://docs.dask.org/en/latest/dataframe-groupby.html#aggregate
got a little problem. I have two dask dataframes with following format:
#DF1.csv
DATE|EVENTNAME|VALUE
#DF2.csv
DATE|EVENTNAME0|EVENTNAME1|...|EVENTNAMEX
I want to merge the value from DF1.csv into DF2.csv, at time t (Date) and column (EventName). I use Dask at the moment, because i'm working with huge datesets ~50gb. I noticed that you can't use direct assignment of values in Dask. So i tried, dd.Series.where:
df[nodeid].where(time,value) => Result in an error (for row in df.iterrows():
#df2.loc[row[0],row[1][0]] =row[1][1])
i also tried a merge, but the resulting Dask dataframe had no partitions, which result in a MemoryError, because all datasets will be loaded into memory, if i use the .to_csv('data-*.csv') method. It should be easy to merge the dataframes, but i have no clue at the moment. Is there a Dask pro, that could help me out?
Edit://
This works well in pandas but not with dask:
for row in df.iterrows():
df2.loc[row[0],row[1][0]] =row[1][1]
Tried something like that:
for row in df.iterrows():
df2[row[1][0]] = df2[row[1][0]].where(row[0], row[1][1])
#Result in Error => raise ValueError('Array conditional must be same shape as '
Any ideas?
For everyone who is interested, you can use:
#DF1
df.pivot(index='date', columns='event', values='value') #to create DF2 Memory efficient
see also: https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
before, it took a huge time, was horrible memory hungry and brought up not the results that i was looking for. Just use Pandas pivot, if you try to alter your dataframe scheme.
Edit:// And there is no reason to use Dask anymore, speed up the whole process even further ;)
I am using the influxdb Python library to get a list of series in my database. There are about 20k series. I am then trying to build a Pandas dataframe out of the series'. I am rounding on 15S (I'd like to get rid of the nanoseconds, and I wonder why InfluxDB's Python library's documented get_list_series() call is missing in all versions I've tried, but those are other questions...); I'd like to end up with one big dataframe.
Here's the code:
from influxdb import DataFrameClient
.... get series list ...
temp_df = pd.DataFrame()
for series in series_list:
df = dfclient.query('select time,temp from {} where "series" = \'{}\''.format(location, temp))[location].asfreq('15S')
df.columns = [series]
if temp_df.empty:
temp_df = df
else:
temp_df = temp_df.join(df, how='outer')
This starts out fine but after a few hundred series, slows down quickly, grinding nearly to a halt. I am sure I'm not using Pandas the right way, and I'm hoping you can tell me how to do it the right way.
For what it's worth, I'm running this on relatively powerful hardware (which is why I believe I'm doing this the wrong way.)
One more thing: the time series' for each series I pull from InfluxDB may be different than all the others, which is why I'm using join. I'd like to end up with a DF with a column for each series, with the datetimes appropriately sorted in the index; join does that.