Python PANDAS: Converting from pandas/numpy to dask dataframe/array - python

I am working to try to convert a program to be parallelizable/multithreaded with the excellent dask library. Here is the program I am working on converting:
Python PANDAS: Stack by Enumerated Date to Create Records Vectorized
import pandas as pd
import numpy as np
import dask.dataframe as dd
import dask.array as da
from io import StringIO
test_data = '''id,transaction_dt,units,measures
1,2018-01-01,4,30.5
1,2018-01-03,4,26.3
2,2018-01-01,3,12.7
2,2018-01-03,3,8.8'''
df_test = pd.read_csv(StringIO(test_data), sep=',')
df_test['transaction_dt'] = pd.to_datetime(df_test['transaction_dt'])
df_test = df_test.loc[np.repeat(df_test.index, df_test['units'])]
df_test['transaction_dt'] += pd.to_timedelta(df_test.groupby(level=0).cumcount(), unit='d')
df_test = df_test.reset_index(drop=True)
expected results:
id,transaction_dt,measures
1,2018-01-01,30.5
1,2018-01-02,30.5
1,2018-01-03,30.5
1,2018-01-04,30.5
1,2018-01-03,26.3
1,2018-01-04,26.3
1,2018-01-05,26.3
1,2018-01-06,26.3
2,2018-01-01,12.7
2,2018-01-02,12.7
2,2018-01-03,12.7
2,2018-01-03,8.8
2,2018-01-04,8.8
2,2018-01-05,8.8
It occurred to me that this might be a good candidate to try to parallelize because the separate dask partitions should not need to know anything about each other to accomplish the required operations. Here is a naive representation of how I thought it might work:
dd_test = dd.from_pandas(df_test, npartitions=3)
dd_test = dd_test.loc[da.repeat(dd_test.index, dd_test['units'])]
dd_test['transaction_dt'] += dd_test.to_timedelta(dd.groupby(level=0).cumcount(), unit='d')
dd_test = dd_test.reset_index(drop=True)
So far I have been trying to work through the following errors or idiomatic differences:
"NotImplementedError: Only integer valued repeats supported."
I have tried to convert the index into a int column/array to try as well but still run into the issue.
2. dask does not support the mutating operator: "+="
3. No dask .to_timedelta() argument
4. No dask .cumcount() (but I think .cumsum() is interchangable?!)
If there are any dask experts out there who might be able let me know if there are fundamental impediments to preclude me from trying this or any tips on implementation, that would be a great help!
Edit:
I think I have made a bit of progress on this since posting the question:
dd_test = dd.from_pandas(df_test, npartitions=3)
dd_test['helper'] = 1
dd_test = dd_test.loc[da.repeat(dd_test.index, dd_test['units'])]
dd_test['transaction_dt'] = dd_test['transaction_dt'] + (dd.test.groupby('id')['helper'].cumsum()).astype('timedelta64[D]')
dd_test = dd_test.reset_index(drop=True)
However, I am still stuck on the dask array repeats error. Any tips still welcome.

Not sure if this is exactly what you are looking for, but I replaced the da.repeat with using np.repeat, along with explicity casting dd_test.index and dd_test['units'] to numpy arrays, and finally adding dd_test['transaction_dt'].astype('M8[us]') to your timedelta calculation.
df_test = pd.read_csv(StringIO(test_data), sep=',')
dd_test = dd.from_pandas(df_test, npartitions=3)
dd_test['helper'] = 1
dd_test = dd_test.loc[np.repeat(np.array(dd_test.index),
np.array(dd_test['units']))]
dd_test['transaction_dt'] = dd_test['transaction_dt'].astype('M8[us]') + (dd_test.groupby('id')['helper'].cumsum()).astype('timedelta64[D]')
dd_test = dd_test.reset_index(drop=True)
df_expected = dd_test.compute()

Related

Why is pandas eval not working anymore with where?

I was using pandas eval within a where that sits inside a function in order to create a column in a data frame. While it was working in the past, not it doesn't. There was a recent move to Python 3 within our dataiku software. Could that be the reason for it?
Below will be the code that is now in place
import pandas as pd, numpy as np
from numpy import where, nan
d = {'ASSET': ['X','X','A','X','B'], 'PRODUCT': ['Z','Y','Z','C','Y']}
MAIN_df = pd.DataFrame(data=d)
def val_per(ASSET, PRODUCT):
return(
where(pd.eval("ASSET== 'X' & PRODUCT == 'Z'"),0.04,
where(pd.eval("PRODUCT == 'Y'"),0.08,1.5)
)
)
MAIN_2_df = (MAIN_df.eval("PCT = #val_per(ASSET, PRODUCT)"))
The error received now is <class 'TypeError'>: unhashable type: 'numpy.ndarray'
You can change the last two lines with:
MAIN_2_df = MAIN_df.copy()
MAIN_2_df = val_per(MAIN_2_df.ASSET, MAIN_2_df.PRODUCT)
This approach will work faster for large dataframes. You can use a vectorized aproach to faster results.

Reading partitioned data (parquets) using dask with 'int64' vs 'int64 not null'

I have this annoying situation where some of my parquet files have:
x: int64
and others have
x: int64 not null
and ergo (in dask 2.8.0/numpy 1.15.1/pandas 0.25.3) I can't run the following:
test: Union[pd.Series, pd.DataFrame, np.ndarray] = dd.read_parquet(input_path).query(filter_string)[input_columns].compute()
Anyone know what I can do short of upgrading dask/numpy (as I know the latest dask/numpy seem to work)?
Thanks in advance!
If you know which files contain the different dtypes, then it's best to re-process them (load/convert dtype/save).
If that's not an option, then you can create a dask dataframe from delayed objects with something like this:
import pandas as pd
from dask import delayed
import dask.dataframe as dd
#delayed
def custom_load(fpath):
df = pd.read_parquet(fpath)
df = df.astype({'x': 'Int64'}) # the appropriate dtype
return df
delayed = [custom_load(f) for f in files] # where files is the list of files
ddf = dd.from_delayed(delayed) # can also provide meta option if known

How to calculate averages of datetime64[ns] numpy.ndarray?

With the following data, I would like to show the mean and other averages:
time = ['2020-01-01T00:00:00.000000000' '2020-01-02T00:00:00.000000000'
'2020-01-03T00:00:00.000000000' '2020-01-04T00:00:00.000000000'
'2020-01-05T00:00:00.000000000' '2020-01-06T00:00:00.000000000'
'2020-01-07T00:00:00.000000000' '2020-01-08T00:00:00.000000000'
'2020-01-09T00:00:00.000000000' '2020-01-10T00:00:00.000000000'
'2020-01-11T00:00:00.000000000' '2020-01-12T00:00:00.000000000'
'2020-01-13T00:00:00.000000000' '2020-01-14T00:00:00.000000000'
'2020-01-15T00:00:00.000000000' '2020-01-16T00:00:00.000000000'
'2020-01-17T00:00:00.000000000' '2020-01-18T00:00:00.000000000'
'2020-01-19T00:00:00.000000000' '2020-01-20T00:00:00.000000000'
'2020-01-21T00:00:00.000000000' '2020-01-22T00:00:00.000000000'
'2020-01-23T00:00:00.000000000' '2020-01-24T00:00:00.000000000'
'2020-01-25T00:00:00.000000000' '2020-01-26T00:00:00.000000000'
'2020-01-27T00:00:00.000000000' '2020-01-28T00:00:00.000000000'
'2020-01-29T00:00:00.000000000' '2020-01-30T00:00:00.000000000'
'2020-01-31T00:00:00.000000000']
print(np.mean(time)) has an error: TypeError: cannot perform reduce with flexible type
I think i may need to implement pandas / dataframe / slicing, however i am unsure how to do this.
First you need to add commas between the entries of your list. Then, a possible option is to use pandas:
import pandas as pd
import numpy as np
You can convert your sting list to a pandas datetime list.
time_pd = pd.to_datetime(time)
Then turn this into an integer list and perform all the calculations you want. For example, calculating the mean:
time_np = time_pd.astype(np.int64)
average_time_np = np.average(time_np)
average_time_pd = pd.to_datetime(average_time_np)
print(average_time_pd)
Which prints: 2020-01-16 00:00:00
There are certainly ways to cast the time strings directly to numpy without using pandas, but that's the solution that I could figure out without much more research.
Here is one approach based on converting back and forth between Unix time
dt = np.array(time, dtype='datetime64')
delta_sec = np.timedelta64(1, 's')
epoch = '1970-01-01T00:00:00'
epoch_sec = (dt - np.datetime64(epoch)) / delta_sec
epoch_sec_mean = np.mean(epoch_sec)
dt_mean = np.datetime64(epoch) + np.timedelta64(int(epoch_sec_mean), 's')
print(dt_mean)
Output
2020-01-16T00:00:00

How to remedy excessive hard disk usage (>>100GB) by Dask Dataframe when shuffling data

I need to calculate statistics per segment of large (15 - 20 GB) CSV files. This I do with groupby() in Dask Dataframe.
The problem is that I need custom functions, because I need kurtosis and skew, which are not part of Dask. Therefore I use groupby().apply(). However, this makes Dask use tremendous amounts of disk drive space in my Temp directory: more than 150 GB just running the script once! This causes my hard drive to run out of space, making the script crash.
Is there a way to rewrite the code which makes it avoid writing such an enormous amount of junk to my Temp directory?
Example code is given below:
Example 1 runs relatively fast, and doesn't generate tons of Temp output, but it doesn't support kurtosis or skew.
Example 2 calculates also kurtosis and skew, but fills up my hard disk if I run it for the full dataset.
Any help would be appreciated!
By the way: this page (https://docs.dask.org/en/latest/dataframe-groupby.html), suggests using an indexed column for the groupby(). But unfortunately multi-indexing is not supported by Dask Dataframe, so that does not solve my problem.
import dask.dataframe as dd
import numpy as np
import scipy.stats as sps
ddf = dd.read_csv('18_GB_csv_file.csv')
segmentations = { 'seg1' : ['col1', 'col2'],
'seg2' : ['col1', 'col2', 'col3', 'col4'],
'seg3' : ['col3', 'col4'],
'seg4' : ['col1', 'col2', 'col5']
}
data_cols = [ 'datacol1', 'datacol2', 'datacol3' ]
# Example 1: this runs fast and doesn't generate needless temp output.
# But it does not support "kurt" or "skew":
dd_comp = {}
for seg_group, seg_cols in segmentations.items():
df_grouped = df.groupby(seg_cols)[data_cols]
dd_comp[seg_group] = df_grouped.aggregate( ['mean', 'std', 'min', 'max'])
with ProgressBar():
segmented_stats = dd.compute(dd_comp)
# Example 2: includes also "kurt" and "skew". But it is painfully slow
# and generates >150 GB of Temp output before running out of disk space
empty_segment = pd.DataFrame( index=data_cols,
columns=['mean', 'three_sigma',
'min', 'max', 'kurt', 'skew']
)
def segment_statistics(segment):
stats = empty_segment.copy()
for col in data_cols:
stats.loc[col]['mean'] = np.mean(segment[col])
stats.loc[col]['std'] = np.std(segment[col])
stats.loc[col]['min'] = np.min(segment[col])
stats.loc[col]['max'] = np.max(segment[col])
stats.loc[col]['skew'] = sps.skew(segment[col])
stats.loc[col]['kurt'] = sps.kurtosis(segment[col]) + 3
return stats
dd_comp = {}
for seg_group, seg_cols in segmentations.items():
df_grouped = df.groupby(seg_cols)[data_cols]
dd_comp[seg_group] = df_grouped.apply( segment_statistics,
meta=empty_segment )
with ProgressBar():
segmented_stats = dd.compute(dd_comp)
It sounds like you might benefit from custom aggregations: https://docs.dask.org/en/latest/dataframe-groupby.html#aggregate
If you're able to come up with some nice implementations for higher order moments those also sound like they would be nice contributions to the project.

xarray groupby: Apply different reducers to variables

I'm using xarray's groupby + reducer to perform spatial overlay/aggregation on spatial rasters. I'm wondering if there is a way to use a different reducer for certain data variables. In the code below for instance, I would like categorical_variable to be reduced with first() (or mode but that doesn't seem to be implemented), and continuous_variable to be reduced with mean()
import xarray as xr
import numpy as np
categorical_variable = np.array([[1,1,1,1,1],
[1,1,1,1,2],
[1,1,1,2,2],
[1,1,2,2,2],
[1,2,2,2,2]], dtype='int16')
grouping_variable = np.array([[1,1,1,2,2],
[1,1,3,2,2],
[1,3,3,3,3],
[3,3,3,3,3],
[4,4,4,4,4]], dtype='int16')
continuous_variable = np.random.rand(5,5)
xr_dataset = xr.Dataset({'grouping_variable': xr.DataArray(grouping_variable,
dims=['x', 'y']),
'categorical_variable': xr.DataArray(categorical_variable,
dims=['x', 'y']),
'continuous_variable': xr.DataArray(continuous_variable,
dims=['x', 'y'])})
xr_grouped = xr_dataset.groupby('grouping_variable')
xr_reduced = xr_grouped.mean()
This isn't currently possible in one go in xarray currently AFAIK, but since you're losing the spatial structure anyway you can go via pandas quite simply and use agg:
>>> df = xr_dataset.to_dataframe()
>>> df.groupby('grouping_variable').agg({"categorical_variable": "first",
"continuous_variable": "mean"})
categorical_variable continuous_variable
grouping_variable
1 1 0.458534
2 1 0.822294
3 1 0.539483
4 1 0.515586
The performance is not optimal but this is what I ended up doing:
xr_dataset = xr.merge([
xr_dataset.categorical_variable.groupby('grouping_variable').first(),
xr_dataset.continuous_variable.groupby('grouping_variable').mean(),
...
])

Categories