Dask .categorize very slow - python

I am trying to use Dask in an effort to perform feature extraction on a very large dataset (feature extraction using tsfresh), however I am having trouble with very long processing times.
My data looks as follows.
I have it all stored in Parquet files on my hard-drive.
To begin with I import the data into a Dask dataframe using the following code.
import dask
from dask import dataframe as dd
df = dd.read_parquet("/Users/oskar/Library/Mobile Documents/com~apple~CloudDocs/Documents/Studies/BSc Sem 7/Bachelor Project/programs/python/data/*/data.parquet")
I then initialise a Dask cluster.
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=8,
threads_per_worker=1,
scheduler_port=8786,
memory_limit='2GB')
cluster.scheduler_address
After that I start a Dask client.
from tsfresh.utilities.distribution import ClusterDaskDistributor
dask_distributor = ClusterDaskDistributor(address="127.0.0.1:8786")
dask_distributor.client
I then melt 'df'...
dfm = df.melt(id_vars=["id", "time"],
value_vars=['FP1-F7','F7-T7','T7-P7','P7-O1','FP1-F3','F3-C3','C3-P3','P3-O1','FP2-F4','F4-C4',
'C4-P4','P4-O2','FP2-F8','F8-T8','T8-P8','P8-O2','FZ-CZ','CZ-PZ','T7-FT9','FT9-FT10',
'FT10-T8'],
var_name="kind",
value_name="value")
... and group it.
dfm_grouped = dfm.groupby(["id", "kind"])
I then create an instance of 'dask_feature_extraction_on_chunk'.
from tsfresh.convenience.bindings import dask_feature_extraction_on_chunk
from tsfresh.feature_extraction.settings import MinimalFCParameters
features = dask_feature_extraction_on_chunk(dfm_grouped,
column_id="id",
column_kind="kind",
column_sort="time",
column_value="value",
default_fc_parameters=MinimalFCParameters())
And I then finally try to categorize it. Now this is what takes absolutely forever. And I'm wondering if it is possible to speed up this process?
features = features.categorize(columns=["variable"])
After that I intend on resetting the index and pivoting the table; presumably this will take forever also.
features = features.reset_index(drop=True)
feature_table = features.pivot_table(index="id",
columns="variable",
values="value",
aggfunc="sum")
Not to mention the actual computation..
df_features = feature_table.compute()
Again - is there any way I can set up my Dask to allow for faster computation? My computer has 16GB of memory. Thank you.

Related

Reading partitioned data (parquets) using dask with 'int64' vs 'int64 not null'

I have this annoying situation where some of my parquet files have:
x: int64
and others have
x: int64 not null
and ergo (in dask 2.8.0/numpy 1.15.1/pandas 0.25.3) I can't run the following:
test: Union[pd.Series, pd.DataFrame, np.ndarray] = dd.read_parquet(input_path).query(filter_string)[input_columns].compute()
Anyone know what I can do short of upgrading dask/numpy (as I know the latest dask/numpy seem to work)?
Thanks in advance!
If you know which files contain the different dtypes, then it's best to re-process them (load/convert dtype/save).
If that's not an option, then you can create a dask dataframe from delayed objects with something like this:
import pandas as pd
from dask import delayed
import dask.dataframe as dd
#delayed
def custom_load(fpath):
df = pd.read_parquet(fpath)
df = df.astype({'x': 'Int64'}) # the appropriate dtype
return df
delayed = [custom_load(f) for f in files] # where files is the list of files
ddf = dd.from_delayed(delayed) # can also provide meta option if known

Fastest way to read large files text files in Pandas Dataframe

I have several large files (> 4 gb each). Some of them are in a fixed width format and some are pipe delimited. The files have both numeric and text data. Currently I am using the following approach:
df1 = pd.read_fwf(fwFileName, widths = [2, 3, 5, 2, 16],
names = columnNames, dtype = columnTypes,
skiprows = 1, engine = 'c',
keep_default_na = False)
df2 = pd.read_csv(pdFileName, sep = '|', names = columnNames,
dtype = columnTypes, useCols = colNumbers,
skiprows = 1, engine = 'c',
keep_default_na = False)
However, this seems to be slower than for example R's read_fwf (from readr) and fread (from data.table). Can I use some other methods that will help speed up reading these files?
I am working on a big server with several cores so memory is not an issue. I can safely load the whole files into memory. Maybe they are the same thing in this case but my goal is to optimize on time and not on resources.
Update
Based on the comments so far here are a few additional details about the data and my ultimate goal.
These files are compressed (fixed width are zip and pipe delimited are gzip). Therefore, I am not sure if things like Dask will add value for loading. Will they?
After loading these files, I plan to apply a computationally expensive function to groups of data. Therefore, I need the whole data. Although the data is sorted by groups, i.e. first x rows are group 1, next y rows are group 2 and so on. Therefore, forming groups on the fly might be more productive? Is there an efficient way of doing that, given that I don't know how many rows to expect for each group?
Since we are taking time as a metric here, then your memory size is not the main factor we should be looking at, actually on the contrary all methods using lazy loading(less memory and only load objects when needed) are much much faster than loading all data at once in memory, you can check out dask as it provides such lazy read function. https://dask.org/
start_time = time.time()
data = dask.dataframe.read_csv('rg.csv')
duration = time.time() - start_time
print(f"Time taken {duration} seconds") # less than a second
But as I said this won't load data in memory, but rather load only portions of data when needed, you could however load it in full using:
data.compute()
If you want to load things faster in memory, then you need to have good computing capabilities in your server, a good candidate that could benefit from such capabilities is ParaText https://github.com/wiseio/paratext
You can benchmark ParaText against readcsv using the following code:
import time
import paratext
start_time = time.time()
df = paratext.load_csv_to_pandas("rg.csv")
duration = time.time() - start_time
print(f"Time taken {duration} seconds")
import time
import pandas as pd
start_time = time.time()
df = pd.read_csv("rg.csv")
duration = time.time() - start_time
print(f"Time taken {duration} seconds")
Please Note that results may be worse if you don't have enough compute power to support paraText.
You can check out the benchmarks for ParaText loading large files here
https://deads.gitbooks.io/paratext-bench/content/results_csv_throughput.html .

How to remedy excessive hard disk usage (>>100GB) by Dask Dataframe when shuffling data

I need to calculate statistics per segment of large (15 - 20 GB) CSV files. This I do with groupby() in Dask Dataframe.
The problem is that I need custom functions, because I need kurtosis and skew, which are not part of Dask. Therefore I use groupby().apply(). However, this makes Dask use tremendous amounts of disk drive space in my Temp directory: more than 150 GB just running the script once! This causes my hard drive to run out of space, making the script crash.
Is there a way to rewrite the code which makes it avoid writing such an enormous amount of junk to my Temp directory?
Example code is given below:
Example 1 runs relatively fast, and doesn't generate tons of Temp output, but it doesn't support kurtosis or skew.
Example 2 calculates also kurtosis and skew, but fills up my hard disk if I run it for the full dataset.
Any help would be appreciated!
By the way: this page (https://docs.dask.org/en/latest/dataframe-groupby.html), suggests using an indexed column for the groupby(). But unfortunately multi-indexing is not supported by Dask Dataframe, so that does not solve my problem.
import dask.dataframe as dd
import numpy as np
import scipy.stats as sps
ddf = dd.read_csv('18_GB_csv_file.csv')
segmentations = { 'seg1' : ['col1', 'col2'],
'seg2' : ['col1', 'col2', 'col3', 'col4'],
'seg3' : ['col3', 'col4'],
'seg4' : ['col1', 'col2', 'col5']
}
data_cols = [ 'datacol1', 'datacol2', 'datacol3' ]
# Example 1: this runs fast and doesn't generate needless temp output.
# But it does not support "kurt" or "skew":
dd_comp = {}
for seg_group, seg_cols in segmentations.items():
df_grouped = df.groupby(seg_cols)[data_cols]
dd_comp[seg_group] = df_grouped.aggregate( ['mean', 'std', 'min', 'max'])
with ProgressBar():
segmented_stats = dd.compute(dd_comp)
# Example 2: includes also "kurt" and "skew". But it is painfully slow
# and generates >150 GB of Temp output before running out of disk space
empty_segment = pd.DataFrame( index=data_cols,
columns=['mean', 'three_sigma',
'min', 'max', 'kurt', 'skew']
)
def segment_statistics(segment):
stats = empty_segment.copy()
for col in data_cols:
stats.loc[col]['mean'] = np.mean(segment[col])
stats.loc[col]['std'] = np.std(segment[col])
stats.loc[col]['min'] = np.min(segment[col])
stats.loc[col]['max'] = np.max(segment[col])
stats.loc[col]['skew'] = sps.skew(segment[col])
stats.loc[col]['kurt'] = sps.kurtosis(segment[col]) + 3
return stats
dd_comp = {}
for seg_group, seg_cols in segmentations.items():
df_grouped = df.groupby(seg_cols)[data_cols]
dd_comp[seg_group] = df_grouped.apply( segment_statistics,
meta=empty_segment )
with ProgressBar():
segmented_stats = dd.compute(dd_comp)
It sounds like you might benefit from custom aggregations: https://docs.dask.org/en/latest/dataframe-groupby.html#aggregate
If you're able to come up with some nice implementations for higher order moments those also sound like they would be nice contributions to the project.

Read Large data from database table in pandas or dask

I want to read all data from a table with 10+ gb of data into a dataframe. When i try to read with read_sql i get memory overload error. I want to do some processing on that data and update table with new data. How i can do this efficiently. My PC have 26gb of ram but data is max 11 gb of size, still i get memory overload error.
In Dask its taking so much time. Below is code.
import dateparser
import dask.dataframe as dd
import numpy as np
df = dd.read_sql_table('fbo_xml_json_raw_data', index_col='id', uri='postgresql://postgres:passwordk#address:5432/database')
def make_year(data):
if data and data.isdigit() and int(data) >= 0:
data = '20' + data
elif data and data.isdigit() and int(data) < 0:
data = '19' + data
return data
def response_date(data):
if data and data.isdigit() and int(data[-2:]) >= 0:
data = data[:-2] + '20' + data[-2:]
elif data and data.isdigit() and int(data[-2:]) < 0:
data = data[:-2] + '19' + data[-2:]
if data and dateparser.parse(data):
return dateparser.parse(data).date().strftime('%Y-%m-%d')
def parse_date(data):
if data and dateparser.parse(data):
return dateparser.parse(data).date().strftime('%Y-%m-%d')
df.ARCHDATE = df.ARCHDATE.apply(parse_date)
df.YEAR = df.YEAR.apply(make_year)
df.DATE = df.DATE + df.YEAR
df.DATE = df.DATE.apply(parse_date)
df.RESPDATE = df.RESPDATE.apply(response_date)
See here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html
See that chunksize arg? You can chunk your data so it fits into memory.
It will return a chunk reading object so you can apply operations iteratively over the chunks.
You can probably also incorporate multiprocessing as well.
This will add a layer of complexity since you're no longer working on the DataFrame itself but an object containing chunks.
Since you're using Dask this "should" apply. I'm not sure how Dask handles chunking. It's been a while since I touched Pandas/Dask compatbility.
The main issue seems to be the exclusive use of pd.Series.apply. But apply is just a row-wise Python-level loop. It will be slow in Pandas and Dask. For performance-critical code, you should favour column-wise operations.
In fact, dask.dataframe supports a useful subset of the Pandas API. Here are a couple of examples:-
Avoid string operations
Convert data to numeric types first; then perform vectorisable operations. For example:
dd['YEAR'] = dd['YEAR'].astype(int)
dd['YEAR'] = dd['YEAR'].mask(dd['YEAR'] >= 0, 20)
dd['YEAR'] = dd['YEAR'].mask(dd['YEAR'] < 0, 19)
Convert to datetime
If you have datetime strings in an appropriate format:
df['ARCHDATE'] = df['ARCHDATE'].astype('M8[us]')
See also dask dataframe how to convert column to to_datetime.

Python PANDAS: Converting from pandas/numpy to dask dataframe/array

I am working to try to convert a program to be parallelizable/multithreaded with the excellent dask library. Here is the program I am working on converting:
Python PANDAS: Stack by Enumerated Date to Create Records Vectorized
import pandas as pd
import numpy as np
import dask.dataframe as dd
import dask.array as da
from io import StringIO
test_data = '''id,transaction_dt,units,measures
1,2018-01-01,4,30.5
1,2018-01-03,4,26.3
2,2018-01-01,3,12.7
2,2018-01-03,3,8.8'''
df_test = pd.read_csv(StringIO(test_data), sep=',')
df_test['transaction_dt'] = pd.to_datetime(df_test['transaction_dt'])
df_test = df_test.loc[np.repeat(df_test.index, df_test['units'])]
df_test['transaction_dt'] += pd.to_timedelta(df_test.groupby(level=0).cumcount(), unit='d')
df_test = df_test.reset_index(drop=True)
expected results:
id,transaction_dt,measures
1,2018-01-01,30.5
1,2018-01-02,30.5
1,2018-01-03,30.5
1,2018-01-04,30.5
1,2018-01-03,26.3
1,2018-01-04,26.3
1,2018-01-05,26.3
1,2018-01-06,26.3
2,2018-01-01,12.7
2,2018-01-02,12.7
2,2018-01-03,12.7
2,2018-01-03,8.8
2,2018-01-04,8.8
2,2018-01-05,8.8
It occurred to me that this might be a good candidate to try to parallelize because the separate dask partitions should not need to know anything about each other to accomplish the required operations. Here is a naive representation of how I thought it might work:
dd_test = dd.from_pandas(df_test, npartitions=3)
dd_test = dd_test.loc[da.repeat(dd_test.index, dd_test['units'])]
dd_test['transaction_dt'] += dd_test.to_timedelta(dd.groupby(level=0).cumcount(), unit='d')
dd_test = dd_test.reset_index(drop=True)
So far I have been trying to work through the following errors or idiomatic differences:
"NotImplementedError: Only integer valued repeats supported."
I have tried to convert the index into a int column/array to try as well but still run into the issue.
2. dask does not support the mutating operator: "+="
3. No dask .to_timedelta() argument
4. No dask .cumcount() (but I think .cumsum() is interchangable?!)
If there are any dask experts out there who might be able let me know if there are fundamental impediments to preclude me from trying this or any tips on implementation, that would be a great help!
Edit:
I think I have made a bit of progress on this since posting the question:
dd_test = dd.from_pandas(df_test, npartitions=3)
dd_test['helper'] = 1
dd_test = dd_test.loc[da.repeat(dd_test.index, dd_test['units'])]
dd_test['transaction_dt'] = dd_test['transaction_dt'] + (dd.test.groupby('id')['helper'].cumsum()).astype('timedelta64[D]')
dd_test = dd_test.reset_index(drop=True)
However, I am still stuck on the dask array repeats error. Any tips still welcome.
Not sure if this is exactly what you are looking for, but I replaced the da.repeat with using np.repeat, along with explicity casting dd_test.index and dd_test['units'] to numpy arrays, and finally adding dd_test['transaction_dt'].astype('M8[us]') to your timedelta calculation.
df_test = pd.read_csv(StringIO(test_data), sep=',')
dd_test = dd.from_pandas(df_test, npartitions=3)
dd_test['helper'] = 1
dd_test = dd_test.loc[np.repeat(np.array(dd_test.index),
np.array(dd_test['units']))]
dd_test['transaction_dt'] = dd_test['transaction_dt'].astype('M8[us]') + (dd_test.groupby('id')['helper'].cumsum()).astype('timedelta64[D]')
dd_test = dd_test.reset_index(drop=True)
df_expected = dd_test.compute()

Categories