I am doing a (binary) prediction exercise with 47 variables but many millions of rows. So pure pandas (or now pydaal/daal4py) are not enough to set up the data; I am adapting a dask example to run locally. I am not scheduling tasks on a cluster but hope to stream my data and not hold all in memory.
Either for dask-xgboost or other dask-supported tools (in scikit-learn) I would need to convert my categorical variables into numerical values.
In Matthew Rocklin's exercise, this is done in this line:
df2 = dd.get_dummies(df.categorize()).persist()
where df is a dask DataFrame and dd is the dask DataFrame module imported.
However, I would hope to avoid persist() in an earlier line (df, is_delayed = dask.persist(df, is_delayed) for Matthew), as I cannot do this in memory. Maybe this is why I get errors when I try to use the resulting DataFrame like this:
Length mismatch: Expected axis has 39 elements, new values have 57
elements
How should one use dask.dataframe.get_dummies(), and in particular in a use case like mine?
Related
I have a Raspberry Pi 3B+ with 1Gb ram in which I am running a telegram bot.
This bot uses a database I store in csv format, which contains about 100k rows and four columns:
First two are for searching
Third is a result
those use about 20-30MB ram, this is assumable.
The last column is really a problem, it shoots up the ram usage to 180MB, impossible to manage for RPi, this column is also for searching, but I only need it sometimes.
I started only loading the df with read_csv at start of script and let the script polling, but when the db grows, I realized that this is too much for RPi.
What do you think is the best way to do this? Thanks!
Setting the dtype according to the data can reduce memory usage a lot.
With read_csv you can directly set the dtype for each column:
dtypeType name or dict of column -> type, optional
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}
Use str or object together with suitable
na_values settings to preserve and not interpret dtype. If converters
are specified, they will be applied INSTEAD of dtype conversion.
Example:
df = pd.read_csv(my_csv, dtype={0:'int8',1:'float32',2:'category'}, header=None)
See the next section on some dtype examples (with an existing df).
To change that on an existing dataframe you can use astype on the respective column.
Use df.info() to check the df memory usage before and after the change.
Some examples:
# Columns with integer values
# check first if min & max of that column is in the respective int range to not loose info
df['column_with_integers'] = df['column_with_integers'].astype('int8')
# Columns with float values
# mind potential calculation precision = don't go too low
df['column_with_floats'] = df['column_with_floats'].astype('float32')
# Columns with categorical values (strings)
# e.g. when the rows have repeatingly the same strings
# like 'TeamRed', 'TeamBlue', 'TeamYellow' spread over 10k rows
df['Team_Name'] = df['Team_Name'].astype('category')
# Cange boolean like string columns to actual boolean
df['Yes_or_No'] = df['Yes_or_No'].map({'yes':True, 'no':False})
Kudos to Medallion Data Science with his Youtube Video Efficient Pandas Dataframes in Python - Make your code run fast with these tricks! where I learned those tips.
Kudos to Robert Haas for the additional link in the comments to Pandas Scaling to large datasets - Use efficient datatypes
Not sure if this is a good idea in this case but it might be worth trying.
The Dask package was designed to allow Pandas-like data analysis on dataframes that are too big to fit in memory (RAM) (as well as other things). It does this by only loading chunks of the complete dataframe into memory at a time.
However, not sure it was designed for machines like the Raspberry Pi (not even sure there is a distribution for it).
The good thing is Dask will slide seamlessly into your Pandas script so it might not be too much effort to try it out.
Here is a simple example I made before:
dask-examples.ipynb
If you try it let me know if it works, I'm also interested.
If I have a dataset with unknown divisions and would like to sort it according to a column and output to Parquet, it seems to me that Dask does at least some of the work twice:
import dask
import dask.dataframe as dd
def my_identity(x):
"""Does nothing, but shows up on the Dask dashboard"""
return x
df = dask.datasets.timeseries()
df = df.map_partitions(my_identity)
df = df.set_index(['name']) # <- `my_identity` is calculated here, as well as other tasks
df.to_parquet('temp.parq') # <- previous tasks seem to be recalculated here
If my_identity was computationally demanding, then recomputing it would be really costly.
Am I correct in my understanding that Dask does some work twice here? Is there any way to prevent that?
The explanation below may not be accurate, but hopefully helps a bit.
Let's try to get into dask's shoes on this. We are asking dask to create an index based on some variable... Dask only works with sorted indexes, so Dask will want to know how to re-arrange data to make it sorted and also what will be the appropriate divisions for the partitions. The first calculation you see is doing that, and dask will store only the parts of calculation necessary for the divisions/data-reshuffling.
Then when we ask Dask to save the data, it computes the variables, shuffles the data (in line with the previous computations) and stores it in corresponding partitions.
How to avoid this? Possible options:
persist before setting the index. Once you persist, dask will compute the variable and keep it on workers, so setting index will refer to the results of that computation. There will still be reshuffling of the data needed). Note that the documentation suggests persisting after setting the index, but that case assumes that the column exists (does not require separate computation).
Sort within partitions, this can be done lazily, but of course it's only an option if you do not need a global sort.
Use plain pandas, this may necessitate some manual chunking of the data (what I tend to use for sorting).
I'm looking to apply the same function to multiple sub-combinations of a pandas DataFrame. Imagine the full DataFrame having 15 columns, and I want to draw from this full DataFrame a sub-frame containing 10 columns, I would have 3003 such sub-frames in total. My current approach is to use multiprocessing which works well for a full DataFrame with about 20 columns - 184,756 combinations, however the real full frame has 50 columns leading to more than 10 billions combinations, after which it will take too long. Is there any library that would be suitable for this type of calculation ? I have used dask before and it's incredibly powerful but dask is only suitable for calculation on a single DataFrame, not different ones.
Thanks.
It's hard to answer this question without a MVCE. The best path forward depends on what you want to do with your 10 billion DataFrame combinations (write them to disk, train models, aggregations, etc.).
I'll provide some high level advice:
using a columnar file format like Parquet allows for column pruning (grabbing certain columns rather than all of them), which can be memory efficient
Persisting the entire DataFrame in memory with ddf.persist() may be a good way for you to handle this combinations problem so you're not constantly reloading it
Feel free to add more detail about the problem and I can add a more detailed solution.
I am trying to use groupby and apply a custom function on a huge dataset, which is giving me memory errors and the workers are getting killed because of the shuffling. How can I avoid shuffle and do this efficiently.
I am reading around fifty 700MB (each) parquet files and the data in those files is isolated, i.e. no group exists in more than one file. If I try running my code on one file, it works fine but fails when I try to run on the complete dataset.
Dask documentation talks about problems with groupby when you apply a custom function on groups, but they do not offer a solution for such data:
http://docs.dask.org/en/latest/dataframe-groupby.html#difficult-cases
How can I process my dataset in a reasonable timeframe (it takes around 6 minutes for groupby-apply on a single file) and hopefully avoid shuffle. I do not need my results to be sorted, or groupby trying to sort my complete dataset from different files.
I have tried using persist but the data does not fit into RAM (32GB). Even though dask does not support multi column index, but I tried adding a index on one column to support groupby to no avail. Below is what the structure of code looks like:
from dask.dataframe import read_parquet
df = read_parquet('s3://s3_directory_path')
results = df.groupby(['A', 'B']).apply(custom_function).compute()
# custom function sorts the data within a group (the groups are small, less than 50 entries) on a field and computes some values based on heuristics (it computes 4 values, but I am showing 1 in example below and other 3 calculations are similar)
def custom_function(group):
results = {}
sorted_group = group.sort_values(['C']).reset_index(drop=True)
sorted_group['delta'] = sorted_group['D'].diff()
sorted_group.delta = sorted_group.delta.shift(-1)
results['res1'] = (sorted_group[sorted_group.delta < -100]['D'].sum() - sorted_group.iloc[0]['D'])
# similarly 3 more results are generated
results_df = pd.DataFrame(results, index=[0])
return results_df
One possibility is that I process one file at a time and do it multiple times, but in that case dask seems useless (no parallel processing) and it will take hours to achieve the desired results. Is there any way to do this efficiently using dask, or any other library? How do people deal with such data?
If you want to avoid shuffling and can promise that groups are well isolated, then you could just call a pandas groupby apply across every partition with map_partitions
df.map_partitions(lambda part: part.groupby(...).apply(...))
Update:
The pandas df was created like this:
df = pd.read_sql(query, engine)
encoded = pd.get_dummies(df, columns=['account'])
Creating a dask df from this df looks like this:
df = dd.from_pandas(encoded, 50)
Performing the operation with dask results in no visible progress being made (checking with dask diagnostics):
result = df.groupby('journal_entry').max().reset_index().compute()
Original:
I have a large pandas df with 2.7M rows and 4,000 columns. All but four of the columns are of dtype uint8. The uint8 columns only hold values of 1 or 0. I am attempting to perform this operation on the df:
result = df.groupby('id').max().reset_index()
Predictably, this operation immediately returns a memory error. My initial thought is to chunk the df both horizontally and vertically. However, this creates a messy situation, since the .max() needs to be performed across all the uint8 columns, not just a pair of columns. In addition, it is still extremely slow to chunk the df like this. I have 32 GB of RAM on my machine.
What strategy could mitigate the memory issue?
If you have any categorical columns in your data (rather than categories stored as object columns or strings), make sure you use the observed=True option in your groupby command. This makes sure it only creates lines where an entry is present, e.g. only one line per customer_id,order_id combination, rather than creating n_custs * n_orders lines!
I just did a groupby-sum on a 26M row dataset, never going above 7GB of RAM. Before adding the observed=True option, it was going up to 62GB and then running out.
you could use dask.dataframe for this task
import dask.dataframe as dd
df = dd.from_pandas(df)
result = df.groupby('id').max().reset_index().compute()
All you need to do is convert your pandas.DataFrame into a dask.dataframe. Dask is a python out-of-core parallelization framework that offers various parallelized container types, one of which is the dataframe. It let's you perform most common pandas.DataFrame operations in parallel and/or distributed with data that is too large to fit in memory. The core of dask is a set of schedulers and an API for building computation graphs, hence we have to call .compute() at the end in order for any computation to actually take place. The library is easy to install because it is written in pure python for the most part.
As an idea i would say, splitting the data column wise let's say four times, and use the id for each subset to perform the operations and then remerge