Avoiding Memory Issues For GroupBy on Large Pandas DataFrame - python

Update:
The pandas df was created like this:
df = pd.read_sql(query, engine)
encoded = pd.get_dummies(df, columns=['account'])
Creating a dask df from this df looks like this:
df = dd.from_pandas(encoded, 50)
Performing the operation with dask results in no visible progress being made (checking with dask diagnostics):
result = df.groupby('journal_entry').max().reset_index().compute()
Original:
I have a large pandas df with 2.7M rows and 4,000 columns. All but four of the columns are of dtype uint8. The uint8 columns only hold values of 1 or 0. I am attempting to perform this operation on the df:
result = df.groupby('id').max().reset_index()
Predictably, this operation immediately returns a memory error. My initial thought is to chunk the df both horizontally and vertically. However, this creates a messy situation, since the .max() needs to be performed across all the uint8 columns, not just a pair of columns. In addition, it is still extremely slow to chunk the df like this. I have 32 GB of RAM on my machine.
What strategy could mitigate the memory issue?

If you have any categorical columns in your data (rather than categories stored as object columns or strings), make sure you use the observed=True option in your groupby command. This makes sure it only creates lines where an entry is present, e.g. only one line per customer_id,order_id combination, rather than creating n_custs * n_orders lines!
I just did a groupby-sum on a 26M row dataset, never going above 7GB of RAM. Before adding the observed=True option, it was going up to 62GB and then running out.

you could use dask.dataframe for this task
import dask.dataframe as dd
df = dd.from_pandas(df)
result = df.groupby('id').max().reset_index().compute()
All you need to do is convert your pandas.DataFrame into a dask.dataframe. Dask is a python out-of-core parallelization framework that offers various parallelized container types, one of which is the dataframe. It let's you perform most common pandas.DataFrame operations in parallel and/or distributed with data that is too large to fit in memory. The core of dask is a set of schedulers and an API for building computation graphs, hence we have to call .compute() at the end in order for any computation to actually take place. The library is easy to install because it is written in pure python for the most part.

As an idea i would say, splitting the data column wise let's say four times, and use the id for each subset to perform the operations and then remerge

Related

How to reduce amount of ram used by pandas

I have a Raspberry Pi 3B+ with 1Gb ram in which I am running a telegram bot.
This bot uses a database I store in csv format, which contains about 100k rows and four columns:
First two are for searching
Third is a result
those use about 20-30MB ram, this is assumable.
The last column is really a problem, it shoots up the ram usage to 180MB, impossible to manage for RPi, this column is also for searching, but I only need it sometimes.
I started only loading the df with read_csv at start of script and let the script polling, but when the db grows, I realized that this is too much for RPi.
What do you think is the best way to do this? Thanks!
Setting the dtype according to the data can reduce memory usage a lot.
With read_csv you can directly set the dtype for each column:
dtypeType name or dict of column -> type, optional
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}
Use str or object together with suitable
na_values settings to preserve and not interpret dtype. If converters
are specified, they will be applied INSTEAD of dtype conversion.
Example:
df = pd.read_csv(my_csv, dtype={0:'int8',1:'float32',2:'category'}, header=None)
See the next section on some dtype examples (with an existing df).
To change that on an existing dataframe you can use astype on the respective column.
Use df.info() to check the df memory usage before and after the change.
Some examples:
# Columns with integer values
# check first if min & max of that column is in the respective int range to not loose info
df['column_with_integers'] = df['column_with_integers'].astype('int8')
# Columns with float values
# mind potential calculation precision = don't go too low
df['column_with_floats'] = df['column_with_floats'].astype('float32')
# Columns with categorical values (strings)
# e.g. when the rows have repeatingly the same strings
# like 'TeamRed', 'TeamBlue', 'TeamYellow' spread over 10k rows
df['Team_Name'] = df['Team_Name'].astype('category')
# Cange boolean like string columns to actual boolean
df['Yes_or_No'] = df['Yes_or_No'].map({'yes':True, 'no':False})
Kudos to Medallion Data Science with his Youtube Video Efficient Pandas Dataframes in Python - Make your code run fast with these tricks! where I learned those tips.
Kudos to Robert Haas for the additional link in the comments to Pandas Scaling to large datasets - Use efficient datatypes
Not sure if this is a good idea in this case but it might be worth trying.
The Dask package was designed to allow Pandas-like data analysis on dataframes that are too big to fit in memory (RAM) (as well as other things). It does this by only loading chunks of the complete dataframe into memory at a time.
However, not sure it was designed for machines like the Raspberry Pi (not even sure there is a distribution for it).
The good thing is Dask will slide seamlessly into your Pandas script so it might not be too much effort to try it out.
Here is a simple example I made before:
dask-examples.ipynb
If you try it let me know if it works, I'm also interested.

Use dask to chunkwise work with smaller pandas df breaks memory limits

I do have the following problem:
The downside:
I do have a large amount of data that does not fit a pandas df in memory.
The upside:
The data is mostly independent from each other. The only restriction is, that all elements with the same id have to be calculated together in a chunk.
What I tried:
Dask looked like a perfect fit for this problem.
I have a dask kubernetes cluster with multiple workers and no problem loading the data from sql database into a dask df.
The calculation itself is not that easy to implement in dask because some functions are missing or problematic with dask (e.g Multiindex, pivot). However because the data is mostly independent I tried to make the calculations chunkwise in pandas. When I call the .set_index(id) function on the dask df all equal id's are in the same partition. Thereby I wanted to iterate over the partitions and make the calculations on a (temporary) pandas df and store the result right away.
The code basically looks like this:
from dask import dataframe as dd
from distributed import Client
client = Client("kubernetes adress")
#Loading:
for i in range(x):
df_chunk = load_from_sql
future = client.scatter(df_chunk)
futures.append(future)
dask_df = dd.from_delayed(futures, meta=df_chunk)
dask_df = dask_df.set_index("id")
dask_df.map_partitions(lambda part: calculation(part).compute()
where
def calculation(part):
part = # do something in pandas
part.to_csv/to_sql # store data somwhere
del part / client.cancel(part) # release memory of temporary pandas df
With small amounts of data this runs smoothly however with a lot data the memory of the workers becomes full and the process stops with a cancelled error.
What am I doing wrong or are there any alternatives with dask to work memory efficient with chunkwise data from a dask df?
Loading the data directly in chunks from the database is currently not an option.

Iterating Dask Dataframe

I'm trying to create a Keras Tokenizer out of a single column from hundreds of large CSV files. Dask seems like a good tool for this. My current approach eventually causes memory issues:
df = dd.read_csv('data/*.csv', usecol=['MyCol'])
# Process column and get underlying Numpy array.
# This greatly reduces memory consumption, but eventually materializes
# the entire dataset into memory
my_ids = df.MyCol.apply(process_my_col).compute().values
tokenizer = Tokenizer()
tokenizer.fit_on_texts(my_ids)
How can I do this by parts? Something along the lines of:
df = pd.read_csv('a-single-file.csv', chunksize=1000)
for chunk in df:
# Process a chunk at a time
Dask DataFrame is technically a set of pandas dataframes, called partitions. When you get the underlying numpy array you are destroying the partitioning structure and it will be one big array. I recommend using map_partition function of Dask DataFrames to apply regular pandas functions on each partition separately.
I also recommend map_partition when it suits your problem. However, if you really just want sequential access, and an API similar to read_csv(chunksize=...) then you might be looking for the partitions attribute
for part in df.partitions:
process(model, part.compute())

What is an efficient way to use groupby & apply a custom function for a huge dataset and avoid shuffling?

I am trying to use groupby and apply a custom function on a huge dataset, which is giving me memory errors and the workers are getting killed because of the shuffling. How can I avoid shuffle and do this efficiently.
I am reading around fifty 700MB (each) parquet files and the data in those files is isolated, i.e. no group exists in more than one file. If I try running my code on one file, it works fine but fails when I try to run on the complete dataset.
Dask documentation talks about problems with groupby when you apply a custom function on groups, but they do not offer a solution for such data:
http://docs.dask.org/en/latest/dataframe-groupby.html#difficult-cases
How can I process my dataset in a reasonable timeframe (it takes around 6 minutes for groupby-apply on a single file) and hopefully avoid shuffle. I do not need my results to be sorted, or groupby trying to sort my complete dataset from different files.
I have tried using persist but the data does not fit into RAM (32GB). Even though dask does not support multi column index, but I tried adding a index on one column to support groupby to no avail. Below is what the structure of code looks like:
from dask.dataframe import read_parquet
df = read_parquet('s3://s3_directory_path')
results = df.groupby(['A', 'B']).apply(custom_function).compute()
# custom function sorts the data within a group (the groups are small, less than 50 entries) on a field and computes some values based on heuristics (it computes 4 values, but I am showing 1 in example below and other 3 calculations are similar)
def custom_function(group):
results = {}
sorted_group = group.sort_values(['C']).reset_index(drop=True)
sorted_group['delta'] = sorted_group['D'].diff()
sorted_group.delta = sorted_group.delta.shift(-1)
results['res1'] = (sorted_group[sorted_group.delta < -100]['D'].sum() - sorted_group.iloc[0]['D'])
# similarly 3 more results are generated
results_df = pd.DataFrame(results, index=[0])
return results_df
One possibility is that I process one file at a time and do it multiple times, but in that case dask seems useless (no parallel processing) and it will take hours to achieve the desired results. Is there any way to do this efficiently using dask, or any other library? How do people deal with such data?
If you want to avoid shuffling and can promise that groups are well isolated, then you could just call a pandas groupby apply across every partition with map_partitions
df.map_partitions(lambda part: part.groupby(...).apply(...))

is dask get_dummies supposed to work without persist?

I am doing a (binary) prediction exercise with 47 variables but many millions of rows. So pure pandas (or now pydaal/daal4py) are not enough to set up the data; I am adapting a dask example to run locally. I am not scheduling tasks on a cluster but hope to stream my data and not hold all in memory.
Either for dask-xgboost or other dask-supported tools (in scikit-learn) I would need to convert my categorical variables into numerical values.
In Matthew Rocklin's exercise, this is done in this line:
df2 = dd.get_dummies(df.categorize()).persist()
where df is a dask DataFrame and dd is the dask DataFrame module imported.
However, I would hope to avoid persist() in an earlier line (df, is_delayed = dask.persist(df, is_delayed) for Matthew), as I cannot do this in memory. Maybe this is why I get errors when I try to use the resulting DataFrame like this:
Length mismatch: Expected axis has 39 elements, new values have 57
elements
How should one use dask.dataframe.get_dummies(), and in particular in a use case like mine?

Categories