python fragmented data pandas , DASK - python

what is the difference of using
//DASK
b = db.from_sequence(_query,npartitions=2)
df = b.to_dataframe()
df = df.compute()
//PANDAS
df = pd.DataFrame(_query)
I want to choose the best option to fragment large amounts of data and without losing performance

As per Dask's best practices with dataframes https://docs.dask.org/en/latest/dataframe-best-practices.html, for data that fits into RAM, use Pandas, it will probably be more efficient.
If you choose to use Dask, avoid very large partitions. If manually changing partition count, take into account your available memory and cores. For instance a machine with 100 GB and 10 cores would typically want partitions in the 1 GB range.
As of Dask 2.0.0 you can do that by using something like:
df.repartition(partition_size="100MB")
Other tips I can offer if you choose to stick with Dask is setting up a local client where you can take advantage of Dask Distributed http://distributed.dask.org/en/latest/client.html. From there avoid full data shuffling and reduce as far as you can before computing to Pandas.

Related

what is the best way to read big data and pd.concat

suppose I have two very big hdf files and I am going to read them and concat.
data = pd.concat([
pd.read_hdf("file1.hdf", key='data'),
pd.read_hdf("file2.hdf", key='data')
])
suppose every file take 10G of memory, and as we know, the above code will take a
peak memory usage of 40g. But the problem is that my computer's memory is only
32g, I wonder if there is any good way to read them and concat inplace so the peak
memory usage would be 20g?
If you want to use Pandas I'd input a chunksize parameter. This will create an iterator of your data that you can go through.
Alternatively, try pyspark or dask. Dask is essentially pandas, but let's you both parallelize your pipeline and not load in the entire dataset.

Reading large database table into Dask dataframe

I have a 7GB postgresql table which I want to read into python and do some analysis. I cannot use Pandas for it because it is larger than the memory on my local machine. I therefore wanted to try reading the table into Dask Dataframe first, perform some aggregation and switch back to Pandas for subsequent analysis. I used the below lines of code for that.
df = dd.read_sql_table('table_xyz', uri = "postgresql+psycopg2://user:pwd#remotehost/dbname", index_col = 'column_xyz', schema = 'private')
The index_col i.e. 'column_xyz' is indexed in database. This works but when I perform an action for example an aggregation, it takes ages (like an hour) to return the result.
avg = df.groupby("col1").col2.mean().compute()
I understand that Dask is not as fast as Pandas more so when I am working on a single machine and not a cluster. I am wondering whether I am using Dask the right way? If not what is a faster alternative to perform analysis on large tables that do not fit in memory using Python.
If your data fits into the RAM of your machine then you're better off using Pandas. Dask will not outperform Pandas in some cases.
Alternatively you can play around with the chunksize and see if things improve. The best way to figure this out is to look at dask diagnostics tool dashboard and figure out what is taking dask so long. That will help you make a much more informed decision.

Dask: setting index on a big dataframe results in high disk space usage during processing

I am working with a large dataset (220 000 000 rows, ~25Gb as csv files) which is stored as a handful of csv files.
I have already managed to read these csv with Dask and save the data as a parquet file with the following:
import pandas as pd
from dask.distributed import Client
import dask.dataframe as dd
client = Client()
init_fields = {
# definition of csv fields
}
raw_data_paths = [
# filenames with their path
]
read_csv_kwargs = dict(
sep=";",
header=None,
names=list(init_fields.keys()),
dtype=init_fields,
parse_dates=['date'],
)
ddf = dd.read_csv(
raw_data_paths,
**read_csv_kwargs,
)
ddf.to_parquet(persist_path / 'raw_data.parquet')
It works like a charm, and completes within minutes. I get a parquet file holding a Dask Dataframe with 455 partitions which I can totally use.
However, this dataframe consists in a huge list of client orders, which I would like to index by date for further processing.
When I try to run the code with the adjustment below:
ddf = dd.read_csv(
raw_data_paths,
**read_csv_kwargs,
).set_index('date')
ddf.to_parquet(persist_path / 'raw_data.parquet')
the processing gets really long, with 26 000+ tasks (I can understand that, that's a lot of data to sort) but workers start dying after a while from using to much memory.
With each worker death, some progress is lost and it seems that the processing will never complete.
I have noticed that the workers deaths are related to the disk of my machine reaching its limit, and whenever a worker dies some space is freed. At the beginning of the processing, I have about 37 Gb of free disk space.
I am quite new to Dask, so have a few questions about that:
Is setting the index before dumping in a parquet file a good idea ? I have several groupby date to come for the next steps, and as per the Dask documentation using this field as index seemed to me to be a good idea.
If I manage to set the index before dumping as a parquet file, will the parquet file be sorted and my further processing require no more shuffling?
Does the above described behaviour (high disk usage into memory error) seem normal or is something odd in my setup or use of Dask? Are there some parameters that I could tweak?
Or I really need more disk space, because sorting so much data requires it? What would be an estimation of the total disk space required?
Thanks in advance!
EDIT:
I finally managed to set the index by:
adding disk space on my machine
tweaking the client parameters to have more memory per worker
The parameters I used were:
client = Client(
n_workers=1,
threads_per_worker=8,
processes=True,
memory_limit='31GB'
)
I am less adamant that the disk usage was the root cause of my workers dying from lack of memory, because increasing disk space alone did not enable the processing to complete. It also required that memory per worker was extended, which I achieved by creating a single worker with the whole memory of my machine.
However, I am quite surprised that that much memory was needed. I thought that one of the aim of Dask (and other big data tools) was to enable "out of core processing". Am I doing something wrong here or setting an index requires a big amount of memory, no matter what?
Regards,
Here's how I understand things, but I might be missing some important points.
Let's start with a nice indexed dataset to have a reproducible example.
import dask
import dask.dataframe as dd
df = dask.datasets.timeseries(start='2000-01-01', end='2000-01-2', freq='2h', partition_freq='12h')
print(len(df), df.npartitions)
# 12 2
So we are dealing with a tiny dataset, just 12 rows, split across 2 partitions. Since this dataframe is indexed, merges on it will be very fast, because dask knows which partitions contain which (index) values.
%%time
_ = df.merge(df, how='outer', left_index=True, right_index=True).compute()
#CPU times: user 25.7 ms, sys: 4.23 ms, total: 29.9 ms
#Wall time: 27.7 ms
Now if we try to merge on a non-index column, dask will not know which partition contains which values, so it will have to exchange information between workers and transfer bits of data among workers.
%%time
_ = df.merge(df, how='outer', on=['name']).compute()
#CPU times: user 82.3 ms, sys: 8.19 ms, total: 90.4 ms
#Wall time: 85.4 ms
This might not seem much on this small dataset, but compare it to the time pandas would take:
%%time
_ = df.compute().merge(df.compute(), how='outer', on=['name'])
#CPU times: user 18.9 ms, sys: 3.39 ms, total: 22.3 ms
#Wall time: 19.7 ms
Another way to see this is with the DAGs, compare the DAG for the merge with indexed columns to DAG for the merge with non-indexed column. The first one is nicely parallel:
The second one (using non-indexed column) is a lot more complex:
So what happens as the size of data grows, is it becomes and more expensive to perform operations with non-indexed columns. This is especially true for columns that contain many unique values (e.g. strings). You can experiment with increasing the number of partitions in the dataframe df constructed above, and you will observe how the non-indexed case becomes more and more complex, while DAG for indexed data remains scaleable.
Going back to your specific case, you are starting with a non-indexed dataframe, which after indexing is going to be a pretty complex entity. You can see the DAG for the indexed dataframe with .visualize(), and from experience I can guess it will not look pretty.
So when you are saving to parquet (or initiating other computation of the dataframe), workers begin to shuffle data around, which will eat up memory quickly (especially if there are many columns and/or many partitions and/or columns have a lot of unique values). Once the worker memory limit is close, workers will start spilling data to disk (if they are allowed to), which is why you were able to complete your task by increasing both memory and available disk space.
In a situation where neither of those options is possible, you might need to implement custom workflow that uses delayed API (or futures for dynamic graphs), such that this workflow makes use of some information that is not explicitly available to dask. For example, if the original csv files were partitioned by a column of interest, you might want to process these csv files in separate batches, rather than ingesting them into a single dask dataframe and then indexing.

Using set_index() on a Dask Dataframe and writing to parquet causes memory explosion

I have a large set of Parquet files that I am trying to sort on a column. Uncompressed, the data is around ~14Gb, so Dask seemed like the right tool for the job. All I'm doing with Dask is:
Reading the parquet files
Sorting on one of the columns (called "friend")
Writing as parquet files in a separate directory
I can't do this without the Dask process (there's just one, I'm using the synchronous scheduler) running out of memory and getting killed. This surprises me, because no one partition is more than ~300 mb uncompressed.
I've written a little script to profile Dask with progressively larger portions of my dataset, and I've noticed that Dask's memory consumption scales with the size of the input. Here's the script:
import os
import dask
import dask.dataframe as dd
from dask.diagnostics import ResourceProfiler, ProgressBar
def run(input_path, output_path, input_limit):
dask.config.set(scheduler="synchronous")
filenames = os.listdir(input_path)
full_filenames = [os.path.join(input_path, f) for f in filenames]
rprof = ResourceProfiler()
with rprof, ProgressBar():
df = dd.read_parquet(full_filenames[:input_limit])
df = df.set_index("friend")
df.to_parquet(output_path)
rprof.visualize(file_path=f"profiles/input-limit-{input_limit}.html")
Here are the charts produced by the visualize() call:
Input Limit = 2
Input Limit = 4
Input Limit = 8
Input Limit = 16
The full dataset is ~50 input files, so at this rate of growth I'm not surprised that job eats up all of the memory on my 32gb machine.
My understanding is that the whole point of Dask is to allow you to operate on larger-than-memory datasets. I get the impression that people are using Dask to process datasets far larger than my ~14gb one. How do they avoid this issue with scaling memory consumption? What am I doing wrong here?
I'm not interested in using a different scheduler or in parallelism at this point. I'd just like to know why Dask is consuming so much more memory than I would have thought necessary.
This turns out to have been a performance regression in Dask that was fixed in the 2021.03.0 release.
See this Github issue for more info.

Why does running compute() on a filtered Dask dataframe take so long?

I'm reading in data using this:
ddf1 = dd.read_sql_table('mytable', conn_string, index_col='id', npartitions=8)
Of course, this runs instantaneously due to lazy computation. This table has several hundred million rows.
Next, I want to filter this Dask dataframe:
ddf2 = ddf1.query('some_col == "converted"')
Finally, I want to convert this to a Pandas dataframe. The result should only be about 8000 rows:
ddf3 = ddf2.compute()
However, this is taking very long (~1 hour). Can I get any advice on how to substantially speed this up? I've tried using .compute(scheduler='threads'), changing up the number of partitions, but none have worked so far. What am I doing wrong?
Firstly, you may be able to use sqlalchemy expression syntax to encode your filter clause in the query, and do the filtering server-side. If data transfer is your bottleneck, than that is your best solution, especially is the filter column is indexed.
Depending on your DB backend, sqlalchemy probably does not release the GIL, so your partitions cannot run in parallel in threads. All you are getting is contention between the threads and extra overhead. You should use the distributed scheduler with processes.
Of course, please look at your CPU and memory usage; with the distributed scheduler, you also have access to the diagnostic dashboard. You should also be concerned with how big each partition will be in memory.

Categories