Parallelize pandas function pd.concat - python

I have a huge list of data frames called df_list (with some different and some common columns) which I wish to merge into one big data frame. I have tried the following:
all_dfs = pd.concat(df_list)
Though this takes too much time on a single core. I killed the script after 48 hours. How would you parallelize this process to use all my cores or rewrite the code to make it faster

pandas - is not about parallel processing.
The easiest way is to use third-party tools to process huge data frames. You can run computing / processing of data set on different nodes.
You can look at dask (similar with pandas interface).
You can look at pyspark.
Also you can use swifter to runs processing on multiple cores.
There are probably some other tools... In other words, in your case it is better to run calculations in a cluster.
Hope this helps.

Related

Reading large database table into Dask dataframe

I have a 7GB postgresql table which I want to read into python and do some analysis. I cannot use Pandas for it because it is larger than the memory on my local machine. I therefore wanted to try reading the table into Dask Dataframe first, perform some aggregation and switch back to Pandas for subsequent analysis. I used the below lines of code for that.
df = dd.read_sql_table('table_xyz', uri = "postgresql+psycopg2://user:pwd#remotehost/dbname", index_col = 'column_xyz', schema = 'private')
The index_col i.e. 'column_xyz' is indexed in database. This works but when I perform an action for example an aggregation, it takes ages (like an hour) to return the result.
avg = df.groupby("col1").col2.mean().compute()
I understand that Dask is not as fast as Pandas more so when I am working on a single machine and not a cluster. I am wondering whether I am using Dask the right way? If not what is a faster alternative to perform analysis on large tables that do not fit in memory using Python.
If your data fits into the RAM of your machine then you're better off using Pandas. Dask will not outperform Pandas in some cases.
Alternatively you can play around with the chunksize and see if things improve. The best way to figure this out is to look at dask diagnostics tool dashboard and figure out what is taking dask so long. That will help you make a much more informed decision.

I need to perform an "expensive" loop on a table with millions of rows. What technology (programming language or something else) should I use?

This question is a follow up to this one: How to the increase performance of a Python loop?.
Basically I have a script that takes as inputs a few csv files and after some data manipulation it outputs 2 csv files. In this script there is a loop on a table with ~14 million rows whose objective is to create another table with the same number of rows. I am working with Python on this project but the loop is just too slow (I know this because I used the tqdm package to measure speed).
So I’m looking for suggestions on what I should use in order to achieve my objective. Ideally the technology is free and it doesn’t take long for to learn it. I already got a few suggestions from other people: Cython and Power BI. The last one is paid and the first one seems complicated but I am willing to learn if indeed it is useful.
If more details are necessary just ask. Thanks.
Read about vaex. Vaex can help you to process your data much much faster. You should first convert your csv file to hdf5 format using vaex library. csv files are very slow for read/write.
Vaex will do the multiprocessing for your operations.
Also check if you can vectorize your computation (probably you can). I glansed at your code. Try to avoid using list instead use numpy arrays if you can.
If you're willing to stay with python I would probably recommend using the multiprocessing module. Corey Schafer has a good tutorial on how it works here.
Multiprocessing is a bit like threading however uses multiple interpreters to complete the main task unlike the threading module which switches through each thread to execute a line of code independently.
Divide up the work with however many cores you have on your CPU with this:
import os
cores = os.cpu_count()
This should speed up the work load by dividing up the work across your entire computer.
14 million rows is very achievable with python, but you can't do it with inefficient looping methods. I had a glance at the code you posted here, and saw that you're using iterrows(). iterrows() is fine for small dataframes, but it is (as you know) painfully slow when used on dataframes the size of yours. Instead, I suggest you start by looking into the apply() method (see docs here). That should get you up to speed!

What explains the performance difference in Dask between DataFrame.assign(**kwargs) and dd[x]=y when appending multiple columns?

While migrating some code from Pandas to Dask I found an enormous performance difference between modifying a Dask dataframe by calling DataFrame.assign() with multiple columns vs modifying it with multiple DataFrame.__setitem__() (aka dataframe[x]=y) calls.
With imports
import pandas, dask, cProfile
For a Dask dataframe defined as:
dd = dask.dataframe.from_pandas(pandas.DataFrame({'a':[1]}), npartitions=1)
cProfile.run('for i in range(100): dd["c"+str(i)]=dd["a"]+i')
takes 1.436 seconds
while
cProfile.run('dd.assign(**{"c"+str(i):dd["a"]+i for i in range(100)})')
only takes 0.211 seconds. A 6.8X difference.
I have already tried looking at the stats with pyprof2calltree but couldn't make sense of it.
What explains this difference? And more importantly, is there any way to get the assign performance without having to refactor code that is using dd[x]=y repeatedly?
This may not matter or happen for large datasets, I haven't checked, but it does for a single row (why I care about Dask being fast for single rows is a separate topic).
For context, there is a difference in Pandas too but it is a lot smaller:
df = pandas.DataFrame({'a':[1]})
cProfile.run('for i in range (100): df["c"+str(i)]=df["a"]+i')
takes 0.116 seconds.
cProfile.run('df.assign(**{"c"+str(i):df["a"]+i for i in range(100)})')
takes 0.096 seconds. Just 1.2X.
Two main reasons:
The for loop generates a larger task graph (one new layer per item in the loop), compared to the single additional task from the assign.
DataFrame.__setitem__ is actually implemented in terms of assign: https://github.com/dask/dask/blob/366c7998398bc778c4aa5f4b6bb22c25b584fbc1/dask/dataframe/core.py#L3424-L3432, so you end up calling the same code, just many more times. Each assign is associated with a copy in pandas.
I have already tried looking at the stats with pyprof2calltree but couldn't make sense of it.
Profilers like this (built on cProfile) aren't well suited for profiling parallel code like Dask.

Why does running compute() on a filtered Dask dataframe take so long?

I'm reading in data using this:
ddf1 = dd.read_sql_table('mytable', conn_string, index_col='id', npartitions=8)
Of course, this runs instantaneously due to lazy computation. This table has several hundred million rows.
Next, I want to filter this Dask dataframe:
ddf2 = ddf1.query('some_col == "converted"')
Finally, I want to convert this to a Pandas dataframe. The result should only be about 8000 rows:
ddf3 = ddf2.compute()
However, this is taking very long (~1 hour). Can I get any advice on how to substantially speed this up? I've tried using .compute(scheduler='threads'), changing up the number of partitions, but none have worked so far. What am I doing wrong?
Firstly, you may be able to use sqlalchemy expression syntax to encode your filter clause in the query, and do the filtering server-side. If data transfer is your bottleneck, than that is your best solution, especially is the filter column is indexed.
Depending on your DB backend, sqlalchemy probably does not release the GIL, so your partitions cannot run in parallel in threads. All you are getting is contention between the threads and extra overhead. You should use the distributed scheduler with processes.
Of course, please look at your CPU and memory usage; with the distributed scheduler, you also have access to the diagnostic dashboard. You should also be concerned with how big each partition will be in memory.

Problem running a Pandas UDF on a large dataset

I'm currently working on a project and I am having a hard time understanding how does the Pandas UDF in PySpark works.
I have a Spark Cluster with one Master node with 8 cores and 64GB, along with two workers of 16 cores each and 112GB. My dataset is quite large and divided into seven principal partitions consisting each of ~78M lines. The dataset consists of 70 columns.
I defined a Pandas UDF in to do some operations on the dataset, that can only be done using Python, on a Pandas dataframe.
The pandas UDF is defined this way :
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def operation(pdf):
#Some operations
return pdf
spark.table("my_dataset").groupBy(partition_cols).apply(operation)
There is absolutely no way to get the Pandas UDF to work as it crashes before even doing the operations. I suspect there is an OOM error somewhere. The code above runs for a few minutes before crashing with an error code stating that the connection has reset.
However, if I call the .toPandas() function after filtering on one partition and then display it, it runs fine, with no error. The error seems to happen only when using a PandasUDF.
I fail to understand how it works. Does Spark try to convert one whole partition at once (78M lines) ? If so, what memory does it use ? The driver memory ? The executor's ? If it's on the driver's, is all Python code executed on it ?
The cluster is configured with the following :
SPARK_WORKER_CORES=2
SPARK_WORKER_MEMORY=64g
spark.executor.cores 2
spark.executor.memory 30g (to allow memory for the python instance)
spark.driver.memory 43g
Am I missing something or is there just no way to run 78M lines through a PandasUDF ?
Does Spark try to convert one whole partition at once (78M lines) ?
That's exactly what happens. Spark 3.0 adds support for chunked UDFs, which operate on iterators of Pandas DataFrames or Series, but if operations on the dataset, that can only be done using Python, on a Pandas dataframe, these might not be the right choice for you.
If so, what memory does it use ? The driver memory? The executor's?
Each partition is processed locally, on the respective executor, and data is passed to and from Python worker, using Arrow streaming.
Am I missing something or is there just no way to run 78M lines through a PandasUDF?
As long as you have enough memory to handle Arrow input, output (especially if data is copied), auxiliary data structures, as well as as JVM overhead, it should handle large datasets just fine.
But on such tiny cluster, you'll be better with partitioning the output and reading data directly with Pandas, without using Spark at all. This way you'll be able to use all the available resources (i.e. > 100GB / interpreter) for data processing instead of wasting these on secondary tasks (having 16GB - overhead / interpreter).
To answer the general question about using a Pandas UDF on a large pyspark dataframe:
If you're getting out-of-memory errors such as
java.lang.OutOfMemoryError : GC overhead limit exceeded or java.lang.OutOfMemoryError: Java heap space and increasing memory limits hasn't worked, ensure that pyarrow is enabled. It is disabled by default.
In pyspark, you can enable it using:
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
More info here.

Categories