I have a simple function which takes in 4 arguments which are all data frames, using those data frames it does pre-processing like cleaning, using Machine Learning to fill in null values, manipulate the data frames using group by and then in the end gives out two data frames in the return function.
Something like this:
def preprocess(df1,df2,df3,df4):
"some arguments for cleaning and manipulation data"
return(clean,df_new)
Since i am performing memory intensive operations, I was thinking if I can utilize the concurrent futures functionality in python. I did try simply using this style:
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(clean,df_new=preprocess(df1,df2,df3,df4))
I am getting error saying that clean and df_new is not defined. I am a bit confused about how to use concurrent future on "preprocess" function so that I can utilize all the cores of my laptop and fasten the processing speed for this operation. Any ideas/help is really appreciated.
Related
I have a function f that uses as input a variable x which is a large np.ndarray (lenght 20000).
Execution of f takes very little (about 5ms).
A for loop over a matrix M with many rows
for x in M:
f(x)
takes about 5 times longer than parallelizing using multiprocessing
import multiprocessing
with multiprocessing.Pool() as pool:
pool.map(f, M)
I have tried to parallelize with dask but it loses even against sequential execution. Related post is here but the accepted answer doesn´t work for me. I have tried many thing like use partitions of the data as the best practices say or using dask.bag. I'm running Dask in local machine with 4 physical cores.
So the question is how to use dask with short tasks that take large data as input?
Firstly, the dask documentation makes clear the following contraindications:
it is a bad idea to create big data in the client and pass it to workers; you should have workers load the data they need
if the data you need fit into memory, the standard python tool (in this case numpy) probably works as well or better than dask
if you want to share memory and are running processes such as numpy that release the GIL, then you should prefer threads over processes.
dask multiprocessing should not generally be used if you can run distributed (i.e., always)
don't use a python loop over an array, you should vectorize
Since we don't know much about what you are doing or your system, I will provide a guess of why dask is slower than multiprocessing. When you use multiprocessing.pool, probably the system created processes via fork, and copied (or copy-on-write duplicated) the array into each process, so they can access it. Dask requires threads and event loops to run, so it is not safe to use with fork. This means, that when you want data in the client to be processed in a worker, it must be serialised and sent over IPC. This is very likely the cause of your slowdown.
Problem
I'd like to get the minimum value in a dataframe and apply a function to it.
I also would like to do this lazily. However, doing so appears to introduce a performance cost.
Example
I believe this example captures this behaviour:
import dask
import dask.dataframe as dd
# Sample data
df = dask.datasets.timeseries(end='2002-01-31')
# Sample function
def f(x):
return 2*x
task = df['id'].min()
f(task.compute()) # Takes ~1.6s on my machine
dask.delayed(f)(task).compute() # Takes ~3.5s on my machine
Why is the second computation taking longer? Can this be improved somehow?
Additional notes
Looking at the dashboard, it appears that making f delayed makes the actual processing of the data slower. That is, the longer time is not caused by f becoming slow through delayed.
If you persist the dataframe beforehand, the time for both tasks is equal.
The effect appears also when you use read_parquet to read the data
Snakeviz
I have tried to visualize the tasks using snakeviz. I am showing only what I think are the important parts:
Without delayed
With delayed
Depending on the size of your input data, it seems to me that this may be related to hashing of the input to delayed function. This is an additional operation on top of the actual calculation which can make it slower if you have a lot of input data.
I am using Dask for parallel computation and want to detect the language of sentences in a column using langdetect. However, I still can not gain any speed in getting the language of the rows in the column.
Below is my code:
import dask.dataframe as dd
data = dd.read_csv('name.csv')# has a column called short_description
def some_fn(e):
return e['short_description'].apply(langdetect.detect)
data['Language'] = data.map_partitions(some_fn, meta='string')# adding a new column called Language.
This csav file has 800000 rows each containing approx. 20 words long sentences.
Any suggestion how I can achieve language detection more faster because currently it takes 2-3 hours.
By default dask dataframe uses a thread pool for processing. My guess is that your language detection algorithm is written in pure Python (rather than C/Cython like most of Pandas) and so is limited by the GIL. This means that you should use processes rather than threads. You can ask Dask to use processes by adding the scheduler="processes" keyword to any compute or persist calls
df.compute(scheduler="processes")
More information on Dask's different schedulers and when to use them is here: https://docs.dask.org/en/latest/scheduling.html
I'm reading in data using this:
ddf1 = dd.read_sql_table('mytable', conn_string, index_col='id', npartitions=8)
Of course, this runs instantaneously due to lazy computation. This table has several hundred million rows.
Next, I want to filter this Dask dataframe:
ddf2 = ddf1.query('some_col == "converted"')
Finally, I want to convert this to a Pandas dataframe. The result should only be about 8000 rows:
ddf3 = ddf2.compute()
However, this is taking very long (~1 hour). Can I get any advice on how to substantially speed this up? I've tried using .compute(scheduler='threads'), changing up the number of partitions, but none have worked so far. What am I doing wrong?
Firstly, you may be able to use sqlalchemy expression syntax to encode your filter clause in the query, and do the filtering server-side. If data transfer is your bottleneck, than that is your best solution, especially is the filter column is indexed.
Depending on your DB backend, sqlalchemy probably does not release the GIL, so your partitions cannot run in parallel in threads. All you are getting is contention between the threads and extra overhead. You should use the distributed scheduler with processes.
Of course, please look at your CPU and memory usage; with the distributed scheduler, you also have access to the diagnostic dashboard. You should also be concerned with how big each partition will be in memory.
I have a huge list of data frames called df_list (with some different and some common columns) which I wish to merge into one big data frame. I have tried the following:
all_dfs = pd.concat(df_list)
Though this takes too much time on a single core. I killed the script after 48 hours. How would you parallelize this process to use all my cores or rewrite the code to make it faster
pandas - is not about parallel processing.
The easiest way is to use third-party tools to process huge data frames. You can run computing / processing of data set on different nodes.
You can look at dask (similar with pandas interface).
You can look at pyspark.
Also you can use swifter to runs processing on multiple cores.
There are probably some other tools... In other words, in your case it is better to run calculations in a cluster.
Hope this helps.