How to do faster language detection using Dask? - python

I am using Dask for parallel computation and want to detect the language of sentences in a column using langdetect. However, I still can not gain any speed in getting the language of the rows in the column.
Below is my code:
import dask.dataframe as dd
data = dd.read_csv('name.csv')# has a column called short_description
def some_fn(e):
return e['short_description'].apply(langdetect.detect)
data['Language'] = data.map_partitions(some_fn, meta='string')# adding a new column called Language.
This csav file has 800000 rows each containing approx. 20 words long sentences.
Any suggestion how I can achieve language detection more faster because currently it takes 2-3 hours.

By default dask dataframe uses a thread pool for processing. My guess is that your language detection algorithm is written in pure Python (rather than C/Cython like most of Pandas) and so is limited by the GIL. This means that you should use processes rather than threads. You can ask Dask to use processes by adding the scheduler="processes" keyword to any compute or persist calls
df.compute(scheduler="processes")
More information on Dask's different schedulers and when to use them is here: https://docs.dask.org/en/latest/scheduling.html

Related

I need to perform an "expensive" loop on a table with millions of rows. What technology (programming language or something else) should I use?

This question is a follow up to this one: How to the increase performance of a Python loop?.
Basically I have a script that takes as inputs a few csv files and after some data manipulation it outputs 2 csv files. In this script there is a loop on a table with ~14 million rows whose objective is to create another table with the same number of rows. I am working with Python on this project but the loop is just too slow (I know this because I used the tqdm package to measure speed).
So I’m looking for suggestions on what I should use in order to achieve my objective. Ideally the technology is free and it doesn’t take long for to learn it. I already got a few suggestions from other people: Cython and Power BI. The last one is paid and the first one seems complicated but I am willing to learn if indeed it is useful.
If more details are necessary just ask. Thanks.
Read about vaex. Vaex can help you to process your data much much faster. You should first convert your csv file to hdf5 format using vaex library. csv files are very slow for read/write.
Vaex will do the multiprocessing for your operations.
Also check if you can vectorize your computation (probably you can). I glansed at your code. Try to avoid using list instead use numpy arrays if you can.
If you're willing to stay with python I would probably recommend using the multiprocessing module. Corey Schafer has a good tutorial on how it works here.
Multiprocessing is a bit like threading however uses multiple interpreters to complete the main task unlike the threading module which switches through each thread to execute a line of code independently.
Divide up the work with however many cores you have on your CPU with this:
import os
cores = os.cpu_count()
This should speed up the work load by dividing up the work across your entire computer.
14 million rows is very achievable with python, but you can't do it with inefficient looping methods. I had a glance at the code you posted here, and saw that you're using iterrows(). iterrows() is fine for small dataframes, but it is (as you know) painfully slow when used on dataframes the size of yours. Instead, I suggest you start by looking into the apply() method (see docs here). That should get you up to speed!

Aggregation/groupby using pymongo versus using pandas

I have an API that loads data from MongoDB (with pymongo) and applies relatively "complex" data transformations with pandas afterwards, such as groupby on datetime columns, parametrizing the frequency and other stuff. Since I'm more expert in pandas than mongo, I do prefer doing it as it is, but I have no idea if writing these transformations as mongo aggregate queries would be significantly faster.
To simplify the question, not considering the difficulty on writing the queries on both sides: it's faster doing a [simple groupby on mongo and select * results] or [select * and doing it in pandas/dask(in a distributed scenario)]? Is the former faster/slower than the second in large datasets or smaller?
In general aggregation will be much faster than Pandas because the aggregation pipeline is:
Uploaded to the server and processed there eliminating network latency associated with round trips
Optimised internally to achieve the fastest execution
Executed in C on the server will all the available threading as opposed to running
on a single thread in Pandas
Working on the same dataset within the same process which means your working set
will be in memory eliminating disk accesses
As a stop gap I recommend pre-processing your data into a new collection using $out and then using Panda to process that.

Leveraging Concurrent future functionality on a simple function

I have a simple function which takes in 4 arguments which are all data frames, using those data frames it does pre-processing like cleaning, using Machine Learning to fill in null values, manipulate the data frames using group by and then in the end gives out two data frames in the return function.
Something like this:
def preprocess(df1,df2,df3,df4):
"some arguments for cleaning and manipulation data"
return(clean,df_new)
Since i am performing memory intensive operations, I was thinking if I can utilize the concurrent futures functionality in python. I did try simply using this style:
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(clean,df_new=preprocess(df1,df2,df3,df4))
I am getting error saying that clean and df_new is not defined. I am a bit confused about how to use concurrent future on "preprocess" function so that I can utilize all the cores of my laptop and fasten the processing speed for this operation. Any ideas/help is really appreciated.

Local use of dask: to Client() or not to Client()?

I am trying to understand the use patterns for Dask on a local machine.
Specifically,
I have a dataset that fits in memory
I'd like to do some pandas operations
groupby...
date parsing
etc.
Pandas performs these operations via a single core and these operations are taking hours for me. I have 8 cores on my machine and, as such, I'd like to use Dask to parallelize these operations as best as possible.
My question is as follows: What is the difference between the two way of doing this in Dask:
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
(1)
import dask.dataframe as dd
df = dd.from_pandas(
pd.DataFrame(iris.data, columns=iris.feature_names),
npartitions=2
)
df.mean().compute()
(2)
import dask.dataframe as dd
from distributed import Client
client = Client()
df = client.persist(
dd.from_pandas(
pd.DataFrame(iris.data, columns=iris.feature_names),
npartitions=2
)
)
df.mean().compute()
What is the benefit of one use pattern over the other? Why should I use one over the other?
Version (2) has two differences compared to version (1): the choice to use the distributed scheduler, and persist. These are separate factors. There is a lot of documentation about both: https://distributed.readthedocs.io/en/latest/quickstart.html, http://dask.pydata.org/en/latest/dataframe-performance.html#persist-intelligently , so this answer can be kept brief.
1) The distributed scheduler is newer and smarter than the previous threaded and multiprocess schedulers. As the name suggests, it is able to use a cluster, but also works on a single machine. Although the latency when calling .compute() is generally higher, in many ways it is more efficient, has more advanced features such as real-time dynamic programming and more diagnostics such as the dashboard. When created with Client(), you by default get a number of processes equal to the number of cores, but you can choose the number of processes and threads, and get close to the original threads-only situation with Client(processes=False).
2) Persisting means evaluating a computation and storing it in memory, so that further computations are faster. You can also persist without the distributed client (dask.persist). It effectively offers to trade memory for performance because you don't need to re-evalute the computation each time you use it for anything that depends on it. In the case where you go on to perform only one computation on the intermediate, as in the example, it should make no difference to performance.

How to improve performance of a Python script using Pandas and the SciKitLearn Stack?

I've created a script to match climate data between Brazilian cities using Python and the SciKitLearn stack. At the moment, I'm using MongoDB with climate collections with 60M+ entries, and Pandas to query and join over these tables.
I compare climate data from every Brazilian city with each other with a simple algorithm to generate a final score for each pair of cities.
The problem is that it takes too long (11 seconds per pair). This is really slow, since I want to get the score for all the 12M+ possible combinations. How can I make it faster? Here is the relevant code:
Connecting with MongoDB (almost instant):
client = MongoClient('localhost', 27017)
db = client.agricredit_datafetcher_development
Fetching climate data (fast, but not instant):
base_weather_data = list(db.meteo_data_weather_data.find({'weather_station_id': ObjectId(base_weather_station['_id'])}))
target_weather_data = list(db.meteo_data_weather_data.find({'weather_station_id': ObjectId(target_weather_station['_id'])}))
return {'base_dataframe': pd.DataFrame(base_weather_data),
'target_dataframe': pd.DataFrame(target_weather_data)}
Getting data from embedded (nested) Mongo collections (slow):
base_dataframe = dataframes['base_dataframe'].set_index('analysis_date')['weather_forecast'].apply(pd.Series)
target_dataframe = dataframes['target_dataframe'].set_index('analysis_date')['weather_forecast'].apply(pd.Series)
Querying to remove empty values and joining the DataFrames (fast, but not instant):
available_forecast_data = base_dataframe[base_dataframe['weather_forecast'] > 0][['weather_forecast']]
to_be_compared_data = target_dataframe[target_dataframe['weather_forecast'] > 0][['weather_forecast']]
join_dataframe = available_forecast_data.join(to_be_compared_data, how='inner', lsuffix='_base', rsuffix='_target')
Then I apply the scoring algorithm (which is pretty fast, just sums and averages) and then insert it into my Mongo database.
How can I improve my code?
Here are some points:
I'm pretty sure if that I'm able handle embedded MongoDB data (third step) without using creating a new DataFrame the pd.Series method. I've already tried json.normalize my Mongo collection and then transforming it to a Dataframe, but it's slow as well and some data get screwed when I do this.
The bottleneck is on the described steps. All the other steps (scoring algorithm, for instance) are instant;
I'm relying on Pandas to do the querying, groupings and creating Dataframes. Is that OK? I'm tempted to try Mongo Aggregation Framework to see if I got any improvements. Is it worth it?
Is Hadoop/Spark a way to improve this? I'm having a hard time to understand their role on a Data Science project;
EDIT: Third point solved! Used Mongo Aggregations instead of applying pd.Series method to the dataframe:
pipeline = [
{"$match" : {"weather_station_id": ObjectId(weather_station_id)}},
{"$project": {
'analysis_date': "$analysis_date",
'rainfall': "$rainfall.rainfall",
'agricultural_drought': "$rainfall.agricultural_drought",
'weather_forecast': "$weather_forecast.med_temp"
}
}
]
Now it takes ~1.2s for each pair of cities. But I feel I still can improve this... now the bottleneck is on the weather_data querying. It is indexed, though, is there any way to make it faster?
EDIT 2: Parallelizing my task with JobLib (https://pythonhosted.org/joblib/parallel.html) scaled my script virtual speed by 3-4 times :):
Parallel(n_jobs=4)(delayed(insert_or_read_scores_on_db)(city['name']) for city in cities)
Still looking for improvements, though
I have 2 suggestions which might be helpful to you.
1) In 2nd step you are using dataframe apply function which applies on each cell of column. If somehow you can write a for loop to do that than you can use python Numba library which is famous for doing things in parallel. You can create a function with for loop and in that function you can give decorator #jit which does parallel work in multicore CPU. You need to do from numba import jit
2) I have seen one video of professor of university of washington while doing course on coursera where he teaches how to use map reduce algorithm to do JOIN of two tables which can be run in parallel. That way you can do Join also fast if it runs in parallel.

Categories