Aggregation/groupby using pymongo versus using pandas

Aggregation/groupby using pymongo versus using pandas - python

I have an API that loads data from MongoDB (with pymongo) and applies relatively "complex" data transformations with pandas afterwards, such as groupby on datetime columns, parametrizing the frequency and other stuff. Since I'm more expert in pandas than mongo, I do prefer doing it as it is, but I have no idea if writing these transformations as mongo aggregate queries would be significantly faster.
To simplify the question, not considering the difficulty on writing the queries on both sides: it's faster doing a [simple groupby on mongo and select * results] or [select * and doing it in pandas/dask(in a distributed scenario)]? Is the former faster/slower than the second in large datasets or smaller?

In general aggregation will be much faster than Pandas because the aggregation pipeline is:
Uploaded to the server and processed there eliminating network latency associated with round trips
Optimised internally to achieve the fastest execution
Executed in C on the server will all the available threading as opposed to running
on a single thread in Pandas
Working on the same dataset within the same process which means your working set
will be in memory eliminating disk accesses
As a stop gap I recommend pre-processing your data into a new collection using $out and then using Panda to process that.

Related

Faster way to read ~400.000 rows of data with Python

I am working on a real-time information retrieval system which performs large queries on a local SQL Server database with around 4.5 million rows.
Each query returns, on average, 400.000 rows and is parameterized, like the following example:
SELECT Id, Features, Edges, Cluster, Objects FROM db.Image
WHERE Cluster = 16
AND Features IS NOT NULL
AND Objects IS NOT NULL
These are the times I am getting with my current approach:
Query time: 4.52361000 seconds
Query size: 394048 rows, 5 columns
While not necessarily unusable, it is expected that query sizes will grow quickly, and as such I need a more efficient way to read large amounts of rows into a DataFrame.
Currently, I'm using pyodbc for building a connecting to SQL Server and pd.read_sql to parse the query directly into a DataFrame which is then manipulated. I am looking at ways to improve the query times significantly while still allowing me to work with DataFrame operations after the data is fetched. So far, I have tried dask DataFrames, connectorX, as well as failed attempts at parallelizing the queries with multithreading, but to no avail.
How can one, relying on other solutions, multithreading, or even entirely different file formats, improve the time it takes to read this amount of data?
Code Sample
conn = connection() # I have a function that returns a connector
filter = 16
command = '''SELECT Id, Features, Edges, Cluster, Objects FROM Common.Image
WHERE Cluster = {} AND Features IS NOT NULL AND Objects IS NOT NULL'''.format(filter)
result = pd.read_sql(command, conn)
EDIT
Following #tadman's comment:
Consider caching this if practical, like once you've fetched the data you could save it in a more compact form (Google Protobuf, Parquet, etc.) Reading in that way can be considerably faster, as you're usually just IO bound, not server/CPU bound.
I looked at Parquet caching and landed on a considerably faster way to fetch my data:
Created compressed parquet files for each of my data clustes (1 to 21).
Using pyarrow, read the necessary cluster file with
df_pq = pq.read_table("\\cluster16.parquet")
Convert the parquet file to a pandas DataFrame with df = df_pq.to_pandas()
Proceed as usual
With this method, I reduced the total time to 1.12400 seconds.

Pyspark dataframes run faster than pandas dataframes and should provide more memory.
If you already have the dataframe in pandas, you can convert like this:
spark_df = spark.createDataFrame(df)
modified_df=spark_df.filter("query here").collect()
You can convert back to pandas if you need after main sql querying.
link:
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.filter.html

Reading large database table into Dask dataframe

I have a 7GB postgresql table which I want to read into python and do some analysis. I cannot use Pandas for it because it is larger than the memory on my local machine. I therefore wanted to try reading the table into Dask Dataframe first, perform some aggregation and switch back to Pandas for subsequent analysis. I used the below lines of code for that.
df = dd.read_sql_table('table_xyz', uri = "postgresql+psycopg2://user:pwd#remotehost/dbname", index_col = 'column_xyz', schema = 'private')
The index_col i.e. 'column_xyz' is indexed in database. This works but when I perform an action for example an aggregation, it takes ages (like an hour) to return the result.
avg = df.groupby("col1").col2.mean().compute()
I understand that Dask is not as fast as Pandas more so when I am working on a single machine and not a cluster. I am wondering whether I am using Dask the right way? If not what is a faster alternative to perform analysis on large tables that do not fit in memory using Python.

If your data fits into the RAM of your machine then you're better off using Pandas. Dask will not outperform Pandas in some cases.
Alternatively you can play around with the chunksize and see if things improve. The best way to figure this out is to look at dask diagnostics tool dashboard and figure out what is taking dask so long. That will help you make a much more informed decision.

How to do faster language detection using Dask?

I am using Dask for parallel computation and want to detect the language of sentences in a column using langdetect. However, I still can not gain any speed in getting the language of the rows in the column.
Below is my code:
import dask.dataframe as dd
data = dd.read_csv('name.csv')# has a column called short_description
def some_fn(e):
return e['short_description'].apply(langdetect.detect)
data['Language'] = data.map_partitions(some_fn, meta='string')# adding a new column called Language.
This csav file has 800000 rows each containing approx. 20 words long sentences.
Any suggestion how I can achieve language detection more faster because currently it takes 2-3 hours.

By default dask dataframe uses a thread pool for processing. My guess is that your language detection algorithm is written in pure Python (rather than C/Cython like most of Pandas) and so is limited by the GIL. This means that you should use processes rather than threads. You can ask Dask to use processes by adding the scheduler="processes" keyword to any compute or persist calls
df.compute(scheduler="processes")
More information on Dask's different schedulers and when to use them is here: https://docs.dask.org/en/latest/scheduling.html

How to improve performance of a Python script using Pandas and the SciKitLearn Stack?

I've created a script to match climate data between Brazilian cities using Python and the SciKitLearn stack. At the moment, I'm using MongoDB with climate collections with 60M+ entries, and Pandas to query and join over these tables.
I compare climate data from every Brazilian city with each other with a simple algorithm to generate a final score for each pair of cities.
The problem is that it takes too long (11 seconds per pair). This is really slow, since I want to get the score for all the 12M+ possible combinations. How can I make it faster? Here is the relevant code:
Connecting with MongoDB (almost instant):
client = MongoClient('localhost', 27017)
db = client.agricredit_datafetcher_development
Fetching climate data (fast, but not instant):
base_weather_data = list(db.meteo_data_weather_data.find({'weather_station_id': ObjectId(base_weather_station['_id'])}))
target_weather_data = list(db.meteo_data_weather_data.find({'weather_station_id': ObjectId(target_weather_station['_id'])}))
return {'base_dataframe': pd.DataFrame(base_weather_data),
'target_dataframe': pd.DataFrame(target_weather_data)}
Getting data from embedded (nested) Mongo collections (slow):
base_dataframe = dataframes['base_dataframe'].set_index('analysis_date')['weather_forecast'].apply(pd.Series)
target_dataframe = dataframes['target_dataframe'].set_index('analysis_date')['weather_forecast'].apply(pd.Series)
Querying to remove empty values and joining the DataFrames (fast, but not instant):
available_forecast_data = base_dataframe[base_dataframe['weather_forecast'] > 0][['weather_forecast']]
to_be_compared_data = target_dataframe[target_dataframe['weather_forecast'] > 0][['weather_forecast']]
join_dataframe = available_forecast_data.join(to_be_compared_data, how='inner', lsuffix='_base', rsuffix='_target')
Then I apply the scoring algorithm (which is pretty fast, just sums and averages) and then insert it into my Mongo database.
How can I improve my code?
Here are some points:
I'm pretty sure if that I'm able handle embedded MongoDB data (third step) without using creating a new DataFrame the pd.Series method. I've already tried json.normalize my Mongo collection and then transforming it to a Dataframe, but it's slow as well and some data get screwed when I do this.
The bottleneck is on the described steps. All the other steps (scoring algorithm, for instance) are instant;
I'm relying on Pandas to do the querying, groupings and creating Dataframes. Is that OK? I'm tempted to try Mongo Aggregation Framework to see if I got any improvements. Is it worth it?
Is Hadoop/Spark a way to improve this? I'm having a hard time to understand their role on a Data Science project;
EDIT: Third point solved! Used Mongo Aggregations instead of applying pd.Series method to the dataframe:
pipeline = [
{"$match" : {"weather_station_id": ObjectId(weather_station_id)}},
{"$project": {
'analysis_date': "$analysis_date",
'rainfall': "$rainfall.rainfall",
'agricultural_drought': "$rainfall.agricultural_drought",
'weather_forecast': "$weather_forecast.med_temp"
}
}
]
Now it takes ~1.2s for each pair of cities. But I feel I still can improve this... now the bottleneck is on the weather_data querying. It is indexed, though, is there any way to make it faster?
EDIT 2: Parallelizing my task with JobLib (https://pythonhosted.org/joblib/parallel.html) scaled my script virtual speed by 3-4 times :):
Parallel(n_jobs=4)(delayed(insert_or_read_scores_on_db)(city['name']) for city in cities)
Still looking for improvements, though

I have 2 suggestions which might be helpful to you.
1) In 2nd step you are using dataframe apply function which applies on each cell of column. If somehow you can write a for loop to do that than you can use python Numba library which is famous for doing things in parallel. You can create a function with for loop and in that function you can give decorator #jit which does parallel work in multicore CPU. You need to do from numba import jit
2) I have seen one video of professor of university of washington while doing course on coursera where he teaches how to use map reduce algorithm to do JOIN of two tables which can be run in parallel. That way you can do Join also fast if it runs in parallel.

What is the Spark DataFrame method `toPandas` actually doing?

I'm a beginner of Spark-DataFrame API.
I use this code to load csv tab-separated into Spark Dataframe
lines = sc.textFile('tail5.csv')
parts = lines.map(lambda l : l.strip().split('\t'))
fnames = *some name list*
schemaData = StructType([StructField(fname, StringType(), True) for fname in fnames])
ddf = sqlContext.createDataFrame(parts,schemaData)
Suppose I create DataFrame with Spark from new files, and convert it to pandas using built-in method toPandas(),
Does it store the Pandas object to local memory?
Does Pandas low-level computation handled all by Spark?
Does it exposed all pandas dataframe functionality?(I guess yes)
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?

Using spark to read in a CSV file to pandas is quite a roundabout method for achieving the end goal of reading a CSV file into memory.
It seems like you might be misunderstanding the use cases of the technologies in play here.
Spark is for distributed computing (though it can be used locally). It's generally far too heavyweight to be used for simply reading in a CSV file.
In your example, the sc.textFile method will simply give you a spark RDD that is effectively a list of text lines. This likely isn't what you want. No type inference will be performed, so if you want to sum a column of numbers in your CSV file, you won't be able to because they are still strings as far as Spark is concerned.
Just use pandas.read_csv and read the whole CSV into memory. Pandas will automatically infer the type of each column. Spark doesn't do this.
Now to answer your questions:
Does it store the Pandas object to local memory:
Yes. toPandas() will convert the Spark DataFrame into a Pandas DataFrame, which is of course in memory.
Does Pandas low-level computation handled all by Spark
No. Pandas runs its own computations, there's no interplay between spark and pandas, there's simply some API compatibility.
Does it exposed all pandas dataframe functionality?
No. For example, Series objects have an interpolate method which isn't available in PySpark Column objects. There are many many methods and functions that are in the pandas API that are not in the PySpark API.
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?
Absolutely. In fact, you probably shouldn't even use Spark at all in this case. pandas.read_csv will likely handle your use case unless you're working with a huge amount of data.
Try to solve your problem with simple, low-tech, easy-to-understand libraries, and only go to something more complicated as you need it. Many times, you won't need the more complex technology.

Using some spark context or hive context method (sc.textFile(), hc.sql()) to read data 'into memory' returns an RDD, but the RDD remains in distributed memory (memory on the worker nodes), not memory on the master node. All the RDD methods (rdd.map(), rdd.reduceByKey(), etc) are designed to run in parallel on the worker nodes, with some exceptions. For instance, if you run a rdd.collect() method, you end up copying the contents of the rdd from all the worker nodes to the master node memory. Thus you lose your distributed compute benefits (but can still run the rdd methods).
Similarly with pandas, when you run toPandas(), you copy the data frame from distributed (worker) memory to the local (master) memory and lose most of your distributed compute capabilities. So, one possible workflow (that I often use) might be to pre-munge your data into a reasonable size using distributed compute methods and then convert to a Pandas data frame for the rich feature set. Hope that helps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.