Reading large database table into Dask dataframe

Reading large database table into Dask dataframe - python

I have a 7GB postgresql table which I want to read into python and do some analysis. I cannot use Pandas for it because it is larger than the memory on my local machine. I therefore wanted to try reading the table into Dask Dataframe first, perform some aggregation and switch back to Pandas for subsequent analysis. I used the below lines of code for that.
df = dd.read_sql_table('table_xyz', uri = "postgresql+psycopg2://user:pwd#remotehost/dbname", index_col = 'column_xyz', schema = 'private')
The index_col i.e. 'column_xyz' is indexed in database. This works but when I perform an action for example an aggregation, it takes ages (like an hour) to return the result.
avg = df.groupby("col1").col2.mean().compute()
I understand that Dask is not as fast as Pandas more so when I am working on a single machine and not a cluster. I am wondering whether I am using Dask the right way? If not what is a faster alternative to perform analysis on large tables that do not fit in memory using Python.

If your data fits into the RAM of your machine then you're better off using Pandas. Dask will not outperform Pandas in some cases.
Alternatively you can play around with the chunksize and see if things improve. The best way to figure this out is to look at dask diagnostics tool dashboard and figure out what is taking dask so long. That will help you make a much more informed decision.

Related

Faster way to read ~400.000 rows of data with Python

I am working on a real-time information retrieval system which performs large queries on a local SQL Server database with around 4.5 million rows.
Each query returns, on average, 400.000 rows and is parameterized, like the following example:
SELECT Id, Features, Edges, Cluster, Objects FROM db.Image
WHERE Cluster = 16
AND Features IS NOT NULL
AND Objects IS NOT NULL
These are the times I am getting with my current approach:
Query time: 4.52361000 seconds
Query size: 394048 rows, 5 columns
While not necessarily unusable, it is expected that query sizes will grow quickly, and as such I need a more efficient way to read large amounts of rows into a DataFrame.
Currently, I'm using pyodbc for building a connecting to SQL Server and pd.read_sql to parse the query directly into a DataFrame which is then manipulated. I am looking at ways to improve the query times significantly while still allowing me to work with DataFrame operations after the data is fetched. So far, I have tried dask DataFrames, connectorX, as well as failed attempts at parallelizing the queries with multithreading, but to no avail.
How can one, relying on other solutions, multithreading, or even entirely different file formats, improve the time it takes to read this amount of data?
Code Sample
conn = connection() # I have a function that returns a connector
filter = 16
command = '''SELECT Id, Features, Edges, Cluster, Objects FROM Common.Image
WHERE Cluster = {} AND Features IS NOT NULL AND Objects IS NOT NULL'''.format(filter)
result = pd.read_sql(command, conn)
EDIT
Following #tadman's comment:
Consider caching this if practical, like once you've fetched the data you could save it in a more compact form (Google Protobuf, Parquet, etc.) Reading in that way can be considerably faster, as you're usually just IO bound, not server/CPU bound.
I looked at Parquet caching and landed on a considerably faster way to fetch my data:
Created compressed parquet files for each of my data clustes (1 to 21).
Using pyarrow, read the necessary cluster file with
df_pq = pq.read_table("\\cluster16.parquet")
Convert the parquet file to a pandas DataFrame with df = df_pq.to_pandas()
Proceed as usual
With this method, I reduced the total time to 1.12400 seconds.

Pyspark dataframes run faster than pandas dataframes and should provide more memory.
If you already have the dataframe in pandas, you can convert like this:
spark_df = spark.createDataFrame(df)
modified_df=spark_df.filter("query here").collect()
You can convert back to pandas if you need after main sql querying.
link:
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.filter.html

Dask dataframe saving to_csv for incremental data - Effecient Writing to csv

I have an existing code for reading streaming data and storing it using pandas DataFrame (new data comes in every 5 mins), I then capture this Data Category wise (~350 categories).
Next, I write all the new data (as this is to be incrementally stored) using to_csv in a loop.
The Pseudocode is given below:
for row in parentdf.itertuples(): #insert into <tbl> .
mycat = row.category # this is the ONLY parameter which is passed to the Key function below.
try:
df = FnforExtractingNParsingData(mycat ,NumericParam1,NumericParam1)
df.insert(0,'NewCol',sym)
df = df.assign(calculatedCol = functions1(params))
df = df.assign(calculatedCol1 = functions2(params),20))
df = df.assign(calculatedCol3 = functions3(More params),20))
df[20:].to_csv(outfile, mode='a', header=False, index=False)
The category-wise reading and storing in csv takes 2 Mins-Per cycle*. This is close to .34 Seconds for each writing of the 350 Categories incrementally.
I am wondering whether I can make the above process faster & efficient by using dask dataframes.
I looked up dask.org and didn't get any clear answers, looked at the use cases as well.
Additional details: I am using Python 3.7 and Pandas 0.25,
Further the above code above doesn't return any errors, even though we have completed good amount of Exception handling already on the above.
My key function i.e. FnforExtractingNParsingData is fairly resilient and is working as desired for a long time.

Sounds like you're reading data into a Pandas DataFrame every 5 minutes and then writing it to disk. The question doesn't mention some key facts:
how much data is ingested every 5 minutes (10MB or 10TB)?
where is the code being executed (AWS Lambda or a big cluster of machines)?
what data operations does FnforExtractingNParsingData perform?
Dask DataFrames can be written to disk as multiple CSV files in parallel, which can be a lot faster than writing a single file with Pandas, but it depends. Dask is overkill for a tiny dataset. Dask can leverage all the CPUs of a single machine, so it can scale up on a single machine better than most people realize. For large datasets, Dask will help a lot. Feel free to provide more details in your question and I can give more specific suggestions.

Aggregation/groupby using pymongo versus using pandas

I have an API that loads data from MongoDB (with pymongo) and applies relatively "complex" data transformations with pandas afterwards, such as groupby on datetime columns, parametrizing the frequency and other stuff. Since I'm more expert in pandas than mongo, I do prefer doing it as it is, but I have no idea if writing these transformations as mongo aggregate queries would be significantly faster.
To simplify the question, not considering the difficulty on writing the queries on both sides: it's faster doing a [simple groupby on mongo and select * results] or [select * and doing it in pandas/dask(in a distributed scenario)]? Is the former faster/slower than the second in large datasets or smaller?

In general aggregation will be much faster than Pandas because the aggregation pipeline is:
Uploaded to the server and processed there eliminating network latency associated with round trips
Optimised internally to achieve the fastest execution
Executed in C on the server will all the available threading as opposed to running
on a single thread in Pandas
Working on the same dataset within the same process which means your working set
will be in memory eliminating disk accesses
As a stop gap I recommend pre-processing your data into a new collection using $out and then using Panda to process that.

Choosing a framework for larger than memory data analysis with python

I'm solving a problem with a dataset that is larger than memory.
The original dataset is a .csv file.
One of the columns is for track IDs from the musicbrainz service.
What I already did
I read the .csv file with dask and converted it to castra format on disk for higher performance.
I also queried the musicbrainz API and populated an sqlite DB, using peewee, with some relevant results. I choose to use a DB instead of another dask.dataframe because the process took few days and I didn't want to loose data in case of any failure.
I didn't started to really analyze the data yet. I managed to made enough mess during the rearrangement of the data.
The current problem
I'm having hard time in joining the columns from the SQL DB to the dask / castra dataframe. Actually, I'm not sure if this is viable at all.
Alternative approaches
It seems that I made some mistakes in choosing the best tools for the task. Castra is probably not mature enough and I think that it's part of the problem.
In addition, it may be better to choose SQLAlchemy in favor of peewee, as it used by pandas and peewee's not.
Blaze + HDF5 might serve as good alternatives to dask + castra, mainly because of HDF5 being more stable / mature / complete than castra and blaze being less opinionated regarding data storage. E.g.
it may simplify the join of the SQL DB into the main dataset.
On the other hand, I'm familiar with pandas and dask expose the "same" API. With dask I also gain parallelism.
TL;DR
I'm having a larger than memory dataset + sqlite DB that I need to join into the main dataset.
I'm in doubt whether to work with dask + castra (don't know of other relevant data stores for dask.dataframe), and use SQLAlchemy to load parts of the SQL DB at a time into the dataframe with pandas. The best alternative I see is to switch to blaze + HDF5 instead.
What would you suggest in this case?
Any other option / opinion is welcome.
I hope that this is specific enough for SO.

You're correct in the following points:
Castra is experimental and immature.
If you want something more mature you could consider HDF5 or CSV (if you're fine with slow performance). Dask.dataframe supports all of these formats just in the same way that pandas does.
It is not clear how to join between two different formats like dask.dataframe and SQL.
Probably you want to use one or the other. If you're interested in reading SQL data into dask.dataframe you could raise an issue. This would not be hard to add in common situations.

What is the Spark DataFrame method `toPandas` actually doing?

I'm a beginner of Spark-DataFrame API.
I use this code to load csv tab-separated into Spark Dataframe
lines = sc.textFile('tail5.csv')
parts = lines.map(lambda l : l.strip().split('\t'))
fnames = *some name list*
schemaData = StructType([StructField(fname, StringType(), True) for fname in fnames])
ddf = sqlContext.createDataFrame(parts,schemaData)
Suppose I create DataFrame with Spark from new files, and convert it to pandas using built-in method toPandas(),
Does it store the Pandas object to local memory?
Does Pandas low-level computation handled all by Spark?
Does it exposed all pandas dataframe functionality?(I guess yes)
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?

Using spark to read in a CSV file to pandas is quite a roundabout method for achieving the end goal of reading a CSV file into memory.
It seems like you might be misunderstanding the use cases of the technologies in play here.
Spark is for distributed computing (though it can be used locally). It's generally far too heavyweight to be used for simply reading in a CSV file.
In your example, the sc.textFile method will simply give you a spark RDD that is effectively a list of text lines. This likely isn't what you want. No type inference will be performed, so if you want to sum a column of numbers in your CSV file, you won't be able to because they are still strings as far as Spark is concerned.
Just use pandas.read_csv and read the whole CSV into memory. Pandas will automatically infer the type of each column. Spark doesn't do this.
Now to answer your questions:
Does it store the Pandas object to local memory:
Yes. toPandas() will convert the Spark DataFrame into a Pandas DataFrame, which is of course in memory.
Does Pandas low-level computation handled all by Spark
No. Pandas runs its own computations, there's no interplay between spark and pandas, there's simply some API compatibility.
Does it exposed all pandas dataframe functionality?
No. For example, Series objects have an interpolate method which isn't available in PySpark Column objects. There are many many methods and functions that are in the pandas API that are not in the PySpark API.
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?
Absolutely. In fact, you probably shouldn't even use Spark at all in this case. pandas.read_csv will likely handle your use case unless you're working with a huge amount of data.
Try to solve your problem with simple, low-tech, easy-to-understand libraries, and only go to something more complicated as you need it. Many times, you won't need the more complex technology.

Using some spark context or hive context method (sc.textFile(), hc.sql()) to read data 'into memory' returns an RDD, but the RDD remains in distributed memory (memory on the worker nodes), not memory on the master node. All the RDD methods (rdd.map(), rdd.reduceByKey(), etc) are designed to run in parallel on the worker nodes, with some exceptions. For instance, if you run a rdd.collect() method, you end up copying the contents of the rdd from all the worker nodes to the master node memory. Thus you lose your distributed compute benefits (but can still run the rdd methods).
Similarly with pandas, when you run toPandas(), you copy the data frame from distributed (worker) memory to the local (master) memory and lose most of your distributed compute capabilities. So, one possible workflow (that I often use) might be to pre-munge your data into a reasonable size using distributed compute methods and then convert to a Pandas data frame for the rich feature set. Hope that helps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.