Spark, small excel files, Pandas or Pyspark for performance?

Spark, small excel files, Pandas or Pyspark for performance? - python

I'm working with small Excel files (10 000 rows or less) in spark (Databricks).
I need to do some transformations on the excel file and usually I use pandas to read in the file, convert to spark dataframe and then do transformations. But since learning more about Spark distributed architecture I'm thinking that this is not good for performance for such small files?
Should I just do all the transformations with Pandas (forcing it to run only on the driver node) and then converting to spark dataframe only when needed or does it not really matter?

Related

Most efficient way to read a huge parquet file into memory in Python

Ideally, I would like to have the data in a dictionary. I am not even sure if a dictionary is better than a dataframe in this context. After a bit of research, I found the following ways to read a parquet file into memory:
Pyarrow (Python API of Apache Arrow):
With pyarrow, I can read a parquet file into a pyarrow.Table. I can also read the data into a pyarrow.DictionaryArray. Both are easily convertible into a dataframe, but wouldn't memory consumption double in this case?
Pandas:
Via pd.read_parquet. The file is read into a dataframe. Again, would a dataframe perform as well as a dictionary?
parquet-python (pure python, supports read-only):
Supports reading in each row in a parquet as a dictionary. That means I'd have to merge a lot of nano-dictionaries. I am not sure if this is wise.

Most efficient way to read a huge parquet file into memory in Python, you can consider is using the pyarrow library, which provides high-performance, memory-efficient data structures for working with Parquet files.
import pyarrow.parquet as pq
# Read the Parquet file into a Pandas DataFrame
df = pq.read_pandas(path).to_pandas()
# Convert the DataFrame to a NumPy array
data = df.values

Export Multiple GB PySpark dataframe to CSV, np or pickle

I have this large dataset around 6gb and have processed and cleaned the data using PySpark and now want to save it so I can use it elsewhere for machine learning uses
I am trying to find the fastest way of saving the datasets.
I followed this link, but its taking so long to save the csv or the parquet.
How to export a table dataframe in PySpark to csv?
Please can someone provide some information on how I can do this

Databricks - Pyspark vs Pandas

I have a python script where I'm using pandas for transformations/manipulation of my data. I know I have some "inefficient" blocks of code. My question is, if pyspark is supposed to be much faster, can I just replace these blocks using pyspark instead of pandas or do I need everything to be in pyspark? If I'm in Databricks, how much does this really matter since it's already on a spark cluster?

If the data is small enough that you can use pandas to process it, then you likely don't need pyspark. Spark is useful when you have such large data sizes that it doesn't fit into memory in one machine since it can perform distributed computation. That being said, if the computation is complex enough that it could benefit from a lot of parallelization, then you could see an efficiency boost using pyspark. I'm more comfortable with pyspark's APIs than pandas, so I might end up using pyspark anyways, but whether you'll see an efficiency boost depends a lot on the problem.

Pandas run operations on a single machine whereas PySpark runs on multiple machines. If you are working on a Machine Learning application where you are dealing with larger datasets, PySpark is the best fit which could process operations many times(100x) faster than Pandas.
PySpark is very efficient for processing large datasets. But you can convert spark dataframe to Pandas dataframe after preprocessing and data exploration to train machine learning models using sklearn.

Let's compare apples with apples please: pandas is not an alternative to pyspark, as pandas cannot do distributed computing and out-of-core computations. What you can pit Spark against is dask on Ray Core (see docs), and you don't even have to learn a different API like you would with Spark, as Dask is intended be a distributed drop-in replacement for pandas and numpy (and so is Dask ML for popular ML packages such as scikit-learn and xgboost).

Converting a large csv to xml file

I have a large CSV file(30gb) with 7 columns. Would there be another format to save the file so that the size is much smaller because the first few columns have the same values for many rows?
I was thinking about an XML file type. How do I convert this large csv file to an xml file?
The solution I found involves the pandas package. But since the data is large, using pandas would not work on my 8gb ram laptop.

Pandas is an in-memory package, so the data must be smaller than the amount of RAM. Can you split the original 30 GB file into a collection of smaller files, and process in pandas one at a time? E.g., one file for each fund_ticker.
Dask supports out-of-memory processing for NumPy and pandas, but that is another layer of complexity. https://dask.org
Here is info from pandas docs on scaling to large data sets: https://pandas.pydata.org/docs/user_guide/scale.html
Finally, is a database an option for this use case?

What is the Spark DataFrame method `toPandas` actually doing?

I'm a beginner of Spark-DataFrame API.
I use this code to load csv tab-separated into Spark Dataframe
lines = sc.textFile('tail5.csv')
parts = lines.map(lambda l : l.strip().split('\t'))
fnames = *some name list*
schemaData = StructType([StructField(fname, StringType(), True) for fname in fnames])
ddf = sqlContext.createDataFrame(parts,schemaData)
Suppose I create DataFrame with Spark from new files, and convert it to pandas using built-in method toPandas(),
Does it store the Pandas object to local memory?
Does Pandas low-level computation handled all by Spark?
Does it exposed all pandas dataframe functionality?(I guess yes)
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?

Using spark to read in a CSV file to pandas is quite a roundabout method for achieving the end goal of reading a CSV file into memory.
It seems like you might be misunderstanding the use cases of the technologies in play here.
Spark is for distributed computing (though it can be used locally). It's generally far too heavyweight to be used for simply reading in a CSV file.
In your example, the sc.textFile method will simply give you a spark RDD that is effectively a list of text lines. This likely isn't what you want. No type inference will be performed, so if you want to sum a column of numbers in your CSV file, you won't be able to because they are still strings as far as Spark is concerned.
Just use pandas.read_csv and read the whole CSV into memory. Pandas will automatically infer the type of each column. Spark doesn't do this.
Now to answer your questions:
Does it store the Pandas object to local memory:
Yes. toPandas() will convert the Spark DataFrame into a Pandas DataFrame, which is of course in memory.
Does Pandas low-level computation handled all by Spark
No. Pandas runs its own computations, there's no interplay between spark and pandas, there's simply some API compatibility.
Does it exposed all pandas dataframe functionality?
No. For example, Series objects have an interpolate method which isn't available in PySpark Column objects. There are many many methods and functions that are in the pandas API that are not in the PySpark API.
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?
Absolutely. In fact, you probably shouldn't even use Spark at all in this case. pandas.read_csv will likely handle your use case unless you're working with a huge amount of data.
Try to solve your problem with simple, low-tech, easy-to-understand libraries, and only go to something more complicated as you need it. Many times, you won't need the more complex technology.

Using some spark context or hive context method (sc.textFile(), hc.sql()) to read data 'into memory' returns an RDD, but the RDD remains in distributed memory (memory on the worker nodes), not memory on the master node. All the RDD methods (rdd.map(), rdd.reduceByKey(), etc) are designed to run in parallel on the worker nodes, with some exceptions. For instance, if you run a rdd.collect() method, you end up copying the contents of the rdd from all the worker nodes to the master node memory. Thus you lose your distributed compute benefits (but can still run the rdd methods).
Similarly with pandas, when you run toPandas(), you copy the data frame from distributed (worker) memory to the local (master) memory and lose most of your distributed compute capabilities. So, one possible workflow (that I often use) might be to pre-munge your data into a reasonable size using distributed compute methods and then convert to a Pandas data frame for the rich feature set. Hope that helps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.