Python multiprocessing tool vs Py(Spark) - python

A newbie question, as I get increasingly confused with pyspark. I want to scale an existing python data preprocessing and data analysis pipeline. I realize if I partition my data with pyspark, I can't treat each partition as a standalone pandas data frame anymore, and need to learn to manipulate with pyspark.sql row/column functions, and change a lot of existing code, plus I am bound to spark mllib libraries and can't take full advantage of more mature scikit-learn package. Then why would I ever need to use Spark if I can use multiprocessing tools for cluster computing and parallelize tasks on existing dataframe?

True, Spark does have the limitations you have mentioned, that is you are bounded in the functional spark world (spark mllib, dataframes etc). However, what it provides vs other multiprocessing tools/libraries is the automatic distribution, partition and rescaling of parallel tasks. Scaling and scheduling spark code becomes an easier task than having to program your custom multiprocessing code to respond to larger amounts of data + computations.

Related

Databricks - Pyspark vs Pandas

I have a python script where I'm using pandas for transformations/manipulation of my data. I know I have some "inefficient" blocks of code. My question is, if pyspark is supposed to be much faster, can I just replace these blocks using pyspark instead of pandas or do I need everything to be in pyspark? If I'm in Databricks, how much does this really matter since it's already on a spark cluster?
If the data is small enough that you can use pandas to process it, then you likely don't need pyspark. Spark is useful when you have such large data sizes that it doesn't fit into memory in one machine since it can perform distributed computation. That being said, if the computation is complex enough that it could benefit from a lot of parallelization, then you could see an efficiency boost using pyspark. I'm more comfortable with pyspark's APIs than pandas, so I might end up using pyspark anyways, but whether you'll see an efficiency boost depends a lot on the problem.
Pandas run operations on a single machine whereas PySpark runs on multiple machines. If you are working on a Machine Learning application where you are dealing with larger datasets, PySpark is the best fit which could process operations many times(100x) faster than Pandas.
PySpark is very efficient for processing large datasets. But you can convert spark dataframe to Pandas dataframe after preprocessing and data exploration to train machine learning models using sklearn.
Let's compare apples with apples please: pandas is not an alternative to pyspark, as pandas cannot do distributed computing and out-of-core computations. What you can pit Spark against is dask on Ray Core (see docs), and you don't even have to learn a different API like you would with Spark, as Dask is intended be a distributed drop-in replacement for pandas and numpy (and so is Dask ML for popular ML packages such as scikit-learn and xgboost).

Pyspark alternative to spark.lapply?

I have a compute intensive python function called repeatedly within a for loop (each iteration is independent i.e. embarrassingly parallel). I am looking for spark.lapply (from SparkR) kind of functionality to utilize the Spark cluster.
Native Spark
If you use Spark data frames and libraries, then Spark will natively parallelize and distribute your task.
Thread Pools
One of the ways that you can achieve parallelism in Spark without using Spark data frames is by using the multiprocessing library.However, by default all of your code will run on the driver node.
Pandas UDFs
One of the newer features in Spark that enables parallel processing is Pandas UDFs. With this feature, you can partition a Spark data frame into smaller data sets that are distributed and converted to Pandas objects, where your function is applied, and then the results are combined back into one large Spark data frame.
Example from https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
from pyspark.sql.functions import udf
# Use udf to define a row-at-a-time udf
#udf('double')
# Input/output are both a single double value
def plus_one(v):
return v + 1
df.withColumn('v2', plus_one(df.v))

How do I load gigabytes of data from Google Cloud Storage into a pandas dataframe?

I am trying to load gigabytes of data from Google Cloud Storage or Google BigQuery into pandas dataframe so that I can attempt to run scikit's OneClassSVM and Isolation Forest (or any other unary or PU classification). So I tried pandas-gbq but attempting to run
pd.read_gbq(query, 'my-super-project', dialect='standard')
causes my machine to sigkill it when it's only 30% complete. And I can't load it locally, and my machine does not have enough space nor does it sound reasonably efficient.
I have also tried
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('bucket-id-here')
# Then do other things...
blob = bucket.get_blob('remote/path/to/file.txt')
print(blob.download_as_string())
upon I can load 1/10 or 1/5 of my available data, but then my machine eventually tells me that it ran out of memory.
TLDR: Is there a way that I can run my custom code (with numpy, pandas, and even TensorFlow) in the cloud or some farway supercomputer where I can easily and efficiently load data from Google Cloud Storage or Google BigQuery?
I don't quite think you are going in the right direction. I'll try to explain how I usually work with data and hopefully this gives you some insights.
I first tend to work with small datasets by either applying some sampling technique or querying for less days. In this step, it's ok to use pandas or other tools developed for small data and build models, raise some statistics, find moments and so on.
After I get some acquaintance with the data then I start working with Big Data tools.
Specifically, I have a very small Dataproc cluster where I've already setup a jupyter notebook to run pyspark code.
The total memory of your cluster will have to surpass the total memory you are using as input.
Either using pandas or spark dataframes should be straightforward for you, as you can see in this blog post by databricks, spark already offers this feature.
After that, comes implementing the algorithms. Spark already offers some built-in algorithms out-of-the-box, you can play around with them.
If the algorithms you want to implement are not available, you can either issue a request in their repository or build it yourself (you can use Python's Scipy implementation as a guide and transpose it to the spark environment).
Here's an example of how I load data for one of the algorithms I use to build a recommender system for our company:
from pyspark.sql import functions as sfunc
from pyspark.sql import types as stypes
schema = stypes.StructType().add("fv", stypes.StringType()).add("sku", stypes.StringType()).add("score", stypes.FloatType())
train_df = spark.read.csv('gs://bucket_name/pyspark/train_data*.gz', header=True, schema=schema)
Spark will automatically distribute this data across the different workers you have available in your cluster. After that I mainly run queries and map / reduce steps to get correlations between skus.
As far as maintaining your current code, it probably won't scale for big data already. You can nevertheless find lots of resources for combining the power of numpy with spark, as in this example for instance.

`pyspark mllib` versus `pyspark ml` packages

What is difference between pyspark mllib and pyspark ml packages ? :
https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html
https://spark.apache.org/docs/latest/api/python/pyspark.ml.html
pyspark mllib appears to be target algorithms at dataframe level pyspark ml
One difference I found is pyspark ml implements pyspark.ml.tuning.CrossValidator while pyspark mllib does not.
My understanding is the library should use if implementing algorithms on Apache Spark framework is mllib but there appears to be a split ?
There does not appear to be interoperability between each of the frameworks without transforming types as they each contain a different package structure.
From my experience pyspark.mllib classes can only be used with pyspark.RDD's, whereas (as you mention) pyspark.ml classes can only be used with pyspark.sql.DataFrame's. There is mention to support this in the documentation for pyspark.ml, the first entry in pyspark.ml package states:
DataFrame-based machine learning APIs to let users quickly assemble and configure practical machine learning pipelines.
Now I am reminded of an article I read a while back regarding the three API's available in Spark 2.0, their relative benefits/drawbacks and their comparative performance. A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets. I was in the midst of doing performance testing on new client servers and was interested if there would ever be a scenario in which it would be worth developing an RDD based approach as opposed to a DataFrame based approach (my approach of choice), but I digress.
The gist was that there are situations in which each are highly suited and others where they might not be. One example I remember is that if you data is already structured DataFrames confer some performance benefits over RDD's, this is apparently drastic as the complexity of your operations increase. Another observation was that DataSets and DataFrames consume far less memory when caching than RDD's. In summation the author concluded that for low level operations RDD's are great, but for high level operations, viewing, and tying with other API's DataFrames and DataSets are superior.
So to come back full circle to your question, I believe the answer is a resounding pyspark.ml as the classes in this package are designed to utilize pyspark.sql.DataFrames. I would imagine that the performance of complex algorithms implemented in each of these packages would be significant if you were to test against the same data structured as a DataFrame vs RDD. Furthermore, viewing the data and developing compelling visuals would be both more intuitive and have better performance.

What is the Spark DataFrame method `toPandas` actually doing?

I'm a beginner of Spark-DataFrame API.
I use this code to load csv tab-separated into Spark Dataframe
lines = sc.textFile('tail5.csv')
parts = lines.map(lambda l : l.strip().split('\t'))
fnames = *some name list*
schemaData = StructType([StructField(fname, StringType(), True) for fname in fnames])
ddf = sqlContext.createDataFrame(parts,schemaData)
Suppose I create DataFrame with Spark from new files, and convert it to pandas using built-in method toPandas(),
Does it store the Pandas object to local memory?
Does Pandas low-level computation handled all by Spark?
Does it exposed all pandas dataframe functionality?(I guess yes)
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?
Using spark to read in a CSV file to pandas is quite a roundabout method for achieving the end goal of reading a CSV file into memory.
It seems like you might be misunderstanding the use cases of the technologies in play here.
Spark is for distributed computing (though it can be used locally). It's generally far too heavyweight to be used for simply reading in a CSV file.
In your example, the sc.textFile method will simply give you a spark RDD that is effectively a list of text lines. This likely isn't what you want. No type inference will be performed, so if you want to sum a column of numbers in your CSV file, you won't be able to because they are still strings as far as Spark is concerned.
Just use pandas.read_csv and read the whole CSV into memory. Pandas will automatically infer the type of each column. Spark doesn't do this.
Now to answer your questions:
Does it store the Pandas object to local memory:
Yes. toPandas() will convert the Spark DataFrame into a Pandas DataFrame, which is of course in memory.
Does Pandas low-level computation handled all by Spark
No. Pandas runs its own computations, there's no interplay between spark and pandas, there's simply some API compatibility.
Does it exposed all pandas dataframe functionality?
No. For example, Series objects have an interpolate method which isn't available in PySpark Column objects. There are many many methods and functions that are in the pandas API that are not in the PySpark API.
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?
Absolutely. In fact, you probably shouldn't even use Spark at all in this case. pandas.read_csv will likely handle your use case unless you're working with a huge amount of data.
Try to solve your problem with simple, low-tech, easy-to-understand libraries, and only go to something more complicated as you need it. Many times, you won't need the more complex technology.
Using some spark context or hive context method (sc.textFile(), hc.sql()) to read data 'into memory' returns an RDD, but the RDD remains in distributed memory (memory on the worker nodes), not memory on the master node. All the RDD methods (rdd.map(), rdd.reduceByKey(), etc) are designed to run in parallel on the worker nodes, with some exceptions. For instance, if you run a rdd.collect() method, you end up copying the contents of the rdd from all the worker nodes to the master node memory. Thus you lose your distributed compute benefits (but can still run the rdd methods).
Similarly with pandas, when you run toPandas(), you copy the data frame from distributed (worker) memory to the local (master) memory and lose most of your distributed compute capabilities. So, one possible workflow (that I often use) might be to pre-munge your data into a reasonable size using distributed compute methods and then convert to a Pandas data frame for the rich feature set. Hope that helps.

Categories