`pyspark mllib` versus `pyspark ml` packages - python

What is difference between pyspark mllib and pyspark ml packages ? :
https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html
https://spark.apache.org/docs/latest/api/python/pyspark.ml.html
pyspark mllib appears to be target algorithms at dataframe level pyspark ml
One difference I found is pyspark ml implements pyspark.ml.tuning.CrossValidator while pyspark mllib does not.
My understanding is the library should use if implementing algorithms on Apache Spark framework is mllib but there appears to be a split ?
There does not appear to be interoperability between each of the frameworks without transforming types as they each contain a different package structure.

From my experience pyspark.mllib classes can only be used with pyspark.RDD's, whereas (as you mention) pyspark.ml classes can only be used with pyspark.sql.DataFrame's. There is mention to support this in the documentation for pyspark.ml, the first entry in pyspark.ml package states:
DataFrame-based machine learning APIs to let users quickly assemble and configure practical machine learning pipelines.
Now I am reminded of an article I read a while back regarding the three API's available in Spark 2.0, their relative benefits/drawbacks and their comparative performance. A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets. I was in the midst of doing performance testing on new client servers and was interested if there would ever be a scenario in which it would be worth developing an RDD based approach as opposed to a DataFrame based approach (my approach of choice), but I digress.
The gist was that there are situations in which each are highly suited and others where they might not be. One example I remember is that if you data is already structured DataFrames confer some performance benefits over RDD's, this is apparently drastic as the complexity of your operations increase. Another observation was that DataSets and DataFrames consume far less memory when caching than RDD's. In summation the author concluded that for low level operations RDD's are great, but for high level operations, viewing, and tying with other API's DataFrames and DataSets are superior.
So to come back full circle to your question, I believe the answer is a resounding pyspark.ml as the classes in this package are designed to utilize pyspark.sql.DataFrames. I would imagine that the performance of complex algorithms implemented in each of these packages would be significant if you were to test against the same data structured as a DataFrame vs RDD. Furthermore, viewing the data and developing compelling visuals would be both more intuitive and have better performance.

Related

Can everything that can be done with Pandas Dataframes be reproduced with Pyspark Dataframes?

I'm trying to think of a reason (other than you only have a small dataset) that you wouldn't use Pyspark Dataframes.
Can everything that can be done with Pandas Dataframes be reproduced with Pyspark Dataframes?
Are there some Pandas-exclusive functions or some functions that are incredibly difficult to reproduce with Pyspark?
spark is a distributed processing framework. In addition to supporting the DataFrame functionality, it needs to run a JVM, a scheduler, cross-process/machine communication, it spins up databases, etc. So while of course, the answer to your question is no, not exactly everything is implemented the same way, the wider answer is that any distributed processing library naturally involves immense overhead. Lots of work goes into reducing this overhead, but it will never be trivial.
Dask (another distributed processing library with a DataFrame implementation) has a great section on best practices. In it, the first recommendation is not to use dask unless you have to:
Parallelism brings extra complexity and overhead. Sometimes it’s necessary for larger problems, but often it’s not. Before adding a parallel computing system like Dask to your workload you may want to first try some alternatives:
Use better algorithms or data structures: NumPy, Pandas, Scikit-Learn may have faster functions for what you’re trying to do. It may be worth consulting with an expert or reading through their docs again to find a better pre-built algorithm.
Better file formats: Efficient binary formats that support random access can often help you manage larger-than-memory datasets efficiently and simply. See the Store Data Efficiently section below.
Compiled code: Compiling your Python code with Numba or Cython might make parallelism unnecessary. Or you might use the multi-core parallelism available within those libraries.
Sampling: Even if you have a lot of data, there might not be much advantage from using all of it. By sampling intelligently you might be able to derive the same insight from a much more manageable subset.
Profile: If you’re trying to speed up slow code it’s important that you first understand why it is slow. Modest time investments in profiling your code can help you to identify what is slowing you down. This information can help you make better decisions about if parallelism is likely to help, or if other approaches are likely to be more effective.
There's a very good reason for this. In-memory, single-threaded applications are always going to be much faster for small datasets.
Very simplistically, if you imagine the single-threaded runtime for your workflow is T, the wall time of a distributed workflow will be T_parallelizable / n_cores + T_not_parallelizable + overhead. For pyspark, this overhead is very significant. It's worth it a lot of the time. But it's not nothing.

Databricks - Pyspark vs Pandas

I have a python script where I'm using pandas for transformations/manipulation of my data. I know I have some "inefficient" blocks of code. My question is, if pyspark is supposed to be much faster, can I just replace these blocks using pyspark instead of pandas or do I need everything to be in pyspark? If I'm in Databricks, how much does this really matter since it's already on a spark cluster?
If the data is small enough that you can use pandas to process it, then you likely don't need pyspark. Spark is useful when you have such large data sizes that it doesn't fit into memory in one machine since it can perform distributed computation. That being said, if the computation is complex enough that it could benefit from a lot of parallelization, then you could see an efficiency boost using pyspark. I'm more comfortable with pyspark's APIs than pandas, so I might end up using pyspark anyways, but whether you'll see an efficiency boost depends a lot on the problem.
Pandas run operations on a single machine whereas PySpark runs on multiple machines. If you are working on a Machine Learning application where you are dealing with larger datasets, PySpark is the best fit which could process operations many times(100x) faster than Pandas.
PySpark is very efficient for processing large datasets. But you can convert spark dataframe to Pandas dataframe after preprocessing and data exploration to train machine learning models using sklearn.
Let's compare apples with apples please: pandas is not an alternative to pyspark, as pandas cannot do distributed computing and out-of-core computations. What you can pit Spark against is dask on Ray Core (see docs), and you don't even have to learn a different API like you would with Spark, as Dask is intended be a distributed drop-in replacement for pandas and numpy (and so is Dask ML for popular ML packages such as scikit-learn and xgboost).

How to use Spark for machine learning workflows with big models

Is there a memory efficient way to apply large (>4GB) models to Spark Dataframes without running into memory issues?
We recently ported a custom pipeline framework over to Spark (using python and pyspark) and ran into problems when applying large models like Word2Vec and Autoencoders to tokenized text inputs. First I very naively converted the transformation calls to udfs (both pandas and spark "native" ones), which was fine, as long as the models/utilities used were small enough to either be broadcasted, or instantiated repeatedly:
#pandas_udf("array<string>")
def tokenize_sentence(sentences: pandas.Series):
return sentences.map(lambda sentence: tokenize.word_tokenize(sentence))
Trying the same approach with large models (e.g. for embedding those tokens into vector space via word2vec) resulted in terrible performance, and I get why:
#pandas_udf("array<array<double>>")
def rows_to_lists_of_vectors(rows):
model = api.load('word2vec-google-news-300')
def words_to_vectors(words) -> List[List[float]]:
vectors = []
for word in words:
if word in model:
vec = model[word]
vectors.append(vec.tolist())
return vectors
return rows.map(words_to_vectors)
The code from above would instantiate the ~4Gb word2vec model repeatedly, loading it from disk into RAM, which is very slow. I could remedy this by using mapPartition, which would at least only load it once per partition. But more importantly this would crash with memory related issues (at least on my dev machine), if I didn't heavily restrict the number of tasks, which in turn made the small udfs very slow. For example restricting the number of tasks to 2 would solve the memory crashes, but make the tokenizing painfully slow.
I understand there is an entire pipeline framework in Spark, that would fit our needs, but before committing to that, I'd like to understand how the problems I ran into were solved there. Maybe there are some key practices we could use instead of having to rewrite our framework.
My actual question therefore is twofold:
Would using the Spark pipeline framework solve our issues regarding performance and memory, assuming we wrote custom Estimators and Transformers for the steps, that are not covered by Spark out of the box (like e.g. Tokenizers and Word2Vec).
How does Spark solve those issues, if at all? Can I improve the current approach, or is this impossible using python (where to my understanding processes don't share memory space).
If any of the above makes you believe I missed a core principle with Spark, please point it out, after all I'm just getting started with Spark.
This much vary on various factors (models, cluster resourcing, pipeline) but trying to answers to your main question :
1). Spark pipeline might solve your problem if they fits your needs in terms of the Tokenizers, Words2Vec, etc. However those are not so powerful as the one already available of the shelf and loaded with api.load. You might also want to take a look to Deeplearning4J which brings those to Java/Apache Spark and see how it can do the same things: tokenize, word2vec,etc
2). Following the current approach I would see loading the model in a foreachParition or mapPartition and ensure the model can fit into memory per partition. You can shrink down the partition size to a more affordable number based on the cluster resources to avoid memory problems (it's the same when for example instead of creating a db connection for each row you have one per partition).
Typically Spark udfs are good when you apply a kind o business logic that is spark friendly and not mixin 3rd external parties.

Python multiprocessing tool vs Py(Spark)

A newbie question, as I get increasingly confused with pyspark. I want to scale an existing python data preprocessing and data analysis pipeline. I realize if I partition my data with pyspark, I can't treat each partition as a standalone pandas data frame anymore, and need to learn to manipulate with pyspark.sql row/column functions, and change a lot of existing code, plus I am bound to spark mllib libraries and can't take full advantage of more mature scikit-learn package. Then why would I ever need to use Spark if I can use multiprocessing tools for cluster computing and parallelize tasks on existing dataframe?
True, Spark does have the limitations you have mentioned, that is you are bounded in the functional spark world (spark mllib, dataframes etc). However, what it provides vs other multiprocessing tools/libraries is the automatic distribution, partition and rescaling of parallel tasks. Scaling and scheduling spark code becomes an easier task than having to program your custom multiprocessing code to respond to larger amounts of data + computations.

Dealing with big data to perform random forest classification

I am currently working on my thesis, which involves dealing with quite a sizable dataset: ~4mln observations and ~260ths features. It is a dataset of chess games, where most of the features are player dummies (130k for each colour).
As for the hardware and the software, I have around 12GB of RAM on this computer. I am doing all my work in Python 3.5 and use mainly pandas and scikit-learn packages.
My problem is that obviously I can't load this amount of data to my RAM. What I would love to do is to generate the dummy variables, then slice the database into like a thousand or so chunks, apply the Random Forest and aggregate the results again.
However, to do that I would need to be able to first create the dummy variables, which I am not able to do due to memory error, even if I use sparse matrices. Theoretically, I could just slice up the database first, then create the dummy variables. However, the effect of that will be that I will have different features for different slices, so I'm not sure how to aggregate such results.
My questions:
1. How would you guys approach this problem? Is there a way to "merge" the results of my estimation despite having different features in different "chunks" of data?
2. Perhaps it is possible to avoid this problem altogether by renting a server. Are there any trial versions of such services? I'm not sure exactly how much CPU/RAM would I need to complete this task.
Thanks for your help, any kind of tips will be appreciated :)
I would suggest you give CloudxLab a try.
Though it is not free it is quite affordable ($25 for a month). It provides complete environment to experiment with various tools such as HDFS, Map-Reduce, Hive, Pig, Kafka, Spark, Scala, Sqoop, Oozie, Mahout, MLLib, Zookeeper, R, Scala etc. Many of the popular trainers are using CloudxLab.

Categories