I have a python script where I'm using pandas for transformations/manipulation of my data. I know I have some "inefficient" blocks of code. My question is, if pyspark is supposed to be much faster, can I just replace these blocks using pyspark instead of pandas or do I need everything to be in pyspark? If I'm in Databricks, how much does this really matter since it's already on a spark cluster?
If the data is small enough that you can use pandas to process it, then you likely don't need pyspark. Spark is useful when you have such large data sizes that it doesn't fit into memory in one machine since it can perform distributed computation. That being said, if the computation is complex enough that it could benefit from a lot of parallelization, then you could see an efficiency boost using pyspark. I'm more comfortable with pyspark's APIs than pandas, so I might end up using pyspark anyways, but whether you'll see an efficiency boost depends a lot on the problem.
Pandas run operations on a single machine whereas PySpark runs on multiple machines. If you are working on a Machine Learning application where you are dealing with larger datasets, PySpark is the best fit which could process operations many times(100x) faster than Pandas.
PySpark is very efficient for processing large datasets. But you can convert spark dataframe to Pandas dataframe after preprocessing and data exploration to train machine learning models using sklearn.
Let's compare apples with apples please: pandas is not an alternative to pyspark, as pandas cannot do distributed computing and out-of-core computations. What you can pit Spark against is dask on Ray Core (see docs), and you don't even have to learn a different API like you would with Spark, as Dask is intended be a distributed drop-in replacement for pandas and numpy (and so is Dask ML for popular ML packages such as scikit-learn and xgboost).
Related
Most of the benchmarks have dask and cuDF isolated, but i can use them together. Wouldn't Dask with cuDF be faster than polars?!
Also, Polars only runs if the data fits in memory, but this isn't the case with dask. So why is there https://h2oai.github.io/db-benchmark/ an out of memory indication for dask?
Different dataframe libraries have their strengths and weaknesses. For example, see this blog post for a comparison of different libraries, esp. from a scaling pandas perspective.
Dask Dataframe comes with some default assumptions on how best to divide the workload among multiple tasks. If these assumptions are not be valid for the particular use-case, then it's not uncommon to see memory-related errors.
I am trying to load gigabytes of data from Google Cloud Storage or Google BigQuery into pandas dataframe so that I can attempt to run scikit's OneClassSVM and Isolation Forest (or any other unary or PU classification). So I tried pandas-gbq but attempting to run
pd.read_gbq(query, 'my-super-project', dialect='standard')
causes my machine to sigkill it when it's only 30% complete. And I can't load it locally, and my machine does not have enough space nor does it sound reasonably efficient.
I have also tried
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('bucket-id-here')
# Then do other things...
blob = bucket.get_blob('remote/path/to/file.txt')
print(blob.download_as_string())
upon I can load 1/10 or 1/5 of my available data, but then my machine eventually tells me that it ran out of memory.
TLDR: Is there a way that I can run my custom code (with numpy, pandas, and even TensorFlow) in the cloud or some farway supercomputer where I can easily and efficiently load data from Google Cloud Storage or Google BigQuery?
I don't quite think you are going in the right direction. I'll try to explain how I usually work with data and hopefully this gives you some insights.
I first tend to work with small datasets by either applying some sampling technique or querying for less days. In this step, it's ok to use pandas or other tools developed for small data and build models, raise some statistics, find moments and so on.
After I get some acquaintance with the data then I start working with Big Data tools.
Specifically, I have a very small Dataproc cluster where I've already setup a jupyter notebook to run pyspark code.
The total memory of your cluster will have to surpass the total memory you are using as input.
Either using pandas or spark dataframes should be straightforward for you, as you can see in this blog post by databricks, spark already offers this feature.
After that, comes implementing the algorithms. Spark already offers some built-in algorithms out-of-the-box, you can play around with them.
If the algorithms you want to implement are not available, you can either issue a request in their repository or build it yourself (you can use Python's Scipy implementation as a guide and transpose it to the spark environment).
Here's an example of how I load data for one of the algorithms I use to build a recommender system for our company:
from pyspark.sql import functions as sfunc
from pyspark.sql import types as stypes
schema = stypes.StructType().add("fv", stypes.StringType()).add("sku", stypes.StringType()).add("score", stypes.FloatType())
train_df = spark.read.csv('gs://bucket_name/pyspark/train_data*.gz', header=True, schema=schema)
Spark will automatically distribute this data across the different workers you have available in your cluster. After that I mainly run queries and map / reduce steps to get correlations between skus.
As far as maintaining your current code, it probably won't scale for big data already. You can nevertheless find lots of resources for combining the power of numpy with spark, as in this example for instance.
I am trying to find a means of starting to work with very large CSV files in Pandas, ultimately to be able to do some machine learning with XGBoost.
I am torn between using mySQL or some sqllite framework to manage chunks of my data; my issue is in the machine learning aspect of it later on, and in loading in chunks at a time to train the model.
My other thought was to use Dask, which is built of off Pandas, but also has XGBoost functionality.
I'm not sure what the best starting point is and was hoping to ask for an opinion! I am leaning towards Dask but I have not used it yet.
This blogpost goes through an example using XGBoost on a large CSV dataset. However it did so by using a distributed cluster with enough RAM to fit the entire dataset in memory at once. While many dask.dataframe operations can operate in small space I don't think that XGBoost training is likely to be one of them. XGBoost seems to operate best when all data is available all the time.
I haven't tried this, but I would load your data into an hdf5 file using h5py. This library let's you store data on disk but access it like a numpy array. Therefore you are no longer constrained by memory for your dataset.
For the XGBoost part, I would use the sklearn API and pass in the h5py object as the X value. I recommend the sklearn API since it accepts numpy like arrays for input which should let h5py objects work. Make sure to use a small value for subsample otherwise you'll likely run out of memory fast.
A newbie question, as I get increasingly confused with pyspark. I want to scale an existing python data preprocessing and data analysis pipeline. I realize if I partition my data with pyspark, I can't treat each partition as a standalone pandas data frame anymore, and need to learn to manipulate with pyspark.sql row/column functions, and change a lot of existing code, plus I am bound to spark mllib libraries and can't take full advantage of more mature scikit-learn package. Then why would I ever need to use Spark if I can use multiprocessing tools for cluster computing and parallelize tasks on existing dataframe?
True, Spark does have the limitations you have mentioned, that is you are bounded in the functional spark world (spark mllib, dataframes etc). However, what it provides vs other multiprocessing tools/libraries is the automatic distribution, partition and rescaling of parallel tasks. Scaling and scheduling spark code becomes an easier task than having to program your custom multiprocessing code to respond to larger amounts of data + computations.
What is difference between pyspark mllib and pyspark ml packages ? :
https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html
https://spark.apache.org/docs/latest/api/python/pyspark.ml.html
pyspark mllib appears to be target algorithms at dataframe level pyspark ml
One difference I found is pyspark ml implements pyspark.ml.tuning.CrossValidator while pyspark mllib does not.
My understanding is the library should use if implementing algorithms on Apache Spark framework is mllib but there appears to be a split ?
There does not appear to be interoperability between each of the frameworks without transforming types as they each contain a different package structure.
From my experience pyspark.mllib classes can only be used with pyspark.RDD's, whereas (as you mention) pyspark.ml classes can only be used with pyspark.sql.DataFrame's. There is mention to support this in the documentation for pyspark.ml, the first entry in pyspark.ml package states:
DataFrame-based machine learning APIs to let users quickly assemble and configure practical machine learning pipelines.
Now I am reminded of an article I read a while back regarding the three API's available in Spark 2.0, their relative benefits/drawbacks and their comparative performance. A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets. I was in the midst of doing performance testing on new client servers and was interested if there would ever be a scenario in which it would be worth developing an RDD based approach as opposed to a DataFrame based approach (my approach of choice), but I digress.
The gist was that there are situations in which each are highly suited and others where they might not be. One example I remember is that if you data is already structured DataFrames confer some performance benefits over RDD's, this is apparently drastic as the complexity of your operations increase. Another observation was that DataSets and DataFrames consume far less memory when caching than RDD's. In summation the author concluded that for low level operations RDD's are great, but for high level operations, viewing, and tying with other API's DataFrames and DataSets are superior.
So to come back full circle to your question, I believe the answer is a resounding pyspark.ml as the classes in this package are designed to utilize pyspark.sql.DataFrames. I would imagine that the performance of complex algorithms implemented in each of these packages would be significant if you were to test against the same data structured as a DataFrame vs RDD. Furthermore, viewing the data and developing compelling visuals would be both more intuitive and have better performance.