Spark - is map function available for Dataframe or just RDD? - python

I just realized that I can do following in Scala
val df = spark.read.csv("test.csv")
val df1=df.map(x=>x(0).asInstanceOf[String].toLowerCase)
However in Python if I try to call map function on DataFrame it will throw me error.
df = spark.read.csv("Downloads/test1.csv")
df.map(lambda x: x[1].lower())
Error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Cellar/apache-spark/2.4.3/libexec/python/pyspark/sql/dataframe.py", line 1300, in __getattr__
"'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
AttributeError: 'DataFrame' object has no attribute 'map'
In Python I need to explicitly convert Dataframe to RDD.
my question is why, I need to do this in case of python ?
Is this the different in Spark API implementation or Scala implicityly converts DataFrame to RDD back and again to DataFrame

Python Dataframe API doesn't have map function due to how the Python API works.
Python, everytime that you convert to RDD or uses a UDF with the Python API you are creating a python call during your execution.
What that means? That means, during the Spark execution instead of all the data be processed inside of the JVM with Scala code generated (Dataframe API) the JVM need to call the Python code to apply the logic you created. That by default creates a HUGE overhead during the execution.
So the solution for Python is building an API that will block the usage of python code and will only use Scala generated code using the DataFrame pipeline.
This will help to understand how UDFs with python works, that basically is really close how RDD maps will work with Python: https://medium.com/wbaa/using-scala-udfs-in-pyspark-b70033dd69b9

Related

module 'pandas' has no attribute 'read_csv': AttributeError

I have written a lambda function for AWS which will use pandas for handling dataframe. When I tested this lambda function - I faced error - No module name pandas.
I further kept pandas and other dependencies libraries in library folder of my repository.
Now I am facing other issue which I am unable to solve.
Current error:
module 'pandas' has no attribute 'read_csv': AttributeError
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 127, in lambda_handler
initial_df = pd.read_csv(obj['Body']) # 'Body' is a key word
AttributeError: module 'pandas' has no attribute 'read_csv'
I checked the solutions available on this site - like - module 'pandas' has no attribute 'read_csv
I don't have pandas.py and csv.py in my pandas folder but rather have test_to_csv.py, csvs.py and test_pandas.py, which is required as per the discussion in link provided above.
I am unable to figure out a way here.
Pandas is indeed not available by default on AWS lambda.
If you want to use Pandas with AWS lamdba, the easiest way is to use the AWS Data Wrangler layer.
When you add a new layer, select AWS layers , then in the dropdown menu you can select the AWSDataWrangler-Python39 one.
Once you have added the layer, you will be able to use pandas as usual.

Having trouble convering pyspark dataframe to scala dataframe and passing it to a scala function

I am trying to submit a pyspark application with a scala lib/jar as dependency. I pass this scala jar via the --jars parameter when submitting the pyspark job on GCP Dataproc.
In my python driver program, I have a pyspark dataframe df. When I check its type, it shows what is expected
print(type(df)) -> <class 'pyspark.sql.dataframe.DataFrame'>
The scala jar has a function which takes input a scala spark dataframe. To pass the pyspark dataframe df to this scala function, I use the ._jdf attribute -> df._jdf
But I meet with this error:
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1296, in __call__
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1266, in _build_args
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1266, in <listcomp>
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 298, in get_command_part
AttributeError: 'JavaMember' object has no attribute '_get_object_id'
I think this is because df._jdf is not of type 'spark.sql.DataFrame' but of type below:
print(type(df._jdf)) -> <class 'py4j.java_gateway.JavaMember'>
Is df._jdf not the correct way to convert a pyspark dataframe to scala? ? or is there a better alternative way to achieve what I am trying to do?
I am following these sources:
https://diogoalexandrefranco.github.io/scala-code-in-pyspark/
https://www.crowdstrike.com/blog/spark-hot-potato-passing-dataframes-between-scala-spark-and-pyspark/

Pandas read_csv error while using to_dict

Previously i was using python DictReader to read and load each row in python dict, But now due to some design changes i have to use pandas to do the same. But while trying to achieve same behaviour i am using this code
for idx,row_r in enumerate(pd.read_csv(input_file_path,chunksize=10000, skiprows=skip_n_rows).to_dict()):
But i am getting this error
AttributeError: 'TextFileReader' object has no attribute 'to_dict'
python 3.7.6
pandas 0.23.0
P.S i am using chunksize because CSV file is large.

StringIndexerModel inputCol

I have a cluster with spark 2.1 and a process that at the end writes on file a PipelineModel, which contains a StringIndexerModel. I can locally (with spark 2.3) load the pipeline and inspect the StringIndexerModel. What it appears very strange is that the method and fields differ between the two versions, even if they read the same files. In particular, with spark 2.1 the field inputCol appears to not be there even if it's obviously needed to make the StringIndexer work.
This is what I get.
Spark 2.1:
pip1 = PipelineModel.load("somepath")
si = pip1.stages[0]
si
#StringIndexer_494eb1f86ababc8540e2
si.inputCol
#Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
#AttributeError: 'StringIndexerModel' object has no attribute 'inputCol'
Spark 2.3
pip1 = PipelineModel.load("somepath")
si = pip1.stages[0]
si
#StringIndexer_494eb1f86ababc8540e2
si.inputCol
#Param(parent='StringIndexer_494eb1f86ababc8540e2', name='inputCol', doc='input column name')
I understand that methods and fields might change from one version to another, but the inputCol must be somewhere in the object, since it is essential to make fit or transform work. Is there a way to extract the inputCol in spark 2.1 with PySpark?
The heavy lifting in Spark ML is done by the internal Java objects (_java_obj), that's why objects can work, even if internal are never fully exposed in Python API. This of course limits what can be done without drilling into Java API, and since Spark 2.3 Params are exposed in PySpark models (SPARK-10931).
In previous versions you can access internal model, and fetch data from there. However if you want to get a value of the Param you should use get* method, not the Param as such.
si._java_obj.getInputCol()
Related:
pyspark: getting the best model's parameters after a gridsearch is blank {}
How to get best params after tuning by pyspark.ml.tuning.TrainValidationSplit?
Getting labels from StringIndexer stages within pipeline in Spark (pyspark)

Convert list to RDD

I am trying to work on a samplecsv.csv file(64 MB) in pyspark.
This code generates an error: AttributeError: 'list' object has no attribute 'saveAsTextFile'
I think I have already converted list to RDD using parallelize. If not, how is it done?
file = sc.textFile('/user/project/samplecsv.csv',5)
rdd = file.map(lambda line: (line.split(',')[0], line.split(',')[1],
line.split(',')[2], line.split(',')[3],
line.split(',')[4])).collect()
temp = sc.parallelize([rdd], numSlices=50000).collect()
temp.saveAsTextFile("/user/project/newfile.txt")}
Your problem is that you called collect on the parallelized list, returning it back to a normal python list.
Also, you should not be calling collect in each step, unless you're making it for a testing/debugging process. Otherwise you're not taking advantage of the Spark computing model.
# loads the file as an rdd
file = sc.textFile('/user/project/samplecsv.csv',5)
# builds a computation graph
rdd = file.map(lambda line: (line.split(',')[0], line.split(',')[1],
line.split(',')[2], line.split(',')[3],
line.split(',')[4]))
# saves the rdd to the filesystem
rdd.saveAsTextFile("/user/project/newfile.txt")
Also, you can make the code more optimal by spliting the line only once.
I think you should try the below code, it will solve your purpose:
file = sc.textFile("C://Users/Ravi/Desktop/test.csv",5)
rdd = file.map(lambda line: (line.split(',')[0], line.split(',')[1],
line.split(',')[2], line.split(',')[3]))
rdd.coalesce(1).saveAsTextFile("C://Users/Ravi/Desktop/temp")
If you want partitioned file, don't use coalesce.

Categories