python script hangs on input method when running spark - python

I'm running a simple script by python in spark
when I use input method, running scripts stuck in that line
here is the code:
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("My App")
sc = SparkContext(conf = conf)
testFileName = input("enter file name: ");
print("okey I'll open " + testFileName)
# load an RDD from a test file
fileRDD = sc.textFile(testFileName)

When you submit application using spark-submit you don't interact with Python code, but with Java one, which doesn't expect any input from stdin.
If you want to make it work you have to skip spark-submit and execute this as a Python script directly.

Related

How to load sql file in spark using python

My pySpark version is 2.4 and python version is 2.7. I have multiple line sql file which needs to run in spark. Instead of running line by line, is it possible to keep the sql file in python (which initialize spark) and execute it using spark-submit ? I am trying to write a generic script in python so that we just need to replace sql file from hdfs folder later. Below is my code snippet.
import sys
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
args = str(sys.argv[1]).split(',')
fd = args[0]
ld = args[1]
sd = args[2]
#Below line does not work
df = open("test.sql")
query = df.read().format(fd,ld,sd)
#Initiating SparkSession.
spark = SparkSession.builder.appName("PC").enableHiveSupport().getOrCreate()
#Below line works fine
df_s=spark.sql("""select * from test_tbl where batch_date='2021-08-01'""")
#Execute the sql (Does not work now)
df_s1=spark.sql(query)
spark-submit throws following error for the above code.
Exception in thread "main" org.apache.spark.SparkException:
Application application_1643050700073_7491 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1158)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1606)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:847)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:922)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:931)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 22/02/10 01:24:52 INFO util.ShutdownHookManager: Shutdown hook called
I am relatively new in pyspark. Can anyone please guide me what I am missing here ?
You cant run pyspark on your local directory. If you want to do an sql statement on a File in HDFS, you have to put your file from HDFS, first on your local directory.
Referred to spark 2.4.0 Spark Documentation, you can simply use the pyspark API.
from os.path import expanduser, join, abspath
from pyspark.sql import SparkSession
from pyspark.sql import Row
spark.sql("YOUR QUERY").show()
or query files directly with:
df = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")

Is there a way to read a csv file in hdfs into a python dataframe using insecureclient in a jupyter notebook?

I have a csv file located on hdfs in a remote server.
I want to read the csv file into a pandas dataframe using insecureclient, however I keep getting an error
1st attempt:
code:
from hdfs import InsecureClient
client_hdfs = InsecureClient('hdfs://host:port', user=user)
with client_hdfs.read('path/to/csv.csv') as reader:
print(reader)
error:
InvalidSchema: No connection adapters were found for
'host:port:path/to/csv.csv'
I have verified that the path is correct by running 'hdfs -ls path/to/csv.csv' on the server and viewing the file there and have obtained the host name by running 'uname -n' on the server.
2nd attempt (created a new test file with the contents "this is a test file" and placed in same hdfs location and tried to read):
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("myFirstApp").setMaster("local")
sc = SparkContext(conf=conf)
path = "hdfs://host:port/path/to/textfile.txt"
df = sc.textFile(path)
df.first()
Error:
runs indefinitely without ever returning result
The module you're using connects to WebHDFS (using http protocol), not HDFS Namenode port (hdfs:// protocol)
Docs - https://hdfscli.readthedocs.io/en/latest/quickstart.html#instantiating-a-client
PySpark, on the other hand, does allow connecting over hdfs:// + You should use Spark dataframes, not Pandas directly - https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html

How to run python program automatically even if system restarted

I want to run my python API continuously (means program API will all-time active if call then it work) even if a system restarts my python API automatically restart.
I have API URL: http://localhost:8002/city_id_pred?id=1,2 through this URL calling python API.
Program:
import web
import pyodbc
import re
import numpy as np
#from wordcloud import WordCloud, STOPWORDS
from collections import Counter
from sklearn.externals import joblib
import pandas as pd
cnxn = pyodbc.connect('')
cursor = cnxn.cursor()
urls = (
'/city_id_pred?', 'Predict'
#'/', 'MyApplication'
)
class Predict(web.application):
def run(self, port=8080, *middleware):
func = self.wsgifunc(*middleware)
return web.httpserver.runsimple(func, ('0.0.0.0', port))
print("Start class...")
def GET(self):
#here prediction model
if __name__ == "__main__":
app = Predict(urls, globals())
app.run(port=8002)
Please suggest
Actually, I want to run on window server but currently using in windows OS.
As #Mubarak said, you basically want to convert it to a .exe and then add that .exe to startup.
I would recommend that you do this by using PyInstaller and then following this tutorial on how to add that .exe to your startup.
Following step will help you:
Make your python code like example.py
Convert example.py to example.exe file by using auto-py-to-exe https://pypi.org/project/auto-py-to-exe/
Open Task Schedular in your window system
Create Task->Gentral Tab ->Give Name,Location
Trigger Tab->Bigin the Task->On Startup
Hope this helps
Following given steps,.
1.Convert your python file to .exe format(https://pypi.org/project/auto-py-to-exe/)
Make simple batch file to run the .exe file which your python file.
#echo off
cd "C:\Program Files\Google\Chrome\Application\"
Start chrome.exe
start – "C:\Program Files\Microsoft Office\Office15\WINWORD.EXE"
"C:\Work\MUO\How to Batch Rename.docx"
cd "C:\Program Files (x86)\VMware\VMware Player"
start vmplayer.exe
exit
Hope your understood,....

How to execute hql script with transform python udf in spark?

I am new to spark and learning thru POC. As part of this POC I am trying to execute hql file directly which has transform keyword to use python udf.
I have tested hql script in CLI "hive -f filename.hql" and it is working fine.
Same script I have tried in spark-sql but it is failing with hdfs path not found error. I tried to give hdfs path in different way as below but all are not working
"/test/scripts/test.hql"
"hdfs://test.net:8020/test/scripts/test.hql"
"hdfs:///test.net:8020/test/scripts/test.hql"
Also tried giving complete path in hive transform code as below
USING "scl enable python27 'python hdfs://test.net:8020/user/test/scripts/TestPython.py'"
Hive Code
add file hdfs://test.net:8020/user/test/scripts/TestPython.py;
select * from
(select transform (*)
USING "scl enable python27 'python TestPython.py'"
as (Col_1 STRING,
col_2 STRING,
...
..
col_125 STRING
)
FROM
test.transform_inner_temp1 a) b;
TestPython code:
#!/usr/bin/env python
'''
Created on June 2, 2017
#author: test
'''
import sys
from datetime import datetime
import decimal
import string
D = decimal.Decimal
for line in sys.stdin:
line = sys.stdin.readline()
TempList = line.strip().split('\t')
col_1 = TempList[0]
...
....
col_125 = TempList[34] + TempList[32]
outList.extend((col_1,....col_125))
outValue = "\t".join(map(str,outList))
print "%s"%(outValue)
So I have tried another method as executing directly in spark-submit
spark-submit --master yarn-cluster hdfs://test.net:8020/user/test/scripts/testspark.py
testspark.py
from pyspark.sql.types import StringType
from pyspark import SparkConf, SparkContext
from pyspark import SQLContext
conf = SparkConf().setAppName("gveeran pyspark test")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
with open("hdfs://test.net:8020/user/test/scripts/test.hql") as fr:
query = fr.read()
results = sqlContext.sql(query)
results.show()
But again same issue as below
Traceback (most recent call last):
File "PySparkTest2.py", line 7, in <module>
with open("hdfs://test.net:8020/user/test/scripts/test.hql") as fr:
IOError: [Errno 2] No such file or directory: 'hdfs://test.net:8020/user/test/scripts/test.hql'
You can read the file as a query and then execute as spark sql job
Example:-
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
sc =SparkContext.getOrCreate()
sqlCtx = SQLContext(sc)
with open("/home/hadoop/test/abc.hql") as fr:
query = fr.read()
print(query)
results = sqlCtx.sql(query)

how to load --jars with pyspark with spark standalone on client mode

I am using python 2.7 with spark standalone cluster on client mode.
I want to use jdbc for mysql and found that i need to load it using --jars argument, I have the jdbc on my local, and manage to load it with pyspark console like here
When I write a python script inside my ide, using pyspark, I don't manage to load the additional jar mysql-connector-java-5.1.26.jar and keep get
no suitable driver
error
How can I load additional jar files when running a python script in client mode, using a standalone cluster on client mode and refering to a remote master?
edit: added some code #########################################################################
this is the basic code that i am using, i use pyspark with spark context in python e.g i do not use spark submit directly and don't understand how to use spark submit parameters in this case...
def createSparkContext(masterAdress = algoMaster):
"""
:return: return a spark context that is suitable for my configs
note the ip for the master
app name is not that important, just to show off
"""
from pyspark.mllib.util import MLUtils
from pyspark import SparkConf
from pyspark import SparkContext
import os
SUBMIT_ARGS = "--driver-class-path /var/nfs/general/mysql-connector-java-5.1.43 pyspark-shell"
#SUBMIT_ARGS = "--packages com.databricks:spark-csv_2.11:1.2.0 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
conf = SparkConf()
#conf.set("spark.driver.extraClassPath", "var/nfs/general/mysql-connector-java-5.1.43")
conf.setMaster(masterAdress)
conf.setAppName('spark-basic')
conf.set("spark.executor.memory", "2G")
#conf.set("spark.executor.cores", "4")
conf.set("spark.driver.memory", "3G")
conf.set("spark.driver.cores", "3")
#conf.set("spark.driver.extraClassPath", "/var/nfs/general/mysql-connector-java-5.1.43")
sc = SparkContext(conf=conf)
print sc._conf.get("spark.executor.extraClassPath")
return sc
sql = SQLContext(sc)
df = sql.read.format('jdbc').options(url='jdbc:mysql://ip:port?user=user&password=pass', dbtable='(select * from tablename limit 100) as tablename').load()
print df.head()
Thanks
Your SUBMIT_ARGS is going to be passed to the spark-submit when creating a sparkContext from python. You should use --jars instead of --driver-class-path.
EDIT
Your problem is actually a lot simpler than it seems: you're missing the parameter driver in the options:
sql = SQLContext(sc)
df = sql.read.format('jdbc').options(
url='jdbc:mysql://ip:port',
user='user',
password='pass',
driver="com.mysql.jdbc.Driver",
dbtable='(select * from tablename limit 100) as tablename'
).load()
You can also put userand password in separate arguments.

Categories