How to run code on the AWS cluster using Apache-Spark?

How to run code on the AWS cluster using Apache-Spark? - python

I've written a python code on summing up all numbers in first-column for each csv file which is as follow:
import os, sys, inspect, csv
### Current directory path.
curr_dir = os.path.split(inspect.getfile(inspect.currentframe()))[0]
### Setup the environment variables
spark_home_dir = os.path.realpath(os.path.abspath(os.path.join(curr_dir, "../spark")))
python_dir = os.path.realpath(os.path.abspath(os.path.join(spark_home_dir, "./python")))
os.environ["SPARK_HOME"] = spark_home_dir
os.environ["PYTHONPATH"] = python_dir
### Setup pyspark directory path
pyspark_dir = python_dir
sys.path.append(pyspark_dir)
### Import the pyspark
from pyspark import SparkConf, SparkContext
### Specify the data file directory, and load the data files
data_path = os.path.realpath(os.path.abspath(os.path.join(curr_dir, "./test_dir")))
### myfunc is to add all numbers in the first column.
def myfunc(s):
total = 0
if s.endswith(".csv"):
cr = csv.reader(open(s,"rb"))
for row in cr:
total += int(row[0])
return total
def main():
### Initialize the SparkConf and SparkContext
conf = SparkConf().setAppName("ruofan").setMaster("spark://ec2-52-26-177-197.us-west-2.compute.amazonaws.com:7077")
sc = SparkContext(conf = conf)
datafile = sc.wholeTextFiles(data_path)
### Sent the application in each of the slave node
temp = datafile.map(lambda (path, content): myfunc(str(path).strip('file:')))
### Collect the result and print it out.
for x in temp.collect():
print x
if __name__ == "__main__":
main()
I would like to use Apache-Spark to parallelize the summation process for several csv files using the same python code. I've already done the following steps:
I've created one master and two slave nodes on AWS.
I've used the bash command $ scp -r -i my-key-pair.pem my_dir root#ec2-52-27-82-124.us-west-2.compute.amazonaws.com to upload directory my_dir including my python code with the csv files onto the cluster master node.
I've login my master node, and from there used the bash command $ ./spark/copy-dir my_dir to send my python code as well as csv files to all slave nodes.
I've setup the environment variables on the master node:
$ export SPARK_HOME=~/spark
$ export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
However, when I run the python code on the master node: $ python sum.py, it shows up the following error:
Traceback (most recent call last):
File "sum.py", line 18, in <module>
from pyspark import SparkConf, SparkContext
File "/root/spark/python/pyspark/__init__.py", line 41, in <module>
from pyspark.context import SparkContext
File "/root/spark/python/pyspark/context.py", line 31, in <module>
from pyspark.java_gateway import launch_gateway
File "/root/spark/python/pyspark/java_gateway.py", line 31, in <module>
from py4j.java_gateway import java_import, JavaGateway, GatewayClient
ImportError: No module named py4j.java_gateway
I have no ideas about this error. Also, I am wondering if the master node automatically calls all slave nodes to run in parallel. I really appreciate if anyone can help me.

Here is how I would debug this particular import error.
ssh to your master node
Run the python REPL with $ python
Try the failing import line >> from py4j.java_gateway import java_import, JavaGateway, GatewayClient
If it fails, try simply running >> import py4j
If that fails, it means that your system either does not have py4j installed or cannot find it.
Exit the REPL >> exit()
Try installing py4j $ pip install py4j (you'll need to have pip installed)
Open the REPL $ python
Try importing again >> from py4j.java_gateway import java_import, JavaGateway, GatewayClient
If that works, then >> exit() and try running your $ python sum.py again

I think you are asking two separate questions. It looks like you have an import error. Is it possible that you have a different version of the package py4j installed on your local computer that you haven't installed on your master node?
I can't help with running this in parallel.

Related

Python Code Runs in Spyder but not Anaconda Prompt -- ModueNotFoundError

I am in need of some help with the below. Using Python 3 code on Windows 10. Written in Spyder over Anaconda install and custom environment.
I have some code that runs in Spyder but not Anaconda Prompt. It gets a ModuleNotFoundError. The directory/file structure is like this:
my_python
smb_bau
etl
init.py
etl_globals.py
init.py
my_config.conf
my_connections.py
my_definitions.py
init.py
(underscores round init not displaying for some reason)
Some notes before the problematic code:
It's just test code for now that will be replaced with the real hive code later.
The code works when all the files are in one foler.
Th my_ prefixes are just to avoid any word that might cause issues while testing.
In my_connections I have this code:
class hive_connection:
def __init__(self):
self.my_name = None
self.my_town = None
import configparser
from smb_bau.my_definitions import CONFIG_PATH
config = configparser.ConfigParser()
config.read(CONFIG_PATH)
my_details = config["my_details"]
self.my_name = my_details["my_name"]
self.my_town = my_details["my_town"]
And this is what's in my_definitions:
import os
ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
CONFIG_PATH = os.path.join(ROOT_DIR, 'my_config.conf')
(I gratefully stole this from another thread to solve the issue of getting config info from a different folder).
The file I run is etl_globals which contains this code:
from smb_bau.my_connections import hive_connection
hive = hive_connection()
print(hive.my_name, hive.my_town)
When I run etl_globals in Spyder it works perfectly. But it fails in Anaconda Prompt.
I open up Anaconda Prompt, activate my environment, navigate to the etl folder, and enter python etl_globals.py
The error message I get is:
Traceback (most recent call last):
File "etl_globals.py", line 1, in
from smb_bau.my_connections import hive_connection
ModuleNotFoundError: No module named 'smb_bau'
I don't understand why the import of modules works in Spyder but not Anaconda Prompt.
The overarching goal is to be able to not have all the files in one folder.
Any help would be gratefully appreciated.
Many thanks.
Edit:
Just to be sure what the issue is I simplified it to create a file called my_functions (in the smb_bau folder) with this in:
def fn_test():
print(5)
And a file in the etl folder called my_test with this in:
from smb_bau.my_functions import fn_test
fn_test()
And got the same error:
Traceback (most recent call last):
File "my_test.py", line 1, in
from smb_bau.my_functions import fn_test
ModuleNotFoundError: No module named 'smb_bau'
Why does from smb_bau.my_functions import xxx work in Spyder but not Anaconda prompt?
Also, this may be relevant: the only reason there is a folder called my_python at the top is that I was getting an error saying no module named smb_bau. So I moved it all down a level and then it would accept smb_bau as a module but only in Spyder).
I also just checked PYTHONPATH and the root folder is there.
I have only been using Python for a week so I'm sure there are other bit of my code that aren't great -- but it does work in the spyder client.
Thanks and sorry if any of this is unclear.

Executing .py file from Command Prompt raises "ImportError: no module named geopandas"- Script works in Spyder (using anaconda)

I have a python script that accomplishes a few small tasks:
Create new directory structure
Download a .zip file from a URL and unzip contents
Clean up the data
Export data as a .csv
The full .py file runs successfully giving desired output when in Spyder, but when trying to run the .py from Command Prompt, it raises "ImportError: no module named geopandas"
I am using Windows 10 Enterprise version 1909, conda v4.9.2, Anaconda command line client v 1.7.2, Spyder 4.2.3.
I am in a virtual environment with all the needed packages that my script imports.
The first part of my script only needs os and requests packages, and it runs fine as its own .py file from Command Prompt:
import os
import requests
#setup folders, download .zip file and unzip it
#working directory is directory the .py file is in
wd = os.path.dirname(__file__)
if not os.path.exists(wd):
os.mkdir(wd)
#data source directory
src_path = os.path.join(wd, "src")
if not os.path.exists(src_path):
os.mkdir(src_path)
#data output directory
output_path = os.path.join(wd,"output")
if not os.path.exists(output_path):
os.mkdir(output_path)
#create new output directories and define as variables
out_parent = os.path.join(wd, "output")
if not os.path.exists(out_parent):
os.mkdir(out_parent)
folders = ["imgs", "eruptions_processed"]
for folder in folders:
new_dir = os.path.join(out_parent, folder)
if not os.path.exists(new_dir):
os.mkdir(new_dir)
output_imgs = os.path.join(out_parent, "imgs")
if not os.path.exists(output_imgs):
os.mkdir(output_imgs)
output_eruptions = os.path.join(out_parent, "eruptions_processed")
if not os.path.exists(output_eruptions):
os.mkdir(output_eruptions)
if not os.path.exists(os.path.join(src_path,"Historical_Significant_Volcanic_Eruption_Locations.zip")):
url = 'https://opendata.arcgis.com/datasets/3ed5925b69db4374aec43a054b444214_6.zip?outSR=%7B%22latestWkid%22%3A3857%2C%22wkid%22%3A102100%7D'
doc = requests.get(url)
os.chdir(src_path) #change working directory to src folder
with open('Historical_Significant_Volcanic_Eruption_Locations.zip', 'wb') as f:
f.write(doc.content)
file = os.path.join(src_path,"Historical_Significant_Volcanic_Eruption_Locations.zip") #full file path of downloaded
But once I re-introduce my full list of packages in the .py file:
import os
import pandas as pd
import geopandas as gpd
import requests
import datetime
import shutil
and run again from Command Prompt, I get:
Traceback (most recent call last):
File "C:\Users\KWOODW01\py_command_line_tools\download_eruptions.py", line 17, in <module>
import geopandas as gpd
ImportError: No module named geopandas
I am thinking the problem is something to do with not finding my installed packages in my anaconda virtual environment, but I don't have a firm grasp on how to troubleshoot that. I thought I had added the necessary Anaconda file paths to my Windows PATH variable before.
The path to my virtual environment packages are in
"C:\Users\KWOODW01\Anaconda3\envs\pygis\Lib\site-packages"
echo %PATH% returns:
C:\Users\KWOODW01\Anaconda3\envs\pygis;C:\Users\KWOODW01\Anaconda3\envs\pygis\Library\mingw-w64\bin;C:\Users\KWOODW01\Anaconda3\envs\pygis\Library\usr\bin;C:\Users\KWOODW01\Anaconda3\envs\pygis\Library\bin;C:\Users\KWOODW01\Anaconda3\envs\pygis\Scripts;C:\Users\KWOODW01\Anaconda3\envs\pygis\bin;C:\Program Files (x86)\Common Files\Oracle\Java\javapath;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0;C:\WINDOWS\System32\OpenSSH;C:\Program Files\McAfee\Solidcore\Tools\GatherInfo;C:\Program Files\McAfee\Solidcore\Tools\Scanalyzer;C:\Program Files\McAfee\Solidcore;C:\Program Files\McAfee\Solidcore\Tools\ScGetCerts;C:\Users\KWOODW01\AppData\Local\Microsoft\WindowsApps;C:\Users\KWOODW01\Anaconda3\Library\bin;C:\Users\KWOODW01\Anaconda3\Scripts;C:\Users\KWOODW01\Anaconda3\condabin;C:\Users\KWOODW01\Anaconda3;.
So it appears that the path to the directory where my pygis venv packages live are already added to my PATH variables, yet from Command Prompt the script still raises the "ImportError: no module named geopandas". Pretty stuck on this one. Hoping someone can provide some more troubleshooting tips. Thanks.

I figured out I wasn't calling python in command prompt before executing the python file.
The proper command is python modulename.py instead of modulename.py if you want to execute a .py file from the command prompt. Yikes. Let this be a lesson for other python novices.

CONDA Base Env load error / NUMPY: ImportError: DLL load failed

Just installed miniconda on C:\Anaconda 3 and ran conda install numpy unsing the Anaconda shell (defaults to conda base as env).
If I run the Anaconda command prompt and type python >> import numpy all works fine.
If I open a normal command window and got to c:\Anaconda3 and run python >> import numpy this fails (error below).
I have checked sys.path and they are the same on both CMD windows. The only solution is to run on the normal CMD window doing: c:\Anaconda3\Scripts\conda activate base and then run python >> import numpy.
I had Miniconda installations in the past that did not have this issue so I am surprised to suddenly have to activate the environment. I thought the base environment is loaded by default but it seems as if this is not the case and i have to force that.
The error I get is:
>>> import numpy
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\Anaconda3\lib\site-packages\numpy\__init__.py", line 140, in <module>
from . import _distributor_init
File "c:\Anaconda3\lib\site-packages\numpy\_distributor_init.py", line 34, in
<module>
from . import _mklinit
ImportError: DLL load failed: The specified module could not be found.
I found that I could run from the command line: C:\Anaconda3\python.exe C:\Anaconda3\cwp.py C:\Anaconda3 C:\Anaconda3\python.exe and then run import numpy and that will work. The cwp.py file is as follows:
# this script is used on windows to wrap shortcuts so that they are executed within an environment
# It only sets the appropriate prefix PATH entries - it does not actually activate environments
import os
import sys
import subprocess
from os.path import join, pathsep
from menuinst.knownfolders import FOLDERID, get_folder_path, PathNotFoundException
# call as: python cwp.py PREFIX ARGs...
prefix = sys.argv[1]
args = sys.argv[2:]
new_paths = pathsep.join([prefix,
join(prefix, "Library", "mingw-w64", "bin"),
join(prefix, "Library", "usr", "bin"),
join(prefix, "Library", "bin"),
join(prefix, "Scripts")])
env = os.environ.copy()
env['PATH'] = new_paths + pathsep + env['PATH']
env['CONDA_PREFIX'] = prefix
documents_folder, exception = get_folder_path(FOLDERID.Documents)
if exception:
documents_folder, exception = get_folder_path(FOLDERID.PublicDocuments)
if not exception:
os.chdir(documents_folder)
sys.exit(subprocess.call(args, env=env))
PS: If you wonder "why is this needed if you can simply activate base"? When using xlwings for instance the script calls python.exe (without activating an environment first, even if I thought that using the python.exe on the root folder meant you did not need to activate base environment). This is troublesome as I get the error when I try to load numpy.
Thanks!

It's a DLL error. caused because of missing DLL file. download files from here
Go to C:/windows/system32 and /Windows/SysWOW64 folder .just paste those files
If asked just replace files

PySpark - Input path does not exist on YARN. Works fine locally

I am new to Spark and Python and I am trying to launch a Python script (through a bash run.sh command).
When I run it on local mode, everything is fine. When I try to run it in the cluster(which has spark 2.1.2 without hadoop) then I receive the same error.
I hope this info is enough.
What should I do so that the script runs in yarn?
from pyspark import SparkContext, SparkConf
import sys
import collections
import os
from subprocess import call, Popen
import numpy
import re
import requests
import json
import math
from bs4 import BeautifulSoup
from bs4.element import Comment
sc = SparkContext("yarn", "test")
record_attribute = sys.argv[1]
in_file = sys.argv[2]
#Read warc file and split in WARC/1.0
rdd = sc.newAPIHadoopFile(in_file,
"org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
"org.apache.hadoop.io.LongWritable",
"org.apache.hadoop.io.Text",
conf={"textinputformat.record.delimiter": "WARC/1.0"})
And this is the error
7/11/30 14:05:48 INFO spark.SparkContext: Created broadcast 1 from broadcast at PythonRDD.scala:553
Traceback (most recent call last):
File "/home/test/script.py", line 51, in <module>
,conf={"textinputformat.record.delimiter": "WARC/1.0"})
File "/home/test/spark-2.1.2-bin-without-hadoop/python/lib/pyspark.zip/pyspark/context.py", line 651, in newAPIHadoopFile
File "/home/test/spark-2.1.2-bin-without-hadoop/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/home/test/spark-2.1.2-bin-without-hadoop/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile. org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist

When I run it on local mode, everything is fine
You are loading a file from the local filesystem, then
When you deploy to the cluster, you need to ensure that in_file exists on all executors (YARN ResourceManagers), or on some shared filesystem, such as HDFS or S3
You should specify where you data is located in that bash script like hdfs:///some/input/in_file
If you did not copy your Hadoop cluster's core-site.xml and hdfs-site.xml into a local HADOOP_CONF_DIR environment variable, or otherwise configure it yourself, then the default behavior is to read a local filesystem, and you will therefore need to use the external filesystem URI path. For example, HDFS is in the format hdfs://namenode:port/some/input/in_file
Note: You need to first upload your file to the remote filesystem

Did you configure HADOOP_CONF_DIR or YARN_CONF_DIR properly? This directory should contain client configs of hdfs and yarn service. So that spark application can get resources from Yarn and applications can perform read/write operation to HDFS.
Kindly check the doc given below which gives you info about prerequisites to be followed for spark on yarn.
https://spark.apache.org/docs/2.1.1/running-on-yarn.html
If you deploy spark from cloudera manager or ambari server, All environment variables associated with the client configuration will be deployed by itself.

Running Python SparkPi in YARN Cluster Mode
Run the pi.py file:
spark-submit --master yarn --deploy-mode cluster SPARK_HOME/lib/pi.py 10
Please refer to the following link for more information: "Running Spark Applications on YARN".

Python external project script execution

After many researches on the web, on many different topics, I wasn't able to find the solution answering my problem.
I am developing a tool (with graphic interface) allowing to process python scripts from different projects.
To not have any problem of dependences, I ask the user to target the project directory as well as the script to process.
Thus to develop this tool I have in my IDE a project "benchmark_tool" which is the project integrating the GUI, and the project "test_project" which groups various testing scripts. Obviously, the final goal is the tool to be compiled and thus that the project "benchmark_tool" disappears from IDE.
Here is the IDE architecture:
benchmark_tool (python project)
__init__.py
main.py
test_project (python project)
__init__.py
module1
__init__.py
test_script.py
module2
__init__.py
imports.py
How it works: The main.py shall call test_script.py.
test_script.py calls imports.py at the first beggining of the script.
UPDATE
I tried some modifications in my main code:
import sys
import subprocess
subprocess.check_call([sys.executable, '-m', 'test_project.module1.test_script'], cwd='D:/project/python')
I got this error
Traceback(most recent call last):
File "C:/Python31/lib/runpy.py", line 110, in run module as main
mod_name, loader, code, fname = _get_module_detail(mod_name)
File "C:/Python31/lib/runpy.py", line 91, in get module details
code = loader.get_code(mod_name)
File "C:/Python31/lib/pkgutil.py", line 272, in get code
self.code = compile(source, self.filename, 'exec')
File "D:/project/python/test_project/module1/test_script.py", line 474
SyntaxError: invalid syntax
Traceback (most recent call last):
File "D:/other_project/benchmark_tool/main.py", line 187, in read
subprocess.check_call([sys.executable, '-m', 'module1.test_script.py'], cwd='D:/project/python/test_project')
File "C:/Python31/lib/subprocess.py", line 446, in check call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['C:/Python31/python.exe', '-m', 'module1.test_script.py']' returned non-zero exit status 1
Note
It works with subprocess.Popen(['python','D:/project/python/test_project/module1/test_script.py])
What's the main difference between both methods ? I'll also have to pass arguments to test_scripts.py, which one is the best to use to communicate with a python script (input and output datas are exchanged) ?
Thanks by advance

There are 3 problems I see that you have to fix:
In order to import from module2, you need to turn that directory into a package by placing an empty *__init__.py file in it
When executing test_script.py, you need the full path to the file, not just file name
Fixing up sys.path only work for your script. If you want to propagate that to test_script.py, set the PYTHONPATH environment variable
Therefore, my solution is:
import os
import sys
import subprocess
# Here is one way to determine test_project
# You might want to do it differently
script_dir = os.path.abspath(os.path.dirname(__file__))
test_project = os.path.normpath(os.path.join(script_dir, '..', 'test_project'))
python_path = '{}:{}'.format(os.environ.get('PYTHONPATH', ''), test_project)
python_path = python_path.lstrip(':') # In case PYTHONPATH was not set
# Need to determine the full path to the script
script = os.path.join(test_project, 'module1', 'test_script.py')
proc = subprocess.Popen(['python', script], env=dict(PYTHONPATH=python_path))

To run an "external project script", install it e.g.: pip install project (run it in a virtualenv if you'd like) and then run the script as you would with any other executable:
#!/usr/bin/env python
import subprocess
subprocess.check_call(['executable', 'arg 1', '2'])
If you don't want to install the project and it has no dependencies (or you assume they are already installed) then to run a script from a given path as a subprocess (assuming there is module1/__init__.py file):
#!/usr/bin/env python
import subprocess
import sys
subprocess.check_call([sys.executable, '-m', 'module1.test_script'],
cwd='/abs/path/to/test_project')
Using -m to run the script, avoids many import issues.
The above assumes that test_project is not a Python package (no test_project/__init__.py). If test_project is itself a Python package then include it in the module name and start from the parent directory instead:
subprocess.check_call([sys.executable, '-m', 'test_project.module1.test_script'],
cwd='/abs/path/to')
You could pass the path to test_project directory as a command-line parameter (sys.argv) or read it from a config file (configparser,json). Avoid calculating it relative to the installation path of your parent script (if you have to then use pkgutil, setuptools' pkg_resources, to get the data). To find a place to store user data, you could use appdirs Python package.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to run code on the AWS cluster using Apache-Spark? - python

I think you are asking two separate questions. It looks like you have an import error. Is it possible that you have a different version of the package py4j installed on your local computer that you haven't installed on your master node? I can't help with running this in parallel.

Related

Python Code Runs in Spyder but not Anaconda Prompt -- ModueNotFoundError

Executing .py file from Command Prompt raises "ImportError: no module named geopandas"- Script works in Spyder (using anaconda)

CONDA Base Env load error / NUMPY: ImportError: DLL load failed

PySpark - Input path does not exist on YARN. Works fine locally

Python external project script execution

Categories

Resources