I am looking for a guide regarding
python DataPrediction.py running well. But when I submit
spark-submit --master yarn --deploy-mode cluster --driver-memory 4g --num-executors 3 --executor-memory 3g --executor-cores 2 --queue default DataPrediction.py
Traceback (most recent call last):
File "/mnt/vol1/hdata/nm-local-dir/usercache/ajit/appcache/application_1674580462889_0114/container_e14_1674580462889_0114_02_000001/DataPrediction.py", line 7, in <module>
from prophet import Prophet
ModuleNotFoundError: No module named 'prophet'
Please help what should i do now.
The problem is that prophet is not installed on the machines of your yarn cluster. There are multiple ways to package python modules and use them within a spark job (venv, conda, pex...). Here is the official documentation.
One solution is to use a venv.
python -m venv my_env
source my_env/bin/activate
pip install prophet venv-pack
venv-pack -o my_env.tar.gz
PYSPARK_PYTHON=./environment/bin/python spark-submit\
--master yarn --deploy-mode cluster --queue default\
--archives my_env.tar.gz#environment DataPrediction.py
Related
I was checking this SO but none of the solutions helped PySpark custom UDF ModuleNotFoundError: No module named
I have the current repo on azure databricks:
|-run_pipeline.py
|-__init__.py
|-data_science
|--__init.py__
|--text_cleaning
|---text_cleaning.py
|---__init.py__
On the run_pipeline notebook I have this
from data_science.text_cleaning import text_cleaning
path = os.path.join(os.path.dirname(__file__), os.pardir)
sys.path.append(path)
spark = SparkSession.builder.master(
"local[*]").appName('workflow').getOrCreate()
df = text_cleaning.basic_clean(spark_df)
On the text_cleaning.py I have a function called basic_clean that will run something like this:
def basic_clean(df):
print('Removing links')
udf_remove_links = udf(_remove_links, StringType())
df = df.withColumn("cleaned_message", udf_remove_links("cleaned_message"))
return df
When I do df.show() on the run_pipeline notebook, I get this error message:
Exception has occurred: PythonException (note: full exception trace is shown but execution is paused at: <module>)
An exception was thrown from a UDF: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'data_science''. Full traceback below:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'data_science'
Shouldnt the imports work? Why is this an issue?
It seems data-science module is missing on cluster. Kindly consider
installing it on cluster.
Please check below link about installing libraries to cluster.
https://learn.microsoft.com/en-us/azure/databricks/libraries/cluster-libraries
You can consider executing pip list command to see libraries installed on cluster.
You can consider running pip install data_science command also directly in notebook cell.
I've been facing the same issue running pyspark tests with UDFs in Azure Devops. I've noticed that this happens when running from the pool with vmImage:ubuntu-latest. When I use a custom container build from the following Dockerfile, the tests run fine:
FROM python:3.8.3-slim-buster AS py3
FROM openjdk:8-slim-buster
ENV PYSPARK_VER=3.3.0
ENV DELTASPARK_VER=2.1.0
COPY --from=py3 / /
WORKDIR /setup
COPY requirements.txt .
RUN python -m pip install --upgrade pip
RUN pip install --no-cache-dir -r requirements.txt && \
rm requirements.txt
WORKDIR /code
requirements.txt contains pyspark==3.3.0 and delta-spark==2.1.0.
This led me to conclude that it's due to how spark runs in the default ubuntu VM which runs python 3.10.6 and java 11 (at the time of posting this). I've tried setting env variables such as PYSPARK_PYTHON to enforce pyspark to use the same python binary on which the to-be-tested package is installed but to no avail.
Maybe you can use this information to find a way to use the default agent pool's ubuntu vm to get it to work, otherwise I recommend just using a pre-configured container like I did.
When trying to synthesize my CDK app, I receive the following error:
`
Traceback (most recent call last):
File "C:\Users\myusername\PycharmProjects\rbds-cdk_testing\app.py", line 2, in <module>
from aws_cdk.core import App, Environment
File "C:\Users\myusername\PycharmProjects\rbds-cdk_testing\.venv\lib\site-packages\aws_cdk\__init__.py", line 1260, in <module>
from .cloud_assembly_schema import (
ImportError: cannot import name 'AssetManifestOptions' from 'aws_cdk.cloud_assembly_schema' (C:\Users\myusername\PycharmProjects\rbds-cdk_testing\.venv\lib\site-packages\aws_cdk\cloud_assembly_schema\__init__.py)
I am using node version 18.0.0. Here's the steps I've done in creating my CDK app:
(FROM c:\Users\myusername\)
installed nvm
installed npm
nvm use 18.0.0
npm install -g yarn
npm install -g aws-cdk
cdk bootstrap aws://account-number/region
cd .\PyCharmProjects\mycdkapp
cdk init app --language python
.venv\Scripts\activate.bat
python -m pip install aws-cdk.aws-glue
python -m pip install aws-cdk
I error out even when executing cdk ls as the runtime tries to run app.py which contains
\
import yaml
from aws_cdk.core import App, Environment
from pipeline import PipelineCDKStack
In checking whether the init.py file for aws_cdk contains AssetManifestOptions, I've discovered it is completely missing:
Am I missing something here or is this a unique bug that I am experiencing? Any help much appreciated! I am banging my head on this one.
Its the same here, I think the issue can be in wrong package version.
cloud-assembly-schema==2.50.0 contains AssetManifestOptions.
Can you please paste here output of
pip list -v | grep aws
Iam able to install 2.50.0, however it depends on other packages of the same version (see attach)
And I cant set up core package because there is no CDKv2 matching distribution at the moment
I have tried following the Databricks blog post here but unfortunately keep getting errors. I'm trying to install pandas, pyarrow, numpy, and the h3 library and then be able to access those libraries on my PySpark cluster but following these instructions isn't working.
conda init --all (then close and reopen terminal)
conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas h3 numpy python=3.7.10 conda-pack
conda activate pyspark_conda_env
conda pack -f -o pyspark_conda_env.tar.gz
import os
from pyspark.sql import SparkSession
os.environ['PYSPARK_PYTHON'] = "./environment/bin/python"
spark = SparkSession.builder.config(
"spark.yarn.archive", # 'spark.yarn.dist.archives' in YARN.
"~/gzk/pyspark_conda_env.tar.gz#environment").getOrCreate()
I'm able to get this far but when I actually try to run a pandas udf I get the error: ModuleNotFoundError: No module named 'numpy'
How can I solve this problem and use pandas udf's?
I ended up solving this issue by writing a bootstrap script for my AWS EMR cluster that would install all the packages I needed on all the nodes. I was never able to get the directions above to work properly.
Documentation on bootstrap scripts can be found here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html
I am trying to run a python script in EC2 user-data. The EC2 instance I am launching uses a custom AMI image that I repaired. I installed the 2 packages that I need - boto3 and pyodbc packages by executing these commands (notice, I am installing those as root):
sudo yum -y install boto3 pyodbc
My user-data script:
#!/bin/bash
set -e -x
# set AWS region
echo "export AWS_DEFAULT_REGION=${region}" >> /etc/profile
source /etc/profile
# copy python script from the s3 bucket
aws s3 cp s3://${bucket_name}/ /home/ec2-user --recursive
/usr/bin/python3 /home/ec2-user/my_script.py
After launching a new EC2 Instance (using the my custom AMI) and checking /var/log/cloud-init-output.log I see that error:
+ python3 /home/ec2-user/main_rigs_stopgap.py
Traceback (most recent call last):
File "/home/ec2-user/my_script.py", line 1, in <module>
import boto3
ModuleNotFoundError: No module named 'boto3'
util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-001 [1]
cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Any suggestions, please?
To make sure that you installed modules to correct version of python, use builtin pip module of the python version you are using:
/usr/bin/python3 -m pip install module_name
https://github.com/NVIDIA/DeepRecommender
According to the above page, I tried to run the NVIDIA's DeepRecommender program.After I activated the pytorch, I run the program as below but it failed.
[I run this Command]
$ python run.py --gpu_ids 0 \
--path_to_train_data Netflix/NF_TRAIN \
--path_to_eval_data Netflix/NF_VALID \
--hidden_layers 512,512,1024 \
--non_linearity_type selu \
--batch_size 128 \
--logdir model_save \
--drop_prob 0.8 \
--optimizer momentum \
--lr 0.005 \
--weight_decay 0 \
--aug_step 1 \
--noise_prob 0 \
--num_epochs 12 \
--summary_frequency 1000
[The comments of the Guide.]
Note that you can run Tensorboard in parallel
$ tensorboard --logdir=model_save
[My Question]
The guide says as above.I don't know how to run in parallel.Please tell me the way. Shoud I open 2 terminal windows?
[Enviroment]
The detail of the enviroment is as follow.
---> Ubuntu 18.04 LTS, python 3.6, Pytorch 1.2.0, CUDA V10.1.168
[The 1st trial]
After I activated the pytorch,
$source activate pytorch
$python run.py --gpu_ids 0 \ (The long parameters are abbreviated here.)
[The Error messages of the 1st trial]
Traceback (most recent call last):
File "run.py", line 13, in
from logger import Logger
File "/home/user/NVIDIA_DeepRecommender/DeepRecommender-mp_branch/logge r.py", line 4, in
import tensorflow as tf
ModuleNotFoundError: No module named 'tensorflow'
[The 2nd trial]
After I activated the tensorflow-gpu,
$ source activate tensorflow-gpu
$python run.py --gpu_ids 0 \ (The long parameters are abbreviated here.)
[The Error messages of the 2nd trial.]
Traceback (most recent call last):
File "run.py", line 2, in
import torch
ModuleNotFoundError: No module named 'torch'
[Expected result]
$ python run.py --gpu_ids 0 \
The program can run with no error and finish training the model.
Try either installing tensorflow-gpu in your pytorch environment or pytorch in your tensorflow-gpu environemnt and use that environment to run your program.