What is wrong with my boto elastic mapreduce jar jobflow parameters? - python

I am using the boto library to create a job flow in Amazons Elastic MapReduce Webservice (EMR). The following code should create a step:
step2 = JarStep(name='Find similiar items',
jar='s3n://recommendertest/mahout-core/mahout-core-0.5-SNAPSHOT.jar',
main_class='org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob',
step_args=['s3n://bucket/output/' + run_id + '/aggregate_watched/',
's3n://bucket/output/' + run_id + '/similiar_items/',
'SIMILARITY_PEARSON_CORRELATION'
])
When I run the job flow, it always fails throwing this error:
java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/JobContext
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.JobContext
This is the line in the EMR logs invoking the java code:
2011-01-24T22:18:54.491Z INFO Executing /usr/lib/jvm/java-6-sun/bin/java \
-cp /home/hadoop/conf:/usr/lib/jvm/java-6-sun/lib/tools.jar:/home/hadoop:/home/hadoop \
/hadoop-0.18-core.jar:/home/hadoop/hadoop-0.18-tools.jar:/home/hadoop/lib/*:/home/hadoop/lib/jetty-ext/* \
-Xmx1000m \
-Dhadoop.log.dir=/mnt/var/log/hadoop/steps/3 \
-Dhadoop.log.file=syslog \
-Dhadoop.home.dir=/home/hadoop \
-Dhadoop.id.str=hadoop \
-Dhadoop.root.logger=INFO,DRFA \
-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/3/tmp \
-Djava.library.path=/home/hadoop/lib/native/Linux-i386-32 \
org.apache.hadoop.mapred.JobShell \
/mnt/var/lib/hadoop/steps/3/mahout-core-0.5-SNAPSHOT.jar \
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob \
s3n://..../output/job_2011-01-24_23:09:29/aggregate_watched/ \
s3n://..../output/job_2011-01-24_23:09:29/similiar_items/ \
SIMILARITY_PEARSON_CORRELATION
What is wrong with the parameters? The java class definition can be found here:
https://hudson.apache.org/hudson/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/hadoop/similarity/item/ItemSimilarityJob.html

I found the solution for the problem:
You need to specify hadoop version 0.20 in the jobflow parameters
You need to run the JAR step with mahout-core-0.5-SNAPSHOT-job.jar, not with the mahout-core-0.5-SNAPSHOT.jar
If you have an additional streaming step in your jobflow, you need to fix a bug in boto:
Open boto/emr/step.py
Change line 138 to "return '/home/hadoop/contrib/streaming/hadoop-streaming.jar'"
Save and reinstall boto
This is how the job_flow function should be invoked to run with mahout:
jobid = emr_conn.run_jobflow(name = name,
log_uri = 's3n://'+ main_bucket_name +'/emr-logging/',
enable_debugging=1,
hadoop_version='0.20',
steps=[step1,step2])

The fix to boto described in step #2 above (i.e. using the non-versioned hadoop-streamin.jar file) has been incorporated into the github master in this commit:
https://github.com/boto/boto/commit/a4e8e065473b5ff9af554ceb91391f286ac5cac7

For Some reference doing this from boto
import boto.emr.connection as botocon
import boto.emr.step as step
con = botocon.EmrConnection(aws_access_key_id='', aws_secret_access_key='')
step = step.JarStep(name='Find similar items', jar='s3://mahout-core-0.6-job.jar', main_class='org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob', action_on_failure='CANCEL_AND_WAIT', step_args=['--input', 's3://', '--output', 's3://', '--similarityClassname', 'SIMILARITY_PEARSON_CORRELATION'])
con.add_jobflow_steps('jflow', [step])
Obviously you need to upload the mahout-core-0.6-job.jar to an accessible s3 location. And the input and out put have to be accessible.

Related

How to set up Kafka as a dependency when using Delta Lake in PySpark?

This is the code to set up Delta Lake as part of a regular Python script, according to their documentation:
import pyspark
from delta import *
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
The official docs for Kafka integration in Spark show how to set up Kafka when using a spark-submit command (through the --packages parameter), but not in Python.
Digging around, turns out that you can also include this parameter when building the Spark session:
import pyspark
from delta import *
packages = [
"org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1",
]
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars.packages", ",".join(packages))
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
However, when I try to stream to Kafka using the spark session created above I still get the following error:
Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".
I'm using Delta 2.1.0 and PySpark 3.3.0.
Turns out that Delta overwrites any packages provided in spark.jars.packages if you're using configure_spark_with_delta_pip (source). The proper way is to make use of the extra_packages parameter when setting up your Spark Session:
import pyspark
from delta import *
packages = [
"org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1",
]
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder, extra_packages=packages).getOrCreate()

`AnalysisException("Database 'delta' not found" ...` when creating delta table over DeltaTableBuilder API locally

I am creating delta tables in Databricks Runtime 9.1 LTS using the DeltaTableBuilder API in PySpark. This works fine. When running the same code locally (for a unit test) I get a strange error.
Im a setting up my local environment and the local PySpark Session as described in the Quickstart guide.
Steps to reproduce:
Setup environment as in DB Runtime 9.1: Python 3.8
pip install delta-spark==1.0.0
pip install pyspark==3.1.2
Create SparkSession:
from pyspark.sql import SparkSession
from delta import configure_spark_with_delta_pip
session_builder = SparkSession.builder
session_builder = (
session_builder.config("spark.jars.packages", "io.delta:delta-core_2.12:1.0.0")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
)
spark = configure_spark_with_delta_pip(session_builder).getOrCreate()
Try to create delta table:
from delta import DeltaTable
DeltaTable.createIfNotExists(spark) \
.location("test_tables/test.delta") \
.addColumn("id", "LONG") \
.execute()
Error:
pyspark.sql.utils.AnalysisException: Database 'delta' not found
In Databricks this error does not appear. It does not require any database "delta" it just creates the delta table directory with the delta_log in it - no database involved. But ok lets create the database locally:
from delta import DeltaTable
spark.sql("CREATE DATABASE delta")
DeltaTable.createOrReplace(spark) \
.location("test_delta") \
.addColumn("id", "LONG") \
.execute()
Error:
pyspark.sql.utils.AnalysisException: `delta`.`test_delta` is not a Delta table.
How can I make the behaviour locally as in Databricks?

Dataflow reading from PubSub works at GCP, can't run locally

I have a small test Dataflow job that just reads from a PubSub subscription and discards the message, that we're using to start some proof-of-concept work.
It works just fine running at GCP, but fails locally. My expectation is that the same code should work either way, just by switching the Dataflow runner, but perhaps that's not the case? Here's the code:
import os
from datetime import datetime
import logging
from apache_beam import Map, io, Pipeline
from apache_beam.options.pipeline_options import PipelineOptions
def noop(element):
pass
def run(input_subscription, pipeline_args=None):
pipeline_options = PipelineOptions(
pipeline_args, streaming=True, save_main_session=True
)
with Pipeline(options=pipeline_options) as pipeline:
(
pipeline
| "Read from Pub/Sub" >> io.ReadFromPubSub(subscription=input_subscription, with_attributes=True)
| "noop" >> Map(noop)
)
if __name__ == "__main__":
logging.getLogger().setLevel(logging.INFO)
run(
os.environ['INPUT_SUBSCRIPTION'],
[
'--runner', os.getenv('RUNNER', 'DirectRunner'),
'--project', os.getenv('PROJECT'),
'--region', os.getenv('REGION'),
'--temp_location', os.getenv('TEMP_LOCATION'),
'--service_account_email', os.getenv('SERVICE_ACCOUNT_EMAIL'),
'--network', os.getenv('NETWORK'),
'--subnetwork', os.getenv('SUBNETWORK'),
'--num_workers', os.getenv('NUM_WORKERS'),
]
)
If I run it with this command line, it creates and runs the job in the Google Cloud just fine:
INPUT_SUBSCRIPTION=subscriptionname \
RUNNER=DataflowRunner \
PROJECT=project \
REGION=region \
TEMP_LOCATION=gs://somewhere/temp \
SERVICE_ACCOUNT_EMAIL=serviceaccount#project.iam.gserviceaccount.com \
NETWORK=network \
SUBNETWORK=https://www.googleapis.com/compute/v1/projects/project/regions/region/subnetworks/subnetwork \
NUM_WORKERS=3 \
python read-pubsub-with-dataflow.py
If I omit the RUNNER option, so it uses DirectRunner:
INPUT_SUBSCRIPTION=subscriptionname \
PROJECT=project \
REGION=region \
TEMP_LOCATION=gs://somewhere/temp \
SERVICE_ACCOUNT_EMAIL=serviceaccount#project.iam.gserviceaccount.com \
NETWORK=network \
SUBNETWORK=https://www.googleapis.com/compute/v1/projects/project/regions/region/subnetworks/subnetwork \
NUM_WORKERS=3 \
python read-pubsub-with-dataflow.py
it fails with a whole flood of error messages, but I'll just include the first one (I think the rest are just cascading):
INFO:apache_beam.runners.direct.direct_runner:Running pipeline with DirectRunner.
/Users/denis/redacted/env/lib/python3.6/site-packages/google/auth/_default.py:70: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK without a quota project. You might receive a "quota exceeded" or "API not enabled" error. We recommend you rerun `gcloud auth application-default login` and make sure a quota project is added. Or you can use service accounts instead. For more information about service accounts, see https://cloud.google.com/docs/authentication/
warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
ERROR:apache_beam.runners.direct.executor:Exception at bundle <apache_beam.runners.direct.bundle_factory._Bundle object at 0x7fed3e368448>, due to an exception.
Traceback (most recent call last):
File "/Users/denis/redacted/env/lib/python3.6/site-packages/apache_beam/runners/direct/transform_evaluator.py", line 694, in _read_from_pubsub
self._sub_name, max_messages=10, return_immediately=True)
File "/Users/denis/redacted/env/lib/python3.6/site-packages/google/cloud/pubsub_v1/_gapic.py", line 40, in <lambda>
fx = lambda self, *a, **kw: wrapped_fx(self.api, *a, **kw) # noqa
File "/Users/denis/redacted/env/lib/python3.6/site-packages/google/pubsub_v1/services/subscriber/client.py", line 1106, in pull
"If the `request` argument is set, then none of "
ValueError: If the `request` argument is set, then none of the individual field arguments should be set.
During handling of the above exception, another exception occurred:
...etc...
I suspect maybe this has to do with credentials? Or our project config? Perhaps I should try in a new blank project.
This turned out to be incompatible package versions. My requirements.txt had been:
apache_beam[gcp]
google_apitools
google-cloud-pubsub
but that was installing a version of the google-cloud-pubsub package that was breaking apache_beam. I changed my requirements.txt to:
apache_beam[gcp]
google_apitools
and it all works now!
And for what it's worth, running locally with DirectRunner I obviously did not need a lot of the options that I needed for DataflowRunner. This sufficed:
GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json \
RUNNER=DirectRunner \
INPUT_SUBSCRIPTION=projects/mytopic/subscriptions/mysubscription \
python read-pubsub-with-dataflow.py

flink run -py wordcount.py caused NullPointerException

i want to process data with flink's python api on windows . But when i use the command to submit a job to Local cluster, it throws NullPointerException。
bin/flink run -py D:\workspace\python-test\flink-test.py
flink-test.py:
from pyflink.dataset import ExecutionEnvironment
from pyflink.table import TableConfig, DataTypes, BatchTableEnvironment
from pyflink.table.descriptors import Schema, OldCsv, FileSystem
exec_env = ExecutionEnvironment.get_execution_environment()
exec_env.set_parallelism(1)
t_config = TableConfig()
t_env = BatchTableEnvironment.create(exec_env, t_config)
t_env.connect(FileSystem().path('D:\\workspace\\python-test\\data.txt')) \
.with_format(OldCsv()
.line_delimiter(' ')
.field('word', DataTypes.STRING())) \
.with_schema(Schema()
.field('word', DataTypes.STRING())) \
.register_table_source('mySource')
t_env.connect(FileSystem().path('D:\\workspace\\python-test\\result.txt')) \
.with_format(OldCsv()
.field_delimiter('\t')
.field('word', DataTypes.STRING())
.field('count', DataTypes.BIGINT())) \
.with_schema(Schema()
.field('word', DataTypes.STRING())
.field('count', DataTypes.BIGINT())) \
.register_table_sink('mySink')
t_env.scan('mySource') \
.group_by('word') \
.select('word, count(1)') \
.insert_into('mySink')
t_env.execute("tutorial_job")
Does anyone know why?
I have solved this problem. I read the source code by the error message.
The NullPointerException is caused by that flinkOptPath is empty!. I use the flink.bat to submit the job , and the flink.bat don't set the flinkOptPath. So I add some code in the flink.bat like this . The flink.bat is Incomplete for now. we should run flink on linux.

AWS Glue - Truncate destination postgres table prior to insert

I am trying to truncate a postgres destination table prior to insert, and in general, trying to fire external functions utilizing the connections already created in GLUE.
Has anyone been able to do so?
I've tried the DROP/ TRUNCATE scenario, but have not been able to do it with connections already created in Glue, but with a pure Python PostgreSQL driver, pg8000.
Download the tar of pg8000 from pypi
Create an empty __init__.py in the root folder
Zip up the contents & upload to S3
Reference the zip file in the Python lib path of the job
Set the DB connection details as job params (make sure to prepend all key names with --). Tick the "Server-side encryption" box.
Then you can simply create a connection and execute SQL.
import sys
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
import pg8000
args = getResolvedOptions(sys.argv, [
'JOB_NAME',
'PW',
'HOST',
'USER',
'DB'
])
# ...
# Create Spark & Glue context
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# ...
config_port = 5432
conn = pg8000.connect(
database=args['DB'],
user=args['USER'],
password=args['PW'],
host=args['HOST'],
port=config_port
)
query = "TRUNCATE TABLE {0};".format(".".join([schema, table]))
cur = conn.cursor()
cur.execute(query)
conn.commit()
cur.close()
conn.close()
After following step (4) of #thenaturalist's response,
sc.addPyFile("/home/glue/downloads/python/pg8000.zip")
import pg8000
worked for me in a development endpoint (zeppelin notebook)
More info: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html
To clarify #thenaturalist instructions for the zip as I still struggled with this
Download the tar.gz of pg8000 from pypi.org and extract.
Zip the contents so you have the below structure
pg8000-1.15.3.zip
|
| -- pg8000 <dir>
| -- __init__.py
| -- _version.py <optional>
| -- core.py
Upload to s3 and then you should be able to just do a simple import pg8000.
NOTE: scramp is also required at the moment so follow the same procedure as above to include the scramp module. But you don't need to import it.
data=spark.sql(sql)
conf = glueContext.extract_jdbc_conf("jdbc-commerce")
data.write \
.mode('overwrite') \
.format("jdbc") \
.option("url", conf['url']) \
.option("database", 'Pacvue_Commerce') \
.option("dbtable", "dbo.glue_1") \
.option("user", conf['user']) \
.option('truncate','true') \
.option("password", conf['password']) \
.save()
glue api not support , but spark api support.
jdbc-commerce is your connection name at crawl.
use extract_jdbc_conf to get url、username and password.

Categories