flink run -py wordcount.py caused NullPointerException - python

i want to process data with flink's python api on windows . But when i use the command to submit a job to Local cluster, it throws NullPointerException。
bin/flink run -py D:\workspace\python-test\flink-test.py
flink-test.py:
from pyflink.dataset import ExecutionEnvironment
from pyflink.table import TableConfig, DataTypes, BatchTableEnvironment
from pyflink.table.descriptors import Schema, OldCsv, FileSystem
exec_env = ExecutionEnvironment.get_execution_environment()
exec_env.set_parallelism(1)
t_config = TableConfig()
t_env = BatchTableEnvironment.create(exec_env, t_config)
t_env.connect(FileSystem().path('D:\\workspace\\python-test\\data.txt')) \
.with_format(OldCsv()
.line_delimiter(' ')
.field('word', DataTypes.STRING())) \
.with_schema(Schema()
.field('word', DataTypes.STRING())) \
.register_table_source('mySource')
t_env.connect(FileSystem().path('D:\\workspace\\python-test\\result.txt')) \
.with_format(OldCsv()
.field_delimiter('\t')
.field('word', DataTypes.STRING())
.field('count', DataTypes.BIGINT())) \
.with_schema(Schema()
.field('word', DataTypes.STRING())
.field('count', DataTypes.BIGINT())) \
.register_table_sink('mySink')
t_env.scan('mySource') \
.group_by('word') \
.select('word, count(1)') \
.insert_into('mySink')
t_env.execute("tutorial_job")
Does anyone know why?

I have solved this problem. I read the source code by the error message.
The NullPointerException is caused by that flinkOptPath is empty!. I use the flink.bat to submit the job , and the flink.bat don't set the flinkOptPath. So I add some code in the flink.bat like this . The flink.bat is Incomplete for now. we should run flink on linux.

Related

How to set up Kafka as a dependency when using Delta Lake in PySpark?

This is the code to set up Delta Lake as part of a regular Python script, according to their documentation:
import pyspark
from delta import *
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
The official docs for Kafka integration in Spark show how to set up Kafka when using a spark-submit command (through the --packages parameter), but not in Python.
Digging around, turns out that you can also include this parameter when building the Spark session:
import pyspark
from delta import *
packages = [
"org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1",
]
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars.packages", ",".join(packages))
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
However, when I try to stream to Kafka using the spark session created above I still get the following error:
Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".
I'm using Delta 2.1.0 and PySpark 3.3.0.
Turns out that Delta overwrites any packages provided in spark.jars.packages if you're using configure_spark_with_delta_pip (source). The proper way is to make use of the extra_packages parameter when setting up your Spark Session:
import pyspark
from delta import *
packages = [
"org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1",
]
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder, extra_packages=packages).getOrCreate()

Dataflow reading from PubSub works at GCP, can't run locally

I have a small test Dataflow job that just reads from a PubSub subscription and discards the message, that we're using to start some proof-of-concept work.
It works just fine running at GCP, but fails locally. My expectation is that the same code should work either way, just by switching the Dataflow runner, but perhaps that's not the case? Here's the code:
import os
from datetime import datetime
import logging
from apache_beam import Map, io, Pipeline
from apache_beam.options.pipeline_options import PipelineOptions
def noop(element):
pass
def run(input_subscription, pipeline_args=None):
pipeline_options = PipelineOptions(
pipeline_args, streaming=True, save_main_session=True
)
with Pipeline(options=pipeline_options) as pipeline:
(
pipeline
| "Read from Pub/Sub" >> io.ReadFromPubSub(subscription=input_subscription, with_attributes=True)
| "noop" >> Map(noop)
)
if __name__ == "__main__":
logging.getLogger().setLevel(logging.INFO)
run(
os.environ['INPUT_SUBSCRIPTION'],
[
'--runner', os.getenv('RUNNER', 'DirectRunner'),
'--project', os.getenv('PROJECT'),
'--region', os.getenv('REGION'),
'--temp_location', os.getenv('TEMP_LOCATION'),
'--service_account_email', os.getenv('SERVICE_ACCOUNT_EMAIL'),
'--network', os.getenv('NETWORK'),
'--subnetwork', os.getenv('SUBNETWORK'),
'--num_workers', os.getenv('NUM_WORKERS'),
]
)
If I run it with this command line, it creates and runs the job in the Google Cloud just fine:
INPUT_SUBSCRIPTION=subscriptionname \
RUNNER=DataflowRunner \
PROJECT=project \
REGION=region \
TEMP_LOCATION=gs://somewhere/temp \
SERVICE_ACCOUNT_EMAIL=serviceaccount#project.iam.gserviceaccount.com \
NETWORK=network \
SUBNETWORK=https://www.googleapis.com/compute/v1/projects/project/regions/region/subnetworks/subnetwork \
NUM_WORKERS=3 \
python read-pubsub-with-dataflow.py
If I omit the RUNNER option, so it uses DirectRunner:
INPUT_SUBSCRIPTION=subscriptionname \
PROJECT=project \
REGION=region \
TEMP_LOCATION=gs://somewhere/temp \
SERVICE_ACCOUNT_EMAIL=serviceaccount#project.iam.gserviceaccount.com \
NETWORK=network \
SUBNETWORK=https://www.googleapis.com/compute/v1/projects/project/regions/region/subnetworks/subnetwork \
NUM_WORKERS=3 \
python read-pubsub-with-dataflow.py
it fails with a whole flood of error messages, but I'll just include the first one (I think the rest are just cascading):
INFO:apache_beam.runners.direct.direct_runner:Running pipeline with DirectRunner.
/Users/denis/redacted/env/lib/python3.6/site-packages/google/auth/_default.py:70: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK without a quota project. You might receive a "quota exceeded" or "API not enabled" error. We recommend you rerun `gcloud auth application-default login` and make sure a quota project is added. Or you can use service accounts instead. For more information about service accounts, see https://cloud.google.com/docs/authentication/
warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
ERROR:apache_beam.runners.direct.executor:Exception at bundle <apache_beam.runners.direct.bundle_factory._Bundle object at 0x7fed3e368448>, due to an exception.
Traceback (most recent call last):
File "/Users/denis/redacted/env/lib/python3.6/site-packages/apache_beam/runners/direct/transform_evaluator.py", line 694, in _read_from_pubsub
self._sub_name, max_messages=10, return_immediately=True)
File "/Users/denis/redacted/env/lib/python3.6/site-packages/google/cloud/pubsub_v1/_gapic.py", line 40, in <lambda>
fx = lambda self, *a, **kw: wrapped_fx(self.api, *a, **kw) # noqa
File "/Users/denis/redacted/env/lib/python3.6/site-packages/google/pubsub_v1/services/subscriber/client.py", line 1106, in pull
"If the `request` argument is set, then none of "
ValueError: If the `request` argument is set, then none of the individual field arguments should be set.
During handling of the above exception, another exception occurred:
...etc...
I suspect maybe this has to do with credentials? Or our project config? Perhaps I should try in a new blank project.
This turned out to be incompatible package versions. My requirements.txt had been:
apache_beam[gcp]
google_apitools
google-cloud-pubsub
but that was installing a version of the google-cloud-pubsub package that was breaking apache_beam. I changed my requirements.txt to:
apache_beam[gcp]
google_apitools
and it all works now!
And for what it's worth, running locally with DirectRunner I obviously did not need a lot of the options that I needed for DataflowRunner. This sufficed:
GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json \
RUNNER=DirectRunner \
INPUT_SUBSCRIPTION=projects/mytopic/subscriptions/mysubscription \
python read-pubsub-with-dataflow.py

pyspark using mysql database on remote machine

I am using python 2.7 with ubuntu and running spark via a python script using a sparkcontext
My db is a remote mysql, with a username and password.
I try to query it using this code
sc = createSparkContext()
sql = SQLContext(sc)
df = sql.read.format('jdbc').options(url='jdbc:mysql://ip:port?user=user&password=password', dbtable='(select * from tablename limit 100) as tablename').load()
print df.head()
And get this error
py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
: java.sql.SQLException: No suitable driver
I found that I need the JDBC driver for mysql.
I downloaded the platform free one from here
I tried including it using this code in starting the spark context
conf.set("spark.driver.extraClassPath", "/var/nfs/general/mysql-connector-java-5.1.43")
and tried to install it using
sudo apt-get install libmysql-java
on the master machine, on the db machine and on the machine running the python script with no luck.
edit2
#
i tried using
conf.set("spark.executor.extraClassPath", "/var/nfs/general/mysql-connector-java-5.1.43")
seems by the output of
print sc.getConf().getAll()
which is
[(u'spark.driver.memory', u'3G'), (u'spark.executor.extraClassPath',
u'file:///var/nfs/general/mysql-connector-java-5.1.43.jar'),
(u'spark.app.name', u'spark-basic'), (u'spark.app.id',
u'app-20170830'), (u'spark.rdd.compress', u'True'),
(u'spark.master', u'spark://127.0.0.1:7077'), (u'spark.driver.port',
u''), (u'spark.serializer.objectStreamReset', u'100'),
(u'spark.executor.memory', u'2G'), (u'spark.executor.id', u'driver'),
(u'spark.submit.deployMode', u'client'), (u'spark.driver.host',
u''), (u'spark.driver.cores', u'3')]
that it includes the correct path, but still i get the same "no driver" error...
What am i missing here?
Thanks
You need to set the classpath for both driver and worker nodes. Add the following to spark configuration
conf.set("spark.executor.extraClassPath", "/var/nfs/general/mysql-connector-java-5.1.43")
conf.set("spark.driver.extraClassPath", "/var/nfs/general/mysql-connector-java-5.1.43")
Or you can pass it using
import os
os.environ['SPARK_CLASSPATH'] = "/path/to/driver/mysql.jar"
For spark >=2.0.0 you can add the comma separated list of jars to spark-defaults.conf file located in spark_home/conf directory like this
spark.jars path_2_jar1,path_2_jar2
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("Word Count")\
.config("spark.driver.extraClassPath", "/home/tuhin/mysql.jar")\
.getOrCreate()
dataframe_mysql = spark.read\
.format("jdbc")\
.option("url", "jdbc:mysql://ip:port/db_name")\
.option("driver", "com.mysql.jdbc.Driver")\
.option("dbtable", "employees").option("user", "root")\
.option("password", "12345678").load()
print(dataframe_mysql.columns)
"/home/tuhin/mysql.jar" is the location of mysql jar file

How to update rrd database in python

I am new to the programing and I am using already created scripts, I am trying to update my RRD database in python. I have manage to create below code which don’t come back to me with any errors but when I am trying to generate a graph it don’t contain any data.
#!/usr/bin/python
#modules
import sys
import os
import time
import rrdtool
import Adafruit_DHT as dht
#assign data
h,t = dht.read_retry(dht.DHT22, 22)
#display data
print 'Temp={0:0.1f}*C'.format(t, h)
print 'Humidity={1:0.1f}%'.format(t,h)
#update database
data = "N:h:t"
ret = rrdtool.update("%s/humidity.rrd" % (os.path.dirname(os.path.abspath(__file__))),data)
if ret:
print rrdtool.error()
time.sleep(300)
Below my data base specification:
#! /bin/bash
rrdtool create humidity.rrd \
--start "01/01/2015" \
--step 300 \
DS:th_dht22:GAUGE:1200:-40:100 \
DS:hm_dht22:GAUGE:1200:-40:100 \
RRA:AVERAGE:0.5:1:288 \
RRA:AVERAGE:0.5:6:336 \
RRA:AVERAGE:0.5:24:372 \
RRA:AVERAGE:0.5:144:732 \
RRA:MIN:0.5:1:288 \
RRA:MIN:0.5:6:336 \
RRA:MIN:0.5:24:372 \
RRA:MIN:0.5:144:732 \
RRA:MAX:0.5:1:288 \
RRA:MAX:0.5:6:336 \
RRA:MAX:0.5:24:372 \
RRA:MAX:0.5:144:732 \
rrdtool will silently ignore updates that are either too far apart or lie outside the predefined input range. I would add a logging feature to your code to see what you are trying to feed to rrdtool.

What is wrong with my boto elastic mapreduce jar jobflow parameters?

I am using the boto library to create a job flow in Amazons Elastic MapReduce Webservice (EMR). The following code should create a step:
step2 = JarStep(name='Find similiar items',
jar='s3n://recommendertest/mahout-core/mahout-core-0.5-SNAPSHOT.jar',
main_class='org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob',
step_args=['s3n://bucket/output/' + run_id + '/aggregate_watched/',
's3n://bucket/output/' + run_id + '/similiar_items/',
'SIMILARITY_PEARSON_CORRELATION'
])
When I run the job flow, it always fails throwing this error:
java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/JobContext
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.JobContext
This is the line in the EMR logs invoking the java code:
2011-01-24T22:18:54.491Z INFO Executing /usr/lib/jvm/java-6-sun/bin/java \
-cp /home/hadoop/conf:/usr/lib/jvm/java-6-sun/lib/tools.jar:/home/hadoop:/home/hadoop \
/hadoop-0.18-core.jar:/home/hadoop/hadoop-0.18-tools.jar:/home/hadoop/lib/*:/home/hadoop/lib/jetty-ext/* \
-Xmx1000m \
-Dhadoop.log.dir=/mnt/var/log/hadoop/steps/3 \
-Dhadoop.log.file=syslog \
-Dhadoop.home.dir=/home/hadoop \
-Dhadoop.id.str=hadoop \
-Dhadoop.root.logger=INFO,DRFA \
-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/3/tmp \
-Djava.library.path=/home/hadoop/lib/native/Linux-i386-32 \
org.apache.hadoop.mapred.JobShell \
/mnt/var/lib/hadoop/steps/3/mahout-core-0.5-SNAPSHOT.jar \
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob \
s3n://..../output/job_2011-01-24_23:09:29/aggregate_watched/ \
s3n://..../output/job_2011-01-24_23:09:29/similiar_items/ \
SIMILARITY_PEARSON_CORRELATION
What is wrong with the parameters? The java class definition can be found here:
https://hudson.apache.org/hudson/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/hadoop/similarity/item/ItemSimilarityJob.html
I found the solution for the problem:
You need to specify hadoop version 0.20 in the jobflow parameters
You need to run the JAR step with mahout-core-0.5-SNAPSHOT-job.jar, not with the mahout-core-0.5-SNAPSHOT.jar
If you have an additional streaming step in your jobflow, you need to fix a bug in boto:
Open boto/emr/step.py
Change line 138 to "return '/home/hadoop/contrib/streaming/hadoop-streaming.jar'"
Save and reinstall boto
This is how the job_flow function should be invoked to run with mahout:
jobid = emr_conn.run_jobflow(name = name,
log_uri = 's3n://'+ main_bucket_name +'/emr-logging/',
enable_debugging=1,
hadoop_version='0.20',
steps=[step1,step2])
The fix to boto described in step #2 above (i.e. using the non-versioned hadoop-streamin.jar file) has been incorporated into the github master in this commit:
https://github.com/boto/boto/commit/a4e8e065473b5ff9af554ceb91391f286ac5cac7
For Some reference doing this from boto
import boto.emr.connection as botocon
import boto.emr.step as step
con = botocon.EmrConnection(aws_access_key_id='', aws_secret_access_key='')
step = step.JarStep(name='Find similar items', jar='s3://mahout-core-0.6-job.jar', main_class='org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob', action_on_failure='CANCEL_AND_WAIT', step_args=['--input', 's3://', '--output', 's3://', '--similarityClassname', 'SIMILARITY_PEARSON_CORRELATION'])
con.add_jobflow_steps('jflow', [step])
Obviously you need to upload the mahout-core-0.6-job.jar to an accessible s3 location. And the input and out put have to be accessible.

Categories