Dataflow reading from PubSub works at GCP, can't run locally

Dataflow reading from PubSub works at GCP, can't run locally - python

I have a small test Dataflow job that just reads from a PubSub subscription and discards the message, that we're using to start some proof-of-concept work.
It works just fine running at GCP, but fails locally. My expectation is that the same code should work either way, just by switching the Dataflow runner, but perhaps that's not the case? Here's the code:
import os
from datetime import datetime
import logging
from apache_beam import Map, io, Pipeline
from apache_beam.options.pipeline_options import PipelineOptions
def noop(element):
pass
def run(input_subscription, pipeline_args=None):
pipeline_options = PipelineOptions(
pipeline_args, streaming=True, save_main_session=True
)
with Pipeline(options=pipeline_options) as pipeline:
(
pipeline
| "Read from Pub/Sub" >> io.ReadFromPubSub(subscription=input_subscription, with_attributes=True)
| "noop" >> Map(noop)
)
if __name__ == "__main__":
logging.getLogger().setLevel(logging.INFO)
run(
os.environ['INPUT_SUBSCRIPTION'],
[
'--runner', os.getenv('RUNNER', 'DirectRunner'),
'--project', os.getenv('PROJECT'),
'--region', os.getenv('REGION'),
'--temp_location', os.getenv('TEMP_LOCATION'),
'--service_account_email', os.getenv('SERVICE_ACCOUNT_EMAIL'),
'--network', os.getenv('NETWORK'),
'--subnetwork', os.getenv('SUBNETWORK'),
'--num_workers', os.getenv('NUM_WORKERS'),
]
)
If I run it with this command line, it creates and runs the job in the Google Cloud just fine:
INPUT_SUBSCRIPTION=subscriptionname \
RUNNER=DataflowRunner \
PROJECT=project \
REGION=region \
TEMP_LOCATION=gs://somewhere/temp \
SERVICE_ACCOUNT_EMAIL=serviceaccount#project.iam.gserviceaccount.com \
NETWORK=network \
SUBNETWORK=https://www.googleapis.com/compute/v1/projects/project/regions/region/subnetworks/subnetwork \
NUM_WORKERS=3 \
python read-pubsub-with-dataflow.py
If I omit the RUNNER option, so it uses DirectRunner:
INPUT_SUBSCRIPTION=subscriptionname \
PROJECT=project \
REGION=region \
TEMP_LOCATION=gs://somewhere/temp \
SERVICE_ACCOUNT_EMAIL=serviceaccount#project.iam.gserviceaccount.com \
NETWORK=network \
SUBNETWORK=https://www.googleapis.com/compute/v1/projects/project/regions/region/subnetworks/subnetwork \
NUM_WORKERS=3 \
python read-pubsub-with-dataflow.py
it fails with a whole flood of error messages, but I'll just include the first one (I think the rest are just cascading):
INFO:apache_beam.runners.direct.direct_runner:Running pipeline with DirectRunner.
/Users/denis/redacted/env/lib/python3.6/site-packages/google/auth/_default.py:70: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK without a quota project. You might receive a "quota exceeded" or "API not enabled" error. We recommend you rerun `gcloud auth application-default login` and make sure a quota project is added. Or you can use service accounts instead. For more information about service accounts, see https://cloud.google.com/docs/authentication/
warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
ERROR:apache_beam.runners.direct.executor:Exception at bundle <apache_beam.runners.direct.bundle_factory._Bundle object at 0x7fed3e368448>, due to an exception.
Traceback (most recent call last):
File "/Users/denis/redacted/env/lib/python3.6/site-packages/apache_beam/runners/direct/transform_evaluator.py", line 694, in _read_from_pubsub
self._sub_name, max_messages=10, return_immediately=True)
File "/Users/denis/redacted/env/lib/python3.6/site-packages/google/cloud/pubsub_v1/_gapic.py", line 40, in <lambda>
fx = lambda self, *a, **kw: wrapped_fx(self.api, *a, **kw) # noqa
File "/Users/denis/redacted/env/lib/python3.6/site-packages/google/pubsub_v1/services/subscriber/client.py", line 1106, in pull
"If the `request` argument is set, then none of "
ValueError: If the `request` argument is set, then none of the individual field arguments should be set.
During handling of the above exception, another exception occurred:
...etc...
I suspect maybe this has to do with credentials? Or our project config? Perhaps I should try in a new blank project.

This turned out to be incompatible package versions. My requirements.txt had been:
apache_beam[gcp]
google_apitools
google-cloud-pubsub
but that was installing a version of the google-cloud-pubsub package that was breaking apache_beam. I changed my requirements.txt to:
apache_beam[gcp]
google_apitools
and it all works now!
And for what it's worth, running locally with DirectRunner I obviously did not need a lot of the options that I needed for DataflowRunner. This sufficed:
GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json \
RUNNER=DirectRunner \
INPUT_SUBSCRIPTION=projects/mytopic/subscriptions/mysubscription \
python read-pubsub-with-dataflow.py

Related

ModuleNotFoundError in Dataflow job

I am trying to execute a apache beam pipeline as a dataflow job in Google Cloud Platform.
My project structure is as follows:
root_dir/
__init__.py
setup.py
main.py
utils/
__init__.py
log_util.py
config_util.py
Here's my setup.py
setuptools.setup(
name='dataflow_example',
version='1.0',
install_requires=[
"google-cloud-tasks==2.2.0",
"google-cloud-pubsub>=0.1.0",
"google-cloud-storage==1.39.0",
"google-cloud-bigquery==2.6.2",
"google-cloud-secret-manager==2.0.0",
"google-api-python-client==2.3.0",
"oauth2client==4.1.3",
"apache-beam[gcp]>=2.20.0",
"wheel>=0.36.2"
],
packages=setuptools.find_packages()
)
Here's my pipeline code:
import math
import apache_beam as beam
from datetime import datetime
from apache_beam.options.pipeline_options import PipelineOptions
from utils.log_util import LogUtil
from utils.config_util import ConfigUtil
class DataflowExample:
config = {}
def __init__(self):
self.config = ConfigUtil.get_config(module_config=["config"])
self.project = self.config['project']
self.region = self.config['location']
self.bucket = self.config['core_bucket']
self.batch_size = 10
def execute_pipeline(self):
try:
LogUtil.log_n_notify(log_type="info", msg=f"Dataflow started")
query = "SELECT id, name, company FROM `<bigquery_table>` LIMIT 10"
beam_options = {
"project": self.project,
"region": self.region,
"job_name": "dataflow_example",
"runner": "DataflowRunner",
"temp_location": f"gs://{self.bucket}/temp_location/"
}
options = PipelineOptions(**beam_options, save_main_session=True)
with beam.Pipeline(options=options) as pipeline:
data = (
pipeline
| 'Read from BQ ' >> beam.io.Read(beam.io.ReadFromBigQuery(query=query, use_standard_sql=True))
| 'Count records' >> beam.combiners.Count.Globally()
| 'Print ' >> beam.ParDo(PrintCount(), self.batch_size)
)
LogUtil.log_n_notify(log_type="info", msg=f"Dataflow completed")
except Exception as e:
LogUtil.log_n_notify(log_type="error", msg=f"Exception in execute_pipeline - {str(e)}")
class PrintCount(beam.DoFn):
def __init__(self):
self.logger = LogUtil()
def process(self, row_count, batch_size):
try:
current_date = datetime.today().date()
total = int(math.ceil(row_count / batch_size))
self.logger.log_n_notify(log_type="info", msg=f"Records pulled from table on {current_date} is {row_count}")
self.logger.log_n_notify(log_type="info", msg=f"Records per batch: {batch_size}. Total batches: {total}")
except Exception as e:
self.logger.log_n_notify(log_type="error", msg=f"Exception in PrintCount.process - {str(e)}")
if __name__ == "__main__":
df_example = DataflowExample()
df_example.execute_pipeline()
Functionality of pipeline is
Query against BigQuery Table.
Count the total records fetched from querying.
Print using the custom Log module present in utils folder.
I am running the job using cloud shell using command - python3 - main.py
Though the Dataflow job starts, the worker nodes throws error after few mins saying "ModuleNotFoundError: No module named 'utils'"
"utils" folder is available and the same code works fine when executed with "DirectRunner".
log_util and config_util files are custom util files for logging and config fetching respectively.
Also, I tried running with setup_file options as python3 - main.py --setup_file </path/of/setup.py> which makes the job to just freeze and does not proceed even after 15 mins.
How do I resolve the ModuleNotFoundError with "DataflowRunner"?

Posting as community wiki. As confirmed by #GopinathS the error and fix are as follows:
The error encountered by the workers is Beam SDK base version 2.32.0 does not match Dataflow Python worker version 2.28.0. Please check Dataflow worker startup logs and make sure that correct version of Beam SDK is installed.
To fix this "apache-beam[gcp]>=2.20.0" is removed from install_requires of setup.py since, the '>=' is assigning the latest available version (2.32.0 as of this writing) while the workers version are only 2.28.0.
Updated setup.py:
setuptools.setup(
name='dataflow_example',
version='1.0',
install_requires=[
"google-cloud-tasks==2.2.0",
"google-cloud-pubsub>=0.1.0",
"google-cloud-storage==1.39.0",
"google-cloud-bigquery==2.6.2",
"google-cloud-secret-manager==2.0.0",
"google-api-python-client==2.3.0",
"oauth2client==4.1.3", # removed apache-beam[gcp]>=2.20.0
"wheel>=0.36.2"
],
packages=setuptools.find_packages()
)
Updated beam_options in the pipeline code:
beam_options = {
"project": self.project,
"region": self.region,
"job_name": "dataflow_example",
"runner": "DataflowRunner",
"temp_location": f"gs://{self.bucket}/temp_location/",
"setup_file": "./setup.py"
}
Also make sure that you pass all the pipeline options at once and not partially.
If you pass --setup_file </path/of/setup.py> in the command then make sure to read and append the setup file path into the already defined beam_options variable using argument_parser in your code.
To avoid parsing the argument and appending into beam_options I instead added it directly in beam_options as "setup_file": "./setup.py"

Dataflow might have problems with installing packages that are platform locked in isolated network.
It won't be able to compile them if no network is there.
Or maybe it tries installing them but since cannot compile downloads wheels? No idea.
Still to be able to use packages like psycopg2 (binaries), or google-cloud-secret-manager (no binaries BUT dependencies have binaries), you need to install everything that has no binaries (none-any) AND no dependencies with binaries, by requirements.txt and the rest by --extra_packages param with wheel. Example:
--extra_packages=package_1_needed_by_2-manylinux.whl \
--extra_packages=package_2_needed_by_3-manylinux.whl \
--extra_packages=what-you-need_needing_3-none-any.whl

Insert data into mysql using dataflow

The below code builds the pipeline and DAG is generated.
RuntimeError: NotImplementedError [while running 'generatedPtransform-438']Please let me know if there is any direct connector for mysql in python for beam.
from apache_beam.options.pipeline_options import PipelineOptions
from google.cloud import pubsub_v1
from google.cloud import bigquery
import mysql.connector
import apache_beam as beam
import logging
import argparse
import sys
import re
PROJECT="12344"
TOPIC = "projects/12344/topics/mytopic"
class insertfn(beam.Dofn):
def insertdata(self,data):
db_conn=mysql.connector.connect(host="localhost",user="abc",passwd="root",database="new")
db_cursor=db_conn.cursor()
emp_sql = " INSERT INTO emp(ename,eid,dept) VALUES (%s,%s,%s)"
db_cusror.executemany(emp_sql,(data[0],data[1],data[2]))
db_conn.commit()
print(db_cursor.rowcount,"record inserted")
class Split(beam.DoFn):
def process(self, data):
data = data.split(",")
return [{
'ename': data[0],
'eid': data[1],
'dept': data[2]
}]
def main(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument("--input_topic")
parser.add_argument("--output")
known_args = parser.parse_known_args(argv)
p = beam.Pipeline(options=PipelineOptions())
(p
| 'ReadData' >> beam.io.ReadFromPubSub(topic=TOPIC).with_output_types(bytes)
| "Decode" >> beam.Map(lambda x: x.decode('utf-8'))
| 'ParseCSV' >> beam.ParDo(Split())
| 'WriteToMySQL' >> beam.ParDo(insertfn())
)
result = p.run()
result.wait_until_finish()

After our discussion in the comment section, I noticed that you are not using the proper commands to execute the DataFlow pipeline.
According to the documentation, there are mandatory flags which must be defined in order to run the pipeline in Dataflow Managed Service. These flags are described below:
job_name - The name of the Dataflow job being executed.
project - The ID of your Google Cloud project. runner - The pipeline
runner - that will parse your program and construct your pipeline. For
cloud execution, this must be DataflowRunner.
staging_location - A Cloud Storage path for Dataflow to stage code packages needed by workers executing the job.
temp_location - A Cloud Storage path for Dataflow to stage temporary job files created during the execution of the pipeline.
In addition to these flags, there are others you can use, in your case since you use a PubSub topic:
--input_topic: sets the input Pub/Sub topic to read messages from.
Therefore, an example to run a Dataflow pipeline would be as follows:
python RunPipelineDataflow.py \
--job_name=jobName\
--project=$PROJECT_NAME \
--runner=DataflowRunner \
--staging_location=gs://YOUR_BUCKET_NAME/AND_STAGING_DIRECTORY\
--temp_location=gs://$BUCKET_NAME/temp
--input_topic=projects/$PROJECT_NAME/topics/$TOPIC_NAME \
I would like to point the importance of using DataflowRunner, it allows you to use the Cloud Dataflow managed service, providing a fully managed service, autoscaling and dynamic work rebalancing. However, it is also possible to use DirectRunner which executes your pipeline in your machine, it is designed to validate the pipeline.

KafkaRecord cannot be cast to [B

I'm trying to process the data streaming from Apache Kafka using the Python SDK for Apache Beam with the Flink runner. After running Kafka 2.4.0 and Flink 1.8.3, I follow these steps:
1) Compile and run Beam 2.16 with Flink 1.8 runner.
git clone --single-branch --branch release-2.16.0 https://github.com/apache/beam.git beam-2.16.0
cd beam-2.16.0
nohup ./gradlew :runners:flink:1.8:job-server:runShadow -PflinkMasterUrl=localhost:8081 &
2) Run the Python pipeline.
from apache_beam import Pipeline
from apache_beam.io.external.kafka import ReadFromKafka
from apache_beam.options.pipeline_options import PipelineOptions
if __name__ == '__main__':
with Pipeline(options=PipelineOptions([
'--runner=FlinkRunner',
'--flink_version=1.8',
'--flink_master_url=localhost:8081',
'--environment_type=LOOPBACK',
'--streaming'
])) as pipeline:
(
pipeline
| 'read' >> ReadFromKafka({'bootstrap.servers': 'localhost:9092'}, ['test']) # [BEAM-3788] ???
)
result = pipeline.run()
result.wait_until_finish()
3) Publish some data to Kafka.
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
>{"hello":"world!"}
The Python script throws this error:
[flink-runner-job-invoker] ERROR org.apache.beam.runners.fnexecution.jobsubmission.JobInvocation - Error during job invocation BeamApp-USER-somejob. org.apache.flink.client.program.ProgramInvocationException: Job failed. (JobID: xxx)
at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:268)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:483)
at org.apache.beam.runners.flink.FlinkExecutionEnvironments$BeamFlinkRemoteStreamEnvironment.executeRemotely(FlinkExecutionEnvironments.java:360)
at org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.execute(RemoteStreamEnvironment.java:310)
at org.apache.beam.runners.flink.FlinkStreamingPortablePipelineTranslator$StreamingTranslationContext.execute(FlinkStreamingPortablePipelineTranslator.java:173)
at org.apache.beam.runners.flink.FlinkPipelineRunner.runPipelineWithTranslator(FlinkPipelineRunner.java:104)
at org.apache.beam.runners.flink.FlinkPipelineRunner.run(FlinkPipelineRunner.java:80)
at org.apache.beam.runners.fnexecution.jobsubmission.JobInvocation.runPipeline(JobInvocation.java:78)
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:265)
... 13 more
Caused by: java.lang.ClassCastException: org.apache.beam.sdk.io.kafka.KafkaRecord cannot be cast to [B
at org.apache.beam.sdk.coders.ByteArrayCoder.encode(ByteArrayCoder.java:41)
at org.apache.beam.sdk.coders.LengthPrefixCoder.encode(LengthPrefixCoder.java:56)
at org.apache.beam.sdk.values.ValueWithRecordId$ValueWithRecordIdCoder.encode(ValueWithRecordId.java:105)
at org.apache.beam.sdk.values.ValueWithRecordId$ValueWithRecordIdCoder.encode(ValueWithRecordId.java:81)
at org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:578)
at org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:529)
at org.apache.beam.sdk.util.CoderUtils.encodeToSafeStream(CoderUtils.java:82)
at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:66)
at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:51)
at org.apache.beam.sdk.util.CoderUtils.clone(CoderUtils.java:141)
at org.apache.beam.runners.flink.translation.types.CoderTypeSerializer.copy(CoderTypeSerializer.java:67)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:577)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:554)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:534)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:718)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:696)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:305)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:394) at org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.emitElement(UnboundedSourceWrapper.java:341)
at org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:283)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)
at org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:302)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
... 1 more
ERROR:root:java.lang.ClassCastException: org.apache.beam.sdk.io.kafka.KafkaRecord cannot be cast to [B
[flink-runner-job-invoker] INFO org.apache.beam.runners.fnexecution.artifact.BeamFileSystemArtifactRetrievalService - Manifest at/tmp/artifacts0k1mnin0/somejob/MANIFEST has 0 artifact locations
[flink-runner-job-invoker] INFO org.apache.beam.runners.fnexecution.artifact.BeamFileSystemArtifactStagingService - Removed dir /tmp/artifacts0k1mnin0/job_somejob/
Traceback (most recent call last):
File "main.py", line 40, in <module>
run()
File "main.py", line 37, in run
result.wait_until_finish()
File "/home/USER/beam/lib/python3.5/site-packages/apache_beam/runners/portability/portable_runner.py", line 439, in wait_until_finish self._job_id, self._state, self._last_error_message()))
RuntimeError: Pipeline BeamApp-USER-somejob failed in state FAILED: java.lang.ClassCastException: org.apache.beam.sdk.io.kafka.KafkaRecord cannot be cast to [B
I tried other deserializers available in Kafka but they did not work: Couldn't infer Coder from class org.apache.kafka.common.serialization.StringDeserializer. This error is originating from this piece of code.
Am I doing something wrong?

Disclaimer: this is my first encounter with Apache Beam project.
It seems that Kafka consumer support is quite fresh thing in Beam (at least in Python interface) according to this JIRA. Apparently, it seems that there is still problem with FlinkRunner combined with this new API. Even though your code is technically correct it will not run correctly on Flink. There is a patch available which seems more like a quickfix than final solution to me. It requires recompilation and thus is not something I would propose using on production. If you are just getting started with technology and don't want to be blocked then feel free to try it out.

flink run -py wordcount.py caused NullPointerException

i want to process data with flink's python api on windows . But when i use the command to submit a job to Local cluster, it throws NullPointerException。
bin/flink run -py D:\workspace\python-test\flink-test.py
flink-test.py：
from pyflink.dataset import ExecutionEnvironment
from pyflink.table import TableConfig, DataTypes, BatchTableEnvironment
from pyflink.table.descriptors import Schema, OldCsv, FileSystem
exec_env = ExecutionEnvironment.get_execution_environment()
exec_env.set_parallelism(1)
t_config = TableConfig()
t_env = BatchTableEnvironment.create(exec_env, t_config)
t_env.connect(FileSystem().path('D:\\workspace\\python-test\\data.txt')) \
.with_format(OldCsv()
.line_delimiter(' ')
.field('word', DataTypes.STRING())) \
.with_schema(Schema()
.field('word', DataTypes.STRING())) \
.register_table_source('mySource')
t_env.connect(FileSystem().path('D:\\workspace\\python-test\\result.txt')) \
.with_format(OldCsv()
.field_delimiter('\t')
.field('word', DataTypes.STRING())
.field('count', DataTypes.BIGINT())) \
.with_schema(Schema()
.field('word', DataTypes.STRING())
.field('count', DataTypes.BIGINT())) \
.register_table_sink('mySink')
t_env.scan('mySource') \
.group_by('word') \
.select('word, count(1)') \
.insert_into('mySink')
t_env.execute("tutorial_job")
Does anyone know why?

I have solved this problem. I read the source code by the error message.
The NullPointerException is caused by that flinkOptPath is empty!. I use the flink.bat to submit the job , and the flink.bat don't set the flinkOptPath. So I add some code in the flink.bat like this . The flink.bat is Incomplete for now. we should run flink on linux.

What is wrong with my boto elastic mapreduce jar jobflow parameters?

I am using the boto library to create a job flow in Amazons Elastic MapReduce Webservice (EMR). The following code should create a step:
step2 = JarStep(name='Find similiar items',
jar='s3n://recommendertest/mahout-core/mahout-core-0.5-SNAPSHOT.jar',
main_class='org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob',
step_args=['s3n://bucket/output/' + run_id + '/aggregate_watched/',
's3n://bucket/output/' + run_id + '/similiar_items/',
'SIMILARITY_PEARSON_CORRELATION'
])
When I run the job flow, it always fails throwing this error:
java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/JobContext
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.JobContext
This is the line in the EMR logs invoking the java code:
2011-01-24T22:18:54.491Z INFO Executing /usr/lib/jvm/java-6-sun/bin/java \
-cp /home/hadoop/conf:/usr/lib/jvm/java-6-sun/lib/tools.jar:/home/hadoop:/home/hadoop \
/hadoop-0.18-core.jar:/home/hadoop/hadoop-0.18-tools.jar:/home/hadoop/lib/*:/home/hadoop/lib/jetty-ext/* \
-Xmx1000m \
-Dhadoop.log.dir=/mnt/var/log/hadoop/steps/3 \
-Dhadoop.log.file=syslog \
-Dhadoop.home.dir=/home/hadoop \
-Dhadoop.id.str=hadoop \
-Dhadoop.root.logger=INFO,DRFA \
-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/3/tmp \
-Djava.library.path=/home/hadoop/lib/native/Linux-i386-32 \
org.apache.hadoop.mapred.JobShell \
/mnt/var/lib/hadoop/steps/3/mahout-core-0.5-SNAPSHOT.jar \
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob \
s3n://..../output/job_2011-01-24_23:09:29/aggregate_watched/ \
s3n://..../output/job_2011-01-24_23:09:29/similiar_items/ \
SIMILARITY_PEARSON_CORRELATION
What is wrong with the parameters? The java class definition can be found here:
https://hudson.apache.org/hudson/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/hadoop/similarity/item/ItemSimilarityJob.html

I found the solution for the problem:
You need to specify hadoop version 0.20 in the jobflow parameters
You need to run the JAR step with mahout-core-0.5-SNAPSHOT-job.jar, not with the mahout-core-0.5-SNAPSHOT.jar
If you have an additional streaming step in your jobflow, you need to fix a bug in boto:
Open boto/emr/step.py
Change line 138 to "return '/home/hadoop/contrib/streaming/hadoop-streaming.jar'"
Save and reinstall boto
This is how the job_flow function should be invoked to run with mahout:
jobid = emr_conn.run_jobflow(name = name,
log_uri = 's3n://'+ main_bucket_name +'/emr-logging/',
enable_debugging=1,
hadoop_version='0.20',
steps=[step1,step2])

The fix to boto described in step #2 above (i.e. using the non-versioned hadoop-streamin.jar file) has been incorporated into the github master in this commit:
https://github.com/boto/boto/commit/a4e8e065473b5ff9af554ceb91391f286ac5cac7

For Some reference doing this from boto
import boto.emr.connection as botocon
import boto.emr.step as step
con = botocon.EmrConnection(aws_access_key_id='', aws_secret_access_key='')
step = step.JarStep(name='Find similar items', jar='s3://mahout-core-0.6-job.jar', main_class='org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob', action_on_failure='CANCEL_AND_WAIT', step_args=['--input', 's3://', '--output', 's3://', '--similarityClassname', 'SIMILARITY_PEARSON_CORRELATION'])
con.add_jobflow_steps('jflow', [step])
Obviously you need to upload the mahout-core-0.6-job.jar to an accessible s3 location. And the input and out put have to be accessible.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.