BeamRunPythonPipelineOperator on DataFlowRunner keeps throwing error missing service_account

BeamRunPythonPipelineOperator on DataFlowRunner keeps throwing error missing service_account - python

I am running following DataFlow config
test_dataflow= BeamRunPythonPipelineOperator(
task_id="xxxx",
runner="DataflowRunner",
py_file=xxxxx,
pipeline_options = dataflow_options,
py_requirements=['apache-beam[gcp]==2.39.0'],
py_interpreter='python3',
dataflow_config=DataflowConfiguration(job_name="{{task.task_id}}", location=LOCATION, project_id=PROJECT, wait_until_finished=False,gcp_conn_id="google_cloud_default")
#dataflow_config={"job_name":"{{task.task_id}}", "location":LOCATION, "project_id":PROJECT, "wait_until_finished":True,"gcp_conn_id":"google_cloud_default"}
)
It keeps throwing error . airflow-2.2.5 version.
Error - Traceback (most recent call last):
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/apache/beam/operators/beam.py", line 287, in execute
) = self._init_pipeline_options(format_pipeline_options=True, job_name_variable_key="job_name")
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/apache/beam/operators/beam.py", line 183, in _init_pipeline_options
dataflow_job_name, pipeline_options, process_line_callback = self._set_dataflow(
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/apache/beam/operators/beam.py", line 63, in _set_dataflow
pipeline_options = self.__get_dataflow_pipeline_options(
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/apache/beam/operators/beam.py", line 92, in __get_dataflow_pipeline_options
if self.dataflow_config.service_account:
AttributeError: 'DataflowConfiguration' object has no attribute 'service_account'
If I give service_account, then it errors saying parameter invalid

I ran into the same issue.
This is because of the inconsistency between the dataflow_configuration in dataflow and the one expected by beam. The DataflowConfiguration doesn't accepting the service_account.
I resolved my issue by upgrading the composer in place, so it gets the latest package related to dataflow where it has been fixed.
The service_account attribute has been added in this commit https://github.com/apache/airflow/commit/de65a5cc5acaa1fc87ae8f65d367e101034294a6
If you can't upgrade composer, try updating the google providers package to the latest version or > version 7.0 ?
You can check the commit in the commit log and identify the minimum version here - https://airflow.apache.org/docs/apache-airflow-providers-google/stable/commits.html#id6
Even though composer uses it's own fork, the oss should work. You can see the list of packages in the composer version list https://cloud.google.com/composer/docs/concepts/versioning/composer-versions it says apache-airflow-providers-google==2022.5.18+composer instead of apache-airflow-providers-google==7.0.

Related

error using pip search (pip search stopped working)

I am getting this error in pip search while studying python.
The picture is an error when I pip search. Can you tell me how to fix it?
$ pip search pdbx
ERROR: Exception:
Traceback (most recent call last):
File "*/lib/python3.7/site-packages/pip/_internal/cli/base_command.py", line 224, in _main
status = self.run(options, args)
File "*/lib/python3.7/site-packages/pip/_internal/commands/search.py", line 62, in run
pypi_hits = self.search(query, options)
File "*/lib/python3.7/site-packages/pip/_internal/commands/search.py", line 82, in search
hits = pypi.search({'name': query, 'summary': query}, 'or')
File "/usr/lib/python3.7/xmlrpc/client.py", line 1112, in __call__
return self.__send(self.__name, args)
File "/usr/lib/python3.7/xmlrpc/client.py", line 1452, in __request
verbose=self.__verbose
File "*/lib/python3.7/site-packages/pip/_internal/network/xmlrpc.py", line 46, in request
return self.parse_response(response.raw)
File "/usr/lib/python3.7/xmlrpc/client.py", line 1342, in parse_response
return u.close()
File "/usr/lib/python3.7/xmlrpc/client.py", line 656, in close
raise Fault(**self._stack[0])
xmlrpc.client.Fault: <Fault -32500: 'RuntimeError: This API has been temporarily disabled due to unmanageable load and will be deprecated in the near future. Please use the Simple or JSON API instead.'>

The pip search command queries PyPI's servers, and PyPI's maintainers have explained that the API endpoint that the pip search command queries is very resource intensive and too expensive for them to always keep open to the public. Consequently they sometimes throttle access and are actually planning to remove it completely soon.
See this GitHub issues thread ...
The solution I am using for now is to pip install pip-search (a utility created by GitHub user #victorgarric).
So, instead of 'pip search', I use pip_search. Definitely beats searching PyPI via a web browser

Follow the suggestion from JRK at the discussion at github (last comment) the search command is temporarily disabled, use your browser to search for packages meanwhile:
Check the thread on github and give him a thumb up ;)

search on website, https://pypi.org/,
then install the package you wanted

The error says
Please use the Simple or JSON API instead
You can try pypi-simple to query the pip repository
https://pypi.org/project/pypi-simple/
It gives an example too, I tried to use it here:
pypi-simple version 0.8.0 DistributionPackage' object has no attribute 'get_digest':
!/usr/bin/env python3
-*- coding: utf-8 -*-
"""
Created on Thu Nov 11 17:40:03 2020
#author: Pietro
"""
from pypi_simple import PyPISimple
def simple():
package=input('\npackage to be checked ')
try:
with PyPISimple() as client:
requests_page = client.get_project_page(package)
except:
print("\n SOMETHING WENT WRONG !!!!! \n\n",
"CHECK INTERNET CONNECTION OR DON'T KNOW WHAT HAPPENED !!!\n")
pkg = requests_page.packages[0]
print(pkg)
print(type(pkg))
print('\n',pkg,'\n')
print('\n'+pkg.filename+'\n')
print('\n'+pkg.url+'\n')
print('\n'+pkg.project+'\n')
print('\n'+pkg.version+'\n')
print('\n'+pkg.package_type+'\n')
#print('\n'+pkg.get_digest()+'\n','ENDs HERE !!!!') #wasnt working
if __name__ == '__main__':
simple()
got -4 so far for this answer don't know why , figureout I can try to check for a package with:
# package_name = input('insert package name : ')
package_name = 'numpy'
import requests
url = ('https://pypi.org/pypi/'+package_name+'/json')
r = requests.get(url)
try:
data = r.json()
for i in data:
if i == 'info':
print('ok')
for j in data[i]:
if j == 'name':
print((data[i])[j])
print([k for k in (data['releases'])])
except:
print('something went south !!!!!!!!!!')

KafkaRecord cannot be cast to [B

I'm trying to process the data streaming from Apache Kafka using the Python SDK for Apache Beam with the Flink runner. After running Kafka 2.4.0 and Flink 1.8.3, I follow these steps:
1) Compile and run Beam 2.16 with Flink 1.8 runner.
git clone --single-branch --branch release-2.16.0 https://github.com/apache/beam.git beam-2.16.0
cd beam-2.16.0
nohup ./gradlew :runners:flink:1.8:job-server:runShadow -PflinkMasterUrl=localhost:8081 &
2) Run the Python pipeline.
from apache_beam import Pipeline
from apache_beam.io.external.kafka import ReadFromKafka
from apache_beam.options.pipeline_options import PipelineOptions
if __name__ == '__main__':
with Pipeline(options=PipelineOptions([
'--runner=FlinkRunner',
'--flink_version=1.8',
'--flink_master_url=localhost:8081',
'--environment_type=LOOPBACK',
'--streaming'
])) as pipeline:
(
pipeline
| 'read' >> ReadFromKafka({'bootstrap.servers': 'localhost:9092'}, ['test']) # [BEAM-3788] ???
)
result = pipeline.run()
result.wait_until_finish()
3) Publish some data to Kafka.
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
>{"hello":"world!"}
The Python script throws this error:
[flink-runner-job-invoker] ERROR org.apache.beam.runners.fnexecution.jobsubmission.JobInvocation - Error during job invocation BeamApp-USER-somejob. org.apache.flink.client.program.ProgramInvocationException: Job failed. (JobID: xxx)
at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:268)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:483)
at org.apache.beam.runners.flink.FlinkExecutionEnvironments$BeamFlinkRemoteStreamEnvironment.executeRemotely(FlinkExecutionEnvironments.java:360)
at org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.execute(RemoteStreamEnvironment.java:310)
at org.apache.beam.runners.flink.FlinkStreamingPortablePipelineTranslator$StreamingTranslationContext.execute(FlinkStreamingPortablePipelineTranslator.java:173)
at org.apache.beam.runners.flink.FlinkPipelineRunner.runPipelineWithTranslator(FlinkPipelineRunner.java:104)
at org.apache.beam.runners.flink.FlinkPipelineRunner.run(FlinkPipelineRunner.java:80)
at org.apache.beam.runners.fnexecution.jobsubmission.JobInvocation.runPipeline(JobInvocation.java:78)
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:265)
... 13 more
Caused by: java.lang.ClassCastException: org.apache.beam.sdk.io.kafka.KafkaRecord cannot be cast to [B
at org.apache.beam.sdk.coders.ByteArrayCoder.encode(ByteArrayCoder.java:41)
at org.apache.beam.sdk.coders.LengthPrefixCoder.encode(LengthPrefixCoder.java:56)
at org.apache.beam.sdk.values.ValueWithRecordId$ValueWithRecordIdCoder.encode(ValueWithRecordId.java:105)
at org.apache.beam.sdk.values.ValueWithRecordId$ValueWithRecordIdCoder.encode(ValueWithRecordId.java:81)
at org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:578)
at org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:529)
at org.apache.beam.sdk.util.CoderUtils.encodeToSafeStream(CoderUtils.java:82)
at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:66)
at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:51)
at org.apache.beam.sdk.util.CoderUtils.clone(CoderUtils.java:141)
at org.apache.beam.runners.flink.translation.types.CoderTypeSerializer.copy(CoderTypeSerializer.java:67)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:577)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:554)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:534)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:718)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:696)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:305)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:394) at org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.emitElement(UnboundedSourceWrapper.java:341)
at org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:283)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)
at org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:302)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
... 1 more
ERROR:root:java.lang.ClassCastException: org.apache.beam.sdk.io.kafka.KafkaRecord cannot be cast to [B
[flink-runner-job-invoker] INFO org.apache.beam.runners.fnexecution.artifact.BeamFileSystemArtifactRetrievalService - Manifest at/tmp/artifacts0k1mnin0/somejob/MANIFEST has 0 artifact locations
[flink-runner-job-invoker] INFO org.apache.beam.runners.fnexecution.artifact.BeamFileSystemArtifactStagingService - Removed dir /tmp/artifacts0k1mnin0/job_somejob/
Traceback (most recent call last):
File "main.py", line 40, in <module>
run()
File "main.py", line 37, in run
result.wait_until_finish()
File "/home/USER/beam/lib/python3.5/site-packages/apache_beam/runners/portability/portable_runner.py", line 439, in wait_until_finish self._job_id, self._state, self._last_error_message()))
RuntimeError: Pipeline BeamApp-USER-somejob failed in state FAILED: java.lang.ClassCastException: org.apache.beam.sdk.io.kafka.KafkaRecord cannot be cast to [B
I tried other deserializers available in Kafka but they did not work: Couldn't infer Coder from class org.apache.kafka.common.serialization.StringDeserializer. This error is originating from this piece of code.
Am I doing something wrong?

Disclaimer: this is my first encounter with Apache Beam project.
It seems that Kafka consumer support is quite fresh thing in Beam (at least in Python interface) according to this JIRA. Apparently, it seems that there is still problem with FlinkRunner combined with this new API. Even though your code is technically correct it will not run correctly on Flink. There is a patch available which seems more like a quickfix than final solution to me. It requires recompilation and thus is not something I would propose using on production. If you are just getting started with technology and don't want to be blocked then feel free to try it out.

How to commit and push files using python library GitPython

Requirement:
Commit and push files to GitHub repository from a python script.
The credentials should be included in the script.
Issue:
If credentials are provided in the script the commit operation is
executing and throwing the following error,
Traceback (most recent call last):
File "/home/amith/example.py", line 14, in <module>
repo.index.add(folder_path)
AttributeError: 'Repository' object has no attribute 'index'
If credentials are, not provided in the script the commit operation is working properly by providing it on the terminal.
I need to integrate this script in Django application which should accept the credentials from the configuration file.
I have tried the following links but nothing has worked for me yet.
- link1
- link2
- link3
from git import Repo
from github import Github
from pdb import set_trace as bp
repo_dir = '--------'
repo = Repo(repo_dir)
# using username and password
g = Github("-----", "------")
folder_path = '----------'
commit_message = 'Add New file'
repo.index.add(folder_path)
repo.index.commit(commit_message)
origin = repo.remote('origin')
origin.push()
So, I am getting this error "AttributeError: 'Repository' object has no attribute 'index'".
Complete error -
Traceback (most recent call last):
File "/home/amith/example.py", line 14, in <module>
repo.index.add(folder_path)
AttributeError: 'Repository' object has no attribute 'index'

RPC Error when using jnpr.junos.utils.config Load command

still rather new to Python. I've been referencing a few blogs regarding the jnpr.junos packages. Specifically from Jeremy Schulman (http://forums.juniper.net/t5/Automation/Python-for-Non-Programmers-Part-2/bc-p/277682). I'm simply trying to make sure I have the commands right. I'm just attempting to pass simple commands to my SRX cluster. I'm attempting to pass the following to an SRX650 cluster.
>>> from jnpr.junos.utils.config import Config
>>> from jnpr.junos import Device
>>> dev = Device(host='devip',user='myuser',password='mypwd')
>>> dev.open()
Device(devip)
>>> cu = Config(dev)
>>> cu
jnpr.junos.utils.Config(devip)
>>> set_cmd = 'set system login message "Hello Admin!"'
>>> cu.load(set_cmd,format='set')
Warning (from warnings module):
File "C:\Python27\lib\site-packages\junos_eznc-1.0.0- py2.7.egg\jnpr\junos\utils\config.py", line 273
if any([e.find('[error-severity="error"]') for e in rerrs]):
FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead.
Traceback (most recent call last):
File "<pyshell#8>", line 1, in <module>
cu.load(set_cmd,format='set')
File "C:\Python27\lib\site-packages\junos_eznc-1.0.0- py2.7.egg\jnpr\junos\utils\config.py", line 296, in load
return try_load(rpc_contents, rpc_xattrs)
File "C:\Python27\lib\site-packages\junos_eznc-1.0.0-py2.7.egg\jnpr\junos\utils\config.py", line 274, in try_load
raise err
RpcError
I've done quite a bit of searching and can't seem to find anything as to why this RPC error is popping up. I've confirmed that the syntax is correct and read through the jnpr.junos documentation for Junos EZ.

Found that I was using an outdated version of junos.eznc. Running pip install -U junos-eznc updated me to junos.eznc 1.3.1. After doing this, my script worked properly.

"Failed to import google/appengine/ext/deferred/handler.py" in Google App Engine Flexible Environment

I use App Engine Flexible Environment (previously called Managed VMs), and recently upgraded to the latest gcloud SDK. It included some new errors:
ERROR: (gcloud.preview.app.deploy) Error Response: [400] Invalid
character
in filename: lib/setuptools/script (dev).tmpl
ERROR: The [application] field is specified in file [.../app.yaml]. This field is not used
by gcloud and must be removed. Project name should instead be
specified either by `gcloud config set project MY_PROJECT` or by
setting the `--project` flag on individual command executions.
ERROR: (gcloud.preview.app.deploy) There is a Dockerfile in the
current directory, and the runtime field in
.../app.yaml is currently set to
[runtime: python27]. To use your Dockerfile to build a custom runtime,
set the runtime field in .../app.yaml
to [runtime: custom]. To continue using the [python27] runtime, please
omit the Dockerfile from this directory.
I fixed these errors and was able to publish again, but started seeing errors like this:
Failed to import google/appengine/ext/deferred/handler.py
Traceback (most recent call last):
File "/home/vmagent/python_vm_runtime/google/appengine/ext/vmruntime/meta_app.py", line 549, in GetUserAppAndServe
app, mod_file = self.GetUserApp(script)
File "/home/vmagent/python_vm_runtime/google/appengine/ext/vmruntime/meta_app.py", line 410, in GetUserApp
app = _AppFrom27StyleScript(script)
File "/home/vmagent/python_vm_runtime/google/appengine/ext/vmruntime/meta_app.py", line 270, in _AppFrom27StyleScript
app, filename, err = wsgi.LoadObject(script)
File "/home/vmagent/python_vm_runtime/google/appengine/runtime/wsgi.py", line 85, in LoadObject
obj = __import__(path[0])
ImportError: Import by filename is not supported.

After a bit of digging, I figured out what is going on. Namely, the code that processes this:
builtins:
- remote_api: on
- appstats: on
- deferred: on
is broken with Managed VMs. The correct fix is to eliminate these and inline the builtin includes instead. You can find the relevant includes inside these subdirectories:
In my case, it was to add this to my handlers: directive:
- url: /_ah/queue/deferred
script: google.appengine.ext.deferred.application
login: admin
- url: /_ah/stats.*
script: google.appengine.ext.appstats.ui.app
- url: /_ah/remote_api(/.*)?
script: google.appengine.ext.remote_api.handler.application
As to why, you can understand more here. In google/appengine/ext/builtins/__init__.py#L92, it attempts to find the relevant include file by using the runtime: field in your app.yaml. This means where it previously looked up deferred/include-python27.yaml, it now attempts to find deferred/include-custom.yaml (due to fixing the errors above) and fails. So now it defaulting to deferred/include.yaml, which lists the include scrip by path-name instead of module-name. This then breaks in a python27-custom-VM setup (since it is expecting/needing module-names)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeamRunPythonPipelineOperator on DataFlowRunner keeps throwing error missing service_account - python

Related

error using pip search (pip search stopped working)

KafkaRecord cannot be cast to [B

How to commit and push files using python library GitPython

RPC Error when using jnpr.junos.utils.config Load command

"Failed to import google/appengine/ext/deferred/handler.py" in Google App Engine Flexible Environment

Categories

Resources