GCP dataflow, argparse.ArgumentError using DataflowRunner but not DirectRunner

GCP dataflow, argparse.ArgumentError using DataflowRunner but not DirectRunner - python

Dataflow pipeline with runtime arguments runs well using DirectRunner, but encounters argument error when switching to DataflowRunner.
File "/home/user/miniconda3/lib/python3.8/site-packages/apache_beam/options/pipeline_options.py", line 124, in add_value_provider_argument
self.add_argument(*args, **kwargs)
File "/home/user/miniconda3/lib/python3.8/argparse.py", line 1386, in add_argument
return self._add_action(action)
File "/home/user/miniconda3/lib/python3.8/argparse.py", line 1749, in _add_action
self._optionals._add_action(action)
File "/home/user/miniconda3/lib/python3.8/argparse.py", line 1590, in _add_action
action = super(_ArgumentGroup, self)._add_action(action)
File "/home/user/miniconda3/lib/python3.8/argparse.py", line 1400, in _add_action
self._check_conflict(action)
File "/home/user/miniconda3/lib/python3.8/argparse.py", line 1539, in _check_conflict
conflict_handler(action, confl_optionals)
File "/home/user/miniconda3/lib/python3.8/argparse.py", line 1548, in _handle_conflict_error
raise ArgumentError(action, message % conflict_string)
argparse.ArgumentError: argument --bucket_input: conflicting option string: --bucket_input
Here is how the argument defined and called
class CustomPipelineOptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument(
'--bucket_input',
default="device-file-dev",
help='Raw device file bucket')
pipeline = beam.Pipeline(options=pipeline_options)
custom_options = pipeline_options.view_as(CustomPipelineOptions)
_ = (
pipeline
| 'Initiate dataflow' >> beam.Create(["Start"])
| 'Create P collection with file paths' >> beam.ParDo(
CreateGcsPCol(input_bucket=custom_options.bucket_input)
)
Notice this only happens with DataflowRunner. Anyone knows how to solve it? Thanks a lot.

Copying the answer from the comment here:
The error is caused by importing a local Python sub-module via a relative path. With the DirectRunner, the relative path works because it's on the local machine. However, the DataflowRunner is on a different machine (GCE Instance) and needs the absolute path. Thus, the problem was solved by installing both the Dataflow pipeline module, the sub-module, and importing from the installed sub-module -- instead of using the relative path.

Related

ReadFromKafka throws ValueError: Unsupported signal: 2

Currently I try to get the hang of apache beam together with apache kafka.
The Kafka service is running (locally) and I write with the kafka-console-producer some test messages.
First I wrote this Java Codesnippet to test apache beam with a language that I know. And it works as expected.
public class Main {
public static void main(String[] args) {
Pipeline pipeline = Pipeline.create();
Read<Long, String> kafkaReader = KafkaIO.<Long, String>read()
.withBootstrapServers("localhost:9092")
.withTopic("beam-test")
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(StringDeserializer.class);
kafkaReader.withoutMetadata();
pipeline
.apply("Kafka", kafkaReader
).apply(
"Extract words", ParDo.of(new DoFn<KafkaRecord<Long, String>, String>() {
#ProcessElement
public void processElement(ProcessContext c){
System.out.println("Key:" + c.element().getKV().getKey() + " | Value: " + c.element().getKV().getValue());
}
})
);
pipeline.run();
}
}
My goal is to write that same in python and this is what I´m currently at:
def run_pipe():
with beam.Pipeline(options=PipelineOptions()) as p:
(p
| 'Kafka Unbounded' >> ReadFromKafka(consumer_config={'bootstrap.servers' : 'localhost:9092'}, topics=['beam-test'])
| 'Test Print' >> beam.Map(print)
)
if __name__ == '__main__':
run_pipe()
Now to the problem. When I try to run the python code, I get the following error:
(app) λ python ArghKafkaExample.py
Traceback (most recent call last):
File "ArghKafkaExample.py", line 22, in <module>
run_pipe()
File "ArghKafkaExample.py", line 10, in run_pipe
(p
File "C:\Users\gamef\git\BeamMeScotty\app\lib\site-packages\apache_beam\transforms\ptransform.py", line 1028, in __ror__
return self.transform.__ror__(pvalueish, self.label)
File "C:\Users\gamef\git\BeamMeScotty\app\lib\site-packages\apache_beam\transforms\ptransform.py", line 572, in __ror__
result = p.apply(self, pvalueish, label)
File "C:\Users\gamef\git\BeamMeScotty\app\lib\site-packages\apache_beam\pipeline.py", line 648, in apply
return self.apply(transform, pvalueish)
File "C:\Users\gamef\git\BeamMeScotty\app\lib\site-packages\apache_beam\pipeline.py", line 691, in apply
pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
File "C:\Users\gamef\git\BeamMeScotty\app\lib\site-packages\apache_beam\runners\runner.py", line 198, in apply
return m(transform, input, options)
File "C:\Users\gamef\git\BeamMeScotty\app\lib\site-packages\apache_beam\runners\runner.py", line 228, in apply_PTransform
return transform.expand(input)
File "C:\Users\gamef\git\BeamMeScotty\app\lib\site-packages\apache_beam\transforms\external.py", line 322, in expand
self._expanded_components = self._resolve_artifacts(
File "C:\Users\gamef\AppData\Local\Programs\Python\Python38\lib\contextlib.py", line 120, in __exit__
next(self.gen)
File "C:\Users\gamef\git\BeamMeScotty\app\lib\site-packages\apache_beam\transforms\external.py", line 372, in _service
yield stub
File "C:\Users\gamef\git\BeamMeScotty\app\lib\site-packages\apache_beam\transforms\external.py", line 523, in __exit__
self._service_provider.__exit__(*args)
File "C:\Users\gamef\git\BeamMeScotty\app\lib\site-packages\apache_beam\utils\subprocess_server.py", line 74, in __exit__
self.stop()
File "C:\Users\gamef\git\BeamMeScotty\app\lib\site-packages\apache_beam\utils\subprocess_server.py", line 133, in stop
self.stop_process()
File "C:\Users\gamef\git\BeamMeScotty\app\lib\site-packages\apache_beam\utils\subprocess_server.py", line 179, in stop_process
return super(JavaJarServer, self).stop_process()
File "C:\Users\gamef\git\BeamMeScotty\app\lib\site-packages\apache_beam\utils\subprocess_server.py", line 143, in stop_process
self._process.send_signal(signal.SIGINT)
File "C:\Users\gamef\AppData\Local\Programs\Python\Python38\lib\subprocess.py", line 1434, in send_signal
raise ValueError("Unsupported signal: {}".format(sig))
ValueError: Unsupported signal: 2
From googling I found out, that it has something to do with program exit codes (like Strg+C), but overall I have absolut no idea what the problem is.
Any advice would be helpful!
Greetings Pascal

Your pipeline code seems correct here. The issue is due to the requirements of the Kafka IO in the Python SDK. From the module documentation:
These transforms are currently supported by Beam portable runners (for example, portable Flink and Spark) as well as Dataflow runner.
Transforms provided in this module are cross-language transforms implemented in the Beam Java SDK. During the pipeline construction, Python SDK will connect to a Java expansion service to expand these transforms. To facilitate this, a small amount of setup is needed before using these transforms in a Beam Python pipeline.
Kafka IO is implemented in Python as a cross-language transform in Java and your pipeline is failing because you haven't set up your environment to execute cross-language transforms. To explain what a cross-language transform is in layman's terms: it means that the Kafka transform is actually executing on the Java SDK rather than the Python SDK, so it can make use of the existing Kafka code on Java.
There are two barriers preventing your pipeline from working. The easier one to fix is that only the runners I quoted above support cross-language transforms, so if you're running this pipeline with the Direct runner it won't work, you'll want to switch to either the Flink or Spark runner in local mode.
The more tricky barrier is that you need to start up an Expansion Service to be able to add external transforms to your pipeline. The stacktrace you're getting is happening because Beam is attempting to expand the transform but is unable to connect to the expansion service, and the expansion fails.
If you still want to try running this with cross-language despite the extra setup, the documentation I linked contains instructions for running an expansion service. At the time I am writing this answer this feature is still new, and there might be blind spots in the documentation. If you run into problems, I encourage you to ask questions on the Apache Beam users mailing list or Apache Beam slack channel.

How can I compile Bootstrap 4 scss with python webassets?

I'm trying to simply compile Bootstrap 4 with python webassets and having zero success. For now I'm just trying to do this within the bootstrap/scss directory so path issues are less of a big deal. Within this directory I have added a main.scss file with one line:
#import "bootstrap.scss";
I have a script called test_scss.py that looks like this:
from webassets import Bundle, Environment
my_env = Environment(directory='.', url='/')
css = Bundle('main.scss', filters='scss', output='all.css')
my_env.register('css_all', css)
print(my_env['css_all'].urls())
When I run this command, I get an error trace like this:
Traceback (most recent call last):
File "./test_scss.py", line 11, in <module>
print(my_env['css_all'].urls())
File "/Users/benlindsay/miniconda/lib/python3.6/site-packages/webassets/bundle.py", line 806, in urls
urls.extend(bundle._urls(new_ctx, extra_filters, *args, **kwargs))
File "/Users/benlindsay/miniconda/lib/python3.6/site-packages/webassets/bundle.py", line 765, in _urls
*args, **kwargs)
File "/Users/benlindsay/miniconda/lib/python3.6/site-packages/webassets/bundle.py", line 619, in _build
force, disable_cache=disable_cache, extra_filters=extra_filters)
File "/Users/benlindsay/miniconda/lib/python3.6/site-packages/webassets/bundle.py", line 543, in _merge_and_apply
kwargs=item_data)
File "/Users/benlindsay/miniconda/lib/python3.6/site-packages/webassets/merge.py", line 276, in apply
return self._wrap_cache(key, func)
File "/Users/benlindsay/miniconda/lib/python3.6/site-packages/webassets/merge.py", line 218, in _wrap_cache
content = func().getvalue()
File "/Users/benlindsay/miniconda/lib/python3.6/site-packages/webassets/merge.py", line 251, in func
getattr(filter, type)(data, out, **kwargs_final)
File "/Users/benlindsay/miniconda/lib/python3.6/site-packages/webassets/filter/sass.py", line 196, in input
self._apply_sass(_in, out, os.path.dirname(source_path))
File "/Users/benlindsay/miniconda/lib/python3.6/site-packages/webassets/filter/sass.py", line 190, in _apply_sass
return self.subprocess(args, out, _in, cwd=child_cwd)
File "/Users/benlindsay/miniconda/lib/python3.6/site-packages/webassets/filter/__init__.py", line 527, in subprocess
proc.returncode, stdout, stderr))
webassets.exceptions.FilterError: scss: subprocess returned a non-success result code: 65, stdout=b'',
stderr=b'DEPRECATION WARNING: Importing from the current working directory will
not be automatic in future versions of Sass. To avoid future errors, you can add it
to your environment explicitly by setting `SASS_PATH=.`, by using the -I command
line option, or by changing your Sass configuration options.
Error: Invalid CSS after "...lor}: #{$value}": expected "{", was ";"
on line 4 of /Users/benlindsay/scratch/python/webassets/test-2/bootstrap/scss/_root.scss
from line 11 of /Users/benlindsay/scratch/python/webassets/test-2/bootstrap/scss/bootstrap.scss
from line 1 of standard input
Use --trace for backtrace.
If I follow the instructions and set environment variable SASS_PATH=., that gets rid of that part of the error message, but I still get the error
Error: Invalid CSS after "...lor}: #{$value}": expected "{", was ";"
on line 4 of /Users/benlindsay/scratch/python/webassets/test-2/bootstrap/scss/_root.scss
from line 11 of /Users/benlindsay/scratch/python/webassets/test-2/bootstrap/scss/bootstrap.scss
from line 1 of standard input
Use --trace for backtrace.
I don't know SCSS syntax well yet, but I'd bet a lot of money this is me doing something wrong and not an error in the Bootstrap SCSS. Any thoughts of what I'm doing wrong would be greatly appreciated.

Turns out it actually kind of was a problem on Bootstrap's end. See https://github.com/sass/sass/issues/2383, specifically the quote:
This is a bug in our implementation—the parser shouldn't crash—but those Bootstrap styles aren't valid for Sass 3.5 as written.
Anyway, I just needed to update to the latest version of Ruby Sass (which apparently the webassets module depends on) and that fixed it.

Windows cmd versus bash for sys.argv - Python

I was trying to run a python script in visual studio 2015 and I wanted to specify a path to my arparse function, however kept receiving an OSError. See Update the problem appears to be a difference in how argparse receives values from command instead of bash behaviour.
Whether I specify it like this
C:\Users\Sayth\Documents\Racing\XML\*xml
or like this
C:\\Users\\Sayth\Documents\\Racing\\XML\\*xml
I get an OSError that the path is not found
OSError: Error reading file 'C:\\Users\\Sayth\\Documents\\Racing\\XML\\*xml': failed to load external entity "file:/C://Users//Sayth/Documents//Racing//XML//*xml"
Update
I copied the script and XML file to a test directory. From here I have run the script on 2 different shells on windows.
On command cmd
C:\Users\Sayth\Projects
λ python RaceHorse.py XML\*xml
Traceback (most recent call last):
File "RaceHorse.py", line 42, in <module>
tree = lxml.etree.parse(file)
File "lxml.etree.pyx", line 3427, in lxml.etree.parse (src\lxml\lxml.etree.c:79720)
File "parser.pxi", line 1782, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:115914)
File "parser.pxi", line 1808, in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:116264)
File "parser.pxi", line 1712, in lxml.etree._parseDocFromFile (src\lxml\lxml.etree.c:115152)
File "parser.pxi", line 1115, in lxml.etree._BaseParser._parseDocFromFile (src\lxml\lxml.etree.c:109849)
File "parser.pxi", line 573, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:103323)
File "parser.pxi", line 683, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:104977)
File "parser.pxi", line 611, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:103843)
OSError: Error reading file 'XML\*xml': failed to load external entity "XML/*xml"
When I change it to git bash
It reads the file I get an error however it shows its working.
Sayth#renshaw-laptop ~/Projects
λ python RaceHorse.py XML/*xml
Traceback (most recent call last):
File "RaceHorse.py", line 50, in <module>
nomination_table.append([race_id] + [nomination.attrib[name] for name in horseattrs])
File "RaceHorse.py", line 50, in <listcomp>
nomination_table.append([race_id] + [nomination.attrib[name] for name in horseattrs])
File "lxml.etree.pyx", line 2452, in lxml.etree._Attrib.__getitem__ (src\lxml\lxml.etree.c:68544)
KeyError: 'race_id'
I have a simple argparse function
parser = argparse.ArgumentParser(description=None)
def GetArgs(parser):
"""Parser function using argparse"""
# parser.add_argument('directory', help='directory use',
# action='store', nargs='*')
parser.add_argument("files", nargs="+")
return parser.parse_args()
fileList = GetArgs(parser)
Update 2
Based on comments am trying to implement glob to enable use of windows shells.
glob is returning an error that its object the parser has no object len.
updated glob parser
def GetArgs(parser):
"""Parser function using argparse"""
# parser.add_argument('directory', help='directory use',
# action='store', nargs='*')
parser.add_argument("files", nargs="+")
files = glob.glob(parser.parse_args())
return files
filelist = GetArgs(parser)
Returns this error.
TypeError was unhandled by user code
Message: object of type 'Namespace' has no len()

The following should work with both the Windows cmd shell and bash because it will glob any filenames it receives (which can happen if the shell didn't do it already):
import argparse
from glob import glob
parser = argparse.ArgumentParser(description=None)
def GetArgs(parser):
"""Parser function using argparse"""
parser.add_argument("files", nargs="+")
namespace = parser.parse_args()
files = [filename for filespec in namespace.files for filename in glob(filespec)]
return files
filelist = GetArgs(parser)
However, I don't think having GetArgs() add arguments to the parser it was passed is a good design choice (because it could be an undesirable side-effect if the parser object is reused).

even very short and simple I still consider it worth the answer not only comment because python is multi platform and for that reason when you work with path you should prefer using
from os import path
to avoid problems running your app on different platforms

Py.test collection phase taking very long

I am really quite new to development in Python in general, let alone testing with pytest. My problem is that the pytest collection phase runs unusually slow. I am specifying the test directory which contains only a handful of files with only one file containing three tests. The collection takes pretty much a whole minute, after which the actual tests run in under a few seconds. I have looked at similar questions but couldn't find a solution. I don't think it matters (as py.test is slow even from the command line) but I am using the pycharm IDE. The OS is Ubuntu.
This may be relevant: If I terminate the process after a few seconds I usually end up with a stacktrace ending as follows:
<A FEW LINES OMITTED...>
File "/usr/local/lib/python2.7/dist-packages/_pytest/core.py", line 413, in __call__
return self._docall(methods, kwargs)
File "/usr/local/lib/python2.7/dist-packages/_pytest/core.py", line 424, in _docall
res = mc.execute()
File "/usr/local/lib/python2.7/dist-packages/_pytest/core.py", line 315, in execute
res = method(**kwargs)
File "/usr/local/lib/python2.7/dist-packages/_pytest/helpconfig.py", line 27, in pytest_cmdline_parse
config = __multicall__.execute()
File "/usr/local/lib/python2.7/dist-packages/_pytest/core.py", line 315, in execute
res = method(**kwargs)
File "/usr/local/lib/python2.7/dist-packages/_pytest/config.py", line 636, in pytest_cmdline_parse
self.parse(args)
File "/usr/local/lib/python2.7/dist-packages/_pytest/config.py", line 747, in parse
self._preparse(args)
File "/usr/local/lib/python2.7/dist-packages/_pytest/config.py", line 709, in _preparse
self._initini(args)
File "/usr/local/lib/python2.7/dist-packages/_pytest/config.py", line 704, in _initini
self.inicfg = getcfg(args, ["pytest.ini", "tox.ini", "setup.cfg"])
File "/usr/local/lib/python2.7/dist-packages/_pytest/config.py", line 861, in getcfg
if exists(p):
File "/usr/local/lib/python2.7/dist-packages/_pytest/config.py", line 848, in exists
return path.check()
File "/usr/local/lib/python2.7/dist-packages/py/_path/local.py", line 352, in check
return exists(self.strpath)
File "/usr/lib/python2.7/genericpath.py", line 18, in exists
os.stat(path)
KeyboardInterrupt
Or sometimes...
<STACK TRACE...>
File "/usr/local/lib/python2.7/dist-packages/py/_iniconfig.py", line 50, in __init__
f = open(self.path)
KeyboardInterrupt
Maybe one of the two last calls before the KeyboardInterrupt is very slow?
Please do ask for more detail should you require it!
Cheers!

Add PYTHONDONTWRITEBYTECODE=1 to your environment variables!
Windows Batch: set PYTHONDONTWRITEBYTECODE=1
Unix: export PYTHONDONTWRITEBYTECODE=1
subprocess.run: Add keyword env={'PYTHONDONTWRITEBYTECODE': '1'}
Note that the first two options are only valid for your current terminal session.
Here is how I found this out: pytest was being unusably slow from the command line, but working fine from within PyCharm. Copying the PyCharm command into cmd.exe (executes a small helper script) also was unusuably slow. Thus I printed out the environ variables at os.environ and tried it with that -- and it was fast! Then I eliminated each one-by-one.

I had the same problem. My fix was to set the Working directory setting in the Run/Debug Configuration to the folder where manage.py is located.

My solution was based off of this answer, where I did pytest dir/to/tests and it skipped the collection step entirely.

cx_freeze / pptx "PackageNotFoundError"

I'm working on a Python-programm that use the modul "pptx" (edit and manipulate powerpoint). The programm works without problem in the interpretor, but once I start it as a .exe file (built with cx_freeze or py2exe) I have this error-message when I click on the main-button :
Exception in Tkinter callback
Traceback (most recent call last):
File "Tkinter.pyc", line 1470, in __call__
File "pptx_02.py", line 187, in ButtonErstellen
File "pptx\api.pyc", line 29, in __init__
File "pptx\presentation.pyc", line 87, in __init__
File "pptx\presentation.pyc", line 170, in __open
File "pptx\packaging.pyc", line 88, in open
File "pptx\packaging.pyc", line 671, in __new__
PackageNotFoundError: Package not found at 'C:\Users\Moi\Programmation\Python\build\exe.win32-2.7\library.zip\pptx\templates\default.pptx'
The problem is that the file "default.pptx" actually exists at this adress.
While looking in the functions of "pptx", I may had an idea about the reason of this problem (but no idea of the solution).
For the last error : File "pptx\packaging.pyc", line 671, in new
is the following code:
class FileSystem(object):
"""
Factory for filesystem interface instances.
A FileSystem object provides access to on-disk package items via their URI
(e.g. ``/_rels/.rels`` or ``/ppt/presentation.xml``). This allows parts to
be accessed directly by part name, which for a part is identical to its
item URI. The complexities of translating URIs into file paths or zip item
names, and file and zip file access specifics are all hidden by the
filesystem class. |FileSystem| acts as the Factory, returning the
appropriate concrete filesystem class depending on what it finds at *path*.
"""
def __new__(cls, file):
# if *file* is a string, treat it as a path
if isinstance(file, basestring):
path = file
if is_zipfile(path):
fs = ZipFileSystem(path)
elif os.path.isdir(path):
fs = DirectoryFileSystem(path)
else:
raise PackageNotFoundError("Package not found at '%s'" % path)
else:
fs = ZipFileSystem(file)
return fs
The problem may be that the final file (default.pptx) isnot a zip but is in a zip. But I have no idea how I could fix it or if it realy is the problem.
If somebody as an idea ...
Thank you !

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.