Apache Beam Error: Unable to get file system for GCS - python

I'm trying to write to GCS bucket via Beam (and TF Transform). But I keep getting the following error:
ValueError: Unable to get the Filesystem for path [...]
The answer here and some other sources suggest that I need to pip install aache-beam[gcp] to get a different variant of Apache Beam that works with GCP.
So, I tried changing the setup.py of my training package as:
REQUIRED_PACKAGES = ['apache_beam[gcp]==2.14.0', 'tensorflow-ranking', 'tensorflow_transform==0.14.0']
which didn't help. I also tried adding the following to the beginning of my code:
subprocess.check_call('pip uninstall apache-beam'.split())
subprocess.check_call('pip install apache-beam[gcp]'.split())
which didn't work either.
The logs of the failed GCP job is here. The traceback and the error message appear on row 276.
I should mention that running the same code using Beam's DirectRunner and writing the outputs to local disk runs fine. But I'm now trying to switch to DataflowRunner.
Thanks.

It turns out that you need to uninstall google-cloud-dataflow in addition to installing apache-beam with the gcp option. I guess this happens because google-cloud-dataflow is installed on GCP instances by default. Not sure if the same would be true on other platforms like AWS. But anyway, here are the commands I used:
pip uninstall -y google-cloud-dataflow
pip install apache-beam[gcp]
I noticed this in the very first cell of [this notebook] (https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/deepdive/10_recommend/wals_tft.ipynb).

Related

Amazon EMR pip install in bootstrap actions runs OK but has no effect

In Amazon EMR, I am using the following script as a custom bootstrap action to install python packages. The script runs OK (checked the logs, packages installed successfully) but when I open a notebook in Jupyter Lab, I cannot import any of them. If I open a terminal in JupyterLab and run pip list or pip3 list, none of my packages is there. Even if I go to / and run find . -name mleap for instance, it does not exist.
Something I have noticed is that on the master node, I am getting all the time an error saying bootstrap action 2 has failed (there is no second action, only one). According to this, it is a rare error which I get in all my clusters. However, my cluster eventually gets created and I can use it.
My script is called aws-emr-bootstrap-actions.sh
#!/bin/bash
sudo python3 -m pip install numpy scikit-learn pandas mleap sagemaker boto3
I suspect it might have something to do with a docker image being deployed that invalidates my previous installs or something, but I think (for my Google searches) it is common to use bootstrap actions to install python packages and should work ...
The PYSPARK, Python interpreter that Spark is using, is different than the one to which the OP was installing the modules (as confirmed in comments).

How can I make this script run

I found this script (tutorial) on GitHub (https://github.com/amyoshino/Dash_Tutorial_Series/blob/master/ex4.py) and I am trying to run in my local machine.
Unfortunately I am having and Error
I would really appreciate if anyone can help me to run this script.
Perhaps this is something easy but I am new in coding.
Thank you!
You probably just need to pip install the dash-core-components library!
Take a look at the Dash Installation documentation. It currently recommends running these commands:
pip install dash==0.38.0 # The core dash backend
pip install dash-html-components==0.13.5 # HTML components
pip install dash-core-components==0.43.1 # Supercharged components
pip install dash-table==3.5.0 # Interactive DataTable component (new!)
pip install dash-daq==0.1.0 # DAQ components (newly open-sourced!)
For more info on using pip to install Python packages, see: Installing Packages.
If you have run those commands, and Flask still throws that error, you may be having a path/environment issue, and should provide more info in your question about your Python setup.
Also, just to give you a sense of how to interpret this error message:
It's often easiest to start at the bottom and work your way up.
Here, the bottommost message is a FileNotFound error.
The program is looking for the file in your Python37/lib/site-packages folder. That tells you it's looking for a Python package. That is the directory to which Python packages get installed when you use a tool like pip.

python: cannot import name beam_runner_api_pb2

I am relatively new to Python and Beam and I have followed the Apache Beam - Python Quickstart (here) to the last letter. My Python 2.7 virtual environment was created with conda.
I cloned the example from https://github.com/apache/beam
When I try to run
python -m apache_beam.examples.wordcount --input sample_text.txt --output counts
I get the following error
/Users/name/anaconda3/envs/py27/bin/python: cannot import name beam_runner_api_pb2
(which after searching I understand means that there is a circular import)
I have no idea where to begin. Is this a bug or something wrong with my setup.
(I have now tried redoing the example in three different virtual environments - all with the same result)
It seems it was my mistake. I did not correctly install the Google Cloud Platfrom (gcp) components. Once I did this it all worked
# As part of the initial setup, install Google Cloud Platform specific extra components.
pip install apache-beam[gcp]

Install Package with Pip on Google Datalab - No Space on Device

I'm trying to install the package sodapy using Pip on a fresh instance of Google Datalab but I'm receiving the error 'No space left on device.' I created this instance with over 100 GB of disk space so I'm a bit confused why I would be getting this error and I've tried deleting instances and creating new ones with no luck.
I'm using the command
!pip install sodapy
as is explained in the documentation- https://cloud.google.com/datalab/docs/how-to/adding-libraries
Thanks in advance for your help!
You may have run into a bug where the Disk was not being attached: https://github.com/googledatalab/datalab/issues/1898
(If that is the case, reseting the VM once should fix it and you should update to gcloud version 186.0.0 or later.)
Try adding the --user flag to the pip install command. This switches to using your persistent disk, which is the 100GB disk, instead of your VM's boot disk for the installed package.

Apache Beam in Python, error with beam.io.TextFileSource

I'm trying to run the code in the Data Science on GCP repo and keep hitting an error in the Beam code.
This is the line that gives an error:
beam.Read(beam.io.TextFileSource('airports.csv.gz')
Here's the error I'm getting:
AttributeError: 'module' object has no attribute 'TextFileSource'
Here's the complete file:
https://github.com/GoogleCloudPlatform/data-science-on-gcp/blob/master/04_streaming/simulate/df01.py
Does anyone know how to get this working, or what I'm missing?
Google Dataflow is migrating to the Apache Beam standard which means you should be using apache_beam.io.textio.ReadFromText. The standard is still evolving so it's best to consult the Release Notes whenever you upgrade the package.
It appears that you are using an older version of apache-beam/cloud-dataflow.
Do:
pip freeze | grep dataflow
When I do this, I get:
google-cloud-dataflow==0.4.3
If your version you get is older, try:
pip install google-cloud-dataflow
and repeat the pip freeze command. If you keep getting an older version, then you are in Python library hell and I suggest using virtualenv to ensure that you are using the latest version of all packages ...

Categories