I'm trying to run a beam script in python on GCP following this tutorial:
[https://levelup.gitconnected.com/scaling-scikit-learn-with-apache-beam-251eb6fcf75b][1]
but I keep getting the following error:
AttributeError: module 'google.cloud' has no attribute 'storage'
I have google-cloud-storage in my requirements.txt so really not sure what I'm missing here.
My full script:
import apache_beam as beam
import json
query = """
SELECT
year,
plurality,
apgar_5min,
mother_age,
father_age,
gestation_weeks,
ever_born,
case when mother_married = true then 1 else 0 end as mother_married,
weight_pounds as weight,
current_timestamp as time,
GENERATE_UUID() as guid
FROM `bigquery-public-data.samples.natality`
order by rand()
limit 100
"""
class ApplyDoFn(beam.DoFn):
def __init__(self):
self._model = None
from google.cloud import storage
import pandas as pd
import pickle as pkl
self._storage = storage
self._pkl = pkl
self._pd = pd
def process(self, element):
if self._model is None:
bucket = self._storage.Client().get_bucket('bqr_dump')
blob = bucket.get_blob('natality/sklearn-linear')
self._model = self._pkl.loads(blob.download_as_string())
new_x = self._pd.DataFrame.from_dict(element,
orient='index').transpose().fillna(0)
pred_weight = self._model.predict(new_x.iloc[:, 1:8])[0]
return [ {'guid': element['guid'],
'predicted_weight': pred_weight,
'time': str(element['time'])}]
# set up pipeline options
options = {'project': my-project-name,
'runner': 'DataflowRunner',
'temp_location': 'gs://bqr_dump/tmp',
'staging_location': 'gs://bqr_dump/tmp'
}
pipeline_options = beam.pipeline.PipelineOptions(flags=[], **options)
with beam.Pipeline(options=pipeline_options) as pipeline:
(
pipeline
| 'ReadTable' >> beam.io.Read(beam.io.BigQuerySource(
query=query,
use_standard_sql=True))
| 'Apply Model' >> beam.ParDo(ApplyDoFn())
| 'Save to BigQuery' >> beam.io.WriteToBigQuery(
'pzn-pi-sto:beam_test.weight_preds',
schema='guid:STRING,weight:FLOAT64,time:STRING',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED))
and my requirements.txt:
google-cloud==0.34.0
google-cloud-storage==1.30.0
apache-beam[GCP]==2.20.0
This issue is usually related to two main reasons: the modules not being well installed, which means that something broke during the installation and the second reason, the import of the module not being correctly done.
To fix the issue, in case the reason is the broken modules, reinstalling or checking it in a virtual environment would be the solution. As indicated here, a similar case as yours, this should fix your case.
For the second reason, try to change your code and import all the modules in the beginning of the code, as demonstrated in this official example here. Your code should be something like this:
import apache_beam as beam
import json
import pandas as pd
import pickle as pkl
from google.cloud import storage
...
Let me know if this information helped you!
Make sure u have installed the correct version. Because the modules Google maintains will have constant updates. If u just give pip install for the required package it is directly going to install the latest version of the package.
Related
I am building a script which automatically builds AWS Lambda function. I follow this github repo as an inspiration.
However, the lambda_handler that I want to deploy is having extra dependencies, such as numpy, pandas, or even lgbm. Simple example below:
import numpy as np
def lambda_handler(event, context):
result = np.power(event['data'], 2)
response = {'result': result}
return response
example of an event and its response:
event = {'data' = [1,2,3]}
lambda_handler(event)
> {"result": [1,4,9]}
I would like to automatically add the needed layer, while creating the AWS Lambda. For that I think I need to change create_lambda_deployment_package function that is in the lambda_basic.py in the repo. What I was thinking of doing is a following:
import zipfile
import glob
import io
def create_full_lambda_deployment_package(function_file_name):
buffer = io.BytesIO()
with zipfile.ZipFile(buffer, 'w') as zipped:
# Adding all the files around the lambda_handler directory
for i in glob.glob(function_file_name):
zipped.write(i)
# Adding the numpy directory
for i in glob.glob('./venv/lib/python3.7/site-packages/numpy/*'):
zipped.write(i, f'numpy/{i[41:]}')
buffer.seek(0)
return buffer.read()
Despite the fact that lambda is created and 'numpy' folder appears my lambda environment, unfortunately this doesn't work (error cannot import name 'integer' from partially initialized module 'numpy' (most likely due to a circular import) (/var/task/numpy/__init__.py)").
How could I fix this issue? Or is there maybe another way to solve my problem?
I am getting the following error when trying to go through the esy-osm example:
INFO:esy.osmfilter.pre_filter:OSM_raw_data does not exist
I am using python 3.8 on Windows and the code I am using is below:
import os, sys
import configparser, contextlib
from esy.osmfilter import osm_colors as cc
from esy.osmfilter import run_filter
from esy.osmfilter import Node, Way, Relation
PBF_inputfile = os.path.join(os.getcwd(), 'Geospatial_data\OSM_raw/liechtenstein-latest.osm.pbf')
JSON_outputfile = os.path.join(os.getcwd(), 'Geospatial_data/OSM_filtered/liechtenstein.json')
prefilter = {Node: {}, Way: {"man_made":["pipeline",],}, Relation: {}}
whitefilter = []
blackfilter = []
[Data,_]=run_filter('noname',
PBF_inputfile,
JSON_outputfile,
prefilter,
whitefilter,
blackfilter,
NewPreFilterData=True,
CreateElements=False,
LoadElements=False,
verbose=True)
print(len(Data['Node']))
print(len(Data['Relation']))
print(len(Data['Way']))
Does anyone know where I am going wrong on this?
you dont find the pbf file.
Please look at the path separator on you machine.
For windows, it’s ‘\’ and for unix it’s ‘/’.
You have used both simutaniously:
'Geospatial_data\OSM_raw/liechtenstein-latest.osm.pbf'
Cheers,
Adam
I have a data analysis tool that I made a Python package for and I'd like to include some sample datasets, but I don't want to include all the datasets directly in the Python package because it will bloat the size and slow down install for people who don't use them.
The behavior I want is when a sample dataset is referenced it automatically gets downloaded from a URL and saved to the package locally, but then the next time it is used it will read the local version instead of re-downloading it. And this caching should persist permanently for my package, not only the during of the Python instance.
How can I do this?
I ended up making a folder under AppData using the appdirs package
datasets.py
import os
import pandas as pd
from pandasgui.utility import get_logger
from appdirs import user_data_dir
from tqdm import tqdm
logger = get_logger(__name__)
__all__ = ["all_datasets",
"country_indicators",
"us_shooting_incidents",
"diamonds",
"pokemon",
"anscombe",
"attention",
"car_crashes",
"dots",
"exercise",
"flights",
"fmri",
"gammas",
"geyser",
"iris",
"mpg",
"penguins",
"planets",
"tips",
"titanic",
"gapminder",
"stockdata"]
dataset_names = [x for x in __all__ if x != "all_datasets"]
all_datasets = {}
root_data_dir = os.path.join(user_data_dir(), "pandasgui", "dataset_files")
# Open local data CSVs if they exists
if all([os.path.exists(os.path.join(root_data_dir, f"{name}.csv")) for name in dataset_names]):
for name in dataset_names:
data_path = os.path.join(root_data_dir, f"{name}.csv")
if os.path.isfile(data_path):
all_datasets[name] = pd.read_csv(data_path)
# Download data if it doesn't exist locally
else:
os.makedirs(root_data_dir, exist_ok=True)
logger.info(f"Downloading PandasGui sample datasets into {root_data_dir}...")
pbar = tqdm(dataset_names, bar_format='{percentage:3.0f}% {bar} | {desc}')
for name in pbar:
pbar.set_description(f"{name}.csv")
data_path = os.path.join(root_data_dir, f"{name}.csv")
if os.path.isfile(data_path):
all_datasets[name] = pd.read_csv(data_path)
else:
all_datasets[name] = pd.read_csv(
os.path.join("https://raw.githubusercontent.com/adamerose/datasets/master/",
f"{name}.csv"))
all_datasets[name].to_csv(data_path, index=False)
# Add the datasets to globals so they can be imported like `from pandasgui.datasets import iris`
for name in all_datasets.keys():
globals()[name] = all_datasets[name]
I'm trying to link my kaggle project to Google Cloud Platform but I can't seem to get it done even after I followed: https://cloud.google.com/docs/authentication/getting-started
I still get this error:
DefaultCredentialsError: File /C:/Users/Jirah Marie Navarro/kagglebqtest-2999bd391350.json was not found.
This is my code:
# Replace 'kaggle-competitions-project' with YOUR OWN project id here --
PROJECT_ID = 'kagglebqtest'
from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID, location="US")
dataset = client.create_dataset('bqml_example', exists_ok=True)
from google.cloud.bigquery import magics
from kaggle.gcp import KaggleKernelCredentials
magics.context.credentials = KaggleKernelCredentials()
magics.context.project = PROJECT_ID
# create a reference to our table
table = client.get_table("kaggle-competition-datasets.geotab_intersection_congestion.train")
# look at five rows from our dataset
client.list_rows(table, max_results=5).to_dataframe()
Seems that the GOOGLE_APPLICATION_CREDENTIALS environment variable is not set correctly.
I suggest you to verify that the 'kagglebqtest-2999bd391350.json' file is in the path 'C:/Users/Jirah Marie Navarro/'.
I recommend you also to use a path without spaces such as 'C:/' or 'C:/credentials/' maybe the JSON credential is not recognized for the spaces in your path, so you can try with something like:
$env:GOOGLE_APPLICATION_CREDENTIALS="C:\credentials/kagglebqtest-2999bd391350.json"
I'm using Gcloud Composer as my Airflow. When I try to use Jinja in my HQL code, it does not translate it correctly.
I know that the HiveOperator has a Jinja translator as I'm used to it, but the DataProcHiveOperator doesn't.
I've tried to use the HiveConf directly into my HQL files, but when setting those values to my Partition (i.e. INSERT INTO TABLE abc PARTITION (ds = ${hiveconf:ds}))`, it doesn't work.
I have also added the following to my HQL file:
SET ds=to_date(current_timestamp());
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
But it didn't work as HIVE is transforming the formula above into a STRING.
So my idea was to combine both operators to have the Jinja translator working fine, but when I do that, I get the following error: ERROR - submit() takes from 3 to 4 positional arguments but 5 were given.
I'm not very familiar with Python coding and any help would be great, see below code for the operator I'm trying to build;
Header of the Python File (please note that the file contains other Operators not mentioned in this question):
import ntpath
import os
import re
import time
import uuid
from datetime import timedelta
from airflow.contrib.hooks.gcp_dataproc_hook import DataProcHook
from airflow.contrib.hooks.gcs_hook import GoogleCloudStorageHook
from airflow.exceptions import AirflowException
from airflow.models import BaseOperator
from airflow.utils.decorators import apply_defaults
from airflow.version import version
from googleapiclient.errors import HttpError
from airflow.utils import timezone
from airflow.utils.operator_helpers import context_to_airflow_vars
modified DataprocHiveOperator:
class DataProcHiveOperator(BaseOperator):
template_fields = ['query', 'variables', 'job_name', 'cluster_name', 'dataproc_jars']
template_ext = ('.q',)
ui_color = '#0273d4'
#apply_defaults
def __init__(
self,
query=None,
query_uri=None,
hiveconfs=None,
hiveconf_jinja_translate=False,
variables=None,
job_name='{{task.task_id}}_{{ds_nodash}}',
cluster_name='cluster-1',
dataproc_hive_properties=None,
dataproc_hive_jars=None,
gcp_conn_id='google_cloud_default',
delegate_to=None,
region='global',
job_error_states=['ERROR'],
*args,
**kwargs):
super(DataProcHiveOperator, self).__init__(*args, **kwargs)
self.gcp_conn_id = gcp_conn_id
self.delegate_to = delegate_to
self.query = query
self.query_uri = query_uri
self.hiveconfs = hiveconfs or {}
self.hiveconf_jinja_translate = hiveconf_jinja_translate
self.variables = variables
self.job_name = job_name
self.cluster_name = cluster_name
self.dataproc_properties = dataproc_hive_properties
self.dataproc_jars = dataproc_hive_jars
self.region = region
self.job_error_states = job_error_states
def prepare_template(self):
if self.hiveconf_jinja_translate:
self.query_uri= re.sub(
"(\$\{(hiveconf:)?([ a-zA-Z0-9_]*)\})", "{{ \g<3> }}", self.query_uri)
def execute(self, context):
hook = DataProcHook(gcp_conn_id=self.gcp_conn_id,
delegate_to=self.delegate_to)
job = hook.create_job_template(self.task_id, self.cluster_name, "hiveJob",
self.dataproc_properties)
if self.query is None:
job.add_query_uri(self.query_uri)
else:
job.add_query(self.query)
if self.hiveconf_jinja_translate:
self.hiveconfs = context_to_airflow_vars(context)
else:
self.hiveconfs.update(context_to_airflow_vars(context))
job.add_variables(self.variables)
job.add_jar_file_uris(self.dataproc_jars)
job.set_job_name(self.job_name)
job_to_submit = job.build()
self.dataproc_job_id = job_to_submit["job"]["reference"]["jobId"]
hook.submit(hook.project_id, job_to_submit, self.region, self.job_error_states)
I would like to be able to use Jinja templating inside my HQL code to allow partition automation on my data pipeline.
P.S: I'll use the Jinja templating mostly for Partition DateStamp
Does anyone know what is the error message I'm getting + help me solve it?
ERROR - submit() takes from 3 to 4 positional arguments but 5 were given
Thank you!
It is because of the 5th argument job_error_states which is only in master and not in the current stable release (1.10.1).
Source Code for 1.10.1 -> https://github.com/apache/incubator-airflow/blob/76a5fc4d2eb3c214ca25406f03b4a0c5d7250f71/airflow/contrib/hooks/gcp_dataproc_hook.py#L219
So remove that parameter and it should work.