There is a scheduled notebook, that uses BigQuery client and service account with Owner rights. When I run the cells manually, it makes an update to BQ table. There is one project for both BQ and Vertex AI.
I've found a similar question, but there is no output in bucket folder:
Google Cloud Vertex AI Notebook Scheduled Runs Aren't Running Code?
In schedules section this notebook is stuck on Initializing:
Here's the notebook:
Update: I've tried to schedule cells one by one, and all of the stuck attempts cannot get through BigQuery:
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'dialogflow-293713-f89fd8f4ed2d.json'
bigquery_client = bigquery.Client()
QUERY = f"""
INSERT `dialogflow-293713.chats.Ежедневная сводка маркетплейса` (date, effectiveness, operatorWorkload)
VALUES({period}, {effectiveness}, {redirectedToSales}, {operatorWorkload})
"""
Query_Results = bigquery_client.query(QUERY)
This way of authorization worked!
from google.cloud import bigquery
from google.oauth2 import service_account
import json
raw_credential = { "dictionary. copy the dict elements of your credential.json file" }
service_account_info = json.loads(json.dumps(raw_credential))
credentials = service_account.Credentials.from_service_account_info(service_account_info)
client = bigquery.Client(credentials=credentials)
query = """ Your Query """
df = client.query(query).to_dataframe()
#see some results. remove if its not needed.
print(df.head())
# OPTIONAL: If you want to move data to a google cloud storage bucket
from google.cloud import storage
client = storage.Client()
bucket_name = 'my-bucket-id'
bucket = client.get_bucket(bucket_name)
# if folder `output` does not exist it will be created. You can use the name as you want.
bucket.blob("output/output.csv").upload_from_string(df.to_csv(), 'text/csv')
Resolved on Issue Tracker in this thread.
Related
I am consistently running into problems querying in python using the following libraries. I am given a 403 error, that the "user does not have 'bigquery.readsessions.create' permissions for the project I am accessing.
#BQ libs
from google.cloud import bigquery
from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_file('path.json')
#BigQuery Connection and query execution
bqProjectId = 'poject_id'
project_id = bqProjectId
client = bigquery.Client(credentials= credentials,project=project_id)
query = client.query("SELECT * FROM `table`")
Output = Query.to_dataframe()
I am using the same service account json file, and same query in Java, R, and even on a BI tool. All three successfully retreived the data. So this seems to be python specific.
I have tried starting with a clean environment. I even reinstalled anaconda. Nothing seems to work. What are some possible culprits here?
*Obviously my path, query, and creds are different for that actual script.
You can try the below code by including access scope https://www.googleapis.com/auth/cloud-platform for your requirement.
from google.cloud import bigquery
from google.oauth2 import service_account
key_path = "path/to/service_account.json"
credentials = service_account.Credentials.from_service_account_file(
key_path,
scopes=["https://www.googleapis.com/auth/cloud-platform"],
)
project_id = "project-id"
client = bigquery.Client(
credentials=credentials,
project=credentials.project_id,
)
sql_query ="SELECT * FROM table"
query_job = client.query(sql_query)
results = query_job.result()
df = results.to_dataframe()
print (df)
As per the error message you are getting, service account is missing the BigQuery Admin Role which includes the permission bigquery.readsessions.create.
For more information regarding BigQuery IAM roles you can refer to this document.
I read some documentation on internet official and non official and i'm currently unable to import the logs from bigquery like "bigquery_resource" (for getting all my insert, update, merge ... processing on my gcp project ) from a gcp project where i'm owner with python on my local.
Mandatory prerequisite :
Only use the scripts to read and catch the logs with a filter without creating CF, data in bucket, manual action from user on the gcp project etc...
Using a service account in the process
Import the bigquery logs from the gcp on a local when i execute my script python
Here the code below where i try to get the logs :
import google.protobuf
from google.cloud.bigquery_logging_v1 import AuditData
import google.cloud.logging
from datetime import datetime, timedelta, timezone
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="C:\\mypath\\credentials.json"
project_id = os.environ["GOOGLE_CLOUD_PROJECT"] = "project1"
yesterday = datetime.now(timezone.utc) - timedelta(days=2)
time_format = "%Y-%m-%dT%H:%M:%S.%f%z"
filter_str = (
f'logName="projects/{project_id}/logs/cloudaudit.googleapis.com%2Factivity"'
f' AND resource.type="bigquery_resource"'
f' AND timestamp>="{yesterday.strftime(time_format)}"'
)
client = google.cloud.logging.Client(project="project1")
for entry in client.list_entries(filter_=filter_str):
decoded_entry = entry.to_api_repr()
#print(decoded_entry)
print(entry) #the same output as print(decoded_entry)
open("C:\\mypath\\logs.txt", "w").close()
with open("C:\\mypath\\logs.txt", "w") as f:
for entry in client.list_entries(filter_=filter_str):
f.write(entry)
Unfortunately , it doesn't work(and my code is messy), i get a ProtobufEntry with the var entry like below and i don't know how get my data from my gcp project in a proper way.
All the help is welcome ! (please don't answer me with a deprecated answer from openaichatgpt )
Here how i export my logs without creating bucket, sink, pubsub, cloud function, table in bigquery etc..
=> Only 1 Service account with rights on my project and 1 script .py on my local and added an option in the python script for scan only bigquery ressource during the last hour.
I add the path of gcloud because i have some problem with path in my envvar in my local with the popen lib, maybe you won't need to do it.
from subprocess import Popen, PIPE
import json
from google.cloud.bigquery_logging_v1 import AuditData
import google.cloud.logging
from datetime import datetime, timedelta, timezone
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="C:\\Users\\USERAAAA\\Documents\\Python Scripts\\credentials.json"
gcloud_path = "C:\\Program Files (x86)\\Google\\Cloud SDK\\google-cloud-sdk\\bin\\gcloud.cmd"
process = Popen([gcloud_path, "logging", "read", "resource.type=bigquery_resource AND logName=projects/PROJECTGCP1/logs/cloudaudit.googleapis.com%2Fdata_access", "--freshness=1h"], stdout=PIPE, stderr=PIPE)
stdout, stderr = process.communicate()
output_str = stdout.decode()
# data string into a a file
with open("C:\\Users\\USERAAAA\\Documents\\Python_Scripts\\testes.txt", "w") as f:
f.write(output_str)
One way to achieve this as follows:
Create a dedicated logging sink for BigQuery logs:
gcloud logging sinks create my-example-sink bigquery.googleapis.com/projects/my-project-id/datasets/auditlog_dataset \
--log-filter='protoPayload.metadata."#type"="type.googleapis.com/google.cloud.audit.BigQueryAuditMetadata"'
The above command will create logging sink in a dataset named auditlog_dataset that only includes BigQueryAuditMetadata messages. Refer BigQueryAuditMetadata for all the events which are captured as part of GCP AuditData.
Create a service account and give access to above created dataset.
For creating service account refer here and for granting access to dataset refer here.
Use this service account to authenticate from your local environment and query the above created dataset using BigQuery Python client to get filtered BigQuery data.
from google.cloud import bigquery
client = bigquery.Client()
# Select rows from log dataset
QUERY = (
'SELECT name FROM `MYPROJECTID.MYDATASETID.cloudaudit_googleapis_com_activity`'
'LIMIT 100')
query_job = client.query(QUERY) # API request
rows = query_job.result() # Waits for query to finish
for row in rows:
print(row.name)
Also, you can query the audit tables from the console directly.
Reference BigQuery audit logging.
Another option is to use Python Script to query log events. And one more option is to use Cloud Pub/Sub to route logs to external (out of gcp) clients.
I mostly prefer to keep the filtered logs in dedicated Log Analytics bucket and query as per needs and create custom log based metrics using Cloud Monitoring. Moving logs out of GCP may incur network egress charges, refer the documentation, if you are querying large volume of data.
I try to run SQL queries from Google BigQuery in the Jupyter notebook.
I do everything as written here https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas#download_query_results_using_the_client_library.
I opened a Client Account and download the JSON file.
Now I try to run the script :
from google.cloud import bigquery
bqclient = bigquery.Client('c://folder/client_account.json')
# Download query results.
query_string = """
SELECT * from `project.dataset.table`
"""
dataframe = (
bqclient.query(query_string)
.result()
.to_dataframe(
# Optionally, explicitly request to use the BigQuery Storage API. As of
# google-cloud-bigquery version 1.26.0 and above, the BigQuery Storage
# API is used by default.
create_bqstorage_client=True,
)
)
print(dataframe.head())
But I keep getting an error:
DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started
I do not understand what I am doing wrong, because the JSON file looks fine and the path to the file is correct.
The error suggests that your GCP environment is not able to identify and configure the required application credentials.
To authenticate using service account follow the below approach :
from google.cloud import bigquery
from google.oauth2 import service_account
# TODO(developer): Set key_path to the path to the service account key
# file.
key_path = "path/to/service_account.json"
credentials = service_account.Credentials.from_service_account_file(
key_path, scopes=["https://www.googleapis.com/auth/cloud-platform"],
)
bqclient = bigquery.Client(credentials=credentials, project=credentials.project_id,)
query_string = """
SELECT * from `project.dataset.table`
"""
dataframe = (
bqclient.query(query_string)
.result()
.to_dataframe(
# Optionally, explicitly request to use the BigQuery Storage API. As of
# google-cloud-bigquery version 1.26.0 and above, the BigQuery Storage
# API is used by default.
create_bqstorage_client=True,
)
)
print(dataframe.head())
I have a cloud function, the code is fine when I test locally. However, it doesn't work as a cloud function even though it deploys successfully. When deployed, I tried adding allUsers as a Cloud Function invoker. Ingress settings are set to allow all web traffic.
I get a 500 error and it says >Error: could not handle the request when visiting the URL.
Cloud Scheduler constantly fails, and the logs for the cloud function don't really help give any understanding as to why it fails.
When expanded, the logs give no further detail either.
I've got no idea what else to try and resolve this issue. I just want to be able to invoke my HTTP cloud function on a schedule, the code works fine when run and tested using a service account. Why doesn't it work when added to the function?
Here is the code I'm using;
from bs4 import BeautifulSoup
import pandas as pd
import constants as const
from google.cloud import storage
import os
import json
from datetime import datetime
from google.cloud import bigquery
import re
from flask import escape
#service_account_path = os.path.join("/Users/nbamodel/nba-data-keys.json")
#os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = service_account_path
client = storage.Client()
bucket = client.get_bucket(const.destination_gcs_bucket)
def scrape_team_data(request):
"""HTTP Cloud Function.
Args:
request (flask.Request): The request object.
<http://flask.pocoo.org/docs/1.0/api/#flask.Request>
Returns:
The response text, or any set of values that can be turned into a
Response object using `make_response`
<http://flask.pocoo.org/docs/1.0/api/#flask.Flask.make_response>.
"""
headers = [
'Rank',
'Team',
'Age',
'Wins',
'Losses',
'PW',
'PL',
'MOV',
'SOS',
'SRS',
'ORtg',
'DRtg',
'NRtg',
'Pace',
'FTr',
'_3PAr',
'TS_pct',
'offense_eFG_pct',
'offense_TOV_pct',
'offense_ORB_pct',
'offense_FT_FGA',
'defense_eFG_pct',
'defense_TOV_pct',
'defense_DRB_pct',
'defense_FT_FGA',
'Arena',
'Attendance',
'Attendance_Game'
]
r = requests.get('https://www.basketball-reference.com/leagues/NBA_2020.html')
matches = re.findall(r'id=\"misc_stats\".+?(?=table>)table>', r.text, re.DOTALL)
find_table = pd.read_html('<table ' + matches[0])
df = find_table[0]
df.columns = headers
filename = 'teams_data_adv_stats' #+ datetime.now().strftime("%Y%m%d")
df.to_json(filename, orient='records', lines=True)
print(filename)
# Push data to GCS
blob = bucket.blob(filename)
blob.upload_from_filename(
filename=filename,
content_type='application/json'
)
# Create BQ table from data in bucket
client = bigquery.Client()
dataset_id = 'nba_model'
dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.create_disposition = 'CREATE_IF_NEEDED'
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
uri = "gs://nba_teams_data/{}".format(filename)
load_job = client.load_table_from_uri(
uri,
dataset_ref.table("teams_data"),
location="US", # Location must match that of the destination dataset.
job_config=job_config,
) # API request
print("Starting job {}".format(load_job.job_id))
load_job.result() # Waits for table load to complete.
print("Job finished.")
destination_table = client.get_table(dataset_ref.table("teams_data"))
print("Loaded {} rows.".format(destination_table.num_rows))
return
I have deployed your code into a Cloud Function and it's failing due to two reasons.
First, it's missing the requests dependency, so the line import requests has to be added on top of the file, with the other imports.
Second, it seems like your code is trying to write a file on a read-only file system, which is immediately rejected the os and the function gets terminated. Said write operation is being done by the method DataFrame.to_json, which is trying to write content to the file teams_data_adv_stats to later upload it to a GCS bucket.
There are two ways that you can work around this issue:
Create the file in the temporary folder. As explained on the documentation you cannot write in the file system with the exception of the /tmp directory. I have managed to succeed using this method with the following modified lines:
filename = 'teams_data_adv_stats'
path = os.path.join('/tmp', filename)
df.to_json(path, orient='records', lines=True)
blob = bucket.blob(filename)
blob.upload_from_filename(
filename=path,
content_type='application/json'
)
Avoid creating a file and work with a string. Instead of using upload_from_filename I suggest you work with upload_from_string. I have managed to succeed using this method with the following modified lines:
filename = 'teams_data_adv_stats'
data_json = df.to_json(orient='records', lines=True)
blob = bucket.blob(filename)
blob.upload_from_string(
data_json,
content_type='application/json'
)
As a heads up, you can test your Cloud Functions from testing tab on the function's details. I recommend you use it because it's what I have worked with in order to troubleshoot your issue and could be handy to know about it. Also bear in mind that there's an on-going issue with logs on failing Cloud Functions with the python37 runtime that prevents the error message to appear. I encountered the issue while working on your CF and I worked around it with the workaround provided.
As a side note I did all the reproduction with the following requirements.txt file in order to deploy and run successfully, since you didn't provide it. I assume this is correct:
beautifulsoup4==4.9.1
Flask==1.1.2
google-cloud-bigquery==1.27.2
google-cloud-storage==1.30.0
lxml==4.5.2
pandas==1.1.1
I would like to automate a csv file extraction process from Google BigQuery to a Google Cloud Storage Bucket, and from the latter to an external server with two Python scripts, could you help me please? I would appreciate it.
For extracting from BigQuery in Python, you can use the Python Client for Google BigQuery.
The below snippet based on this repository should get you going:
# client = bigquery.Client()
# bucket_name = 'my-bucket'
project = "bigquery-public-data"
dataset_id = "samples"
table_id = "shakespeare"
destination_uri = "gs://{}/{}".format(bucket_name, "shakespeare.csv")
dataset_ref = bigquery.DatasetReference(project, dataset_id)
table_ref = dataset_ref.table(table_id)
extract_job = client.extract_table(
table_ref,
destination_uri,
# Location must match that of the source table.
location="US",
) # API request
extract_job.result() # Waits for job to complete.
print(
"Exported {}:{}.{} to {}".format(project, dataset_id, table_id, destination_uri)
)
In order to post the export to another server, you can use the Cloud Storage Client Library for Python to post the CSV file to your server or service of choice.
As per my knowledge, BigQuery can't export/download query result to GCS or Local File. You can keep it in a temporary / stagging table and then use code like below to export to gcs:
https://cloud.google.com/bigquery/docs/exporting-data#exporting_table_data
So you can put this in a container and deploy it as cloudrun service and call this from cloud scheduler.