I have a cloud function, the code is fine when I test locally. However, it doesn't work as a cloud function even though it deploys successfully. When deployed, I tried adding allUsers as a Cloud Function invoker. Ingress settings are set to allow all web traffic.
I get a 500 error and it says >Error: could not handle the request when visiting the URL.
Cloud Scheduler constantly fails, and the logs for the cloud function don't really help give any understanding as to why it fails.
When expanded, the logs give no further detail either.
I've got no idea what else to try and resolve this issue. I just want to be able to invoke my HTTP cloud function on a schedule, the code works fine when run and tested using a service account. Why doesn't it work when added to the function?
Here is the code I'm using;
from bs4 import BeautifulSoup
import pandas as pd
import constants as const
from google.cloud import storage
import os
import json
from datetime import datetime
from google.cloud import bigquery
import re
from flask import escape
#service_account_path = os.path.join("/Users/nbamodel/nba-data-keys.json")
#os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = service_account_path
client = storage.Client()
bucket = client.get_bucket(const.destination_gcs_bucket)
def scrape_team_data(request):
"""HTTP Cloud Function.
Args:
request (flask.Request): The request object.
<http://flask.pocoo.org/docs/1.0/api/#flask.Request>
Returns:
The response text, or any set of values that can be turned into a
Response object using `make_response`
<http://flask.pocoo.org/docs/1.0/api/#flask.Flask.make_response>.
"""
headers = [
'Rank',
'Team',
'Age',
'Wins',
'Losses',
'PW',
'PL',
'MOV',
'SOS',
'SRS',
'ORtg',
'DRtg',
'NRtg',
'Pace',
'FTr',
'_3PAr',
'TS_pct',
'offense_eFG_pct',
'offense_TOV_pct',
'offense_ORB_pct',
'offense_FT_FGA',
'defense_eFG_pct',
'defense_TOV_pct',
'defense_DRB_pct',
'defense_FT_FGA',
'Arena',
'Attendance',
'Attendance_Game'
]
r = requests.get('https://www.basketball-reference.com/leagues/NBA_2020.html')
matches = re.findall(r'id=\"misc_stats\".+?(?=table>)table>', r.text, re.DOTALL)
find_table = pd.read_html('<table ' + matches[0])
df = find_table[0]
df.columns = headers
filename = 'teams_data_adv_stats' #+ datetime.now().strftime("%Y%m%d")
df.to_json(filename, orient='records', lines=True)
print(filename)
# Push data to GCS
blob = bucket.blob(filename)
blob.upload_from_filename(
filename=filename,
content_type='application/json'
)
# Create BQ table from data in bucket
client = bigquery.Client()
dataset_id = 'nba_model'
dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.create_disposition = 'CREATE_IF_NEEDED'
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
uri = "gs://nba_teams_data/{}".format(filename)
load_job = client.load_table_from_uri(
uri,
dataset_ref.table("teams_data"),
location="US", # Location must match that of the destination dataset.
job_config=job_config,
) # API request
print("Starting job {}".format(load_job.job_id))
load_job.result() # Waits for table load to complete.
print("Job finished.")
destination_table = client.get_table(dataset_ref.table("teams_data"))
print("Loaded {} rows.".format(destination_table.num_rows))
return
I have deployed your code into a Cloud Function and it's failing due to two reasons.
First, it's missing the requests dependency, so the line import requests has to be added on top of the file, with the other imports.
Second, it seems like your code is trying to write a file on a read-only file system, which is immediately rejected the os and the function gets terminated. Said write operation is being done by the method DataFrame.to_json, which is trying to write content to the file teams_data_adv_stats to later upload it to a GCS bucket.
There are two ways that you can work around this issue:
Create the file in the temporary folder. As explained on the documentation you cannot write in the file system with the exception of the /tmp directory. I have managed to succeed using this method with the following modified lines:
filename = 'teams_data_adv_stats'
path = os.path.join('/tmp', filename)
df.to_json(path, orient='records', lines=True)
blob = bucket.blob(filename)
blob.upload_from_filename(
filename=path,
content_type='application/json'
)
Avoid creating a file and work with a string. Instead of using upload_from_filename I suggest you work with upload_from_string. I have managed to succeed using this method with the following modified lines:
filename = 'teams_data_adv_stats'
data_json = df.to_json(orient='records', lines=True)
blob = bucket.blob(filename)
blob.upload_from_string(
data_json,
content_type='application/json'
)
As a heads up, you can test your Cloud Functions from testing tab on the function's details. I recommend you use it because it's what I have worked with in order to troubleshoot your issue and could be handy to know about it. Also bear in mind that there's an on-going issue with logs on failing Cloud Functions with the python37 runtime that prevents the error message to appear. I encountered the issue while working on your CF and I worked around it with the workaround provided.
As a side note I did all the reproduction with the following requirements.txt file in order to deploy and run successfully, since you didn't provide it. I assume this is correct:
beautifulsoup4==4.9.1
Flask==1.1.2
google-cloud-bigquery==1.27.2
google-cloud-storage==1.30.0
lxml==4.5.2
pandas==1.1.1
Related
I read some documentation on internet official and non official and i'm currently unable to import the logs from bigquery like "bigquery_resource" (for getting all my insert, update, merge ... processing on my gcp project ) from a gcp project where i'm owner with python on my local.
Mandatory prerequisite :
Only use the scripts to read and catch the logs with a filter without creating CF, data in bucket, manual action from user on the gcp project etc...
Using a service account in the process
Import the bigquery logs from the gcp on a local when i execute my script python
Here the code below where i try to get the logs :
import google.protobuf
from google.cloud.bigquery_logging_v1 import AuditData
import google.cloud.logging
from datetime import datetime, timedelta, timezone
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="C:\\mypath\\credentials.json"
project_id = os.environ["GOOGLE_CLOUD_PROJECT"] = "project1"
yesterday = datetime.now(timezone.utc) - timedelta(days=2)
time_format = "%Y-%m-%dT%H:%M:%S.%f%z"
filter_str = (
f'logName="projects/{project_id}/logs/cloudaudit.googleapis.com%2Factivity"'
f' AND resource.type="bigquery_resource"'
f' AND timestamp>="{yesterday.strftime(time_format)}"'
)
client = google.cloud.logging.Client(project="project1")
for entry in client.list_entries(filter_=filter_str):
decoded_entry = entry.to_api_repr()
#print(decoded_entry)
print(entry) #the same output as print(decoded_entry)
open("C:\\mypath\\logs.txt", "w").close()
with open("C:\\mypath\\logs.txt", "w") as f:
for entry in client.list_entries(filter_=filter_str):
f.write(entry)
Unfortunately , it doesn't work(and my code is messy), i get a ProtobufEntry with the var entry like below and i don't know how get my data from my gcp project in a proper way.
All the help is welcome ! (please don't answer me with a deprecated answer from openaichatgpt )
Here how i export my logs without creating bucket, sink, pubsub, cloud function, table in bigquery etc..
=> Only 1 Service account with rights on my project and 1 script .py on my local and added an option in the python script for scan only bigquery ressource during the last hour.
I add the path of gcloud because i have some problem with path in my envvar in my local with the popen lib, maybe you won't need to do it.
from subprocess import Popen, PIPE
import json
from google.cloud.bigquery_logging_v1 import AuditData
import google.cloud.logging
from datetime import datetime, timedelta, timezone
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="C:\\Users\\USERAAAA\\Documents\\Python Scripts\\credentials.json"
gcloud_path = "C:\\Program Files (x86)\\Google\\Cloud SDK\\google-cloud-sdk\\bin\\gcloud.cmd"
process = Popen([gcloud_path, "logging", "read", "resource.type=bigquery_resource AND logName=projects/PROJECTGCP1/logs/cloudaudit.googleapis.com%2Fdata_access", "--freshness=1h"], stdout=PIPE, stderr=PIPE)
stdout, stderr = process.communicate()
output_str = stdout.decode()
# data string into a a file
with open("C:\\Users\\USERAAAA\\Documents\\Python_Scripts\\testes.txt", "w") as f:
f.write(output_str)
One way to achieve this as follows:
Create a dedicated logging sink for BigQuery logs:
gcloud logging sinks create my-example-sink bigquery.googleapis.com/projects/my-project-id/datasets/auditlog_dataset \
--log-filter='protoPayload.metadata."#type"="type.googleapis.com/google.cloud.audit.BigQueryAuditMetadata"'
The above command will create logging sink in a dataset named auditlog_dataset that only includes BigQueryAuditMetadata messages. Refer BigQueryAuditMetadata for all the events which are captured as part of GCP AuditData.
Create a service account and give access to above created dataset.
For creating service account refer here and for granting access to dataset refer here.
Use this service account to authenticate from your local environment and query the above created dataset using BigQuery Python client to get filtered BigQuery data.
from google.cloud import bigquery
client = bigquery.Client()
# Select rows from log dataset
QUERY = (
'SELECT name FROM `MYPROJECTID.MYDATASETID.cloudaudit_googleapis_com_activity`'
'LIMIT 100')
query_job = client.query(QUERY) # API request
rows = query_job.result() # Waits for query to finish
for row in rows:
print(row.name)
Also, you can query the audit tables from the console directly.
Reference BigQuery audit logging.
Another option is to use Python Script to query log events. And one more option is to use Cloud Pub/Sub to route logs to external (out of gcp) clients.
I mostly prefer to keep the filtered logs in dedicated Log Analytics bucket and query as per needs and create custom log based metrics using Cloud Monitoring. Moving logs out of GCP may incur network egress charges, refer the documentation, if you are querying large volume of data.
There is a scheduled notebook, that uses BigQuery client and service account with Owner rights. When I run the cells manually, it makes an update to BQ table. There is one project for both BQ and Vertex AI.
I've found a similar question, but there is no output in bucket folder:
Google Cloud Vertex AI Notebook Scheduled Runs Aren't Running Code?
In schedules section this notebook is stuck on Initializing:
Here's the notebook:
Update: I've tried to schedule cells one by one, and all of the stuck attempts cannot get through BigQuery:
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'dialogflow-293713-f89fd8f4ed2d.json'
bigquery_client = bigquery.Client()
QUERY = f"""
INSERT `dialogflow-293713.chats.Ежедневная сводка маркетплейса` (date, effectiveness, operatorWorkload)
VALUES({period}, {effectiveness}, {redirectedToSales}, {operatorWorkload})
"""
Query_Results = bigquery_client.query(QUERY)
This way of authorization worked!
from google.cloud import bigquery
from google.oauth2 import service_account
import json
raw_credential = { "dictionary. copy the dict elements of your credential.json file" }
service_account_info = json.loads(json.dumps(raw_credential))
credentials = service_account.Credentials.from_service_account_info(service_account_info)
client = bigquery.Client(credentials=credentials)
query = """ Your Query """
df = client.query(query).to_dataframe()
#see some results. remove if its not needed.
print(df.head())
# OPTIONAL: If you want to move data to a google cloud storage bucket
from google.cloud import storage
client = storage.Client()
bucket_name = 'my-bucket-id'
bucket = client.get_bucket(bucket_name)
# if folder `output` does not exist it will be created. You can use the name as you want.
bucket.blob("output/output.csv").upload_from_string(df.to_csv(), 'text/csv')
Resolved on Issue Tracker in this thread.
I am starting a small python script (not an application) that can upload my *.fit activity files on Strava whenever they are created in a desired folder.
The main steps I plan to do are:
1. monitor *.fit file system modifications
2. access authentication to Strava to enable my program to upload files
(This tool will be personal use only, thus I expect no need to authenticate every time uploading)
3. upload the file to my Strava account
4. automatically doing this fixed routine with the help of Windows Task Scheduler
(For example, there will be 4-5 new riding activities generated in my computer folder, I expect this tool can automatically upload all of them once a week so that I do not need to manually complete the task.)
For step2, I really have no ideas how to implement even though reading through Strava Authentication Documentation and several source codes other peoples have developed (e.g. toravir's "rk2s (RunKeeper 2 Strava)" project on GitHub). I grabbed that some of the python modules like stravalib, swagger_client, request, json, etc. as well as concepts like OAuth2 may be related to step2 but I still can not put everything together...
Can any experienced give me some advice for the implementations of step2? or any related readings will be perfect!
Advice for other parts of this project will also be very welcomed and appreciated.
I thank you very much in advance:)
This is a code example on how you can access the Strava API, check out this gist or use the code below:
import time
import pickle
from fastapi import FastAPI
from fastapi.responses import RedirectResponse
from stravalib.client import Client
CLIENT_ID = 'GET FROM STRAVA API SITE'
CLIENT_SECRET = 'GET FROM STRAVA API SITE'
REDIRECT_URL = 'http://localhost:8000/authorized'
app = FastAPI()
client = Client()
def save_object(obj, filename):
with open(filename, 'wb') as output: # Overwrites any existing file.
pickle.dump(obj, output, pickle.HIGHEST_PROTOCOL)
def load_object(filename):
with open(filename, 'rb') as input:
loaded_object = pickle.load(input)
return loaded_object
def check_token():
if time.time() > client.token_expires_at:
refresh_response = client.refresh_access_token(client_id=CLIENT_ID, client_secret=CLIENT_SECRET, refresh_token=client.refresh_token)
access_token = refresh_response['access_token']
refresh_token = refresh_response['refresh_token']
expires_at = refresh_response['expires_at']
client.access_token = access_token
client.refresh_token = refresh_token
client.token_expires_at = expires_at
#app.get("/")
def read_root():
authorize_url = client.authorization_url(client_id=CLIENT_ID, redirect_uri=REDIRECT_URL)
return RedirectResponse(authorize_url)
#app.get("/authorized/")
def get_code(state=None, code=None, scope=None):
token_response = client.exchange_code_for_token(client_id=CLIENT_ID, client_secret=CLIENT_SECRET, code=code)
access_token = token_response['access_token']
refresh_token = token_response['refresh_token']
expires_at = token_response['expires_at']
client.access_token = access_token
client.refresh_token = refresh_token
client.token_expires_at = expires_at
save_object(client, 'client.pkl')
return {"state": state, "code": code, "scope": scope}
try:
client = load_object('client.pkl')
check_token()
athlete = client.get_athlete()
print("For {id}, I now have an access token {token}".format(id=athlete.id, token=client.access_token))
# To upload an activity
# client.upload_activity(activity_file, data_type, name=None, description=None, activity_type=None, private=None, external_id=None)
except FileNotFoundError:
print("No access token stored yet, visit http://localhost:8000/ to get it")
print("After visiting that url, a pickle file is stored, run this file again to upload your activity")
Download that file, install the requirements, and run it (assuming the filename is main):
pip install stravalib
pip install fastapi
pip install uvicorn
uvicorn main:app --reload
I believe you need to authenticate using OAuth in order to upload your activity, which pretty much requires you to have a web server setup that Strava can post back to after you "Authorize". I just set the authentication piece up using Rails & Heroku.
This link has a pretty good flowchart of what needs to happen.
https://developers.strava.com/docs/authentication/
Actually it looks like if you go to API Settings you can get your access token and refresh token there. I would also check out the Python Strava Library but it looks like you could do something like:
from stravalib.client import Client
access_token = 'your_access_token_from_your_api_application_settings_page'
refresh_token = 'your_refresh_token_from_your_api_application_settings_page'
client = Client()
athlete = client.get_athlete()
You may need to dig in a little more to that library to figure out the upload piece.
I'm trying to automate some data cleaning tasks by uploading the files to Cloud Storage, running them through a pipeline, and downloading the results.
I have created the template for my pipeline to execute using the GUI in Dataprep, and am attempting to automate the upload and execution of the template using the Google Client Libraries, specifically in Python.
However, I have found that when running the job with the Python script, the full template is not executed; sometimes some of the step aren't completed, sometimes the output file - which should be MegaBytes large - is less than 500 bytes. This is dependent on the template I use. Each template has its own issue.
I've tried breaking the large template into smaller templates to apply consecutively so I can see where the issue is, but that is where I discovered that each template has it's own issue. I have also tried creating the job from the Dataflow Monitoring Interface, and have found that anything created with that will run perfectly, meaning that there must be some issue with the script I've created.
def runJob(bucket, template, fileName):
#open connection with the needed credentials
credentials = GoogleCredentials.get_application_default()
service = build('dataflow', 'v1b3', credentials = credentials)
#name job after file being processed
jobName = fileName.replace('.csv', '')
projectId = 'my-project'
#find the template to run on the dataset
templatePath = "gs://{bucket}/me#myemail.com/temp/{template}".format(bucket = bucket, template=template)
#construct job JSON
body = {
"jobName":"{jobName}".format(jobName=jobName),
"parameters" : {
"inputLocations" :"{\"location1\":\"gs://" + bucket + "/me#myemail.com/RawUpload/" + fileName + "\"}",
"outputLocations":"{\"location1\":\"gs://" + bucket + "/me#myemail.com/CleanData/" + fileName.replace('.csv', '_auto_delete_2') + "\"}",
},
"environment" : {
"tempLocation":"gs://{bucket}/me#myemail.com/temp".format(bucket = bucket),
"zone":"us-central1-f"
}
}
#create and execute HTTPRequest
request = service.projects().templates().launch(projectId=projectId, gcsPath=templatePath, body=body)
response = request.execute()
#notify user
print(response)
Using the JSON format, my input to the parameters is that same as when I use the Monitoring Interface. This tells me that there is either something going on in the background of the Monitoring Interface that I am unaware of and thus am not including, or there is an issue with the code that I have created.
As I said above, the issue varies depending on the template I try to run, but the most common is the extremely small output file. The output file will be magnitudes smaller than it should be. This is because it will contain only the CSV headers and some random samples of the first row within the data and is also formatted incorrectly for a CSV file in the first place.
Does anyone know what I'm missing or recognize what I'm doing wrong?
I'm building an API that uploads images to Firebase storage, everything works as expected in that regard, the problem is that the syntax makes me specify the file name in each upload, and in production mode the API will receive upload requests from multiple devices, so I need to make to code so it checks for an available id, set it for the "blob()" object, and then do a normal upload, but I have no idea how to do that. or a random name I don't care as long as it doesn't overwrite another picture
Here is my current code:
from flask_pymongo import PyMongo
import firebase_admin
from firebase_admin import credentials, auth, storage, firestore
import os
import io
cred = credentials.Certificate('service_account_key.json')
firebase_admin.initialize_app(cred, {'storageBucket': 'MY-DATABASE-NAME.appspot.com'})
bucket = storage.bucket()
blob = bucket.blob("images/newimage.png") #here is where im guessing i #should put the next available name
# "apple.png" is a sample image #for testing in my directory
with open("apple.png", "rb") as f:
blob.upload_from_file(f)
As "Klaus D."'s comment said the solution was to implement the "uuid" module
import uuid
.....
.....
blob = bucket.blob("images/" + str(uuid.uuid4()))