I read some documentation on internet official and non official and i'm currently unable to import the logs from bigquery like "bigquery_resource" (for getting all my insert, update, merge ... processing on my gcp project ) from a gcp project where i'm owner with python on my local.
Mandatory prerequisite :
Only use the scripts to read and catch the logs with a filter without creating CF, data in bucket, manual action from user on the gcp project etc...
Using a service account in the process
Import the bigquery logs from the gcp on a local when i execute my script python
Here the code below where i try to get the logs :
import google.protobuf
from google.cloud.bigquery_logging_v1 import AuditData
import google.cloud.logging
from datetime import datetime, timedelta, timezone
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="C:\\mypath\\credentials.json"
project_id = os.environ["GOOGLE_CLOUD_PROJECT"] = "project1"
yesterday = datetime.now(timezone.utc) - timedelta(days=2)
time_format = "%Y-%m-%dT%H:%M:%S.%f%z"
filter_str = (
f'logName="projects/{project_id}/logs/cloudaudit.googleapis.com%2Factivity"'
f' AND resource.type="bigquery_resource"'
f' AND timestamp>="{yesterday.strftime(time_format)}"'
)
client = google.cloud.logging.Client(project="project1")
for entry in client.list_entries(filter_=filter_str):
decoded_entry = entry.to_api_repr()
#print(decoded_entry)
print(entry) #the same output as print(decoded_entry)
open("C:\\mypath\\logs.txt", "w").close()
with open("C:\\mypath\\logs.txt", "w") as f:
for entry in client.list_entries(filter_=filter_str):
f.write(entry)
Unfortunately , it doesn't work(and my code is messy), i get a ProtobufEntry with the var entry like below and i don't know how get my data from my gcp project in a proper way.
All the help is welcome ! (please don't answer me with a deprecated answer from openaichatgpt )
Here how i export my logs without creating bucket, sink, pubsub, cloud function, table in bigquery etc..
=> Only 1 Service account with rights on my project and 1 script .py on my local and added an option in the python script for scan only bigquery ressource during the last hour.
I add the path of gcloud because i have some problem with path in my envvar in my local with the popen lib, maybe you won't need to do it.
from subprocess import Popen, PIPE
import json
from google.cloud.bigquery_logging_v1 import AuditData
import google.cloud.logging
from datetime import datetime, timedelta, timezone
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="C:\\Users\\USERAAAA\\Documents\\Python Scripts\\credentials.json"
gcloud_path = "C:\\Program Files (x86)\\Google\\Cloud SDK\\google-cloud-sdk\\bin\\gcloud.cmd"
process = Popen([gcloud_path, "logging", "read", "resource.type=bigquery_resource AND logName=projects/PROJECTGCP1/logs/cloudaudit.googleapis.com%2Fdata_access", "--freshness=1h"], stdout=PIPE, stderr=PIPE)
stdout, stderr = process.communicate()
output_str = stdout.decode()
# data string into a a file
with open("C:\\Users\\USERAAAA\\Documents\\Python_Scripts\\testes.txt", "w") as f:
f.write(output_str)
One way to achieve this as follows:
Create a dedicated logging sink for BigQuery logs:
gcloud logging sinks create my-example-sink bigquery.googleapis.com/projects/my-project-id/datasets/auditlog_dataset \
--log-filter='protoPayload.metadata."#type"="type.googleapis.com/google.cloud.audit.BigQueryAuditMetadata"'
The above command will create logging sink in a dataset named auditlog_dataset that only includes BigQueryAuditMetadata messages. Refer BigQueryAuditMetadata for all the events which are captured as part of GCP AuditData.
Create a service account and give access to above created dataset.
For creating service account refer here and for granting access to dataset refer here.
Use this service account to authenticate from your local environment and query the above created dataset using BigQuery Python client to get filtered BigQuery data.
from google.cloud import bigquery
client = bigquery.Client()
# Select rows from log dataset
QUERY = (
'SELECT name FROM `MYPROJECTID.MYDATASETID.cloudaudit_googleapis_com_activity`'
'LIMIT 100')
query_job = client.query(QUERY) # API request
rows = query_job.result() # Waits for query to finish
for row in rows:
print(row.name)
Also, you can query the audit tables from the console directly.
Reference BigQuery audit logging.
Another option is to use Python Script to query log events. And one more option is to use Cloud Pub/Sub to route logs to external (out of gcp) clients.
I mostly prefer to keep the filtered logs in dedicated Log Analytics bucket and query as per needs and create custom log based metrics using Cloud Monitoring. Moving logs out of GCP may incur network egress charges, refer the documentation, if you are querying large volume of data.
Related
There is a scheduled notebook, that uses BigQuery client and service account with Owner rights. When I run the cells manually, it makes an update to BQ table. There is one project for both BQ and Vertex AI.
I've found a similar question, but there is no output in bucket folder:
Google Cloud Vertex AI Notebook Scheduled Runs Aren't Running Code?
In schedules section this notebook is stuck on Initializing:
Here's the notebook:
Update: I've tried to schedule cells one by one, and all of the stuck attempts cannot get through BigQuery:
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'dialogflow-293713-f89fd8f4ed2d.json'
bigquery_client = bigquery.Client()
QUERY = f"""
INSERT `dialogflow-293713.chats.Ежедневная сводка маркетплейса` (date, effectiveness, operatorWorkload)
VALUES({period}, {effectiveness}, {redirectedToSales}, {operatorWorkload})
"""
Query_Results = bigquery_client.query(QUERY)
This way of authorization worked!
from google.cloud import bigquery
from google.oauth2 import service_account
import json
raw_credential = { "dictionary. copy the dict elements of your credential.json file" }
service_account_info = json.loads(json.dumps(raw_credential))
credentials = service_account.Credentials.from_service_account_info(service_account_info)
client = bigquery.Client(credentials=credentials)
query = """ Your Query """
df = client.query(query).to_dataframe()
#see some results. remove if its not needed.
print(df.head())
# OPTIONAL: If you want to move data to a google cloud storage bucket
from google.cloud import storage
client = storage.Client()
bucket_name = 'my-bucket-id'
bucket = client.get_bucket(bucket_name)
# if folder `output` does not exist it will be created. You can use the name as you want.
bucket.blob("output/output.csv").upload_from_string(df.to_csv(), 'text/csv')
Resolved on Issue Tracker in this thread.
I try to run SQL queries from Google BigQuery in the Jupyter notebook.
I do everything as written here https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas#download_query_results_using_the_client_library.
I opened a Client Account and download the JSON file.
Now I try to run the script :
from google.cloud import bigquery
bqclient = bigquery.Client('c://folder/client_account.json')
# Download query results.
query_string = """
SELECT * from `project.dataset.table`
"""
dataframe = (
bqclient.query(query_string)
.result()
.to_dataframe(
# Optionally, explicitly request to use the BigQuery Storage API. As of
# google-cloud-bigquery version 1.26.0 and above, the BigQuery Storage
# API is used by default.
create_bqstorage_client=True,
)
)
print(dataframe.head())
But I keep getting an error:
DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started
I do not understand what I am doing wrong, because the JSON file looks fine and the path to the file is correct.
The error suggests that your GCP environment is not able to identify and configure the required application credentials.
To authenticate using service account follow the below approach :
from google.cloud import bigquery
from google.oauth2 import service_account
# TODO(developer): Set key_path to the path to the service account key
# file.
key_path = "path/to/service_account.json"
credentials = service_account.Credentials.from_service_account_file(
key_path, scopes=["https://www.googleapis.com/auth/cloud-platform"],
)
bqclient = bigquery.Client(credentials=credentials, project=credentials.project_id,)
query_string = """
SELECT * from `project.dataset.table`
"""
dataframe = (
bqclient.query(query_string)
.result()
.to_dataframe(
# Optionally, explicitly request to use the BigQuery Storage API. As of
# google-cloud-bigquery version 1.26.0 and above, the BigQuery Storage
# API is used by default.
create_bqstorage_client=True,
)
)
print(dataframe.head())
I have a cloud function, the code is fine when I test locally. However, it doesn't work as a cloud function even though it deploys successfully. When deployed, I tried adding allUsers as a Cloud Function invoker. Ingress settings are set to allow all web traffic.
I get a 500 error and it says >Error: could not handle the request when visiting the URL.
Cloud Scheduler constantly fails, and the logs for the cloud function don't really help give any understanding as to why it fails.
When expanded, the logs give no further detail either.
I've got no idea what else to try and resolve this issue. I just want to be able to invoke my HTTP cloud function on a schedule, the code works fine when run and tested using a service account. Why doesn't it work when added to the function?
Here is the code I'm using;
from bs4 import BeautifulSoup
import pandas as pd
import constants as const
from google.cloud import storage
import os
import json
from datetime import datetime
from google.cloud import bigquery
import re
from flask import escape
#service_account_path = os.path.join("/Users/nbamodel/nba-data-keys.json")
#os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = service_account_path
client = storage.Client()
bucket = client.get_bucket(const.destination_gcs_bucket)
def scrape_team_data(request):
"""HTTP Cloud Function.
Args:
request (flask.Request): The request object.
<http://flask.pocoo.org/docs/1.0/api/#flask.Request>
Returns:
The response text, or any set of values that can be turned into a
Response object using `make_response`
<http://flask.pocoo.org/docs/1.0/api/#flask.Flask.make_response>.
"""
headers = [
'Rank',
'Team',
'Age',
'Wins',
'Losses',
'PW',
'PL',
'MOV',
'SOS',
'SRS',
'ORtg',
'DRtg',
'NRtg',
'Pace',
'FTr',
'_3PAr',
'TS_pct',
'offense_eFG_pct',
'offense_TOV_pct',
'offense_ORB_pct',
'offense_FT_FGA',
'defense_eFG_pct',
'defense_TOV_pct',
'defense_DRB_pct',
'defense_FT_FGA',
'Arena',
'Attendance',
'Attendance_Game'
]
r = requests.get('https://www.basketball-reference.com/leagues/NBA_2020.html')
matches = re.findall(r'id=\"misc_stats\".+?(?=table>)table>', r.text, re.DOTALL)
find_table = pd.read_html('<table ' + matches[0])
df = find_table[0]
df.columns = headers
filename = 'teams_data_adv_stats' #+ datetime.now().strftime("%Y%m%d")
df.to_json(filename, orient='records', lines=True)
print(filename)
# Push data to GCS
blob = bucket.blob(filename)
blob.upload_from_filename(
filename=filename,
content_type='application/json'
)
# Create BQ table from data in bucket
client = bigquery.Client()
dataset_id = 'nba_model'
dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.create_disposition = 'CREATE_IF_NEEDED'
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
uri = "gs://nba_teams_data/{}".format(filename)
load_job = client.load_table_from_uri(
uri,
dataset_ref.table("teams_data"),
location="US", # Location must match that of the destination dataset.
job_config=job_config,
) # API request
print("Starting job {}".format(load_job.job_id))
load_job.result() # Waits for table load to complete.
print("Job finished.")
destination_table = client.get_table(dataset_ref.table("teams_data"))
print("Loaded {} rows.".format(destination_table.num_rows))
return
I have deployed your code into a Cloud Function and it's failing due to two reasons.
First, it's missing the requests dependency, so the line import requests has to be added on top of the file, with the other imports.
Second, it seems like your code is trying to write a file on a read-only file system, which is immediately rejected the os and the function gets terminated. Said write operation is being done by the method DataFrame.to_json, which is trying to write content to the file teams_data_adv_stats to later upload it to a GCS bucket.
There are two ways that you can work around this issue:
Create the file in the temporary folder. As explained on the documentation you cannot write in the file system with the exception of the /tmp directory. I have managed to succeed using this method with the following modified lines:
filename = 'teams_data_adv_stats'
path = os.path.join('/tmp', filename)
df.to_json(path, orient='records', lines=True)
blob = bucket.blob(filename)
blob.upload_from_filename(
filename=path,
content_type='application/json'
)
Avoid creating a file and work with a string. Instead of using upload_from_filename I suggest you work with upload_from_string. I have managed to succeed using this method with the following modified lines:
filename = 'teams_data_adv_stats'
data_json = df.to_json(orient='records', lines=True)
blob = bucket.blob(filename)
blob.upload_from_string(
data_json,
content_type='application/json'
)
As a heads up, you can test your Cloud Functions from testing tab on the function's details. I recommend you use it because it's what I have worked with in order to troubleshoot your issue and could be handy to know about it. Also bear in mind that there's an on-going issue with logs on failing Cloud Functions with the python37 runtime that prevents the error message to appear. I encountered the issue while working on your CF and I worked around it with the workaround provided.
As a side note I did all the reproduction with the following requirements.txt file in order to deploy and run successfully, since you didn't provide it. I assume this is correct:
beautifulsoup4==4.9.1
Flask==1.1.2
google-cloud-bigquery==1.27.2
google-cloud-storage==1.30.0
lxml==4.5.2
pandas==1.1.1
I am writing a python program using boto3 that grabs all of the queries made by a master account and pushes them out to all of the master account's sub accounts.
Grabbing the query IDs from the master instance is done, but I'm having trouble pushing them out to the sub accounts. With my authentication information AWS is connecting to the master account by default, but I can't figure out how to get it to connect to a sub account. Generally AWS services do this by switching roles, but Athena doesn't have a built in method for this. I could manually create different profiles but I'm not sure how to switch them manually in the middle of code execution
Here's Amazon's code example for switching in STS, which does support assuming different roles https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-api.html
Here's what my program looks like so far
#!/usr/bin/env python3
import boto3
dev = boto3.session.Session(profile_name='dev')
#Function for executing athena queries
client = dev.client('athena')
s3_input = 's3://dev/test/'
s3_output = 's3://dev/testOutput'
database = 'ex_athena_db'
table = 'test_data'
response = client.list_named_queries(
MaxResults=50,
WorkGroup='primary'
)
print response
So I have the "dev" profile, but I'm not sure how to differentiate this profile to indicate to AWS that I'd like to access one of the child accounts. Is it just the name, or do I need some other paramter? I don't think I can (or need to) generate a seperate authentication token for this
I solved this by creating a new user profile for the sub account with a new ARN
sample config
[default]
region = us-east-1
[profile ecr-dev]
role_arn = arn:aws:iam::76532435:role/AccountRole
source_profile = default
sample code
#!/usr/bin/env python3
import boto3
dev = boto3.session.Session(profile_name='name', region_name="us-east-1")
#Function for executing athena queries
client = dev.client('athena')
s3_input = 's3:/test/'
s3_output = 's3:/test'
database = 'ex_athena_db'
response = client.list_named_queries(
MaxResults=50,
WorkGroup='primary'
)
print response
Is there a library or package which we can use with python to connect to salesforce and get data?
I use beatbox
Example to query for a lead by email address
import beatbox
sf_username = "Username"
sf_password = "password"
sf_api_token = "api token"
def get_lead_records_by_email(email)
sf_client = beatbox.PythonClient()
password = str("%s%s" % (sf_password, sf_api_token))
sf_client.login(sf_username, password)
lead_qry = "SELECT id, Email, FirstName, LastName, OwnerId FROM Lead WHERE Email = '%s'" % (email)
records = sf_client.query(lead_qry)
return records
To get other data look at the salesforce api docs
view other beatbox examples here
There's also a package called simple_salesforce .
You can install it with:
$ pip install simple_salesforce
You can gain access to your Salesforce account with the following:
from simple_salesforce import Salesforce
sf = Salesforce(username='youremail#abc.com', password='password', security_token='token')
The readme is helpful with regards to details...
This one is the best in my experience:
http://code.google.com/p/salesforce-python-toolkit/
Here is the ready code to get anyone started. For fetching reports from SFDC.
import pandas as pd
import numpy as np
from pandas import DataFrame, Series
from simple_salesforce import Salesforce #imported salesforce
sf = Salesforce(username='youremail#domain.com', password='enter_password', security_token = 'Salesforce_token')
salesforce token is received in email everytime you change your password.
import requests #imported requests
session = requests.Session() #starting sessions
from io import StringIO #to read web data
error_report_defined = session.get("https://na4.salesforce.com/xxxxxxxxxxxx?export=1&enc=UTF-8&xf=csv".format('xxxxxxxxxxxx'), headers=sf.headers, cookies={'sid': sf.session_id})
df_sfdc_error_report_defined = pd.DataFrame.from_csv(StringIO(error_report_defined.text))
df_sfdc_error_report_defined = df_sfdc_error_report_defined.to_csv('defined.csv', encoding = 'utf-8')
error_report = pd.read_csv('defined.csv') #your report is saved in csv format
print (error_report)
Although this is not Python specific. I came across a cool tool for the command line. You could run bash commands as an option..
https://force-cli.heroku.com/
Usage: force <command> [<args>]
Available commands:
login force login [-i=<instance>] [<-u=username> <-p=password>]
logout Log out from force.com
logins List force.com logins used
active Show or set the active force.com account
whoami Show information about the active account
describe Describe the object or list of available objects
sobject Manage standard & custom objects
bigobject Manage big objects
field Manage sobject fields
record Create, modify, or view records
bulk Load csv file use Bulk API
fetch Export specified artifact(s) to a local directory
import Import metadata from a local directory
export Export metadata to a local directory
query Execute a SOQL statement
apex Execute anonymous Apex code
trace Manage trace flags
log Fetch debug logs
eventlogfile List and fetch event log file
oauth Manage ConnectedApp credentials
test Run apex tests
security Displays the OLS and FLS for a give SObject
version Display current version
update Update to the latest version
push Deploy artifact from a local directory
aura force aura push -resourcepath=<filepath>
password See password status or reset password
notify Should notifications be used
limits Display current limits
help Show this help
datapipe Manage DataPipes