improper data formatting - saving query from BigQuery to GCS

improper data formatting - saving query from BigQuery to GCS - python

wondering if someone can help with that.
I have cloud function with python code that does query BigQuery table and stores query result in GCS bucket as csv file.
But in csv file I have got strange format like:
Row(('asser',), {'user_login': 0})
Row(('godx',), {'user_login': 0})
Row(('johnw',), {'user_login': 0})
Row(('miki',), {'user_login': 0})
But save data format is expexcted to be like:
asser,
godx,
johnw,
miki
When I do debug in GCP logging console I able to get expected format. Seems I do smth wrong when processing query result.
I use this code:
def main(event, context):
from google.cloud import bigquery
from google.cloud import storage
import pandas as pd
import datetime
project_name = my_project
destination_bucket = my_bucket
bq_dataset_name = my_dataset
bq_table_name = my_table
bq_table_full_path = f"""{project_name}.{bq_dataset_name}.{bq_table_name}"""
bq_client = bigquery.Client()
query_string = """
SELECT user_login
FROM `my_table_full_path`
WHERE DATE(insert_time) = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
GROUP BY user_login
"""
bq_response = bq_client.query(query_string)
df = pd.DataFrame(bq_response)
csv_data = df.to_csv(header=False, index=False)
# create and upload file to Google Storage
timestr = datetime.datetime.now(datetime.timezone.utc).strftime('%Y-%m-%d')
file_name = 'daily_active_users_' + timestr + '.csv'
upload_blob(data=csv_data, destination_blob_name=file_name)
def upload_blob(data, destination_blob_name):
storage_client = storage.Client()
bucket = storage_client.get_bucket(destination_bucket)
blob = bucket.blob(destination_blob_name)
blob.upload_from_string(data, 'text/csv')
Thanks in advance!

Try to use the method to_dataframe of QueryJob to return the dataframe.
Instead of:
df = pd.DataFrame(bq_response)
Try this:
df = bq_response.to_dataframe()

Related

Dataframe results to bigquery are empty

I am working in a google cloud function with the intention of putting the results in a dataframe and then porting all of that into BigQuery. My function was able to be deployed without error but when looking into the associated bq table I am seeing no data. Below is a view of my code:
# general setup, common imports
import json, requests, time, urllib.parse
import pandas as pd
from pandas import DataFrame
import datetime
import io
import os
from google.cloud import bigquery
from google.cloud.bigquery.client import Client
def crux_data():
# Read the URLs for auditing
url_list = open('pagespeedlist', 'r')
url_list.read()
results = []
for x in url_list:
url = x[0]
pagespeed_results = urllib.request.urlopen('https://www.googleapis.com/pagespeedonline/v5/runPagespeed?url={}&strategy=mobile&key=API_KEY'\
.format(url)).read().decode('UTF-8')
pagespeed_results_json = json.loads(pagespeed_results)
add_date = datetime.date.today()
largest_contentful_paint = pagespeed_results_json['lighthouseResult']['audits']['largest-contentful-paint']['displayValue'].replace(u'\xa0', u'') # Largest Contenful Paint
first_input_delay = str(round(pagespeed_results_json['loadingExperience']['metrics']['FIRST_INPUT_DELAY_MS']['distributions'][2]['proportion'] * 1000, 1)) + 'ms' # First Input Delay
cumulative_layout_shift = pagespeed_results_json['lighthouseResult']['audits']['cumulative-layout-shift']['displayValue'] # CLS
crux_lcp = pagespeed_results_json['loadingExperience']['metrics']['LARGEST_CONTENTFUL_PAINT_MS']['category'] # Largest Contenful Paint Score
crux_fid = pagespeed_results_json['loadingExperience']['metrics']['FIRST_INPUT_DELAY_MS']['category'] # First Input Delay Score
crux_cls = pagespeed_results_json['loadingExperience']['metrics']['CUMULATIVE_LAYOUT_SHIFT_SCORE']['category'] # CLS Score
result_url = [url,date,largest_contentful_paint,first_input_delay,cumulative_layout_shift,lcp_score,fid_score,cls_score]
results.append(result_url)
#Convert to dataframe
results_csv = DataFrame (results,columns=['URL','DATE','LCP','FID','CLS','LCP_SCORE','FID_SCORE','CLS_SCORE'])
# Construct a BigQuery client object.
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'credentials.json'
client = Client()
# TODO(developer): Set table_id to the ID of the table to create.
table_id = "db.datatable.dataLoc"
job_config = bigquery.LoadJobConfig()
job = client.load_table_from_dataframe(
results_csv, table_id, job_config=job_config
) # Make an API request.
job.result() # Wait for the job to complete.
table = client.get_table(table_id) # Make an API request.
print(
"Loaded {} rows and {} columns to {}".format(
table.num_rows, len(table.schema), table_id
)
)
I do see the proper schema in the bq table but no actual data. Is there something I am missing with loading a df to bigquery?
Any help is much appreciated!

Automate BigQuery schema/table creation from csv file using Python

I want to automatically create BQ table/s from a desktop folder containing csv files( i.e Automatically create schema and load to a new table)
If the same file is loaded next time just update the existing table, if a new file is loaded then create a new table. Is it possible to automate using Python?.
Current Code:
import pandas as pd
from google.cloud import bigquery
def bqDataLoad(event, context):
bucketName = event['test_vs']
blobName = event['gf-dev-models']
fileName = "gs://" + bucketName + "/" + blobName
bigqueryClient = bigquery.Client()
tableRef = bigqueryClient.dataset("gf-dev-models-204097").table("test_vs")
dataFrame = pd.read_csv(fileName) bigqueryJob = bigqueryClient.load_table_from_dataframe(dataFrame, tableRef) bigqueryJob.result()
#Project id = gf-dev-models
#dataset = gf-dev-models-204097
#table name = want a new table created

Here is my answer with reference to your question in the comment section:
Credential in Code:
You can create a service account with desired BigQuery roles and download JSON key file (example: data-lab.json).
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "data-lab.json"
Create Schema Automatically & Loading data to BigQuery:
from google.cloud import bigquery
bigqueryClient = bigquery.Client()
jobConfig = bigquery.LoadJobConfig()
jobConfig.skip_leading_rows = 1
jobConfig.source_format = bigquery.SourceFormat.CSV
jobConfig.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
jobConfig.autodetect=True
datasetName = "dataset-name"
targetTable = "table-name"
uri = "gs://bucket-name/file-name.csv"
tableRef = bigqueryClient.dataset(datasetName).table(targetTable)
bigqueryJob = bigqueryClient.load_table_from_uri(uri, tableRef, job_config=jobConfig)
bigqueryJob.result()

Python - BigQuery Temporary Table

Is it possible to import data already in Cloud Storage to a temporary table in bigquery using Python? Can I create a BigQuery temporary table in Python and insert data into it?

You can only create temporary tables as part of a bigquery script or stored procedure.
What you can do is create tables with a random suffix name and a short expiry. One hour in my example. The example function create the temp table and only need a dataset as a parameter.
from google.cloud import bigquery
import datetime, pytz, random
PROJECT = "myproject"
def get_temp_table(dataset: str, table_name: str = None, project=None) -> bigquery.Table:
prefix = "temp"
suffix = random.randint(10000, 99999)
if not table_name:
table_name = "noname"
temp_table_name = f"{dataset}.{prefix}_{table_name}_{suffix}"
if project:
temp_table_name = f"{project}.{temp_table_name}"
tmp_table_def = bigquery.Table(temp_table_name)
tmp_table_def.expires = datetime.datetime.now(pytz.utc) + datetime.timedelta(
hours=1
)
return tmp_table_def
client = bigquery.Client(project=PROJECT)
tmp_table_def = get_temp_table("mydataset", "new_users", project=PROJECT)
tmp_table_def.schema = [
bigquery.SchemaField("id", "STRING", mode="REQUIRED"),
bigquery.SchemaField("full_name", "STRING", mode="REQUIRED"),
bigquery.SchemaField("age", "INTEGER", mode="REQUIRED"),
]
tmp_table = client.create_table(tmp_table_def) # type: bigquery.Table
data = [
{"id": "c-1234", "full_name": "John Smith", "age": 39},
{"id": "c-1234", "full_name": "Patricia Smith", "age": 41},
]
errors = client.insert_rows(tmp_table, data)
print(f"Loaded {len(data)} rows into {tmp_table.dataset_id}:{tmp_table.table_id} with {len(errors)} errors")

(this draft don't consider a temporary table, but i think can help.)
I used this with google cloud functions and Python 3.7 and works fine.
from google.cloud import storage,bigquery
import json
import os
import csv
import io
import pandas as pd
def upload_dataframe_gbq(df,table_name):
bq_client = bigquery.Client()
dataset_id = 'your_dataset_id'
dataset_ref = bq_client.dataset(dataset_id)
table_ref = dataset_ref.table(table_name)
job = bq_client.load_table_from_dataframe(df, table_ref)
job.result() # Waits for table load to complete.
assert job.state == "DONE"
table = bq_client.get_table(table_ref)
print(table.num_rows)
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="your_credentials.json"
client = storage.Client()
bucket = client.get_bucket('your_bucket_name')
blob = bucket.blob('sample.csv')
content = blob.download_as_string()
csv_content = BytesIO(content)
df = pd.read_csv(csv_content, sep=",", header=0 )
table_name = "your_big_query_table_name"
upload_dataframe_gbq(df,table_name)

Uploading a file from Google Cloud Storage to Bigquery using Python

I am having trouble writing a python script that loads or exports a file from google cloud storage to google bigquery.
#standardSQL
import json
import argparse
import time
import uuid
from google.cloud import bigquery
from google.cloud import storage
dataset = 'dataworks-356fa'
source = 'gs://dataworks-356fa-backups/pullnupload.json'
# def load_data_from_gcs(dataset, source):
# # load_data_from_gcs(dataworks-356fa, 'test10', gs://dataworks-356fa-backups/pullnupload.json):
# bigquery_client = bigquery.Client('dataworks-356fa')
# dataset = bigquery_client.dataset(FirebaseArchive)
# table = dataset.table(test10)
# job_name = str(uuid.uuid4())
#
# job = bigquery_client.load_table_from_storage(
# job_name, test10, 'gs://dataworks-356fa-backups/pullnupload.json')
#
# job.source_format = 'NEWLINE_DELIMITED_JSON'
# job.begin()
def load_data_from_gcs(dataset, test10, source ):
bigquery_client = bigquery.Client(dataset)
dataset = bigquery_client.dataset('FirebaseArchive')
table = dataset.table(test10)
job_name = str(uuid.uuid4())
job = bigquery_client.load_table_from_storage(
job_name, table, "gs://dataworks-356fa-backups/pullnupload.json")
job.source_format = 'NEWLINE_DELIMITED_JSON'
job.begin()
job.errors
So far this is my code. This file will run but it does not load anything into bigquery or come back with an error message. It runs then returns me to the normal terminal view.

From your previous question, you have the wait_for_job function. You should use it before printing for errors, like so:
def load_data_from_gcs(dataset, test10, source ):
bigquery_client = bigquery.Client(dataset)
dataset = bigquery_client.dataset('FirebaseArchive')
table = dataset.table(test10)
job_name = str(uuid.uuid4())
job = bigquery_client.load_table_from_storage(
job_name, table, "gs://dataworks-356fa-backups/pullnupload.json")
job.source_format = 'NEWLINE_DELIMITED_JSON'
job.begin()
wait_for_job(job)
print("state of job is: " + job.state)
print("errors: " + job.errors)
You can also use IPython to run each step by hand and observe what results on each line.
Notice that job.state must reach 'DONE' status before looking for errors.

Python BigQuery allowLargeResults with pandas.io.gbq

I want to use the Pandas library to read BigQuery data. How do I allow large results?
For non-Pandas BigQuery interactions, this can be achieved like this.
Current code with Pandas:
sProjectID = "project-id"
sQuery = '''
SELECT
column1, column2
FROM [dataset_name.tablename]
'''
from pandas.io import gbq
df = gbq.read_gbq(sQuery, sProjectID)

EDIT: I've posted the proper way to do this with in my other answer; by dropping off the data in google storage first. This way you'll never have data that is too large.
Ok, I didn't find a direct way to do it with pandas, so I had to write a little extra with the normal API. Here is my fix (also most of the work to do it natively without Pandas):
sProjectID = "project-id"
sQuery = '''
SELECT
column1, column2
FROM [dataset_name.tablename]
'''
df = create_dataframe(sQuery, sProjectID, bLargeResults=True)
#*******Functions to make above work*********
def create_dataframe(sQuery, sProjectID, bLargeResults=False):
"takes a BigQuery sql query and returns a Pandas dataframe"
if bLargeResults:
oService = create_service()
dDestinationTable = run_query(sQuery, oService, sProjectID)
df = pandas_get_table(dDestinationTable)
else:
df = pandas_query(sQuery, sProjectID)
return df
def pandas_query(sQuery, sProjectID):
"go into bigquery and get the table with sql query and return dataframe"
from pandas.io import gbq
df = gbq.read_gbq(sQuery, sProjectID)
return df
def pandas_get_table(dTable):
"fetch a table and return dataframe"
from pandas.io import gbq
sProjectID = dTable['projectId']
sDatasetID = dTable['datasetId']
sTableID = dTable['tableId']
sQuery = "SELECT * FROM [{}.{}]".format(sDatasetID, sTableID)
df = gbq.read_gbq(sQuery, sProjectID)
return df
def create_service():
"create google service"
from oauth2client.client import GoogleCredentials
from apiclient.discovery import build
credentials = GoogleCredentials.get_application_default()
oService = build('bigquery', 'v2', credentials=credentials)
return oService
def run_query(sQuery, oService, sProjectID):
"runs the bigquery query"
dQuery = {
'configuration': {
'query': {
'writeDisposition': 'OVERWRITE',
'useQueryCache': False,
'allowLargeResults': True,
'query': sQuery,
'destinationTable': {
'projectId': sProjectID,
'datasetId': 'sandbox',
'tableId': 'api_large_result_dropoff',
},
}
}
}
job = oService.jobs().insert(projectId=sProjectID, body=dQuery).execute()
return job['configuration']['query']['destinationTable']

Decided to post the proper way to do this via the python3 google.cloud API. Looking at my previous answer I see that it would fail like yosemite_k said.
Large results really need to follow BigQuery -> Storage -> local -> dataframe pattern.
BigQuery resources:
https://cloud.google.com/bigquery/docs/reference/libraries
https://googlecloudplatform.github.io/google-cloud-python/stable/bigquery-client.html
http://google-cloud-python.readthedocs.io/en/latest/bigquery-usage.html
Storage resources:
https://googlecloudplatform.github.io/google-cloud-python/stable/storage-client.html
https://googlecloudplatform.github.io/google-cloud-python/stable/storage-blobs.html
Pandas Resources:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
Installation:
pip install pandas
pip install google-cloud-storage
pip install google-cloud-bigquery
Full implementation (bigquery_to_dataframe.py):
"""
We require python 3 for the google cloud python API
mkvirtualenv --python `which python3` env3
And our dependencies:
pip install pandas
pip install google-cloud-bigquery
pip install google-cloud-storage
"""
import os
import time
import uuid
from google.cloud import bigquery
from google.cloud import storage
import pandas as pd
def bq_to_df(project_id, dataset_id, table_id, storage_uri, local_data_path):
"""Pipeline to get data from BigQuery into a local pandas dataframe.
:param project_id: Google project ID we are working in.
:type project_id: str
:param dataset_id: BigQuery dataset id.
:type dataset_id: str
:param table_id: BigQuery table id.
:type table_id: str
:param storage_uri: Google Storage uri where data gets dropped off.
:type storage_uri: str
:param local_data_path: Path where data should end up.
:type local_data_path: str
:return: Pandas dataframe from BigQuery table.
:rtype: pd.DataFrame
"""
bq_to_storage(project_id, dataset_id, table_id, storage_uri)
storage_to_local(project_id, storage_uri, local_data_path)
data_dir = os.path.join(local_data_path, "test_data")
df = local_to_df(data_dir)
return df
def bq_to_storage(project_id, dataset_id, table_id, target_uri):
"""Export a BigQuery table to Google Storage.
:param project_id: Google project ID we are working in.
:type project_id: str
:param dataset_id: BigQuery dataset name where source data resides.
:type dataset_id: str
:param table_id: BigQuery table name where source data resides.
:type table_id: str
:param target_uri: Google Storage location where table gets saved.
:type target_uri: str
:return: The random ID generated to identify the job.
:rtype: str
"""
client = bigquery.Client(project=project_id)
dataset = client.dataset(dataset_name=dataset_id)
table = dataset.table(name=table_id)
job = client.extract_table_to_storage(
str(uuid.uuid4()), # id we assign to be the job name
table,
target_uri
)
job.destination_format = 'CSV'
job.write_disposition = 'WRITE_TRUNCATE'
job.begin() # async execution
if job.errors:
print(job.errors)
while job.state != 'DONE':
time.sleep(5)
print("exporting '{}.{}' to '{}': {}".format(
dataset_id, table_id, target_uri, job.state
))
job.reload()
print(job.state)
return job.name
def storage_to_local(project_id, source_uri, target_dir):
"""Save a file or folder from google storage to a local directory.
:param project_id: Google project ID we are working in.
:type project_id: str
:param source_uri: Google Storage location where file comes form.
:type source_uri: str
:param target_dir: Local file location where files are to be stored.
:type target_dir: str
:return: None
:rtype: None
"""
client = storage.Client(project=project_id)
bucket_name = source_uri.split("gs://")[1].split("/")[0]
file_path = "/".join(source_uri.split("gs://")[1].split("/")[1::])
bucket = client.lookup_bucket(bucket_name)
folder_name = "/".join(file_path.split("/")[0:-1]) + "/"
blobs = [o for o in bucket.list_blobs() if o.name.startswith(folder_name)]
# get files if we wanted just files
blob_name = file_path.split("/")[-1]
if blob_name != "*":
print("Getting just the file '{}'".format(file_path))
our_blobs = [o for o in blobs if o.name.endswith(blob_name)]
else:
print("Getting all files in '{}'".format(folder_name))
our_blobs = blobs
print([o.name for o in our_blobs])
for blob in our_blobs:
filename = os.path.join(target_dir, blob.name)
# create a complex folder structure if necessary
if not os.path.isdir(os.path.dirname(filename)):
os.makedirs(os.path.dirname(filename))
with open(filename, 'wb') as f:
blob.download_to_file(f)
def local_to_df(data_path):
"""Import local data files into a single pandas dataframe.
:param data_path: File or folder path where csv data are located.
:type data_path: str
:return: Pandas dataframe containing data from data_path.
:rtype: pd.DataFrame
"""
# if data_dir is a file, then just load it into pandas
if os.path.isfile(data_path):
print("Loading '{}' into a dataframe".format(data_path))
df = pd.read_csv(data_path, header=1)
elif os.path.isdir(data_path):
files = [os.path.join(data_path, fi) for fi in os.listdir(data_path)]
print("Loading {} into a single dataframe".format(files))
df = pd.concat((pd.read_csv(s) for s in files))
else:
raise ValueError(
"Please enter a valid path. {} does not exist.".format(data_path)
)
return df
if __name__ == '__main__':
PROJECT_ID = "my-project"
DATASET_ID = "bq_dataset"
TABLE_ID = "bq_table"
STORAGE_URI = "gs://my-bucket/path/for/dropoff/*"
LOCAL_DATA_PATH = "/path/to/save/"
bq_to_df(PROJECT_ID, DATASET_ID, TABLE_ID, STORAGE_URI, LOCAL_DATA_PATH)

You can do it by changing the default dialect from legacy to standard in pd.read_gbq function.
pd.read_gbq(query, 'my-super-project', dialect='standard')
Indeed, you can read in Big Query documentation for the parameter AllowLargeResults:
AllowLargeResults: For standard SQL queries, this flag is
ignored and large results are always allowed.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

improper data formatting - saving query from BigQuery to GCS - python

Try to use the method to_dataframe of QueryJob to return the dataframe. Instead of: df = pd.DataFrame(bq_response) Try this: df = bq_response.to_dataframe()

Related

Dataframe results to bigquery are empty

Automate BigQuery schema/table creation from csv file using Python

Python - BigQuery Temporary Table

Uploading a file from Google Cloud Storage to Bigquery using Python

Python BigQuery allowLargeResults with pandas.io.gbq

Categories

Resources