Live data from BigQuery into a Python DataFrame - python

I am exploring ways to bring BigQuery data into Python, here is my code so far:
from google.cloud import bigquery
from pandas.io import gbq
client = bigquery.Client.from_service_account_json("path_to_my.json")
project_id = "my_project_name"
query_job = client.query("""
#standardSQL
SELECT date,
SUM(totals.visits) AS visits
FROM `projectname.dataset.ga_sessions_20*` AS t
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 3 day) and
DATE_sub(current_date(), interval 1 day)
GROUP BY date
""")
results = query_job.result() # Waits for job to complete.
#for row in results:
# print("{}: {}".format(row.date, row.visits))
results_df = gbq.read_gbq(query_job,project_id=project_id)
The commented out lines: #for row in results:
print("{}: {}".format(row.date, row.visits))
return the correct results from my query, but they aren't usable in this form, as a next step I'd like to get them into a dataframe, but this code returns the error TypeError: Object of type 'QueryJob' is not JSON serializable.
Can anyone tell me what is wrong with my code to generate this error, or perhaps suggest a better way to bring in BigQuery data to a dataframe?

The method read_gbq expects a str as input and not a QueryJob one.
Try running it like this instead:
query = """
#standardSQL
SELECT date,
SUM(totals.visits) AS visits
FROM `projectname.dataset.ga_sessions_20*` AS t
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 3 day) and
DATE_sub(current_date(), interval 1 day)
GROUP BY date
"""
results_df = gbq.read_gbq(query, project_id=project_id, private_key='path_to_my.json')

Related

Delete datetime from SQL database based on hour

I'm a python dev, I'm handling an SQL database through sqlite3 and I need to perform a certain SQL query to delete data.
I have tables which contain datetime objects as keys.
I want to keep only one row per hour (the last record for that specific time) and delete the rest.
I also need this to only happen on data older than 1 week.
Here's my attempt:
import sqlite3
c= db.cursor()
c.execute('''DELETE FROM TICKER_AAPL WHERE time < 2022-07-11 AND time NOT IN
( SELECT * FROM
(SELECT min(time) FROM TICKER_AAPL GROUP BY hour(time)) AS temp_tab);''')
Here's a screenshot of the table itself:
First change the format of your dates from yyyyMMdd ... to yyyy-MM-dd ..., because this is the only valid text date format for SQLite.
Then use the function strftime() in your query to get the hour of each value in the column time:
DELETE FROM TICKER_AAPL
WHERE time < date(CURRENT_DATE, '-7 day')
AND time NOT IN (SELECT MAX(time) FROM TICKER_AAPL GROUP BY strftime('%Y-%m-%d %H', time));

BigQuery automatically converts timestamp timezone to UTC

I have a table as such:
and a file as such: https://storage.googleapis.com/test_share_file/testTimestamp.csv
which looks like:
and I load the file to big query using python as such:
from google.cloud import bigquery as bq
gs_path = 'gs://test_share_file/testTimestamp.csv'
bq_client = bq.Client.from_service_account_json(gcp_creds_fp)
ds = bq_client.dataset('test1')
tbl = ds.table('testTimestamp')
job_config = bq.LoadJobConfig()
job_config.write_disposition = bq.job.WriteDisposition.WRITE_APPEND
job_config.skip_leading_rows = 1 # skip header
load_job = bq_client.load_table_from_uri(gs_path, tbl, job_config=job_config)
res = load_job.result()
and yet in the table, both timestamps are in UTC time!
How do I get the second column to be in eastern time?
You can "transform" first column into eastern time on-fly - something like in below example
#standardSQL
WITH t AS (
SELECT TIMESTAMP '2018-05-07 22:40:00+00:00' AS ts
)
SELECT ts, STRING(ts, '-04:00') timestamp_eastern
FROM t
I am dealing with ... stubbornness ...
You can create view which will consists of all the logic you need in place so client will query that view instead of original table
#standardSQL
CREATE VIEW `project.dataset.your_view` AS
SELECT ts, STRING(ts, '-04:00') timestamp_eastern
FROM `project.dataset.your_table`
I do think it odd that big query can't display a time in a timezone
A timestamp represents an absolute point in time, independent of any time zone or convention such as Daylight Savings Time.
Time zones are used when parsing timestamps or formatting timestamps for display. The timestamp value itself does not store a specific time zone. A string-formatted timestamp may include a time zone. When a time zone is not explicitly specified, the default time zone, UTC, is used.
See more about Timestamp type

How to sort a pandas dataframe by date

I am importing data into a pandas dataframe from Google BigQuery and I'd like to sort the results by date. My code is as follows:
import sys, getopt
import pandas as pd
from datetime import datetime
# set your BigQuery service account private private key
pkey ='#REMOVED#'
destination_table = 'test.test_table_2'
project_id = '#REMOVED#'
# write your query
query = """
SELECT date, SUM(totals.visits) AS Visits
FROM `#REMOVED#.#REMOVED#.ga_sessions_20*`
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 3 day) and
DATE_sub(current_date(), interval 1 day)
GROUP BY Date
"""
data = pd.read_gbq(query, project_id, dialect='standard', private_key=pkey, parse_dates=True, index_col='date')
date = data.sort_index()
data.info()
data.describe()
print(data.head())
My output is shown below, as you can see dates are not sorted.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
date 3 non-null object
Visits 3 non-null int32
dtypes: int32(1), object(1)
memory usage: 116.0+ bytes
date Visits
0 20180312 207440
1 20180310 178155
2 20180311 207452
I have read several questions and so far tried the below, which resulted in no change to my output:
Removing index_col='date' and adding date = data.sort_values(by='date')
Setting the date column as the index, then sorting the index (shown above).
Setting headers (headers = ['Date', 'Visits']) and dypes (dtypes = [datetime, int]) to my read_gbq line (parse_dates=True, names=headers)
What am I missing?
I managed to solve this by transforming my date field into a datetime object, I assumed this would be done automatically by parse_date=True but it seems that will only parse a existing datetime object.
I added the following after my query to create a new datetime column from my date string, then I was able to use data.sort_index() and it worked as expected:
time_format = '%Y-%m-%d'
data = pd.read_gbq(query, project_id, dialect='standard', private_key=pkey)
data['n_date'] = pd.to_datetime(data['date'], format=time_format)
data.index = data['n_date']
del data['date']
del data['n_date']
data.index.names = ['Date']
data = data.sort_index()
As most of the work is done on the Google BigQuery side, I'd do sorting there as well:
query = """
SELECT date, SUM(totals.visits) AS Visits
FROM `#REMOVED#.#REMOVED#.ga_sessions_20*`
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 3 day) and
DATE_sub(current_date(), interval 1 day)
GROUP BY Date
ORDER BY Date
"""
This should work:
data.sort_values('date', inplace=True)

How to import Datetime function into python

I am running a query that I plan on using multiple times. However when running this query the 'my-job1a' has to be different everytime so I was planning on making this go by the date time. Does anybody know how to implement the date time function for this?
from google.cloud import bigquery
client = bigquery.Client('dataworks-356fa')
query = query
dataset = client.dataset('FirebaseArchive')
table = dataset.table(name='test1')
tbl = dataset.table(name='test12')
job = client.run_async_query('my-job1a', query)
job.destination = tbl
job.write_disposition= 'WRITE_TRUNCATE'
job.begin()
i believe "my-job1a" is a constant string. and you want to change the string for new query.
import datetime
# "my-job1a" replace this with "my-job1a" + datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
job = client.run_async_query("my-job1a-" + datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), query)
this will change for each second . if you want in millisecond then change the strftime function parameter. if you don't want such a big string , then change strftime parameter as per your choice.

Loop through list of dates in Python query string

I am trying to use Pandas and SQLAlchemy to run a query on a MySQL instance. In the actual query, there is a 'WHERE' statement referencing a specific date. I'd like to run this query separately for each date in a Python list, and append each date's dataframe iteratively to another Master dataframe. My code right now looks like this (excluding SQLAlchemy engine creation):
dates = ['2016-01-12','2016-01-13','2016-01-14']
for day in dates:
query="""SELECT * from schema.table WHERE date = '%s' """
df = pd.read_sql_query(query,engine)
frame.append(df)
My error is
/opt/rh/python27/root/usr/lib64/python2.7/site-packages/MySQLdb/cursors.pyc in execute(self, query, args)
157 query = query.encode(charset)
158 if args is not None:
--> 159 query = query % db.literal(args)
160 try:
161 r = self._query(query)
TypeError: not enough arguments for format string
I'm wondering what the best way to insert the string from the list into my query string is?
Use params to parameterize your query:
dates = ['2016-01-12', '2016-01-13', '2016-01-14']
query = """SELECT * from schema.table WHERE date = %s"""
for day in dates:
df = pd.read_sql_query(query, engine, params=(day, ))
frame.append(df)
Note that I've removed the quotes around the %s placeholder - the data type conversions would be handled by the database driver itself. It would put quotes implicitly if needed.
And, you can define the query before the loop once - no need to do it inside.
I also think that you may need to have a list of date or datetime objects instead of strings:
from datetime import date
dates = [
date(year=2016, month=1, day=12),
date(year=2016, month=1, day=13),
date(year=2016, month=1, day=14),
]

Categories