Get number of rows in a BigQuery table (streaming buffer) - python

I am doing inserts via Streaming. In the UI, I can see the following row counts:
Is there a way to get that via the API? Current when I do:
from google.cloud import bigquery
client = bigquery.Client()
dataset = client.dataset("bqtesting")
table = client.get_table(dataset.table('table_streaming'))
table.num_rows
0
Obviously 0 is not the number that I'm looking to get. From the API documentation it says:
numRows unsigned long [Output-only] The number of rows of data in this table, excluding any data in the streaming buffer.
So then, my question is: how do we get the exact number of rows in a table? Currently I'm doing:
count=[item[0] for item in client.query('SELECT COUNT(*) FROM `bqtesting.table_streaming`').result()][0]
But this takes about 5s just to get the count (and I need to execute this query quite frequently to see if all streaming inserts have 'finished').

select count(1) and select count(*) etc have 0 scanned and billed bytes (you can see this in the job metadata after you run it or in a dry run) so you should be able to run those as often as you like
if i'm reading the documentation correctly, the numbers there are not guaranteed to give you rows in the buffer which have not yet been flushed to big-query storage
you can also use the API mentioned here https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataavailability i.e. check the streamingBuffer.oldestEntryTime field from the tables.get result

You can use the __TABLES__ metadata table to get the information that you want. Querying __TABLES__ incurs no charges.
The query that you need is:
SELECT table_id, row_count, size_bytes
FROM `your-project-name.bqtesting.__TABLES__`
WHERE STARTS_WITH(table_id, "table_streaming")
ORDER BY table_id DESC

Related

How to perform the UPSERT operation using the python BigQuery client when writing JSON record

I am writing JSON records into a BigQuery table using the function bq.insert_rows_json(f'{project}.{dataset}.{table_name}', rows_to_insert). This operation is done in INSERT mode. I was wondering if I could use the same function but in UPSERT mode. Is it possible ? I check the documentation here but did not find an argument for that.
I can't seem to find an in-built UPSERT function for python. However, you may try and consider the below approach which is derived from the comment of #Mr.Nobody.
from google.cloud import bigquery
client = bigquery.Client()
query_job = client.query(
"""
MERGE my-dataset.json_table T
USING my-dataset.json_table_source S
ON T.int64_field_0 = S.int64_field_0
WHEN MATCHED THEN
UPDATE SET string_field_1 = s.string_field_1
WHEN NOT MATCHED THEN
INSERT (int64_field_0, string_field_1) VALUES(int64_field_0, string_field_1)"""
)
results = query_job.result() # Waits for job to complete.
In this approach, you will be needing to ingest all of your supposedly "updated" JSON data on a table before inserting or updating them to your main BigQuery table. The query then will match each rows to the main table if the primary ID (uniqueness checker) is already there (then query will do UPDATE) or not yet (then query will do INSERT).
Screenshot of both tables before running the python code.
Main Table:
Source Table:
Screenshot of the Main Table when the python code finished executing.
Conclusion: The int64_field_0 4 was updated (from version 1.0.0. to 6.5.1) because it is already existing in the Main table. The int64_field_0 5 was inserted because it is not yet existing on the main table.

Google Dataflow: insert + update in BigQuery in a streaming pipeline

The main object
A python streaming pipeline in which I read the input from pub/sub.
After the input is analyzed, two option are available:
If x=1 -> insert
If x=2 -> update
Testing
This can not be done using apache beam function, so you need to develop it using the 0.25 API of BigQuery (currently this is the version supported in Google Dataflow)
The problem
The inserted record are still in the BigQuery buffer, so the update statement fail:
UPDATE or DELETE statement over table table would affect rows in the streaming buffer, which is not supported
The code
Insert
def insertCanonicalBQ(input):
from google.cloud import bigquery
client = bigquery.Client(project='project')
dataset = client.dataset('dataset')
table = dataset.table('table' )
table.reload()
table.insert_data(
rows=[[values]])
Update
def UpdateBQ(input):
from google.cloud import bigquery
import uuid
import time
client = bigquery.Client()
STD= "#standardSQL"
QUERY= STD + "\n" + """UPDATE table SET field1 = 'XXX' WHERE field2= 'YYY'"""
client.use_legacy_sql = False
query_job = client.run_async_query(query=QUERY, job_name='temp-query-job_{}'.format(uuid.uuid4())) # API request
query_job.begin()
while True:
query_job.reload() # Refreshes the state via a GET request.
if query_job.state == 'DONE':
if query_job.error_result:
raise RuntimeError(query_job.errors)
print "done"
return input
time.sleep(1)
Even if the row wasn't in the streaming buffer, this still wouldn't be the way to approach this problem in BigQuery. BigQuery storage is better suited for bulk mutations rather than mutating individual entities like this via UPDATE. Your pattern is aligned with something I'd expect from an transactional rather than analytical use case.
Consider an append-based pattern for this. Each time you process an entity message write it to BigQuery via streaming insert. Then, when needed you can get the latest version of all entities via a query.
As an example, let's assume an arbitrary schema: idfield is your unique entity key/identifier, and message_time represents the time the message was emitted. Your entities may have many other fields. To get the latest version of the entities, we could run the following query (and possibly write this to another table):
#standardSQL
SELECT
idfield,
ARRAY_AGG(
t ORDER BY message_time DESC LIMIT 1
)[OFFSET(0)].* EXCEPT (idfield)
FROM `myproject.mydata.mytable` AS t
GROUP BY idfield
An additional advantage of this approach is that it also allows you to perform analysis at arbitrary points of time. To perform an analysis of the entities as of their state an hour ago would simply involve adding a WHERE clause: WHERE message_time <= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)

Partitioning BigQuery Tables via API in python

I'm using Python to hit the BigQuery API. I've been successful at running queries and writing new tables, but would like to ensure those output tables are partitioned per https://cloud.google.com/bigquery/docs/creating-partitioned-tables
The output of the query would have the columns: event_date[string in the format "2017-11-12"], metric[integer]
Per the code below, I've been assigning the "partitioning_type" code to various objects, but it never returns an error.
( I guess it'd also be useful to know how to tell if my partitioning efforts are actually working (i.e. how to identify the _PARTITIONTIME pseudo column)).
dest_table_id = "BQresults"
query_job = client.run_async_query(str(uuid.uuid4()), query))
query_job.allow_large_results = True
dest_dataset = client.dataset(dest_dataset_id)
dest_table = dest_dataset.table(dest_table_id)
dest_table.partitioning_type ="DAY"
query_job.destination = dest_table
query_job.write_disposition = 'WRITE_TRUNCATE'
query_job.use_legacy_sql = False
query_job.begin()
query_job.result()
If you want to check if the table is partitioned on a time column or not, use get_table() method and check the partitioning_type property of the returned object.
You can check on a integer partitioning checking the range_partitioning property. You can also get the job object using get_job() with the job id and check if the time_partitioning was set in the configuration.
I don't think that query job you're running results in the partitioned table, since the time_partitioning should be set in the job configuration, and it seems like the client doesn't do this. If it is true, you can create partitioned table first and use existing table as a destination.

Postgresql error statement is too large

I developed a script on python and sqlalchemy to get and update the last activity of my active users.
But the users are increasing a lot, now i´m getting the following error
psycopg2.ProgrammingError: Statement is too large. Statement Size: 16840277 bytes. Maximum Allowed: 16777216 bytes
I was thinking if I update the file postgres.conf it will work, so with the help of pgtune I updated the file, but it does not work, so I updated my kernel on /etc/syslog.conf, with the following parameters
kern.sysv.shmmax=4194304
kern.sysv.shmmin=1
kern.sysv.shmmni=32
kern.sysv.shmseg=8
kern.sysv.shmall=1024
and again it does not work.
After that I divide my query into slices to reduce the size but I got the same error.
How can know what parameter I need to update, to increase the size of my statement?
Workflow
query = "SELECT id FROM {}.{} WHERE status=TRUE".format(schema, customer_table)
ids = ["{}".format(i)for i in pd.read_sql(query, insert_uri).id.tolist()]
read_query = """
SELECT id,
MAX(CONVERT_TIMEZONE('America/Mexico_City', last_activity)) lastactivity
FROM activity WHERE
DATE_TRUNC('d', CONVERT_TIMEZONE('America/Mexico_City', last_activity)) =
DATE_TRUNC('d', CONVERT_TIMEZONE('America/Mexico_City', CURRENT_DATE))-{} and
id in ({})
GROUP BY id
""".format(day, ",".join(ids))
last_activity = pd.read_sql(read_query, read_engine, parse_dates=True)
If you are only fetching the IDs from the database and not filtering them by any other way, there is no need to fetch them at all, you can just insert the SQL statement as a subquery into the second:
SELECT id,
MAX(CONVERT_TIMEZONE('America/Mexico_City', last_activity)) lastactivity
FROM activity WHERE
DATE_TRUNC('d', CONVERT_TIMEZONE('America/Mexico_City', last_activity)) =
DATE_TRUNC('d', CONVERT_TIMEZONE('America/Mexico_City', CURRENT_DATE))-%s and
id in (
SELECT id FROM customerschema.customer WHERE status=TRUE
)
GROUP BY id
Also, as Antti Haapala said, don't use string formatting for SQL parameters, because it is insecure and if any parameter contains appropriate quotes, postgres will interpret them as commands instead of data.

Memory efficient way of fetching postgresql uniqueue dates?

I have a database with roughly 30 million entries, which is a lot and i don't expect anything but trouble working with larger database entries.
But using py-postgresql and the .prepare() statement i would hope i could fetch entries on a "yield" basis and thus avoiding filling up my memory with only the results from the database, which i aparently can't?
This is what i've got so far:
import postgresql
user = 'test'
passwd = 'test
db = postgresql.open('pq://'+user+':'+passwd+'#192.168.1.1/mydb')
results = db.prepare("SELECT time time FROM mytable")
uniqueue_days = []
with db.xact():
for row in result():
if not row['time'] in uniqueue_days:
uniqueue_days.append(row['time'])
print(uniqueue_days)
Before even getting to if not row['time'] in uniqueue_days: i run out of memory, which isn't so strange considering result() probably fetches all results befor looping through them?
Is there a way to get the library postgresql to "page" or batch down the results in say a 60k per round or perhaps even rework the query to do more of the work?
Thanks in advance!
Edit: Should mention the dates in the database is Unix timestamps, and i intend to convert them into %Y-%m-%d format prior to adding them into the uniqueue_days list.
If you were using the better-supported psycopg2 extension, you could use a loop over the client cursor, or fetchone, to get just one row at a time, as psycopg2 uses a server-side portal to back its cursor.
If py-postgresql doesn't support something similar, you could always explicitly DECLARE a cursor on the database side and FETCH rows from it progressively. I don't see anything in the documentation that suggests py-postgresql can do this for you automatically at the protocol level like psycopg2 does.
Usually you can switch between database drivers pretty easily, but py-postgresql doesn't seem to follow the Python DB-API, so testing it will take a few more changes. I still recommend it.
You could let the database do all the heavy lifting.
Ex: Instead of reading all the data into Python and then calculating unique_dates why not try something like this
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES FROM mytable;
If you want to strictly enforce sort order on unique_dates returned then do the following:
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES
FROM mytable
order by 1;
Usefull references for functions used above:
Date/Time Functions and Operators
Data Type Formatting Functions
If you would like to read data in chunks you could use the dates you get from above query to subset your results further down the line:
Ex:
'SELECT * FROM mytable mytable where time between' +UNIQUE_DATES[i] +'and'+ UNIQUE_DATES[j] ;
Where UNIQUE_DATES[i]& [j] will be parameters you would pass from Python.
I will leave it for you to figure how to convert date into unix timestamps.

Categories