AWS Glue Search Option - python

I'm currently using AWS Glue Data Catalog to organize my database. Once I set up the connection and sent my crawler to gather information, I was able to see the formulated metadata.
One feature that would be nice to have is the ability to SEARCH the entire data catalog on ONE column name. For example, if i have 5 tables in my data catalog, and one of those tables happen to have a field "age". I'd like to be able to see that table.
I also was wondering if I can search on the "comments" field every column has in a table on AWS Glue Data Catalog
Hope to get some help!

You can do that with AWS Glue API. For example, you can use python SDK boto3 and get_tables() method to retrieve all meta information about tables in a particular database. Have a look at the Response Syntax returned by calling get_tables() and then you would only need to parse it, for example:
import boto3
glue_client = boto3.client('glue')
response = glue_client.get_tables(
DatabaseName='__SOME_NAME__'
)
for table in response['TableList']:
columns = table['StorageDescriptor']['Columns']
for col in columns:
col_name = col['Name']
col_comment = col['Comment']
# Here you do search for what you need
Note: if you have a table with partitioning (artificial columns), then you would all need to search through
columns_as_partitions = table['PartitionKeys']
for col in columns_as_partitions:
col_name = col['Name']
col_comment = col['Comment']
# Here you do search for what you need

Related

BigQuery client python get column based partitioning column name

I am looking to include a where clause to an existing query to exclude data in the BigQuery streaming buffer.
To do this I would like to get the Partition column name so I can add
WHERE partition_column IS NOT NULL;
to my existing query.
I have been looking at the CLI and the get_table method however that just returns the value of the column not the column name.
I get the same when searching on .INFORMATION_SCHEMA.PARTITIONS this returns a field for partition_id but I would prefer the column name itself, is there away to get this ?
Additionally the table is setup with column based partitioning.
Based on python BigQuery client documentation, use the attribute time_partitioning:
from google.cloud import bigquery
bq_table = client.get_table('my_partioned_table')
bq_table.time_partitioning # TimePartitioning(field='column_name',type_='DAY')
bq_table.time_partitioning.field # column_name
Small tips, if you don't know where to search, print the API repr:
bq_table.to_api_repr()

How to get number of entities returned in a Azure Tables query?

I am using python to make a query to Azure tables.
query = table_service.query_entities(table_name, filter=filter_string)
How can I see the amount of entities returned by this query? I have tried using
query.count
query.count()
but have had no luck. I get the following error.
'ListGenerator' object has no attribute 'count
Searching online keeps bring back results about getting the count of the entire rows in the table which is not relevant.
There is a new SDK for the Azure tables service, you can install it from pip with the command pip install azure-data-tables. The new SDK can target either a storage account or a cosmos account. Here is a sample of how you can find the total number of entities in a table. You will have to iterate through each entity because the new Tables SDK uses paging on calls for query_entities and list_entities. Entities are turned in an ItemPaged which only returns a subset of the entities at a time.
from azure.data.tables import TableClient, TableServiceClient
connection_string = "<your_conn_str>"
table_name = "<your_table_name>"
with TableClient.from_connection_string(connection_string, table_name) as table_client:
f = "value gt 25"
query = table_client.query_entities(filter=f)
count = 0
for entity in query:
count += 1
print(count)
If you can clarify why you need the number of entities in a query I might be able to give better advice.
(Disclaimer, I work on the Azure SDK for Python team)
You should use len(query.items) to get the number of returned entities.
The code like below:
query = table_service.query_entities(table_name, filter=filter_string)
print(len(query.items))
Here is the test result:

Writing to BigQuery dynamic table name Python SDK

I'm working on an ETL which is pulling data from a database, doing minor transformation and outputs to BigQuery. I have written my pipeline in Apache Beam 2.26.0 using Python SDK. I'm loading a dozen or so tables, and I'm passing their names as arguments to beam.io.WriteToBigQuery
Now, the documentation says that (https://beam.apache.org/documentation/io/built-in/google-bigquery):
When writing to BigQuery, you must supply a table schema for the destination table that you want to write to, unless you specify a create disposition of CREATE_NEVER.
I believe this is not exactly true. In my tests I saw that this is the case only when passing static table name.
If you have a bunch of tables and want to pass a table name as an argument then it throws an error:
ErrorProto message: 'No schema specified on job or table.'
My code:
bq_data | "Load data to BQ" >> beam.io.WriteToBigQuery(
table=lambda row: bg_config[row['table_name']],
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER
)
bq_data is a dict of row of pandas data frame. Where I have a column table_name.
bq_config is a dictionary where key = row['table_name'] and the value is of the format:
[project_id]:[dataset_id].[table_id]
Anyone have some thoughts on this?
Have a look at this thread, I addressed it there. In short; I used the internal python time/date-function to render a variable before executing the python BigQuery API request.

Partitioning BigQuery Tables via API in python

I'm using Python to hit the BigQuery API. I've been successful at running queries and writing new tables, but would like to ensure those output tables are partitioned per https://cloud.google.com/bigquery/docs/creating-partitioned-tables
The output of the query would have the columns: event_date[string in the format "2017-11-12"], metric[integer]
Per the code below, I've been assigning the "partitioning_type" code to various objects, but it never returns an error.
( I guess it'd also be useful to know how to tell if my partitioning efforts are actually working (i.e. how to identify the _PARTITIONTIME pseudo column)).
dest_table_id = "BQresults"
query_job = client.run_async_query(str(uuid.uuid4()), query))
query_job.allow_large_results = True
dest_dataset = client.dataset(dest_dataset_id)
dest_table = dest_dataset.table(dest_table_id)
dest_table.partitioning_type ="DAY"
query_job.destination = dest_table
query_job.write_disposition = 'WRITE_TRUNCATE'
query_job.use_legacy_sql = False
query_job.begin()
query_job.result()
If you want to check if the table is partitioned on a time column or not, use get_table() method and check the partitioning_type property of the returned object.
You can check on a integer partitioning checking the range_partitioning property. You can also get the job object using get_job() with the job id and check if the time_partitioning was set in the configuration.
I don't think that query job you're running results in the partitioned table, since the time_partitioning should be set in the job configuration, and it seems like the client doesn't do this. If it is true, you can create partitioned table first and use existing table as a destination.

BigQuery : add new column to existing tables using python BQ API

Related question: Bigquery add columns to table schema using BQ command line tools
I want to add a new column to existing tables (update the existing table's schema) in BigQuery using BigQuery Python API.
However my code seems not working.
Here's my code:
flow = flow_from_clientsecrets('secret_key_path', scope='my_scope')
storage = Storage('CREDENTIAL_PATH')
credentials = storage.get()
if credentials is None or credentials.invalid:
credentials = tools.run_flow(flow, storage, tools.argparser.parse_args([]))
http = httplib2.Http()
http = credentials.authorize(http)
bigquery_service = build('bigquery', 'v2', http=http)
tbObject = bigquery_service.tables()
query_body = {'schema': {'name':'new_column_name', 'type':'STRING'}}
tbObject.update(projectId='projectId', datasetId='datasetId', tableId='tableId', body=query_body).execute()
it returns Provided schema doesn't match existing table's schema error.
Can anyone give me a working Python example?
Many thanks!
Base on Mikhail Berlyant comments, I have to pass existing table's schema with new field (column) to the update() method to update the existing tables's schema.
A python code example is given below:
...
tbObject = bigquery_service.tables()
# get current table schema
table_data = tbObject.get(projectId=projectId, datasetId=datasetId, tableId=tableId).execute()
schema = table_data.get('schema')
new_column = {'name': 'new_column_name', 'type': 'STRING'}
# append new field to current table's schema
schema.get('fields').append(new_column)
query_body = {'schema': schema}
tbObject.update(projectId='projectId', datasetId='datasetId', tableId='tableId', body=query_body).execute()
And also, there's no way to set value of new columns for existing rows (tables). Thanks for Mikhail Berlyant suggestion, the way to set the value for existing rows is to create a seperate table for new columns with values, and join the existing table with that table to replace the old schema table
summary of my comments (as i've got some minutes now for this):
whole schema (along with new field) needs to be supplied to api
new field will be added with null for existing rows. no way to set
value
you can have some logic in queries that you will be running against
this table to compensate this. or you can have separate table with
just this new field and some key that you will be joining your
existing table with new table to get this field

Categories