Extract metadata about table using BigQuery Client API

Extract metadata about table using BigQuery Client API - python

I have a table in a BigQuery dataset and I'm trying to find out when the table was last modified via the BigQuery client API.
I have tried (in Python)
from gcloud import bigquery
client = bigquery.Client(project="my_project")
dataset = client.dataset("my_dataset")
tables = dataset.list_tables()
table = tables[0][5] # Extract the table that I want
I can check that I've got the right table by running print(table.name), however I don't know how to get the table metadata. In particular, I want to know how to find out when the table was last modified.
Although, I've written the above in Python (I'm more familiar with it than other programming languages) I don't mind if the answer is in Python or Javascript (I think I'm going to have to implement it in the latter).

Under the hood, tables = dataset.list_tables() is making an API request to Tables.list. The result of this request does not contain all the table meta information - like last modified for example.
The Tables.get API request is needed for this type of table information. To make this request you need to call reload() on the table. For example:
bigquery_service = bigquery.Client()
dataset = bigquery_service.dataset("<your-dataset>")
tables = dataset.list_tables()
for table in tables:
table.reload()
print(table.modified)
In my test/dataset, this prints:
2016-12-30 08:57:15.679000+00:00
2016-12-18 23:57:24.570000+00:00
2016-12-19 05:18:28.371000+00:00
See here (Github) and here (Python docs) for more details.

Related

Better way to import data from REST API to SQL DB using Python?

I've written some python code to extract data from a rest api and load them in an Azure SQL database. But this process is taking almost half a hour for 20,000 lines. Is there a more efficient way of doing this? I'm thinking maybe extract the data as json file and put it in blob storate then use azure data factory to load the data into SQL, but have no idea how to code this way.
def manualJournalLineItems(tenantid):
endpoint = "api.xro/2.0/manualjournals/?page=1"
result = (getAPI(endpoint,token,tenantid))
page = 1
while result['ManualJournals']:
endpoint = "api.xro/2.0/manualjournals/?page="+str(page)
result = (getAPI(endpoint,token,tenantid))
for inv in result['ManualJournals']:
for li in inv['JournalLines']:
cursor.execute("INSERT INTO [server].dbo.[Xero_ManualJournalLines](ManualJournalID,AccountID,Description,LineAmount,TaxAmount,AccountCode,Region) VALUES(?,?,?,?,?,?,?)",inv['ManualJournalID'],li['AccountID'],li.get('Description',''),li.get('LineAmount',0),li.get('TaxAmount',0),li.get('AccountCode',0),tenantid)
conn.commit()
page = int(page)+1

If Python is not a mandatory requirement, yes, you can use Data Factory.
You will need to create a pipeline with the following components:
'Copy Data' Activity
Source Dataset (REST API)
Sink Dataset (Azure SQL)
** Also may I know where is your REST API hosted? Is it within Azure through App Service? If not, you will also need to setup a [Self-Hosted Integration Runtime]1
You can refer to the steps here which copies data from Blob storage to Azure SQL
You can also follow my screenshots below which is to create REST API as a Source.
Create a new pipeline.
Type 'copy' in the 'Activity' search box. Drag the 'Copy Data' activity to the pipeline
Click on 'Source' tab, and click on 'New' to create a new Source Dataset.
Type 'REST' in the 'data source' search box.
In the 'REST' dataset window, click on 'Connection' tab. Click on 'New' to create a linked service to point to the REST API.
Here fill up the credentials to the REST API.
Continue setting up the Sink dataset to point to the Azure SQL and test out your pipeline to make sure it works. Hope it helps!

Found the answer. append() the values to a list and insert the list into SQL with executemany()

Why python posting one data will appear 6 times on Firebase database

I just use the simple code to post a data to firebase, but I don't know why it appears 6 times on firebase realtime database.
from firebase import firebase
url = "https://xxx.firebaseio.com/"
fb = firebase.FirebaseApplication(url, None)
fb.post("/posts", {'ID':123})
I run "python fb.py" one time only.
However the result is:
I am very confused.

Are you trying to update this field or create a new entry with a unique ID when you run your code?
Maybe try using fb.put() instead.
fb.post() is the equivalent of .push() in the JavaScript API, so it creates a unique ID for you. fb.put() is equivalent to .set() and will just set the data.

Bigquery Python API create partitioned table by specific field

I need to create a table in Bigquery partitioned by a specific field. I have noticed that this is only available via API Rest. Is there a way to do this via Python API?
Any help?

My guess is that the docs just haven't been updated yet (not that rolling a http request and calling the API would be hard anyway), because if you look at the code for the BigQuery Python client library, it does indeed appear to support specifying the field when creating a partitioned table:

Expanding on Graham Polley's answer: You can set this by setting the time_partitioning property.
Something like this:
import google.cloud.bigquery as bq
bq_client = bq.Client()
dataset = bq_client.dataset('dataset_name')
table = dataset.table('table_name')
table = bq.Table(table, schema=[
bq.SchemaField('timestamp', 'TIMESTAMP', 'REQUIRED'),
bq.SchemaField('col_name', 'STRING', 'REQUIRED')])
table.time_partitioning = bq.TimePartitioning(field='timestamp')
bq_client.create_table(table)

Unable to get campaigns data to push to database in google adwords api

I am facing a couple of issues in figuring out what-is-what, in spite of the humungous documentation I am unable to figure out these issues
1.Which report type should be used to get the campaign level totals. I am trying to get the data in the format of headers
-campaign_id|campaign_name|CLicks|Impressions|Cost|Conversions.
2.I have tried to use "CAMPAIGN_PERFORMANCE_REPORT" but I get broken up information at a keyword level, but I am trying to pull the data at a campaign level.
3.I also need to push the data to a database. In the API documentation, i get samples which will either print the results on my screen or it will create a file on my machine. is there a way where I can get the data in JSON to push it to the database.
4.I have 7 accounts on my MCC account as of now, the number will increase in the coming days. I don't want to manually hard code the client customer ids into my code as there will be new accounts which will be created. is there a way where I can get the list of client customer ids which are on my MCC ac
I am trying to get this data using python as my code base and adwords api V201710.

To retrieve campaign performance data you need to run a campaign_performance_report. Follow this link to view all available columns for Campaign performance report.
The campaign performance report does not include stats aggregated at a keyword level. Are you using AWQL to pull your report?
Can you paste your code here, I find it odd you are getting keyword level data.
Run this python example code to get campaign data (you should definitely not be getting keyword level data with this example code).
Firstly Google AdWords API only returns report data in the following file formats CSVFOREXCEL, CSV, TSV, XML, GZIPPED_CSV, GZIPPED_XML. Unfortunately JSON is not supported for your use case. I would recommend GZIPPED_CSV and set the following properties to false:
skipReportHeader
skipColumnHeader
skipReportSummary
This will simply skip all headers, report titles & totals from the report making is very simple to upsert data into a table.
It is not possible to enter a MCC ID and expect the API to fetch a report for all client accounts. Each API report request contains the client ID, so therefore you are required to create an array of all client IDs and then iterate through each id. If you are using the client library (recommended) then you can simply set the clientID within the session i.e. session.setClientCustomerId("xxx");
To automate this use the ManagedCustomerService to automatically retrieve all clientIDs then iterate through this therefore you would not need to hard code each ClientID. Google have created a handy python file which returns the account hierarchy including child account ID (click here).
Lastly I based on your question I assume you attempting to run an ETL process. Google have an opensource AdWords extractor which I highly recommend.

Bigquery (and pandas) - ensure data-insert consistency

In my python project, I need to fill a bigquery table with a relational dataframe. I'm having a lot of trouble at creating a new table from scratch and being sure that the first data I upload to it are actually put into the table.
I've read the page https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency and have seen that applying a insertId to the insert query would solve the problem, but since I use pandas's dataframes, the function to_gbq of the pandas-gbq package seems to be perfect for this task. Yet, when using to_gbq function and a new table is created/replaced, sometimes (apparently randomly) the first data chunk is not written into the table.
Does anybody know how to ensure the complete insertion of a DataFrame into a bigquery new created table? Thanks

I believe you are encountering https://github.com/pydata/pandas-gbq/issues/75. Basically, Pandas using the BigQuery streaming API to write data into tables, but the streaming API has a delay after table creation to when it starts working.
Edit: Version 0.3.0 of pandas-gbq fixes this issue by using a load job to upload data frames to BigQuery instead of streaming.
In the meantime, I'd recommend using a "load job" to create the tables. For example, using the client.load_table_from_file method in the google-cloud-bigquery package.
from google.cloud.bigquery import LoadJobConfig
from six import StringIO
destination_table = client.dataset(dataset_id).table(table_id)
job_config = LoadJobConfig()
job_config.write_disposition = 'WRITE_APPEND'
job_config.source_format = 'NEWLINE_DELIMITED_JSON'
rows = []
for row in maybe_a_dataframe:
row_json = row.to_json(force_ascii=False, date_unit='s', date_format='iso')
rows.append(row_json)
body = StringIO('{}\n'.format('\n'.join(rows)))
client.load_table_from_file(
body,
destination_table,
job_config=job_config).result()
Edit: This code sample fails for columns containing non-ASCII characters. See https://github.com/pydata/pandas-gbq/pull/108

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.