How to create date partitioned tables in GBQ? Can you use python? - python

I have just under 100M records of data that I wish to transform by denormalising a field and then input into a date partitioned GBQ table. The dates go back to 2001.
I had hoped that I could transform it with Python and then use GBQ directly from the script to accomplish this, but after reading up on this and particularly this document it doesn't seem straight-forward to create date-partitioned tables. I'm looking for a steer in the right direction.
Is there any working example of a python script that can do this? Or is it not possible to do via Python? Or is there another method someone can point me in the direction of?
Update
I'm not sure if I've missed something, but the tables created appear to be partitioned as per the insert date of when I'm creating the table and I want to partition by a date set within the existing dataset. I can't see anyway of changing this.
Here's what I've experimenting with:
import uuid
import os
import csv
from google.cloud import bigquery
from google.cloud.bigquery import SchemaField
from google.cloud.bigquery import Client
from google.cloud.bigquery import Table
import logging #logging.warning(data_store+file)
import json
import pprint
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path to service account credentials'
client = bigquery.Client()
dataset = client.dataset('test_dataset')
dataset.create()
SCHEMA = [
SchemaField('full_name', 'STRING', mode='required'),
SchemaField('age', 'INTEGER', mode='required'),
]
table = dataset.table('table_name', SCHEMA)
table.partitioning_type = "DAY"
table.create()
rows = [
('bob', 30),
('bill', 31)
]
table.insert_data(rows)
Is it possible to modify this to take control of the partitions as I create tables and insert data?
Update 2
It turns out I wasn't looking for table partitioning, for my use case it's enough to simply append a date serial to the end of my table name and then query with something along the lines of:
SELECT * FROM `dataset.test_dataset.table_name_*`
WHERE _TABLE_SUFFIX BETWEEN '20170701' AND '20170702'
I don't know whether this is technically still partitioning or not, but as far as I can see it has the same benefits.

Updated to latest version (google-cloud-biquery==1.4.0)
from google.cloud import bigquery
client = bigquery.Client()
dataset_ref = client.dataset('test_dataset')
table_ref = dataset_ref.table('test_table')
SCHEMA = [
SchemaField('full_name', 'STRING', mode='required'),
SchemaField('age', 'INTEGER', mode='required'),
]
table = bigquery.Table(table_ref, schema=SCHEMA)
if partition not in ('DAY', ):
raise NotImplementedError(f"BigQuery partition type unknown: {partition}")
table.time_partitioning = bigquery.table.TimePartitioning(type_=partition)
table = client.create_table(table) # API request

You can easily create date partitioned tables using the API and Python SDK. Simply set the timePartitioning field to DAY in your script:
https://github.com/GoogleCloudPlatform/google-cloud-python/blob/a14905b6931ba3be94adac4d12d59232077b33d2/bigquery/google/cloud/bigquery/table.py#L219
Or roll your own table insert request with the following body:
{
"tableReference": {
"projectId": "myProject",
"tableId": "table1",
"datasetId": "mydataset"
},
"timePartitioning": {
"type": "DAY"
}
}
Everything is just backed by the REST api here.

Be aware that different versions of google-api-core handle time-partitioned tables differently. For example, using google-cloud-core==0.29.1, you must use the bigquery.Table object to create time-partitioned tables:
from google.cloud import bigquery
MY_SA_PATH = "/path/to/my/service-account-file.json"
MY_DATASET_NAME = "example"
MY_TABLE_NAME = "my_table"
client = bigquery.Client.from_service_account_json(MY_SA_PATH)
dataset_ref = client.dataset(MY_DATASET_NAME)
table_ref = dataset_ref.table(MY_TABLE_NAME)
actual_table = bigquery.Table(table_ref)
actual_table.partitioning_type = "DAY"
client.create_table(actual_table)
I only discovered this by looking at the 0.20.1 Table source code. I didn't see this in any docs or examples. If you're having problems creating time-partitioned tables, I suggest that you identify the version of each Google library that you're using (for example, using pip freeze), and check your work against the library's source code.

Related

how to commit changes to SQLAlchemy object in the database table

I am trying to translate a set of columns in my MySQL database using Python's googletrans library.
Sample MySQL table Data:
Label Answer Label_Translated Answer_Translated
cómo estás Wie heißen sie? NULL NULL
wie gehts per favore rivisita NULL NULL
元気ですか Cuántos años tienes NULL NULL
Below is my sample code:
import pandas as pd
import googletrans
from googletrans import Translator
import sqlalchemy
import pymysql
import numpy as np
from sqlalchemy import create_engine, MetaData, Table
from sqlalchemy.orm import sessionmaker
engine = create_engine("mysql+pymysql:.....")
Session = sessionmaker(bind = engine)
session = Session()
translator = Translator()
I read the database table using:
sql_stmt = "SELECT * FROM translate"
data = session.execute(sql_stmt)
I perform the translation steps using:
for to_translate in data:
to_translate.Answer_Translated = translator.translate(to_translate.Answer, dest = 'en')
to_translate.Label_Translated = translator.translate(to_translate.Label, dest = 'en')
I tried session.commit() but the changes are not reflected in the database. Could someone please let me know how to make the changes permanent in the database.
Also when I try:
for rows in data:
print(rows)
I don't see any output. Before enforcing the changes in the database, is there a way we can view the changes in Python ?
Rewriting my answer because I missed OP was using a raw query to get his set.
Your issue seems to be that there is no real update logic in your code (although you might have missed that out. Here is what you could do. Keep in mind that it's not the most efficient or elegant way to deal with this, but this might get you in the right direction.
# assuming import sqlalchemy as sa
for to_translate in data:
session = Session()
print(to_translate)
mappings = {}
mappings['Label'] = to_translate[0]
mappings['Answer_Translated'] = translator.translate(to_translate.Answer, dest="en")
mappings['Label_Translated'] = translator.translate(to_translate.Label, dest="en")
update_str = "update Data set Answer_Translated=:Answer_Translated, set Label_Translated=:Label_Translated where Label == :Label"
session.execute(sa.text(update_str), mappings)
session.commit()
This will update your db. Now I can't guarantee it will work out of the box, because your actual table might differ from the sample you posted, but the print statement should be able to guide you in fixing update_str. Note that using the ORM would make this a lot nicer.

How do I configure and execute BatchStatement in Cassandra correctly?

In my Python (3.8) application, I make a request to the Cassandra database via DataStax Python Driver 3.24.
I have several CQL operations that I am trying to execute with a single query via BatchStatement according to the official documentation. Unfortunately, my code causes an error with the following content:
"errorMessage": "retry_policy should implement cassandra.policies.RetryPolicy"
"errorType": "ValueError"
As you can see from my code I set the value for the reply_policy attribute inside BatchStatement. Anyway my code raise error which you see above. What kind of value must be inside the reply_policy property? What is the reason for the current conflict?
Code Snippet:
from cassandra.cluster import Cluster, ExecutionProfile, EXEC_PROFILE_DEFAULT
from cassandra.auth import PlainTextAuthProvider
from cassandra.policies import DCAwareRoundRobinPolicy
from cassandra import ConsistencyLevel
from cassandra.query import dict_factory
from cassandra.query import BatchStatement, SimpleStatement
from cassandra.policies import RetryPolicy
auth_provider = PlainTextAuthProvider(username=db_username, password=db_password)
default_profile = ExecutionProfile(
load_balancing_policy=DCAwareRoundRobinPolicy(local_dc=db_local_dc),
consistency_level=ConsistencyLevel.LOCAL_QUORUM,
request_timeout=60,
row_factory=dict_factory
)
cluster = Cluster(
db_host,
auth_provider=auth_provider,
port=db_port,
protocol_version=4,
connect_timeout=60,
idle_heartbeat_interval=0,
execution_profiles={EXEC_PROFILE_DEFAULT: default_profile}
)
session = cluster.connect()
name_1, name_2, name_3 = "Bob", "Jack", "Alex"
age_1, age_2, age_3 = 25, 30, 18
cql_statement = "INSERT INTO users (name, age) VALUES (%s, %s)"
batch = BatchStatement(retry_policy=RetryPolicy)
batch.add(SimpleStatement(cql_statement, (name_1, age_1)))
batch.add(SimpleStatement(cql_statement, (name_2, age_2)))
batch.add(SimpleStatement(cql_statement, (name_3, age_3)))
session.execute(batch)
Well, I finally found the error.
I removed the retry_policy property from the BatchStatement. Then my mistake was that I put CQL arguments inside SimpleStatement.
Here is working example code snippet:
...
batch = BatchStatement(batch_type=BatchType.UNLOGGED)
batch.add(SimpleStatement(cql_statement), (name_1, age_1))
batch.add(SimpleStatement(cql_statement), (name_2, age_2))
batch.add(SimpleStatement(cql_statement), (name_3, age_3))
session.execute(batch)
EDITED:
As a result, I abandoned BatchStatement after comments left at the bottom of this post. I beg you to pay attention to them! CQL batches are not the same as RBDMS batches. CQL batches are not an optimization but for achieving atomic updates of a denormalized record across multiple tables.

How to pass dataset id to bigquery client for python

I just started playing around with bigquery, and I am trying to pass the dataset id to the python client. It should be a pretty basic operation, but I can't find it on other threads.
In practice I would like to to take the following example
# import packages
import os
from google.cloud import bigquery
# set current work directory to the one with this script.
os.chdir(os.path.dirname(os.path.abspath(__file__)))
# initialize client object using the bigquery key I generated from Google clouds
google_credentials_path = 'bigquery-stackoverflow-DC-fdb49371cf87.json'
client = bigquery.Client.from_service_account_json(google_credentials_path)
# create simple query
query_job = client.query(
"""
SELECT
CONCAT(
'https://stackoverflow.com/questions/',
CAST(id as STRING)) as url,
view_count
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE tags like '%google-bigquery%'
ORDER BY view_count DESC
LIMIT 10"""
)
# store results in dataframe
dataframe_query = query_job.result().to_dataframe()
and make it look something like
# import packages
import os
from google.cloud import bigquery
# set current work directory to the one with this script.
os.chdir(os.path.dirname(os.path.abspath(__file__)))
# initialize client object using the bigquery key I generated from Google clouds
google_credentials_path = 'bigquery-stackoverflow-DC-fdb49371cf87.json'
client = bigquery.Client.from_service_account_json(google_credentials_path)\
.A_function_to_specify_id(bigquery-public-data.stackoverflow)
# create simple query
query_job = client.query(
"""
SELECT
CONCAT(
'https://stackoverflow.com/questions/',
CAST(id as STRING)) as url,
view_count
FROM `posts_questions` -- No dataset ID here anymore
WHERE tags like 'google-bigquery'
ORDER BY view_count DESC
LIMIT 10"""
)
# store results in dataframe
dataframe_query = query_job.result().to_dataframe()
The documentation eludes me, so any help would be appreciated.
The closest thing to what you're asking for is the default_dataset (reference) property of the query job config. The query job config is an optional object that can be passed into the query() method of the instantiated BigQuery client.
You don't set default dataset as part of instantiating a client as not all resources are dataset scoped. You're implicitly working with a query job in your example, which is a project scoped resource.
So, to adapt your sample a bit, it might look something like this:
# skip the irrelevant bits like imports and client construction
job_config = bigquery.QueryJobConfig(default_dataset="bigquery-public-data.stackoverflow")
sql = "SELECT COUNT(1) FROM posts_questions WHERE tags like 'google-bigquery'"
dataframe = client.query(sql, job_config=job_config).to_dataframe()
If you're issuing multiple queries against this same dataset you could certainly reuse the same job config object with multiple query invocations.

How to load mongodb databases with special character in their names in pandas dataframe?

I'm trying to import the mongodb collection data in a pandas dataframe. When the database name is simple like 'admin', it's able to load in the dataframe. However when I try with one of my required databases named asdev-Admin (line 5), I get an empty dataframe. Apparently the error's somewhere related to the special character in the db name, but I don't know how to get around it. How do I resolve this??
import pymongo
import pandas as pd
from pymongo import MongoClient
client = MongoClient()
db = client.asdev-Admin
collection = db.system.groups
data = pd.DataFrame(list(collection.find()))
print(data)
The error states: NameError: name 'Admin' is not defined
You can change db = client.asdev-Admin to db = client['asdev-Admin'].

How to create a SQLite Table with a JSON column using SQLAlchemy?

According to this answer SQLite supports JSON data since version 3.9. I use version 3.24 in combination with SQLALchemy (1.2.8) and Python 3.6, but I cannot create any tables containing JSON columns.
What am I missing or doing wrong? A minimal (not) working example is given below:
import sqlalchemy as sa
import os
import tempfile
metadata = sa.MetaData()
foo = sa.Table(
'foo',
metadata,
sa.Column('bar', sa.JSON)
)
tmp_dir = tempfile.mkdtemp()
dbname = os.path.join(tmp_dir, 'foo.db')
engine = sa.create_engine('sqlite:////' + dbname)
metadata.bind = engine
metadata.create_all()
This fails giving the following error:
sqlalchemy.exc.CompileError: (in table 'foo', column 'bar'): Compiler <sqlalchemy.dialects.sqlite.base.SQLiteTypeCompiler object at 0x7f1eae1dab70> can't render element of type <class 'sqlalchemy.sql.sqltypes.JSON'>
Thanks!
Use a TEXT column. Sqlite has a JSON extension with some functions for working with JSON data, but no dedicated JSON type.

Categories