How to define BigQuery schema using Standard SQL? - python

I'd like to use BigQuery Standard SQL in a new project, however I am not able to find any examples on how to define the schema, everything points at Legacy SQL. In particular, I want to use ARRAY and STRUCT.

One way to create a table in BigQuery is by using the API calls. There is no CREATE table syntax.
Creating a table
BigQuery offers various ways to create a new table as detailed here:
You can create an empty table by using the command line tool's bq mk command or by using the BigQuery API tables.insert() method.
You can load a table from a CSV or JSON data file (compressed or uncompressed), from an Avro file, or from a Cloud Datastore backup.
You can create a table from a query result.
You can copy a table
You can define a table over a file in Cloud Storage
you can use Standard SQL types when you define your table schema (see Elliotts answer) and there is a tichet about to update in docs as well. Vote/star here.
lots of Python samples are on GitHub simple as:
def create_table(dataset_name, table_name, project=None):
"""Creates a simple table in the given dataset.
If no project is specified, then the currently active project is used.
"""
bigquery_client = bigquery.Client(project=project)
dataset = bigquery_client.dataset(dataset_name)
if not dataset.exists():
print('Dataset {} does not exist.'.format(dataset_name))
return
table = dataset.table(table_name)
# Set the table schema
table.schema = (
bigquery.SchemaField('Name', 'STRING'),
bigquery.SchemaField('Age', 'INTEGER'),
bigquery.SchemaField('Weight', 'FLOAT'),
)
table.create()
print('Created table {} in dataset {}.'.format(table_name, dataset_name))

You can create a table with a schema that uses standard SQL types. Here is an example of a valid schema:
{
"a": "ARRAY<STRUCT<x INT64, y STRING>>",
"b": "STRUCT<z DATE>",
"c": "INT64"
}
If you put this in a file such as sample_schema.json, you can create a table from it using bq mk:
bq mk --schema sample_schema.json -t your_dataset.YourTableName
Outside of the bq client, the tables.insert API also supports standard SQL type names.

Related

Can't associate temp view with database in spark session

I'm trying to create a temp view using spark, from a csv file.
To reproduce my production scenario, I need to test my script locally, however in production I'm using Glue Jobs (AWS) where there are databases and tables.
In the code below, I'm creating a database in my spark session and using it, after that, I create a temp view.
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("pulsar_data").getOrCreate()
df = spark.read.format('csv')\
.options(infer_schema=True)\
.options(header=True)\
.load('pulsar_stars.csv')
spark.sql('CREATE DATABASE IF NOT EXISTS MYDB')
spark.sql('USE MYDB')
df.createOrReplaceTempView('MYDB.TB_PULSAR_STARS')
spark.catalog.listTables()
spark.sql('SELECT * FROM MYDB.TB_PULSAR_STARS').show()
However, when I try to select db.table, Spark can't find the relation between my temp view and my database and throws following error:
*** pyspark.sql.utils.AnalysisException: Table or view not found: MYDB.TB_PULSAR_STARS; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation [MYDB, TB_PULSAR_STARS], [], false
Debugging my code with pdb, I have listed my spark session catalog, where I find that my table is in fact associated:
(Pdb) spark.catalog.listTables()
[Table(name='tb_pulsar_stars', database='MYDB', description=None, tableType='TEMPORARY', isTemporary=True)]
How can I make this relationship work?
Temporary view name associated to a DataFrame can only be one segment. This is explicitly checked here in Spark code. I would expect your code to throw AnalysisException: CREATE TEMPORARY VIEW or the corresponding Dataset APIs only accept single-part view names, but got: MYDB.TB_PULSAR_STARS - not sure why in your case it's a bit different.
Anyway, use:
df.createOrReplaceTempView('TB_PULSAR_STARS')
spark.sql('SELECT * FROM TB_PULSAR_STARS').show()
And if you need to actually write this data to a table, create it using:
spark.sql("CREATE TABLE MYDB.TB_PULSAR_STARS AS SELECT * FROM TB_PULSAR_STARS")

Migrating multiple tables data from SQL Server to Oracle

I have a scenario to migrate SQL Server tables(30- 40 tables) to Oracle. I Cannot depend on SSIS as the no of tables to be migrated to Oracle will change regularly and I cannot always create or update a DFT when ever there is a change in schema.
Is there any other way where the movement of data can be handled dynamically and can work effectively ? Like using Python or any other Programming languages ?
C# approach - SchemaMapper library
Since you are open to a solution using a programming language, i think you can benefit from SchemaMapper class library which is an open-source project published on GitHub. A full description can be found in the Readme file on the link above.
Important Note: Yesterday i added the support of reading data from databases (SQL Server , Oracle ...) and the ability to export data to Oracle.
In this answer i will provide information on importing SQL Server tables, create the appropriate SchemaMapper class for each one (since they have different schema and you need to import them to different schemas), and how to export data to Oracle.
//First of all list the tables names need to import
string[] TableNameFilter = new[] { "Table1", "Table2" };
//Create an instance of the oracle import class
SchemaMapper.Exporters.OracleExport expOracle = new SchemaMapper.Exporters.OracleExport(oracleconnectionstring);
//Create an SQL Server import class
using (SchemaMapper.Converters.SqlServerCeImport ssImport = new SchemaMapper.Converters.SqlServerCeImport(sqlconnectionstring))
{
//Retrieve tables names
ssImport.getSchemaTable();
//loop over tables matching the filter
foreach(DataRow drRowSchema in ssImport.SchemaTable.AsEnumerable().Where(x =>
TableNameFilter.Contains(x["TABLE_NAME"].ToString())).ToList())
{
string SQLTableName = drRowSchema["TABLE_NAME"].ToString();
string SQLTableSchema = drRowSchema["TABLE_SCHEMA"].ToString();
DataTable dtSQL = ssImport.GetDataTable(SQLTableSchema, SQLTableName);
//Create a schema mapping class
using (SchemaMapper.SchemaMapping.SchemaMapper sm = new SchemaMapper.SchemaMapping.SchemaMapper(SQLTableSchema, SQLTableName))
{
foreach (DataColumn dc in dtSQL.Columns)
{
SchemaMapper_Column smCol = new SchemaMapper_Column();
smCol.Name = dc.ColumnName;
smCol.Name = dc.ColumnName;
smCol.DataType = smCol.GetCorrespondingDataType(dc.DataType.ToString(), dc.MaxLength);
sm.Columns.Add(smCol);
}
//create destination table in oracle
expOracle.CreateDestinationTable(sm);
//Insert data
expOracle.InsertUsingOracleBulk(sm, dtSQL);
//there are other methods such as :
//expOracle.InsertIntoDb(sm, dtSQL);
//expOracle.InsertIntoDbWithParameters(sm, dtSQL);
}
}
}
Note: this is an open-source project: it is not fully tested and not all data types are supported, if you encountered some errors feel free to give a feedback, or add an Issue in GitHub
Other approach - SQL Server Import and Export Wizard
If you can do this without scheduling a Job, then you can use the Import and Export Wizard which allows you to import multiple tables into Oracle without the need to build the packages manually. It will create packages, destination tables, map columns and import data.
Start the SQL Server Import and Export Wizard
Connect to an Oracle Data Source (SQL Server Import and Export Wizard)
Here is the approach I have decided to go considering the time constraint( using C# is taking more time).For 8 GB table it is taking 11 minutes to move the data SQL to Oracle.
Steps:
Dump the SQL tables data into flat files.(Used BIML for automating
the DFT creation)
Transfer these flat files to the Destination server.
Using SQL*Loader to load data from flat files to Oracle.

AWS Glue and update duplicating data

I'm using AWS Glue to move multiple files to an RDS instance from S3. Each day I get a new file into S3 which may contain new data, but can also contain a record I have already saved with some updates values. If I run the job multiple times I will of course get duplicate records in the database. Instead of multiple records being inserted I want Glue to try and update that record if it notices a field has changed, each record has a unique id. Is this possible?
I followed the similar approach which is suggested as 2nd option by Yuriy. Get existing data as well as new data and then do some processing to merge to of them and write with ovewrite mode. Following code would help you to get an idea about how to solve this problem.
sc = SparkContext()
glueContext = GlueContext(sc)
#get your source data
src_data = create_dynamic_frame.from_catalog(database = src_db, table_name = src_tbl)
src_df = src_data.toDF()
#get your destination data
dst_data = create_dynamic_frame.from_catalog(database = dst_db, table_name = dst_tbl)
dst_df = dst_data.toDF()
#Now merge two data frames to remove duplicates
merged_df = dst_df.union(src_df)
#Finally save data to destination with OVERWRITE mode
merged_df.write.format('jdbc').options( url = dest_jdbc_url,
user = dest_user_name,
password = dest_password,
dbtable = dest_tbl ).mode("overwrite").save()
Unfortunately there is no elegant way to do it with Glue. If you would write to Redshift you could use postactions to implement Redshift merge operation. However, it's not possible for other jdbc sinks (afaik).
Alternatively in your ETL script you can load existing data from a database to filter out existing records before saving. However if your DB table is big then the job may take a while to process it.
Another approach is to write into a staging table with mode 'overwrite' first (replace existing staging data) and then make a call to a DB via API to copy new records only into a final table.
I have used INSERT into table .... ON DUPLICATE KEY.. for UPSERTs into the Aurora RDS running mysql engine. Maybe this would be a reference for your use case. We cannot use a JDBC since we have only APPEND, OVERWRITE, ERROR modes currently supported.
I am not sure of the RDS database engine you are using, and following is an example for mysql UPSERTS.
Please see this reference, where i have posted a solution using INSERT INTO TABLE..ON DUPLICATE KEY for mysql :
Error while using INSERT INTO table ON DUPLICATE KEY, using a for loop array

Google Cloud DataLab + BigQuery: how to set region/zone/location

I'm using the Datalab for a Python notebook that loads data from Cloud Storage into BigQuery basically following this example.
I then saw that my original data in the Cloud Storage bucket is in the EU (eu-west3-a), the VM that executes the Datalab is in the same region, but the final data in BigQuery is in the US.
According to this post I tried setting the location for the dataset in code, but did not work. This is because there is no such option defined in the Datalab.Bigquery Python module.
So my question is: How do I set the location (zone and region) for the BigQuery dataset and its containing tables?
This is my code:
# data: https://www.kaggle.com/benhamner/sf-bay-area-bike-share/data
%%gcs read --object gs://my_bucket/kaggle/station.csv --variable stations
# CSV will be read as bytes first
df_stations = pd.read_csv(StringIO(stations))
schema = bq.Schema.from_data(df_stations)
# Create an empty dataset
#bq.Dataset('kaggle_bike_rentals').create(location='europe-west3-a')
bq.Dataset('kaggle_bike_rentals').create()
# Create an empty table within the dataset
table_stations = bq.Table('kaggle_bike_rentals.stations').create(schema = schema, overwrite = True)
# load data directly from cloud storage into the bigquery table. the locally loaded Pandas dataframe won't be used here
table_stations.load('gs://my_bucket/kaggle/station.csv', mode='append', source_format = 'csv', csv_options=bq.CSVOptions(skip_leading_rows = 1))
Update: Meanwhile I manually created the dataset in the BigQuery Web-UI and used it in code without creating it there. Now an exception will be raised if the dataset is not existing thus forbidding to create a one in code that will result in default location US.
Have you tried bq.Dataset('[your_dataset]').create(location='EU')?
BigQuery locations are set on a dataset level. Tables take their location based on the dataset they are in.
Setting the location of a dataset at least outside of Datalab:
from google.cloud import bigquery
bigquery_client = bigquery.Client(project='your_project')
dataset_ref = bigquery_client.dataset('your_dataset_name')
dataset = bigquery.Dataset(dataset_ref)
dataset.location = 'EU'
dataset = bigquery_client.create_dataset(dataset)
Based on the code snippet from here: https://cloud.google.com/bigquery/docs/datasets

Dynamically creating table from csv file using psycopg2

I would like to get some understanding on the question that I was pretty sure was clear for me. Is there any way to create table using psycopg2 or any other python Postgres database adapter with the name corresponding to the .csv file and (probably the most important) with columns that are specified in the .csv file.
I'll leave you to look at the psycopg2 library properly - this is off the top of my head (not had to use it for a while, but IIRC the documentation is ample).
The steps are:
Read column names from CSV file
Create "CREATE TABLE whatever" ( ... )
Maybe INSERT data
import os.path
my_csv_file = '/home/somewhere/file.csv'
table_name = os.path.splitext(os.path.split(my_csv_file)[1])[0]
cols = next(csv.reader(open(my_csv_file)))
You can go from there...
Create a SQL query (possibly using a templating engine for the fields and then issue the insert if needs be)

Categories