How to convert Hive schema to Bigquery schema using Python? - python

What i get from api:
"name":"reports"
"col_type":"array<struct<imageUrl:string,reportedBy:string>>"
So in hive schema I got:
reports array<struct<imageUrl:string,reportedBy:string>>
Note: I got hive array schema as string from api
My target:
bigquery.SchemaField("reports", "RECORD", mode="NULLABLE",
fields=(
bigquery.SchemaField('imageUrl', 'STRING'),
bigquery.SchemaField('reportedBy', 'STRING')
)
)
Note: I would like to create universal code that can handle when i receive any number of struct inside of the array.
Any tips are welcome.

I tried creating a script that parses your input which is reports array<struct<imageUrl:string,reportedBy:string>>. This converts your input to a dictionary that could be used as schema when creating a table. The main idea of the apporach is instead of using SchemaField(), you can create a dictionary which is much easier than creating SchemaField() objects with parameters using your example input.
NOTE: The script is only tested based on your input and it can parse more fields if added in struct<.
import re
from google.cloud import bigquery
def is_even(number):
if (number % 2) == 0:
return True
else:
return False
def clean_string(str_value):
return re.sub(r'[\W_]+', '', str_value)
def convert_to_bqdict(api_string):
"""
This only works for a struct with multiple fields
This could give you an idea on constructing a schema dict for BigQuery
"""
num_even = True
main_dict = {}
struct_dict = {}
field_arr = []
schema_arr = []
# Hard coded this since not sure what the string will look like if there are more inputs
init_struct = sample.split(' ')
main_dict["name"] = init_struct[0]
main_dict["type"] = "RECORD"
main_dict["mode"] = "NULLABLE"
cont_struct = init_struct[1].split('<')
num_elem = len(cont_struct)
# parse fields inside of struct<
for i in range(0,num_elem):
num_even = is_even(i)
# fields are seen on even indices
if num_even and i != 0:
temp = list(filter(None,cont_struct[i].split(','))) # remove blank elements
for elem in temp:
fields = list(filter(None,elem.split(':')))
struct_dict["name"] = clean_string(fields[0])
# "type" works for STRING as of the moment refer to
# https://cloud.google.com/bigquery/docs/schemas#standard_sql_data_types
# for the accepted data types
struct_dict["type"] = clean_string(fields[1]).upper()
struct_dict["mode"] = "NULLABLE"
field_arr.append(struct_dict)
struct_dict = {}
main_dict["fields"] = field_arr # assign dict to array of fields
schema_arr.append(main_dict)
return schema_arr
sample = "reports array<struct<imageUrl:string,reportedBy:string,newfield:bool>>"
bq_dict = convert_to_bqdict(sample)
client = bigquery.Client()
project = client.project
dataset_ref = bigquery.DatasetReference(project, '20211228')
table_ref = dataset_ref.table("20220203")
table = bigquery.Table(table_ref, schema=bq_dict)
table = client.create_table(table)
Output:

Related

Databricks DLT reading a table from one schema(bronze), process CDC data and store to another schema (processed)

I am developing an ETL pipeline using databricks DLT pipelines for CDC data that I recieve from kafka. I have created 2 pipelines successfully for landing, and raw zone. The raw one will have operation flag, a sequence column, and I would like to process the CDC and store the clean data in processed layer (SCD 1 type). I am having difficulties in reading table from one schema, apply CDC changes, and load to target db schema tables.
I have 100 plus tables, so i am planning to loop through the tables in RAW layer and apply CDC, move to processed layer. Following is my code that I have tried (I have left the commented code just for your reference).
import dlt
from pyspark.sql.functions import *
from pyspark.sql.types import *
raw_db_name = "raw_db"
processed_db_name = "processed_db_name"
def generate_curated_table(src_table_name, tgt_table_name, df):
# #dlt.view(
# name= src_table_name,
# spark_conf={
# "pipelines.incompatibleViewCheck.enabled": "false"
# },
# comment="Processed data for " + str(src_table_name)
# )
# # def create_target_table():
# # return (df)
# dlt.create_target_table(name=tgt_table_name,
# comment= f"Clean, merged {tgt_table_name}",
# #partition_cols=["topic"],
# table_properties={
# "quality": "silver"
# }
# )
# #dlt.view
# def users():
# return spark.readStream.format("delta").table(src_table_name)
#dlt.view
def raw_tbl_data():
return df
dlt.create_target_table(name=tgt_table_name,
comment="Clean, merged customers",
table_properties={
"quality": "silver"
})
dlt.apply_changes(
target = tgt_table_name,
source = f"{raw_db_name}.raw_tbl_data,
keys = ["id"],
sequence_by = col("timestamp_ms"),
apply_as_deletes = expr("op = 'DELETE'"),
apply_as_truncates = expr("op = 'TRUNCATE'"),
except_column_list = ["id", "timestamp_ms"],
stored_as_scd_type = 1
)
return
tbl_name = 'raw_po_details'
df = spark.sql(f'select * from {raw_dbname}.{tbl_name}')
processed_tbl_name = tbl_name.replace("raw", "processed") //processed_po_details
generate_curated_table(tbl_name, processed_tbl_name, df)
I have tried with dlt.view(), dlt.table(), dlt.create_streaming_live_table(), dlt.create_target_table(), but ending up with either of the following errors:
AttributeError: 'function' object has no attribute '_get_object_id'
pyspark.sql.utils.AnalysisException: Failed to read dataset '<raw_db_name.mytable>'. Dataset is not defined in the pipeline
.Expected result:
Read the dataframe which is passed as a parameter (RAW_DB) and
Create new tables in PROCESSED_DB which is configured in DLT pipeline settings
https://www.databricks.com/blog/2022/04/27/how-uplift-built-cdc-and-multiplexing-data-pipelines-with-databricks-delta-live-tables.html
https://cprosenjit.medium.com/databricks-delta-live-tables-job-workflows-orchestration-patterns-bc7643935299
Appreciate any help please.
Thanks in advance
I got the solution myself and got it working, thanks to all. Am adding my solution so it could be a reference to others.
import dlt
from pyspark.sql.functions import *
from pyspark.sql.types import *
def generate_silver_tables(target_table, source_table):
#dlt.table
def customers_filteredB():
return spark.table("my_raw_db.myraw_table_name")
### Create the target table definition
dlt.create_target_table(name=target_table,
comment= f"Clean, merged {target_table}",
#partition_cols=["topic"],
table_properties={
"quality": "silver",
"pipelines.autoOptimize.managed": "true"
}
)
## Do the merge
dlt.apply_changes(
target = target_table,
source = "customers_filteredB",
keys = ["id"],
apply_as_deletes = expr("operation = 'DELETE'"),
sequence_by = col("timestamp_ms"),#primary key, auto-incrementing ID of any kind that can be used to identity order of events, or timestamp
ignore_null_updates = False,
except_column_list = ["operation", "timestamp_ms"],
stored_as_scd_type = "1"
)
return
raw_dbname = "raw_db"
raw_tbl_name = 'raw_table_name'
processed_tbl_name = raw_tbl_name.replace("raw", "processed")
generate_silver_tables(processed_tbl_name, raw_tbl_name)

Problems querying with python to BigQuery (Python String Format)

I am trying to make a query to BigQuery in order to modify all the values of a row (in python). When I use a simple string to query, I have no problems. Nevertheless, when I introduce the string formatting the query does not work. As follows I'm presenting the same query, but diminishing the number of columns that I am modifying.
I already made the connection to BigQuery, by defining the Client, etc (and works properly).
I tried:
"UPDATE `riscos-dev.survey_test.data-test-bdrn` SET informaci_meteorol_gica = {inf}, risc = {ri} WHERE objectid = {obj_id}".format(inf = df.informaci_meteorol_gica[index], ri = df.risc[index], obj_id = df.objectid[index])
To specify the input values in format:
df.informaci_meteorol_gica[index] = 'Neu' , also a string for df.risc[index] and df.objectid[index] = 3
I am obtaining the following error message:
BadRequest: 400 Braced constructors are not supported at [1:77]
Instead of using format method of string, I propose you another approach with the f string formating in Python :
def build_query():
inf = "'test_inf'"
ri = "'test_ri'"
obj_id = "'test_obj_id'"
return f"UPDATE `riscos-dev.survey_test.data-test-bdrn` SET informaci_meteorol_gica = {inf}, risc = {ri} WHERE objectid = {obj_id}"
if __name__ == '__main__':
query = build_query()
print(query)
The result is :
UPDATE `riscos-dev.survey_test.data-test-bdrn` SET informaci_meteorol_gica = 'test_inf', risc = 'test_ri' WHERE objectid = 'test_obj_id'
I mocked the query params in my example with :
inf = "'test_inf'"
ri = "'test_ri'"
obj_id = "'test_obj_id'"

Data output is not the same as inside function

I am currently having an issue where I am trying to store data in a list (using dataclasses). When I print the data inside the list in the function (PullIncursionData()) it responded with a certain amount of numbers (never the same, not possible due to it's nature). When printing it after it being called to store it's return in a Var it somehow prints only the same number.
I cannot share the numbers, as they update with EVE Online's API, so the only way is to run it locally and read the first list yourself.
The repository is Here: https://github.com/AtherActive/EVEAPI-Demo
Heads up! Inside the main.py (the file with issues) (a snippet of code is down below) are more functions. All functions from line 90 and forward are important, the rest can be ignored for this question, as they do not interact with the other functions.
def PullIncursionData():
#Pulls data from URL and converts it into JSON
url = 'https://esi.evetech.net/latest/incursions/?datasource=tranquility'
data = rq.get(url)
jsData = data.json()
#Init var to store incursions
incursions = []
#Set lenght for loop. yay
length = len(jsData)
# Every loop incursion data will be read by __parseIncursionData(). It then gets added to var Incursions.
for i in range(length):
# Add data to var Incursion.
incursions.append(__parseIncursionData(jsData, i))
# If Dev mode, print some debug. Can be toggled in settings.py
if settings.developerMode == 1:
print(incursions[i].constellation_id)
return incursions
# Basically parses the input data in a decent manner. No comments needed really.
def __parseIncursionData(jsData, i):
icstruct = stru.Incursion
icstruct.constellation_id = jsData[i]['constellation_id']
icstruct.constellation_name = 'none'
icstruct.staging = jsData[i]['staging_solar_system_id']
icstruct.region_name = ResolveSystemNames(icstruct.constellation_id, 'con-reg')
icstruct.status = jsData[i]['state']
icstruct.systems_id = jsData[i]['infested_solar_systems']
icstruct.systems_names = ResolveSystemNames(jsData[i]['infested_solar_systems'], 'system')
return icstruct
# Resolves names for systems, regions and constellations. Still WIP.
def ResolveSystemNames(id, mode='constellation'):
#init value
output_name = 'none'
# If constellation, pull data and find region name.
if mode == 'con-reg':
url = 'https://www.fuzzwork.co.uk/api/mapdata.php?constellationid={}&format=json'.format(id)
data = rq.get(url)
jsData = data.json()
output_name = jsData[0]['regionname']
# Pulls system name form Fuzzwork.co.uk.
elif mode == 'system':
#Convert output to a list.
output_name = []
lenght = len(id)
# Pulls system name from Fuzzwork. Not that hard.
for i in range(lenght):
url = 'https://www.fuzzwork.co.uk/api/mapdata.php?solarsystemid={}&format=json'.format(id[i])
data = rq.get(url)
jsData = data.json()
output_name.append(jsData[i]['solarsystemname'])
return output_name
icdata = PullIncursionData()
print('external data check:')
length = len(icdata)
for i in range(length):
print(icdata[i].constellation_id)
structures.py (custom file)
#dataclass
class Incursion:
constellation_id = int
constellation_name = str
staging = int
staging_name = str
systems_id = list
systems_names = list
region_name = str
status = str
def ___init___(self):
self.constellation_id = -1
self.constellation_name = 'undefined'
self.staging = -1
self.staging_name = 'undefined'
self.systems_id = []
self.systems_names = []
self.region_name = 'undefined'
self.status = 'unknown'

how to call _all_docs primary index with list of keys using python-cloudant?

I'm currently using the primary index to query for a list of keys in a Cloudant database:
class DAO:
#staticmethod
def get_movie_names(movie_ids: List[int]) -> Dict[int, str]:
# The movie_ids in cloudant are stored as strings so convert to
# correct format for querying
movie_ids = [ str(id) for id in movie_ids ]
keys = urllib.parse.quote_plus(json.dumps(movie_ids))
# The movie id is stored in the _id field, so we query it
# using the 'keys' parameter
end_point = '{0}/{1}/_all_docs?keys={2}&include_docs=true'.format (
CL_URL, CL_MOVIEDB, keys
)
response = cloudant_client.r_session.get(end_point)
movie_data = json.loads(response.text)
movie_names = {}
if 'rows' in movie_data:
for row in movie_data['rows']:
if 'doc' in row:
movie_id = int(row['key'])
movie_name = row['doc']['name']
movie_names[movie_id] = movie_name
return movie_names
It looks as though I may be able to achieve this using the cloudant.result.Result. If I've understood the documentation correctly, this will return all documents and you can then filter the returned results. However, I would like to filter by passing parameters to the Cloudant request so that I only return the data I'm interested in. Is this possible?
Chris - CloudantDatabase gives you access to all_docs, is that what you want?
http://python-cloudant.readthedocs.io/en/latest/database.html#cloudant.database.CouchDatabase.all_docs

Selecting values from a JSON file in Python

I am getting JIRA data using the following python code,
how do I store the response for more than one key (my example shows only one KEY but in general I get lot of data) and print only the values corresponding to total,key, customfield_12830, summary
import requests
import json
import logging
import datetime
import base64
import urllib
serverURL = 'https://jira-stability-tools.company.com/jira'
user = 'username'
password = 'password'
query = 'project = PROJECTNAME AND "Build Info" ~ BUILDNAME AND assignee=ASSIGNEENAME'
jql = '/rest/api/2/search?jql=%s' % urllib.quote(query)
response = requests.get(serverURL + jql,verify=False,auth=(user, password))
print response.json()
response.json() OUTPUT:-
http://pastebin.com/h8R4QMgB
From the the link you pasted to pastebin and from the json that I saw, its a you issues as list containing key, fields(which holds custom fields), self, id, expand.
You can simply iterate through this response and extract values for keys you want. You can go like.
data = response.json()
issues = data.get('issues', list())
x = list()
for issue in issues:
temp = {
'key': issue['key'],
'customfield': issue['fields']['customfield_12830'],
'total': issue['fields']['progress']['total']
}
x.append(temp)
print(x)
x is list of dictionaries containing the data for fields you mentioned. Let me know if I have been unclear somewhere or what I have given is not what you are looking for.
PS: It is always advisable to use dict.get('keyname', None) to get values as you can always put a default value if key is not found. For this solution I didn't do it as I just wanted to provide approach.
Update: In the comments you(OP) mentioned that it gives attributerror.Try this code
data = response.json()
issues = data.get('issues', list())
x = list()
for issue in issues:
temp = dict()
key = issue.get('key', None)
if key:
temp['key'] = key
fields = issue.get('fields', None)
if fields:
customfield = fields.get('customfield_12830', None)
temp['customfield'] = customfield
progress = fields.get('progress', None)
if progress:
total = progress.get('total', None)
temp['total'] = total
x.append(temp)
print(x)

Categories