got a file.json in my s3 bucket, it contains a list of jsons,
for instance when I download it and parse it with python json load I get a list:
[{'k': 'calendar#event'}, {'k': 'calendar#event'}]\
loading it into an external table works:
create external table if not exists TEST_111
with location = #TESt111
auto_refresh = true
file_format = (type = json);
but instead of getting a table with 2 rows, I get one row with a list in it,
any ideas?
If the value is provided as array then strip_outer_array could be used:
create external table if not exists TEST_111
with location = #TESt111
auto_refresh = true
file_format = (type = json, STRIP_OUTER_ARRAY=TRUE);
Additionally if the json keys are known in advance, they could be exposed as columns directly in external table's definition:
create external table if not exists TEST_111
(
filename TEXT metadata$filename
,k TEXT AS (value:"k"::TEXT)
)
with location = #TESt111
auto_refresh = true
file_format = (type = json, STRIP_OUTER_ARRAY=TRUE);
Related
This is the same as this question, but I also want to limit the depth returned.
Currently, all answers return all the objects after the specified prefix. I want to see just what's in the current hierarchy level.
Current code that returns everything:
self._session = boto3.Session(
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
)
self._session.resource("s3")
bucket = self._s3.Bucket(bucket_name)
detections_contents = bucket.objects.filter(Prefix=prefix)
for object_summary in detections_contents:
print(object_summary.key)
How to see only the files and folders directly under prefix? How to go n levels deep?
I can parse everything locally, and this is clearly not what I am looking for here.
There is no definite way to do this using list objects without getting all the objects in the dir.
But there is a way using s3 select which uses sql query like format to get n levels deep to get the file content as well as to get object keys.
If you are fine with writing sql then use this.
reference doc
import boto3
import json
s3 = boto3.client('s3')
bucket_name = 'my-bucket'
prefix = 'my-directory/subdirectory/'
input_serialization = {
'CompressionType': 'NONE',
'JSON': {
'Type': 'LINES'
}
}
output_serialization = {
'JSON': {}
}
# Set the SQL expression to select the key field for all objects in the subdirectory
expression = 'SELECT s.key FROM S3Object s WHERE s.key LIKE \'' + prefix + '%\''
response = s3.select_object_content(
Bucket=bucket_name,
ExpressionType='SQL',
Expression=expression,
InputSerialization=input_serialization,
OutputSerialization=output_serialization
)
# The response will contain a Payload field with the selected data
payload = response['Payload']
for event in payload:
if 'Records' in event:
records = event['Records']['Payload']
data = json.loads(records.decode('utf-8'))
# The data will be a list of objects, each with a "key" field representing the file name
for item in data:
print(item['key'])
There is not built in way with the Boto3 or S3 APIs to do this. You'll need some version of processing each level and asking in turn for a list of objects at that level:
import boto3
s3 = boto3.client('s3')
max_depth = 2
paginator = s3.get_paginator('list_objects_v2')
# Track all prefixes to show with a list
common_prefixes = [(0, "")]
while len(common_prefixes) > 0:
# Pull out the next prefix to show
current_depth, current_prefix = common_prefixes.pop(0)
# Loop through all of the items using a paginator to handle common prefixes with more
# than a thousand items
for page in paginator.paginate(Bucket=bucket_name, Prefix=current_prefix, Delimiter='/'):
for cur in page.get("CommonPrefixes", []):
# Show each common prefix, here just use a format like AWS CLI does
print(" " * 27 + f"PRE {cur['Prefix']}")
if current_depth < max_depth:
# This is below the max depth we want to show, so
# add it to the list to be shown
common_prefixes.append((current_depth + 1, cur['Prefix']))
for cur in page.get("Contents", []):
# Show each item sharing this common prefix using a format like the AWS CLI
print(f"{cur['LastModified'].strftime('%Y-%m-%d %H:%M:%S')}{cur['Size']:11d} {cur['Key']}")
What i get from api:
"name":"reports"
"col_type":"array<struct<imageUrl:string,reportedBy:string>>"
So in hive schema I got:
reports array<struct<imageUrl:string,reportedBy:string>>
Note: I got hive array schema as string from api
My target:
bigquery.SchemaField("reports", "RECORD", mode="NULLABLE",
fields=(
bigquery.SchemaField('imageUrl', 'STRING'),
bigquery.SchemaField('reportedBy', 'STRING')
)
)
Note: I would like to create universal code that can handle when i receive any number of struct inside of the array.
Any tips are welcome.
I tried creating a script that parses your input which is reports array<struct<imageUrl:string,reportedBy:string>>. This converts your input to a dictionary that could be used as schema when creating a table. The main idea of the apporach is instead of using SchemaField(), you can create a dictionary which is much easier than creating SchemaField() objects with parameters using your example input.
NOTE: The script is only tested based on your input and it can parse more fields if added in struct<.
import re
from google.cloud import bigquery
def is_even(number):
if (number % 2) == 0:
return True
else:
return False
def clean_string(str_value):
return re.sub(r'[\W_]+', '', str_value)
def convert_to_bqdict(api_string):
"""
This only works for a struct with multiple fields
This could give you an idea on constructing a schema dict for BigQuery
"""
num_even = True
main_dict = {}
struct_dict = {}
field_arr = []
schema_arr = []
# Hard coded this since not sure what the string will look like if there are more inputs
init_struct = sample.split(' ')
main_dict["name"] = init_struct[0]
main_dict["type"] = "RECORD"
main_dict["mode"] = "NULLABLE"
cont_struct = init_struct[1].split('<')
num_elem = len(cont_struct)
# parse fields inside of struct<
for i in range(0,num_elem):
num_even = is_even(i)
# fields are seen on even indices
if num_even and i != 0:
temp = list(filter(None,cont_struct[i].split(','))) # remove blank elements
for elem in temp:
fields = list(filter(None,elem.split(':')))
struct_dict["name"] = clean_string(fields[0])
# "type" works for STRING as of the moment refer to
# https://cloud.google.com/bigquery/docs/schemas#standard_sql_data_types
# for the accepted data types
struct_dict["type"] = clean_string(fields[1]).upper()
struct_dict["mode"] = "NULLABLE"
field_arr.append(struct_dict)
struct_dict = {}
main_dict["fields"] = field_arr # assign dict to array of fields
schema_arr.append(main_dict)
return schema_arr
sample = "reports array<struct<imageUrl:string,reportedBy:string,newfield:bool>>"
bq_dict = convert_to_bqdict(sample)
client = bigquery.Client()
project = client.project
dataset_ref = bigquery.DatasetReference(project, '20211228')
table_ref = dataset_ref.table("20220203")
table = bigquery.Table(table_ref, schema=bq_dict)
table = client.create_table(table)
Output:
I am trying to create azure data factory pipelines and resources using python. I was successful with certain ADF activities like Lookup, Copy .. but the problem I am facing here is I am trying to copy few tables from SQL to blob using FOR Each activity and it is throwing the below error
How would you create activities inside for each activity? Any inputs is greatly appreciated. Thanks!
Ref: https://learn.microsoft.com/en-us/python/api/azure-mgmt-datafactory/azure.mgmt.datafactory.models.foreachactivity?view=azure-python
Error Message
TypeError: 'CopyActivity' object is not iterable
Code Block
## Lookup Activity
ls_sql_name = 'ls_'+project_name+'_'+src_svr_type+'_dev'
linked_service_name =LinkedServiceReference(reference_name=ls_sql_name)
lkp_act_name ='Get Table Names'
sql_reader_query = "SELECT top 3 name from sys.tables where name like '%dim'"
source = SqlSource(sql_reader_query= sql_reader_query)
dataset= {"referenceName": "ds_sql_Dim_input","type": "DatasetReference"}
LookupActivity_ = LookupActivity(name =lkp_act_name, linked_service_name= linked_service_name, source = source, dataset = dataset
,first_row_only =False)
#create copy activity
ds_name = 'ds_sql_dim_input' #these datasets already created
dsOut_name ='ds_blob_dim_output' #these datasets already created
copy_act_name = 'Copy SQL to Blob(parquet)'
sql_reader_query = {"value": "#item().name","type": "Expression"}
sql_source = SqlSource(sql_reader_query=sql_reader_query)
blob_sink = ParquetSink()
dsin_ref = DatasetReference(reference_name=ds_name)
dsOut_ref = DatasetReference(reference_name=dsOut_name)
copy_activity = CopyActivity(name=copy_act_name,inputs=[dsin_ref], outputs=[dsOut_ref], source=sql_source, sink=blob_sink)
## For Each Activity
pl_name ='pl_Test'
items= {"value": "#activity('Get Table Names').output.value","type": "Expression"}
dependsOn = [{"activity": "Get Table Names","dependencyConditions": ["Succeeded"]}]
ForEachActivity_= ForEachActivity(name = 'Copy tables in loop',items=items,depends_on=dependsOn ,activities =copy_activity)
params_for_pipeline = {}
p_obj = PipelineResource(activities=[LookupActivity_,ForEachActivity_], parameters=params_for_pipeline)
p = adf_client.pipelines.create_or_update(rg_name, df_name, pl_name, p_obj)
Activities needs to be a list of Activity, and you are passing a single one. Try creating a list and add the copy activity to it, and the pass that list in the activities parameter.
i have a zip file in s3 bucket . I unzip that zip file in memory(without unzipping it) and dump the data in .csv into tables in my database.But while dumping the tables , i get foreign constraints as the program dumps the csvs in the order in which they come , so some tables are dumped before other table example - i have two tables 'dealer_master' and 'billing_master', structure of both tables:
1) dealer_data
dealer_id : primary key
country
pincode
address
create_date
2) billing_data
bill_id : primary-key
dealer_id : foreign_key
bill_amount
bill_date
In the zip file , i get billing_data ahead of dealer_data. So get 'foreign key constraint error'. For the solution to above problem i turned off foreign key constraints while making connection to database. Is there any other way, i can dump the tables into my database in correct order ?
Can i store tables in memory for some piece of time and later dump them in the way i want ?
My code goes like this :
def etl_job():
`data = json.load(open('path_to_json'))`
`logger = helpers.setup_logging()`
`s3_client = boto3.client('s3',aws_access_key_id=data['aws_access_key_id'],
aws_secret_access_key=data['aws_secret_access_key'])`
`s3_resource = boto3.resource('s3',aws_access_key_id=data['aws_access_key_id'],
aws_secret_access_key=data['aws_secret_access_key'])`
`keys = []`
`resp = s3_client.list_objects_v2(Bucket=bucket_name)`
`for obj in resp['Contents']:`
`keys.append(obj['Key'])
for key in keys:
names = key.split("/")
obj = s3_resource.Bucket(bucket_name).Object(helpers.zip_file_name())
buffer = io.BytesIO(obj.get()["Body"].read())
zip_file = zipfile.ZipFile(buffer,'r')
logger.info("Name of csv in zip file :%s",zip_file.namelist())
logs = ""
dataframe = pd.DataFrame()
for name_of_zipfile in zip_file.namelist():
zip_open = pd.read_csv(zip_file.open(name_of_zipfile))
zip_open = zip_open.dropna()
table_name = "{name}".format(name=name_of_zipfile.replace('.csv',''))
try :
zip_open.to_sql(name=name_of_zipfile.replace('.csv',''), con=database.db_connection(), if_exists = 'append', index=False)
except SQLAlchemyError as sqlalchemy_error:
print sqlalchemy_error
database.db_connection().execute('SET FOREIGN_KEY_CHECKS=1;')`
I want to add a links property to each couchdb document based on data in a csv file.
the value of the links property is to be an array of dicts containing the couchdb _id of the linked document and the linkType
When I run the script i get a links error (see error info below)
I am not sure how to create the dict key links if it doesn't exist and add the link data, or otherwise append to the links array if it does exist.
an example of a document with the links will look like this:
{
_id: p_3,
name: 'Smurfette'
links: [
{to_id: p_2, linkType: 'knows'},
{to_id: o_56, linkType: 'follows'}
]
}
python script for processing the csv file:
#!/usr/bin/python
# coding: utf-8
# Version 1
#
# csv fields: ID,fromType,fromID,toType,toID,LinkType,Directional
import csv, sys, couchdb
def csv2couchLinks(database, csvfile):
# CouchDB Database Connection etc
server = couchdb.Server()
#assumes that couchdb runs on http://localhost:5984
db = server[database]
#assumes that db is already created
# CSV file
data = csv.reader(open(csvfile, "rb")) # Read in the CSV file rb=read/binary
csv_links= csv.DictReader(open(csvfile, "rb"))
def makeLink(from_id, to_id, linkType):
# get doc from db
doc = db[from_id]
# construct link object
link = {'to_id':to_id, 'linkType':linkType}
# add link reference to array at key 'links'
if doc['links'] in doc:
doc['links'].append(link)
else:
doc['links'] = [link]
# update the record in the database
db[doc.id] = doc
# read each row in csv file
for row in csv_links:
# get entityTypes as lowercase and entityIDs
fromType = row['fromType'].lower()
fromID = row['fromID']
toType = row['toType'].lower()
toID = row['toID']
linkType = row['LinkType']
# concatenate 'entity type' and 'id' to make couch '_id'
fromIDcouch = fromType[0]+'_'+fromID #eg 'p_2' <= person 2
toIDcouch = toType[0]+'_'+toID
makeLink(fromIDcouch, toIDcouch, linkType)
makeLink(toIDcouch, fromIDcouch, linkType)
# Run csv2couchLinks() if this is not an imported module
if __name__ == '__main__':
DATABASE = sys.argv[1]
CSVFILE = sys.argv[2]
csv2couchLinks(DATABASE,CSVFILE)
error info:
$ python LINKS_csv2couchdb_v1.py "qmhonour" "./tablesAsCsv/links.csv"
Traceback (most recent call last):
File "LINKS_csv2couchdb_v1.py", line 65, in <module>
csv2couchLinks(DATABASE,CSVFILE)
File "LINKS_csv2couchdb_v1.py", line 57, in csv2couchLinks
makeLink(fromIDcouch, toIDcouch, linkType)
File "LINKS_csv2couchdb_v1.py", line 33, in makeLink
if doc['links'] in doc:
KeyError: 'links'
Another option is condensing the if block to this:
doc.setdefault('links', []).append(link)
The dictionary's setdefault method checks to see if links exists in the dictionary, and if it doesn't, it creates a key and makes the value an empty list (the default). It then appends link to that list. If links does exist, it just appends link to the list.
def makeLink(from_id, to_id, linkType):
# get doc from db
doc = db[from_id]
# construct link object
link = {'to_id':to_id, 'linkType':linkType}
# add link reference to array at key 'links'
doc.setdefault('links', []).append(link)
# update the record in the database
db[doc.id] = doc
Replace:
if doc['links'] in doc:
With:
if 'links' in doc: