I have a JSON object in S3 which follows this structure:
<code> : {
<client>: <value>
}
For example,
{
"code_abc": {
"client_1": 1,
"client_2": 10
},
"code_def": {
"client_2": 40,
"client_3": 50,
"client_5": 100
},
...
}
I am trying to retrieve the numerical value with an S3 Select query, where the "code" and the "client" are populated dynamically with each query.
So far I have tried:
sql_exp = f"SELECT * from s3object[*][*] s where s.{proc}.{client_name} IS NOT NULL"
sql_exp = f"SELECT * from s3object s where s.{proc}[*].{client_name}[*] IS NOT NULL"
as well as without the asterisk inside the square brackets, but nothing works, I get ClientError: An error occurred (ParseUnexpectedToken) when calling the SelectObjectContent operation: Unexpected token found LITERAL:UNKNOWN at line 1, column X (depending on the length of the query string)
Within the function defining the object, I have:
resp = s3.select_object_content(
Bucket=<bucket>,
Key=<filename>,
ExpressionType="SQL",
Expression=sql_exp,
InputSerialization={'JSON': {"Type": "Document"}},
OutputSerialization={"JSON": {}},
)
Is there something off in the way I define the object serialization? How can I fix the query so I can retrieve the desired numerical value on the fly when I provide ”code” and “client”?
I did some tinkering based on the documentation, and it works!
I need to access the single event in the EventStream (resp) as follows:
event_stream = resp['Payload']
# unpack successful query response
for event in event_stream:
if "Records" in event:
output_str = event["Records"]["Payload"].decode("utf-8") # bytes to string
output_dict = json.loads(output_str) # string to dict
Now the correct SQL expression is:
sql_exp= f"SELECT s['{code}']['{client}'] FROM S3Object s"
where I have gotten (dynamically) my values for code and client beforehand.
For example, based on the dummy JSON structure above, if code = "code_abc" and client = "client_2", I want this S3 Select query to return the value 10.
The f-string resolves to sql_exp = "SELECT s['code_abc']['client_2'] FROM S3Object s", and when we call resp, we retrieve output_dict = {'client_2': 10} (Not sure if there is a clear way to get the value by itself without the client key, this is how it looks like in the documentation as well).
So, the final step is to retrieve value = output_dict['client_2'], which in our case is equal to 10.
Related
This is the same as this question, but I also want to limit the depth returned.
Currently, all answers return all the objects after the specified prefix. I want to see just what's in the current hierarchy level.
Current code that returns everything:
self._session = boto3.Session(
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
)
self._session.resource("s3")
bucket = self._s3.Bucket(bucket_name)
detections_contents = bucket.objects.filter(Prefix=prefix)
for object_summary in detections_contents:
print(object_summary.key)
How to see only the files and folders directly under prefix? How to go n levels deep?
I can parse everything locally, and this is clearly not what I am looking for here.
There is no definite way to do this using list objects without getting all the objects in the dir.
But there is a way using s3 select which uses sql query like format to get n levels deep to get the file content as well as to get object keys.
If you are fine with writing sql then use this.
reference doc
import boto3
import json
s3 = boto3.client('s3')
bucket_name = 'my-bucket'
prefix = 'my-directory/subdirectory/'
input_serialization = {
'CompressionType': 'NONE',
'JSON': {
'Type': 'LINES'
}
}
output_serialization = {
'JSON': {}
}
# Set the SQL expression to select the key field for all objects in the subdirectory
expression = 'SELECT s.key FROM S3Object s WHERE s.key LIKE \'' + prefix + '%\''
response = s3.select_object_content(
Bucket=bucket_name,
ExpressionType='SQL',
Expression=expression,
InputSerialization=input_serialization,
OutputSerialization=output_serialization
)
# The response will contain a Payload field with the selected data
payload = response['Payload']
for event in payload:
if 'Records' in event:
records = event['Records']['Payload']
data = json.loads(records.decode('utf-8'))
# The data will be a list of objects, each with a "key" field representing the file name
for item in data:
print(item['key'])
There is not built in way with the Boto3 or S3 APIs to do this. You'll need some version of processing each level and asking in turn for a list of objects at that level:
import boto3
s3 = boto3.client('s3')
max_depth = 2
paginator = s3.get_paginator('list_objects_v2')
# Track all prefixes to show with a list
common_prefixes = [(0, "")]
while len(common_prefixes) > 0:
# Pull out the next prefix to show
current_depth, current_prefix = common_prefixes.pop(0)
# Loop through all of the items using a paginator to handle common prefixes with more
# than a thousand items
for page in paginator.paginate(Bucket=bucket_name, Prefix=current_prefix, Delimiter='/'):
for cur in page.get("CommonPrefixes", []):
# Show each common prefix, here just use a format like AWS CLI does
print(" " * 27 + f"PRE {cur['Prefix']}")
if current_depth < max_depth:
# This is below the max depth we want to show, so
# add it to the list to be shown
common_prefixes.append((current_depth + 1, cur['Prefix']))
for cur in page.get("Contents", []):
# Show each item sharing this common prefix using a format like the AWS CLI
print(f"{cur['LastModified'].strftime('%Y-%m-%d %H:%M:%S')}{cur['Size']:11d} {cur['Key']}")
I have a mongo collection with data structure in the follwoing way
content: {'description': { 'text': [{'_date': '2019-05-21','_sectionId': 'a13a','_objectId: 'f637cee'},
{'_date': '2019-05-21','_objectId': '8b2ed183', '_source: 'f637cee'},
{ etc....}
{'_date': '2019-05-21','_sectionId': 'a13a','_objectId: 'XXXcee'}
},
'client' : {.....},
}
I am looking for the way to query the collection to get a list of tuples in the following way:
given a section Id I would like to get the corresponding 'objectId'
In this case the result would be:
('a13a','f637cee'), ('a13a','XXXcee')
I started to do something like this:
import pymongo
myclient = pymongo.MongoClient(mongoconnection)
print('databases names:')
myclient.list_database_names()
# getting the collection:
mydb = myclient["clients"]
query = {'content.description.text._sectionId': 'a13a'}
cur = mydb.find(query)
But I dont know how to extract the information from the cursor.
Some help?
Note the info might be nested in different places, i.e. there are more nodes preceding "content" that can vary.
Thanks a lot
Use the second parameter of the find() to get required fields.
Ex:
query = {'content.description.text._sectionId': 'a13a'}
cur = mydb.find(query, { "_id": 0, "_sectionId": 1, "_objectId": 1 })
print([tuple(i.values()) for i in cur])
So I'm trying to insert multiple items in a dynamodb table. I'm reading my data from a csv file and everything's going write(My logs in aws cloudwatch shows me that I'm correctly extracting my data from the csv file).
I've first try in a loop to write each element in the table like this:
for item in Items:
response = dynamodb.put_item(
TableName = 'some_table_name',
Item = item)
For this syntax I'm using the dynamodb client like this:
dynamodb = boto3.client('dynamodb', region_name=region)
After this attempt, I've tried to used to batch write in a loop like this:
for item in Items:
response = dynamodb.batch_write_item(
RequestItems={
'some_table': [
{
'PutRequest': {
'Item':item
}
}
]
}
)
print('Successfully uploaded to DynamoDB')
And I'm still using the dynamodb client.
After those two attempts, I've tried the same functions with the dynamodb resource (dynamodb = boto3.client('dynamodb', region_name=region)).
The problem is that my list Items has 42 items, and all those tentatives put different number of items (25,27,29) but never 42. So where am I doing wrong? Can u guys help me please?
Try implementing retry mechanism for your bachwrite
result, batchError := svc.DynamoDBBatchWrite(input)
if result != nil && len(result.UnprocessedItems) > 0 {
input = &dynamodb.BatchWriteItemInput{
RequestItems: result.UnprocessedItems,
}
}
and then retry till the result.UnprocessedItems == 0
I would like to check retrieve items that have an attribute value that is present in the list of value I provide. Below is the query I have for searching. Unfortunately the response return an empty list of items. I don't understand why this is the case and would like to know the correct query.
def search(self, src_words, translations):
entries = []
query_src_words = [word.decode("utf-8") for word in src_words]
params = {
"TableName": self.table,
"FilterExpression": "src_word IN (:src_words) AND src_language = :src_language AND target_language = :target_language",
"ExpressionAttributeValues": {
":src_words": {"SS": query_src_words},
":src_language": {"S": config["source_language"]},
":target_language": {"S": config["target_language"]}
}
}
page_iterator = self.paginator.paginate(**params)
for page in page_iterator:
for entry in page["Items"]:
entries.append(entry)
return entries
Below is the table that I would like to query from. For example if my list of query_src_word have: [soccer ball, dog] then only row with entry_id=2 should be returned
Any insights would be much appreciated.
I think this is because in the query_src_word you have "soccer_ball" (with an underscore), while in the database you have "soccer ball" (without an underscore).
Change "soccer_ball" to "soccer ball" in your query_src_words and it should work find
I am using named parameters in Bigquery SQL and want to write the results to a permanent table. I have two functions 1 for using named query parameters and 1 for writing query results to table. How do I combine the two to get query results written to table; the query having named parameters.
This is the function using parameterized queries :
def sync_query_named_params(column_name,min_word_count,value):
query = """with lsq_results as
(select "%s" = #min_word_count)
replace (%s AS %s)
from lsq.lsq_results
""" % (min_word_count,value,column_name)
client = bigquery.Client()
query_results = client.run_sync_query(query
,
query_parameters=(
bigquery.ScalarQueryParameter('column_name', 'STRING', column_name),
bigquery.ScalarQueryParameter(
'min_word_count',
'STRING',
min_word_count),
bigquery.ScalarQueryParameter('value','INT64',value)
))
query_results.use_legacy_sql = False
query_results.run()
Function to write to permanent table
class BigQueryClient(object):
def __init__(self, bq_service, project_id, swallow_results=True):
self.bigquery = bq_service
self.project_id = project_id
self.swallow_results = swallow_results
self.cache = {}
def write_to_table(
self,
query,
dataset=None,
table=None,
external_udf_uris=None,
allow_large_results=None,
use_query_cache=None,
priority=None,
create_disposition=None,
write_disposition=None,
use_legacy_sql=None,
maximum_billing_tier=None,
flatten=None):
configuration = {
"query": query,
}
if dataset and table:
configuration['destinationTable'] = {
"projectId": self.project_id,
"tableId": table,
"datasetId": dataset
}
if allow_large_results is not None:
configuration['allowLargeResults'] = allow_large_results
if flatten is not None:
configuration['flattenResults'] = flatten
if maximum_billing_tier is not None:
configuration['maximumBillingTier'] = maximum_billing_tier
if use_query_cache is not None:
configuration['useQueryCache'] = use_query_cache
if use_legacy_sql is not None:
configuration['useLegacySql'] = use_legacy_sql
if priority:
configuration['priority'] = priority
if create_disposition:
configuration['createDisposition'] = create_disposition
if write_disposition:
configuration['writeDisposition'] = write_disposition
if external_udf_uris:
configuration['userDefinedFunctionResources'] = \
[ {'resourceUri': u} for u in external_udf_uris ]
body = {
"configuration": {
'query': configuration
}
}
logger.info("Creating write to table job %s" % body)
job_resource = self._insert_job(body)
self._raise_insert_exception_if_error(job_resource)
return job_resource
How do I combine the 2 functions to write a parameterized query and write the results to a permanent table?Or if there is another simpler way. Please suggest.
You appear to be using two different client libraries.
Your first code sample uses a beta version of the BigQuery client library, but for the time being I would recommend against using it, since it needs substantial revision before it is considered generally available. (And if you do use it, I would recommend using run_async_query() to create a job using all available parameters, and then call results() to get the QueryResults object.)
Your second code sample is creating a job resource directly, which is a lower-level interface. When using this approach, you can specify the configuration.query.queryParameters field on your query configuration directly. This is the approach I'd recommend right now.