I am using DataflowStartFlexTemplateOperator dag operator in airflow to export my bigquery external table data to parquet file format with desired number of output files.
But here how to give output file names. At least the filename prefix.
Below is my code
export_to_gcs = DataflowStartFlexTemplateOperator(
task_id=f'export_to_gcs_day_{day_num}',
project_id=PROJECT_ID,
body={
'launchParameter': {
'containerSpecGcsPath': 'gs://dataflow-templates-us-central1/latest/flex/BigQuery_to_Parquet',
'jobName': f'tivo-export-to-gcs-{run_date}',
'environment': {
'stagingLocation': f'gs://{GCS_BUCKET_NAME}/{STAGING_LOCATION}',
'numWorkers': '1',
'maxWorkers': '20',
'workerRegion': 'us-central1',
'serviceAccountEmail': DF_SA_NAME,
'machineType': 'n1-standard-4',
'ipConfiguration': 'WORKER_IP_PRIVATE',
'tempLocation': f'gs://{GCS_BUCKET_NAME}/{TEMP_LOCATION}',
'subnetwork': SUBNETWORK,
'enableStreamingEngine': False
},
'parameters': {
'tableRef': f'{PROJECT_ID}:{DATASET_NAME}.native_firehose_table',
'bucket': f'gs://{GCS_BUCKET_NAME}/Tivo/site_activity/{run_date}/',
'numShards': '25',
},
}
},
location=REGION_NAME,
wait_until_finished=True,
dag=dag
)
"BigQuery_to_Parquet" template does not allow you to add a prefix. To do that, you would actually need to download the template and create a custom version.
This is the part in the source code where the files are written:
* Step 2: Write records to Google Cloud Storage as one or more Parquet files
* via {#link ParquetIO}.
*/
.apply(
"WriteToParquet",
FileIO.<GenericRecord>write()
.via(ParquetIO.sink(schema))
.to(options.getBucket())
.withNumShards(options.getNumShards())
.withSuffix(FILE_SUFFIX));
Maybe you can take some inspiration from other template that was actually created supporting a prefix, such as streaming data generator, but keep in mind that its purpose is different, and also is streaming instead of batch.
StreamingDataGeneratorWriteToGcs.java takes the prefix parameter from the parameters:
/** Converts the fake messages in bytes to json format and writes to a text file. */
private void writeAsParquet(PCollection<GenericRecord> genericRecords, Schema avroSchema) {
genericRecords.apply(
"Write Parquet output",
FileIO.<GenericRecord>write()
.via(ParquetIO.sink(avroSchema))
.to(getPipelineOptions().getOutputDirectory())
.withPrefix(getPipelineOptions().getOutputFilenamePrefix())
.withSuffix(getPipelineOptions().getOutputType().getFileExtension())
.withNumShards(getPipelineOptions().getNumShards()));
}
Another option is just to export the files in a very specific folder where they cannot be mixed with any other process and apply a mass renaming in a subsequent task.
Related
def add_graph(file, file_name):
file.seek(0)
file_content = file.read()
if 'snomed' in file_name:
conn.add(stardog.content.Raw(file_content,content_type='bytes', content_encoding='utf-
8'), graph_uri='sct:900000000000207008')
Here I'm facing issues in push the file which I have downloaded from S3 bucket and is in bytes form. It is throwing stardog.Exception 500 on pushing this data to stardog database.
I tried pushing the bytes directly as shown below but that also didn't help.
conn.add(content.File(file),
graph_uri='<http://www.example.com/ontology#concept>')
Can someone help me to push the turtle file which is in bytes form to push in stardog database using pystardog library of Python.
I believe this is what you are looking for:
import stardog
conn_details = {
'endpoint': 'http://localhost:5820',
'username': 'admin',
'password': 'admin'
}
conn = stardog.Connection('myDb', **conn_details) # assuming you have this since you already have 'conn', just sending it to a DB named 'myDb'
file = open('snomed.ttl', 'rb') # just opening a file as a binary object to mimic
file_name = 'snomed.ttl' # adding this to keep your function as it is
def add_graph(file, file_name):
file.seek(0)
file_content = file.read() # this will be of type bytes
if 'snomed' in file_name:
conn.begin() # added this to begin a connection, but I do not think it is required
conn.add(stardog.content.Raw(file_content, content_type='text/turtle'), graph_uri='sct:900000000000207008')
conn.commit() # added this to commit the added data
add_graph(file, file_name) # I just ran this directly in the Python file for the example.
Take note of the conn.add line where I used text/turtle as the content-type. I added some more context so it can be a running example.
Here is the sample file as well snomed.ttl:
<http://api.stardog.com/id=1> a :person ;
<http://api.stardog.com#first_name> "John" ;
<http://api.stardog.com#id> "1" ;
<http://api.stardog.com#dob> "1995-01-05" ;
<http://api.stardog.com#email> "john.doe#example.com" ;
<http://api.stardog.com#last_name> "Doe" .
EDIT - Query Test
If it runs successfully and there are no errors in stardog.log you should be able to see results using this query. Note that you have to specify the Named Graph since the data was added there. If you query without specifying, it will show no results.
SELECT * {
GRAPH <sct:900000000000207008> {
?s ?p ?o
}
}
You can run that query in stardog.studio but if you want it in Python, this will print the JSON result:
print(conn.select('SELECT * { GRAPH <sct:900000000000207008> { ?s ?p ?o } }'))
I'm working with azure CLI to script out a storage upgrade as well as add a policy, all in a python script. However, when I run the script I'm getting some expected and some very NOT expected output.
What I'm using so far:
from azure.cli.core import get_default_cli
def az_cli (args_str):
args = args_str.split()
cli = get_default_cli()
cli.invoke(args)
if cli.result.result:
return cli.result.result
elif cli.result.error:
raise cli.result.error
return True
sas = az_cli("storage account list --query [].{Name:name,ResourceGroup:resourceGroup,Kind:kind}")
print(sas)
By using this SO article as reference I'm pretty easily making Azure CLI calls, however my output is the following:
[
{
"Kind": "StorageV2",
"Name": "TestStorageName",
"ResourceGroup": "my_test_RG"
},
{
"Kind": "Storage",
"Name": "TestStorageName2",
"ResourceGroup": "my_test_RG_2"
}
]
[OrderedDict([('Name', 'TestStorageName'), ('ResourceGroup', 'my_test_RG'), ('Kind', 'StorageV2')]), OrderedDict([('Name', 'TestStorageName2'), ('ResourceGroup', 'my_test_RG_2'), ('Kind', 'Storage')])]
I appear to be getting 2 arrays back, and I'm unsure of what the cause is. I'm assuming it has to do with my using the --query to narrow down the output I get back, but I'm at a loss as to why it then repeats itself. Expected result would just be the first part that's in json format. I have also tried with tsv output as well with the same results. I appreciate any insight!
So I'm fairly new to both AWS and Python. I'm on a uni assignment and have hit a road block.
I'm uploading data to AWS S3, this information is being sent to an SQS Queue and passed into AWS Lambda. I know, it would be much easier to just go straight from S3 to Lambda...but apparently "that's not the brief".
So I've got my event accurately coming into AWS Lambda, but no matter how deep I dig, I can't reach the information I need. In AMS Lambda, I run the following query.
def lambda_handler(event, context):
print(event)
Via CloudWatch, I get the output
{'Records': [{'messageId': '1d8e0a1d-d7e0-42e0-9ff7-c06610fccae0', 'receiptHandle': 'AQEBr64h6lBEzLk0Xj8RXBAexNukQhyqbzYIQDiMjJoLLtWkMYKQp5m0ENKGm3Icka+sX0HHb8gJoPmjdTRNBJryxCBsiHLa4nf8atpzfyCcKDjfB9RTpjdTZUCve7nZhpP5Fn7JLVCNeZd1vdsGIhkJojJ86kbS3B/2oBJiCR6ZfuS3dqZXURgu6gFg9Yxqb6TBrAxVTgBTA/Pr35acEZEv0Dy/vO6D6b61w2orabSnGvkzggPle0zcViR/shLbehROF5L6WZ5U+RuRd8tLLO5mLFf5U+nuGdVn3/N8b7+FWdzlmLOWsI/jFhKoN4rLiBkcuL8UoyccTMJ/QTWZvh5CB2mwBRHectqpjqT4TA3Z9+m8KNd/h/CIZet+0zDSgs5u', 'body': '{"Records":[{"eventVersion":"2.1","eventSource":"aws:s3","awsRegion":"eu-west-2","eventTime":"2021-03-26T01:03:53.611Z","eventName":"ObjectCreated:Put","userIdentity":{"principalId":"MY_ID"},"requestParameters":{"sourceIPAddress":"MY_IP_ADD"},"responseElements":{"x-amz-request-id":"BQBY06S20RYNH1XJ","x-amz-id-2":"Cdo0RvX+tqz6SZL/Xw9RiBLMCS3Rv2VOsu2kVRa7PXw9TsIcZeul6bzbAS6z4HF6+ZKf/2MwnWgzWYz+7jKe07060bxxPhsY"},"s3":{"s3SchemaVersion":"1.0","configurationId":"test","bucket":{"name":"MY_BUCKET","ownerIdentity":{"principalId":"MY_ID"},"arn":"arn:aws:s3:::MY_BUCKET"},"object":{"key":"test.jpg","size":246895,"eTag":"c542637a515f6df01cbc7ee7f6e317be","sequencer":"00605D33019AD8E4E5"}}}]}', 'attributes': {'ApproximateReceiveCount': '1', 'SentTimestamp': '1616720643174', 'SenderId': 'AIDAIKZTX7KCMT7EP3TLW', 'ApproximateFirstReceiveTimestamp': '1616720648174'}, 'messageAttributes': {}, 'md5OfBody': '1ab703704eb79fbbb58497ccc3f2c555', 'eventSource': 'aws:sqs', 'eventSourceARN': 'arn:aws:sqs:eu-west-2:ARN', 'awsRegion': 'eu-west-2'}]}
[Disclaimer, I've tried to edit out any identifying information but if there's any sensitive data I'm not understanding or missed, please let me know]
Anyways, just for a sample, I want to get the Object Key, which is test.jpg. I tried to drill down as much as I can, finally getting to: -
def lambda_handler(event, context):
print(event['Records'][0]['body'])
This returned the following (which was nice to see fully stylized): -
{
"Records": [
{
"eventVersion": "2.1",
"eventSource": "aws:s3",
"awsRegion": "eu-west-2",
"eventTime": "2021-03-26T01:08:16.823Z",
"eventName": "ObjectCreated:Put",
"userIdentity": {
"principalId": "MY_ID"
},
"requestParameters": {
"sourceIPAddress": "MY_IP"
},
"responseElements": {
"x-amz-request-id": "ZNKHRDY8GER4F6Q5",
"x-amz-id-2": "i1Cazudsd+V57LViNWyDNA9K+uRbSQQwufMC6vf50zQfzPaH7EECsvw9SFM3l3LD+TsYEmnjXn1rfP9GQz5G5F7Fa0XZAkbe"
},
"s3": {
"s3SchemaVersion": "1.0",
"configurationId": "test",
"bucket": {
"name": "MY_BUCKET",
"ownerIdentity": {
"principalId": "MY_ID"
},
"arn": "arn:aws:s3:::MY_BUCKET"
},
"object": {
"key": "test.jpg",
"size": 254276,
"eTag": "b0052ab9ba4b9395e74082cfd51a8f09",
"sequencer": "00605D3407594DE184"
}
}
}
]
}
However, from this stage on if I try to write print(event['Records'][0]['body']['Records']) or print(event['Records'][0]['s3']), I'll get told I require an integer, not a string. If I try to write print(event['Records'][0]['body'][0]), I'll be given a single character every time (in this cause the first { bracket).
I'm not sure if this has something to do with tuples, or if at this stage it's all saved as one large string, but at least in the output view it doesn't appear to be saved that way.
Does anyone have any idea what I'd do from this stage to access the further information? In the full release after I'm done testing, I'll be wanting to save an audio file and the file name as opposed to a picture.
Thanks.
You are having this problem because the contents of the body is a JSON. But in string format. You should parse it to be able to access it like a normal dictionary. Like so:
import json
def handler(event: dict, context: object):
body = event['Records'][0]['body']
body = json.loads(body)
# use the body as a normal dictionary
You are getting only a single char when using integer indexes because it is a string. So, using [n] in an string will return the nth char.
It's because your getting stringified JSON data. You need to load it back to its Python dict format.
There is a useful package called lambda_decorators. you can install with pip install lambda_decorators
so you can do this:
from lambda_decorators import load_json_body
#load_json_body
def lambda_handler(event, context):
print(event['Records'][0]['body'])
# Now you can access the the items in the body using there index and keys.
This will extract the JSON for you.
I use dynamic mapping in elasticsearch to load my json file into elasticsearch, like this:
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
def extract():
f = open('tmdb.json')
if f:
return json.loads(f.read())
movieDict = extract()
def index(movieDict={}):
for id, body in movieDict.items():
es.index(index='tmdb', id=id, doc_type='movie', body=body)
index(movieDict)
How can I update mapping for single field? I have field title to which I want to add different analyzer.
title_settings = {"properties" : { "title": {"type" : "text", "analyzer": "english"}}}
es.indices.put_mapping(index='tmdb', body=title_settings)
This fails.
I know that I cannot update already existing index, but what is proper way to reindex mapping generated from json file? My file has a lot of fields, creating mapping/settings manually would be very troublesome.
I am able to specify analyzer for an query, like this:
query = {"query": {
"multi_match": {
"query": userSearch, "analyzer":"english", "fields": ['title^10', 'overview']}}}
How do I specify it for index or field?
I am also able to put analyzer to settings after closing and opening index
analysis = {'settings': {'analysis': {'analyzer': 'english'}}}
es.indices.close(index='tmdb')
es.indices.put_settings(index='tmdb', body=analysis)
es.indices.open(index='tmdb')
Copying exact settings for english analyzers doesn't do 'activate' it for my data.
https://www.elastic.co/guide/en/elasticsearch/reference/7.6/analysis-lang-analyzer.html#english-analyzer
By 'activate' I mean, search is not returned in a form processed by english analyzer ie. there are still stopwords.
Solved it with massive amount of googling....
You cannot change analyzer on already indexed data. This includes opening/closing of index. You can specify new index, create new mapping and load your data (quickest way)
Specifying analyzer for whole index isn't good solution, as 'english' analyzer is specific to 'text' fields. It's better to specify analyzer by field.
If analyzers are specified by field you also need to specify type.
You need to remember that analyzers are used at can be used at/or index and search time. Reference Specifying analyzers
Code:
def create_index(movieDict={}, mapping={}):
es.indices.create(index='test_index', body=mapping)
start = time.time()
for id, body in movieDict.items():
es.index(index='test_index', id=id, doc_type='movie', body=body)
print("--- %s seconds ---" % (time.time() - start))
Now, I've got mapping from dynamic mapping of my json file. I just saved it back to json file for ease of processing (editing). That's because I have over 40 fields to map, doing it by hand would be just tiresome.
mapping = es.indices.get_mapping(index='tmdb')
This is example of how title key should be specified to use english analyzer
'title': {'type': 'text', 'analyzer': 'english','fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}}
I'm trying to set data validation rules for my current spreadsheet. One thing that would help me would to be able to view the rules in JSON from data validation rules I have already set (In the spreadsheet UI or within an API call).
Example.
request = {
"requests": [
{
"setDataValidation": {
"range": {
"sheetId": SHEET_ID,
"startRowIndex": 1,
"startColumnIndex": 0,
"endColumnIndex":1
},
"rule": {
"condition": {
"type": "BOOLEAN"},
"inputMessage": "Value MUST BE BOOLEAN",
"strict": "True"
}
}
}
]
}
service.spreadsheets().batchUpdate(spreadsheetId=SPREADSHEET_ID body=request).execute()
But what API calls do I use to see the Data Validation on these range of cells? This is useful for if I set the Data Validation rules in the spreadsheet and I want to see how google interprets them. I'm having a lot of trouble setting complex Datavalidations through the API.
Thank you
To obtain only the "Data Validation" components of a given spreadsheet, you simply request the appropriate field in the call to spreadsheets.get:
service = get_authed_sheets_service_somehow()
params = {
spreadsheetId: 'your ssid',
#range: 'some range',
fields: 'sheets(data/rowData/values/dataValidation,properties(sheetId,title))' }
request = service.spreadsheets().get(**params)
response = request.execute()
# Example print code (not tested :p )
for sheet in response['sheets']:
for range in sheet['data']:
for r, row in enumerate(range['rowData']):
for c, col in enumerate(row['values']):
if 'dataValidation' in col:
# print "Sheet1!R1C1" & associated data validation object.
# Assumes whole grid was requested (add appropriate indices if not).
print(f'\'{sheet["properties"]["title"]}\'!R{r}C{c}', col['dataValidation'])
By specifying fields, includeGridData is not required to obtain data on a per-cell basis from the range you requested. By not supplying a range, we target the entire file. This particular fields specification requests the rowData.values.dataValidation object and the sheetId and title of the properties object, for every sheet in the spreadsheet.
You can use the Google APIs Explorer to interactively determine the appropriate valid "fields" specification, and additionally examine the response:
https://developers.google.com/apis-explorer/#p/sheets/v4/sheets.spreadsheets.get
For more about how "fields" specifiers work, read the documentation: https://developers.google.com/sheets/api/guides/concepts#partial_responses
(For certain write requests, field specifications are not optional so it is in your best interest to determine how to use them effectively.)
I think I found the answer. IncludeGridData=True in your spreadsheet().get
from pprint import pprint
response = service.spreadsheets().get(
spreadsheetId=SPREADSHEETID, fields='*',
ranges='InputWorking!A2:A',includeGridData=True).execute()
You get a monster datastructure back. So to look at the very first data in your range you could do.
pprint(response['sheets'][0]['data'][0]['rowData'][0]['values'][0]['dataValidation'])
{'condition': {'type': 'BOOLEAN'},
'inputMessage': 'Value MUST BE BOOLEAN',
'strict': True}