Cannot ingest data using snowpipe more than once

Cannot ingest data using snowpipe more than once - python

I am using the sample program from the Snowflake document on using Python to ingest the data to the destination table.
So basically, I have to execute put command to load data to the internal stage and then run the Python program to notify the snowpipe to ingest the data to the table.
This is how I create the internal stage and pipe:
create or replace stage exampledb.dbschema.example_stage;
create or replace pipe exampledb.dbschema.example_pipe
as copy into exampledb.dbschema.example_table
from
(
select
t.*
from
#exampledb.dbschema.example_stage t
)
file_format = (TYPE = CSV) ON_ERROR = SKIP_FILE;
put command:
put file://E:\\example\\data\\a.csv #exampledb.dbschema.example_stage OVERWRITE = TRUE;
This is the sample program I use:
from logging import getLogger
from snowflake.ingest import SimpleIngestManager
from snowflake.ingest import StagedFile
from snowflake.ingest.utils.uris import DEFAULT_SCHEME
from datetime import timedelta
from requests import HTTPError
from cryptography.hazmat.primitives import serialization
from cryptography.hazmat.primitives.serialization import load_pem_private_key
from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives.serialization import Encoding
from cryptography.hazmat.primitives.serialization import PrivateFormat
from cryptography.hazmat.primitives.serialization import NoEncryption
import time
import datetime
import os
import logging
logging.basicConfig(
filename='/tmp/ingest.log',
level=logging.DEBUG)
logger = getLogger(__name__)
# If you generated an encrypted private key, implement this method to return
# the passphrase for decrypting your private key.
def get_private_key_passphrase():
return '<private_key_passphrase>'
with open("E:\\ssh\\rsa_key.p8", 'rb') as pem_in:
pemlines = pem_in.read()
private_key_obj = load_pem_private_key(pemlines,
get_private_key_passphrase().encode(),
default_backend())
private_key_text = private_key_obj.private_bytes(
Encoding.PEM, PrivateFormat.PKCS8, NoEncryption()).decode('utf-8')
# Assume the public key has been registered in Snowflake:
# private key in PEM format
# List of files in the stage specified in the pipe definition
file_list=['a.csv.gz']
ingest_manager = SimpleIngestManager(account='<account_identifier>',
host='<account_identifier>.snowflakecomputing.com',
user='<user_login_name>',
pipe='exampledb.dbschema.example_pipe',
private_key=private_key_text)
# List of files, but wrapped into a class
staged_file_list = []
for file_name in file_list:
staged_file_list.append(StagedFile(file_name, None))
try:
resp = ingest_manager.ingest_files(staged_file_list)
except HTTPError as e:
# HTTP error, may need to retry
logger.error(e)
exit(1)
# This means Snowflake has received file and will start loading
assert(resp['responseCode'] == 'SUCCESS')
# Needs to wait for a while to get result in history
while True:
history_resp = ingest_manager.get_history()
if len(history_resp['files']) > 0:
print('Ingest Report:\n')
print(history_resp)
break
else:
# wait for 20 seconds
time.sleep(20)
hour = timedelta(hours=1)
date = datetime.datetime.utcnow() - hour
history_range_resp = ingest_manager.get_history_range(date.isoformat() + 'Z')
print('\nHistory scan report: \n')
print(history_range_resp)
After running the program, I just need to remove the file in the internal stage:
REMOVE #exampledb.dbschema.example_stage;
The code works as expected for the first time but when I truncate the data on that table and run the code again, the table on snowflake doesn't have any data in it.
Do I miss something here? How can I make this code can run multiple times?
Update:
I found that if I use a file with a different name each time I run, the data can load to the snowflake table.
So how can I run this code without changing the data filename?

Snowflake uses file loading metadata to prevent reloading the same files (and duplicating data) in a table. Snowpipe prevents loading files with the same name even if they were later modified (i.e. have a different eTag).
The file loading metadata is associated with the pipe object rather than the table. As a result:
Staged files with the same name as files that were already loaded are ignored, even if they have been modified, e.g. if new rows were added or errors in the file were corrected.
Truncating the table using the TRUNCATE TABLE command does not delete the Snowpipe file loading metadata.
However, note that pipes only maintain the load history metadata for 14 days. Therefore:
Files modified and staged again within 14 days:
Snowpipe ignores modified files that are staged again. To reload modified data files, it is currently necessary to recreate the pipe object using the CREATE OR REPLACE PIPE syntax.
Files modified and staged again after 14 days:
Snowpipe loads the data again, potentially resulting in duplicate records in the target table.
For more information have a look here

Related

How to properly read J1939 messages from .asc file with cantools?

I'm trying to create a CAN logs converter from .asc files to .csv files (in human readable form). I'm somewhat successful. My code works fine with almost any database but j1939.dbc.
The thing is, that if I print out the messages read from the dbc file, I can see that the messages from j1939.dbc are read into the database. But it fails to find any of those messages in the processed log file. At the same time I can read the same file using Vector CANalyzer with no issues.
I wonder why this may happed and why it only affects the j1939.dbc and not the others.
I suspect that maybe the way I convert those messages is wrong because it never goes by the if msg_id in database: line (and as mentioned above, those messages are certainly there because Vector CANalyzer works fine with them).
EDIT: I realized that maybe the problem is not cantools but python-can package, maybe the can.ASCReader() doeasn't do well with j1939 frames and omits them? I'm gonna investigate myself but I hope someone better at coding will help.
import pandas as pd
import can
import cantools
import time as t
from tqdm import tqdm
import re
import os
from binascii import unhexlify
dbcs = [filename.split('.')[0] for filename in os.listdir('./dbc/') if filename.endswith('.dbc')]
files = [filename.split('.')[0] for filename in os.listdir('./asc/') if filename.endswith('.asc')]
start = t.time()
db = cantools.database.Database()
for dbc in dbcs:
with open(f'./dbc/{dbc}.dbc', 'r') as f:
db.add_dbc(f)
f_num = 1
for fname in files:
print(f'[{f_num}/{len(files)}] Parsing data from file: {fname}')
log=can.ASCReader(f'./asc/{fname}.asc')
entries = []
all_msgs =[]
message = {'Time [s]': ''}
database = list(db._frame_id_to_message.keys())
print(database)
lines = sum(1 for line in open(f'./asc/{fname}.asc'))
msgs = iter(log)
try:
for msg, i in zip(msgs, tqdm(range(lines))):
msg = re.split("\\s+", str(msg))
timestamp = round(float(msg[1]), 0)
msg_id = int(msg[3], 16)
try:
data = unhexlify(''.join(msg[7:15]))
except:
continue
if msg_id in database:
if timestamp != message['Time [s]']:
entries.append(message.copy())
message.update({'Time [s]': timestamp})
message.update(db.decode_message(msg_id, data))
except ValueError:
print('ValueError')
df = pd.DataFrame(entries[1:])
duration = t.time() - start
df.to_csv(f'./csv/{fname}.csv', index=False)
print(f'DONE IN {int(round(duration, 2)//60)}min{round(duration % 60, 2)}s!\n{len(df.columns)} signals extracted!')
f_num += 1

class can.ASCReader(file, base=’hex’)
Bases: can.io.generic.BaseIOHandler
Iterator of CAN messages from a ASC logging file. Meta data (comments, bus statistics, J1939 Transport
Protocol messages) is ignored.
Might answer your question...

Using Python Logging module to save logs to S3, how to capture all levels of logs for all modules?

It's my first time using the logging module in Python(3.7). My code uses imported modules that also have their own log statements. When I first added log statements to my code, I didn't use getLogger(). I just used logging.basicConfig(filename) and called logger.debug() directly to log statements. When I did this, all the logs from both my script and also all the imported modules was output to the same file together.
Now I need to convert my code to save logs to s3 instead of a file. I tried the solution mentioned in How Can I Write Logs Directly to AWS S3 from Memory Without First Writing to stdout? (Python, boto3) - Stack Overflow but I have two issues with it:
None of the 'prefixes' are present in the output when I check on s3.
Only INFO statements are showing up. I was under the impression that logging.basicConfig(level=logging.INFO) would make it would output all logs at or above level INFO, but I'm only seeing INFO. Also, only INFO logs get printed to stdout, when before all levels were. I don't know why the 'prefixes' are missing.
from psaw import PushshiftAPI
api = PushshiftAPI()
import time
import logging
import boto3
import io
import atexit
def write_logs(body, bucket, key):
s3 = boto3.client("s3")
s3.put_object(Body=body.getvalue(), Bucket=bucket, Key=key)
logging.basicConfig(level=logging.INFO)
log = logging.getLogger()
log_stringio = io.StringIO()
handler = logging.StreamHandler(log_stringio)
log.addHandler(handler)
def collectRange(sub,start,end):
atexit.register(write_logs, body=log_stringio, bucket="<...>", key=f'{sub}/log.txt')
s3 = boto3.resource('s3')
object = s3.Object('<...>', f'{sub}/{sub}#{start}-{end}.csv')
now = time.time()
logging.info(f'Start Time:{now}')
logging.debug('First request')
gen = api.search_comments(after=start, before=end,<...>, subreddit=sub)
r=next(gen)
<...>
quit()
Output:
Found credentials in shared credentials file: ~/.aws/credentials
Start Time:1591310443.7060978
https://api.pushshift.io/reddit/comment/search?<...>
https://api.pushshift.io/reddit/comment/search?<...>
Desired output:
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
INFO:root:Start Time:1591310443.7060978
DEBUG:root:First request
INFO:psaw.PushshiftAPI:https://api.pushshift.io/reddit/comment/search?<...>
DEBUG:psaw.PushshiftAPI:<whatever is usually here>
DEBUG:psaw.PushshiftAPI:<whatever is usually here>
INFO:psaw.PushshiftAPI:https://api.pushshift.io/reddit/comment/search?<...>
DEBUG:psaw.PushshiftAPI:<whatever is usually here>
DEBUG:psaw.PushshiftAPI:<whatever is usually here>
Any help is appreciated. Thanks.

You can at least add the level name (or time) by following this documentation:
Changing the format of displayed messages.
And to get DEBUG as well you need to use the following instead of INFO
logging.basicConfig(..., level=logging.DEBUG)

How to handle BigQuery insert errors in a Dataflow pipeline using Python?

I'm trying to create a streaming pipeline with Dataflow that reads messages from a PubSub topic to end up writing them on a BigQuery table. I don't want to use any Dataflow template.
For the moment I just want to create a pipeline in a Python3 script executed from a Google VM Instance to carry out a loading and transformation process of every message that arrives from Pubsub (parsing the records that it contains and adding a new field) to end up writing the results on a BigQuery table.
Simplifying, my code would be:
#!/usr/bin/env python
from apache_beam.options.pipeline_options import PipelineOptions
from google.cloud import pubsub_v1,
import apache_beam as beam
import apache_beam.io.gcp.bigquery
import logging
import argparse
import sys
import json
from datetime import datetime, timedelta
def load_pubsub(message):
try:
data = json.loads(message)
records = data["messages"]
return records
except:
raise ImportError("Something went wrong reading data from the Pub/Sub topic")
class ParseTransformPubSub(beam.DoFn):
def __init__(self):
self.water_mark = (datetime.now() + timedelta(hours = 1)).strftime("%Y-%m-%d %H:%M:%S.%f")
def process(self, records):
for record in records:
record["E"] = self.water_mark
yield record
def main():
table_schema = apache_beam.io.gcp.bigquery.parse_table_schema_from_json(open("TableSchema.json"))
parser = argparse.ArgumentParser()
parser.add_argument('--input_topic')
parser.add_argument('--output_table')
known_args, pipeline_args = parser.parse_known_args(sys.argv)
with beam.Pipeline(argv = pipeline_args) as p:
pipe = ( p | 'ReadDataFromPubSub' >> beam.io.ReadStringsFromPubSub(known_args.input_topic)
| 'LoadJSON' >> beam.Map(load_pubsub)
| 'ParseTransform' >> beam.ParDo(ParseTransformPubSub())
| 'WriteToAvailabilityTable' >> beam.io.WriteToBigQuery(
table = known_args.output_table,
schema = table_schema,
create_disposition = beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition = beam.io.BigQueryDisposition.WRITE_APPEND)
)
result = p.run()
result.wait_until_finish()
if __name__ == '__main__':
logger = logging.getLogger().setLevel(logging.INFO)
main()
(For example) The messages published in the PubSub topic use to come as follows:
'{"messages":[{"A":"Alpha", "B":"V1", "C":3, "D":12},{"A":"Alpha", "B":"V1", "C":5, "D":14},{"A":"Alpha", "B":"V1", "C":3, "D":22}]}'
If the field "E" is added in the record, then the structure of the record (dictionary in Python) and the data type of the fields is what the BigQuery table expects.
The problems that a I want to handle are:
If some messages come with an unexpected structure I want to fork the pipeline flatten and write them in another BigQuery table.
If some messages come with an unexpected data type of a field, then in the last level of the pipeline when they should be written in the table an error will occur. I want to manage this type of error by diverting the record to a third table.
I read the documentation found on the following pages but I found nothing:
https://cloud.google.com/dataflow/docs/guides/troubleshooting-your-pipeline
https://cloud.google.com/dataflow/docs/guides/common-errors
By the way, if I choose the option to configure the pipeline through the template that reads from a PubSubSubscription and writes into BigQuery I get the following schema which turns out to be the same one I'm looking for:
Template: Cloud Pub/Sub Subscription to BigQuery

You can't catch the errors that occur in the sink to BigQuery. The message that you write into bigquery must be good.
The best pattern is to perform a transform which checks your messages structure and fields type. In case of error, you create an error flow and you write this issue flow in a file (for example, or in a table without schema, you write in plain text your message)

We do the following when errors occur at the BigQuery sink.
send a message (without stacktrace) to GCP Error Reporting, for developers to be notified
log the error to StackDriver
stop the pipeline execution (the best place for messages to wait until a developer has fixed the issue, is the incomming pubSub subscription)

FileUploadMiscError while persisting output file from Azure Batch

I'm facing the following error while trying to persist log files to Azure Blob storage from Azure Batch execution - "FileUploadMiscError - A miscellaneous error was encountered while uploading one of the output files". This error doesn't give a lot of information as to what might be going wrong. I tried checking the Microsoft Documentation for this error code, but it doesn't mention this particular error code.
Below is the relevant code for adding the task to Azure Batch that I have ported from C# to Python for persisting the log files.
Note: The container that I have configured gets created when the task is added, but there's no blob inside.
import datetime
import logging
import os
import azure.storage.blob.models as blob_model
import yaml
from azure.batch import models
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.common.cloudstorageaccount import CloudStorageAccount
from dotenv import load_dotenv
LOG = logging.getLogger(__name__)
def add_tasks(batch_client, job_id, task_id, io_details, blob_details):
task_commands = "This is a placeholder. Actual code has an actual task. This gets completed successfully."
LOG.info("Configuring the blob storage details")
base_blob_service = BaseBlobService(
account_name=blob_details['account_name'],
account_key=blob_details['account_key'])
LOG.info("Base blob service created")
base_blob_service.create_container(
container_name=blob_details['container_name'], fail_on_exist=False)
LOG.info("Container present")
container_sas = base_blob_service.generate_container_shared_access_signature(
container_name=blob_details['container_name'],
permission=blob_model.ContainerPermissions(write=True),
expiry=datetime.datetime.now() + datetime.timedelta(days=1))
LOG.info(f"Container SAS created: {container_sas}")
container_url = base_blob_service.make_container_url(
container_name=blob_details['container_name'], sas_token=container_sas)
LOG.info(f"Container URL created: {container_url}")
# fpath = task_id + '/output.txt'
fpath = task_id
LOG.info(f"Creating output file object:")
out_files_list = list()
out_files = models.OutputFile(
file_pattern=r"../stderr.txt",
destination=models.OutputFileDestination(
container=models.OutputFileBlobContainerDestination(
container_url=container_url, path=fpath)),
upload_options=models.OutputFileUploadOptions(
upload_condition=models.OutputFileUploadCondition.task_completion))
out_files_list.append(out_files)
LOG.info(f"Output files: {out_files_list}")
LOG.info(f"Creating the task now: {task_id}")
task = models.TaskAddParameter(
id=task_id, command_line=task_commands, output_files=out_files_list)
batch_client.task.add(job_id=job_id, task=task)
LOG.info(f"Added task: {task_id}")

There is a bug in Batch's OutputFile handling which causes it to fail to upload to containers if the full container URL includes any query-string parameters other than the ones included in the SAS token. Unfortunately, the azure-storage-blob Python module includes an extra query string parameter when generating the URL via make_container_url.
This issue was just raised to us, and a fix will be released in the coming weeks, but an easy workaround is instead of using make_container_url to craft the URL, craft it yourself like so: container_url = 'https://{}/{}?{}'.format(blob_service.primary_endpoint, blob_details['container_name'], container_sas).
The resulting URL should look something like this: https://<account>.blob.core.windows.net/<container>?se=2019-01-12T01%3A34%3A05Z&sp=w&sv=2018-03-28&sr=c&sig=<sig> - specifically it shouldn't have restype=container in it (which is what the azure-storage-blob package is including)

Django - Heroku - Serving dynamic large file timing out

I have one function view that creates a report using xlsxwriter, it is created on the fly using a StringIO as buffer and finally sending through HttpResponse. It works well using Local Server.
The problem is that on Heroku, after some seconds (documentation mention 30 seconds timeout and not modifiable) the server hangs out and reboot the web process, giving error as a response.
What is the best way to...?:
create an xmlx file on the fly (dynamically) in memory
serve the entire file to the client.
prevent server to hang out because of the long process running
This is a piece of the code I am using:
def reporte_usuarios(request):
from xlsxwriter.workbook import Workbook
try:
import cStringIO as StringIO
except ImportError:
import StringIO
# create a workbook in memory
output = StringIO.StringIO()
workbook = Workbook(output)
bold = workbook.add_format({'bold': True})
# get the data
from django.db.models import Count
usuarios = User.objects.filter(....... # all filter stuff
for usr in usuarios:
if usr.activos > 0:
# create a workbook sheet every User registered
ws = workbook.add_worksheet(u'%s' % usr.username)
# some relevant user data
ws.write(1, 1, u'USUARIO: %s' % usr.username)
...
# get rows for user
log = LogActivos.objects.filter(usuario=usr).select_related('activo__unidad__id', 'activo__unidad__nombre', 'activo__nombre')
# write headers
ws.write(3, 0, u'FECHA', bold)
...
sig_fila = 4 #starting row for data (after headers)
for l in log:
# write all data
ws.write(sig_fila, 0, u'%s' % l.fecha)
...
sig_fila += 1
# close the workbook
workbook.close()
# go to the beginning of the buffer
output.seek(0)
# response using the buffer
response = HttpResponse(output.read(), content_type='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet')
response['Content-Disposition'] = 'attachment; filename="ACTIVOS_USUARIOS__%s.xlsx"' % datetime.now().strftime("%Y%m%d_%H%M")
return response
Notes: I am using Gunicorn on Heroku, django 1.9.13 and python 2.7.11

IMHO you should follow a totally different approach in this case.
As you are generating a file rather big in size, it's normal for the system to hang out because of a timeout error.
What you could do instead is to deploy a background task queue, like Celery or DjangoRQ. With that, you will get a background task to create this file using your user's data, and then you can let your user know that it's ready by any mean, like a notification or an email.
If you need more details regarding how you can do something like this, let me know and I can help :)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cannot ingest data using snowpipe more than once - python

Related

How to properly read J1939 messages from .asc file with cantools?

Using Python Logging module to save logs to S3, how to capture all levels of logs for all modules?

How to handle BigQuery insert errors in a Dataflow pipeline using Python?

FileUploadMiscError while persisting output file from Azure Batch

Django - Heroku - Serving dynamic large file timing out

Categories

Resources