I am new-ish to GCP, Dataflow, Apache Beam, Python, and OOP in general. I come from the land of functional javascript, for context.
Right now I have a streaming pipeline built with the Apache Beam python sdk, and I deploy it to GCP's Dataflow. The pipeline's source is a pubsub subscription, and the sink is a datastore.
The pipeline picks up a message from a pubsub subscription, makes a decision based on a configuration object + the contents of the message, and then puts it in the appropriate spot in the datastore depending on what decision it makes. This is all working presently.
Now I am in a situation where the configuration object, which is currently hardcoded, needs to be more dynamic. By that I mean: instead of just hardcoding the configuration object, we are now instead going to make an API call that will return the configuration. That way, we can update the configuration without having to redeploy the pipeline. This also works presently.
But! We are anticipating heavy traffic, so it is NOT ideal to fetch the configuration for every single message that comes in. So we are moving the fetch to the beginning, right before the actual pipeline starts. But this means we immediately lose the value in having it come from an API call, because the API call only happens one time when the pipeline starts up.
Here is what we have so far (stripped out irrelevant parts for clarity):
def run(argv=None):
options = PipelineOptions(
streaming=True,
save_main_session=True
)
configuration = get_configuration() # api call to fetch config
with beam.Pipeline(options=options) as pipeline:
# read incoming messages from pubsub
incoming_messages = (
pipeline
| "Read Messages From PubSub"
>> beam.io.ReadFromPubSub(subscription=f"our subscription here", with_attributes=True))
# make a decision based off of the message + the config
decision_messages = (
incoming_messages
| "Create Decision Messages" >> beam.FlatMap(create_decision_message, configuration)
)
create_decision_message takes in the incoming message from the stream + the configuration file and then, you guessed it, makes a decision. It is pretty simple logic. Think "if the message is apples, and the configuration says we only care about oranges, then do nothing with the message". We need to be able to update it on the fly to say "nevermind, we care about apples too now suddenly".
I need to figure out a way to let the pipeline know it needs to re-fetch that configuration file every 15 minutes. I'm not totally sure what is the best way to do that with the tools I'm using. If it were javascript, I would do something like:
(please forgive the pseudo-code, not sure if this would actually run but you get the idea)
let fetch_time = Date.now() // initialized when app starts
let expiration = 900 // 900 seconds = 15 mins
let config = getConfigFromApi() // fetch config right when app starts
function fetchConfig(now){
if (fetch_time + expiration < now) {
// if fetch_time + expiration is less than the current time, we need to re-fetch the config
config = getConfigFromApi() // assign new value to config var
fetch_time = now // assign new value to fetch_time var
}
return config
}
...
const someLaterTime = Date.now() // later in the code, within the pipeline, I need to use the config object
const validConfig = fetchConfig(someLaterTime) // i pass in the current time and get back either the memory-cached config, or a just-recently-fetched config
I'm not really sure how to translate this concept to python, and I'm not really sure if I should. Is this a reasonable thing to try to pull off? Or is this type of behavior not congruent with the stack I'm using? I'm in a position where I'm the only one on my team working on this, and it is a greenfield project, so there are no examples anywhere of how it's been done in the past. I am not sure if I should try to figure this out, or if I should say "sorry bossman, we need another solution".
Any help is appreciated, no matter how small... thank you!
I think there are multiple ways to implement what you want to achieve, the most straight-forward way is probably through stateful processing in which you record your config through state in a Stateful DoFn and set a looping timer to refresh the record.
You can read more about stateful processing here https://beam.apache.org/blog/timely-processing/
more from the beam programming guide about the state and timer: https://beam.apache.org/documentation/programming-guide/#types-of-state.
I would image that you can define your processing logic which requires the config in a ParDo like:
class MakeDecision(beam.DoFn):
CONFIG = ReadModifyWriteState('config', coders.StrUtf8Coder())
REFRESH_TIMER = TimerSpec('output', TimeDomain.REAL_TIME)
def process(self,
element,
config=DoFn.StateParam(CONFIG),
timer=DoFn.TimerParam(REFRESH_TIMER)):
valid_config={}
if config.read():
valid_config=json.loads(config.read())
else: # config is None and hasn't been fetched before.
valid_config=fetch_config() # your own fetch function.
config.write(json.dumps(valid_config))
timer.set(Timestamp.now() + Duration(seconds=900))
# Do what ever you need with the config.
...
#on_timer(REFRESH_TIMER)
def refresh_config(self,
config=DoFn.StateParam(CONFIG),
timer=DoFn.TimerParam(REFRESH_TIMER)):
valid_config=fetch_config()
config.write(json.dumps(valid_config))
timer.set(Timestamp.now() + Duration(seconds=900))
And then you can now process your messages with the Stateful DoFn.
with beam.Pipeline(options=options) as pipeline:
pipeline
| "Read Messages From PubSub"
>> beam.io.ReadFromPubSub(subscription=f"our subscription here", with_attributes=True))
| "Make decision" >> beam.ParDo(MakeDecision())
Related
I have a simple dag - that takes argument from mysql db - (like sql, subject)
Then I have a function creating report out and send to particular email.
Here is code snippet.
def s_report(k,**kwargs):
body_sql = list2[k][4]
request1 = "({})".format(body_sql)
dwh_hook = SnowflakeHook(snowflake_conn_id="snowflake_conn")
df1 = dwh_hook.get_pandas_df(request1)
df2 = df1.to_html()
body_Text = list2[k][3]
html_content = f"""HI Team, Please find report<br><br>
{df2} <br> </br>
<b>Thank you!</b><br>
"""
return EmailOperator(task_id="send_email_snowflake{}".format(k), to=list2[k][1],
subject=f"{list2[k][2]}", html_content=html_content, dag=dag)
for j in range(len(list)):
mysql_list >> [ s_report(j)] >> end_operator
The s_report is getting generated dynamically, But the real problem is hook is continously submitting query in backend, While dag is stopped still its submitting query in backend.
I can use pythonoperator, but its not generating dynamic task.
A couple of things:
By looking at your code, in particular the lines:
for j in range(len(list)):
mysql_list >> [ s_report(j)] >> end_operator
we can determine that if your first task succeeds, namely, mysql_list, then the tasks downstream to it, namely, the s_report calls should begin executing. You have precisely len(list) of them. Within each s_report call there is exactly one dwh_hook.get_pandas_df(request) call, so I believe your DAG should be making len(list) calls of this type provided mysql_list task succeeds.
As for the mismatch you see in your Snowflake logs, I can't advise you here. I'd need more details. Keep in mind that the call get_pandas_df might have a retry mechanism (i.e. if cannot reach snowflake, retry) which might explain why your Snowflake logs show a bunch of requests.
If your DAG finishes successfully, (i.e. end_operator tasks finishes successfully), you are correct. There should be no requests in your Snowflake logs that came post-DAG end.
If you want more insight as to how your DAG interacts with your Snowflake resource, I'd suggest having a single s_report task like so:
mysql_list >> [ s_report(0)] >> end_operator
and see the behaviour in the logs.
I have done some Python programming, but I'm not a seasoned developer by any stretch of the imagination. We have a Python etl programme, which was set up as a Cloud Function but it is timing out as there is just too much data to load and we are looking to re-write it to work in Dataflow.
The code at the moment simply connects to an API, which returns a newline-delimiter JSON, and then the data is loaded into a new table in BigQuery.
This is our first time using Dataflow and we are just trying to get to grips with how it works. It seems pretty easy to get the data into BigQuery the stumbling block we are hitting is how to get the data out of the API. Its not clear to us how we can make this work, do we need to go down the route of developing a new I/O connector as per [Develop IO Connector]? Or is there another option as developing a new connector seems complex?
We've done a lot of googling, but haven't found anything obvious to help.
Here is a sample of our code but we are not 100% sure its on the right track. The code doesn't work, and we think it needs to be a .io.read and not a .ParDo initially but we aren't quite sure where to go with that. Some guidance would be much appreciated!
class callAPI(beam.DoFn):
def __init__(self, input_header, input_uri):
self.headers = input_header
self.remote_url = input_uri
def process(self):
try:
res = requests.get(self.remote_url, headers=self.headers)
res.raise_for_status()
except HTTPError as message:
logging.error(message)
return
return res.text
with beam.Pipeline() as p:
data = ( p
| 'Call API ' >> beam.ParDo(callAPI(HEADER, REMOTE_URI))
| beam.Map(print))
Thanks in advance.
You are on the right track, but there are a couple of things to fix.
As you point out, the root of a pipeline needs to be a read of some kind. The ParDo operation processes a set of elements (ideally in parallel), but needs some input to process. You could do
p | beam.Create(['a', 'b', 'c']) | beam.ParDo(SomeDoFn())
in which SomeDoFn will be passed a, b, and c into its process method. There is a special p | beam.Impulse() operation that will produce a single None element if there's no reasonable input and you want to ensure your DoFn is just called once. You can also read elements from a file (or similar). Note that your process method takes both self and the element to be processed, and returns an iterable (to allow zero or more outputs. There is also beam.Map and beam.FlatMap which encapsulates the simpler pattern). So you could do something like
class CallAPI(beam.DoFn):
def __init__(self, input_header):
self.headers = input_header
def process(self, input_uri):
try:
res = requests.get(input_uri, headers=self.headers)
res.raise_for_status()
except HTTPError as message:
logging.error(message)
yield res.text
with beam.Pipeline() as p:
data = (
p
| beam.Create([REMOTE_URI])
| 'Call API ' >> beam.ParDo(CallAPI(HEADER))
| beam.Map(print))
which would allow you to read from more than one URI (in parallel) in the same pipeline.
You could write a full IO connector if your source is such that it can be split (ideally dynamically) rather than only read in one huge request.
Can you share the code from your cloud function?
Is this a scheduled task or triggered by an event? If it is a scheduled task Apache Airflow may be a better option, you could use Dataflow Python Operators and BigQueryOperators to do what you're looking for
Apache Airflow https://airflow.apache.org/
DataFlowPythonOperator https://airflow.apache.org/docs/apache-airflow/1.10.6/_api/airflow/contrib/operators/dataflow_operator/index.html#airflow.contrib.operators.dataflow_operator.DataFlowPythonOperator
BigQueryOperator https://airflow.apache.org/docs/apache-airflow/1.10.14/_api/airflow/contrib/operators/bigquery_operator/index.html
Has anyone gotten parallel tests to work in Django with Elasticsearch? If so, can you share what configuration changes were required to make it happen?
I've tried just about everything I can think of to make it work including the solution outlined here. Taking inspiration from how Django itself does the parallel DB's, I currently have created a custom new ParallelTestSuite that overrides the init_worker to iterate through each index/doctype and change the index names roughly as follows:
_worker_id = 0
def _elastic_search_init_worker(counter):
global _worker_id
with counter.get_lock():
counter.value += 1
_worker_id = counter.value
for alias in connections:
connection = connections[alias]
settings_dict = connection.creation.get_test_db_clone_settings(_worker_id)
# connection.settings_dict must be updated in place for changes to be
# reflected in django.db.connections. If the following line assigned
# connection.settings_dict = settings_dict, new threads would connect
# to the default database instead of the appropriate clone.
connection.settings_dict.update(settings_dict)
connection.close()
### Everything above this is from the Django version of this function ###
# Update index names in doctypes
for doc in registry.get_documents():
doc._doc_type.index += f"_{_worker_id}"
# Update index names for indexes and create new indexes
for index in registry.get_indices():
index._name += f"_{_worker_id}"
index.delete(ignore=[404])
index.create()
print(f"Started thread # {_worker_id}")
This seems to generally work, however, there's some weirdness that happens seemingly randomly (i.e. running the test suite again doesn't reliably reproduce the issue and/or the error messages change). The following are the various errors I've gotten and it seems to randomly fail on one of them each test run:
Raise a 404 when trying to create the index in the function above (I've confirmed that it's the 404 coming back from the PUT request, however in the Elasticsearch server logs it says that it's created the index without issue)
a 500 when trying to create the index, although this one hasn't happened in a while so I think this was fixed by something else
query responses will sometimes not have an items dictionary value inside the _process_bulk_chunk function from the elasticsearch library
I'm thinking that there's something weird going on at the connection layer (like somehow the connections between Django test runner processes are getting the responses mixed up?) but I'm at a loss as to how that would be even possible since Django uses multiprocessing to parallelize the tests and thus they are each running in their own process. Is it somehow possible that the spun-off processes are still trying to use the connection pool of the original process or something? I'm really at a loss of other things to try from here and would greatly appreciate some hints or even just confirmation that this is in fact possible to do.
I'm thinking that there's something weird going on at the connection layer (like somehow the connections between Django test runner processes are getting the responses mixed up?) but I'm at a loss as to how that would be even possible since Django uses multiprocessing to parallelize the tests and thus they are each running in their own process. Is it somehow possible that the spun-off processes are still trying to use the connection pool of the original process or something?
This is exactly what is happening. From the Elasticsearch DSL docs:
Since we use persistent connections throughout the client it means that the client doesn’t tolerate fork very well. If your application calls for multiple processes make sure you create a fresh client after call to fork. Note that Python’s multiprocessing module uses fork to create new processes on POSIX systems.
What I observed happening is that the responses get very weirdly interleaved with a seemingly random client that may have started the request. So a request to index a document might end up with a response to create an index which have very different attributes on them.
The fix is to ensure that each test worker has its own Elasticsearch client. This can be done by creating worker-specific connection aliases and then overwriting the current connection aliases (with the private attribute _using) with the worker-specific one. Below is a modified version of the code you posted with the change
_worker_id = 0
def _elastic_search_init_worker(counter):
global _worker_id
with counter.get_lock():
counter.value += 1
_worker_id = counter.value
for alias in connections:
connection = connections[alias]
settings_dict = connection.creation.get_test_db_clone_settings(_worker_id)
# connection.settings_dict must be updated in place for changes to be
# reflected in django.db.connections. If the following line assigned
# connection.settings_dict = settings_dict, new threads would connect
# to the default database instead of the appropriate clone.
connection.settings_dict.update(settings_dict)
connection.close()
### Everything above this is from the Django version of this function ###
from elasticsearch_dsl.connections import connections
# each worker needs its own connection to elasticsearch, the ElasticsearchClient uses
# global connection objects that do not play nice otherwise
worker_connection_postfix = f"_worker_{_worker_id}"
for alias in connections:
connections.configure(**{alias + worker_connection_postfix: settings.ELASTICSEARCH_DSL["default"]})
# Update index names in doctypes
for doc in registry.get_documents():
doc._doc_type.index += f"_{_worker_id}"
# Use the worker-specific connection
doc._doc_type._using = doc.doc_type._using + worker_connection_postfix
# Update index names for indexes and create new indexes
for index in registry.get_indices():
index._name += f"_{_worker_id}"
index._using = doc.doc_type._using + worker_connection_postfix
index.delete(ignore=[404])
index.create()
print(f"Started thread # {_worker_id}")
I am new to use ZeroMQ, so I am struggling with some code.
If I do the following code, no error is shown :
import zmq.asyncio
ctx = zmq.asyncio.Context()
rcv_socket = ctx.socket(zmq.PULL)
rcv_socket.connect("ipc:///tmp/test")
rcv_socket.bind("ipc:///tmp/test")
But, if I try to use the function zmq_getsockopt(), it fails :
import zmq.asyncio
ctx = zmq.asyncio.Context()
rcv_socket = ctx.socket(zmq.PULL)
rcv_socket.connect("ipc:///tmp/test")
socket_path = rcv_socket.getsockopt(zmq.LAST_ENDPOINT)
rcv_socket.bind("ipc://%s" % socket_path)
Then I get :
zmq.error.ZMQError: No such file or directory for ipc path "b'ipc:///tmp/test'".
new to use ZeroMQ, so I am struggling with some code.
Well, you will be way, way better off, if you start with first understanding The Rules of the Game, than with learning from crashes ( yes, on the very contrary to what "wannabe-evangelisation-gurus" pump into the crowds that "just-coding" is enough - which it is not, for doing indeed a serious business ).
This is why:
If you read the published API, it still will confuse you most of the time, if you have no picture of the structure of the system & do not understand its internal and external behaviours ( the Framework's Rules of the Game ) :
The ZMQ_LAST_ENDPOINT option shall retrieve the last endpoint bound for TCP and IPC transports. The returned value will be a string in the form of a ZMQ DSN. Note that if the TCP host is INADDR_ANY, indicated by a *, then the returned address will be 0.0.0.0 (for IPv4).
This says the point, yet without knowing the concept, the point is still hidden from you to see it.
The Best Next Step
If you are indeed serious into low-latency and distributed computing, the best next step, after reading the link above, is to stop coding and first take some time to read and understand the fabulous Pieter Hintjens' book "Code Connected, Volume 1" before going any further - definitely worth your time.
Then, you will see why this will never fly:
import zmq.asyncio; ctx = zmq.asyncio.Context()
rcv_socket = ctx.socket( zmq.PULL )
rcv_socket.connect( "ipc:///tmp/test" )
socket_path = rcv_socket.getsockopt( zmq.LAST_ENDPOINT )
rcv_socket.bind( "ipc://%s" % socket_path )
whereas this one may ( yet no handling of a NULL-terminated character string is still present here ... which is per se a sign of a bad software design ) :
import zmq.asyncio; ctx = zmq.asyncio.Context()
rcv_socket = ctx.socket( zmq.PULL )
rcv_socket.bind( "ipc:///tmp/test" )
socket_path = rcv_socket.getsockopt( zmq.LAST_ENDPOINT )
rcv_socket.connect( "ipc://%s" % socket_path )
I'm working on a django application which reads csv file from dropbox, parse data and store it in database. For this purpose I need background task which checks if the file is modified or changed(updated) and then updates database.
I've tried 'Celery' but failed to configure it with django. Then I find django-background-tasks which is quite simpler than celery to configure.
My question here is how to initialize repeating tasks?
It is described in documentation
but I'm unable to find any example which explains how to use repeat, repeat_until or other constants mentioned in documentation.
can anyone explain the following with examples please?
notify_user(user.id, repeat=<number of seconds>, repeat_until=<datetime or None>)
repeat is given in seconds. The following constants are provided:
Task.NEVER (default), Task.HOURLY, Task.DAILY, Task.WEEKLY,
Task.EVERY_2_WEEKS, Task.EVERY_4_WEEKS.
You have to call the particular function (notify_user()) when you really need to execute it.
Suppose you need to execute the task while a request comes to the server, then it would be like this,
#background(schedule=60)
def get_csv(creds):
#read csv from drop box with credentials, "creds"
#then update the DB
def myview(request):
# do something with my view
get_csv(creds, repeat=100)
return SomeHttpResponse
Excecution Procedure
1. Request comes to the url hence it would dispatch to the corresponding view, here myview()
2. Excetes the line get_csv(creds, repeat=100) and then creates a async task in DB (it wont excetute the function now)
3. Returning the HTTP response to the user.
After 60 seconds from the time which the task creation, get_csv(creds) will excecutes repeatedly in every 100 seconds
For example, suppose you have the function from the documentation
#background(schedule=60)
def notify_user(user_id):
# lookup user by id and send them a message
user = User.objects.get(pk=user_id)
user.email_user('Here is a notification', 'You have been notified')
Suppose you want to repeat this task daily until New Years day of 2019 you would do the following
import datetime
new_years_2019 = datetime.datetime(2019, 01, 01)
notify_user(some_id, repeat=task.DAILY, repeat_until=new_years_2019)