Task processing in python - python

I need a help on technology choice for the following problem:
There are data files coming to the system to process. Some of them are self contained (.a) and can be processed immediately and some (.b) need to wait for more files to get a full set. I'm loading everything that arrives at the system to a DB assigning package ids and I can send a message on the MQ.
What I need here is a component that connects to that queue and listens to those messages. When it receives a message that file arrived it needs to do more or less:
If a file name is taskA.a then create a request to workerA.
If a file name is taskB.a then create a request to workerB.
If a file name is taskA.b and we got taskA.c before than send a request to workerA with both ids.
Which technology should I use. There are a few like Celery just its hard to find the proper one by just reading docs.

Related

Kafka connector sink for topic not known in advance

Generic explanation: My application consumes messages from a topic and then splits them into separate topics according to their id, so the topics are named like topic_name_id. My goal is to connect those new topics to a certain sink (s3 or snowflake, haven't decided) so that the messages published in those topics will end up there. However, i've only found ways to do this using a configuration file, where you connect the sink to a topic that already exists and which you know the name of. But here the goal would be to connect the sink to the topic created during the process. Is there a way this can be achieved?
If the above is not possible, is there a way to connect to the common topic with all the messages, but create different tables (in snowflake) or s3 directories according to the message ID? Adding to that, in case of s3, the messages are added as individual json files, right? No way to combine them into one file?
Thanks
The outgoing IDs are known, right?
Kafka Connect uses a REST API that you generate a JSON HTTP body using those IDs and finalized topic names, then use requests, for example, to publish and start connectors for those topics. You can do that directly from the process directly before starting the producer, or you can send a request with the ID/topic name to a lambda job instead, which communicates with the Connect API
When using different topics with the S3 Sink connector, there will be separate S3 paths and separate files, based on the number of partitions in the topic and the other partitioner settings defined in your connector property. Most S3 processes are able to read full S3 prefixes, though, so I don't imagine that being an issue
I don't have experience with the Snowflake connector to know how it handles different topic names.

How do I get the value of the 'Branch files on creation' value in the Stream creation form from a form-in trigger?

I am setting up a Perforce Trigger on a Helix Core Server which is triggered on stream creation forms when the user submits the form for the stream to be created. I want to validate the name of the stream the user has selected. I also want to warn the user if they decide to not select 'Branch files from parent on stream creation' setting.
Stream creation form example
So far I have setup the trigger which works and can currently verify the stream name. However I am unable to find a way to get the boolean value of the checkbox for the 'Branch files...' setting.
Trigger line:
check-stream-name form-save stream "python /p4/common/bin/triggers/check_stream_name_trigger.py %formfile%"
I get all the form fields by grabbing the 'formfile' variable when the trigger is hit which gives me a temporary file name which i can read and get values for most of the fields on the form. But the 'Branch files...' field value is not shown.
Stream: //TestDepot/devs/TestStream_dev
Update:
Access:
Owner: me
Name: TestStream_dev
Parent: //TestDepot/Mainlines/ShortStream
Type: development
Description:
Created by me.
Options: allsubmit unlocked toparent fromparent mergedown
Paths:
share ...
How do I get the value of the 'Branch files...' checkbox in the trigger?
You can't, because this value isn't part of the actual stream spec on the server. When you check that box in the P4V UI, it will run the p4 populate command immediately after saving the spec (after your form trigger runs). The server doesn't know anything about it until that command is executed.
It's pretty difficult to trigger on something that doesn't happen, unfortunately, especially if you're trying to trigger on it not happening in the future. A somewhat complicated solution might be to have a form-commit trigger on streams that kicks off a delayed check to see if the stream creation is quickly followed by a populate (i.e. check the type of the stream and its paths, wait at least a few seconds to allow time for a p4 populate, and then see if there are depot files that should exist but don't yet), and then potentially sends the user a followup email letting them know they haven't populated the stream yet (and here's how to do it).
Alternatively, you could just have that form-commit trigger run the populate (assuming you want to always populate a stream with the head revisions of its parent -- that might not actually be something you'd want to enforce across the board though, since it can be useful to create a stream that's based on earlier revisions).

Concurrent file upload/download and running background processes

I want to create a minimal webpage where concurrent users can upload a file and I can process the file (which is expected to take some hours) and email back to the user later on.
Since I am hosting this on AWS, I was thinking of invoking some background process once I receive the file so that even if the user closes the browser window, the processing keeps taking place and I am able to send the results after few hours, all through some pre-written scripts.
Can you please help me with the logistics of how should I do this?
Here's how it might look like (hosting-agnostic):
A user uploads a file on the web server
The file is saved in a storage that can be accessed later by the background jobs
Some metadata (location in the storage, user's email etc) about the file is saved in a DB/message broker
Background jobs tracking the DB/message broker pick up the metadata and start handling the file (this is why it needs to be accessible by it in p.2) and notify the user
More specifically, in case of python/django + aws you might use the following stack:
Lets assume you're using python + django
You can save the uploaded files in a private AWS S3 bucket
Some meta might be saved in the db or use celery + AWS SQS or AWS SQS directly or bring up something like rabbitmq or redis(+pubsub)
Have python code handling the job - depends on what your opt for in p.3. The only requirement is that it can pull data from your S3 bucket. After the job is done notify the user via AWS SES
The simplest single-server setup that doesn't require any intermediate components:
Your python script that simply saves the file in a folder and gives it a name like someuser#yahoo.com-f9619ff-8b86-d011-b42d-00cf4fc964ff
Cron job looking for any files in this folder that would handle found files and notify the user. Notice if you need multiple background jobs running in parallel you'll need to slightly complicate the scheme to avoid race conditions (i.e. rename the file being processed so that only a single job would handle it)
In a prod app you'll likely need something in between depending on your needs

mongodb = trigger => python

I'm currently building a pipeline that reads data from MongoDB everytime new document gets inserted and send it to external data source after some preprocessing. Preprocessing and sending data to external data source part works well the way I designed.
The problem, however, I can't read data from MongoDB. I'm trying to build a trigger that reads data from MongoDB when certain MongoDB collection gets updated then sends it to python. I'm not considering polling a MongoDB since it's too resource-intensive.
I've found this library mongotriggers(https://github.com/drorasaf/mongotriggers/) and now taking a look at it.
In summary, how can I build a trigger that sends data to python from MongoDB when new document gets inserted to specific collection?
Any comment or feedback would be appreciated.
Thanks in advance.
Best
Gee
In MongoDB v3.6+, you can now use MongoDB Change Streams. Change streams allow applications to access real-time data changes without the complexity and risk of tailing the oplog. Applications can use change streams to subscribe to all data changes on a single collection, a database, or an entire deployment, and immediately react to them.
For example to listen to streams from MongoDB when a new document gets inserted:
try:
with db.collection.watch([{'$match': {'operationType': 'insert'}}]) as stream:
for insert_change in stream:
# Do something
print(insert_change)
except pymongo.errors.PyMongoError:
# The ChangeStream encountered an unrecoverable error or the
# resume attempt failed to recreate the cursor.
logging.error('...')
pymongo.collection.Collection.watch() is available from PyMongo 3.6.0+.

jabber messages for file transfer

I'm interested in finding out the xml messages which are sent among two clients when a file is transferred from one to another(just for fun).
So far I've been finding out the xml messages involved in actions such as authentication, setting the status, sending a message etc. by using jabber.py - I modified xmlstream.py's network write function to print the data just before it writes it to the network.
However, jabber.py does not provide functions for file transfer. Could someone:-
Suggest a Python library that does that?
Or, show me some place where the xml messages sent from client to client would be documented.
Thanks.
Take a look at
XEP-0096: SI File Transfer
XEP-0234: Jingle File Transfer
for details (Edit: about the XML).
Edit: I don't know about the current support for these XPEs in Python libraries.

Categories