dagster can you trigger a job to run via an api? - python

I have been looking all over for the answer, but can't seem to find what I'm looking for
I want to create an api endpoint that can pass information to the dagster assets and trigger a run. For example, I have the following asset in dagster
#asset
def player_already_registered(player_name: str):
q = text(
f'''
SELECT
COUNT(*)
FROM
`player_account_info`
WHERE
summonerName = :player_name
'''
)
result = database.conn.execute(q, player_name=player_name).fetchone()[0]
return bool(result)
Say that I have an endpoint already made where I can pass the player_name via a get-parameter. How can I pass the parameter to the asset and then run the job itself?

Related

Azure functions keep on running, inserting 10x values in Database

I have built a pipeline with Stream Analytics data triggering Azure Functions.
There are 5000 values merged in a single data. I wrote a simple python program in the Function to validate the data, parse the bulk data, and save it in Cosmos DB as an individual document. But the problem is, my functions don't stop. After 30 minutes I can see that my function generated an error saying timed out. And in these 30 minutes, I can see more than 300k values in my database which are duplicating themselves. I thought this problem is with my code (for loop) and I tried running it locally, and everything works. I am not sure why this is the problem. In the whole code, the only statement, I am unable to understand is in container.upsert line.
This is my code:
import logging
import azure.functions as func
import hashlib as h
from azure.cosmos import CosmosClient
import random, string
def generateRandomID(length):
# choose from all lowercase letter
letters = string.ascii_lowercase
result_str = ''.join(random.choice(letters) for i in range(length))
return result_str
URL = dburl
KEY = dbkey
client = CosmosClient(URL, credential=KEY)
DATABASE_NAME = dbname
database = client.get_database_client(DATABASE_NAME)
CONTAINER_NAME = containername
container = database.get_container_client(CONTAINER_NAME)
def main(req: func.HttpRequest) -> func.HttpResponse:
logging.info('Python HTTP trigger function processed a request.')
req_body = req.get_json()
try:
#Level 1
rawMsg = req_body[0]
filteredMsg = rawMsg['message']
metaData = rawMsg['metaData']
logging.info(metaData)
encodeMD5 = filteredMsg.encode('utf-8')
generateMD5 = h.md5(encodeMD5).hexdigest()
parsingMetaData = metaData.split(',')
parsingMD5Hex = parsingMetaData[3]
splitingHex = parsingMD5Hex.split(':')
parsingMD5Value = splitingHex[1]
except:
logging.info("Failed to parse the Data and Generate MD5 Checksums. Error at the level 1")
finally:
logging.info("Execution Successful | First level Completed ")
#return func.HttpResponse(f"OK")
try:
#Level 2:
if generateMD5 == parsingMD5Value:
#parsing the ecg values
logging.info('MD5 Checksums matched!')
splitValues = filteredMsg.split(',')
for eachValue in range(len(splitValues)):
ecgRawData = splitValues[eachValue]
divideEachValue = ecgRawData.split(':')
timeData = divideEachValue[0]
ecgData = divideEachValue[1]
container.upsert_item({ 'id': generateRandomID(10), 'time': timeData, 'ecgData': ecgData})
elif generateMD5 != parsingMD5Hex:
logging.info('The MD5s did not matched and couldnt execute the code properly')
logging.info(generateMD5)
else:
logging.info('Something is going wrong. Please check.')
except:
logging.info("Failed to parse ECG Values into the DB Container. Error ar the level 2")
finally:
logging.info("Execution Successful | Second level complete ")
#return func.HttpResponse(f"OK")
# Return a 200 status
return func.HttpResponse(f"OK")
A test I performed:
Commented the for loop block and deployed the Function, it executes normally without any error.
Please let me know how I can address this issue and also if there is a wrong way of code practice.
I found the solution! (I am the OP)
In my resource group, an App service plan is enabled for a Web application. So, when creating an Azure Function, it doesn't let me deploy it in the Serverless option. So, I deployed with the same app service plan used for Web applications. And while testing, the function completely works except for the container.upsert line. When I add this line, it fails to stop and creates 10x values in the database until it gets stopped by a timeout error beyond 30 minutes.
I tried creating an App Service plan dedicated to this Function. But the issue is still the same.
And while testing with 100s of corner case scenarios, I found out that my function runs perfectly when I deploy it in the other resource group. The only catch is, I have opted for the Serverless option while deploying the Functions.
(If you are using an App service plan in your Azure Resource Group, you cannot deploy Azure Functions with a Serverless option. It shows the deployment is not proper. You need to create a dedicated app service plan for that function or you should use the existing App service plan)
As per my research, when dealing with bulk data and inserting those data into the database, the usual app service plan doesn't work. The App Service Plan should be large enough to sustain the load. Or you should choose the Serverless option while deploying the Function, as the compute is totally managed by Azure.
Hope this helps.

Flask send JSON to the front end within a python function from the backend that also needs to return a `send_file()`?

I'm trying to see if there is a simple way to send information to the front end of my website within a function from the backend in my python code that needs to return a send_file() call at the end. I'm trying to download files from YouTube using pytube and then send those videos to the end user. However, during the process of downloading said videos, I want to show the progress of the function that's running by passing a percent of completion number to the front end.
This is basically what I have.
#app.route('/main', methods = ['GET', 'POST'])
def main():
if request.method == 'POST':
files = get_files(current_user)
return files or redirect(url_for('home'))
return render_template('main.html', user = current_user)
def get_files(user):
if not user:
return None:
zip_bytes = BytesIO()
counter = 0
with ZipFile(zip_bytes, "w") as zip:
while True:
counter++
#send counter value to front-end to update progress bar
#code for generating files and adding them into the zip file
return send_file(zip_bytes, download_name=f"something.zip", as_attachment=True)
So what I want to do, is to send the counter over to my front-end, to change some HTML on my page to say something like "3/10 files completed", "4/10 files completed", etc.
If possible, please explain it in simple terms because I don't know much terminology in the world of web development.
You need a table to keep track of tasks with the following rough attributes
id
uuid
percentage_completed
Then, in the function the task is being processed, add a record to the tasks table with percentage 0, return the task id to the front-end so that it can ping for status. Update task percentage after each video
# pseudo code
#app.route('/some-url')
def function():
uuid = random_string
task = Task(uuid, percentage_completed=0)
db.session.add(task)
subprocess run download_vids
return jsonify({'task_id': uuid})
def download_vids(uuid):
for i, vid in enumerate(vids):
# do stuff
task = db.query(Task).filter_by(uuid=uuid).first()
task.percentage_completed = (i+1//len(vids) // 100
db.session.commit()
# db.session.delete(task)
# db.session.commit()
#app.route('/progress/<task_id>') # front-end will ping this each time
def return_stats(task_id):
task = db.query(Task).filter_by(uuid=uuid).first()
return jsonify({'progress': task.percentage_completed})
This answer uses sqlalchemy for database transactions.
You might want to have a look at celery for better handling of tasks.

Cannot access user_data when using job queue

I have a function that I need to be able to both execute by sending a command and execute automatically using a job queue. And in that function I need to have access to user_data to store data that is specific to each user. Now the problem is that when I try to execute it via job queue, user_data is None and thus unavailable. How can I fix that?
Here's how I'm currently doing this, simplified:
import datetime
from settings import TEST_TOKEN, BACKUP_USER
from telegram.ext import Updater, CommandHandler, CallbackContext
from pytz import timezone
def job_daily(context: CallbackContext):
job(BACKUP_USER, context)
def job_command(update, context):
job(update.message.chat_id, context)
def job(chat_id, context):
print(context.user_data)
def main():
updater = Updater(TEST_TOKEN, use_context=True)
dispatcher = updater.dispatcher
job_queue = updater.job_queue
# To run it automatically
tehran = timezone("Asia/Tehran")
due = datetime.time(15, 3, tzinfo=tehran)
job_queue.run_daily(job_daily, due)
# To run it via command
dispatcher.add_handler(CommandHandler("job", job_command))
updater.start_polling()
updater.idle()
if __name__ == "__main__":
main()
Now when I send the command /job and thus executing job_command, the job function prints {} which means that I can access user_data. But when the job_daily function is executed, the job function prints None meaning that I don't have access to user_data. Same goes for chat_data.
In a callback function of python-telegram-bot, context.user_data and context.chat_data depend on the update. More precisely, PTB takes update.effective_user/chat.id and provides the corresponding user/chat_data. Jobs callbacks are not triggered by an update (but by a time based trigger), so there's no reasonable way to provide context.user_data.
What you can do, when scheduling the job from within a handler callback, where user_data is available, is to pass it as context argument to the job:
context.job_queue.run_*(..., context=context.user_data)
Then within the job callback, you can retrieve it as user_data = context.job.context
In your case, you schedule the job in main, which is not a handler callback and hence you don't have context.user_data (not even context). If you have a specific user id for which you'd like to pass the user_data, you can get that user_data as user_data = updater.dispatcher.user_data[user_id], which is the same object as context.user_data (for updates from this particular user).
Disclaimer: I'm currently the maintainer of python-telegram-bot.

Using Dataloader at request level (graphene + tornado-graphql)

I'm trying to integrate GraphQL to my web service which was written in tornado (python). By using dataloaders, I can speed up my request and avoid sending multiple queries to my database. But the problem is I can't find any examples or the definition that equal the "context" variable at request level to store the GraphQLView. I found an example written in sanic refer to this link. Are there any definition in "tornado" that equal to "context" (get_context) in sanic ??? Or any examples to resolve the attributes like that:
class Bandwidth(ObjectType):
class Meta:
interfaces = (Service, )
min_inbits_value = Field(Point)
max_inbits_value = Field(Point)
def resolve_min_inbits_value(context, resolve_info, arg1, arg2):
Finaly, I can access and modify the context at request level, I want to share how to I do it. I can wrapper the context propery in TornadoGraphQLHandler, but I need to parse the raw query:
from graphene_tornado import tornado_graphql_handler
from graphql import parse
class TornadoGraphQLHandler(tornado_graphql_handler.TornadoGraphQLHandler):
#property
def context(self):
data = self.parse_body()
query, variables, operation_name, id = self.get_graphql_params(self.request, data)
try:
document = parse(query)
args = dict()
for member in document.definitions[0].selection_set.selections[0].arguments:
args[member.name.value] = member.value.value
return <dataloaders with the arguments in request here>
except:
return self.request
By this way, I can access the dataloader by "info.context" in graphene in the next level.

How can i subscribe a consumer and notify him of any changes in django channels

I'm currently building an application that allows users to collaborate together and create things, as I require a sort of discord like group chatfeed. I need to be able to subscribe logged in users to a project for notifications.
I have a method open_project that retrieves details from a project that has been selected by the user, which I use to subscribe him to any updates for that project.
So I can think of 2 ways of doing this. I have created a instance variable in my connect function, like this:
def connect(self):
print("connected to projectconsumer...")
self.accept()
self.projectSessions = {}
And here is the open_project method:
def open_project(self, message):
p = Project.objects.values("projectname").get(id=message)
if len(self.projectSessions) == 0:
self.projectSessions[message] = []
pass
self.projectSessions[message] = self.projectSessions[message].append(self)
print(self.projectSessions[message])
self.userjoinedMessage(self.projectSessions[message])
message = {}
message["command"] = "STC-openproject"
message["message"] = p
self.send_message(json.dumps(message))
Then when the user opens a project, he is added to the projectSessions list, this however doesn't work (I think) whenever a new user connects to the websocket, he gets his own projectconsumer.
The second way I thought of doing this is to create a managing class that only has 1 instance and keeps track of all the users connected to a project. I have not tried this yet as I would like some feedback on if I'm even swinging in the right ball park. Any and all feedback is appreciated.
EDIT 1:
I forgot to add the userjoinedMessage method to the question, this method is simply there to mimic future mechanics and for feedback to see if my solution actually works, but here it is:
def userjoinedMessage(self, pointer):
message = {}
message["command"] = "STC-userjoinedtest"
message["message"] = ""
pointer.send_message(json.dumps(message))
note that i attempt to reference the instance of the consumer.
I will also attempt to implement a consumer manager that has the role of keeping track of what consumers are browsing what projects and sending updates to the relevant channels.
From the question, the issue is how to save projectSessions and have it accessible across multiple instances of the consumer. Instead of trying to save it in memory, you can save in a database. It is a dictionary with project as key. You can make it a table with ForeignKey to the Project model.
In that way, it is persisted and there would be no issue retrieving it even across multiple channels server instances if you ever decide to scale your channels across multiple servers.
Also, if you feel that a traditional database will slow down the retrieval of the sessions, then you can use faster storage systems like redis
Right, this is probably a horrible way of doing things and i should be taken out back and shot for doing it but i have a fix for my problem. I have made a ProjectManager class that handles subscriptions and updates to the users of a project:
import json
class ProjectManager():
def __init__(self):
if(hasattr(self, 'projectSessions')):
pass
else:
self.projectSessions = {}
def subscribe(self, projectid, consumer):
print(projectid not in self.projectSessions)
if(projectid not in self.projectSessions):
self.projectSessions[projectid] = []
self.projectSessions[projectid].append(consumer)
self.update(projectid)
def unsubscribe(self, projectid, consumer):
pass
def update(self, projectid):
if projectid in self.projectSessions:
print(self.projectSessions[projectid])
for consumer in self.projectSessions[projectid]:
message = {}
message["command"] = "STC-userjoinedtest"
message["message"] = ""
consumer.send_message(json.dumps(message))
pass
in my apps.py file i initialize the above ProjectManager class and assign it to a variable.
from django.apps import AppConfig
from .manager import ProjectManager
class ProjectConfig(AppConfig):
name = 'project'
manager = ProjectManager()
Which i then use in my consumers.py file. I import the manager from the projectconfig class and assign it to a instance variable inside the created consumer whenever its connected:
def connect(self):
print("connected to projectconsumer...")
self.accept()
self.manager = ProjectConfig.manager
and whenever i call open_project i subscribe to that project with the given project id recieved from the front-end:
def open_project(self, message):
p = Project.objects.values("projectname").get(id=message)
self.manager.subscribe(message, self)
message = {}
message["command"] = "STC-openproject"
message["message"] = p
self.send_message(json.dumps(message))
as i said i in no way claim that this is the correct way of doing it and i am also aware that channel_layers supposedly does this for you in a neat way. i however don't really have the time to get into channel_layers and will therefore be using this.
I am still open to suggestions ofcourse and am always happy to learn more.

Categories