I've got a google appengine app which runs some code on dynamic backend defined as follows:
backends:
- name: downloadfilesbackend
class: B1
instances: 1
options: dynamic
I've recently made some changes to my code and added a second backend. I've moved some tasks from the front end to the new backend and they work fine. However I want to move the tasks that originally ran on downloadfilesbackend to the new backend (to save on instance hours). I am doing this simply by changing the name of the target to the new backend i.e.
taskqueue.add(queue_name = "organise-files",
url=queue_organise_files,
target='organise-files-backend')
However, despite giving the new backend name as the target the tasks are still being run by the old backend. Any idea why this is happening or how I can fix it?
EDIT:
The old backend is running new tasks - I've checked this.
I've also been through all of my code to check to see if anything is calling the old backend and nothing is. There are only two methods which added tasks to the old backend, and both of these methods have been changed as detailed above.
I stopped the old backend for a few hours, to see whether this would change anything, but all that happened was that the tasks got jammed until I restarted the backend. The new backend is running other tasks fine, so it's definitely been updated correctly...
It's taken a while but I've finally discovered that it is not enough to just change the code and upload it using the SDK. If the code running on the backend or sending tasks to the backend handler is changed then you must run
appcfg backends <dir> update [backend]
documentation for this command is here. There isn't any documentation I've seen that says this - it was related to another related error I was experiencing that prompted this as an avenue. Just thought I'd let people know who may be having a similar problem
Related
We use Python(2.7)/Django(1.8.1) and Gunicorn(19.4.5) for our web application and supervisor(3.0) to monitor it. I have recently encountered 2 issues in logging:
Django was logging into previous day logs(We have log rotation enabled)
Django was not logging anything at all.
The first scenario is understandable where the log rotation changed the file but Django was not updated.
The second scenario fixed when I restarted the supervisor process. Which led me to believe again the file descriptor was not updated in the django process.
I came by this SO thread which states:
Each child is an independent process, and file handles in the parent
may be closed in the child after a fork (assuming POSIX). In any case,
logging to the same file from multiple processes is not supported.
So I have few questions:
My gunicorn has 4 child processes and if one of them fails while
writing to a log file will the other child process won't be able to
use it? and how to debug these kind of scenarios?
Personally I found debugging errors in python logging module to be
difficult. Can some one point how to debug errors such as this or is
there any way I can monkey patch logging to not fail silently?*(Kindly read update section)*
I have seen Django LogRotation causes the Issue type 1 as explained above and not some script scheduled via cron. So what is preferable?
Note: The logging config is not a problem. I have already spent fair amount of time trying to figure that out. Also if the config is the issue Django will not write log files after a process restart.
Update:
For my second question I see that logging modules provides an option to raiseExceptions on failure although this is discourages in production environment. Documentation here. So now my question becomes how do I set this in Django?
I felt like closing this question. Bit awkward and seems stupid after 2 months. But I guess being stupid is part of the learning and want this to be as a reference for people who stumble across this.
Scenario 1: Django on using TimedRotatingFileHandler seems not to update the file descriptor some times and hence writes to old log files unless we restart the supervisor. We are yet to find the reason for this behaviour and update the reason if found. For now we are using WatchedFileHandler and then using logrotate utility to rotate the logs.
Scenario 2: This is the stupid question. When I was logging with some string formatting I forgot to give enough variables which is why the logger was erring. But this didn't get propagated. But locally when I was testing I found that logging module was actually throwing that error but silently and any logs after it in the module were not getting printed. Lessons learns from this scenario were:
If there is a problem in logging find out if the string formatting does not err
Using log.debug('example: {msg}'.format(msg=msg)) of python instead of log.debug('example: %s', msg).
when I put a new DAG python script in the dags folder, I can view a new entry of DAG in the DAG UI but it was not enabled automatically. On top of that, it seems does not loaded properly as well. I can only click on the Refresh button few times on the right side of the list and toggle the on/off button on the left side of the list to be able to schedule the DAG. These are manual process as I need to trigger something even though the DAG Script was put inside the dag folder.
Anyone can help me on this ? Did I missed something ? Or this is a correct behavior in airflow ?
By the way, as mentioned in the post title, there is an indicator with this message "This DAG isn't available in the webserver DagBag object. It shows up in this list because the scheduler marked it as active in the metdata database" tagged with the DAG title before i trigger all this manual process.
It is not you nor it is correct or expected behavior.
It is a current 'bug' with Airflow.
The web server is caching the DagBag in a way that you cannot really use it as expected.
"Attempt removing DagBag caching for the web server" remains on the official TODO as part of the roadmap, indicating that this bug may not yet be fully resolved, but here are some suggestions on how to proceed:
only use builders in airflow v1.9+
Prior to airflow v1.9 this occurs when a dag is instantiated by a function which is imported into the file where instantiation happens. That is: when a builder or factory pattern is used. Some reports of this issue on github 2 and JIRA 3 led to a fix released with in airflow v1.9.
If you are using an older version of airflow, don't use builder functions.
airflow backfill to reload the cache
As Dmitri suggests, running airflow backfill '<dag_id>' -s '<date>' -e '<date>' for the same start and end date can sometimes help. Thereafter you may end up with the (non)-issue that Priyank points, but that is expected behavior (state: paused or not) depending on the configuration you have in your installation.
Restart the airflow webserver solves my issue.
This error can be misleading. If hitting refresh button or restarting airflow webserver doesn't fix this issue, check the DAG (python script) for errors.
Running airflow list_dags can display the DAG errors (in addition to listing out the dags) or even try running/testing your dag as a normal python script.
After fixing the error, this indicator should go away.
The issue is because the DAG by default is put in the DagBag in paused state so that the scheduler is not overwhelmed with lots of backfill activity on start/restart.
To work around this change the below setting in your airflow.cfg file:
# Are DAGs paused by default at creation
dags_are_paused_at_creation = False
Hope this helps. Cheers!
I have a theory about possible cause of this issue in Google Composer. There is section about dag failures on webserver in troubleshooting documentation for Composer, which says:
Avoid running heavyweight computation at DAG parse time. Unlike the
worker and scheduler nodes, whose machine types can be customized to
have greater CPU and memory capacity, the webserver uses a fixed
machine type, which can lead to DAG parsing failures if the parse-time
computation is too heavyweight.
And I was trying to load configuration from external source (which actually took negligible amount of time comparing to other operations to create DAG, but still broke something, because webserver of Airflow in composer runs on App Engine, which has strange behaviours).
I found the workaround in discussion of this Google issue, and it is to create separate DAG with task which loads all the data needed and stores that data in airflow variable:
Variable.set("pipeline_config", config, serialize_json=True)
Then I could do
Variable.get("pipeline_config", deserialize_json=True)
And successfully generate pipeline from that. Additional benefit is that I get logs from that task, which I get from web server, because of this issue.
After upgrading from Freeswitch 1.2.9 (1.2.9+git~20130506T233047Z~7c88f35451) to Freeswitch 1.4.21 (1.4.21-35~64bit), freeswitch stopped dropping channels after they were hung up, and when we tried to do a manual uuid_kill, it gives us this lovely error:
-ERR No such channel!
Even though show channels shows that channel clearly there. From the bugs on jira.freeswitch.com that I've seen, it looks like it may be a code problem. A little more info on our environment/code:
We have a python twisted loop that connects to the client so the client can run commands on the server and vice versa. As soon as that twisted connection dies (the client is closed/disconnected) the channels are killed as well, but we need the channel to die before then as we're taking a lot of calls per second and need them to die when the other end is disconnected. We can't close and reopen the client every time a call is done, or reconnect as that would take way too much time and defeats the purpose of our use of the software.
Once again, this error only started happening when we changed to installing the freeswitch server using apt-get instead of directly from source. This lets us get a new server up and running extremely faster, and we would rather not take the extra time to use our previous method. Please tell me if there's any code you would like to look at, and ask for any clarification you need, but we would really like this to be fixed soon. Thanks in advance!
Edit: For more clarification, we're mainly using mod_callcenter, mod_conference, and mod_sofia with our software.
Edit 2: For a little more clarification, we're running this on Ubuntu 14.04 Server
We are using an ESL connection to connect and run commands in freeswitch from python, and we think that's the root of the problem. We tried exiting the connection, but that destroys both channels.
Also, all of the bugs filed already for this problem on Jira are closed for not being bugs. I thought I may have a bit more success here, as it is a programming type question.
You need to reproduce the issue in a test environment and file the bug report to Jira. At best you should also try reproducing it with the latest master branch (only Debian 8 is supported):
https://freeswitch.org/confluence/display/FREESWITCH/Debian+8+Jessie
I had a similar problem when I used mod_perl, and a Perl object was referring to a session, and it was not properly destructed (if I remember it right, I had two Perl objects attached to the same session). That resulted in channels which were impossible to kill.
I suppose you are using a ESL connection between your application and FreeSWITCH, right?
This is probably a truly basic thing that I'm simply having an odd time figuring out in a Python 2.5 app.
I have a process that will take roughly an hour to complete, so I made a backend. To that end, I have a backend.yaml that has something like the following:
-name: mybackend
options: dynamic
start: /path/to/script.py
(The script is just raw computation. There's no notion of an active web session anywhere.)
On toy data, this works just fine.
This used to be public, so I would navigate to the page, the script would start, and time out after about a minute (HTTP + 30s shutdown grace period I assume, ). I figured this was a browser issue. So I repeat the same thing with a cron job. No dice. Switch to a using a push queue and adding a targeted task, since on paper it looks like it would wait for 10 minutes. Same thing.
All 3 time out after that minute, which means I'm not decoupling the request from the backend like I believe I am.
I'm assuming that I need to write a proper Handler for the backend to do work, but I don't exactly know how to write the Handler/webapp2Route. Do I handle _ah/start/ or make a new endpoint for the backend? How do I handle the subdomain? It still seems like the wrong thing to do (I'm sticking a long-process directly into a request of sorts), but I'm at a loss otherwise.
So the root cause ended up being doing the following in the script itself:
models = MyModel.all()
for model in models:
# Magic happens
I was basically taking for granted that the query would automatically batch my Query.all() over many entities, but it was dying at the 1000th entry or so. I originally wrote it was computational only because I completely ignored the fact that the reads can fail.
The actual solution for solving the problem we wanted ended up being "Use the map-reduce library", since we were trying to look at each model for analysis.
I am starting to work with GAE Task Queue system and things seem to be working fine except for one issue. Everything works fine in my Django-nonrel project with the default queue but breaks with named queues and says it can not find them. I also noticed that they do not show up in the Console as expected. I followed the guide and would assume that just having the queue.yaml in the project would be enough to see them.
Here is an example:
queue:
- name: bob
max_concurrent_requests: 200
rate: 20/s
- name: default
rate: 10/s
I would expect to see the default and another task queue called "bob" in the Console.
Am I missing something in my configuration of this? Doesn't the presence of the queue.yaml set things up properly?
I am running GAEL 1.6.1
Thanks,
RB