I've run into a really frustrating bug, but I'm not sure exactly how to phrase my question. The core behavior seems to be as follows:
1) Create a new db.session, bound to an existing PostgreSQL database
2) Run db.session.add(myObj)
3) Run db.session.commit()
>>> Check the database using PGAdmin, myObj was successfully uploaded
4) *
5) Run db.session.query(myClass) as many times as I want
>>> Returns [myObj]
6) Run db.session.query(myClass).filter(anyFilterThatDoesNotActuallyChangeResult)
>>> Returns [myObj]
>>> BUG >>> 5 seconds later, another copy of myObj is added to the database (visible in PGAdmin)
7) Repeat step 6 as many times as you want
>>> Returns [myObj, myObj]
8) Repeat step 5
>>> Returns [myObj, myObj]
>>> BUG >>> 5 seconds later, another copy of myObj is added to the database (visible in PGAdmin)
Further confusing information: I can completely close and restart my text editor and python environment at step 4, and the buggy behavior persists.
My intuition is that the COMMIT string is somehow being cached somewhere (in SQLAlchemy or in PostgreSQL) and whenever the query command is changed, that triggers some sort of autoflush on the DB, thereby rerunning the commit string, but not actually clearing that cache upon success.
----------------- EDIT -----------------
IGNORE THE REST OF THIS QUESTION, AS IT WAS NOT RELEVANT TO THE BUG AT HAND.
To further explore this behavior, I ran the following code:
1) Create a new db.session, bound to an existing PostgreSQL database
2) Run db.session.add(myObj)
3) Run db.session.commit()
4) Run db.session.commit()
Which, I would expect to only add ONE copy of myObj, but instead it actually adds TWO!!! This breaks my understanding of what commit is doing--specifically autoflushing, ending the transaction, and removing add(myObj) from its "to do" list. Furthermore, none of the code I try running between lines 3 and 4 will prevent this behavior: for example db.session.expire_all()
I am a complete noob around databases (this is my first project), so I would appreciate any suggestions, especially explicit step-by-step recommendations for how I can overcome this bug. E.g. What code should I add in, and where, to clear such a cache?
Turns out that the problem was more nefarious than I imagined. The steps to repeat were actually more basic than that:
1) Save any file in the same directory as my session manager
>>> BUG >>> 15 seconds later another copy of myObj is added to the database
I am using VS Code (Version: 1.47.3), and the bug only happens while the Python extension is enabled.
My running hypothesis is that because one of the files in the directory auto-initializes a database session (via pyscopg2), there is some caching mechanism that executes that code in a poorly-managed state, which somehow successfully manages to establish a new engine connection, followed by whatever the last commit statement was.
I have stopped trying to debug it, and moved to a refactor of the session management structure so that the connection is only established within a function call, as opposed to whenever the file is run.
Thanks for reading. Hope this helps someone else hitting this infuriatingly unreproducible bug. Literally thought I was going crazy: each time I absent-mindedly saved my file, a mystery object would appear. I would save at different points and at different frequencies, so the behavior appeared utterly random. The only reason I found the original steps to reproduce was because the debugger I was using saved the file before running it.
----------------- FINAL SOLUTION -----------------
It turns out the root of all my woes was my choice of names.
I had written some code that tested my sql code, but foolishly named it test_XXX.py
Then, whenever any file was saved, pytest would do an automatic sweep of all the files that started test_* and run them, thus causing my entire SQL example work to be run behind the scenes.
Tune in next week for more adventures in Things That I Could Have Prevented.
Related
I need to modify an option of the accounting configuration (menu Accounting > Configuration > Accounting).
As you know, those options belong to a Transient Model named account.config.settings, which inherits from res.config.settings.
The problem is that even if I modify no option and click on Apply, Odoo begins loading forever. I put the log in debug_sql mode, and I realised that after clicking on Apply, Odoo starts to make thousands of SQL queries, and that is the reason why it does not stop loading.
I made a database backup and restored it in a newer instance of Odoo 8. In this instance, when I click on Apply, Odoo makes several SQL queries but not so much as in the other instance, so it works perfectly.
My conclusion was that the problem could be in the instance code (not in the database), so I looked for all the modules inheriting from account.config.settings and updated their repositories to go back to the same commits as the wrong instance (with git checkout xxx).
Afterwards I was expecting the newer instance to start failing when clicking on Apply, but it remains working OK.
So I am running out of ideas. I am thinking about running the backup database in the newer instance just to change the option I need, and after that restoring it again in the older instance, but I prefer to avoid that since I think it is a bit risky.
Any ideas? What more can I try to find out the problem?
Finally I found out the guilty module. It was account_due_list from the repository account-payment of the Odoo Community Association. The commit which fixes the problem is https://github.com/OCA/account-payment/commit/d7a09399982c80bb0f9465c44b9dc2a2b17e557a#diff-57131fd364915a56cbf8696d74e19478, merged on September the 22nd in 2016. Its title, "check if currency id not changed per company, remove it from create values".
The computed field maturity_residual depended on company_id.currency_id. This dependency has to be removed due to be the cause of the whole problem. It triggered thousands of SQL queries which made Odoo be loading forever.
Old and wrong code
#api.depends('date_maturity', 'debit', 'credit', 'reconcile_id',
'reconcile_partial_id', 'account_id.reconcile',
'amount_currency', 'reconcile_partial_id.line_partial_ids',
'currency_id', 'company_id.currency_id')
'currency_id')
def _maturity_residual(self):
...
New and right code
#api.depends('date_maturity', 'debit', 'credit', 'reconcile_id',
'reconcile_partial_id', 'account_id.reconcile',
'amount_currency', 'reconcile_partial_id.line_partial_ids',
'currency_id')
def _maturity_residual(self):
...
I found very risky to update repositories to the latest version due to what #CZoellner exactly says, sometimes there are weird commits which can destroy some database data. So, these are the consequences of not doing that.
I am debugging a Python (3.5) program with PyCharm (PyCharm Community Edition 2016.2.2 ; Build #PC-162.1812.1, built on August 16, 2016 ; JRE: 1.8.0_76-release-b216 x86 ; JVM: OpenJDK Server VM by JetBrains s.r.o) on Windows 10.
The problem: when stopped at some breakpoints, the Debugger window is stuck at "Collecting data", which eventually timeout. (with Unable to display frame variables)
The data to be displayed is neither special, nor particularly large. It is somehow available to PyCharm since a conditional break point on some values of the said data works fine (the program breaks) -- it looks like the process to gather it for display only (as opposed to operational purposes) fails.
When I step into a function around the place I have my breakpoint, its data is displayed correctly. When I go up the stack (to the calling function, the one I stepped down from and where I wanted initially to have the breakpoint) - I am stuck with the "Collecting data" timeout again.
There have been numerous issues raised with the same point since at least 2005. Some were fixed, some not. The fixes were usually updates to the latest version (which I have).
Is there a general direction I can go to in order to fix or work around this family of problems?
EDIT: a year later the problem is still there and there is still no reaction from the devs/support after the bug was raised.
EDIT April 2018: It looks like the problem is solved in the 2018.1 version, the following code which was hanging when setting a breakpoint on the print line now works (I can see the variables):
import threading
def worker():
a = 3
print('hello')
threading.Thread(target=worker).start()
I had the same issue with Pycharm 2018.2 when working on a complex Flask project with SocketIO.
When I put a debug breakpoint inside the code and pressed the debug button, it stopped at the breakpoint, but the variables didn't load. It was just infinitely collecting data. I enabled Gevent compatibility and it resolved the issue. Here is where you can find the setting:
In case you landed here because you are using PyTorch (or any other deep learning library) and try to debug in PyCharm (torch 1.31, PyCharm 2019.2 in my case) but it's super slow:
Enable Gevent compatible in the Python Debugger settings as linkliu mayuyu pointed out. The problem might be caused due to debugging large deep learning models (BERT transformer in my case), but I'm not entirely sure about this.
I'm adding this answer as it's end of 2019 and this doesn't seem to be fixed yet. Further I think this is affecting many engineers using deep learning, so I hope my answer-formatting triggers their stackoverflow algorithm :-)
Note (June 2020):
While adding the Gevent compatible allows you to debug PyTorch models, it will prevent you from debug your Flask application in PyCharm! My breakpoints were not working anymore and it took me a while to figure out that this flag is the reason for it. So make sure to enable it only on a per-project base.
I also had this issue when I was working on code using sympy and the Python module 'Lea' aiming to calculate probability distributions.
The action I took that resolved the timeout issue was to change the 'Variables Loading Policy' in the debug setting from the default 'Asynchronously' to 'Synchronously'.
I think that this is caused by some classes having a default method __str__() that is too verbose. Pycharm calls this method to display the local variables when it hits a breakpoint, and it gets stuck while loading the string.
A trick I use to overcome this is manually editing the class that is causing the error and substitute the __str__() method for something less verbose.
As an example, it happens for pytorch _TensorBase class (and all tensor classes extending it), and can be solved by editing the pytorch source torch/tensor.py, changing the __str__() method as:
def __str__(self):
# All strings are unicode in Python 3, while we have to encode unicode
# strings in Python2. If we can't, let python decide the best
# characters to replace unicode characters with.
return str() + ' Use .numpy() to print'
#if sys.version_info > (3,):
# return _tensor_str._str(self)
#else:
# if hasattr(sys.stdout, 'encoding'):
# return _tensor_str._str(self).encode(
# sys.stdout.encoding or 'UTF-8', 'replace')
# else:
# return _tensor_str._str(self).encode('UTF-8', 'replace')
Far from optimum, but comes in hand.
UPDATE: The error seems solved in the last PyCharm version (2018.1), at least for the case that was affecting me.
I met the same problem when I try to run some Deep Learning scripts written by PyTorch (PyCharm 2019.3).
I finally figured out that the problem is I set num_workers in DataLoader to a large value (in my case 20).
So, in the debug mode, I would suggest to set num_workers to 1.
For me, the solution was removing manual watches every-time before starting to debug. If there were any existing manual watches in the "variables" window then it would remain stuck in "Collecting data...".
Using Odoo or Other Large Python Server
None of the above solution worked for me despite I tried all.
It normally works but saldomly gives this annoying Collecting data... or sometimes Timed Out....
The solution is to restart Pycharm and set less breakpoints as possible. after that it starts to work again.
I don't know way is doing that (maybe too many breakpoint) but it worked.
I'm trying hard to like IPython Notebook, but maybe because I'm so used to writing code in vi and executing it at the command line I'm finding some of its defaults challenging. Can anything be done (perhaps in a configuration file somewhere) about the following?
I'd like %hist to output line numbers by default without having to remember the -n and without having to set up an alias every time.
How do I set %automagic to "off" by default to stop IPython polluting my namespace with its un-percented magics and shell commands? I know I can use the --no-import-all option with --pylab option: is there an equivalent --no-automagic option?
It drives me mad that I'm never quite sure what is the status of the objects bound to my variable names: changing and running a cell beneath the one I'm using can alter an object I'm referring to in the current cell. To avoid this, I've got into the habit of using Run All or Run All Above, but that sometimes repeats lengthy calculations and reimports stuff I'd rather not bother with: can I flag some cells to be not-rerun by Run All?
Can I get vi-style key-bindings for editing cells?
Why does IPython notebook hang my browser if the kernel is using lots of memory? I thought they were separate processes with the kernel just reporting back its results.
(Please try to ask one question per question - SO is designed to work that way. However, I'm not feeling so mean that I'd refuse to answer)
I don't think the behaviour of %hist is configurable, sorry.
Set c.InteractiveShell.automagic = False in your config files.
There has been some discussion of a %%cache cell magic that could avoid re-running long running cells by storing values computed in that cell. It has not been implemented yet, though.
Yes: https://github.com/ivanov/ipython-vimception
It shouldn't hang just because of kernel memory use - if your code is producing a lot of output, though, that can hang the browser because adding lots of things to the DOM gums it up.
I am implementing an import tool (Django 1.6) that takes a potentially very large CSV file, validates it and depending on user confirmation imports it or not. Given the potential large filesize, the processing of the file is done via flowy (a python wrapper over Amazon's SWF). Each import job is saved in a table in the DB and the workflow, which is quite simple and consists of only one activity, basically calls a method that runs the import and saves all necessary information about the processing of the file in the job's record in the database.
The tricky thing is: We now have to make this import atomic. Either all records are saved or none. But one of the things saved in the import table is the progress of the import, which is calculated based on the position of the file reader:
progress = (raw_data.tell() * 100.0) / filesize
And this progress is used by an AJAX progressbar widget in the client side. So simply adding #transaction.atomic to the method that loops through the file and imports the rows is not a solution, because the progress will only be saved on commit.
The CSV files only contain one type of record and affect a single table. If I could somehow do a transaction only on this table, leaving the job table free for me to update the progress column, it would be ideal. But from what I've found so far it seems impossible. The only solution I could think of so far is opening a new thread and a new database connection inside it every time I need to update the progress. But I keep wondering… will this even work? Isn't there a simpler solution?
One simple approach would be to use the READ UNCOMMITTED transaction isolation level. That could allow dirty reads, which would allow your other processes to see the progress even though the transaction hasn't been committed. However, whether this works or not will be database-dependent. (I'm not familiar with MySQL, but this wouldn't work in PostgreSQL because READ UNCOMMITTED works the same way as READ COMMITTED.)
Regarding your proposed solution, you don't necessarily need a new thread, you really just need a fresh connection to the database. One way to do that in Django might be to take advantage of the multiple database support. I'm imagining something like this:
As described in the documentation, add a new entry to DATABASES with a different name, but the same setup as default. From Django's perspective we are using multiple databases, even though we in fact just want to get multiple connections to the same database.
When it's time to update the progress, do something like:
JobData.objects.using('second_db').filter(id=5).update(progress=0.5)
That should take place in its own autocommitted transaction, allowing the progress to be seen by your web server.
Now, does this work? I honestly don't know, I've never tried anything like it!
I do run parallel write requests on my ZODB. I do have multiple BTree instances inside my ZODB. Once the server accesses the same objects inside such a BTree, I get a ConflictError for the IOBucket class. For all my Django bases classes I do have _p_resolveconflict set up, but can't implement it for IOBucket 'cause its a C based class.
I did a deeper analysis, but still don't understand why it complains about the IOBucket class and what it writes into it. Additionally, what would be the right strategy to resolve it?
Thousand thanks for any help!
IOBucket is part of the persistence structure of a BTree; it exists to try and reduce conflict errors, and it does try and resolve conflicts where possible.
That said, conflicts are not always avoidable, and you should restart your transaction. In Zope, for example, the whole request is re-run up to 5 times if a ConflictError is raised. Conflicts are ZODB's way of handling the (hopefully rare) occasion where two different requests tried to change the exact same data structure.
Restarting your transaction means calling transaction.begin() and applying the same changes again. The .begin() will fetch any changes made by the other process and your commit will be based on the fresh data.