FastText version before most recent change - python

I was going through old FastText code, and started to realize it doesn't work anymore and expects different parameters. When looking at the dcoumentation , it appears the documentation has been partially updated.
Which it can be seen size and iter are not in the class definition shown in the docs despite being in the parameters. I was wondering if anyone knew exact version where this change has occured as it appears I've accidentally updated it to something newer.

Most changes occurred in gensim-4.0.0. There are a series of notes on the changes & how to adapt your code in the project wiki page, "Migrating from Gensim 3.x to 4":
https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4
In most cases small changes to the method & variable names older code is using will restore full functionality.
There have been significant fixes and optimizations to the FastText implementation, especially in the realm of reducing memory usage, so you probably don't want to stay on any older version (like gensim-3.8.3) except as a temporary quickie workaround.

Related

ctypes.c_int.from_address does not work in RStudio

I am trying to count references to an object in Python using the RStudio. I use following function:
ctypes.c_int.from_address(id(an_object)).value
This work perfectly in Pycharm and Jupyter as shown bellow:
and the result in the RStudio is:
The question is why the result is not correct in RStudio and how to fix it?
I also tried to use
sys.getrefcount
function, but It does not work in RStudio too!
I did it without using "id" function as shown below:
But the result in RStudio is not correct! Sometimes It may happen in PyCharm(I did not see, Perhaps no guarantee) But in RStudio something is wrong completely!
Why this is important?! And why I care about it.
Consider following example:
Sometimes it is important to know about "a" before change "b".
The big problem in RStudio is the result increases randomly! But in PyCharm and other Python tools I did not see that happen.
I am not an expert on this so if I am wrong on it correct me please.
the problem seems to be just that r-studio spams a lot of other references to the objects you create. In the same Python version you would otherwise see no divergence there.
That said, you are definitely taking the wrong approach there: either you alias some data structure expecting it to be mutable "under your feet", or you should refrain from making changes at all to data that "may" be in use in other parts of your code (notebook). That is the reasoning Pandas, for example, have been changing all default operations on dataframe from "inplace" to return a new copy, despite the memory cost of doing that.
Other than that, as stated in the comments, the reason this is so hidden in Python - to the point you have to use ctypes (you could also use the higher level gc module, actually), is that it should not matter.
There are other things you could do - like, create objects that use Locks so that they don't allow changes if they are in use in some other critical place - that would be higher level and reproducible.

Difference between approxCountDsitinct and approx_count_distinct in spark functions

Can anyone tell the difference between pyspark.sql.functions.approxCountDistinct (I know it is deprecated) and pyspark.sql.functions.approx_count_distinct? I have used both versions in a project and have experienced different values
As you mentioned it, pyspark.sql.functions.approxCountDistinct is deprecated. The reason is most likely just a style concern. They probably wanted everything to be in snake case. As you can see in the source code pyspark.sql.functions.approxCountDistinct simply calls pyspark.sql.functions.approx_count_distinct, nothing more except giving you a warning. So regardless the one you use, the very same code runs in the end.
Also, still according to the source code, approx_count_distinct is based on the HyperLogLog++ algorithm. I am not very familiar with the algorithm but it is based on repetitive set merging. Therefore, the result will most likely depend on the order in which the various results of the executors are merged. Since this is not deterministic with spark, this could explain why you witness different results.

LightGBM ignore warning about "boost_from_average"

I use LightGBM model (version 2.2.1). It shows next warning on train:
[LightGBM] [Warning] Starting from the 2.1.2 version, default value
for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the
previous versions of LightGBM. Try to set boost_from_average=false, if
your old models produce bad results
I found what it is about:github link.
But I don't use any old models or legacy code (it's new project created on 2.2.1 version of LightGBM), so I don't need to see this warning every time.
Also I know I can change verbose and turn off all warnings. But it's not really good - some another can be useful!
So my question is: Is it possible to turn off (hide) just this Warning?
Mikhail
Try to set the parameter boost_from_average when you create the model. Either True or False.
Best Regards,

Debugger times out at "Collecting data..."

I am debugging a Python (3.5) program with PyCharm (PyCharm Community Edition 2016.2.2 ; Build #PC-162.1812.1, built on August 16, 2016 ; JRE: 1.8.0_76-release-b216 x86 ; JVM: OpenJDK Server VM by JetBrains s.r.o) on Windows 10.
The problem: when stopped at some breakpoints, the Debugger window is stuck at "Collecting data", which eventually timeout. (with Unable to display frame variables)
The data to be displayed is neither special, nor particularly large. It is somehow available to PyCharm since a conditional break point on some values of the said data works fine (the program breaks) -- it looks like the process to gather it for display only (as opposed to operational purposes) fails.
When I step into a function around the place I have my breakpoint, its data is displayed correctly. When I go up the stack (to the calling function, the one I stepped down from and where I wanted initially to have the breakpoint) - I am stuck with the "Collecting data" timeout again.
There have been numerous issues raised with the same point since at least 2005. Some were fixed, some not. The fixes were usually updates to the latest version (which I have).
Is there a general direction I can go to in order to fix or work around this family of problems?
EDIT: a year later the problem is still there and there is still no reaction from the devs/support after the bug was raised.
EDIT April 2018: It looks like the problem is solved in the 2018.1 version, the following code which was hanging when setting a breakpoint on the print line now works (I can see the variables):
import threading
def worker():
a = 3
print('hello')
threading.Thread(target=worker).start()
I had the same issue with Pycharm 2018.2 when working on a complex Flask project with SocketIO.
When I put a debug breakpoint inside the code and pressed the debug button, it stopped at the breakpoint, but the variables didn't load. It was just infinitely collecting data. I enabled Gevent compatibility and it resolved the issue. Here is where you can find the setting:
In case you landed here because you are using PyTorch (or any other deep learning library) and try to debug in PyCharm (torch 1.31, PyCharm 2019.2 in my case) but it's super slow:
Enable Gevent compatible in the Python Debugger settings as linkliu mayuyu pointed out. The problem might be caused due to debugging large deep learning models (BERT transformer in my case), but I'm not entirely sure about this.
I'm adding this answer as it's end of 2019 and this doesn't seem to be fixed yet. Further I think this is affecting many engineers using deep learning, so I hope my answer-formatting triggers their stackoverflow algorithm :-)
Note (June 2020):
While adding the Gevent compatible allows you to debug PyTorch models, it will prevent you from debug your Flask application in PyCharm! My breakpoints were not working anymore and it took me a while to figure out that this flag is the reason for it. So make sure to enable it only on a per-project base.
I also had this issue when I was working on code using sympy and the Python module 'Lea' aiming to calculate probability distributions.
The action I took that resolved the timeout issue was to change the 'Variables Loading Policy' in the debug setting from the default 'Asynchronously' to 'Synchronously'.
I think that this is caused by some classes having a default method __str__() that is too verbose. Pycharm calls this method to display the local variables when it hits a breakpoint, and it gets stuck while loading the string.
A trick I use to overcome this is manually editing the class that is causing the error and substitute the __str__() method for something less verbose.
As an example, it happens for pytorch _TensorBase class (and all tensor classes extending it), and can be solved by editing the pytorch source torch/tensor.py, changing the __str__() method as:
def __str__(self):
# All strings are unicode in Python 3, while we have to encode unicode
# strings in Python2. If we can't, let python decide the best
# characters to replace unicode characters with.
return str() + ' Use .numpy() to print'
#if sys.version_info > (3,):
# return _tensor_str._str(self)
#else:
# if hasattr(sys.stdout, 'encoding'):
# return _tensor_str._str(self).encode(
# sys.stdout.encoding or 'UTF-8', 'replace')
# else:
# return _tensor_str._str(self).encode('UTF-8', 'replace')
Far from optimum, but comes in hand.
UPDATE: The error seems solved in the last PyCharm version (2018.1), at least for the case that was affecting me.
I met the same problem when I try to run some Deep Learning scripts written by PyTorch (PyCharm 2019.3).
I finally figured out that the problem is I set num_workers in DataLoader to a large value (in my case 20).
So, in the debug mode, I would suggest to set num_workers to 1.
For me, the solution was removing manual watches every-time before starting to debug. If there were any existing manual watches in the "variables" window then it would remain stuck in "Collecting data...".
Using Odoo or Other Large Python Server
None of the above solution worked for me despite I tried all.
It normally works but saldomly gives this annoying Collecting data... or sometimes Timed Out....
The solution is to restart Pycharm and set less breakpoints as possible. after that it starts to work again.
I don't know way is doing that (maybe too many breakpoint) but it worked.

Vim Python omni-completion failing to work on system modules

I'm noticing that even for system modules, code completion doesn't work too well.
For example, if I have a simple file that does:
import re
p = re.compile(pattern)
m = p.search(line)
If I type p., I don't get completion for methods I'd expect to see (I don't see search() for example, but I do see others, such as func_closure(), func_code()).
If I type m., I don't get any completion what so ever (I'd expect .groups(), in this case).
This doesn't seem to affect all modules.. Has any one seen this behaviour and knows how to correct it?
I'm running Vim 7.2 on WinXP, with the latest pythoncomplete.vim from vim.org (0.9), running python 2.6.2.
Completion for this kind of things is tricky, because it would need to execute the actual code to work.
For example p.search() could return None or a MatchObject, depending on the data that is passed to it.
This is why omni-completion does not work here, and probably never will. It works for things that can be statically determined, for example a module's contents.
I never got the builtin omnicomplete to work for any languages. I had the most success with pysmell (which seems to have been updated slightly more recently on github than in the official repo). I still didn't find it to be reliable enough to use consistently but I can't remember exactly why.
I've resorted to building an extensive set of snipMate snippets for my primary libraries and using the default tab completion to supplement.

Categories