How to solve memory error in mtrand.RandomState.choice? - python

I'm trying to sample 1e7 items from 1e5 strings but getting a memory error. It's fine sampling 1e6 items from 1e4 strings. I'm on a 64bit machine with 4GB RAM and don't think I should be reaching any memory limit at 1e7. Any ideas?
$ python3
Python 3.3.3 (default, Nov 27 2013, 17:12:35)
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> K = 100
Works fine with 1e6 :
>>> N = int(1e6)
>>> np.random.choice(["id%010d"%x for x in range(N//K)], N)
array(['id0000005473', 'id0000005694', 'id0000004115', ..., 'id0000006958',
'id0000009972', 'id0000003009'],
dtype='<U12')
Error with N=1e7 :
>>> N = int(1e7)
>>> np.random.choice(["id%010d"%x for x in range(N//K)], N)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "mtrand.pyx", line 1092, in mtrand.RandomState.choice (numpy/random/mtrand/mtrand.c:8229)
MemoryError
>>>
I found this question but it seems to be about catching an error like this rather than solving it.
Python not catching MemoryError
I'd be happy with either a solution still using random.choice or a different method to do this. Thanks.

You can work round this using a generator function:
def item():
for i in xrange(N):
yield "id%010d"%np.random.choice(N//K,1)
This avoids needing all the items in memory at once.

Related

why the result is different between running python interpreter and python code?

I made a simple code on python interpreter and run it.
Python 3.5.3 (v3.5.3:1880cb95a742, Jan 16 2017, 16:02:32) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> x=np.array([0,1])
>>> w=np.array([0.5,0.5])
>>> b=-0.7
>>> np.sum(w*x)+b
-0.19999999999999996
the result -0.19999999999999996 is weird. I think.... it is caused by IEEE 754 rule. But when I try to run almost same code by file, result is a lot different.
import numpy as np
x = np.array([0,1])
w = np.array([0.5,0.5])
b = -0.7
print(np.sum(w * x) + b)
the result is "-0.2". IEEE 754 rule does not affect the result.
what is the difference between file based running and interpreter based running?
The difference is due to how the interpreter displays output.
The print function will try to use an object's __str__ method, but the interpreter will use an object's __repr__.
If, in the interpreter you wrote:
...
z = np.sum(w*x)+b
print(z)
(which is what you're doing in your code) you'd see -0.2.
Similarly, if in your code you wrote:
print(repr(np.sum(w * x) + b))
(which is what you're doing in the interpreter) you'd see -0.19999999999999996
I think the difference lies in the fact that you use print() for your file based code, which converts the number, while in the interpreter's case, you don't use print(), but rather ask the interpreter to show the result.

In Pyside: QProcess.write(u'Test') returns 0L

My understanding is that in Pyside QString has been dropped. One can write a Python string into a QLineEdit, and when the QLineEdit is read, it is returned as a unicode string (16-bits per character).
Trying to write this string from my Gui process to a sub-process started using QProcess does not seem to work and just returns 0L (see below). If one changes the unicode string back to a Python string using the str() function, then self.my_process.write(str(u'test')) now returns 4L. This behaviour does not seem correct to me.
Would it be possible for someone to explain why QProcess.write() does not seem to work on unicode strings?
(Pdb) PySide.QtCore.QString()
*** AttributeError: 'module' object has no attribute 'QString'
(Pdb) self.myprocess.write(u'test')
0L
(Pdb) self.myprocess.write(str(u'test'))
4L
(Pdb)
PySide has never provided classes like QString, QStringList, QVariant, etc. It has always done implicit conversion to and from the equivalent python types - that is, in PyQt terminology, it only implements the v2 API (see PSEP 101 for more details).
However, the behaviour of QProcess when attempting to write unicode strings seems somewhat broken in PySide compared with PyQt4. Here's a simple test in PyQt4:
Python 2.7.8 (default, Sep 24 2014, 18:26:21)
[GCC 4.9.1 20140903 (prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from PyQt4 import QtCore
>>> QtCore.PYQT_VERSION_STR
'4.11.2'
>>> p = QtCore.QProcess()
>>> p.start('cat'); p.waitForStarted()
True
>>> p.write(u'fóó'); p.waitForReadyRead()
3L
True
>>> p.readAll()
PyQt4.QtCore.QByteArray('f\xf3\xf3')
So it seems that PyQt will implicitly encode unicode strings as 'latin-1' before passing them to QProcess.write() (which of course expects either const char * or a QByteArray). If you want a different encoding, it must be done explicitly:
>>> p.write(u'fóó'.encode('utf-8')); p.waitForReadyRead()
5L
True
>>> p.readAll()
PyQt4.QtCore.QByteArray('f\xc3\xb3\xc3\xb3')
Now let's see what happens with PySide:
Python 2.7.8 (default, Sep 24 2014, 18:26:21)
[GCC 4.9.1 20140903 (prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from PySide import QtCore, __version__
>>> __version__
'1.2.2'
>>> p = QtCore.QProcess()
>>> p.start('cat'); p.waitForStarted()
True
>>> p.write(u'fóó'); p.waitForReadyRead()
0L
^C
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyboardInterrupt
So: no implicit encoding, and the process just blocks instead of raising an error (which would seem to be a bug). However, re-trying with explicit encoding works as expected:
>>> p.start('cat'); p.waitForStarted()
True
>>> p.write(u'fóó'.encode('utf-8')); p.waitForReadyRead()
5L
True
>>> p.readAll()
PySide.QtCore.QByteArray('fóó')

Why does the interpreter hang when evaluating the expression?

Here's my experiment:
$ python
Python 2.7.5 (default, Feb 19 2014, 13:47:28)
[GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a = 3
>>> while True:
... a = a * a
...
^CTraceback (most recent call last):
File "<stdin>", line 2, in <module>
KeyboardInterrupt
>>> a
(seems to go on forever)
I understand that the interpreter looped forever at the "while True: " part, but why did it get stuck evaluating a?
a is now a really large number and it takes a while to print. Print a in the loop and you'll see it gets really big, this is just a fraction of how large it is if you omit the print, because print takes time to execute. Also, note a=1 always quickly returns 1.

OpenCV and Numpy interacting badly

Can anyone explain why importing cv and numpy would change the behaviour of python's struct.unpack? Here's what I observe:
Python 2.7.3 (default, Aug 1 2012, 05:14:39)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from struct import pack, unpack
>>> unpack("f",pack("I",31))[0]
4.344025239406933e-44
This is correct
>>> import cv
libdc1394 error: Failed to initialize libdc1394
>>> unpack("f",pack("I",31))[0]
4.344025239406933e-44
Still ok, after importing cv
>>> import numpy
>>> unpack("f",pack("I",31))[0]
4.344025239406933e-44
And OK after importing cv and then numpy
Now I restart python:
Python 2.7.3 (default, Aug 1 2012, 05:14:39)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from struct import pack, unpack
>>> unpack("f",pack("I",31))[0]
4.344025239406933e-44
>>> import numpy
>>> unpack("f",pack("I",31))[0]
4.344025239406933e-44
So far so good, but now I import cv AFTER importing numpy:
>>> import cv
libdc1394 error: Failed to initialize libdc1394
>>> unpack("f",pack("I",31))[0]
0.0
I've repeated this a number of times, including on multiple servers, and it always goes the same way. I've also tried it with struct.unpack and struct.pack, which also makes no difference.
I can't understand how importing numpy and cv could have any impact at all on the output of struct.unpack (pack remains the same, btw).
The "libdc1394" thing is, I believe, a red-herring: ctypes error: libdc1394 error: Failed to initialize libdc1394
Any ideas?
tl;dr: importing numpy and then opencv changes the behaviour of struct.unpack.
UPDATE: Paulo's answer below shows that this is reproducible. Seborg's comment suggests that it's something to do with the way python handles subnormals, which sounds plausible. I looked into Contexts but that didn't seem to be the problem, as the context was the same after the imports as it had been before them.
This isn't an answer, but it's too big for a comment. I played with the values a bit to find the limits.
Without loading numpy and cv:
>>> unpack("f", pack("i", 8388608))
(1.1754943508222875e-38,)
>>> unpack("f", pack("i", 8388607))
(1.1754942106924411e-38,)
After loading numpy and cv, the first line is the same, but the second:
>>> unpack("f", pack("i", 8388607))
(0.0,)
You'll notice that the first result is the lower limit for 32 bit floats. I then tried the same with d.
Without loading the libraries:
>>> unpack("d", pack("xi", 1048576))
(2.2250738585072014e-308,)
>>> unpack("d", pack("xi", 1048575))
(2.2250717365114104e-308,)
And after loading the libraries:
>>> unpack("d",pack("xi", 1048575))
(0.0,)
Now the first result is the lower limit for 64 bit float precision.
It seems that for some reason, loading the numpy and cv libraries, in that order, constrains unpack to use 32 and 64 bit precision and return 0 for lower values.

Using msvcrt in 64-bit python ctypes

I want to call msvcrt functions from 64-bit python using the ctypes package. I'm obviously doing it wrong. Is the right way to do it obvious?
Python 2.7.2 (default, Jun 12 2011, 14:24:46) [MSC v.1500 64 bit (AMD64)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> import ctypes
>>> libc = ctypes.cdll.msvcrt
>>> fp = libc.fopen('text.txt', 'wb') #Seems to work, creates a file
>>> libc.fclose(ctypes.c_void_p(fp))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
WindowsError: exception: access violation reading 0xFFFFFFFFFF082B28
>>>
If this code did what I want, it would have opened and closed a text file without crashing.
The default ctypes result type is a 32 bit integer but a file handle is pointer width, i.e. 64 bits. You are therefore losing half of the information in the file pointer.
Before you call fopen you must state that the result type is a pointer:
libc.fopen.restype = ctypes.c_void_p
fp = libc.fopen(...)
libc.fclose(fp)

Categories