I want to put numerics and strings into the same numpy array. However, I very rarely (difficult to replicate, but sometimes) run into an error where the numeric to string conversion results in a value that cannot back-translate into a decimal (ie, I get "9.8267567e", as opposed to "9.8267567e-5" in the array). This is causing problems after writing files. Here is an example of what I am doing (though on a much smaller scale):
import numpy as np
x = np.array(.94749128494582)
y = np.array(x, dtype='|S100')
My understanding is that this should allow 100 string characters, but sometimes I am seeing a cut-off after ~10. Is there another type that I should be assigning, or a way to limit the number of characters in my array (x)?
First of all, x = np.array(.94749128494582) may not be doing what you think because the argument passed into np.array should be some kind of sequence or something with the array interface. Perhaps you meant x = np.array([.94749128494582])?
Now, as for preserving the strings properly, you could solve this by using
y = np.array(x, dtype=object)
However, as Joe has mentioned in his comment, it's not very numpythonic and you may as well be using plain old python lists.
I would recommend to examine carefully why you seem to have this requirement to hold strings and numbers in the same array, it smells to me like you might have inappropriate data structures set up and could benefit from redesigning/refactoring. numpy arrays are for fast numerical operations, they are not really suited to be used for string manipulations or as some kind of storage/database.
Related
I am very new to python and I would like to write the following (something like fprintf in matlab) I do not know why this string not working ???
Here is the code
import numpy as np
coord=np.linspace(0,10,5)
keyy=("LE")
key=np.repeat(keyy,5)
out_arr=np.array_str(key)
zip=np.array([coord,out_arr])
zzip=zip.T
print(zzip)
savefile=np.savetxt("nam.dat",zzip,fmt="%f %s")
The problem is with the following line:
out_arr=np.array_str(key)
This is converting the array ['LE' 'LE' 'LE' 'LE' 'LE'] to the string "['LE' 'LE' 'LE' 'LE' 'LE']". Note the quotes. This is no longer an array, it is a single string, and numpy interprets it as a length-1 array. You first need to drop that line:
key=np.repeat(keyy,5)
zip=np.array([coord,key])
The next problem you will run into is that this will convert the coord numbers into strings, resulting in all elements being string. This is because numpy arrays have a single, fixed type (there are exceptions but they are more complicated). And the only way to do that in this case is to make everything a string.
The simple way around this is to use an "object" array (basically the same as a cell array in python), which stores arbitrary python objects rather than fixed data:
zip=np.array([coord,out_arr], dtype='object')
However, the better solution if you can is to use pandas. Pandas is sort of like MATLAB tables, but much more powerful. It is designed for this sort of data, and has very nice functions for writing text files like you want to do here in a cleaner, more explicit way.
Also, zip is a built-in function, and it is better not to name variables the same names as built-in functions. It is allowed, but zip is an important function and you don't want to block access to it.
Calling arr.max() on an empty numpy array causes an error. But it's a common convention in math to say that the maximum of the empty set is negative infinity. And since numpy supports infinities, why doesn't np.max() behave this way? It would save me a couple lines of additional logic to handle empty arrays. I'm sure there's a good reason, just curious what that is.
I tried to compute math.exp(9500) but encountered an OverflowError: math range error (it's roughly 6.3e4125). From this question it seems like it's due to a too large float, the accepted answer says "(...) is slightly outside of the range of a double, so it causes an overflow".
I know that Python can deal with arbitrarily large integers (long type), is there a way to deal with arbitrarily large floats in the same manner ?
Edit : my original question was about using integers for calculating exp(n) but as Eric Duminil said, the simplest way to do that would be 3**n which doesn't provide any useful result. I know realize this question might be similar to this one.
I don't think it's possible to approximate exp() with integers. If you use 3**n instead of 2.71828182845905**n, your calculations will be completely useless.
One possible solution would be to use Sympy. According to the documentation:
There is essentially no upper precision limit
>>> from sympy import *
>>> exp(9500)
exp(9500)
>>> exp(9500).evalf()
6.27448493490172e+4125
You can also specify the desired precision:
>>> exp(9500).evalf(1000)
6.274484934901720177929867046175406311474380389941415760684209191232450360090766458256588885184199320756050569665785657269735313171886975309933254563488343491718198237894473901620914303565550450204805537225888529509352754121292701357622411614860860409639719786022989336837263283678476008817556351031696366815467221836948040042378034720460820127399855873232167818091083005170669482845098735176209372328114732133251096196535355946589133977397512846130629857604295369747597459602137604440011394793443041829253598478244189078131130488653468669559814695095974271938947640276013215753183113041899037415404445478806695965167014404297848725756879184380559837391976534521522360723388582608454995349380217499779247330557664230806254642768796486899322646423713763772064068933790640394967085887914192401473425799354391464743910233873602389444180426155866237536459654917521713769608318128404177877383203786348495822099924812081683286880293701785567962687838594752986160305764297117036426951203418854463404773701882e+4125
With exp(9500).evalf(5000), you even get the integer closest to exp(9500).
Here's another way to calculate the result with Python:
exp(9500)
is too big.
But log10(exp(9500)) isn't. You cannot calculate it this way in Python, but log10(exp(9500)) is log(exp(9500))/ln(10), which is 9500/ln(10):
>>> from math import log
>>> 9500/log(10)
4125.797578080892
>>> int(9500/log(10))
4125
>>> 10**(9500/log(10) % 1)
6.274484934896202
This way, you can calculate that exp(9500) is 6.27448493 * 10**4125 in plain Python, without any library!
try long type.
int type has been remove from python since 3.0 version.
I'm trying to use the pack function in the struct module to encode data into formats required by a network protocol. I've run into a problem in that I don't see any way to encode arrays of anything other than 8-bit characters.
For example, to encode "TEST", I can use format specifier "4s". But how do I encode an array or list of 32-bit integers or other non-string types?
Here is a concrete example. Suppose I have a function doEncode which takes an array of 32-bit values. The protocol requires a 32-bit length field, followed by the array itself. Here is what I have been able to come up with so far.
from array import *
from struct import *
def doEncode(arr):
bin=pack('>i'+len(arr)*'I',len(arr), ???)
arr=array('I',[1,2,3])
doEncode(arr)
The best I have been able to come up with is generating a format to the pack string dynamically from the length of the array. Is there some way of specifying that I have an array so I don't need to do this, like there is with a string (which e.g. would be pack('>i'+len(arr)+'s')?
Even with the above approach, I'm not sure how I would go about actually passing the elements in the array in a similar dynamic way, i.e. I can't just say , arr[0], arr[1], ... because I don't know ahead of time what the length will be.
I suppose I could just pack each individual integer in the array in a loop, and then join all the results together, but this seems like a hack. Is there some better way to do this? The array and struct modules each seem to do their own thing, but in this case what I'm trying to do is a combination of both, which neither wants to do.
data = pack('>i', len(arr)) + arr.tostring()
Python allocates integers automatically based on the underlying system architecture. Unfortunately I have a huge dataset which needs to be fully loaded into memory.
So, is there a way to force Python to use only 2 bytes for some integers (equivalent of C++ 'short')?
Nope. But you can use short integers in arrays:
from array import array
a = array("h") # h = signed short, H = unsigned short
As long as the value stays in that array it will be a short integer.
documentation for the array module
Thanks to Armin for pointing out the 'array' module. I also found the 'struct' module that packs c-style structs in a string:
From the documentation (https://docs.python.org/library/struct.html):
>>> from struct import *
>>> pack('hhl', 1, 2, 3)
'\x00\x01\x00\x02\x00\x00\x00\x03'
>>> unpack('hhl', '\x00\x01\x00\x02\x00\x00\x00\x03')
(1, 2, 3)
>>> calcsize('hhl')
8
You can use NumyPy's int as np.int8 or np.int16.
Armin's suggestion of the array module is probably best. Two possible alternatives:
You can create an extension module yourself that provides the data structure that you're after. If it's really just something like a collection of shorts, then
that's pretty simple to do.
You can
cheat and manipulate bits, so that
you're storing one number in the
lower half of the Python int, and
another one in the upper half.
You'd write some utility functions
to convert to/from these within your
data structure. Ugly, but it can be made to work.
It's also worth realising that a Python integer object is not 4 bytes - there is additional overhead. So if you have a really large number of shorts, then you can save more than two bytes per number by using a C short in some way (e.g. the array module).
I had to keep a large set of integers in memory a while ago, and a dictionary with integer keys and values was too large (I had 1GB available for the data structure IIRC). I switched to using a IIBTree (from ZODB) and managed to fit it. (The ints in a IIBTree are real C ints, not Python integers, and I hacked up an automatic switch to a IOBTree when the number was larger than 32 bits).
You can also store multiple any size of integers in a single large integer.
For example as seen below, in python3 on 64bit x86 system, 1024 bits are taking 164 bytes of memory storage. That means on average one byte can store around 6.24 bits. And if you go with even larger integers you can get even higher bits storage density. For example around 7.50 bits per byte with 2**20 bits wide integer.
Obviously you will need some wrapper logic to access individual short numbers stored in the larger integer, which is easy to implement.
One issue with this approach is your data access will slow down due use of the large integer operations.
If you are accessing a big batch of consecutively stored integers at once to minimize the access to large integers, then the slower access to long integers won't be an issue.
I guess use of numpy will be easier approach.
>>> a = 2**1024
>>> sys.getsizeof(a)
164
>>> 1024/164
6.2439024390243905
>>> a = 2**(2**20)
>>> sys.getsizeof(a)
139836
>>> 2**20 / 139836
7.49861266054521
Using bytearray in python which is basically a C unsigned char array under the hood will be a better solution than using large integers. There is no overhead for manipulating a byte array and, it has much less storage overhead compared to large integers. It's possible to get storage density of 7.99+ bits per byte with bytearrays.
>>> import sys
>>> a = bytearray(2**32)
>>> sys.getsizeof(a)
4294967353
>>> 8 * 2**32 / 4294967353
7.999999893829228