I'm pretty new to Python numpy. I was attempted to use numpy array as the key in dictionary in one of my functions and then been told by Python interpreter that numpy array is not hashable. I've just found out that one way to work this issue around is to use repr() function to convert numpy array to a string but it seems very expensive. Is there any better way to achieve same effect?
Update: I could create a new class to contain the numpy array, which seems to be right way to achieve what I want. Just wondering if there is any better method?
update 2: Using a class to contain data in the array and then override __hash__ function is acceptable, however, I'd prefer the solution provided by #hpaulj. Convert the array/list to a tuple fits my need in a better way as it does not require an additional class.
If you want to quickly store a numpy.ndarray as a key in a dictionary, a fast option is to use ndarray.tobytes() which will return a raw python bytes string which is immutable
my_array = numpy.arange(4).reshape((2,2))
my_dict = {}
my_dict[my_array.tobytes()] = None
After done some researches and reading through all comments. I think I've known the answer to my own question so I'd just write them down.
Write a class to contain the data in the array and then override __hash__ function to amend the way how it is hashed as mentioned by ZdaR
Convert this array to a tuple, which makes the list hashable instantaneously.Thanks to hpaulj
I'd prefer method No.2 because it fits my need better, as well as simpler. However, using a class might bring some additional benefits so it could also be useful.
I just ran into that issue and there's a very simple solution to it using list comprehension:
import numpy as np
dict = {'key1':1, 'key2':2}
my_array = np.array(['key1', 'key2'])
result = np.array( [dict[element] for element in my_array] )
print(result)
The result should be:
[1 2]
I don't know how efficient this is but seems like a very practical and straight-forward solution, no conversions or new classes needed :)
Related
I want to perform calculations on a list and assign this to a second list, but I want to do this in the most efficient way possible as I'll be using a lot of data. What is the best way to do this? My current version uses append:
f=time_series_data
output=[]
for i, f in enumerate(time_series_data):
if f > x:
output.append(calculation with f)
etc etc
should I use append or declare the output list as a list of zeros at the beginning?
Appending the values is not slower compared to other ways possible to accomplish this.
The code looks fine and creating a list of zeroes would not help any further. Although it can create problems as you might not know how many values will pass the condition f > x.
Since you wrote etc etc I am not sure how long or what operations you need to do there. If possible try using list comprehension. That would be a little faster.
You can have a look at below article which compared the speed for list creation using 3 methods, viz, list comprehension, append, pre-initialization.
https://levelup.gitconnected.com/faster-lists-in-python-4c4287502f0a
When I want a list from a DataFrame column (pandas 1.0.1), I can do:
df['column'].to_list()
or I can use:
list(df['column'])
The two alternatives works well, but what are the differences between them?
Is one method better than the other?
list receives an iterable and returns a pure python list. It is a built-in python way to convert any iterable into a pure python list.
to_list is a method from the core pandas object classes which converts their objects to pure python lists. The difference is that the implementation is done by pandas core developers, which may optimize the process according to their understanding, and/or add extra functionalities in the conversion that a pure list(....) wouldn't do.
For example, the source_code for this piece is:
def tolist(self):
'''(...)
'''
if self.dtype.kind in ["m", "M"]:
return [com.maybe_box_datetimelike(x) for x in self._values]
elif is_extension_array_dtype(self._values):
return list(self._values)
else:
return self._values.tolist()
Which basically means to_list will likely end up using either a normal list comprehension - analogous to list(...) but enforcing that the final objects are of panda's datetime type instead of python's datetime -; a straight pure list(...) conversion; or using numpy's tolist() implementation.
The differences between the latter and python's list(...) can be found in this thread.
It is illegal to use assignment in map function, such as
map(lambda in: test[in]+=value[in], somelist)
So what is a good alternative for it. You could use for loop to do this, but it seems to me when facing large scale, the for loop solution is very slow, is there a better way?
Use this, it's preferred:
for i in somelist:
test[i] += value[i]
And anyway, your example is not a good case for using map. You use map or even better, list comprehensions, when you want to create a new list as a result. In this case an assignment is being performed over each item, so there's no point in creating a new list here!
If you don't mind using numpy (and I don't see why you would), then this should be a lot more performant:
test[somelist] += value[somelist]
assuming you have first converted your lists to numpy arrays (negligible overhead):
import numpy as np
test = np.array(test)
value = np.array(value)
Note that this would also work with the pure assignment operator =, besides -=, *= etc.
print sum(1 for x in alist if x[1] == 8)
This code runs fine, but it is so slow. Is there a way better than this. Because, my list is very large and the computation takes a lot of time. Do you know a better and faster way to do it?
You'd have to create indexes or cached counts to speed up such code; trade memory for speed.
Wherever you handle your list (add to it, remove from it, edit entries) you also maintain your indices. For example, if you had a counts dict with ids as keys and their frequency as values, all you had to do is look up the count directly, and ensure that the counts stayed up-to-date as you manipulate alist.
The best way to manage this is by encapsulating your list in a custom type, so that you can control all manipulations of the data structure and maintain the extra information.
Not sure how much faster it would be but
len([x for x in alist if x[1] == 8])
is a little clearer.
I would use numpy. My numpy skills are a little bit rusty, but len(np_array == 8) would give you what you need for a single depth array. I think for you it would be len(np_array[:,1]) but I would have to check (this assumes your problem could use numpy arrays)
I am implementing a GA in Python and need to store a sequence of ones and zeros, so I am representing my data as binaries. What is the best data structure for that? A simple string?
If your chromosomes are fixed-length bitstrings, consider using Numpy arrays and vectorized operations on them instead of lists. These may be much faster than Python lists. E.g., one-point crossover can be done with
def crossover(a, b):
"""Return new individual by combining parents a and b
with random crossover point"""
c = np.empty(a.shape, dtype=bool)
k = np.random.randint(a.shape[0])
c[:k] = a[:k]
c[k:] = b[k:]
return c
If you don't want to use Numpy, then strings seem quite appropriate; they're much more compact than lists, which store pointers to elements rather than actual elements.
Finally, be sure to have a look at how Pyevolve represents chromosomes; it seems to do so with using Numpy.
I think sticking with the strings is a good idea. You can easily chop strings into pieces. if you need to act on them as a list, you can convert it with "list(str)". Once you have a list, you can alter it and turn it back into a string using "''.join(lst)".
Personally, I wouldn't use a long or another integer type to store as bits. It may be more space efficient, but the headache of working with the data when you want to do a recombination would be considerable. Mutations would be problematic as well if the mutation consist of something other than a bit flip. Plus, the code would be much harder to read.
Just my 2 cents. Hope that helps you out.
You can try using bitarray.
Or you can play with buffers.