Pythonic way to aggregate object properties in memory efficient way?

Pythonic way to aggregate object properties in memory efficient way? - python

For example we have large list of objects like this:
class KeyStatisticEntry:
def __init__(self, value=""):
self.usedBytes = len(value)
self.encoding = get_string_encoding(value)
#property
def total(self):
overhead = get_object_overhead(self.usedBytes)
if self.encoding == 'some value':
return overhead
else:
return self.usedBytes + overhead
#property
def aligned(self):
return some_func_with(self.usedBytes)
# Here is lots of calculated properties on basis of existing properties
And we need to agregate lots of metrix about this obejct - min, max, sum, mean, stdev values of it propertirs. Currently i do it with code like this:
used_bytes = []
total_bytes = []
aligned_bytes = []
encodings = []
for obj in keys.items():
used_bytes.append(obj.usedBytes)
total_bytes.append(obj.total)
aligned_bytes.append(obj.aligned)
encodings.append(obj.encoding)
total_elements = len(used_bytes)
used_user = sum(used_bytes)
used_real = sum(total_bytes)
aligned = sum(aligned_bytes)
mean = statistics.mean(used_bytes)
Question:
Is here is more "pythonic" way with better perfomance and memory usage?

You can use operator.attrgetter in order to get multiple attribute of your objects then use itertools.zip_longest (itertools.izip_longest in Python 2.X ) to attach the relative attributes together.
from operator import attrgetter
all_result = [attrgetter('usedBytes','total','aligned','encoding')(obj) for obj in keys.items()]
Or use a generator expression to create a generator instead of a list :
all_result = (attrgetter('usedBytes','total','aligned','encoding')(obj) for obj in keys.items())
Then use zip_longest:
used_bytes, total_bytes, aligned_bytes, encodings = zip_longest(*all_results)
Then use map function to apply the sum function on iterables for which you need the sum:
used_user, used_real, aligned = map(sum,(used_bytes, total_bytes, aligned_bytes))
And separately for len and mean:
total_elements = len(used_bytes)
mean = statistics.mean(used_bytes)
And if you want to handle all the sub lists as generator (which is more optimized in terms of memory use and less performance in terms of runtime) you can use a new class in order to calculate the desire result separately using generators :
from itertools import tee
class Aggregator:
def __init__(self, all_obj):
self.obj = all_obj
self.used_user, self.mean = self.getTotalBytesAndMean()
self.total_elements = len(self.all_obj)
self.aligned = self.getAligned()
def getTotalBytesAndMean(self):
iter_1, iter_2 = tee((obj.usedBytes for obj in self.all_obj))
return sum(iter_1), statistics.mean(iter_2)
def getTotal(self):
return sum(obj.total for obj in self.all_obj)
def getAligned(self):
return sum(obj.aligned for obj in self.all_obj)
def getEncoding(self):
return (obj.encoding for obj in self.all_obj)
Then you can do :
Agg = Aggregator(keys.items())
# And simply access to attributes
Agg.used_user

There is a probably better way for memory usage, using (implicit) generators instead of lists for getting all your infos. I am not sure it will be better if you are doing many computations on the same list (for example for usedBytes). Note however that you cannot use len on a generator (but the length would be the length of your input list anyway):
total_elements = len(keys.items())
used_user = sum(obj.usedBytes for obj in keys.items())
used_real = sum(obj.total for obj in keys.items())
aligned = sum(obj.aligned for obj in keys.items())
mean = statistics.mean(obj.usedBytes for obj in keys.items())

Related

Python, Casting a list from a set changes the order. What is the best way to avoid this?

Given this python code snippet:
import numpy as np
rng = np.random.default_rng(42)
class Agent:
def __init__(self, id):
self.id = id
self.friends = set()
def __repr__(self):
return str(self.id)
group_list = list()
for i in range(100):
new_obj = Agent(i)
group_list.append(new_obj)
for person in group_list:
pool = rng.choice([p for p in group_list if p != person], 6)
for p in pool:
person.friends.add(p)
def set_to_list_ordered(a_set):
return sorted(list(a_set), key=lambda x: x.id)
print("This will change: ")
print(rng.choice(list(group_list[0].friends), 2))
print("This will not change: ")
print(rng.choice(set_to_list_ordered(group_list[0].friends), 2))
The purpose of this code is to perform a random extraction of 2 elements from a set. The problem is that the np.random.choiche function does not accept a set, so you have to turn it into a list. But, doing this, the order of the elements is random and given the same seed, the result of the random extraction is not replicable. In this case I implemented a function that sorts the elements, but it is costly.
You will rightly say, use a list instead of a set. To this I reply that sets fit perfectly the use I need. For example, this structure allows the Agent.friends attribute to have no duplicate elements.
So, my question is, what is the most convenient method, other than the function I implemented, to use sets and have the random extraction from a set be deterministic? Is it better to use lists instead of sets? Is there any way to make the transformation deterministic?
Thanks in advance.
EDIT:
Some observe that internally the transformation from set to list is consistent. My objective is for this transformation to be consistent externally as well. So that by running the same script numerous times, the extraction of the default_rng instance is the same.

You can use ordered set.
From the documentation:
from ordered_set import OrderedSet
>>>OrderedSet('abracadabra')
OrderedSet(['a', 'b', 'r', 'c', 'd'])

Solved by overriding the hash() method. Source: https://www.youtube.com/watch?v=C4Kc8xzcA68
import numpy as np
rng = np.random.default_rng(42)
class Agent:
def __init__(self, id):
self.id = id
self.friends = set()
def __repr__(self):
return str(self.id)
def __hash__(self):
return self.id
group_list = list()
for i in range(100):
new_obj = Agent(i)
group_list.append(new_obj)
for person in group_list:
pool = rng.choice([p for p in group_list if p != person], 6)
for p in pool:
person.friends.add(p)
def set_to_list_ordered(a_set):
return sorted(list(a_set), key=lambda x: x.id)
print("This will change: ")
print(rng.choice(list(group_list[0].friends), 2))
print("This will not change: ")
print(rng.choice(set_to_list_ordered(group_list[0].friends), 2))

Find closest value algorithm

def find_closest(data, target, key = lambda x:f(x))
This is my function definition where data is set of values, and I want to find the value that evaluates the closest to target in as few evaluations as possible, i.e. abs(target-f(x)) is minimum. f(x) is monotonic.
I've heard that binary search can do this in O(log(n)) time, is there a library implementation in python? Are there more efficient search algorithms?
EDIT: I'm looking to minimize complexity in terms of evaluating f(x) because that's the expensive part. I want to find the x in data that when evaluated with f(x), comes closest to the target. data is in the domain of f, target is in the range of f. Yes, data can be sorted quickly.

You can use the utilities in the bisect module. You will have to evaluate x on data though, i.e. list(f(x) for x in data) to get a monotonic / sorted list to bisect.
I am not aware of a binary search in the standard library that works directly on f and data.

If the data presented is already sorted and the function is strctly monotonic,
apply the function f on the data and then perform a binary search using bisect.bisect
import bisect
def find_closest(data, target, key = f):
data = map(f, data)
if f(0) > f(1):
data = [-e for e in data]
try:
return data[bisect.bisect_left(data, target)]
except IndexError:
return data[-1]

Use bisect_left() method to find lower bound.
Bisect_left accepts a random-access list of elements, to avoid calculating all of them you can use lazy collection of calculated function values with __len__ and __getitem__ methods defined.
Carefully check return value for border conditions.
Your heavy calculation will be called O(log(N) + 1) = O(log(N)) times.
from bisect import bisect_left
from collections import defaultdict
class Cache(defaultdict):
def __init__(self, method):
self.method = method
def __missing__(self, key):
return self.method(key)
class MappedList(object):
def __init__(self, method, input):
self.method = method
self.input = input
self.cache = Cache(method)
def __len__(self):
return len(self.input)
def __getitem__(self, i):
return self.cache[input[i]]
def find_closest(data, target, key = lambda x:x):
s = sorted(data)
evaluated = MappedList(key, s)
index = bisect_left(evaluated, target)
if index == 0:
return data[0]
if index == len(data):
return data[index-1]
if target - evaluated[index-1] <= evaluated[index] - target:
return data[index-1]
else:
return data[index]

python tensors with named field access

I would like to use in Python something akin to -- or better than -- R arrays. R arrays are tensor-like objects with a dimnames attribute, which allows to straightforwardly allows to subset tensors based on names (strings). In numpy recarrays allow for column names, and pandas for flexible and efficient subsetting of 2-dimensional arrays. Is there something in Python that allows similar operations as slicing and subsetting of ndarrays by using names (or better, objects that are hashable and immutable in Python)?

How about this quick and dirty mapping from lists of strings to indices? You could clean up the notation with callable classes.
def make_dimnames(names):
return [{n:i for i,n in enumerate(name)} for name in names]
def foo(d, *args):
return [d[x] for x in args]
A = np.arange(9).reshape(3,3)
dimnames = [('x','y','z'),('a','b','c')]
Adims = make_dimnames(dimnames)
A[foo(Adims[0],'x','z'),foo(Adims[1],'b')] # A[[0,2],[1]]
A[foo(Adims[0],'x','z'),slice(*foo(Adims[1],'b','c'))] # A[[0,2],slice(1,2)]
Or does R do something more significant with the dimnames?
A class compresses the syntax a bit:
class bar(object):
def __init__(self,dimnames):
self.dd = {n:i for i,n in enumerate(dimnames)}
def __call__(self,*args):
return [self.dd[x] for x in args]
def __getitem__(self,key):
return self.dd[key]
d0, d1 = bar(['x','y','z']), bar(['a','b','c'])
A[d0('x','z'),slice(*d1('a','c'))]
http://docs.scipy.org/doc/numpy/user/basics.subclassing.html
sublassing ndarray, with simple example of adding an attribute (which could be dinnames). Presumably extending the indexing to use that attribute shouldn't be hard.
Inspired by the use of __getitem__ in numpy/index_tricks, I've generalized the indexing:
class DimNames(object):
def __init__(self, dimnames):
self.dd = [{n:i for i,n in enumerate(names)} for names in dimnames]
def __getitem__(self,key):
# print key
if isinstance(key, tuple):
return tuple([self.parse_key(key, self.dd[i]) for i,key in enumerate(key)])
else:
return self.parse_key(key, self.dd[0])
def parse_key(self,key, dd):
if key is None:
return key
if isinstance(key,int):
return key
if isinstance(key,str):
return dd[key]
if isinstance(key,tuple):
return tuple([self.parse_key(k, dd) for k in key])
if isinstance(key,list):
return [self.parse_key(k, dd) for k in key]
if isinstance(key,slice):
return slice(self.parse_key(key.start, dd),
self.parse_key(key.stop, dd),
self.parse_key(key.step, dd))
raise KeyError
dd = DimNames([['x','y','z'], ['a','b','c']])
print A[dd['x']] # A[0]
print A[dd['x','c']] # A[0,2]
print A[dd['x':'z':2]] # A[0:2:2]
print A[dd[['x','z'],:]] # A[[0,2],:]
print A[dd[['x','y'],'b':]] # A[[0,1], 1:]
print A[dd[:'z', :2]] # A[:2,:2]
I suppose further steps would be to subclass A, add dd as attribute, and change its __getitem__, simplifying the notation to A[['x','z'],'b':].

using class to change python class, Heap Error

I am trying to make an class = that extends from list return a slice of itself instead of a list type. The reason I want to do this is because I have many other methods to manipulate the instance of A.
I am running python 2.7.3
Say I have:
class B():
def __init__(self, t, p):
self.t = t
self.p = p
class Alist(list):
def __init__(self, a_list_of_times = []):
for a_time in a_list_of_times:
self.append(a_time )
def __getslice__(self, i, j):
return super(Alist, self).__getslice__(i,j)
def plot_me(self):
pass
# other code goes here!
alist1 = Alist()
for i in range(0, 1000000):
alist1.append(B(i, i)) # yes ten million, very large list!
alist = alist1[1000:200000] # will return a list!
alist2 = Alist(alist) # will return Alist istance
The problem is that remaking the entire list as seen in making variable b is VERY VERY SLOW (comparative to the slice). What I want to do is simply change the class of alist (currently of type list)to Alist
When I try:
alist.__class__ = Alist
>>>> TypeError: __class__ assignment: only for heap types.
Which is very sad since I can do this for my own object types.
I understand that it is not standard, but it is done.
Reclassing an instance in Python.
Is there a way around this? Also I have obviously simplified the problem, where my objects a bit more complex. Mainly what I am finding is that remaking the list into my Alist version is slow. And I need to do this operation a lot (unavoidable). Is there a way to remake A? or a solution to this to make it speed up?
In my version, I can do about a 10,000 (size of my slice) slice in 0.07 seconds, and converting it to my version of Alist takes 3 seconds.

The UserList class (moved to collections in Python 3) is perfectly designed for this. It is a list by all other means but has a data attribute that you can store an underlying list in without copying.
from UserList import UserList
class Alist(UserList):
def __init__(self, iterable, copy=True):
if copy:
super(Alist, self).__init__(iterable)
else:
self.data = iterable
def plot_me(self):
pass

how to enumerate class method then chain them with itertools.product() in python?

I just learned yesterday from this site that I can:
class Seq(object):
def __init__(self, seq):
self.seq = seq
def __repr__(self):
return repr(self.seq)
def __str__(self):
return str(self.seq)
def all(self):
return Seq(self.seq[:])
def head(self, count):
return Seq(self.seq[:count])
def tail(self, count):
return Seq(self.seq[-count:])
def odd(self):
return Seq(self.seq[1::2])
def even(self):
return Seq(self.seq[::2])
def reverse(self):
return Seq(self.seq[::-1])
>>> s = Seq(range(0, 100))
>>> print s.head(10).odd().even().reverse()
[9, 5, 1]
I want to enumerate possible combination of those sequence method chains inside of class Seq, may sort of:
itertools.product([s.head,s.odd,s.even,s.reverse], repeat=4)
# may not just limited those 4 functions
how to use the itertools.product() to
1). generate invoke-able function chains list? just like this:
foo = s.head().odd().even().reverse()
2). generate eval()able chain strings then I can store in ascii file or eval() later or for logging purpose?
the head(), tail() may accept parameter, and even(), odd() is not need to, for example,
the paremeter of head() and tail() may from lists:
head_lmt = [10,20,30]
tail_lmt = [30,40,50]
foo = s.head().odd().tail().reverse()
^------------------------------------head_lmt 10 or 20 or 30
^-----------------------tail_lmt 30 or 40 or 50
If my Q1 is possible, how I can fill those parameter into the invoke-able list and the eval()-able string, a.k.a generate more specific invoke-able list and the eval()-able string?
Thanks!

Note that something like "s.head()" means a method which is "bound" to that specific instance of Seq, that is, "s." Something like "Seq.head()" means a method which is unbound, so one can still pass in different instances of Seq.
From there it simply requires basic functional composition and string concatenation.
def chain_method(from_method, to_method):
def inner(arg):
return to_method(from_method(arg))
return inner
possible_funcs = []
log_strings = []
for possible_combo in itertools.product([Seq.head,Seq.odd,Seq.even,Seq.reverse], repeat=4):
meta_method = possible_combo[0]
for method in possible_combo[1:]:
meta_method = chain_method(meta_method, method)
log_string = []
for method in possible_combo:
log_string.extend(['.', method.__name__, '()'])
possible_funcs.append(meta_method)
log_strings.append("".join(log_string))
I'm not sure what you mean by the examples for the additional parameters, though. How do you intend to combine the different parameter values with the different combinations of functions?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pythonic way to aggregate object properties in memory efficient way? - python

Related

Python, Casting a list from a set changes the order. What is the best way to avoid this?

Find closest value algorithm

python tensors with named field access

using class to change python class, Heap Error

how to enumerate class method then chain them with itertools.product() in python?

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pythonic way to aggregate object properties in memory efficient way? - python

Related

Python, Casting a list from a set changes the order. What is the best way to avoid this?

Find closest value algorithm

python tensors with named field access

using __class__ to change python class, Heap Error

how to enumerate class method then chain them with itertools.product() in python?

Categories

Resources

using class to change python class, Heap Error