Class method return iterator - python

I implemented an iterator class as following:
import numpy as np
import time
class Data:
def __init__(self, filepath):
# Computationaly expensive
print("Computationally expensive")
time.sleep(10)
print("Done!")
def __iter__(self):
return self
def __next__(self):
return np.zeros((2,2)), np.zeros((2,2))
count = 0
for batch_x, batch_y in Data("hello.csv"):
print(batch_x, batch_y)
count = count + 1
if count > 5:
break
count = 0
for batch_x, batch_y in Data("hello.csv"):
print(batch_x, batch_y)
count = count + 1
if count > 5:
break
However the constructor is computationally expensive, and the for loop might be called multiple times. For example, in above code the constructor is called twice (each for loop create a new Data object).
How do I separate constructor and iterator? I am hoping to have the following code, where constructor is called once only:
data = Data(filepath)
for batch_x, batch_y in data.get_iterator():
print(batch_x, batch_y)
for batch_x, batch_y in data.get_iterator():
print(batch_x, batch_y)

You can just iterate over an iterable object directly, for..in doesn't require anything else:
data = Data(filepath)
for batch_x, batch_y in data:
print(batch_x, batch_y)
for batch_x, batch_y in data:
print(batch_x, batch_y)
That said, depending on how you implement __iter__(), this could be buggy.
E.g.:
Bad
class Data:
def __init__(self, filepath):
self._items = load_items(filepath)
self._i = 0
def __iter__(self): return self
def __next__(self):
if self._i >= len(self._items): # Or however you check if data is available
raise StopIteration
result = self._items[self._i]
self._i += 1
return result
Because then you couldn't iterate over the same object twice, as self._i would still point at the end of the loop.
Good-ish
class Data:
def __init__(self, filepath):
self._items = load_items(filepath)
def __iter__(self):
self._i = 0
return self
def __next__(self):
if self._i >= len(self._items):
raise StopIteration
result = self._items[self._i]
self._i += 1
return result
This resets the index every time you're about to iterate, fixing the above. This won't work if you're nesting iteration over the same object.
Better
To fix that, keep the iteration state in a separate iterator object:
class Data:
class Iter:
def __init__(self, data):
self._data = data
self._i = 0
def __next__(self):
if self._i >= len(self._data._items): # check for available data
raise StopIteration
result = self._data._items[self._i]
self._i = self._i + 1
def __init__(self, filepath):
self._items = load_items(filepath)
def __iter__(self):
return self.Iter(self)
This is the most flexible approach, but it's unnecessarily verbose if you can use either of the below ones.
Simple, using yield
If you use Python's generators, the language will take care of keeping track of iteration state for you, and it should do so correctly even when nesting loops:
class Data:
def __init__(self, filepath):
self._items= load_items(filepath)
def __iter__(self):
for it in self._items: # Or whatever is appropriate
yield return it
Simple, pass-through to underlying iterable
If the "computationally expensive" part is loading all the data into memory, you can just use the cached data directly.
class Data:
def __init__(self, filepath):
self._items = load_items(filepath)
def __iter__(self):
return iter(self._items)

Instead of creating a new instance of Data, create a second class IterData that contains an __init__ method that runs a process which is not as computationally expensive as instantiating Data. Then, create a classmethod in Data as an alternative constructor for IterData:
class IterData:
def __init__(self, filepath):
#only pass the necessary data
def __iter__(self):
#implement iter here
class Data:
def __init__(self, filepath):
# Computationaly expensive
#classmethod
def new_iter(cls, filepath):
return IterData(filepath)
results = Data.new_iter('path')
for batch_x, batch_y in results:
pass

Related

Adding class objects to Pytorch Dataloader: batch must contain tensors

I have a custom Pytorch dataset that returns a dictionary containing a class object "queries".
class QueryDataset(torch.utils.data.Dataset):
def __init__(self, queries, values, targets):
super(QueryDataset).__init__()
self.queries = queries
self.values = values
self.targets = targets
def __len__(self):
return self.values.shape[0]
def __getitem__(self, idx):
sample = DeviceDict({'query': self.queries[idx],
"values": self.values[idx],
"targets": self.targets[idx]})
return sample
The problem is that when I put the queries in a data loader I get default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'query.Query'>. Is there a way to have a class object in my data loader? It blows up at next(iterator) in the code below.
train_queries = QueryDataset(train_queries)
train_loader = torch.utils.data.DataLoader(train_queries,
batch_size=10],
shuffle=True,
drop_last=False)
for i in range(epochs):
iterator = iter(train_loader)
for i in range(len(train_loader)):
batch = next(iterator)
out = model(batch)
loss = criterion(out["pred"], batch["targets"])
self.optimizer.zero_grad()
loss.sum().backward()
self.optimizer.step()
You need to define your own colate_fn in order to do this.
A sloppy approach just to show you how stuff works here, would be something like this:
import torch
class DeviceDict:
def __init__(self, data):
self.data = data
def print_data(self):
print(self.data)
class QueryDataset(torch.utils.data.Dataset):
def __init__(self, queries, values, targets):
super(QueryDataset).__init__()
self.queries = queries
self.values = values
self.targets = targets
def __len__(self):
return 5
def __getitem__(self, idx):
sample = {'query': self.queries[idx],
"values": self.values[idx],
"targets": self.targets[idx]}
return sample
def custom_collate(dict):
return DeviceDict(dict)
dt = QueryDataset("q","v","t")
dl = torch.utils.data.DataLoader(dtt,batch_size=1,collate_fn=custom_collate)
t = next(iter(dl))
t.print_data()
Basically colate_fn allows you to achieve custom batching or adding support for custom data types as explained in the link I previously provided.
As you see it just shows the concept, you need to change it based on your own needs.
For those curious, this is the DeviceDict and custom collate function that I used to get things to work.
class DeviceDict(dict):
def __init__(self, *args):
super(DeviceDict, self).__init__(*args)
def to(self, device):
dd = DeviceDict()
for k, v in self.items():
if torch.is_tensor(v):
dd[k] = v.to(device)
else:
dd[k] = v
return dd
def collate_helper(elems, key):
if key == "query":
return elems
else:
return torch.utils.data.dataloader.default_collate(elems)
def custom_collate(batch):
elem = batch[0]
return DeviceDict({key: collate_helper([d[key] for d in batch], key) for key in elem})

Dataset.from_generator: TypeError: `generator` must be callable

I am currently using a generator to produce my training and validation datasets using tf.data.Dataset.from_generator. I have a class method that takes care of this for me:
def build_dataset(self, batch_size=16, shuffle=16, validation=None):
train_dataset = tf.data.Dataset.from_generator(import_images(validation=validation), (tf.float32, tf.float32))
self.train_dataset = train_dataset.shuffle(shuffle).repeat(-1).batch(batch_size).prefetch(1)
if validation is not None:
val_dataset = tf.data.Dataset.from_generator(import_images(validation=validation), (tf.float32, tf.float32))
self.val_dataset = val_dataset.repeat(1).batch(batch_size).prefetch(1)
The problem is passing in (validation=validation) to my import_images generator creates the generator object which Tensorflow doesn't want, and it gives me the error:
TypeError: `generator` must be callable.
Because I have to pass in validation to tell my generator to produce a separate training and validation version, I am required to create two versions of the same generator. It also doesn't allow me to pass in other arguments to control the percentage of training and validation examples - meaning the generator has to be static. Any suggestions?
I recently encountered a similar problem, but I'm a beginner so not sure if this will help.
Try add a call function in your class.
Below are the original class which raise TypeError: `generator` must be callable.
class DataGen:
def __init__(self, files, data_path):
self.i = 0
self.files=files
self.data_path=data_path
def __load__(self, files_name):
data_path = os.path.join(self.data_path, files_name)
arr_img, arr_mask = load_patch(data_path)
return arr_img, arr_mask
def getitem(self, index):
_img, _mask = self.__load__(self.files[index])
return _img, _mask
def __iter__(self):
return self
def __next__(self):
if self.i < len(self.files):
img_arr, mask_arr = self.getitem(self.i)
self.i += 1
else:
raise StopIteration()
return img_arr, mask_arr
Then I revised the code as below and it worked for me.
class DataGen:
def __init__(self, files, data_path):
self.i = 0
self.files=files
self.data_path=data_path
def __load__(self, files_name):
data_path = os.path.join(self.data_path, files_name)
arr_img, arr_mask = load_patch(data_path)
return arr_img, arr_mask
def getitem(self, index):
_img, _mask = self.__load__(self.files[index])
return _img, _mask
def __iter__(self):
return self
def __next__(self):
if self.i < len(self.files):
img_arr, mask_arr = self.getitem(self.i)
self.i += 1
else:
raise StopIteration()
return img_arr, mask_arr
def __call__(self):
self.i = 0
return self

Sharing resources between two classes

I have provided a working example where I have a dynamic array implemented as custom type in Python 3. I wish to use a single instance of this dynamic array as a common resource for implementing, say, N stacks. How do I do that?
I imagine I would like to give each stack an access to only a certain part of the DynamicArray by making demarcation points _start and _end. In order to have _start and _end for each stack, I would like to wrap them in a helper class _StackRecord. In case I am successful in providing a modifiable view of DynamicArray, I want _StackRecord to do all the heavy lifting of poping and pushing such that the stacks don't collide while the underlying DynamicArray expands/shrinks as per the need. I know I am asking for too much, but I might learn some useless skills while I fail to do this.
Any suggestions/criticism towards modularity, maintainability and good practices are wholeheartedly welcome.
import ctypes
class DynamicArray:
"""Expandable array class similar to Python list"""
def __init__(self, size=0):
self._n = size
self._capacity = size + 1
self._A = self._make_low_level_array(self._capacity)
def _make_low_level_array(self, capacity):
return (capacity*ctypes.py_object)()
# following two methods are needed for Python to implement __iter__
def __len__(self):
return self._n
def __getitem__(self, index_key):
if isinstance(index_key, slice):
start, stop, step = index_key.indices(len(self))
return [self._A[i] for i in range(start, stop, step)]
elif isinstance(index_key, int):
if 0 <= index_key < self._n :
return self._A[index_key]
else:
raise IndexError("index out of bounds")
elif isinstance(index_key, tuple):
raise NotImplementedError('Tuple as index')
else:
raise TypeError('Invalid argument type: {}'.format(type(key)))
def __setitem__(self, index_k, value):
if 0 <= index_k < self._n :
self._A[index_k] = value
else:
raise IndexError("index out of bounds")
###################################################################
class FixedMultiStack:
class _StackRecord(DynamicArray):
def __init__(self, array: DynamicArray, stack_number=0, size_of_each=10):
self._stack = stack_number
self._start = stack_number*size_of_each
self._end = self._start + size_of_each
# try commenting the following lines
self._n = size_of_each
self._A = DynamicArray(self._n)
# If I have to use self._A then I would like it to point
# to array[self._start:self._end]
for i in range(self._start, self._end):
array[i] = i
for i in range(self._n):
self._A[i] = array[self._start+ i]
def __init__(self, numStack=1, sizeEach=10):
self._stacks = []
self._items = DynamicArray(numStack*sizeEach)
for i in range(numStack):
self._stacks.append(self._StackRecord(self._items, i, sizeEach))
def __getitem__(self, stack_number):
return self._stacks[stack_number]
if __name__ == "__main__":
fms = FixedMultiStack(3,10)
print(list(fms[0]))
print(list(fms[1]))
print(list(fms[2]))
print(list(fms._items))
Issues
I am doing the wasteful act of making a local copy called self._A. How do I avoid that? Why can't I just work on the global dynamic array passed to my local record keeper _StackRecord?
What do I expect?
fms = FixedMultiStack(3,10), A fixed multi stack packing 3 stacks of size 10 each such that
I would like, if self._A is necessary, the local self._A to refer to that part of DynamicArray which corresponds to the given stack number.
So that print(list(fms[n])) gives me the contents of nth stack
While print(list(fms._items)) should give me the combine state of all the stacks. Yikes! print(list(fms._items)) is ugly. How about print(list(fms))?
I should be able to write something like self._items[n].push(val), self._items[n].pop() to push and pop on n-th stack.
You can use memoryview for creating the different views over the entire array:
class FixedMultiStack:
def __init__(self, m, n):
self.data = bytearray(m*n)
view = memoryview(self.data)
self.stacks = [view[i*n:(i+1)*n] for i in range(m)]
def __getitem__(self, index):
return self.stacks[index]

TypeError: object takes no parameters

I'm trying to create a code that utilizes the __iter__() method as a generator, but I am getting an error saying:
TypeError: object() takes no parameters.
Additionally, I am unsure whether my yield function should be called within try: or within the main() function
I am fairly new to Python and coding, so any suggestions and advice would be greatly appreciated so that I can learn. Thanks!
class Counter(object):
def __init__(self, filename, characters):
self._characters = characters
self.index = -1
self.list = []
f = open(filename, 'r')
for word in f.read().split():
n = word.strip('!?.,;:()$%')
n_r = n.rstrip()
if len(n) == self._characters:
self.list.append(n)
def __iter(self):
return self
def next(self):
try:
self.index += 1
yield self.list[self.index]
except IndexError:
raise StopIteration
f.close()
if __name__ == "__main__":
for word in Counter('agency.txt', 11):
print "%s' " % word
Use yield for function __iter__:
class A(object):
def __init__(self, count):
self.count = count
def __iter__(self):
for i in range(self.count):
yield i
for i in A(10):
print i
In your case, __iter__ maybe looks something like this:
def __iter__(self):
for i in self.list:
yield i
You mistyped the declaration of the __init__ method, you typed:
def __init
Instead of:
def __init__

Python Printing a Deque

I have an entire Deque Array class that looks like this:
from collections import deque
import ctypes
class dequeArray:
DEFAULT_CAPACITY = 10 #moderate capacity for all new queues
def __init__(self):
self.capacity = 5
capacity = self.capacity
self._data = self._make_array(self.capacity)
self._size = 0
self._front = 0
def __len__(self):
return self._size
def __getitem__(self, k): #Return element at index k
if not 0 <= k < self._size:
raise IndexError('invalid index')
return self._data[k]
def isEmpty(self):
if self._data == 0:
return False
else:
return True
def append(self, item): #add an element to the back of the queue
if self._size == self.capacity:
self._data.pop(0)
else:
avail = (self._front + self._size) % len(self._data)
self._data[avail] = item
self._size += 1
#def _resize(self, c):
#B = self._make_array(c)
#for k in range(self._size):
#B[k] = self._A[k]
#self._data = B
#self.capacity = capacity
def _make_array(self, c):
capacity = self.capacity
return (capacity * ctypes.py_object)()
def removeFirst(self):
if self._size == self.capacity:
self._data.pop(0)
else:
answer = self._data[self._front]
self._data[self._front] = None
self._front = (self._front + 1) % len(self._data)
self._size -= 1
print(answer)
def removeLast(self):
return self._data.popleft()
def __str__(self):
return str(self._data)
and when I try to print the deque in the main it prints out something like this,
<bound method dequeArray.__str__ of <__main__.dequeArray object at 0x1053aec88>>
when it should be printing the entire array. I think i need to use the str function and i tried adding
def __str__(self):
return str(self._data)
and that failed to give me the output. I also tried just
def __str__(self):
return str(d)
d being the deque array but I still am not having any success. How do I do i get it to print correctly?
you should call the str function of each element of the array that is not NULL, can be done with the following str function:
def __str__(self):
contents = ", ".join(map(str, self._data[:self._size]))
return "dequeArray[{}]".format(contents)
What I get when I try to q = dequeArray(); print(q) is <__main__.py_object_Array_5 object at 0x006188A0> which makes sense. If you want it list-like, use something like this (print uses __str__ method implicitly):
def __str__(self):
values = []
for i in range(5):
try:
values.append(self._data[i])
except ValueError: # since accessing ctypes array by index
# prior to assignment to this index raises
# the exception
values.append('NULL (never used)')
return repr(values)
Also, several things about the code:
from collections import deque
This import is never user and should be removed.
DEFAULT_CAPACITY = 10
is never used. Consider using it in the __init__:
def __init__(self, capacity=None):
self.capacity = capacity or self.DEFAULT_CAPACITY
This variable inside __init__ is never user and should be removed:
capacity = self.capacity
def _make_array(self, c):
capacity = self.capacity
return (capacity * ctypes.py_object)()
Though this is a valid code, you're doing it wrong unless you're absolutely required to do it in your assignment. Ctypes shouldn't be used like this, Python is a language with automated memory management. Just return [] would be fine. And yes, variable c is never used and should be removed from the signature.
if self._data == 0
In isEmpty always evaluates to False because you're comparing ctypes object with zero, and ctypes object is definitely not a zero.

Categories