Can I treat a file as a list in python?

Can I treat a file as a list in python? - python

This is kind of a question, but it's also kind of me just hoping I don't have to write a bunch of code to get behavior I want. (Plus if it already exists, it probably runs faster than what I would write anyway.) I have a number of large lists of numbers that cannot fit into memory -- at least not all at the same time. Which is fine because I only need a small portion of each list at a time, and I know how to save the lists into files and read out the part of the list I need. The problem is that my method of doing this is somewhat inefficient as it involves iterating through the file for the part I want. So, I was wondering if there happened to be some library or something out there that I'm not finding that allows me to index a file as though it were a list using the [] notation I'm familiar with. Since I'm writing the files myself, I can make the formatting of them whatever I need to, but currently my files contain nothing but the elements of the list with \n as a deliminator between values.
Just to recap what I'm looking for/make it more specific.
I want to use the list indexing notation (including slicing into sub-list and negative indexing) to access the contents of a list written in a file
A accessed sub-list (e.g. f[1:3]) should return as a python list object in memory
I would like to be able to assign to indices of the file (e.g. f[i] = x should write the value x to the file f in the location corresponding to index i)
To be honest, I don't expect this to exist, but you never know when you miss something in your research. So, I figured I'd ask. On a side note if this doesn't exist, is possible to overload the [] operator in python?

If your data is purely numeric you could consider using numpy arrays, and storing the data in npy format. Once stored in this format, you could load the memory-mapped file as:
>>> X = np.load("some-file.npy", mmap_mode="r")
>>> X[1000:1003]
memmap([4, 5, 6])
This access will load directly from disk without requiring the loading of leading data.

You can actually do this by writing a simple class, I think:
class FileWrapper:
def __init__(self, path, **kwargs):
self._file = open(path, 'r+', **kwargs)
def _do_single(self, where, s=None):
if where >= 0:
self._seek(where)
else:
self._seek(where, 2)
if s is None:
return self._read(1)
else:
return self._write(s)
def _do_slice_contiguous(self, start, end, s=None):
if start is None:
start = 0
if end is None:
end = -1
self._seek(start)
if s is None:
return self._read(end - start)
else:
return self._write(s)
def _do_slice(self, where, s=None):
if s is None:
result = []
for index in where:
file._seek(index)
result.append(file.read(1))
return result
else:
for index, char in zip(where, s):
file._seek(index)
file._write(char)
return len(s)
def __getitem__(self, key):
if isinstance(key, int):
return self._do_single(key)
elif isinstance(key, slice):
if self._is_contiguous(key):
return self._do_slice_contiguous(key.start, key.stop)
else:
return self._do_slice(self._process_slice(key))
else:
raise ValueError('File indices must be ints or slices.')
def __setitem__(self, key, value):
if isinstance(key, int):
return self._do_single(key, value)
elif isinstance(key, slice):
if self._is_contiguous(key):
return self._do_slice_contiguous(key.start, key.stop, value)
else:
where = self._process_slice(key)
if len(where) == len(value):
return self._do_slice(where, value)
else:
raise ValueError('Length of slice not equal to length of string to be written.')
def __del__(self):
self._file.close()
def _is_contiguous(self, key):
return key.step is None or key.step == 1
def _process_slice(self, key):
return range(key.start, key.stop, key.step)
def _read(self, size):
return self._file.read(size)
def _seek(self, offset, whence=0):
return self._file.seek(offset, whence)
def _write(self, s):
return self._file.write(s)
I'm sure many optimisations could be made, since I rushed through this, but it was fun to write.
This does not answer the question in full, because it supports random access of characters, as supposed to lines, which are at a higher level of abstraction and more complicated to handle (since they can be variable length)

Related

using python 3 stacks to ensure symbols match in correct pairs and the types of symbols match as well

def spc(sym):
stk1=myStack()
stkall=myStack()
for i in sym:
if i not in stk1:
stk1.push(i)
else:
stkall.push(i)
for j in stk1:
for k in stkall:
if j==k:
stk1.pop(j)
stkall.pop(k)
else:
pass
if len(stk1) == len(stkall):
print("all symbols match")
else:
print(("these symbols, %s, have no matches")%(stkall))
Above code gives me this error
"TypeError: argument of type 'myStack' is not iterable"
But I fixed it by the answer from #SergeBallesta. After I edited code to look as you see now.
Now im getting this error:
"return self.container.pop(item) # pop from the container, this was fixed from the old version which was wrong
TypeError: 'str' object cannot be interpreted as an integer"
What i want to achieve is for parenthesis and all symbols to be properly balanced in that not only does each opening symbol have a corresponding closing symbol, but the types of symbols match as well.
Code for my Stack class is below. Please assist to implement this using STACKS
class myStack:
def __init__(self):
self.container = []
def isEmpty(self):
return self.size() == 0
def push(self, item):
self.container.append(item
def pop(self, item):
return self.container.pop(item)
def size(self):
return len(self.container) # length of the container
def __iter__(self):
return iter(self.container)

These lines
if i not in stk1:
and
for j in stk1:
requires myStack to be iterable. In Python it means that it shall have an __iter__ method that returns an iterator on its objects. As you already have an internal container, the __iter__ method can be as simple as:
class myStack:
...
def __iter__(self):
return iter(self.container)

To do bracket validation using a stack, we only need one stack. As we come across opening brackets, we push them onto the stack. When we come across a closing bracket, we pop the top opening bracket off the stack, and compare the two. If they are the same bracket type, we continue, otherwise the string is invalid. If we ever try to pop an empty stack, the string is invalid. If we reach the end of the string without clearing the stack, the string is invalid.
opening = '[{<('
closing = ']}>)'
d = dict(zip(opening, closing))
def validate(s):
stack = []
for c in s:
if c in opening:
stack.append(c)
elif c in closing:
if not stack:
# tried to pop empty stack
return False
popped = stack.pop()
if d[popped] != c:
# bracket mismatch
return False
return not stack

Which is the cleaner way to get a Python #property as a list with particular conditions?

Now I have the source code above:
class Stats(object):
def __init__(self):
self._pending = []
self._done = []
#property
def pending(self):
return self._pending
The way those lists are filled is not important for my question.
The situation is that I'm getting a sublist of these lists this way:
stats = Stats()
// code to fill the lists
stats.pending[2:10]
The problem here is that I expect to get as many elements as I retrieved.
In the example above I expect a sublist that contains 8 elements (10-2).
Of course, actually I'll get less than 8 elements if the list is shorter.
So, what I need is:
When the list has enough items, it returns the corresponding sublist.
When the list is shorter, it returns a sublist with the expected length, filled with the last elements of the original lists and a default value (for example None) for the extra items.
This way, if I did:
pending_tasks = stats.pending[44:46]
And the pending list only contains 30 elements, it should returns a list of two default elements, for example: [None, None]; instead of an empty list ([]) which is the default behaviour of the lists.
I guess I already know how to do it inside a normal method/function, but I want to do it in the most clean way, trying to follow the #property approach, if possible.
Thanks a lot!

This is not easy to do because the slicing operation is what you want to modify, and that happens after the original list has been returned by the property. It's not impossible though, you'll just need to wrap the regular list with another object that will take care of padding the slices for you. How easy or difficult that will be may depend on how much of the list interface you need your wrapper to implement. If you only need indexing and slicing, it's really easy:
class PadSlice(object):
def __init__(self, lst, default_value=None):
self.lst = lst
self.default_value
def __getitem__(self, index):
item = getitem(self.lst, index)
if isinstance(index, slice):
expected_length = (index.stop - index.start) // (index.step or 1)
if len(item) != expected_length:
item.extend([default_value] * (expected_length - len(item)))
return item
This code probably won't work right for negative step slices, or for slices that don't specify one of the end points (it does have logic to detect an omitted step, since that's common). If this was important to you, you could probably fix up those corner cases.

This is not easy. How would the object (list) you return know how it will be sliced later? You could subclass list, however, and override __getitem__ and __getslice__ (Python2 only):
class L(list):
def __getitem__(self, key):
if isinstance(key, slice):
return [list(self)[i] if 0 <= i < len(self) else None for i in xrange(key.start, key.stop, key.step or 1)]
return list(self)[key]
def __getslice__(self, i, j):
return self.__getitem__(slice(i, j))
This will pad all slices with None, fully compatible with negative indexing and steps != 1. And in your property, return an L version of the actual list:
#property
def pending(self):
return L(self._pending)

You can construct a new class, which is a subclass of list. Then you can overload the __getitem__ magic method to overload [] operator to the appropriate behavior. Consider this subclass of list called MyList:
class MyList(list):
def __getitem__(self, index):
"""Modify index [] operator"""
result = super(MyList, self).__getitem__(index)
if isinstance(index, slice):
# Get sublist length.
if index.step: # Check for zero to avoid divide by zero error
sublist_len = (index.stop - index.start) // index.step
else:
sublist_len = (index.stop - index.start)
# If sublist length is greater (or list is shorter), then extend
# the list to length requested with default value of None
if sublist_len > len(self) or index.start > len(self):
result.extend([None for _ in range(sublist_len - len(result))])
return result
Then you can just change the pending method to return a MyList type instead of list.
class Stats(object):
#property
def pending(self):
return MyList(self._pending)
Hopefully this helps.

Recursion: design a recursive function called replicate_recur which will receive two arguments:

The below code is a recursive function which takes two arguments and return something like [5,5,5].
def recursive(times, data):
if not isinstance(times,int):
raise ValueError("times must be an int")
if not (isinstance(data,int) or isinstance(data, str)):
raise ValueError("data must be an int or a string")
if times <= 0:
return []
return [data] + recursive(times, data - 1)
print(recursive(3, 5))
Why is the code throwing a recursive error?

Let's try to think how we would repeat any data item N times recursively:
If times is 0 or less, we return an empty list, as per the requirements.
If times is greater than 0, we return list that has data ones and another times - 1 repetitions of data, recursively.
Another requirement is to check the validity of the arguments and raise a ValueError if they are invalid. While this can be done in the same recursive function, this carries a performance hit, as we'll do the same validation times times. The textbook solution for this is to separate the function to two - an "outer" function that handles the validations and and "inner" function that handles the recursive logic.
Put it all together, and you'll get something like this:
def replicate_recur(times, data):
if not isinstance(times, int):
raise ValueError("times must be an int")
return real_replicate_recur(times, data)
def real_replicate_recur(times, data):
if times <= 0:
return []
return [data] + real_replicate_recur(times - 1, data)

You can use a list to store the result of current recursive call.
def replicate_recur(times, data, ret=None):
if not ret:
ret = []
ret.append(data)
times -= 1
if not times:
return ret
return replicate_recur(times, data, ret)

your code actually is supposed to run fine, the problem is that base case takes the items variable into consideration while the recursive function decrements the data arguments instead of decrementing the times argument

Dynamic list that automatically expands

How can I make a Python equivalent of pdtolist from Pop-11?
Assume I have a generator called g that returns (say) integers one at a time. I'd like to construct a list a that grows automatically as I ask for values beyond the current end of the list. For example:
print a # => [ 0, 1, 2, g]
print a[0] # => 0
print a[1] # => 1
print a[2] # => 2
# (obvious enough up to here)
print a[6] # => 6
print a # => [ 0, 1, 2, 3, 4, 5, 6, g]
# list has automatically expanded
a = a[4:] # discard some previous values
print a # => [ 4, 5, 6, g]
print a[0] # => 4
Terminology - to anticipate a likely misunderstanding: a list is a "dynamic array" but that's not what I mean; I'd like a "dynamic list" in a more abstract sense.
To explain the motivation better, suppose you have 999999999 items to process. Trying to fit all those into memory (in a normal list) all at once would be a challenge. A generator solves that part of the problem by presenting them one at a time; each one created on demand or read individually from disk. But suppose during processing you want to refer to some recent values, not just the current one? You could remember the last (say) ten values in a separate list. But a dynamic list is better, as it remembers them automatically.

This might get you started:
class DynamicList(list):
def __init__(self, gen):
self._gen = gen
def __getitem__(self, index):
while index >= len(self):
self.append(next(self._gen))
return super(DynamicList, self).__getitem__(index)
You'll need to add some special handling for slices (currently, they just return a normal list, so you lose the dynamic behavior). Also, if you want the generator itself to be a list item, that'll add a bit of complexity.

Just answered another similar question and decided to update my answer for you
hows this?
class dynamic_list(list):
def __init__(self,num_gen):
self._num_gen = num_gen
def __getitem__(self,index):
if isinstance(index, int):
self.expandfor(index)
return super(dynamic_list,self).__getitem__(index)
elif isinstance(index, slice):
if index.stop<index.start:
return super(dynamic_list,self).__getitem__(index)
else:
self.expandfor(index.stop if abs(index.stop)>abs(index.start) else index.start)
return super(dynamic_list,self).__getitem__(index)
def __setitem__(self,index,value):
if isinstance(index, int):
self.expandfor(index)
return super(dynamic_list,self).__setitem__(index,value)
elif isinstance(index, slice):
if index.stop<index.start:
return super(dynamic_list,self).__setitem__(index,value)
else:
self.expandfor(index.stop if abs(index.stop)>abs(index.start) else index.start)
return super(dynamic_list,self).__setitem__(index,value)
def expandfor(self,index):
rng = []
if abs(index)>len(self)-1:
if index<0:
rng = xrange(abs(index)-len(self))
else:
rng = xrange(abs(index)-len(self)+1)
for i in rng:
self.append(self._num_gen.next())

Many thanks to all who contributed ideas! Here's what I have gathered together from all the responses. This retains most functionality from the normal list class, adding additional behaviours where necessary to meet additional requirements.
class DynamicList(list):
def __init__(self, gen):
self.gen = gen
def __getitem__(self, index):
while index >= len(self):
self.append(next(self.gen))
return super(DynamicList, self).__getitem__(index)
def __getslice__(self, start, stop):
# treat request for "last" item as "most recently fetched"
if stop == 2147483647: stop = len(self)
while stop > len(self):
self.append(next(self.gen))
return super(DynamicList, self).__getslice__(start, stop)
def __iter__(self):
return self
def next(self):
n = next(self.gen)
self.append(n)
return n
a = DynamicList(iter(xrange(10)))
Previously generated values can be accessed individually as items or slices. The recorded history expands as necessary if the requested item(s) are beyond the current end of the list. The entire recorded history can be accessed all at once, using print a, or assigned to a normal list using b = a[:]. A slice of the recorded history can be deleted using del a[0:4]. You can iterate over the whole list using for, deleting as you go, or whenever it suits. Should you reach the end of the generated values, StopIteration is raised.
Some awkwardness remains. Assignments like a = a[0:4] successfully truncate the history, but the resulting list no longer auto-expands. Instead use del a[0:4] to retain the automatic growth properties. Also, I'm not completely happy with having to recognise a magic value, 2147483647, representing the most recent item.

Thanks for this thread; it helped me solve my own problem. Mine was a bit simpler: I wanted a list that automatically extended if indexed past its current length --> allow reading and writing past current length. If reading past current length, return 0 values.
Maybe this helps someone:
class DynamicList(list):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def __getitem__(self, idx):
self.expand(idx)
return super().__getitem__(idx)
def __setitem__(self, idx, val):
self.expand(idx)
return super().__setitem__(idx, val)
def expand(self, idx):
if isinstance(idx, int):
idx += 1
elif isinstance(idx, slice):
idx = max(idx.start, idx.stop)
if idx > len(self):
self.extend([0] * (idx - len(self)))

How do you make an object return a sorted array instead of an empty one in python?

I'm trying to create a library of some common algorithms so that people will be able to use them easily. I created an object called Compare, which has some methods that would be useful in these algorithms.
Code for Compare:
class Compare(list):
def __init__(self,arr):
self.arr = arr
def __compare(self,u,v):
# Compares one item of a Compare
# object to another
if u < v:
return 1
if u == v:
return 0
if u > v:
return -1
def __swap(self,arr,i,j):
# Exchanges i and j
temp = arr[i]
arr[i] = arr[j]
a[j] = temp
def __determine(self,arr):
# Determines if the array is sorted or not
for i in range(0,len(array)):
if self.__compare(arr[i], arr[i+1]) == -1:
return False
return True
def __printout(self,arr):
for i in range(0,len(array)):
return arr[i] + '\n'
def sorted(self):
if self.__determine(arr):
return True
return False
Here's one of the algorithms that uses this class:
def SelectionSort(array):
try:
array = Compare(array)
for ix in range(0, len(array)):
m = ix
j = ix+1
for j in range(0,len(array)):
if array.__compare(array[j], array[m]) == -1:
m = j
array.__swap(arr, ix, m)
return array
except(TypeError) as error:
print "Must insert array for sort to work."
The problem I'm having is that whenever I try to use this or any of the other algorithms, it returns an empty array instead of the sorted array. I'm not sure how to get the Compare object to return the sorted array.

I'm pretty sure this is what is happening. When you call :
array = Compare(array)
You overwrite the reference to the original array. Array is now a reference to a Compare object. Replace array with array.arr (or name array something better) and this should work I think! :)
Remember that python is loosely typed, so that your "array" variable is just a reference to some data. In this case, you are switching it from a reference to a list to a reference to a Compare object.
Think about:
>>> x = 1
>>> x
1
>>> x = 's'
>>> x
's'
And think about what happens to the 1 ;)

Your code has many problems some of them make it to fail
for example
in sorted you are using a maybe global arr that doesn't exist, instead
of self.arr).
in swap you also use a[j] = temp, but a is local to the method and you do not use it for anything
you are using two underscores for your methods. This puts name mangling to work, So the calls in the function do not work in the way you do them. Probably you want a single underscore to indicate that this are private methods.
But the main problem is that Compare is not returnig a list. For that you need:
class Compare(list):
def __init__(self, arr):
list.__init__(self, arr)
then:
>>> print Compare([1,2,3,4])
[1, 2, 3, 4]
In this way you should use in your methods self instead of self.arr because your instance is a list (or an instance of a subclass of list).
So the following is your code modified to actually work. The only problem is that your sorting algorithn is wrong an it is not sorting right. But you can do from here I suppose:
class Compare(list):
def __init__(self, arr):
list.__init__(self, arr)
def _compare(self, u, v):
# Compares one item of a Compare
# object to another
if u < v:
return 1
if u == v:
return 0
if u > v:
return -1
def _swap(self, i, j):
# Exchanges i and j
temp = self[i]
self[i] = self[j]
self[j] = temp
def _determine(self):
# Determines if the array is sorted or not
for i in range(len(array)):
if self._compare(self[i], self[i+1]) == -1:
return False
return True
def _printout(self):
for i in self:
return i + '\n'
def sorted(self):
if self._determine():
return True
return False
def SelectionSort(array):
try:
array = Compare(array)
for ix in range(len(array)):
m = ix
j = ix + 1
for j in range(len(array)):
if array._compare(array[j], array[m]) == -1:
m = j
array._swap(ix, m)
return array
except(TypeError) as error:
print "Must insert array for sort to work."

You're not returning the array, you're returning a Compare wrapped around the array. If you intend Compare to be a proxy, the wrapping is incomplete, as you don't forward the standard container operations to the proxied array. In addition, you'll need to consistently use the Compare instance. Currently, you sometimes use the Compare and other times use the original sequence object, such as every place you pass the sequence to a method. Instead, use the Compare object within its own methods.
However, that's having Compare do two things: be an algorithm collection, and be a sequence. If you keep the Compare object separate and work on the list directly, you can switch out the algorithms easily. This is the more typical approach; list.sort works this way, taking a comparator as an argument. You'll also need to fix your implementation of Compare, which uses the wrong variable name in numerous places (array, when the local variable is named arr). If you want anyone to use your library, it's going to have to be much better designed.
As further reasons not to make Compare a sequence, consider what happens when you need to change comparison methods: you end up wrapping the Compare in another, making the wrapped Compare useless.
Consider the approach used in math: an order is a relationship defined on a set, not an intrinsic part of the set, and it especially isn't a part of sequences of items from the set. This reveals another conceptual error with your original approach: it couples an ordering (which is a set relationship) with operations on sequences of elements from the set. The two should be kept separate, so that you can use different comparisons with the sequence operations.
Off-Topic
There are a number of other mistakes of various types in the code. For example, in SelectionSort you assume that type errors must be due to a non-sequence being passed as array. Comparing instances of uncomparable types (such as 0 and 'd') will also result in a type error. For another example, Compare.sorted is useless; it's of the pattern:
if test:
return True
return False
This is logically equivalent to:
return test
which means Compare.sorted is equivalent to Compare.__determine. Make the latter the former, as sorted is a more descriptive name. "determine" is too ambiguous; it begs the question of what's being determined.
You can get more code reviews at codereview.stackexchange.com.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.