I have very large files containing 2d arrays of positive integers
Each file contains a matrix
I would like to process them without reading the files into memory. Luckily I only need to look at the values from left to right in the input file. I was hoping to be able to mmap each file so I can process them as if they were in memory but without actually reading in the files into memory.
Example of smaller version:
[[2, 2, 6, 10, 2, 6, 7, 15, 14, 10, 17, 14, 7, 14, 15, 7, 17],
[3, 3, 7, 11, 3, 7, 0, 11, 7, 16, 0, 17, 17, 7, 16, 0, 0],
[4, 4, 8, 7, 4, 13, 0, 0, 15, 7, 8, 7, 0, 7, 0, 15, 13],
[5, 5, 9, 12, 5, 14, 7, 13, 9, 14, 16, 12, 13, 14, 7, 16, 7]]
Is it possible to mmap such a file so I can then process the np.int64 values with
for i in range(rownumber):
for j in range(rowlength):
process(M[i, j])
To be clear, I don't want ever to have all my input file in memory as it won't fit.
Updated Answer
On the basis of your comments and clarifications, it appears you actually have a text file with a bunch of square brackets in it that is around 4 lines long with 1,000,000,000 ASCII integers per line separated by commas. Not a very efficient format! I would suggest you simply pre-process the file to remove all square brackets, linefeeds, and spaces and convert the commas to newlines so that you get one value per line which you can easily deal with.
Using the tr command to transliterate, that would be this:
# Delete all square brackets, newlines and spaces, change commas into newlines
tr -d '[] \n' < YourFile.txt | tr , '\n' > preprocessed.txt
Your file then looks like this and you can readily process one value at a time in Python.
2
2
6
10
2
6
...
...
In case you are on Windows, the tr tool is available for Windows in GNUWin32 and in the Windows Subsystem for Linux thing (git bash?).
You can go still further and make a file that you can memmap() like in the second part of my answer, you could then randomly find any byte in the file. So, taking the preprocessed.txt created above, you can make a binary version like this:
import struct
# Make binary memmapable version
with open('preprocessed.txt', 'r') as ifile, open('preprocessed.bin', 'wb') as ofile:
for line in ifile:
ofile.write(struct.pack('q',int(line)))
Original Answer
You can do that like this. The first part is just setup:
#!/usr/bin/env python3
import numpy as np
# Create 2,4 Numpy array of int64
a = np.arange(8, dtype=np.int64).reshape(2,4)
# Write to file as binary
a.tofile('a.dat')
Now check the file by hex-dumping it in the shell:
xxd a.dat
00000000: 0000 0000 0000 0000 0100 0000 0000 0000 ................
00000010: 0200 0000 0000 0000 0300 0000 0000 0000 ................
00000020: 0400 0000 0000 0000 0500 0000 0000 0000 ................
00000030: 0600 0000 0000 0000 0700 0000 0000 0000 ................
Now that we are all set up, let's memmap() the file:
# Memmap file and access values via 'mm'
mm = np.memmap('a.dat', dtype=np.int64, mode='r', shape=(2,4))
print(mm[1,2]) # prints 6
The primary problem is that the file is too large, and it doesn't seem to be split in lines either. (For reference, array.txt is the example you provided and arr_map.dat is an empty file)
import re
import numpy as np
N = [str(i) for i in range(10)]
arrayfile = 'array.txt'
mmapfile = 'arr_map.dat'
R = 4
C = 17
CHUNK = 20
def read_by_chunk(file, chunk_size=CHUNK):
return file.read(chunk_size)
fp = np.memmap(mmapfile, dtype=np.uint8, mode='w+', shape=(R,C))
with open(arrayfile,'r') as f:
curr_row = curr_col = 0
while True:
data = read_by_chunk(f)
if not data:
break
# Make sure that chunk reading does not break a number
while data[-1] in N:
data += read_by_chunk(f,1)
# Convert chunk into numpy array
nums = np.array(re.findall(r'[0-9]+', data)).astype(np.uint8)
num_len = len(nums)
if num_len == 0:
break
# CASE 1: Number chunk can fit into current row
if curr_col + num_len <= C:
fp[curr_row, curr_col : curr_col + num_len] = nums
curr_col = curr_col + num_len
# CASE 2: Number chunk has to be split into current and next row
else:
col_remaining = C-curr_col
fp[curr_row, curr_col : C] = nums[:col_remaining] # Fill in row i
curr_row, curr_col = curr_row+1, 0 # Move to row i+1 and fill the rest
fp[curr_row, :num_len-col_remaining] = nums[col_remaining:]
curr_col = num_len-col_remaining
if curr_col>=C:
curr_col = curr_col%C
curr_row += 1
#print('\n--debug--\n',fp,'\n--debug--\n')
Basically, read small parts of the array file at a time (making sure not to break the numbers), finding the numbers from the junk characters like commas, brackets etc. using regex, and then inserting the numbers into the memory map.
The situation you describe seems to be more suitable for a generator that fetches the next integer, or the next row from the file and allows you to process that.
def sanify(s):
while s.startswith('['):
s = s[1:]
while s.endswith(']'):
s = s[:-1]
return int(s)
def get_numbers(file_obj):
file_obj.seek(0)
i = j = 0
for line in file_obj:
for item in line.split(', '):
if item and not item.isspace():
yield sanify(item), i, j
j += 1
i += 1
j = 0
This ensures only one line at a time ever resides in memory.
This can be used like:
import io
s = '''[[2, 2, 6, 10, 2, 6, 7, 15, 14, 10, 17, 14, 7, 14, 15, 7, 17],
[3, 3, 7, 11, 3, 7, 0, 11, 7, 16, 0, 17, 17, 7, 16, 0, 0],
[4, 4, 8, 7, 4, 13, 0, 0, 15, 7, 8, 7, 0, 7, 0, 15, 13],
[5, 5, 9, 12, 5, 14, 7, 13, 9, 14, 16, 12, 13, 14, 7, 16, 7]]'''
items = get_numbers(io.StringIO(s))
for item, i, j in items:
print(item, i, j)
If you really want to be able to access an arbitrary element of the matrix, you could adapt the above logic into a class implementing __getitem__ and you would only need to keep track of the position of the beginning of each line.
In code, this would look like:
class MatrixData(object):
def __init__(self, file_obj):
self._file_obj = file_obj
self._line_offsets = list(self._get_line_offsets(file_obj))[:-1]
file_obj.seek(0)
row = list(self._read_row(file_obj.readline()))
self.shape = len(self._line_offsets), len(row)
self.length = self.shape[0] * self.shape[1]
def __len__(self):
return self.length
def __iter__(self):
self._file_obj.seek(0)
i = j = 0
for line in self._file_obj:
for item in _read_row(line):
yield item, i, j
j += 1
i += 1
j = 0
def __getitem__(self, indices):
i, j = indices
self._file_obj.seek(self._line_offsets[i])
line = self._file_obj.readline()
row = self._read_row(line)
return row[j]
#staticmethod
def _get_line_offsets(file_obj):
file_obj.seek(0)
yield file_obj.tell()
for line in file_obj:
yield file_obj.tell()
#staticmethod
def _read_row(line):
for item in line.split(', '):
if item and not item.isspace():
yield MatrixData._sanify(item)
#staticmethod
def _sanify(item, dtype=int):
while item.startswith('['):
item = item[1:]
while item.endswith(']'):
item = item[:-1]
return dtype(item)
class MatrixData(object):
def __init__(self, file_obj):
self._file_obj = file_obj
self._line_offsets = list(self._get_line_offsets(file_obj))[:-1]
file_obj.seek(0)
row = list(self._read_row(file_obj.readline()))
self.shape = len(self._line_offsets), len(row)
self.length = self.shape[0] * self.shape[1]
def __len__(self):
return self.length
def __iter__(self):
self._file_obj.seek(0)
i = j = 0
for line in self._file_obj:
for item in self._read_row(line):
yield item, i, j
j += 1
i += 1
j = 0
def __getitem__(self, indices):
i, j = indices
self._file_obj.seek(self._line_offsets[i])
line = self._file_obj.readline()
row = list(self._read_row(line))
return row[j]
#staticmethod
def _get_line_offsets(file_obj):
file_obj.seek(0)
yield file_obj.tell()
for line in file_obj:
yield file_obj.tell()
#staticmethod
def _read_row(line):
for item in line.split(', '):
if item and not item.isspace():
yield MatrixData._sanify(item)
#staticmethod
def _sanify(item, dtype=int):
while item.startswith('['):
item = item[1:]
while item.endswith(']'):
item = item[:-1]
return dtype(item)
To be used as:
m = MatrixData(io.StringIO(s))
# get total number of elements
len(m)
# get number of row and col
m.shape
# access a specific element
m[3, 12]
# iterate through
for x, i, j in m:
...
This seems to be exactly what the mmap module does in python. See: https://docs.python.org/3/library/mmap.html
Example from documentation
import mmap
# write a simple example file
with open("hello.txt", "wb") as f:
f.write(b"Hello Python!\n")
with open("hello.txt", "r+b") as f:
# memory-map the file, size 0 means whole file
mm = mmap.mmap(f.fileno(), 0)
# read content via standard file methods
print(mm.readline()) # prints b"Hello Python!\n"
# read content via slice notation
print(mm[:5]) # prints b"Hello"
# update content using slice notation;
# note that new content must have same size
mm[6:] = b" world!\n"
# ... and read again using standard file methods
mm.seek(0)
print(mm.readline()) # prints b"Hello world!\n"
# close the map
mm.close()
It's depends on the operation you would like to perform on your input matrix, if it was a matrix operation, then you can use a partial matrix, Most of the time you are able to partially process small batches of your input file as partial matrix, by this way you can process the file very efficient, you just have to develop your algorithm to read and partially process the input and caching the result, for some operations you may just need to decide what is the best representation of your input matrix (i.e. row major or column major).
The main advantage of using the partial matrix approach, that you can take the advantage of applying a parallel processing techniques to process n partial matrix at each iteration using CUDA GPU for example, if you are familiar with C or C++, Then using the Python C API might improve the time complexity a lot for partial matrix operations, but even using Python not too much worse, because you just need to process your partial matrix using Numpy.
Related
So, I have a txt file with some integers which are between 0 and 50. I want to extract them and to use their values.
The txt file looks like:
1 2 40 23
2 34 12
3 12 1
I have tried something like:
with open(input_file, "r") as file:
lines = file.readlines()
for i in range(len(lines)):
l = lines[i].strip()
for c in range(1, len(l)-1):
if(l[c] >= '0' and l[c] <= '9' and (l[c+1] < '0' or l[c+1] > '9')):
# other code with those numbers
elif(l[c] >= '0' and l[c] <= '9' and (l[c+1] >= '0' and l[c+1] <= '9')):
# other code with those numbers
The problem is that I extract the two digits numbers, but I do also extract one digit two digits numbers.
Any solution?
Or this way:
my_array=[]
with io.open(inputfile, mode="r", encoding="utf-8") as f:
for line in f:
my_array=my_array+line.split()
results = list(map(int, myarray)) #convert to int
print(my_array)
Output:
[1, 2, 40, 23, 2, 34, 12, 3, 12, 1]
You can gather all the numbers in the file into a list like this:
import re
with open(input_file) as f:
print(list(map(int, re.findall('\d+', f.read()))))
Output:
[1, 2, 40, 23, 2, 34, 12, 3, 12, 1]
Note:
Use of re may be unnecessary in OP's case but included here because it allows for potential garbage in the input file
I have a project wherein you need to read data from an excel file. I use openpyxl to read the said file. I tried reading the data as string first before converting it to an integer; however, error is occurring because of, I think, numbers in one cell separated by comma. I am trying to do a nested list but I still new in Python.
My code looks like this:
# storing S
S_follow = []
for row in range(2, max_row+1):
if (sheet.cell(row,3).value is not None):
S_follow.append(sheet.cell(row, 3).value);
# to convert the list from string to int, nested list
for i in range(0, len(S_follow)):
S_follow[i] = int(S_follow[i])
print(S_follow)
The data I a trying to read is:
['2,3', 4, '5,6', 8, 7, 9, 8, 9, 3, 11, 0]
hoping for your help
When you're about to convert the values to integers in the loop on the second-last line of your script, you can check if each value is an integer or string and if it is a string, just split it, convert the split values to integers and push them to a temporary list called say, strVal and then append that temp list to a new list called, say S_follow_int. But if the value is not a string, then just append them to S_follow_int without doing anything.
data= ['2,3', 4, '5,6', 8, 7, 9, 8, 9, 3, 11, 0]
S_follow = []
S_follow_int = []
for row in range(0, len(data)):
if (sheet.cell(row,3).value is not None):
S_follow.append(sheet.cell(row, 3).value);
# to convert the list from string to int, nested list
for i in range(0, len(S_follow)):
#if the current value is a string, split it, convert the values to integers, put them on a temp list called strVal and then append it to S_follow_int
if type(S_follow[i]) is str:
x = S_follow[i].split(',')
strVal = []
for y in x:
strVal.append(int(y))
S_follow_int.append(strVal)
#else if it is already an integer, just append it to S_follow_int without doing anything
else:
S_follow_int.append(S_follow[i])
print(S_follow_int)
However, I would recommend that you check the datatype(str/int) of each value in the initial loop that you used to retrieved data from the excel file itself rather than pushing all values to S_follow and then convert the type afterwards like this:
#simplified representation of the logic you can use for your script
data = ['2,3', 4, '5,6', 8, 7, 9, 8, 9, 3, 11, 0]
x = []
for dat in data:
if dat is not None:
if type(dat) is str:
y = dat.split(',')
strVal = []
for z in y:
strVal.append(int(z))
x.append(strVal)
else:
x.append(dat)
print(x)
S_follow = ['2,3', 4, '5,6', 8, 7, 9, 8, 9, 3, 11, 0]
for i in range(0, len(S_follow)):
try:
s = S_follow[i].split(',')
del S_follow[i]
for j in range(len(s)):
s[j] = int(s[j])
S_follow.insert(i,s)
except AttributeError as e:
S_follow[i] = int(S_follow[i])
print(S_follow)
I am trying to find elements from array(integer array) or list which are unique and those elements must not divisible by any other element from same array or list.
You can answer in any language like python, java, c, c++ etc.
I have tried this code in Python3 and it works perfectly but I am looking for better and optimum solution in terms of time complexity.
assuming array or list A is already sorted and having unique elements
A = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
while i<len(A)-1:
while j<len(A):
if A[j]%A[i]==0:
A.pop(j)
else:
j+=1
i+=1
j=i+1
For the given array A=[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16] answer would be like ans=[2,3,5,7,11,13]
another example,A=[4,5,15,16,17,23,39] then ans would be like, ans=[4,5,17,23,39]
ans is having unique numbers
any element i from array only exists if (i%j)!=0, where i!=j
I think it's more natural to do it in reverse, by building a new list containing the answer instead of removing elements from the original list. If I'm thinking correctly, both approaches do the same number of mod operations, but you avoid the issue of removing an element from a list.
A = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
ans = []
for x in A:
for y in ans:
if x % y == 0:
break
else: ans.append(x)
Edit: Promoting the completion else.
This algorithm will perform much faster:
A = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
if (A[-1]-A[0])/A[0] > len(A)*2:
result = list()
for v in A:
for f in result:
d,m = divmod(v,f)
if m == 0: v=0;break
if d<f: break
if v: result.append(v)
else:
retain = set(A)
minMult = 1
maxVal = A[-1]
for v in A:
if v not in retain : continue
minMult = v*2
if minMult > maxVal: break
if v*len(A)<maxVal:
retain.difference_update([m for m in retain if m >= minMult and m%v==0])
else:
retain.difference_update(range(minMult,maxVal,v))
if maxVal%v == 0:
maxVal = max(retain)
result = list(retain)
print(result) # [2, 3, 5, 7, 11, 13]
In the spirit of the sieve of Eratostenes, each number that is retained, removes its multiples from the remaining eligible numbers. Depending on the magnitude of the highest value, it is sometimes more efficient to exclude multiples than check for divisibility. The divisibility check takes several times longer for an equivalent number of factors to check.
At some point, when the data is widely spread out, assembling the result instead of removing multiples becomes faster (this last addition was inspired by Imperishable Night's post).
TEST RESULTS
A = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16] (100000 repetitions)
Original: 0.55 sec
New: 0.29 sec
A = list(range(2,5000))+[9697] (100 repetitions)
Original: 3.77 sec
New: 0.12 sec
A = list(range(1001,2000))+list(range(4000,6000))+[9697**2] (10 repetitions)
Original: 3.54 sec
New: 0.02 sec
I know that this is totally insane but i want to know what you think about this:
A = [4,5,15,16,17,23,39]
prova=[[x for x in A if x!=y and y%x==0] for y in A]
print([A[idx] for idx,x in enumerate(prova) if len(prova[idx])==0])
And i think it's still O(n^2)
If you care about speed more than algorithmic efficiency, numpy would be the package to use here in python:
import numpy as np
# Note: doesn't have to be sorted
a = [2, 2, 3, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 16, 29, 29]
a = np.unique(a)
result = a[np.all((a % a[:, None] + np.diag(a)), axis=0)]
# array([2, 3, 5, 7, 11, 13, 29])
This divides all elements by all other elements and stores the remainder in a matrix, checks which columns contain only non-0 values (other than the diagonal), and selects all elements corresponding to those columns.
This is O(n*M) where M is the max size of an integer in your list. The integers are all assumed to be none negative. This also assumes your input list is sorted (came to that assumption since all lists you provided are sorted).
a = [4, 7, 7, 8]
# a = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
# a = [4, 5, 15, 16, 17, 23, 39]
M = max(a)
used = set()
final_list = []
for e in a:
if e in used:
continue
else:
used.add(e)
for i in range(e, M + 1):
if not (i % e):
used.add(i)
final_list.append(e)
print(final_list)
Maybe this can be optimized even further...
If the list is not sorted then for the above method to work, one must sort it. The time complexity will then be O(nlogn + Mn) which equals to O(nlogn) when n >> M.
Numpy has а repeat function, that repeats each element of the array a given (per element) number of times.
I want to implement a function that does similar thing but repeats not individual elements, but variably sized blocks of consecutive elements. Essentially I want the following function:
import numpy as np
def repeat_blocks(a, sizes, repeats):
b = []
start = 0
for i, size in enumerate(sizes):
end = start + size
b.extend([a[start:end]] * repeats[i])
start = end
return np.concatenate(b)
For example, given
a = np.arange(20)
sizes = np.array([3, 5, 2, 6, 4])
repeats = np.array([2, 3, 2, 1, 3])
then
repeat_blocks(a, sizes, repeats)
returns
array([ 0, 1, 2,
0, 1, 2,
3, 4, 5, 6, 7,
3, 4, 5, 6, 7,
3, 4, 5, 6, 7,
8, 9,
8, 9,
10, 11, 12, 13, 14, 15,
16, 17, 18, 19,
16, 17, 18, 19,
16, 17, 18, 19 ])
I want to push these loops into numpy in the name of performance. Is this possible? If so, how?
Here's one vectorized approach using cumsum -
# Get repeats for each group using group lengths/sizes
r1 = np.repeat(np.arange(len(sizes)), repeats)
# Get total size of output array, as needed to initialize output indexing array
N = (sizes*repeats).sum() # or np.dot(sizes, repeats)
# Initialize indexing array with ones as we need to setup incremental indexing
# within each group when cumulatively summed at the final stage.
# Two steps here:
# 1. Within each group, we have multiple sequences, so setup the offsetting
# at each sequence lengths by the seq. lengths preceeeding those.
id_ar = np.ones(N, dtype=int)
id_ar[0] = 0
insert_index = sizes[r1[:-1]].cumsum()
insert_val = (1-sizes)[r1[:-1]]
# 2. For each group, make sure the indexing starts from the next group's
# first element. So, simply assign 1s there.
insert_val[r1[1:] != r1[:-1]] = 1
# Assign index-offseting values
id_ar[insert_index] = insert_val
# Finally index into input array for the group repeated o/p
out = a[id_ar.cumsum()]
This function is a great candidate to speed up using Numba:
#numba.njit
def repeat_blocks_jit(a, sizes, repeats):
out = np.empty((sizes * repeats).sum(), a.dtype)
start = 0
oi = 0
for i, size in enumerate(sizes):
end = start + size
for rep in range(repeats[i]):
oe = oi + size
out[oi:oe] = a[start:end]
oi = oe
start = end
return out
This is significantly faster than Divakar's pure NumPy solution, and a lot closer to your original code. I made no effort at all to optimize it. Note that np.dot() and np.repeat() can't be used here, but that doesn't matter when all the code gets compiled.
Plus, since it is njit meaning "nopython" mode, you can even use #numba.njit(nogil=True) and get multicore speedup if you have many of these calls to make.
I have a bytes object or bytearray object representing a packed stream of 11-bit integers. (Edit: Stream is 11-bit big-endian integers without padding.)
Is there a reasonably efficient way of copying this to a stream of 16-bit integers? Or any other integer type?
I know that ctypes supports bit fields but I am not sure whether this helps me here at all.
Can I maybe "abuse" some part of the standard library that already does such bit-fiddling for other purposes?
If I have to resort to cython, is there an good implementation that can deal with variable bit-lengths? I.e. not only for 11 bit input but also 12, 13, etc?
Edit: Pure python solution based on PM2 Ring's answer
def unpackIntegers(data, num_points, bit_len):
"""Unpacks an array of integers of arbitrary bit-length into a
system-word aligned array of integers"""
# TODO: deal with native integer types separately for speedups
mask = (1 << bit_len) - 1
unpacked_bit_len = 2 ** ceil(log(bit_len, 2))
unpacked_byte_len = ceil(unpacked_bit_len / 8)
unpacked_array = bytearray(num_points * unpacked_byte_len)
unpacked = memoryview(unpacked_array).cast(
FORMAT_CODES[unpacked_byte_len])
num_blocks = num_points // 8
# Note: zipping generators is faster than calculating offsets
# from a block count
for idx1_start, idx1_stop, idx2_start, idx2_stop in zip(
range(0, num_blocks*bit_len, bit_len),
range(bit_len, (num_blocks+1)*bit_len, bit_len),
range(7, num_points, 8),
range(-1, num_points-8, 8),
):
n = int.from_bytes(data[idx1_start:idx1_stop], 'big')
for i in range(idx2_start, idx2_stop, -1):
unpacked[i] = n & mask
n >>= bit_len
# process left-over part (missing from PM2 Ring's answer)
else:
points_left = num_points % 8
bits_left = points_left * bit_len
bytes_left = len(data)-num_blocks*bit_len
num_unused_bits = bytes_left * 8 - bits_left
n = int.from_bytes(data[num_blocks*bit_len:], 'big')
n >>= num_unused_bits
for i in range(num_points-1, num_points-points_left-1, -1):
unpacked[i] = n & mask
n >>= bit_len
return unpacked
There may be a more efficient way to do this with a 3rd-party library, but here's one way to do it with standard Python.
The unpack generator iterates over its data argument in chunks, data can be any iterable that yields bytes. To unpack 11 bit data we read chunks of 11 bytes, combine those bytes into a single integer, and then we slice that integer into 8 pieces, so each piece will contain the data from the corresponding 11 source bits.
def unpack(data, bitlen):
mask = (1 << bitlen) - 1
for chunk in zip(*[iter(data)] * bitlen):
n = int.from_bytes(chunk, 'big')
a = []
for i in range(8):
a.append(n & mask)
n >>= bitlen
yield from reversed(a)
# Test
# 0 to 23 in 11 bit integers, packed into bytes
data = bytes([
0, 0, 4, 1, 0, 48, 8, 1, 64, 48, 7,
1, 0, 36, 5, 0, 176, 24, 3, 64, 112, 15,
2, 0, 68, 9, 1, 48, 40, 5, 64, 176, 23,
])
print(list(unpack(data, 11)))
output
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
Note that if data does not contain a multiple of bitlen bytes then it will end in a partial chunk which will be ignored.