I'm trying to read a text file. I need to read column values to list. My text file looks like this:
40 10 5 5
30 20 10 0
30 30 10 5
and desired output is
(40,30,30),(10,20,30),(5,10,10),(5,0,5)
I tried this code
def contest(filename):
contestFile = open(filename,'r')
contestFileLines=contestFile.readlines()
startColumn = 0
contestResult=[]
for x in contestFileLines:
contestResult.append(x.split()[startColumn])
contestFile.close()
print(contestResult)
contest("testing.txt")
and its output is just
['40', '30', '30']
What should I do?
Try reading every line into an list, splitting by each space and mapping to an int. Then, you can use this answer (which Barmar suggested in the comments) to transpose the list of map generators. Just like this:
def cols(path):
rows = []
with open(path) as f:
for line in f:
rows.append(map(int, line.split(' ')))
return list(map(list, zip(*rows)))
print(cols('test.txt')) # => [[40, 30, 30], [10, 20, 30], [5, 10, 10], [5, 0, 5]]
Alternatively, if you need the output as a tuple, just change this line:
return list(map(list, zip(*rows)))
to
return list(map(tuple, zip(*rows)))
I have very large files containing 2d arrays of positive integers
Each file contains a matrix
I would like to process them without reading the files into memory. Luckily I only need to look at the values from left to right in the input file. I was hoping to be able to mmap each file so I can process them as if they were in memory but without actually reading in the files into memory.
Example of smaller version:
[[2, 2, 6, 10, 2, 6, 7, 15, 14, 10, 17, 14, 7, 14, 15, 7, 17],
[3, 3, 7, 11, 3, 7, 0, 11, 7, 16, 0, 17, 17, 7, 16, 0, 0],
[4, 4, 8, 7, 4, 13, 0, 0, 15, 7, 8, 7, 0, 7, 0, 15, 13],
[5, 5, 9, 12, 5, 14, 7, 13, 9, 14, 16, 12, 13, 14, 7, 16, 7]]
Is it possible to mmap such a file so I can then process the np.int64 values with
for i in range(rownumber):
for j in range(rowlength):
process(M[i, j])
To be clear, I don't want ever to have all my input file in memory as it won't fit.
Updated Answer
On the basis of your comments and clarifications, it appears you actually have a text file with a bunch of square brackets in it that is around 4 lines long with 1,000,000,000 ASCII integers per line separated by commas. Not a very efficient format! I would suggest you simply pre-process the file to remove all square brackets, linefeeds, and spaces and convert the commas to newlines so that you get one value per line which you can easily deal with.
Using the tr command to transliterate, that would be this:
# Delete all square brackets, newlines and spaces, change commas into newlines
tr -d '[] \n' < YourFile.txt | tr , '\n' > preprocessed.txt
Your file then looks like this and you can readily process one value at a time in Python.
2
2
6
10
2
6
...
...
In case you are on Windows, the tr tool is available for Windows in GNUWin32 and in the Windows Subsystem for Linux thing (git bash?).
You can go still further and make a file that you can memmap() like in the second part of my answer, you could then randomly find any byte in the file. So, taking the preprocessed.txt created above, you can make a binary version like this:
import struct
# Make binary memmapable version
with open('preprocessed.txt', 'r') as ifile, open('preprocessed.bin', 'wb') as ofile:
for line in ifile:
ofile.write(struct.pack('q',int(line)))
Original Answer
You can do that like this. The first part is just setup:
#!/usr/bin/env python3
import numpy as np
# Create 2,4 Numpy array of int64
a = np.arange(8, dtype=np.int64).reshape(2,4)
# Write to file as binary
a.tofile('a.dat')
Now check the file by hex-dumping it in the shell:
xxd a.dat
00000000: 0000 0000 0000 0000 0100 0000 0000 0000 ................
00000010: 0200 0000 0000 0000 0300 0000 0000 0000 ................
00000020: 0400 0000 0000 0000 0500 0000 0000 0000 ................
00000030: 0600 0000 0000 0000 0700 0000 0000 0000 ................
Now that we are all set up, let's memmap() the file:
# Memmap file and access values via 'mm'
mm = np.memmap('a.dat', dtype=np.int64, mode='r', shape=(2,4))
print(mm[1,2]) # prints 6
The primary problem is that the file is too large, and it doesn't seem to be split in lines either. (For reference, array.txt is the example you provided and arr_map.dat is an empty file)
import re
import numpy as np
N = [str(i) for i in range(10)]
arrayfile = 'array.txt'
mmapfile = 'arr_map.dat'
R = 4
C = 17
CHUNK = 20
def read_by_chunk(file, chunk_size=CHUNK):
return file.read(chunk_size)
fp = np.memmap(mmapfile, dtype=np.uint8, mode='w+', shape=(R,C))
with open(arrayfile,'r') as f:
curr_row = curr_col = 0
while True:
data = read_by_chunk(f)
if not data:
break
# Make sure that chunk reading does not break a number
while data[-1] in N:
data += read_by_chunk(f,1)
# Convert chunk into numpy array
nums = np.array(re.findall(r'[0-9]+', data)).astype(np.uint8)
num_len = len(nums)
if num_len == 0:
break
# CASE 1: Number chunk can fit into current row
if curr_col + num_len <= C:
fp[curr_row, curr_col : curr_col + num_len] = nums
curr_col = curr_col + num_len
# CASE 2: Number chunk has to be split into current and next row
else:
col_remaining = C-curr_col
fp[curr_row, curr_col : C] = nums[:col_remaining] # Fill in row i
curr_row, curr_col = curr_row+1, 0 # Move to row i+1 and fill the rest
fp[curr_row, :num_len-col_remaining] = nums[col_remaining:]
curr_col = num_len-col_remaining
if curr_col>=C:
curr_col = curr_col%C
curr_row += 1
#print('\n--debug--\n',fp,'\n--debug--\n')
Basically, read small parts of the array file at a time (making sure not to break the numbers), finding the numbers from the junk characters like commas, brackets etc. using regex, and then inserting the numbers into the memory map.
The situation you describe seems to be more suitable for a generator that fetches the next integer, or the next row from the file and allows you to process that.
def sanify(s):
while s.startswith('['):
s = s[1:]
while s.endswith(']'):
s = s[:-1]
return int(s)
def get_numbers(file_obj):
file_obj.seek(0)
i = j = 0
for line in file_obj:
for item in line.split(', '):
if item and not item.isspace():
yield sanify(item), i, j
j += 1
i += 1
j = 0
This ensures only one line at a time ever resides in memory.
This can be used like:
import io
s = '''[[2, 2, 6, 10, 2, 6, 7, 15, 14, 10, 17, 14, 7, 14, 15, 7, 17],
[3, 3, 7, 11, 3, 7, 0, 11, 7, 16, 0, 17, 17, 7, 16, 0, 0],
[4, 4, 8, 7, 4, 13, 0, 0, 15, 7, 8, 7, 0, 7, 0, 15, 13],
[5, 5, 9, 12, 5, 14, 7, 13, 9, 14, 16, 12, 13, 14, 7, 16, 7]]'''
items = get_numbers(io.StringIO(s))
for item, i, j in items:
print(item, i, j)
If you really want to be able to access an arbitrary element of the matrix, you could adapt the above logic into a class implementing __getitem__ and you would only need to keep track of the position of the beginning of each line.
In code, this would look like:
class MatrixData(object):
def __init__(self, file_obj):
self._file_obj = file_obj
self._line_offsets = list(self._get_line_offsets(file_obj))[:-1]
file_obj.seek(0)
row = list(self._read_row(file_obj.readline()))
self.shape = len(self._line_offsets), len(row)
self.length = self.shape[0] * self.shape[1]
def __len__(self):
return self.length
def __iter__(self):
self._file_obj.seek(0)
i = j = 0
for line in self._file_obj:
for item in _read_row(line):
yield item, i, j
j += 1
i += 1
j = 0
def __getitem__(self, indices):
i, j = indices
self._file_obj.seek(self._line_offsets[i])
line = self._file_obj.readline()
row = self._read_row(line)
return row[j]
#staticmethod
def _get_line_offsets(file_obj):
file_obj.seek(0)
yield file_obj.tell()
for line in file_obj:
yield file_obj.tell()
#staticmethod
def _read_row(line):
for item in line.split(', '):
if item and not item.isspace():
yield MatrixData._sanify(item)
#staticmethod
def _sanify(item, dtype=int):
while item.startswith('['):
item = item[1:]
while item.endswith(']'):
item = item[:-1]
return dtype(item)
class MatrixData(object):
def __init__(self, file_obj):
self._file_obj = file_obj
self._line_offsets = list(self._get_line_offsets(file_obj))[:-1]
file_obj.seek(0)
row = list(self._read_row(file_obj.readline()))
self.shape = len(self._line_offsets), len(row)
self.length = self.shape[0] * self.shape[1]
def __len__(self):
return self.length
def __iter__(self):
self._file_obj.seek(0)
i = j = 0
for line in self._file_obj:
for item in self._read_row(line):
yield item, i, j
j += 1
i += 1
j = 0
def __getitem__(self, indices):
i, j = indices
self._file_obj.seek(self._line_offsets[i])
line = self._file_obj.readline()
row = list(self._read_row(line))
return row[j]
#staticmethod
def _get_line_offsets(file_obj):
file_obj.seek(0)
yield file_obj.tell()
for line in file_obj:
yield file_obj.tell()
#staticmethod
def _read_row(line):
for item in line.split(', '):
if item and not item.isspace():
yield MatrixData._sanify(item)
#staticmethod
def _sanify(item, dtype=int):
while item.startswith('['):
item = item[1:]
while item.endswith(']'):
item = item[:-1]
return dtype(item)
To be used as:
m = MatrixData(io.StringIO(s))
# get total number of elements
len(m)
# get number of row and col
m.shape
# access a specific element
m[3, 12]
# iterate through
for x, i, j in m:
...
This seems to be exactly what the mmap module does in python. See: https://docs.python.org/3/library/mmap.html
Example from documentation
import mmap
# write a simple example file
with open("hello.txt", "wb") as f:
f.write(b"Hello Python!\n")
with open("hello.txt", "r+b") as f:
# memory-map the file, size 0 means whole file
mm = mmap.mmap(f.fileno(), 0)
# read content via standard file methods
print(mm.readline()) # prints b"Hello Python!\n"
# read content via slice notation
print(mm[:5]) # prints b"Hello"
# update content using slice notation;
# note that new content must have same size
mm[6:] = b" world!\n"
# ... and read again using standard file methods
mm.seek(0)
print(mm.readline()) # prints b"Hello world!\n"
# close the map
mm.close()
It's depends on the operation you would like to perform on your input matrix, if it was a matrix operation, then you can use a partial matrix, Most of the time you are able to partially process small batches of your input file as partial matrix, by this way you can process the file very efficient, you just have to develop your algorithm to read and partially process the input and caching the result, for some operations you may just need to decide what is the best representation of your input matrix (i.e. row major or column major).
The main advantage of using the partial matrix approach, that you can take the advantage of applying a parallel processing techniques to process n partial matrix at each iteration using CUDA GPU for example, if you are familiar with C or C++, Then using the Python C API might improve the time complexity a lot for partial matrix operations, but even using Python not too much worse, because you just need to process your partial matrix using Numpy.
Im a python noob and I'm stuck on a problem.
filehandler = open("data.txt", "r")
alist = filehandler.readlines()
def insertionSort(alist):
for line in alist:
line = list(map(int, line.split()))
print(line)
for index in range(2, len(line)):
currentvalue = line[index]
position = index
while position>1 and line[position-1]>currentvalue:
line[position]=line[position-1]
position = position-1
line[position]=currentvalue
print(line)
insertionSort(alist)
for line in alist:
print line
Output:
[4, 19, 2, 5, 11]
[4, 2, 5, 11, 19]
[8, 1, 2, 3, 4, 5, 6, 1, 2]
[8, 1, 1, 2, 2, 3, 4, 5, 6]
4 19 2 5 11
8 1 2 3 4 5 6 1 2
I am supposed to sort lines of values from a file. The first value in the line represents the number of values to be sorted. I am supposed to display the values in the file in sorted order.
The print calls in insertionSort are just for debugging purposes.
The top four lines of output show that the insertion sort seems to be working. I can't figure out why when I print the lists after calling insertionSort the values are not sorted.
I am new to Stack Overflow and Python so please let me know if this question is misplaced.
for line in alist:
line = list(map(int, line.split()))
line starts out as eg "4 19 2 5 11". You split it and convert to int, ie [4, 19, 2, 5, 11].
You then assign this new value to list - but list is a local variable, the new value never gets stored back into alist.
Also, list is a terrible variable name because there is already a list data-type (and the variable name will keep you from being able to use the data-type).
Let's reorganize your program:
def load_file(fname):
with open(fname) as inf:
# -> list of list of int
data = [[int(i) for i in line.split()] for line in inf]
return data
def insertion_sort(row):
# `row` is a list of int
#
# your sorting code goes here
#
return row
def save_file(fname, data):
with open(fname, "w") as outf:
# list of list of int -> list of str
lines = [" ".join(str(i) for i in row) for row in data]
outf.write("\n".join(lines))
def main():
data = load_file("data.txt")
data = [insertion_sort(row) for row in data]
save_file("sorted_data.txt", data)
if __name__ == "__main__":
main()
Actually, with your data - where the first number in each row isn't actually data to sort - you would be better to do
data = [row[:1] + insertion_sort(row[1:]) for row in data]
so that the logic of insertion_sort is cleaner.
As #Barmar mentioned above, you are not modifying the input to the function. You could do the following:
def insertionSort(alist):
blist = []
for line in alist:
line = list(map(int, line.split()))
for index in range(2, len(line)):
currentvalue = line[index]
position = index
while position>1 and line[position-1]>currentvalue:
line[position]=line[position-1]
position = position-1
line[position]=currentvalue
blist.append(line)
return blist
blist = insertionSort(alist)
print(blist)
Alternatively, modify alist "in-place":
def insertionSort(alist):
for k, line in enumerate(alist):
line = list(map(int, line.split()))
for index in range(2, len(line)):
currentvalue = line[index]
position = index
while position>1 and line[position-1]>currentvalue:
line[position]=line[position-1]
position = position-1
line[position]=currentvalue
alist[k] = line
insertionSort(alist)
print(alist)
My file is this one:
14
3
21
37
48
12
4
6
22
4
How can I read M number at time? for example 4 at time. Is it necessary to use two for loops?
My goal is to create (N/M)+1 lists with M numbers inside every lists, except the final list (it's the reminder of division N/M)
You can use python list slice operator to fetch the number of required elements from a file by reading a file using readlines() where each element of list will be one line of file.
with open("filename") as myfile:
firstNtoMlines = myfile.readlines()[N:N+M] # the interval you want to read
print firstNtoMlines
Use itertools.islice,
import itertools
import math
filename = 'test.dat'
N = 9
M = 4
num_rest_lines = N
nrof_lists = int(math.ceil(N*1.0/M))
with open(filename, 'r') as f:
for i in range(nrof_lists):
num_lines = min(num_rest_lines, M)
lines_gen = itertools.islice(f, num_lines)
l = [int(line.rstrip()) for line in lines_gen]
num_rest_lines = num_rest_lines - M
print(l)
# Output
[14, 3, 21, 37]
[48, 12, 4, 6]
[22]
Previous answer: Iterate over a file (N lines) in chunks (every M lines), forming a list of N/M+1 lists.
import itertools
def grouper(iterable, n, fillvalue=None):
"""iterate in chunks"""
args = [iter(iterable)] * n
return itertools.izip_longest(*args, fillvalue=fillvalue)
# Test
filename = 'test.dat'
m = 4
fillvalue = '0'
with open(filename, 'r') as f:
lists = [[int(item.rstrip()) for item in chuck] for chuck in grouper(f, m, fillvalue=fillvalue)]
print(lists)
# Output
[[14, 3, 21, 37], [48, 12, 4, 6], [22, 4, 0, 0]]
Now my code is this one:
N = 4
M = 0
while (M < 633):
with open("/Users/Lorenzo/Desktop/X","r") as myFile:
res = myFile.readlines()[M:N]
print(res)
M+=4
N+=4
so, It should work. My file's got 633 numbers
This has been asked before.
from itertools import izip_longest
izip_longest(*(iter(yourlist),) * yourgroupsize)
For the case of grouping lines in a file into lists of size 4:
with open("file.txt", "r") as f:
res = izip_longest(*(iter(f)),) * 4)
print res
Alternative way to split a list into groups of n