Python Extract Numbers from a file

Python Extract Numbers from a file - python

So, I have a txt file with some integers which are between 0 and 50. I want to extract them and to use their values.
The txt file looks like:
1 2 40 23
2 34 12
3 12 1
I have tried something like:
with open(input_file, "r") as file:
lines = file.readlines()
for i in range(len(lines)):
l = lines[i].strip()
for c in range(1, len(l)-1):
if(l[c] >= '0' and l[c] <= '9' and (l[c+1] < '0' or l[c+1] > '9')):
# other code with those numbers
elif(l[c] >= '0' and l[c] <= '9' and (l[c+1] >= '0' and l[c+1] <= '9')):
# other code with those numbers
The problem is that I extract the two digits numbers, but I do also extract one digit two digits numbers.
Any solution?

Or this way:
my_array=[]
with io.open(inputfile, mode="r", encoding="utf-8") as f:
for line in f:
my_array=my_array+line.split()
results = list(map(int, myarray)) #convert to int
print(my_array)
Output:
[1, 2, 40, 23, 2, 34, 12, 3, 12, 1]

You can gather all the numbers in the file into a list like this:
import re
with open(input_file) as f:
print(list(map(int, re.findall('\d+', f.read()))))
Output:
[1, 2, 40, 23, 2, 34, 12, 3, 12, 1]
Note:
Use of re may be unnecessary in OP's case but included here because it allows for potential garbage in the input file

Related

Read all columns from text file

I'm trying to read a text file. I need to read column values to list. My text file looks like this:
40 10 5 5
30 20 10 0
30 30 10 5
and desired output is
(40,30,30),(10,20,30),(5,10,10),(5,0,5)
I tried this code
def contest(filename):
contestFile = open(filename,'r')
contestFileLines=contestFile.readlines()
startColumn = 0
contestResult=[]
for x in contestFileLines:
contestResult.append(x.split()[startColumn])
contestFile.close()
print(contestResult)
contest("testing.txt")
and its output is just
['40', '30', '30']
What should I do?

Try reading every line into an list, splitting by each space and mapping to an int. Then, you can use this answer (which Barmar suggested in the comments) to transpose the list of map generators. Just like this:
def cols(path):
rows = []
with open(path) as f:
for line in f:
rows.append(map(int, line.split(' ')))
return list(map(list, zip(*rows)))
print(cols('test.txt')) # => [[40, 30, 30], [10, 20, 30], [5, 10, 10], [5, 0, 5]]
Alternatively, if you need the output as a tuple, just change this line:
return list(map(list, zip(*rows)))
to
return list(map(tuple, zip(*rows)))

How to mmap a 2d array from a text file

I have very large files containing 2d arrays of positive integers
Each file contains a matrix
I would like to process them without reading the files into memory. Luckily I only need to look at the values from left to right in the input file. I was hoping to be able to mmap each file so I can process them as if they were in memory but without actually reading in the files into memory.
Example of smaller version:
[[2, 2, 6, 10, 2, 6, 7, 15, 14, 10, 17, 14, 7, 14, 15, 7, 17],
[3, 3, 7, 11, 3, 7, 0, 11, 7, 16, 0, 17, 17, 7, 16, 0, 0],
[4, 4, 8, 7, 4, 13, 0, 0, 15, 7, 8, 7, 0, 7, 0, 15, 13],
[5, 5, 9, 12, 5, 14, 7, 13, 9, 14, 16, 12, 13, 14, 7, 16, 7]]
Is it possible to mmap such a file so I can then process the np.int64 values with
for i in range(rownumber):
for j in range(rowlength):
process(M[i, j])
To be clear, I don't want ever to have all my input file in memory as it won't fit.

Updated Answer
On the basis of your comments and clarifications, it appears you actually have a text file with a bunch of square brackets in it that is around 4 lines long with 1,000,000,000 ASCII integers per line separated by commas. Not a very efficient format! I would suggest you simply pre-process the file to remove all square brackets, linefeeds, and spaces and convert the commas to newlines so that you get one value per line which you can easily deal with.
Using the tr command to transliterate, that would be this:
# Delete all square brackets, newlines and spaces, change commas into newlines
tr -d '[] \n' < YourFile.txt | tr , '\n' > preprocessed.txt
Your file then looks like this and you can readily process one value at a time in Python.
2
2
6
10
2
6
...
...
In case you are on Windows, the tr tool is available for Windows in GNUWin32 and in the Windows Subsystem for Linux thing (git bash?).
You can go still further and make a file that you can memmap() like in the second part of my answer, you could then randomly find any byte in the file. So, taking the preprocessed.txt created above, you can make a binary version like this:
import struct
# Make binary memmapable version
with open('preprocessed.txt', 'r') as ifile, open('preprocessed.bin', 'wb') as ofile:
for line in ifile:
ofile.write(struct.pack('q',int(line)))
Original Answer
You can do that like this. The first part is just setup:
#!/usr/bin/env python3
import numpy as np
# Create 2,4 Numpy array of int64
a = np.arange(8, dtype=np.int64).reshape(2,4)
# Write to file as binary
a.tofile('a.dat')
Now check the file by hex-dumping it in the shell:
xxd a.dat
00000000: 0000 0000 0000 0000 0100 0000 0000 0000 ................
00000010: 0200 0000 0000 0000 0300 0000 0000 0000 ................
00000020: 0400 0000 0000 0000 0500 0000 0000 0000 ................
00000030: 0600 0000 0000 0000 0700 0000 0000 0000 ................
Now that we are all set up, let's memmap() the file:
# Memmap file and access values via 'mm'
mm = np.memmap('a.dat', dtype=np.int64, mode='r', shape=(2,4))
print(mm[1,2]) # prints 6

The primary problem is that the file is too large, and it doesn't seem to be split in lines either. (For reference, array.txt is the example you provided and arr_map.dat is an empty file)
import re
import numpy as np
N = [str(i) for i in range(10)]
arrayfile = 'array.txt'
mmapfile = 'arr_map.dat'
R = 4
C = 17
CHUNK = 20
def read_by_chunk(file, chunk_size=CHUNK):
return file.read(chunk_size)
fp = np.memmap(mmapfile, dtype=np.uint8, mode='w+', shape=(R,C))
with open(arrayfile,'r') as f:
curr_row = curr_col = 0
while True:
data = read_by_chunk(f)
if not data:
break
# Make sure that chunk reading does not break a number
while data[-1] in N:
data += read_by_chunk(f,1)
# Convert chunk into numpy array
nums = np.array(re.findall(r'[0-9]+', data)).astype(np.uint8)
num_len = len(nums)
if num_len == 0:
break
# CASE 1: Number chunk can fit into current row
if curr_col + num_len <= C:
fp[curr_row, curr_col : curr_col + num_len] = nums
curr_col = curr_col + num_len
# CASE 2: Number chunk has to be split into current and next row
else:
col_remaining = C-curr_col
fp[curr_row, curr_col : C] = nums[:col_remaining] # Fill in row i
curr_row, curr_col = curr_row+1, 0 # Move to row i+1 and fill the rest
fp[curr_row, :num_len-col_remaining] = nums[col_remaining:]
curr_col = num_len-col_remaining
if curr_col>=C:
curr_col = curr_col%C
curr_row += 1
#print('\n--debug--\n',fp,'\n--debug--\n')
Basically, read small parts of the array file at a time (making sure not to break the numbers), finding the numbers from the junk characters like commas, brackets etc. using regex, and then inserting the numbers into the memory map.

The situation you describe seems to be more suitable for a generator that fetches the next integer, or the next row from the file and allows you to process that.
def sanify(s):
while s.startswith('['):
s = s[1:]
while s.endswith(']'):
s = s[:-1]
return int(s)
def get_numbers(file_obj):
file_obj.seek(0)
i = j = 0
for line in file_obj:
for item in line.split(', '):
if item and not item.isspace():
yield sanify(item), i, j
j += 1
i += 1
j = 0
This ensures only one line at a time ever resides in memory.
This can be used like:
import io
s = '''[[2, 2, 6, 10, 2, 6, 7, 15, 14, 10, 17, 14, 7, 14, 15, 7, 17],
[3, 3, 7, 11, 3, 7, 0, 11, 7, 16, 0, 17, 17, 7, 16, 0, 0],
[4, 4, 8, 7, 4, 13, 0, 0, 15, 7, 8, 7, 0, 7, 0, 15, 13],
[5, 5, 9, 12, 5, 14, 7, 13, 9, 14, 16, 12, 13, 14, 7, 16, 7]]'''
items = get_numbers(io.StringIO(s))
for item, i, j in items:
print(item, i, j)
If you really want to be able to access an arbitrary element of the matrix, you could adapt the above logic into a class implementing __getitem__ and you would only need to keep track of the position of the beginning of each line.
In code, this would look like:
class MatrixData(object):
def __init__(self, file_obj):
self._file_obj = file_obj
self._line_offsets = list(self._get_line_offsets(file_obj))[:-1]
file_obj.seek(0)
row = list(self._read_row(file_obj.readline()))
self.shape = len(self._line_offsets), len(row)
self.length = self.shape[0] * self.shape[1]
def __len__(self):
return self.length
def __iter__(self):
self._file_obj.seek(0)
i = j = 0
for line in self._file_obj:
for item in _read_row(line):
yield item, i, j
j += 1
i += 1
j = 0
def __getitem__(self, indices):
i, j = indices
self._file_obj.seek(self._line_offsets[i])
line = self._file_obj.readline()
row = self._read_row(line)
return row[j]
#staticmethod
def _get_line_offsets(file_obj):
file_obj.seek(0)
yield file_obj.tell()
for line in file_obj:
yield file_obj.tell()
#staticmethod
def _read_row(line):
for item in line.split(', '):
if item and not item.isspace():
yield MatrixData._sanify(item)
#staticmethod
def _sanify(item, dtype=int):
while item.startswith('['):
item = item[1:]
while item.endswith(']'):
item = item[:-1]
return dtype(item)
class MatrixData(object):
def __init__(self, file_obj):
self._file_obj = file_obj
self._line_offsets = list(self._get_line_offsets(file_obj))[:-1]
file_obj.seek(0)
row = list(self._read_row(file_obj.readline()))
self.shape = len(self._line_offsets), len(row)
self.length = self.shape[0] * self.shape[1]
def __len__(self):
return self.length
def __iter__(self):
self._file_obj.seek(0)
i = j = 0
for line in self._file_obj:
for item in self._read_row(line):
yield item, i, j
j += 1
i += 1
j = 0
def __getitem__(self, indices):
i, j = indices
self._file_obj.seek(self._line_offsets[i])
line = self._file_obj.readline()
row = list(self._read_row(line))
return row[j]
#staticmethod
def _get_line_offsets(file_obj):
file_obj.seek(0)
yield file_obj.tell()
for line in file_obj:
yield file_obj.tell()
#staticmethod
def _read_row(line):
for item in line.split(', '):
if item and not item.isspace():
yield MatrixData._sanify(item)
#staticmethod
def _sanify(item, dtype=int):
while item.startswith('['):
item = item[1:]
while item.endswith(']'):
item = item[:-1]
return dtype(item)
To be used as:
m = MatrixData(io.StringIO(s))
# get total number of elements
len(m)
# get number of row and col
m.shape
# access a specific element
m[3, 12]
# iterate through
for x, i, j in m:
...

This seems to be exactly what the mmap module does in python. See: https://docs.python.org/3/library/mmap.html
Example from documentation
import mmap
# write a simple example file
with open("hello.txt", "wb") as f:
f.write(b"Hello Python!\n")
with open("hello.txt", "r+b") as f:
# memory-map the file, size 0 means whole file
mm = mmap.mmap(f.fileno(), 0)
# read content via standard file methods
print(mm.readline()) # prints b"Hello Python!\n"
# read content via slice notation
print(mm[:5]) # prints b"Hello"
# update content using slice notation;
# note that new content must have same size
mm[6:] = b" world!\n"
# ... and read again using standard file methods
mm.seek(0)
print(mm.readline()) # prints b"Hello world!\n"
# close the map
mm.close()

It's depends on the operation you would like to perform on your input matrix, if it was a matrix operation, then you can use a partial matrix, Most of the time you are able to partially process small batches of your input file as partial matrix, by this way you can process the file very efficient, you just have to develop your algorithm to read and partially process the input and caching the result, for some operations you may just need to decide what is the best representation of your input matrix (i.e. row major or column major).
The main advantage of using the partial matrix approach, that you can take the advantage of applying a parallel processing techniques to process n partial matrix at each iteration using CUDA GPU for example, if you are familiar with C or C++, Then using the Python C API might improve the time complexity a lot for partial matrix operations, but even using Python not too much worse, because you just need to process your partial matrix using Numpy.

File handling in Python

Im a python noob and I'm stuck on a problem.
filehandler = open("data.txt", "r")
alist = filehandler.readlines()
def insertionSort(alist):
for line in alist:
line = list(map(int, line.split()))
print(line)
for index in range(2, len(line)):
currentvalue = line[index]
position = index
while position>1 and line[position-1]>currentvalue:
line[position]=line[position-1]
position = position-1
line[position]=currentvalue
print(line)
insertionSort(alist)
for line in alist:
print line
Output:
[4, 19, 2, 5, 11]
[4, 2, 5, 11, 19]
[8, 1, 2, 3, 4, 5, 6, 1, 2]
[8, 1, 1, 2, 2, 3, 4, 5, 6]
4 19 2 5 11
8 1 2 3 4 5 6 1 2
I am supposed to sort lines of values from a file. The first value in the line represents the number of values to be sorted. I am supposed to display the values in the file in sorted order.
The print calls in insertionSort are just for debugging purposes.
The top four lines of output show that the insertion sort seems to be working. I can't figure out why when I print the lists after calling insertionSort the values are not sorted.
I am new to Stack Overflow and Python so please let me know if this question is misplaced.

for line in alist:
line = list(map(int, line.split()))
line starts out as eg "4 19 2 5 11". You split it and convert to int, ie [4, 19, 2, 5, 11].
You then assign this new value to list - but list is a local variable, the new value never gets stored back into alist.
Also, list is a terrible variable name because there is already a list data-type (and the variable name will keep you from being able to use the data-type).
Let's reorganize your program:
def load_file(fname):
with open(fname) as inf:
# -> list of list of int
data = [[int(i) for i in line.split()] for line in inf]
return data
def insertion_sort(row):
# `row` is a list of int
#
# your sorting code goes here
#
return row
def save_file(fname, data):
with open(fname, "w") as outf:
# list of list of int -> list of str
lines = [" ".join(str(i) for i in row) for row in data]
outf.write("\n".join(lines))
def main():
data = load_file("data.txt")
data = [insertion_sort(row) for row in data]
save_file("sorted_data.txt", data)
if __name__ == "__main__":
main()
Actually, with your data - where the first number in each row isn't actually data to sort - you would be better to do
data = [row[:1] + insertion_sort(row[1:]) for row in data]
so that the logic of insertion_sort is cleaner.

As #Barmar mentioned above, you are not modifying the input to the function. You could do the following:
def insertionSort(alist):
blist = []
for line in alist:
line = list(map(int, line.split()))
for index in range(2, len(line)):
currentvalue = line[index]
position = index
while position>1 and line[position-1]>currentvalue:
line[position]=line[position-1]
position = position-1
line[position]=currentvalue
blist.append(line)
return blist
blist = insertionSort(alist)
print(blist)
Alternatively, modify alist "in-place":
def insertionSort(alist):
for k, line in enumerate(alist):
line = list(map(int, line.split()))
for index in range(2, len(line)):
currentvalue = line[index]
position = index
while position>1 and line[position-1]>currentvalue:
line[position]=line[position-1]
position = position-1
line[position]=currentvalue
alist[k] = line
insertionSort(alist)
print(alist)

Python: reading N number from file, M at time

My file is this one:
14
3
21
37
48
12
4
6
22
4
How can I read M number at time? for example 4 at time. Is it necessary to use two for loops?
My goal is to create (N/M)+1 lists with M numbers inside every lists, except the final list (it's the reminder of division N/M)

You can use python list slice operator to fetch the number of required elements from a file by reading a file using readlines() where each element of list will be one line of file.
with open("filename") as myfile:
firstNtoMlines = myfile.readlines()[N:N+M] # the interval you want to read
print firstNtoMlines

Use itertools.islice,
import itertools
import math
filename = 'test.dat'
N = 9
M = 4
num_rest_lines = N
nrof_lists = int(math.ceil(N*1.0/M))
with open(filename, 'r') as f:
for i in range(nrof_lists):
num_lines = min(num_rest_lines, M)
lines_gen = itertools.islice(f, num_lines)
l = [int(line.rstrip()) for line in lines_gen]
num_rest_lines = num_rest_lines - M
print(l)
# Output
[14, 3, 21, 37]
[48, 12, 4, 6]
[22]
Previous answer: Iterate over a file (N lines) in chunks (every M lines), forming a list of N/M+1 lists.
import itertools
def grouper(iterable, n, fillvalue=None):
"""iterate in chunks"""
args = [iter(iterable)] * n
return itertools.izip_longest(*args, fillvalue=fillvalue)
# Test
filename = 'test.dat'
m = 4
fillvalue = '0'
with open(filename, 'r') as f:
lists = [[int(item.rstrip()) for item in chuck] for chuck in grouper(f, m, fillvalue=fillvalue)]
print(lists)
# Output
[[14, 3, 21, 37], [48, 12, 4, 6], [22, 4, 0, 0]]

Now my code is this one:
N = 4
M = 0
while (M < 633):
with open("/Users/Lorenzo/Desktop/X","r") as myFile:
res = myFile.readlines()[M:N]
print(res)
M+=4
N+=4
so, It should work. My file's got 633 numbers

This has been asked before.
from itertools import izip_longest
izip_longest(*(iter(yourlist),) * yourgroupsize)
For the case of grouping lines in a file into lists of size 4:
with open("file.txt", "r") as f:
res = izip_longest(*(iter(f)),) * 4)
print res
Alternative way to split a list into groups of n

How to read a two digit numbers in CSV file and storying it in a list?

I am scanning a column in Python, which is full of integers. There are some double digit numbers.
d = []
for Column in ReadDataSourceFile: #ReadDataSourceFile works well. Its file open and delimiter
if Column[1] == 'Something' and Column[0] == 'Somewhere':
countFL += 1
print Column[5]
some = map(int, Column[5])
d.extend(some)
print d
Here Column[5] is 1, 15, 23, 1, 4, 5. But the print displays [1, 1, 5, 2, 3, 1, 4, 5]

Probably some = map(int, Column[5]) split number on the digits
print map(int, '15')
[1, 5]
So print some to check it.
Maybe you need only some = int(Column[5])
EDIT: try
print Column[5]
some = int(Column[5])
d.append(some)

Your first output looks like a string
Maybe this will work:
Column[5].split(',')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Extract Numbers from a file - python

Or this way: my_array=[] with io.open(inputfile, mode="r", encoding="utf-8") as f: for line in f: my_array=my_array+line.split() results = list(map(int, myarray)) #convert to int print(my_array) Output: [1, 2, 40, 23, 2, 34, 12, 3, 12, 1]

Related

Read all columns from text file

How to mmap a 2d array from a text file

File handling in Python

Python: reading N number from file, M at time

How to read a two digit numbers in CSV file and storying it in a list?

Categories

Resources