Using numpy's genfromtxt to load a triangular matrix with python

Using numpy's genfromtxt to load a triangular matrix with python - python

I have a text file containing an upper 'triangular' matrix, the lower values being omitted (here's an example below):
3 5 3 5 1 8 1 6 5 8
5 8 1 1 6 2 9 6 4
2 0 5 2 1 0 0 3
2 2 5 1 0 1 0
1 3 6 3 6 1
4 2 4 3 7
4 0 0 1
0 1 8
2 1
1
Since the file in question is ~10000 lines in size, I was wondering if there was a 'smart' way to generate a numpy matrix from it e.g. using the genfromtxt function. However using it directly throws an error on the lines of
Line #12431 (got 6 columns instead of 12437) and using filling_values won't work as there's no way to designate the no missing value placeholders.
Right now I have to resort to manually open and close the file:
import numpy as np
def load_updiag(filename, size):
output = np.zeros((size,size))
line_count = 0
for line in f:
data = line.split()
output[line_count,line_count:size]= data
line_count += 1
return output
Which I feel is probably not very scalable for large file sizes.
Is there a way to properly use genfromtxt (or any other optimized function from numpy's library) on such matrices?

You can read the raw data from the file into a string, and then use np.fromstring to get a 1-d array of the upper triangular part of the matrix:
with open('data.txt') as data_file:
data = data_file.read()
arr = np.fromstring(data, sep=' ')
Alternatively, you can define a generator to read one line of your file at a time, then use np.fromiter to read a 1-d array from this generator:
def iter_data(path):
with open(path) as data_file:
for line in data_file:
yield from line.split()
arr = np.fromiter(iter_data('data.txt'), int)
If you know the size of the matrix (which you can determine from the first line of the file), you can specify the count keyword argument of np.fromiter so that the function will pre-allocate exactly the right amount of memory, which will be faster. That's what these functions do:
def iter_data(fileobj):
for line in fileobj:
yield from line.split()
def read_triangular_array(path):
with open(path) as fileobj:
n = len(fileobj.readline().split())
count = int(n*(n+1)/2)
with open(path) as fileobj:
return np.fromiter(iter_data(fileobj), int, count=count)
This "wastes" a little work, since it opens the file twice to read the first line and get the count of entries. An "improvement" would be to save the first line and chain it with the iterator over the rest of the file, as in this code:
from itertools import chain
def iter_data(fileobj):
for line in fileobj:
yield from line.split()
def read_triangular_array(path):
with open(path) as fileobj:
first = fileobj.readline().split()
n = len(first)
count = int(n*(n+1)/2)
data = chain(first, iter_data(fileobj))
return np.fromiter(data, int, count=count)
All of these approaches yield
>>> arr
array([ 3., 5., 3., 5., 1., 8., 1., 6., 5., 8., 5., 8., 1.,
1., 6., 2., 9., 6., 4., 2., 0., 5., 2., 1., 0., 0.,
3., 2., 2., 5., 1., 0., 1., 0., 1., 3., 6., 3., 6.,
1., 4., 2., 4., 3., 7., 4., 0., 0., 1., 0., 1., 8.,
2., 1., 1.])
This compact representation might be all you need, but if you want the full square matrix you can allocate a zeros matrix of the right size and copy arr into it using np.triu_indices_from, or you can use scipy.spatial.distance.squareform:
>>> from scipy.spatial.distance import squareform
>>> squareform(arr)
array([[ 0., 3., 5., 3., 5., 1., 8., 1., 6., 5., 8.],
[ 3., 0., 5., 8., 1., 1., 6., 2., 9., 6., 4.],
[ 5., 5., 0., 2., 0., 5., 2., 1., 0., 0., 3.],
[ 3., 8., 2., 0., 2., 2., 5., 1., 0., 1., 0.],
[ 5., 1., 0., 2., 0., 1., 3., 6., 3., 6., 1.],
[ 1., 1., 5., 2., 1., 0., 4., 2., 4., 3., 7.],
[ 8., 6., 2., 5., 3., 4., 0., 4., 0., 0., 1.],
[ 1., 2., 1., 1., 6., 2., 4., 0., 0., 1., 8.],
[ 6., 9., 0., 0., 3., 4., 0., 0., 0., 2., 1.],
[ 5., 6., 0., 1., 6., 3., 0., 1., 2., 0., 1.],
[ 8., 4., 3., 0., 1., 7., 1., 8., 1., 1., 0.]])

Related

What is Pytorch equivalent of Pandas groupby.apply(list)?

I have the following pytorch tensor long_format:
tensor([[ 1., 1.],
[ 1., 2.],
[ 1., 3.],
[ 1., 4.],
[ 0., 5.],
[ 0., 6.],
[ 0., 7.],
[ 1., 8.],
[ 0., 9.],
[ 0., 10.]])
I would like to groupby the first column and store the 2nd column as a tensor. The result is NOT guranteed to be the same size for each grouping. See example below.
[tensor([ 1., 2., 3., 4., 8.]),
tensor([ 5., 6., 7., 9., 10.])]
Is there any nice way to do this using purely Pytorch operators? I would like to avoid using for loops for tracebility purposes.
I have tried using a for loop and empty list of empty tensors but this result in an incorrect trace (different inputs values gave same results)
n_groups = 2
inverted = [torch.empty([0]) for _ in range(n_groups)]
for index, value in long_format:
value = value.unsqueeze(dim=0)
index = index.int()
if type(inverted[index]) != torch.Tensor:
inverted[index] = value
else:
inverted[index] = torch.cat((inverted[index], value))

You can use this code:
import torch
x = torch.tensor([[ 1., 1.],
[ 1., 2.],
[ 1., 3.],
[ 1., 4.],
[ 0., 5.],
[ 0., 6.],
[ 0., 7.],
[ 1., 8.],
[ 0., 9.],
[ 0., 10.]])
result = [x[x[:,0]==i][:,1] for i in x[:,0].unique()]
output
[tensor([ 5., 6., 7., 9., 10.]), tensor([1., 2., 3., 4., 8.])]

Checking non zero-sum rows in numpy array and removing them

I have a numpy array like this:
array([[ 3., 2., 3., ..., 0., 0., 0.],
[ 3., 2., -4., ..., 0., 0., 0.],
[ 3., -4., 1., ..., 0., 0., 0.],
...,
[-1., -2., 4., ..., 0., 0., 0.],
[ 4., -2., -2., ..., 0., 0., 0.],
[-2., 2., 4., ..., 0., 0., 0.]], dtype=float32)
what I want to do is removing all the rows that do not sum to zero and remove them, while also saving such rows indexes/positions in order to eliminate them to another array.
I'm trying the following:
for i in range(len(arr1)):
count=0
for j in arr1[i]:
count+=j
if count != 0:
arr_1 = np.delete(arr1,i,axis=0)
arr_2 = np.delete(arr2,i,axis=0)
the resulting arr_1 and arr_2 still contain rows that do not sum to zero. What am I doing wrong?

You can compute sum then keep row that have sum == 0 like below:
a=np.array([
[ 3., 2., 3., 0., 0., 0.],
[ 3., 2., -4., 0., 0., 0.],
[ 3., -4., 1., 0., 0., 0.]])
b = a.sum(axis=1)
# array([8., 1., 0.])
print(a[b==0])
Output:
array([[ 3., -4., 1., 0., 0., 0.]])

Just use sum(axis=1):
mask = a.sum(axis=1) != 0
do_sum_to_0 = a[~mask]
dont_sum_to_0 = a[mask]

How to flip half of a numpy array

I have a numpy array:
arr=np.array([[1., 2., 0.],
[2., 4., 1.],
[1., 3., 2.],
[-1., -2., 4.],
[-1., -2., 5.],
[1., 2., 6.]])
I want to flip the second half of this array upward. I mean I want to have:
flipped_arr=np.array([[-1., -2., 4.],
[-1., -2., 5.],
[1., 2., 6.],
[1., 2., 0.],
[2., 4., 1.],
[1., 3., 2.]])
When I try this code:
fliped_arr=np.flip(arr, 0)
It gives me:
fliped_arr= array([[1., 2., 6.],
[-1., -2., 5.],
[-1., -2., 4.],
[1., 3., 2.],
[2., 4., 1.],
[1., 2., 0.]])
In advance, I do appreciate any help.

You can simply concatenate rows below the nth row (included) with np.r_ for instance, with row index n of your choice, at the top and the other ones at the bottom:
import numpy as np
n = 3
arr_flip_n = np.r_[arr[n:],arr[:n]]
>>> array([[-1., -2., 4.],
[-1., -2., 5.],
[ 1., 2., 6.],
[ 1., 2., 0.],
[ 2., 4., 1.],
[ 1., 3., 2.]])

you can do this by slicing the array using the midpoint:
ans = np.vstack((arr[int(arr.shape[0]/2):], arr[:int(arr.shape[0]/2)]))
to break this down a little:
find the midpoint of arr, by finding its shape, the first index of which is the number of rows, dividing by two and converting to an integer:
midpoint = int(arr.shape[0]/2)
the two halves of the array can then be sliced like so:
a = arr[:midpoint]
b = arr[midpoint:]
then stack them back together using np.vstack:
ans = np.vstack((a, b))
(note vstack takes a single argument, which is a tuple containing a and b: (a, b))

You can do this with array slicing and vstack -
arr=np.array([[1., 2., 0.],
[2., 4., 1.],
[1., 3., 2.],
[-1., -2., 4.],
[-1., -2., 5.],
[1., 2., 6.]])
mid = arr.shape[0]//2
np.vstack([arr[mid:],arr[:mid]])
array([[-1., -2., 4.],
[-1., -2., 5.],
[ 1., 2., 6.],
[ 1., 2., 0.],
[ 2., 4., 1.],
[ 1., 3., 2.]])

Count values in numpy array and return index by result

I have a 2d numpy array my_array that starts out like this:
array([[1., 2., 3., 4., 5., 6., 7., 8., 9.],
[1., 2., 3., 4., 5., 6., 7., 8., 9.],
[1., 2., 3., 4., 5., 6., 7., 8., 9.],
[1., 2., 3., 4., 5., 6., 7., 8., 9.],
[1., 2., 3., 4., 5., 6., 7., 8., 9.],
[1., 2., 3., 4., 5., 6., 7., 8., 9.],
[1., 2., 3., 4., 5., 6., 7., 8., 9.],
[1., 2., 3., 4., 5., 6., 7., 8., 9.],
[1., 2., 3., 4., 5., 6., 7., 8., 9.]])
But after some processing which is irrelevant now looks like this:
array([[1., 2., 0., 4., 5., 6., 0., 8., 9.],
[0., 2., 0., 0., 5., 6., 7., 8., 9.],
[0., 2., 0., 4., 5., 0., 7., 0., 9.],
[1., 2., 0., 4., 5., 6., 7., 8., 9.],
[1., 2., 3., 4., 5., 0., 7., 8., 9.],
[0., 2., 0., 4., 5., 6., 0., 8., 9.],
[1., 2., 0., 4., 5., 6., 7., 8., 9.],
[1., 2., 0., 4., 5., 6., 7., 8., 9.],
[1., 2., 0., 4., 5., 6., 7., 8., 0.]])
As you can see, some of the items have been "zeroed out" quite randomly, but only the value of 3 was left with only 1 item that isn't zero. I'm looking for a function that takes this array and returns the index / row number that has the value 3 (or any other value that only appears once and only once in the array).
To explain this differently:
I first have to figure out if there is such an item that only appears once (in this example the answer is yes and that item is the number 3), and then I need to return its row number (in this case 4 since the only line with 3 in it is: my_array[4])
I have successfully done that with iterating over the array, item by item, and counting the number of times each number appears (and returning only the item whose count is 1) and then iterating over everything a second time to find the correct index / row number of where that item is located.
This seems very inefficient, especially if the array will be larger. Is there a better way in numpy to do this?
EDIT: if the number that appears only once is 0 that shouldn't count, i'm only looking for the "column" that was zeroed-out completely except 1 item in it

Try using the numpy.count_nonzero method
numpy.count_nonzero(arr, axis=0)
This will count the non-zero values columnwise
Here's a Demo
I will leave the rest to you. Good Luck

Edit I wasn't even using the mask, you can just use the first and last lines:
x = np.array([[1., 2., 0., 4., 5., 6., 0., 8., 9.],
[0., 2., 0., 0., 5., 6., 7., 8., 9.],
[0., 2., 0., 4., 5., 0., 7., 0., 9.],
[1., 2., 0., 4., 5., 6., 7., 8., 9.],
[1., 2., 3., 4., 5., 0., 7., 8., 9.],
[0., 2., 0., 4., 5., 6., 0., 8., 9.],
[1., 2., 0., 4., 5., 6., 7., 8., 9.],
[1., 2., 0., 4., 5., 6., 7., 8., 9.],
[1., 2., 0., 4., 5., 6., 7., 8., 0.]])
res = (x == 3)
print(np.where(res * x)[0])
Output:
[4]
The full response to np.where() is:
(array([4], dtype=int64), array([2], dtype=int64))
So if you wanted both the column and the row number, you could use both of these.

Putting multiple columns into callable sub arrays python

I have a set of data which is in columns, where the first column is the x values. How do i read this in?

If you want to store both, x and y values you can do
ydat = np.zeros((data.shape[1]-1,data.shape[0],2))
# write the x data
ydat[:,:,0] = data[:,0]
# write the y data
ydat[:,:,1] = data[:,1:].T
Edit:
If you want to store only the y-data in the sub arrays you can simply do
ydat = data[:,1:].T
Working example:
t = np.array([[ 0., 0., 1., 2.],
[ 1., 0., 1., 2.],
[ 2., 0., 1., 2.],
[ 3., 0., 1., 2.],
[ 4., 0., 1., 2.]])
a = t[:,1:].T
a
array([[ 0., 0., 0., 0., 0.],
[ 1., 1., 1., 1., 1.],
[ 2., 2., 2., 2., 2.]])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using numpy's genfromtxt to load a triangular matrix with python - python

Related

What is Pytorch equivalent of Pandas groupby.apply(list)?

Checking non zero-sum rows in numpy array and removing them

How to flip half of a numpy array

Count values in numpy array and return index by result

Putting multiple columns into callable sub arrays python

Categories

Resources