How to convert occurence matrix to co-occurence matrix in Python

How to convert occurence matrix to co-occurence matrix in Python - python

I asked this question here: How to convert occurence matrix to co-occurence matrix
I realized that my data is so big that it is not possible to do this using R. My computer hangs. The actual data is a text file with ~5 million rows and 600 columns. I think Python may be an alternate option to do this.

This would be the way you translate the R code to Python code.
>>> import numpy as np
>>> a=np.array([[0, 1, 0, 0, 1, 1],
[0, 0, 1, 1, 0, 1],
[1, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 1, 1]])
>>> acov=np.dot(a.T, a)
>>> acov[np.diag_indices_from(acov)]=0
>>> acov
array([[0, 2, 2, 1, 1, 1],
[2, 0, 2, 1, 2, 2],
[2, 2, 0, 2, 1, 2],
[1, 1, 2, 0, 0, 1],
[1, 2, 1, 0, 0, 2],
[1, 2, 2, 1, 2, 0]])
However, you have a very big dataset. If you don't want to assemble the co-occurence matrix piece by piece and you store your values in int64, with 3e+9 numbers it will take 24GB of RAM alone just to hold the data http://www.wolframalpha.com/input/?i=3e9+*+8+bytes. So you probably want to think over and decide which dtype you want to store your data in: http://docs.scipy.org/doc/numpy/user/basics.types.html. Using int16 probably will make the dot product operation possible on a decent desktop PC nowadays.

Related

Is there a python or pandas function to cut a list into groups that are equal sized or as symmetric as possible?

I would like a function that functions like pandas.qcut however, gives me as 'balanced' or 'symmetric' splits as possible.
At the moment, if I use:
pd.qcut(range(1, 11), 3, labels=False, duplicates="drop")
I get:
array([0, 0, 0, 0, 1, 1, 1, 2, 2, 2], dtype=int64)
But I would like the middle group to have four entries instead i.e.:
array([0, 0, 0, 1, 1, 1, 1, 2, 2, 2], dtype=int64).

Integer arithmetic will do it.
import pandas as pd
pd.Series([i%3 for i in range(1, 11)]).sort_values().values
output
array([0, 0, 0, 1, 1, 1, 1, 2, 2, 2])

How to change all values to the left of a particular cell in every row

I have an array which contains 1's and 0's. A very small section of it looks like this:
arr=[[0,0,0,0,1],
[0,0,1,0,0],
[0,1,0,0,0],
[1,0,1,0,0]]
I want to change the value of every cell to 1, if it is to the left of a cell with a value of 1. I want all other cells to keep their value of 0, i.e:
arrOut=[[1,1,1,1,1],
[1,1,1,0,0],
[1,1,0,0,0]
[1,1,1,0,0]
Some rows have >1 cell with a value =1.
I have managed to do this using a very ugly double for-loop:
for i in range(len(arr)):
for j in range(len(arr[i])):
if arr[i][j]==1:
arrOut[i][0:j]=1
Does anyone know of another way to do this with using for loops? I'm relatively comfortable with numpy and pandas, but also open to other libraries.
Thanks!

You can flip using it, and use np.cumsum:
>>> arr[:, ::-1].cumsum(axis=1)[:, ::-1]
array([[1, 1, 1, 1, 1],
[1, 1, 1, 0, 0],
[1, 1, 0, 0, 0]], dtype=int32)
Or the same using np.fliplr,
>>> np.fliplr(np.fliplr(arr).cumsum(axis=1))
array([[1, 1, 1, 1, 1],
[1, 1, 1, 0, 0],
[1, 1, 0, 0, 0]], dtype=int32)
Using np.where:
>>> np.where(arr.cumsum(1)==0, 1, arr)
array([[1, 1, 1, 1, 1],
[1, 1, 1, 0, 0],
[1, 1, 0, 0, 0]], dtype=int32)
If array has more than one 1, use np.clip:
>>> arr
array([[0, 0, 0, 0, 1],
[0, 0, 1, 0, 0],
[0, 1, 0, 1, 0]])
>>> np.clip(arr[:, ::-1].cumsum(axis=1)[:, ::-1], 0, 1)
array([[1, 1, 1, 1, 1],
[1, 1, 1, 0, 0],
[1, 1, 1, 1, 0]], dtype=int32)
# If you want to make all 0s before the leftmost 1 to 1:
>>> np.where(arr.cumsum(1)==0, 1, arr)
array([[1, 1, 1, 1, 1],
[1, 1, 1, 0, 0],
[1, 1, 0, 1, 0]])

A possibility without using libraries:
for subarray in arr:
#Get the index of the 1 starting from behind so that if there are 2 1s,
#you get the index of the rightmost one
indexof1 = len(subarray) -1 - subarray[::-1].index(1)
#Until the 1, replace with 1s
subarray[:indexof1] = [1]*len(subarray[:indexof1])

Another solution using np.maximum.accumulate:
np.maximum.accumulate(arr[:,::-1],axis=1)[:,::-1]
Where np.maximum.accumulate simply apply a cumulative maximum (cummax).

Compute distances between all points in array efficiently using Python

I have a list of N=3 points like this as input:
points = [[1, 1], [2, 2], [4, 4]]
I wrote this code to compute all possible distances between all elements of my list points, as dist = min(∣x1−x2∣,∣y1−y2∣):
distances = []
for i in range(N-1):
for j in range(i+1,N):
dist = min((abs(points[i][0]-points[j][0]), abs(points[i][1]-points[j][1])))
distances.append(dist)
print(distances)
My output will be the array distances with all the distances saved in it: [1, 3, 2]
It works fine with N=3, but I would like to compute it in a more efficiently way and be free to set N=10^5.
I am trying to use also numpy and scipy, but I am having a little trouble with replacing the loops and use the correct method.
Can anybody help me please? Thanks in advance

The numpythonic solution
To compute your distances using the full power of Numpy, and do it
substantially faster:
Convert your points to a Numpy array:
pts = np.array(points)
Then run:
dist = np.abs(pts[np.newaxis, :, :] - pts[:, np.newaxis, :]).min(axis=2)
Here the result is a square array.
But if you want to get a list of elements above the diagonal,
just like your code generates, you can run:
dist2 = dist[np.triu_indices(pts.shape[0], 1)].tolist()
I ran this code for the following 9 points:
points = [[1, 1], [2, 2], [4, 4], [3, 5], [2, 8], [4, 10], [3, 7], [2, 9], [4, 7]]
For the above data, the result saved in dist (a full array) is:
array([[0, 1, 3, 2, 1, 3, 2, 1, 3],
[1, 0, 2, 1, 0, 2, 1, 0, 2],
[3, 2, 0, 1, 2, 0, 1, 2, 0],
[2, 1, 1, 0, 1, 1, 0, 1, 1],
[1, 0, 2, 1, 0, 2, 1, 0, 1],
[3, 2, 0, 1, 2, 0, 1, 1, 0],
[2, 1, 1, 0, 1, 1, 0, 1, 0],
[1, 0, 2, 1, 0, 1, 1, 0, 2],
[3, 2, 0, 1, 1, 0, 0, 2, 0]])
and the list of elements from upper diagonal part is:
[1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 0, 2, 1, 0, 2, 1, 2, 0, 1, 2, 0, 1, 1, 0, 1, 1,
2, 1, 0, 1, 1, 1, 0, 1, 0, 2]
How faster is my code
It turns out that even for such small sample like I used (9
points), my code works 2 times faster. For a sample of 18 points
(not presented here) - 6 times faster.
This difference in speed has been gained even though my function
computes "2 times more than needed" i.e. it generates a full
array, whereas the lower diagonal part of the result in a "mirror
view" of the upper diagonal part (what computes your code).
For bigger number of points the difference should be much bigger.
Make your test on a bigger sample of points (say 100 points) and write how
many times faster was my code.

making an array of n columns where each successive row increases by one

In numpy, I would like to be able to input n for rows and m for columns and end with the array that looks like:
[(0,0,0,0),
(1,1,1,1),
(2,2,2,2)]
So that would be a 3x4. Each column is just a copy of the previous one and the row increases by one each time. As an example:
input would be 4, then 6 and the output would be and array
[(0,0,0,0,0,0),
(1,1,1,1,1,1),
(2,2,2,2,2,2),
(3,3,3,3,3,3)]
4 rows and 6 columns where the row increases by one each time. Thanks for your time.

So many possibilities...
In [51]: n = 4
In [52]: m = 6
In [53]: np.tile(np.arange(n), (m, 1)).T
Out[53]:
array([[0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1],
[2, 2, 2, 2, 2, 2],
[3, 3, 3, 3, 3, 3]])
In [54]: np.repeat(np.arange(n).reshape(-1,1), m, axis=1)
Out[54]:
array([[0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1],
[2, 2, 2, 2, 2, 2],
[3, 3, 3, 3, 3, 3]])
In [55]: np.outer(np.arange(n), np.ones(m, dtype=int))
Out[55]:
array([[0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1],
[2, 2, 2, 2, 2, 2],
[3, 3, 3, 3, 3, 3]])
Here's one more. The neat trick here is that the values are not duplicated--only memory for the single sequence [0, 1, 2, ..., n-1] is allocated.
In [67]: from numpy.lib.stride_tricks import as_strided
In [68]: seq = np.arange(n)
In [69]: rep = as_strided(seq, shape=(n,m), strides=(seq.strides[0],0))
In [70]: rep
Out[70]:
array([[0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1],
[2, 2, 2, 2, 2, 2],
[3, 3, 3, 3, 3, 3]])
Be careful with the as_strided function. If you don't get the arguments right, you can crash Python.
To see that seq has not been copied, change seq in place, and then check rep:
In [71]: seq[1] = 99
In [72]: rep
Out[72]:
array([[ 0, 0, 0, 0, 0, 0],
[99, 99, 99, 99, 99, 99],
[ 2, 2, 2, 2, 2, 2],
[ 3, 3, 3, 3, 3, 3]])

import numpy as np
def foo(n, m):
return np.array([np.arange(n)] * m).T

Natively (no Python lists):
rows, columns = 4, 6
numpy.arange(rows).reshape(-1, 1).repeat(columns, axis=1)
#>>> array([[0, 0, 0, 0, 0, 0],
#>>> [1, 1, 1, 1, 1, 1],
#>>> [2, 2, 2, 2, 2, 2],
#>>> [3, 3, 3, 3, 3, 3]])

You can easily do this using built in python functions. The program counts to 3 converting each number to a string and repeats the string 6 times.
print [6*str(n) for n in range(0,4)]
Here is the output.
ks-MacBook-Pro:~ kyle$ pbpaste | python
['000000', '111111', '222222', '333333']

On more for fun
np.zeros((n, m), dtype=np.int) + np.arange(n, dtype=np.int)[:,None]

As has been mentioned, there are many ways to do this.
Here's what I'd do:
import numpy as np
def makearray(m, n):
A = np.empty((m,n))
A.T[:] = np.arange(m)
return A
Here's an amusing alternative that will work if you aren't going to be changing the contents of the array.
It should save some memory.
Be careful though because this doesn't allocate a full array, it will have multiple entries pointing to the same memory address.
import numpy as np
from numpy.lib.stride_tricks import as_strided
def makearray(m, n):
A = np.arange(m)
return as_strided(A, strides=(A.strides[0],0), shape=(m,n))
In either case, as I have written them, a 3x4 array can be created by makearray(3, 4)

Using count from the built-in module itertools:
>>> from itertools import count
>>> rows = 4
>>> columns = 6
>>> cnt = count()
>>> [[cnt.next()]*columns for i in range(rows)]
[[0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1], [2, 2, 2, 2, 2, 2], [3, 3, 3, 3, 3, 3]]

you can simply
>>> nc=5
>>> nr=4
>>> [[k]*nc for k in range(nr)]
[[0, 0, 0, 0, 0], [1, 1, 1, 1, 1], [2, 2, 2, 2, 2], [3, 3, 3, 3, 3]]

Several other possibilities using a (n,1) array
a = np.arange(n)[:,None] (or np.arange(n).reshape(-1,1))
a*np.ones((m),dtype=int)
a[:,np.zeros((m),dtype=int)]
If used with a (m,) array, just leave it (n,1), and let broadcasting expand it for you.

Python: multidimensional array masking

What is equivalent pythonic implementation for the following simple piece of code in Matlab.
Matlab:
B = 2D array of integers as indices [1...100]
A = 2D array of numbers: [10x10]
A[B] = 0
which works well as for example for B[i]=42 it finds the position 2 of column 5 to be set.
In Python it causes an error: out of bound which is logical. However to translate the above Matlab code into Python we are looking for pythonic ways.
Please also consider the problem for higher dimensions such as:
B = 2D array of integers as indices [1...3000]
C = 3D array of numbers: [10x10x30]
C[B] = 0
One way we thought about it is to reform indices array elements as i,j instead of being absolute position. That is, position 42 to divmod(42,m=10)[::-1] >>> (2,4). So we will have a nx2 >>> ii,jj vectors of indices which can be used for indexing A easily.
We thought that it might be a better way, efficient also for higher dimensions in Python.

You can use .ravel() on the array (A) before indexing it, and then .reshape() after.
Alternatively, since you know A.shape, you can use np.unravel_index on the other array (B) before indexing.
Example 1:
>>> import numpy as np
>>> A = np.ones((5,5), dtype=int)
>>> B = [1, 3, 7, 23]
>>> A
array([[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1]])
>>> A_ = A.ravel()
>>> A_[B] = 0
>>> A_.reshape(A.shape)
array([[1, 0, 1, 0, 1],
[1, 1, 0, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 0, 1]])
Example 2:
>>> b_row, b_col = np.vstack([np.unravel_index(b, A.shape) for b in B]).T
>>> A[b_row, b_col] = 0
>>> A
array([[1, 0, 1, 0, 1],
[1, 1, 0, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 0, 1]])
Discovered later: you can use numpy.put
>>> import numpy as np
>>> A = np.ones((5,5), dtype=int)
>>> B = [1, 3, 7, 23]
>>> A.put(B, [0]*len(B))
>>> A
array([[1, 0, 1, 0, 1],
[1, 1, 0, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 0, 1]])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to convert occurence matrix to co-occurence matrix in Python - python

Related

Is there a python or pandas function to cut a list into groups that are equal sized or as symmetric as possible?

How to change all values to the left of a particular cell in every row

Compute distances between all points in array efficiently using Python

making an array of n columns where each successive row increases by one

Python: multidimensional array masking

Categories

Resources