I am new to Python programming and I have a problem in assigning specific values to the first column of a very large numpy.array.
This is the code I use:
import numpy as np
a = np.zeros ((365343020, 9), dtype = np.float32)
for n in range (0, 36534302):
a[n*10:(n+1)*10,0] = n
where the second row is where I create an array, of 365343020 rows and 9 columns, filled with zeros; while the successive “for” is meant to replace the first column of the array with a vector whose elements are 36534302 sequential integers repeated 10 times each (eg [0,0,…,0,1,1,…,1,2,2,…, 36534301, 36534301,…, 36534301]).
The code seems to respond as desired till around row 168000000 or the array, then it substitute the 10 repetitions of numbers with the last digit odd with a second repetition of the (even) number before.
I have looked for explanations regarding the difference between views and copies. However, even trying to manually define the content of a specific cell of the array (where it is wrongly defined by the loop), it does not change.
Could you please help me in solving this problem?
Thanks
Maybe your program is consuming too much memory. Here is some basic math for your code.
Date type: float32
Bits used: 32 bits
Size of array: 3288087180 (365343020*9)
Total memory consumed: 105218789760 bits(13.15234872 GB)
1.Try using float8 bit if value being stored in array is not large.
2.Try to decrease your array size.
3.Both 1 and 2
Related
In my original code I have the following function:
B = np.inner(A,x)
where A.shape = [307_200] and has values -1 or 1
where x.shape = [307_200] and has values 0 to 256
where B results in a integer with a large value.
Assuming I know A and B, but don't know x, how can I solve for x??
To simplify the problem...
import numpy as np
A = np.random.choice(a=[-1,1], size=10)
x = np.random.choice(a=range(0,256), size=10)
B = np.inner(A, x)
I want to solve for x now. So something like one of the following...
x_solved = np.linalg.solve(A,x)
x_solved = np.linalg.lstsq(A,x)
Is it possible?
Extra info...
I could change A to be a n x m matrix, but since I am dealing with large matrices, when I try to use lstsq I quickly run out of memory. This is bad because 1. I can't run on my local machine and 2. the end use application needs to limit RAM.
However, for the problem above, I can except RAM intensive solutions since I might be able to moderate the compute resources with some cleaver tricks.
Also, we could switch A to boolean values if that would help.
Apologies if solution is obvious or simple.
Thanks for helps.
Here is your problem re-stated:
I have an array A containing many 1s and -1s. I want to make another array x containing integers 0-255 so that when I multiply each entry by the corresponding first array, then add up all the entries, I get some target number B.
Notice that the problem is just as difficult if you shuffle the array elements. So let's shuffle them so all the 1s are at the start and all the -1s are at the end. After solving this simplified version of the problem, we can shuffle them back.
Now the simplified problem is this:
I have A1 number of 1s and A-1 number of -1s. I want to make two arrays x1 and x-1 containing numbers from 0-255 so that when I add all the numbers in x1 and subtract all the numbers in x-1 I get some target number B.
Can you work out how to solve this?
I'd start by filling x1 with numbers 255 until the next 255 would make the sum too high, then fill the next entry with the number that makes the sum equal the target, then fill the rest with 0s. Then fill x-1 with 0s. If the target number is negative, do the opposite. Then un-shuffle it - match up the x1 and x-1 arrays with positions of the the 1s and -1s in your array A. And you're done.
You can actually write that algorithm so it puts the numbers directly in x without needing to make the temporary arrays x1 and x-1.
So I am currently trying to store a large sparse dataset (4.9 million rows and 6000 columns) in the csr_format. The dense format causes a memory error so I am loading it in line by line from a tsv file.
Here is how I do that:
import numpy as np
from scipy.sparse import csr_matrix
rows=np.empty(4865518,dtype=np.int16)
cols=np.empty(165050535,dtype=np.int16)
values=np.empty(165050535,dtype=np.int16)
labels=np.empty(4865517,dtype=np.int8)
file=open(r'HI-union-allFeatures\HI-union-allFeatures-nonZero-train0.tsv','r')
count=0
nnz=0
col_count=0
for l in file:
if count>0:
l=l.strip().split("\t")
line=l[2:-1]
labels[count-1]=l[-1]
for pair in line:
pair=pair.split()
cols[col_count]=pair[0]
cols[col_count]-=3
values[col_count]=pair[1]
col_count+=1
nnz+=len(line)
rows[count]=nnz
count+=1
cols.astype(np.int16,copy=False) #cols gets stored as 32 bit for some reason.
cols.shape #(165050535,)
rows.shape #(4865518,)
values.shape #(165050535,)
data=csr_matrix((values, cols, rows),copy=False)
data.nnz #30887
data.data.shape #should match values.shape but output is (30887,)
data.indices.shape #should match cols.shape but output is (30887,)
data.indptr.shape #matches rows.shape (4865518,)
However after creating the csr_matrix, it just elminates some of the values. I dont understand why. Here is the screenshot showing that data.data.shape does not match values.shape. I also verified the data in the orginal rows, cols and values arrays and they represent the data perfectly so I dont understand this behaviour. My pc is not running out of memory, I have 16gb of ram and this program barely takes up 1 GB. EDIT : This is my first question here so I'm sorry if I didnt post it correctly. Any help would be great.
Link to the screenshot
np.empty doesn't initialize arrays to zero. The value of rows[0] could be anything.
empty, unlike zeros, does not set the array values to zero, and may therefore be marginally faster. On the other hand, it requires the user to manually set all the values in the array, and should be used with caution
Int16 has a maximum value of 32767. Your row pointers have a maximum value of 165 million. This is why your data is now smaller than an int16.
Both of these things are huge problems. Without example data, providing a working fix as an answer is not possible.
I'm trying to use this 1000 dimension wikipedia word2vec model to analyze some documents.
Using introspection I found out that the vector representation of a word is a 1000 dimension numpy.ndarray, however whenever I try to create an ndarray to find the nearest words I get a value error:
ValueError: maximum supported dimension for an ndarray is 32, found 1000
and from what I can tell by looking around online 32 is indeed the maximum supported number of dimensions for an ndarray - so what gives? How is gensim able to output a 1000 dimension ndarray?
Here is some example code:
doc = [model[word] for word in text if word in model.vocab]
out = []
n = len(doc[0])
print(n)
print(len(model["hello"]))
print(type(doc[0]))
for i in range(n):
sum = 0
for d in doc:
sum += d[i]
out.append(sum/n)
out = np.ndarray(out)
which outputs:
1000
1000
<class 'numpy.ndarray'>
ValueError: maximum supported dimension for an ndarray is 32, found 1000
The goal here would be to compute the average vector of all words in the corpus in a format that can be used to find nearby words in the model so any alternative suggestions to that effect are welcome.
You're calling numpy's ndarray() constructor-function with a list that has 1000 numbers in it – your hand-calculated averages of each of the 1000 dimensions.
The ndarray() function expects its argument to be the shape of the matrix constructed, so it's trying to create a new matrix of shape (d[0], d[1], ..., d[999]) – and then every individual value inside that matrix would be addressed with a 1000-int set of coordinates. And, indeed numpy arrays can only have 32 independent dimensions.
But even if you reduced the list you're supplying to ndarray() to just 32 numbers, you'd still have a problem, because your 32 numbers are floating-point values, and ndarray() is expecting integral counts. (You'd get a TypeError.)
Along the approach you're trying to take – which isn't quite optimal as we'll get to below – you really want to create a single vector of 1000 floating-point dimensions. That is, 1000 cell-like values – not d[0] * d[1] * ... * d[999] separate cell-like values.
So a crude fix along the lines of your initial approach could be replacing your last line with either:
result = np.ndarray(len(d))
for i in range(len(d)):
result[i] = d[i]
But there are many ways to incrementally make this more efficient, compact, and idiomatic – a number of which I'll mention below, even though the best approach, at bottom, makes most of these interim steps unnecessary.
For one, instead of that assignment-loop in my code just above, you could use Python's bracket-indexing assignment option:
result = np.ndarray(len(d))
result[:] = d # same result as previous 3-lines w/ loop
But in fact, numpy's array() function can essentially create the necessary numpy-native ndarray from a given list, so instead of using ndarray() at all, you could just use array():
result = np.array(d) # same result as previous 2-lines
But further, numpy's many functions for natively working with arrays (and array-like lists) already include things to do averages-of-many-vectors in a single step (where even the looping is hidden inside very-efficient compiled code or CPU bulk-vector operations). For example, there's a mean() function that can average lists of numbers, or multi-dimensional arrays of numbers, or aligned sets of vectors, and so forth.
This allows faster, clearer, one-liner approaches that can replace your entire original code with something like:
# get a list of available word-vetors
doc = [model[word] for word in text if word in model.vocab]
# average all those vectors
out = np.mean(doc, axis=0)
(Without the axis argument, it'd average together all individual dimension-values , in all slots, into just one single final average number.)
I'm trying to find an efficient way to transform an array in the following way:
Each element will get transformed into either None, a real number, or a tuple/list/array of size 2 (contaning real numbers).
The transformation function im using is simple and just does number comparizons. So my first thought is to use np.where to make for fast comparizons. Now, if the transformation is a None or a real number, i have no problems.
But, when the transformation is a tuple/list/array, np.where gives me errors. This is ofc because numpy arrays demand regularity of dimensions. So now im forced to work with lists...
So my idea now is to, instead of tranforming the element into a tuple/list/Array of size 2, i transform it into a complex number. But then i have an Array of complex numbers containing mostly numbers with 0 imaginary part, (since most transformations will be None or real numbers). I cant afford this, memory speaking. (or?)
When i have the transformation list/Array/whatever, i will be doing sign operations and arithmetic btw its elements and comparizons again, thats why i would like to keep it being a numpy Array.
Am I forced to work with lists in this scenario or would you do something else?
EDIT:
Im asked to give certaing examples of my transformation:
Input: an array contaning elements with values None or real numbers btw [0,360)
Transformation: (simplified):
None goes to None
element in [0,45) goes to 2 real numbers (left,right), say 2 random real numbers btw 0 and element.
element in [45,360) goes to 1 real number
what i do is for example:
arrayTransformed = np.where((array>=0) & (array<45), transform(array), array)
#this gives problems ofc
arrayTransformed = np.where((array>=45) & (array<=360), transform(array), arrayTransformed )
Basically I have over 1000 3D arrays with the shape (100,100,1000). So some pretty large arrays, which I need to use in some calculations. The great thing about python and Numpy is that instead of interations, calculations on each element and such can be done very quickly. For example, I can make a sum of each index for each 3D array almost instant. The result is one large array with the sum of each index for each array. In principle, that is ALMOST what I want to do, however, there is a bit of a problem.
What I need to do is use an equation that looks like this:
So as stated, I have around 1000 3D arrays. In total, the shape of this total array is (1000, 100, 100, 1000). For each of the 1000 I also have a list going from 1 to 1000 that corresponds to the 1000 3D arrays, and each index of that list contains either a 1 or a 0. If it has a 1 that entire 3D array of that index should go in the first term of the equation, and if 0, it goes into the other.
I am however very much in doubt about how I am going to do this without turning to some kind of looping that might destroy the speed of the calculations by a great deal.
You could sort it by locating the 1's and 0's.
Something like:
list_ones = np.where(Array[0] == 1)
list_zeros = np.where(Array[0] == 0)
Then Array[list_ones,:,:,:] will contain all elements corresponding to a one and Array[list_zeros,:,:,:] will correspond to all elements corresponding to a zero.
Then you can just put
first_term = Array[list_ones,:,:,:]
second_term = Array[list_zeros,:,:,:]
And sum as appropriate.
Would this work for your purpose?