Wrong CSV printing in Python (enumerating numpy array)

Wrong CSV printing in Python (enumerating numpy array) - python

I apologize if this question looks like a duplicate. I am trying to write a 7x2 array to a .csv file. The array I want to print is called x5:
x5
Out[47]:
array([[ 0.5, 1. ],
[ 0.7, 3. ],
[ 1.1, 5. ],
[ 1.9, 6. ],
[ 2. , 7. ],
[ 2.2, 9. ],
[ 3.1, 10. ]])
The code I use:
import time
import csv
import numpy
timestr = time.strftime("%Y%m%d-%H%M%S")
with open('mydir\\AreaIntCurve'+'_'+str(timestr)+'.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['Unique value', ' Occurrences'])
for m, val in numpy.ndenumerate(x5):
writer.writerow([m, val])
The result I get:
Unique value, Occurrences
"(0, 0)",0.5
"(0, 1)",1.0
"(1, 0)",0.69999999999999996
"(1, 1)",3.0
"(2, 0)",1.1000000000000001
"(2, 1)",5.0
"(3, 0)",1.8999999999999999
"(3, 1)",6.0
"(4, 0)",2.0
"(4, 1)",7.0
"(5, 0)",2.2000000000000002
"(5, 1)",9.0
"(6, 0)",3.1000000000000001
"(6, 1)",10.0
The result I want:
Unique value, Occurrences
0.5, 1
0.7, 3
1.1, 5
1.9, 6
2.0, 7
2.2, 9
3.1, 10
I assume the problem is with ndenumerate(x5), which prints the coordinates of my values. I have tried different approaches like numpy.savetxt, but it does not produce what I want and also does not print the current date in the file name. How to amend the ndenumerate() call to get rid of the value coordinates, while keeping the current date in the file name? Thanks a lot!

Here's an alternative that uses numpy.savetxt instead of the csv library:
In [17]: x5
Out[17]:
array([[ 0.5, 1. ],
[ 0.7, 3. ],
[ 1.1, 5. ],
[ 1.9, 6. ],
[ 2. , 7. ],
[ 2.2, 9. ],
[ 3.1, 10. ]])
In [18]: np.savetxt('foo.csv', x5, fmt=['%4.1f', '%4i'], header='Unique value, Occurrences', delimiter=',', comments='')
In [19]: !cat foo.csv
Unique value, Occurrences
0.5, 1
0.7, 3
1.1, 5
1.9, 6
2.0, 7
2.2, 9
3.1, 10

replace this line
for m, val in numpy.ndenumerate(x5):
writer.writerow([m, val])
with:
for val in x5:
writer.writerow(val)
you dont need to do ndenumerate

Have you tried replacing your two last lines of code with
for x in x5:
writer.writerow(x)
?
You may be surpised to see 1.8999999999999999 instead of 1.9 in your csv result; that is because 1.9 cannot be represented exactly in floating point arithmetics (see this question).
If you want to limit the number of digits to 3, you can replace the last line with writer.writerow([["{0:.3f}".format(val) for val in x]])
But this will also add three zeroes to integer values. Since you can check if a float is an integer with is_integer(), you can avoid this with
writer.writerow([str(y) if y.is_integer() else "{0:.3f}".format(y) for y in x])

Related

turning a list of numpy.ndarray to a matrix in order to perform multiplication

i have vectors of this form :
test=np.linspace(0,1,10)
i want to stack them horizontally in order to make a matrix .
problem is that i define them in a loop so the first stack is between an empty matrix and the first column vector , which gives the following error:
ValueError: all the input arrays must have same number of dimensions
bottom line - i have a for loop that with every iteration creates a vector p1 and i want to add it to a final matrix of the form :
[p1 p2 p3 p4] which i could then do matrix operations on such as multiplying by the transposed etc

If you've got a list of 1D arrays that you want horizontally stacked, you could convert them all to column first, but it's probably easier to just vertically stack them and then transpose:
In [6]: vector_list = [np.linspace(0, 1, 10) for _ in range(3)]
In [7]: np.vstack(vector_list).T
Out[7]:
array([[0. , 0. , 0. ],
[0.11111111, 0.11111111, 0.11111111],
[0.22222222, 0.22222222, 0.22222222],
[0.33333333, 0.33333333, 0.33333333],
[0.44444444, 0.44444444, 0.44444444],
[0.55555556, 0.55555556, 0.55555556],
[0.66666667, 0.66666667, 0.66666667],
[0.77777778, 0.77777778, 0.77777778],
[0.88888889, 0.88888889, 0.88888889],
[1. , 1. , 1. ]])

How did you get this dimension error? What does empty array have to do with it?
A list of arrays of the same length:
In [610]: alist = [np.linspace(0,1,6), np.linspace(10,11,6)]
In [611]: alist
Out[611]:
[array([0. , 0.2, 0.4, 0.6, 0.8, 1. ]),
array([10. , 10.2, 10.4, 10.6, 10.8, 11. ])]
Several ways of making an array from them:
In [612]: np.array(alist)
Out[612]:
array([[ 0. , 0.2, 0.4, 0.6, 0.8, 1. ],
[10. , 10.2, 10.4, 10.6, 10.8, 11. ]])
In [614]: np.stack(alist)
Out[614]:
array([[ 0. , 0.2, 0.4, 0.6, 0.8, 1. ],
[10. , 10.2, 10.4, 10.6, 10.8, 11. ]])
If you want to join them in columns, you can transpose one of the above, or use:
In [615]: np.stack(alist, axis=1)
Out[615]:
array([[ 0. , 10. ],
[ 0.2, 10.2],
[ 0.4, 10.4],
[ 0.6, 10.6],
[ 0.8, 10.8],
[ 1. , 11. ]])
np.column_stack is also handy.
In newer numpy versions you can do:
In [617]: np.linspace((0,10),(1,11),6)
Out[617]:
array([[ 0. , 10. ],
[ 0.2, 10.2],
[ 0.4, 10.4],
[ 0.6, 10.6],
[ 0.8, 10.8],
[ 1. , 11. ]])
You don't specify how you create the 'empty array' and how you attempt to stack. I can't exactly recreate the error message (full traceback would have helped). But given that message did you check the number of dimensions of the inputs? Did they match?
Array stacking in a loop is tricky. You have to pay close attention to the shapes, especially of the initial 'empty' array. There isn't a close analog to the empty list []. np.array([]) is 1d with shape (1,). np.empty((0,6)) is 2d with shape (0,6). Also all the stacking functions create a new array with each call (non operate in-place), so they are inefficient (compared to list append).

compress list of numbers into unique non overlapping time ranges using python

I'm from biology and very new to python and ML, the lab has a blackbox ML model which outputs a sequence like this :
Predictions =
[1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,1,0,1,0,1,0,1,1,1,1,1,0,0,0,1,1,1,1,1,1,0]
each value represents a predicted time frame of duration 0.25seconds.
1 means High.
0 means Not High.
How do I convert these predictions into a [start,stop,label] ?
so that longer sequences are grouped example the first 10 ones represent 0 to 10*.25s thus the first range and label would be
[[0.0,2.5, High]
next there are 13 zeroes ===> start = (2.5), stop = 13*.25 +2.5, label = Not high
thus
[2.5, 5.75, Not-High]
so final list would be something like a list of lists/ranges with unique non overlapping intervals along with a label like :
[[0.0,2.5, High],
[2.5, 5.75, Not-High],
[5.75,6.50, High] ..
What I tried:
1. Count number of values in Predictions
2. Generate two ranges , one starting at zero and another starting at 0.25
3. merge these two lists into tuples
import numpy as np
len_pred = len(Predictions)
range_1 = np.arange(0,len_pred,0.25)
range_2 = np.arange(0.25,len_pred,0.25)
new_range = zip(range_1,range_2)
Here I'm able to get the ranges, but missing out on the labels.
Seems like simple problem but I'm running in circles.
Please advise.
Thanks.

You can iterate through the list and create a range when you detect a change. You'll also need to account for the final range when using this method. Might not be super clean but should be effective.
current_time = 0
range_start = 0
current_value = predictions[0]
ranges = []
for p in predictions:
if p != current_value:
ranges.append([range_start, current_time, 'high' if current_value == 1 else 'not high'])
range_start = current_time
current_value = p
current_time += .25
ranges.append([range_start, current_time, 'high' if current_value == 1 else 'not high'])
Updated to fix a few off by one type errors.

by using diff() and where() you can find all the index that the value changed:
import numpy as np
p = np.array([1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,1,0,1,0,1,0,1,1,1,1,1,0,0,0,1,1,1,1,1,1,0])
idx = np.r_[0, np.where(np.diff(p) != 0)[0]+1, len(p)]
t = idx * 0.25
np.c_[t[:-1], t[1:], p[idx[:-1]]]
output:
array([[ 0. , 2.5 , 1. ],
[ 2.5 , 5.75, 0. ],
[ 5.75, 6.5 , 1. ],
[ 6.5 , 6.75, 0. ],
[ 6.75, 7. , 1. ],
[ 7. , 7.25, 0. ],
[ 7.25, 7.5 , 1. ],
[ 7.5 , 7.75, 0. ],
[ 7.75, 8. , 1. ],
[ 8. , 8.25, 0. ],
[ 8.25, 9.5 , 1. ],
[ 9.5 , 10.25, 0. ],
[ 10.25, 11.75, 1. ],
[ 11.75, 12. , 0. ]])

If I understood you correctly I think something like that should work.
compact_prediction = list()
sequence = list() # This will contain each sequence list [start, end, label]
last_prediction = 0
for index, prediction in enumerate(Predictions):
if index == 0:
sequence.append(0) # It's the first sequence, so it will start in zero
# When we not talking about the prediction we only end the sequence
# when the last prediction is different from the current one,
# signaling a change
elif prediction != last_prediction:
sequence.append((index - 1) * 0.25) # We append the end of the sequence
# And we put the label based on the last prediction
if last_prediction == 1:
sequence.append('High')
else:
sequence.append('Not-High')
# Append to our compact list and reset the sequence
compact_prediction.append(sequence)
sequence= list()
# After reseting the sequence we append the start of the new one
sequence.append(index * 0.25)
# Save the last prediction so we can check if it changed
last_prediction = prediction
print(compact_prediction)
Result: [[0.0, 2.25, 'High'], [2.5, 5.5, 'Not-High'], [5.75, 6.25, 'High'], [6.5, 6.5, 'Not-High'], [6.75, 6.75, 'High'], [7.0, 7.0, 'Not-High'], [7.25, 7.25, 'High'], [7.5, 7.5, 'Not-High'], [7.75, 7.75, 'High'],
[8.0, 8.0, 'Not-High'], [8.25, 9.25, 'High'], [9.5, 10.0, 'Not-High'], [10.25, 11.5, 'High']]

Delete columns based on repeat value in one row in numpy array

I'm hoping to delete columns in my arrays that have repeat entries in row 1 as shown below (row 1 has repeats of values 1 & 2.5, so one of each of those values have been been deleted, together with the column each deleted value lies within).
initial_array =
row 0 [[ 1, 1, 1, 1, 1, 1, 1, 1,]
row 1 [0.5, 1, 2.5, 4, 2.5, 2, 1, 3.5,]
row 2 [ 1, 1.5, 3, 4.5, 3, 2.5, 1.5, 4,]
row 3 [228, 314, 173, 452, 168, 351, 300, 396]]
final_array =
row 0 [[ 1, 1, 1, 1, 1, 1,]
row 1 [0.5, 1, 2.5, 4, 2, 3.5,]
row 2 [ 1, 1.5, 3, 4.5, 2.5, 4,]
row 3 [228, 314, 173, 452, 351, 396]]
Ways I was thinking of included using some function that checked for repeats, giving a True response for the second (or more) time a value turned up in the dataset, then using that response to delete the row. That or possibly using the return indices function within numpy.unique. I just can't quite find a way through it or find the right function though.
If I could find a way to return an mean value in the row 3 of the retained repeat and the deleted one, that would be even better (see below).
final_array_averaged =
row 0 [[ 1, 1, 1, 1, 1, 1,]
row 1 [0.5, 1, 2.5, 4, 2, 3.5,]
row 2 [ 1, 1.5, 3, 4.5, 2.5, 4,]
row 3 [228, 307, 170.5, 452, 351, 396]]
Thanks in advance for any help you can give to a beginner who is stumped!

You can use the optional arguments that come with np.unique and then use np.bincount to use the last row as weights to get the final averaged output, like so -
_,unqID,tag,C = np.unique(arr[1],return_index=1,return_inverse=1,return_counts=1)
out = arr[:,unqID]
out[-1] = np.bincount(tag,arr[3])/C
Sample run -
In [212]: arr
Out[212]:
array([[ 1. , 1. , 1. , 1. , 1. , 1. , 1. , 1. ],
[ 0.5, 1. , 2.5, 4. , 2.5, 2. , 1. , 3.5],
[ 1. , 1.5, 3. , 4.5, 3. , 2.5, 1.5, 4. ],
[ 228. , 314. , 173. , 452. , 168. , 351. , 300. , 396. ]])
In [213]: out
Out[213]:
array([[ 1. , 1. , 1. , 1. , 1. , 1. ],
[ 0.5, 1. , 2. , 2.5, 3.5, 4. ],
[ 1. , 1.5, 2.5, 3. , 4. , 4.5],
[ 228. , 307. , 351. , 170.5, 396. , 452. ]])
As can be seen that the output has now an order with the second row being sorted. If you are looking to keep the order as it was originally, use np.argsort of unqID, like so -
In [221]: out[:,unqID.argsort()]
Out[221]:
array([[ 1. , 1. , 1. , 1. , 1. , 1. ],
[ 0.5, 1. , 2.5, 4. , 2. , 3.5],
[ 1. , 1.5, 3. , 4.5, 2.5, 4. ],
[ 228. , 307. , 170.5, 452. , 351. , 396. ]])

You can find the indices of wanted columns using unique:
>>> indices = np.sort(np.unique(A[1], return_index=True)[1])
Then use a simple indexing to get the desire columns:
>>> A[:,indices]
array([[ 1. , 1. , 1. , 1. , 1. , 1. ],
[ 0.5, 1. , 2.5, 4. , 2. , 3.5],
[ 1. , 1.5, 3. , 4.5, 2.5, 4. ],
[ 228. , 314. , 173. , 452. , 351. , 396. ]])

This is a typical grouping problem, which can be solve elegantly and efficiently using the numpy_indexed package (disclaimer: I am its author):
import numpy_indexed as npi
unique, final_array = npi.group_by(initial_array[1]).mean(initial_array, axis=1)
Note that there are many other reductions than mean; if you want the original behavior you described, you could replace 'mean' with 'first', for instance.

Working with multiple columns from a data file

I have a file in which I need to use the first column. The remaining columns need to be integrated with respect to the first. Lets say my file looks like this:
100 1.0 1.1 1.2 1.3 0.9
110 1.8 1.9 2.0 2.1 2.2
120 1.8 1.9 2.0 2.1 2.2
130 2.0 2.1 2.3 2.4 2.5
Could I write a piece of code that takes the second column and integrates with the first then the third and integrates with respect to the first and so on? For my code I have:
import scipy as sp
first_col=dat[:,0] #first column from data file
cols=dat[:,1:] #other columns from data file
col2 = cols[:,0] # gets the first column from variable cols
I = sp.integrate.cumtrapz(col2, first_col, initial = 0) #integration step
This works only for the first row from the variable col, however, I don't want to write this out for all the other columns, it would look discussing (the thought of it makes me shiver). I have seen similar questions but haven't been able to relate the answers to mine and the ones that are more or less the same have vague answers. Any ideas?

The function cumtrapz accepts an axis argument. For example, suppose you put your first column in x and the remaining columns in y, and they have these values:
In [61]: x
Out[61]: array([100, 110, 120, 130])
In [62]: y
Out[62]:
array([[ 1.1, 2.1, 2. , 1.1, 1.1],
[ 2. , 2.1, 1. , 1.2, 2.1],
[ 1.2, 1. , 1.1, 1. , 1.2],
[ 2. , 1.1, 1.2, 2. , 1.2]])
You can integrate each column of y with respect to x as follows:
In [63]: cumtrapz(y, x=x, axis=0, initial=0)
Out[63]:
array([[ 0. , 0. , 0. , 0. , 0. ],
[ 15.5, 21. , 15. , 11.5, 16. ],
[ 31.5, 36.5, 25.5, 22.5, 32.5],
[ 47.5, 47. , 37. , 37.5, 44.5]])

matlab ismember function in python

Although similar questions have been raised a couple of times, still I cannot make a function similar to the matlab ismember function in Python. In particular, I want to use this function in a loop, and compare in each iteration a whole matrix to an element of another matrix. Where the same value is occurring, I want to print 1 and in any other case 0.
Let say that I have the following matrices
d = np.reshape(np.array([ 2.25, 1.25, 1.5 , 1. , 0. , 1.25, 1.75, 0. , 1.5 , 0. ]),(1,10))
d_unique = np.unique(d)
then I have
d_unique
array([ 0. , 1. , 1.25, 1.5 , 1.75, 2.25])
Now I want to iterate like
J = np.zeros(np.size(d_unique))
for i in xrange(len(d_unique)):
J[i] = np.sum(ismember(d,d_unique[i]))
so as to take as an output:
J = [3,1,2,2,1,1]
Does anybody have any idea? Many thanks in advance.

In contrast to other answers, numpy has the built-in numpy.in1d for doing that.
Usage in your case:
bool_array = numpy.in1d(array1, array2)
Note: It also accepts lists as inputs.
EDIT (2021):
numpy now recommend using np.isin instead of np.in1d. np.isin preserves the shape of the input array, while np.in1d returns a flattened output.

To answer your question, I guess you could define a ismember similarly to:
def ismember(d, k):
return [1 if (i == k) else 0 for i in d]
But I am not familiar with numpy, so a little adjustement may be in order.
I guess you could also use Counter from collections:
>>> from collections import Counter
>>> a = [2.25, 1.25, 1.5, 1., 0., 1.25, 1.75, 0., 1.5, 0. ]
>>> Counter(a)
Counter({0.0: 3, 1.25: 2, 1.5: 2, 2.25: 1, 1.0: 1, 1.75: 1})
>>> Counter(a).keys()
[2.25, 1.25, 0.0, 1.0, 1.5, 1.75]
>>> c =Counter(a)
>>> [c[i] for i in sorted(c.keys())]
[3, 1, 2, 2, 1, 1]
Once again, not numpy, you will probably have to do some list(d) somewhere.

Try the following function:
def ismember(A, B):
return [ np.sum(a == B) for a in A ]
This should very much behave like the corresponding MALTAB function.

Try the ismember library from pypi.
pip install ismember
Example:
# Import library
from ismember import ismember
# data
d = [ 2.25, 1.25, 1.5 , 1. , 0. , 1.25, 1.75, 0. , 1.5 , 0. ]
d_unique = [ 0. , 1. , 1.25, 1.5 , 1.75, 2.25]
# Lookup
Iloc,idx = ismember(d, d_unique)
# Iloc is boolean defining existence of d in d_unique
print(Iloc)
# [[True True True True True True True True True True]]
# indexes of d_unique that exists in d
print(idx)
# array([5, 2, 3, 1, 0, 2, 4, 0, 3, 0], dtype=int64)
print(d_unique[idx])
array([2.25, 1.25, 1.5 , 1. , 0. , 1.25, 1.75, 0. , 1.5 , 0. ])
print(d[Iloc])
array([2.25, 1.25, 1.5 , 1. , 0. , 1.25, 1.75, 0. , 1.5 , 0. ])
# These vectors will match
d[Iloc]==d_unique[idx]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Wrong CSV printing in Python (enumerating numpy array) - python

replace this line for m, val in numpy.ndenumerate(x5): writer.writerow([m, val]) with: for val in x5: writer.writerow(val) you dont need to do ndenumerate

Related

turning a list of numpy.ndarray to a matrix in order to perform multiplication

compress list of numbers into unique non overlapping time ranges using python

Delete columns based on repeat value in one row in numpy array

Working with multiple columns from a data file

matlab ismember function in python

Categories

Resources