numpy.genfromtxt: delimiter=',' fails to split string - python

I don't understand why numpy.genfromtxt doesn't split the following string correctly using delimiter="," while it works for most of the other strings in my chunk.
chunk[12968]
Out[143]: '2901869281,3279442095,2012-12-15T23:00:00.003Z,Sacramento,CA,R#3817874,United States,38.583,-121.498,11, 8, 6, 5, 1, 0, 2, 3, 3, 5, 3, 3, 2, 2, 6, 6, 1, 2, 3, 0, 1, 1, 0, 0, 2, 2, 2, 2, 1, 0, 0, 2, 1, 0, 1, 1, 2, 0, 3, 1, 1, 1, 1, 0, 0, 4, 0, 0, 0, 1, 3, 1, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 9, 0, 0, 0, 2, 3, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,130\n'
I would expect an array of shape (110,) but get the following
genfromtxt([chunk[12968]],delimiter=",",dtype=np.int64)
Out[142]:
array([2901869281, 3279442095, -1, -1, -1,
-1], dtype=int64)
Note that I am using izip_longest from itertools to read a large *csv by chunks this way:
with open('events.csv','r') as:
for chunk in izip_longest(*[f] *50000):
...
Thanks for help.

The comments argument to genfromtxt() defaults to '#', so everything past the # in your input is getting ignored:
2901869281,3279442095,2012-12-15T23:00:00.003Z,Sacramento,CA,R#3817874,United States,...
^ start of comment

Related

Change zeros into their closest left nonzero neighbors in an array

Assuming I have an array :
[1, 0, 0, 2, 0, 0, 0, 0, 3, 0, 0, 0, 4, 0, 0, 1, 0, 0, 0, 2]
How can I change the zeros into the value of its closest left non-zero neighbor?
[1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 1, 1, 1, 1, 2]
l=[1, 0, 0, 2, 0, 0, 0, 0, 3, 0, 0, 0, 4, 0, 0, 1, 0, 0, 0, 2]
arr=[]
for i in l:
if i!=0:
arr.append(i)
left_element=i
else:
arr.append(left_element)
print(arr)
keep track of non zero left element and append it to the new list
Space:-O(n)
runtime:-O(n)
OR
l=[1, 0, 0, 2, 0, 0, 0, 0, 3, 0, 0, 0, 4, 0, 0, 1, 0, 0, 0, 2]
for i in range(len(l)):
if l[i]!=0:
left_element=l[i]
else:
l[i]=left_element
print(l)
I would iterate through your array and evaluate each item, replacing when necessary:
arr = [1, 0, 0, 2, 0, 0, 0, 0, 3, 0, 0, 0, 4, 0, 0, 1, 0, 0, 0, 2]
# replace values with the following
previous_x = None
for i,x in enumerate(arr):
if x>0:
previous_x = x
else:
arr[i] = previous_x
If your array is massive (>100,000), I would look into leveraging numpy for a solution.

Optimize Dict values(List) Multiplication

I have Two dictionary elements as follows: Initial (25 key-Value pairs) Results (100 Key-Value Pairs)
Initial: {0: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0],....... 24: [0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0]}
Results: {'0': [360, 0, 0, 0, 0, 1, 0, 0, 3, 3, 0, 0, 15, 0, 14, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 2, 0, 0, 0, 0, 1, 0, 3, 3, 1, 0, 0, 0, 0, 0, 4, 0, 0, 0, 1, 2, 0, 1, 0, 0, 3, 1, 0, 1, 0, 0, 0, 1, 2, 0, 2, 0, 0, 0, 137, 21, 78, 65, 241, 31, 30, 88, 152, 3, 13, 67, 31, 145, 132, 37, 1, 107, 120, 171, 39, 35, 31, 8, 24, 0, 0, 0, 0, 0],......'100': [183, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 4, 0, 12, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 2, 8, 1, 3, 1, 0, 3, 3, 0, 1, 1, 3, 2, 1, 1, 4, 0, 2, 1, 3, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 76, 10, 25, 33, 121, 14, 6, 40, 62, 2, 5, 34, 23, 66, 61, 28, 1, 56, 46, 69, 23, 10, 14, 1, 13, 1, 0, 0, 0, 0]}
In each iteration I multiply each value of Results dictionary to one value in Initial dictionary and call a function passing the product which will fetch me another value and I iterate this through the entire Initial dictionary Values. I am doing this using below code:
for z in Initial.keys():
for i in sorted(Results.keys()):
result = {i :[x*y for x, y in zip(Initial[z], Results[i])]}
One complete cycle is taking about 1 minute and I will need to perform at least 5000 cycles to observe the final results. Any suggestions on improving the performance/Optimization of code would be much appreciated.
Your values are lists and therefore you have to multiply one element at a time. You can cast your values (lists) to arrays first and then use vectorized multiplication thereby removing the use of list comprehension and element wise multiplication as follows
# Converting the values to arrays once for all
Initial = {k:np.array(v) for k,v in Initial.items()}
Results = {k:np.array(v) for k,v in Results.items()}
# Now just using vectorized multipliction
for z in Initial.keys():
for i in sorted(Results.keys()):
result = {i :Initial[z] * Results[i]}
Since you did not provide complete data, I tried your code for some 1 million iterations and found the vectorized code much faster. Try it out on your original data and see if you get a speed up (which you should).
Test case for comparing times
Your list comprehension version took 1 minute 6 seconds
for ii in range(500000):
for z in Initial.keys():
for i in sorted(Results.keys()):
result = {i :[x*y for x, y in zip(Initial[z], Results[i])]}
The following vectorized operation took 2.9 seconds
for ii in range(500000):
for z in Initial.keys():
for i in sorted(Results.keys()):
result = {i :Initial[z] * Results[i]}

Append rows in array

I am making a Draughts game in python, I made an array 10 by 10 and I need to append values within the entire row so that is eventually looks like this;
(
[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 2, 0, 2, 0, 2, 0, 2, 0, 2],
[2, 0, 2, 0, 2, 0, 2, 0, 2, 0],
[0, 2, 0, 2, 0, 2, 0, 2, 0, 2],
[2, 0, 2, 0, 2, 0, 2, 0, 2, 0],
)
Here is my attempt at it so far, I know it's incorrect;
__author__ = 'Matt'
import array
Board_Array = array(10, 10)
pieces = ['Empty', 'White_Piece', 'Black_Piece', 'Upgraded_White_Piece', 'Upgraded_Black_Piece']
list(enumerate(pieces))
if Board_Array.array_equals == [1, 0]:
for i in range(10):
if (i%2) == 0:
array.pop([i])
array.insert(i,1)
You could use a nested list comprehension:
In [173]: [[((i+j) % 2)*k for i in range(10)] for k in (1,1,0,2,2)
for j in (0,1)]
Out[173]:
[[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 2, 0, 2, 0, 2, 0, 2, 0, 2],
[2, 0, 2, 0, 2, 0, 2, 0, 2, 0],
[0, 2, 0, 2, 0, 2, 0, 2, 0, 2],
[2, 0, 2, 0, 2, 0, 2, 0, 2, 0]]
This is equivalent to
result = []
for k in (1,1,0,2,2):
for j in (0,1):
row = []
for i in range(10):
row.append(((i+j) % 2)*k)
result.append(row)

Sorting the order of entire columns in numpy

How can I sort this 3,20 ndarray by column? The np.sort() doesnt seem to do what I think it does. I want to turn this:
a
array([[0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 0, 0, 0, 1, 1, 2, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3],
[0, 1, 2, 3, 0, 1, 2, 0, 1, 0, 0, 1, 2, 0, 1, 0, 0, 1, 0, 0]])
into this:
a
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3],
0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 0, 0, 0, 1, 1, 2, 0, 0, 1, 0],
0, 1, 2, 3, 0, 1, 2, 0, 1, 0, 0, 1, 2, 0, 1, 0, 0, 1, 0, 0]])
Note: the columns are kept intact - see column a. They are sorted first by the first element in the column, then the second, then the third.
Thanks
Maybe you could use lexsort?
>>> arr[:,np.lexsort(arr[::-1])]
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3],
[0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 0, 0, 0, 1, 1, 2, 0, 0, 1, 0],
[0, 1, 2, 3, 0, 1, 2, 0, 1, 0, 0, 1, 2, 0, 1, 0, 0, 1, 0, 0]])
Make each line a tuple and use it as a sort criterion:
a = np.array([[0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 0, 0, 0, 1, 1, 2, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3],
[0, 1, 2, 3, 0, 1, 2, 0, 1, 0, 0, 1, 2, 0, 1, 0, 0, 1, 0, 0]])
np.array(sorted(a, key=tuple))
Out:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3],
[0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 0, 0, 0, 1, 1, 2, 0, 0, 1, 0],
[0, 1, 2, 3, 0, 1, 2, 0, 1, 0, 0, 1, 2, 0, 1, 0, 0, 1, 0, 0]])

k-means in python: Determine which data are associated with each centroid

I've been using scipy.cluster.vq.kmeans for doing some k-means clustering, but was wondering if there's a way to determine which centroid each of your data points is (putativly) associated with.
Clearly you could do this manually, but as far as I can tell the kmeans function doesn't return this?
There is a function kmeans2 in scipy.cluster.vq that returns the labels, too.
In [8]: X = scipy.randn(100, 2)
In [9]: centroids, labels = kmeans2(X, 3)
In [10]: labels
Out[10]:
array([2, 1, 2, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 2, 2, 1, 2, 1, 2, 1, 2, 0,
1, 0, 2, 0, 1, 2, 0, 1, 0, 1, 1, 2, 2, 2, 2, 1, 2, 1, 1, 1, 2, 0, 0,
2, 2, 0, 1, 0, 0, 0, 2, 2, 2, 0, 0, 1, 2, 1, 0, 0, 0, 2, 1, 1, 1, 1,
1, 0, 0, 1, 0, 1, 2, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 2, 0, 2, 2, 0,
1, 1, 0, 1, 0, 0, 0, 2])
Otherwise, if you must use kmeans, you can also use vq to get labels:
In [17]: from scipy.cluster.vq import kmeans, vq
In [18]: codebook, distortion = kmeans(X, 3)
In [21]: code, dist = vq(X, codebook)
In [22]: code
Out[22]:
array([1, 0, 1, 0, 2, 2, 2, 0, 1, 1, 0, 2, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1,
2, 2, 1, 2, 0, 1, 1, 0, 2, 2, 0, 1, 0, 1, 0, 2, 1, 2, 0, 2, 1, 1, 1,
0, 1, 2, 0, 1, 2, 2, 1, 1, 1, 2, 2, 0, 0, 2, 2, 2, 2, 1, 0, 2, 2, 2,
0, 1, 1, 2, 1, 0, 0, 0, 0, 1, 2, 1, 2, 0, 2, 0, 2, 2, 1, 1, 1, 1, 1,
2, 0, 2, 0, 2, 1, 1, 1])
Documentation: scipy.cluster.vq

Categories