numpy.genfromtxt: delimiter=',' fails to split string

numpy.genfromtxt: delimiter=',' fails to split string - python

I don't understand why numpy.genfromtxt doesn't split the following string correctly using delimiter="," while it works for most of the other strings in my chunk.
chunk[12968]
Out[143]: '2901869281,3279442095,2012-12-15T23:00:00.003Z,Sacramento,CA,R#3817874,United States,38.583,-121.498,11, 8, 6, 5, 1, 0, 2, 3, 3, 5, 3, 3, 2, 2, 6, 6, 1, 2, 3, 0, 1, 1, 0, 0, 2, 2, 2, 2, 1, 0, 0, 2, 1, 0, 1, 1, 2, 0, 3, 1, 1, 1, 1, 0, 0, 4, 0, 0, 0, 1, 3, 1, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 9, 0, 0, 0, 2, 3, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,130\n'
I would expect an array of shape (110,) but get the following
genfromtxt([chunk[12968]],delimiter=",",dtype=np.int64)
Out[142]:
array([2901869281, 3279442095, -1, -1, -1,
-1], dtype=int64)
Note that I am using izip_longest from itertools to read a large *csv by chunks this way:
with open('events.csv','r') as:
for chunk in izip_longest(*[f] *50000):
...
Thanks for help.

The comments argument to genfromtxt() defaults to '#', so everything past the # in your input is getting ignored:
2901869281,3279442095,2012-12-15T23:00:00.003Z,Sacramento,CA,R#3817874,United States,...
^ start of comment

Related

Change zeros into their closest left nonzero neighbors in an array

Assuming I have an array :
[1, 0, 0, 2, 0, 0, 0, 0, 3, 0, 0, 0, 4, 0, 0, 1, 0, 0, 0, 2]
How can I change the zeros into the value of its closest left non-zero neighbor?
[1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 1, 1, 1, 1, 2]

l=[1, 0, 0, 2, 0, 0, 0, 0, 3, 0, 0, 0, 4, 0, 0, 1, 0, 0, 0, 2]
arr=[]
for i in l:
if i!=0:
arr.append(i)
left_element=i
else:
arr.append(left_element)
print(arr)
keep track of non zero left element and append it to the new list
Space:-O(n)
runtime:-O(n)
OR
l=[1, 0, 0, 2, 0, 0, 0, 0, 3, 0, 0, 0, 4, 0, 0, 1, 0, 0, 0, 2]
for i in range(len(l)):
if l[i]!=0:
left_element=l[i]
else:
l[i]=left_element
print(l)

I would iterate through your array and evaluate each item, replacing when necessary:
arr = [1, 0, 0, 2, 0, 0, 0, 0, 3, 0, 0, 0, 4, 0, 0, 1, 0, 0, 0, 2]
# replace values with the following
previous_x = None
for i,x in enumerate(arr):
if x>0:
previous_x = x
else:
arr[i] = previous_x
If your array is massive (>100,000), I would look into leveraging numpy for a solution.

Optimize Dict values(List) Multiplication

I have Two dictionary elements as follows: Initial (25 key-Value pairs) Results (100 Key-Value Pairs)
Initial: {0: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0],....... 24: [0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0]}
Results: {'0': [360, 0, 0, 0, 0, 1, 0, 0, 3, 3, 0, 0, 15, 0, 14, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 2, 0, 0, 0, 0, 1, 0, 3, 3, 1, 0, 0, 0, 0, 0, 4, 0, 0, 0, 1, 2, 0, 1, 0, 0, 3, 1, 0, 1, 0, 0, 0, 1, 2, 0, 2, 0, 0, 0, 137, 21, 78, 65, 241, 31, 30, 88, 152, 3, 13, 67, 31, 145, 132, 37, 1, 107, 120, 171, 39, 35, 31, 8, 24, 0, 0, 0, 0, 0],......'100': [183, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 4, 0, 12, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 2, 8, 1, 3, 1, 0, 3, 3, 0, 1, 1, 3, 2, 1, 1, 4, 0, 2, 1, 3, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 76, 10, 25, 33, 121, 14, 6, 40, 62, 2, 5, 34, 23, 66, 61, 28, 1, 56, 46, 69, 23, 10, 14, 1, 13, 1, 0, 0, 0, 0]}
In each iteration I multiply each value of Results dictionary to one value in Initial dictionary and call a function passing the product which will fetch me another value and I iterate this through the entire Initial dictionary Values. I am doing this using below code:
for z in Initial.keys():
for i in sorted(Results.keys()):
result = {i :[x*y for x, y in zip(Initial[z], Results[i])]}
One complete cycle is taking about 1 minute and I will need to perform at least 5000 cycles to observe the final results. Any suggestions on improving the performance/Optimization of code would be much appreciated.

Your values are lists and therefore you have to multiply one element at a time. You can cast your values (lists) to arrays first and then use vectorized multiplication thereby removing the use of list comprehension and element wise multiplication as follows
# Converting the values to arrays once for all
Initial = {k:np.array(v) for k,v in Initial.items()}
Results = {k:np.array(v) for k,v in Results.items()}
# Now just using vectorized multipliction
for z in Initial.keys():
for i in sorted(Results.keys()):
result = {i :Initial[z] * Results[i]}
Since you did not provide complete data, I tried your code for some 1 million iterations and found the vectorized code much faster. Try it out on your original data and see if you get a speed up (which you should).
Test case for comparing times
Your list comprehension version took 1 minute 6 seconds
for ii in range(500000):
for z in Initial.keys():
for i in sorted(Results.keys()):
result = {i :[x*y for x, y in zip(Initial[z], Results[i])]}
The following vectorized operation took 2.9 seconds
for ii in range(500000):
for z in Initial.keys():
for i in sorted(Results.keys()):
result = {i :Initial[z] * Results[i]}

Append rows in array

I am making a Draughts game in python, I made an array 10 by 10 and I need to append values within the entire row so that is eventually looks like this;
(
[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 2, 0, 2, 0, 2, 0, 2, 0, 2],
[2, 0, 2, 0, 2, 0, 2, 0, 2, 0],
[0, 2, 0, 2, 0, 2, 0, 2, 0, 2],
[2, 0, 2, 0, 2, 0, 2, 0, 2, 0],
)
Here is my attempt at it so far, I know it's incorrect;
__author__ = 'Matt'
import array
Board_Array = array(10, 10)
pieces = ['Empty', 'White_Piece', 'Black_Piece', 'Upgraded_White_Piece', 'Upgraded_Black_Piece']
list(enumerate(pieces))
if Board_Array.array_equals == [1, 0]:
for i in range(10):
if (i%2) == 0:
array.pop([i])
array.insert(i,1)

You could use a nested list comprehension:
In [173]: [[((i+j) % 2)*k for i in range(10)] for k in (1,1,0,2,2)
for j in (0,1)]
Out[173]:
[[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 2, 0, 2, 0, 2, 0, 2, 0, 2],
[2, 0, 2, 0, 2, 0, 2, 0, 2, 0],
[0, 2, 0, 2, 0, 2, 0, 2, 0, 2],
[2, 0, 2, 0, 2, 0, 2, 0, 2, 0]]
This is equivalent to
result = []
for k in (1,1,0,2,2):
for j in (0,1):
row = []
for i in range(10):
row.append(((i+j) % 2)*k)
result.append(row)

Sorting the order of entire columns in numpy

How can I sort this 3,20 ndarray by column? The np.sort() doesnt seem to do what I think it does. I want to turn this:
a
array([[0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 0, 0, 0, 1, 1, 2, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3],
[0, 1, 2, 3, 0, 1, 2, 0, 1, 0, 0, 1, 2, 0, 1, 0, 0, 1, 0, 0]])
into this:
a
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3],
0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 0, 0, 0, 1, 1, 2, 0, 0, 1, 0],
0, 1, 2, 3, 0, 1, 2, 0, 1, 0, 0, 1, 2, 0, 1, 0, 0, 1, 0, 0]])
Note: the columns are kept intact - see column a. They are sorted first by the first element in the column, then the second, then the third.
Thanks

Maybe you could use lexsort?
>>> arr[:,np.lexsort(arr[::-1])]
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3],
[0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 0, 0, 0, 1, 1, 2, 0, 0, 1, 0],
[0, 1, 2, 3, 0, 1, 2, 0, 1, 0, 0, 1, 2, 0, 1, 0, 0, 1, 0, 0]])

Make each line a tuple and use it as a sort criterion:
a = np.array([[0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 0, 0, 0, 1, 1, 2, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3],
[0, 1, 2, 3, 0, 1, 2, 0, 1, 0, 0, 1, 2, 0, 1, 0, 0, 1, 0, 0]])
np.array(sorted(a, key=tuple))
Out:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3],
[0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 0, 0, 0, 1, 1, 2, 0, 0, 1, 0],
[0, 1, 2, 3, 0, 1, 2, 0, 1, 0, 0, 1, 2, 0, 1, 0, 0, 1, 0, 0]])

k-means in python: Determine which data are associated with each centroid

I've been using scipy.cluster.vq.kmeans for doing some k-means clustering, but was wondering if there's a way to determine which centroid each of your data points is (putativly) associated with.
Clearly you could do this manually, but as far as I can tell the kmeans function doesn't return this?

There is a function kmeans2 in scipy.cluster.vq that returns the labels, too.
In [8]: X = scipy.randn(100, 2)
In [9]: centroids, labels = kmeans2(X, 3)
In [10]: labels
Out[10]:
array([2, 1, 2, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 2, 2, 1, 2, 1, 2, 1, 2, 0,
1, 0, 2, 0, 1, 2, 0, 1, 0, 1, 1, 2, 2, 2, 2, 1, 2, 1, 1, 1, 2, 0, 0,
2, 2, 0, 1, 0, 0, 0, 2, 2, 2, 0, 0, 1, 2, 1, 0, 0, 0, 2, 1, 1, 1, 1,
1, 0, 0, 1, 0, 1, 2, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 2, 0, 2, 2, 0,
1, 1, 0, 1, 0, 0, 0, 2])
Otherwise, if you must use kmeans, you can also use vq to get labels:
In [17]: from scipy.cluster.vq import kmeans, vq
In [18]: codebook, distortion = kmeans(X, 3)
In [21]: code, dist = vq(X, codebook)
In [22]: code
Out[22]:
array([1, 0, 1, 0, 2, 2, 2, 0, 1, 1, 0, 2, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1,
2, 2, 1, 2, 0, 1, 1, 0, 2, 2, 0, 1, 0, 1, 0, 2, 1, 2, 0, 2, 1, 1, 1,
0, 1, 2, 0, 1, 2, 2, 1, 1, 1, 2, 2, 0, 0, 2, 2, 2, 2, 1, 0, 2, 2, 2,
0, 1, 1, 2, 1, 0, 0, 0, 0, 1, 2, 1, 2, 0, 2, 0, 2, 2, 1, 1, 1, 1, 1,
2, 0, 2, 0, 2, 1, 1, 1])
Documentation: scipy.cluster.vq

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

numpy.genfromtxt: delimiter=',' fails to split string - python

The comments argument to genfromtxt() defaults to '#', so everything past the # in your input is getting ignored: 2901869281,3279442095,2012-12-15T23:00:00.003Z,Sacramento,CA,R#3817874,United States,... ^ start of comment

Related

Change zeros into their closest left nonzero neighbors in an array

Optimize Dict values(List) Multiplication

Append rows in array

Sorting the order of entire columns in numpy

k-means in python: Determine which data are associated with each centroid

Categories

Resources