Unsuccessful in Appending Numpy Arrays - python

I am trying to iterate through a CSV file and create a numpy array for each row in the file, where the first column represents the x-coordinates and the second column represents the y-coordinates. I then am trying to append each array into a master array and return it.
import numpy as np
thedoc = open("data.csv")
headers = thedoc.readline()
def generatingArray(thedoc):
masterArray = np.array([])
for numbers in thedoc:
editDocument = numbers.strip().split(",")
x = editDocument[0]
y = editDocument[1]
createdArray = np.array((x, y))
masterArray = np.append([createdArray])
return masterArray
print(generatingArray(thedoc))
I am hoping to see an array with all the CSV info in it. Instead, I receive an error: "append() missing 1 required positional argument: 'values'
Any help on where my error is and how to fix it is greatly appreciated!

Numpy arrays don't magically grow in the same way that python lists do. You need to allocate the space for the array in your "masterArray = np.array([])" function call before you add everything to it.
The best answer is to import directly to a numpy array using something like genfromtxt (https://docs.scipy.org/doc/numpy-1.10.1/user/basics.io.genfromtxt.html) but...
If you know the number of lines you're reading in, or you can get it using something like this.
file_length = len(open("data.csv").readlines())
Then you can preallocate the numpy array to do something like this:
masterArray = np.empty((file_length, 2))
for i, numbers in enumerate(thedoc):
editDocument = numbers.strip().split(",")
x = editDocument[0]
y = editDocument[1]
masterArray[i] = [x, y]
I would recommend the first method but if you're lazy then you can always just build a python list and then make a numpy array.
masterArray = []
for numbers in thedoc:
editDocument = numbers.strip().split(",")
x = editDocument[0]
y = editDocument[1]
createdArray = [x, y]
masterArray.append(createdArray)
return np.array(masterArray)

Related

Why isn't Python writing elements from one 3-D list to another?

I'm trying to create a program that creates an average out of an image by looping through each RGBA value in two images to average them out and create one composite image, but the right value isn't being written to my list comp_img, which contains all the new image data.
I'm using these 256x256 images for debugging.
But it just creates this as output:
While this is the composite color, the 64 gets wiped out entirely. Help is very much appreciated, thank you.
from PIL import Image
import numpy as np
from math import ceil
from time import sleep
red = Image.open("64red.png")
grn = Image.open("64green.png")
comp_img = []
temp = [0,0,0,0] #temp list used for appending
#temp is a blank pixel template
for i in range(red.width):
comp_img.append(temp)
temp = comp_img
#temp should now be a row template composed of pixel templates
#2d to 3d array code go here
comp_img = []
for i in range(red.height):
comp_img.append(temp)
reddata = np.asarray(red)
grndata = np.asarray(grn)
reddata = reddata.tolist() #its uncanny how easy it is
grndata = grndata.tolist()
for row, elm in enumerate(reddata):
for pxl, subelm in enumerate(elm):
for vlu_index, vlu in enumerate(subelm):
comp_img[row][pxl][vlu_index] = ceil((reddata[row][pxl][vlu_index] + grndata[row][pxl][vlu_index])/2)
#These print statements dramatically slowdown the otherwise remarkably quick program, and are solely for debugging/demonstration.
output = np.array(comp_img, dtype=np.uint8) #the ostensible troublemaker
outputImg = Image.fromarray(output)
outputImg.save("output.png")
You could simply do
comp_img = np.ceil((reddata + grndata) / 2)
This gives me
To get correct values it needs to work with 16bit values - because for uint8 it works only with values 0..255 and 255+255 gives 254 instead of 510 (it calculates it modulo 256 and (255+255) % 256 gives 254)
reddata = np.asarray(red, dtype=np.uint16)
grndata = np.asarray(grn, dtype=np.uint16)
and then it gives
from PIL import Image
import numpy as np
red = Image.open("64red.png")
grn = Image.open("64green.png")
reddata = np.asarray(red, dtype=np.uint16)
grndata = np.asarray(grn, dtype=np.uint16)
#print(reddata[128,128])
#print(grndata[128,128])
#comp_img = (reddata + grndata) // 2
comp_img = np.ceil((reddata + grndata) / 2)
#print(comp_img[128,128])
output = np.array(comp_img, dtype=np.uint8)
outputImg = Image.fromarray(output)
outputImg.save("output.png")
You really should work with just NumPy arrays and functions, but I'll explain the bug here. It's rooted in how you make the nested list. Let's look at the first level:
temp = [0,0,0,0] #temp list used for appending
#temp is a blank pixel template
for i in range(red.width):
comp_img.append(temp)
At this stage, comp_img is a list with size red.width whose every element/pixel references the same RGBA-list [0,0,0,0]. I don't just mean the values are the same, it's one RGBA-list object. When you edit the values of that one RGBA-list, you edit all the pixels' colors at once.
Just fixing that step isn't enough. You also make the same error in the next step of expanding comp_img to a 2D matrix of RGBA-lists; every row is the same list.
If you really want to make a blank comp_img first, you should just make a NumPy array of a numerical scalar dtype; there you can guarantee every element is independent:
comp_img = np.zeros((red.height, red.width, 4), dtype = np.uint8)
If you really want a nested list, you have to properly instantiate (make new) lists at every level for them to be independent. A list comprehension is easy to write:
comp_img = [[[0,0,0,0] for i in range(red.width)] for j in range(red.height)]

produce vector output from a dask array

I have a large dask array (labeled_arr) that is actually a labeled raster image (dtype is int64). I want to use rasterio to turn the labeled regions into polygons and combine them into a single list of polygons (or geoseries with just a geometry column). This is a straightforward task on a single array, but I'm having trouble figuring out how to tell dask that I want it to do this operation on each chunk and return something that is not an array.
function to apply to each chunk:
def get_polys(labeled_blocks):
polys = list(poly[0]['coordinates'][0] for poly in rasterio.features.shapes(
labeled_blocks.astype('int32'), transform=trans))[:-1]
# Note: rasterio.features.shapes returns an iterator, hence the conversion to a list here
return polys
line of code trying to get dask to do this:
test_polygons = da.blockwise(get_polys, '', labeled_arr, 'ij')
test_polygons.compute()
where labeled_arr is the input chunked dask array.
Running as is returns an error saying I have to specify a dtype for da.blockwise. Specifying a dtype returns an AttributeError since the output list type does not have a dtype attribute. I discovered the meta keyword, but still have been unable to get the right syntax to turn my output into a Series or list.
I'm not attached to the above approach, but my overarching goal is: take a labeled, chunked dask dataarray (which does not all fit in memory), extract a list based on computations for each chunk, and generate a concatenated list (or pandas data object) with the outputs from all the chunks in my original chunked array.
This might work:
import dask
import dask.array as da
# we expect to see 4 blocks here
test_array = da.random.random((4, 4), chunks=(2, 2))
#dask.delayed
def my_func(block):
# do something fancy
return list(block)
results = dask.compute([my_func(x) for x in test_array.to_delayed().ravel()])
As you noted, the problem is that list has no dtype. A way around this would be to convert the list into a np.array, but I'm not sure if this will work with all geometry objects (it should be OK for Points, but polygons might be problematic due to varying length). Since you are not interested in forcing these geometries into an array, it's best to treat individual blocks as delayed objects feeding them into your function one at a time (but scaled across workers/processes).
Here's the solution I ended up with initially, though it still requires a lot of RAM given the concatenate=True kwarg.
poss_list = []
def get_polys(labeled_blocks):
polys = list(poly[0]['coordinates'][0] for poly in rasterio.features.shapes(
labeled_blocks.astype('int32'), transform=trans))[:-1]
poss_list.append(polys)
da.blockwise(get_bergs, '', labeled_arr, 'ij',
meta=pd.DataFrame({'c':[]}), concatenate=True).compute()
If I'm interpreting correctly, this doesn't feed the chunks into my function across workers/processes though (which it seems I can get away with for now).
Update - improved answer using dask.delayed, building on the accepted answer by #SultanOrazbayev
import dask
# onedem = original_xarray_dataarray
poss_list = []
#dask.delayed
def get_bergs(labeled_blocks, pointer, chunk0, chunk1):
# Note: I'm using this in a CRS (polar stereo) with negative y coordinates - it hasn't been tested for other CRSs
def getpx(chunkid, chunksz):
amin = chunkid[0] * chunksz[0][0]
amax = amin + chunksz[0][0]
bmin = chunkid[1] * chunksz[1][0]
bmax = bmin + chunksz[1][0]
return (amin, amax, bmin, bmax)
# order of all inputs (and outputs) should be y, x when axis order is used
chunksz = (onedem.chunks['y'], onedem.chunks['x'])
ymini, ymaxi, xmini, xmaxi = getpx((chunk0, chunk1), chunksz)
# use rasterio Windows and rioxarray to construct transform
# https://rasterio.readthedocs.io/en/latest/topics/windowed-rw.html#window-transforms
chwindow = rasterio.windows.Window(xmini, ymini, xmaxi-xmini, ymaxi-ymini) #.from_slices[ymini, ymaxi],[xmini, xmaxi])
trans = onedem.rio.isel_window(chwindow).rio.transform(recalc=True)
return list(poly[0]['coordinates'][0] for poly in rasterio.features.shapes(labeled_blocks.astype('int32'), transform=trans))[:-1]
for __, obj in enumerate(labeled_arr.to_delayed()):
for bl in obj:
piece = dask.delayed(get_bergs)(bl, *bl.key)
poss_list.append(piece)
poss_list = dask.compute(*poss_list)
# unnest the list of polygons returned by using dask to polygonize
concat_list = [item for sublist in poss_list for item in sublist if len(item)!=0]

How to create an empty numpy array with semi-specified dims?

I am trying to read data from a server like this:
with requests.Session() as s:
data = {}
r = s.get('https://something.com' , json = data ).json()
training_set1 = np.empty([-1,4])
training_set1[:,0] = r["o"]
training_set1[:,1] = r["h"]
training_set1[:,2] = r["l"]
training_set1[:,3] = r["c"]
But I don't know the length of arrays, so I used -1 then got this error message:
ValueError: negative dimensions are not allowed
How can I fix this code? The response r is a JSON object:
{"t":[1322352000,1322438400],
"o":[123,123],
"h":[123,123],
"l":[123,123],
"c":[123,123]}
that I am trying to rearrange it to a numpy array.
Numpy arrays have fixed sizes. You cannot initialize a dynamic sized array. What you can do is use a list of lists and later convert the list to a numpy array.
Something like this should work assuming r["x"] is a list. (Untested code)
with requests.Session() as s:
data = {}
r = s.get('https://something.com' , json = data ).json()
t_set1 = []
t_set1.append(r["o"])
t_set1.append(r["h"])
t_set1.append(r["l"])
t_set1.append(r["c"])
training_set1 = np.array(t_set1)
Edit: Edited for the order "o","h","l",""c after OP edited the question
You cannot declare a numpy array with an unknown dimension. But you can declare it in one single operation:
training_set1 = np.array([r["o"], r["o"], r["h"], r["l"]])
or even better:
training_set1 = np.array([r[i] for i in "oohl"])

Particle Tracking by coordinates from txt file

I have some particle track data from an OpenFoam simulation.
The data looks like this:
0.005 0.00223546 1.52096e-09 0.00503396
0.01 0.00220894 3.92829e-09 0.0101636
0.015 0.00218103 5.37107e-09 0.0154245
.....
First row is time, then x, y ,z coordinates.
In my folder, I have a file for every tracked particle.
I would like to calculate the velocity and the displacement for each particle in each timestep.
It would be nice to enter the position data in a way like particle[1].time[0.01].
Is there already a python tool for that kind of problem?
Thanks a lot
If you have regular time steps, you can use a pandas dataframe to find the difference
import pandas as pd
dt = .005 #or whatever time difference you have
df = pd.read_csv(<a bunch of stuff indicating how to read the file>)
df['v_x'] = df.diff(<the x colum>)
df['v_x'] = df['v_x']/dt
You "almost" don't need numpy for that. I created a simple class hierarchy with some initial methods. You could improve from that if you like the approach. Note that I am creating from a string, you should use for line in file instead of the string.split way.
import numpy
class Track(object):
def __init__(self):
self.trackpoints = []
def AddTrackpoint(self, line):
tpt = self.Trackpoint(line)
if self.trackpoints and tpt.t < self.trackpoints[-1].t:
raise ValueError("timestamps should be in ascending order")
self.trackpoints.append(tpt)
return tpt
def length(self):
pairs = zip(self.trackpoints[:-1], self.trackpoints[1:])
dists = map(self.distance, pairs)
result = sum(dists)
print result
def distance(self, points):
p1, p2 = points
return numpy.sqrt(sum((p2.pos - p1.pos)**2)) # only convenient use of numpy so far
class Trackpoint(object):
def __init__(self, line):
t, x, y, z = line.split(' ')
self.t = t
self.pos = numpy.array((x,y,z), dtype=float)
entries = """
0.005 0.00223546 1.52096e-09 0.00503396
0.01 0.00220894 3.92829e-09 0.0101636
0.015 0.00218103 5.37107e-09 0.0154245
""".strip()
lines = entries.split("\n")
track = Track()
for line in lines:
track.AddTrackpoint(line)
print track.length()
Individual files would be easily loaded with something like:
import numpy as np
t, x, y, z = np.loadtxt(filename, delimiter=' ', unpack=True)
Now there is an issue as you would like to index particule position with time, whereas Numpy will only accept integers as indices.
edit: in Python you could make "position" a dictionary so that you can index it with a float, or anything else. But now that comes down to the amount of data you have, and what you want to do with it. Because dictionaries will be less efficient than Numpy arrays for anything a little more 'advanced' than just picking position at time t.

Write & read ndarray

I would like to save an array with shape (5,2), the array named sorted_cube_station_list.
In the print it looks ok, but when I save it with numpy.tofile and later read it with numpy.fromfile it becames a 1d array
Can you help me with that?
import numpy as num
nx=5
ny=5
nz=5
stations=['L001','L002','L003','L004','L005']
for x in range(nx):
for y in range (ny):
for z in range (nz):
cube_station_list = []
i=-1
for sta in stations:
i=i+1
cube=[int(i), num.random.randint(2500, size=1)[0]]
cube_station_list.append(cube)
cub_station_list_arr=num.asarray(cube_station_list)
sorted_cube_station_list_arr=cub_station_list_arr[cub_station_list_arr[:, 1].argsort()]
print x,y,z, sorted_cube_station_list_arr
num.ndarray.tofile(sorted_cube_station_list_arr,str(x)+'_'+str(y)+'_'+str(z)
I suggest you use np.save
a = np.ones(16).reshape([8, 2])
np.save("fileName.npy", a)
See the docs: first parameter must not be the variable you want to save, but the path to the file where you want to save it. Hence the error you got when using np.save(yourArray)
You can load the saved array using np.load(pathToArray)

Categories