combining/merging multiple 2d arrays into single array by using python - python

I have four 2 dimensional np arrays. Shape of each array is (203 , 135). Now I want join all these arrays into one single array with respect to latitude and longitude.
I have used code below to read data
import pandas as pd
import numpy as np
import os
import glob
from pyhdf import SD
import datetime
import mpl_toolkits.basemap.pyproj as pyproj
DATA = ({})
files = glob.glob('MOD04*')
files.sort()
for n, f in enumerate(files):
SDS_NAME='Deep_Blue_Aerosol_Optical_Depth_550_Land'
hdf=SD.SD(f)
lat = hdf.select('Latitude')
latitude = lat[:]
min_lat=latitude.min()
max_lat=latitude.max()
lon = hdf.select('Longitude')
longitude = lon[:]
min_lon=longitude.min()
max_lon=longitude.max()
sds=hdf.select(SDS_NAME)
data=sds.get()
p = pyproj.Proj(proj='utm', zone=45, ellps='WGS84')
x,y = p(longitude, latitude)
def set_element(elements, x, y, data):
# Set element with two coordinates.
elements[x + (y * 10)] = data
elements = []
set_element(elements,x,y,data)
But I got error: only integer arrays with one element can be converted to an index
you can find the data: https://drive.google.com/open?id=0B2rkXkOkG7ExMElPRDd5YkNEeDQ
I have created toy datasets for this problem as per requested.
what I want is to get one single array from four (a,b,c,d) arrays. whose dimension should be something like (406, 270)
a = (np.random.rand(27405)).reshape(203,135)
b = (np.random.rand(27405)).reshape(203,135)
c = (np.random.rand(27405)).reshape(203,135)
d = (np.random.rand(27405)).reshape(203,135)
a_x = (np.random.uniform(10,145,27405)).reshape(203,135)
a_y = (np.random.uniform(204,407,27405)).reshape(203,135)
d_x = (np.random.uniform(150,280,27405)).reshape(203,135)
d_y = (np.random.uniform(204,407,27405)).reshape(203,135)
b_x = (np.random.uniform(150,280,27405)).reshape(203,135)
b_y = (np.random.uniform(0,202,27405)).reshape(203,135)
c_x = (np.random.uniform(10,145,27405)).reshape(203,135)
c_y = (np.random.uniform(0,202,27405)).reshape(203,135)
any help?

This should be a comment, yet the comment space is not enough for these questions. Therefore I am posting here:
You say that you have 4 input arrays (a,b,c,d) which are somehow to be intergrated into an output array. As far as is understood, two of these arrays contain positional information (x,y) such as longitude and latitude. The only line in your code, where you combine several input arrays is here:
def set_element(elements, x, y, data):
# Set element with two coordinates.
elements[x + (y * 10)] = data
Here you have four input variables (elements, x, y, data) which I assume to be your input arrays (a,b,c,d). In this operation yet you do not combine them, but you overwrite an element of elements (index: x + 10y) with a new value (data).
Therefore, I do not understand your target output.
When I was asking for toy data, I had something like this in mind:
a = [[1,2]]
b = [[3,4]]
c = [[5,6]]
d = [[7,8]]
This would be such an easy example that you could easily say:
What I want is this:
res = [[[1,2],[3,4]],[[5,6],[7,8]]]
Then we could help you to find an answer.
Please, thus, provide more information about the operation that you want to conduct either mathematically notated ( such as x = a +b*c +d) or with toy data so that we can deduce the function you ask for.

Related

Re-distributing 2d data with max in middle

Hey all I have a set up seemingly random 2D data that I want to reorder. This is more for an image with specific values at each pixel but the concept will be the same.
I have large 2d array that looks very random, say:
x = 100
y = 120
np.random.random((x,y))
and I want to re-distribute the 2d matrix so that the maximum value is in the center and the values from the maximum surround it giving it sort of a gaussian fall off from the center.
small example:
output = [[0.0,0.5,1.0,1.0,1.0,0.5,0.0]
[0.0,1.0,1.0,1.5,1.0,0.5,0.0]
[0.5,1.0,1.5,2.0,1.5,1.0,0.5]
[0.0,1.0,1.0,1.5,1.0,0.5,0.0]
[0.0,0.5,1.0,1.0,1.0,0.5,0.0]]
I know it wont really be a gaussian but just trying to give a visualization of what I would like. I was thinking of sorting the 2d array into a list from max to min and then using that to create a new 2d array but Im not sure how to distribute the values down to fill the matrix how I want.
Thank you very much!
If anyone looks at this in the future and needs help, Here is some advice on how to do this effectively for a lot of data. Posted below is the code.
def datasort(inputarray,spot_in_x,spot_in_y):
#get the data read
center_of_y = spot_in_y
center_of_x = spot_in_x
M = len(inputarray[0])
N = len(inputarray)
l_list = list(itertools.chain(*inputarray)) #listed data
l_sorted = sorted(l_list,reverse=True) #sorted listed data
#Reorder
to_reorder = list(np.arange(0,len(l_sorted),1))
x = np.linspace(-1,1,M)
y = np.linspace(-1,1,N)
centerx = int(M/2 - center_of_x)*0.01
centery = int(N/2 - center_of_y)*0.01
[X,Y] = np.meshgrid(x,y)
R = np.sqrt((X+centerx)**2 + (Y+centery)**2)
R_list = list(itertools.chain(*R))
values = zip(R_list,to_reorder)
sortedvalues = sorted(values)
unzip = list(zip(*sortedvalues))
unzip2 = unzip[1]
l_reorder = zip(unzip2,l_sorted)
l_reorder = sorted(l_reorder)
l_unzip = list(zip(*l_reorder))
l_unzip2 = l_unzip[1]
sorted_list = np.reshape(l_unzip2,(N,M))
return(sorted_list)
This code basically takes your data and reorders it in a sorted list. Then zips it together with a list based on a circular distribution. Then using the zip and sort commands you can create the distribution of data you wish to have based on your distribution function, in my case its a circle that can be offset.

Generate simulated data in Python while meeting a range of correlations with respect to a predefined variable

Let's denote refVar, a variable of interest that contains experimental data.
For the simulation study, I would like to generate other variables V0.05, V0.10, V0.15 until V0.95.
Note that for the variable name, the value following V represents the correlation between the variable and refVar (in order to quick track in the final dataframe).
My readings led me to multivariate_normal() from numpy. However, when using this function, it generates 2 1D-arrays both with random numbers. What I want is to always keep refVar and generate other arrays filled with random numbers, while meeting the specified correlation.
Please, find below my my code. To cut it short, I've no clue how to generate other variables relative to my experimental variable refVar. Ideally, I would like to build a data frame containing the following columns: refVar,V0.05,V0.10,...,V0.95. I hope you get my point and thank you in advance for your time
import numpy as np
import pandas as pd
from numpy.random import multivariate_normal as mvn
refVar = [75.25,77.93,78.2,61.77,80.88,71.95,79.88,65.53,85.03,61.72,60.96,56.36,23.16,73.36,64.18,83.07,63.25,49.3,78.2,30.96]
mean_refVar = np.mean(refVar)
for r in np.arange(0,1,0.05):
var1 = 1
var2 = 1
cov = r
cov_matrix = [[var1,cov],
[cov,var2]]
data = mvn([mean_refVar,mean_refVar],cov_matrix,size=len(refVar))
output = 'corr_'+str(r.round(2))+'.txt'
df = pd.DataFrame(data,columns=['refVar','v'+str(r.round(2)])
df.to_csv(output,sep='\t',index=False) # Ideally, instead of creating an output for each correlation, I would like to generate a DF with refVar and all these newly created Series
Following this answer we can generate the sequence as follow:
def rand_with_corr(refVar, corr):
# center and normalize refVar
X = np.array(refVar) - np.mean(refVar)
X = X/np.linalg.norm(X)
# random sampling Y
Y = np.random.rand(len(X))
# centralize Y
Y = Y - Y.mean()
# find the orthorgonal component to X
Y = Y - Y.dot(X) * X
# normalize Y
Y = Y/np.linalg.norm(Y)
# output
return Y + (1/np.tan(np.arccos(corr))) * X
# test
out = rand_with_corr(refVar, 0.05)
pd.Series(out).corr(pd.Series(refVar))
# out
# 0.050000000000000086

xarray coordinate-dependent computation

I'm using xarray with data for which I have measurements and errors.
I store these along a dimension moment in the dataset with coordinates value and variance.
When I compute for example the mean along a dimension I need values and variances to be treated differently as the former should be combined as
mean_values = sum(values)/len(values)
but the latter as
mean_variance = sum(variances**2)/len(variances).
Currently I'm doing this by forming two new datasets and concatinating them. This is very ugly, convoluted and not suited to more complex calculations. I would like to be able to do this kind of operation in one step, perhaps by defining a function taking values and variances as input and then broadcasting the dataset dimension moment onto it.
Given a dataset q_lp with dimensions moment, time, position:
q_lp_av = q_lp.sel(moment='value').mean(dim='time')
q_lp_var = q_lp.sel(moment='variance').reduce(average_of_squares, dim='time')
q_lp = xr.concat([q_lp_common_av, q_lp_common_var], dim='moment')
where average_of_squares is defined by
def average_of_squares(data, axis=None):
sums = np.sum(data**2, axis=axis)
if axis:
return sums/np.shape(data)[axis]**2
return sums/len(data)**2
What better ways are there to handle this?
Is it possible to use xr.apply_ufunc and a my_average function to do this in one step and in-place?
Should I no be putting theses into one dataset together at all? q_lp is later on combined with other quantities, also with dimensions moment, pos and time, into a DataSet.
I'm grateful for discussion, ideas, tips and links to examples.
Edit:
To clarify, I don't like splitting the DataArray, handling each moment seperately and concatenating them again. I would prefer a possibility to do the following (untested pseudocode for illustration):
def multi_moment_average(mean, variance):
mean = np.average(mean)
variance = np.sum(variance**2)/len(variance)
return mean, variance
q_lp.reduce(multi_moment_average, broadcast='moment', dim='time')
Minimal working example:
import numpy as np
import xarray as xr
def average_of_squares(data, axis=None):
sums = np.sum(data**2, axis=axis)
if axis:
return sums/np.shape(data)[axis]**2
return sums/len(data)**2
times = np.arange(10)
positions = np.array([1, 3, 5])
values = np.ones((len(times), len(positions))) * (2 + np.random.rand())
variance = np.ones((len(times), len(positions))) * np.random.rand()
q_lp = xr.DataArray(np.array([values, variance]),
coords=[['value', 'variance'], times, positions],
dims=['moment', 'time', 'position'])
q_lp_av = q_lp.sel(moment='value').mean(dim='time')
q_lp_var = q_lp.sel(moment='variance').reduce(average_of_squares, dim='time')
q_lp = xr.concat([q_lp_av, q_lp_var], dim='moment')
I think you can write your function in an xarray-friendly way, and then call it on your data. i.e.
def average_of_squares(data, dim=None):
sums = (data ** 2).sum(dim)
return sums/data.count(dim)**2
q_lp_var = q_lp.sel(moment='variance').pipe(average_of_squares, dim='time')
Having them concat-ed in the same DataArray is fine; it might be a more natural fit for items on a Dataset, though.
Does that answer your question?
Edit: re the edited question, I think holding the items in a Dataset rather than a DataArray is most coherent with the data structures. It seems like the mean & variance are two different arrays you want aligned on the same indexes, so a Dataset is ideal
I found a solution that suits my needs, but am still grateful for more suggestions:
groupby can seperate a Dataset or DataArray along a specified dimension, list thereof creates (key, value) tuples and dict of this has essentially the form of a keyword dictionary. See http://xarray.pydata.org/en/stable/groupby.html
My current solution thus looks like this:
import xarray as xr
def function_applier(data, function, split_dimension=None, **function_kwargs):
return xr.concat(
function(
**dict(list(data.groupby(split_dimension))),
**function_kwargs),
dim=split_dimension)
Now I can define functions taking specific coordinates as inputs which can be written to also work for e.g. numpy arrays.
(MWE using the specific example of my original question here)
import numpy as np
def average_of_gaussians(val, var, dim=None):
return val.mean(dim), (var ** 2).sum(dim)/var.count(dim)
val = np.random.rand(12).reshape(2,6)
var = 0.1*np.random.rand(12).reshape(2,6)
da = xr.DataArray([val, var],
dims=['moment','time','position'],
coords=[['val','var'],
np.arange(6),
['a','b']])
>>>da
<xarray.DataArray (moment: 2, position: 2, time: 6)>
array([[[0.66233728, 0.71419351, 0.96758741, 0.96949021, 0.94594299,
0.05080628],
[0.44005458, 0.64616657, 0.69865189, 0.84970553, 0.19561433,
0.8529829 ]],
[[0.02209967, 0.02152369, 0.09181031, 0.00223527, 0.01448938,
0.01484197],
[0.05651841, 0.04942305, 0.08250529, 0.04258035, 0.00184209,
0.0957248 ]]])
Coordinates:
* moment (moment) <U3 'val' 'var'
* position (position) <U1 'a' 'b'
* time (time) int32 0 1 2 3 4 5
>>>function_applier(da,
average_of_gaussians,
split_dimension='moment',
dim='time')
<xarray.DataArray (moment: 2, position: 2)>
array([[0.71839295, 0.61386263],
[0.001636 , 0.00390397]])
Coordinates:
* position (position) <U1 'a' 'b'
* moment (moment) object 'val' 'var'
Note the input names equal to the coordinates for average_of_gaussians. The different operation on each variable in one function and the lack of references to xarray within it are the properties I am after.

structuring data in numpy for ltsm (examples)

I am having problem with understanding how data should be prepared for different models:
One to many
Many to one
Many to many(A)
Many to many(B)
Is the right way to think o it this way. Shape numbers are no relevant and do not match the one on picture. I am just trying to understand logic behind.:
import numpy as np
#1. one to many
# X for input y for output
X = np.ones([10,1,5])
y = np.zeros([10,3]) #3 represnts size of output vector
#2. many to one
X = np.ones([10,5,5])
y = np.zeros([10,1])
#3. many to many
X = np.ones([10,5,5])
y = np.zeros([10,5])
# in this case cell should be different than y. It must be bigger to shift some data
#4. many to many
X = np.ones([10,5,5])
y = np.zeros([10,5])
# in this case cell is the same shape as y

netCDF grid file: Extracting information from 1D array using 2D values

I am trying to work in Python 3 with topography/bathymetry-information (basically a grid containing x [longitude in decimal degrees], y [latitude in decimal degrees] and z [meter]).
The grid file has the extension .nc and is therefore a netCDF-file. Normally I would use it in mapping tools like Generic Mapping Tools and don't have to bother with how a netCDF file works, but I need to extract specific information in a Python script. Right now this is only limiting the dataset to certain longitude/latitude ranges.
However, right now I am a bit lost on how to get to the z-information for specific x and y values. Here's what I know about the data so far
import netCDF4
#----------------------
# Load netCDF file
#----------------------
bathymetry_file = 'C:/Users/te279/Matlab/data/gebco_08.nc'
fh = netCDF4.Dataset(bathymetry_file, mode='r')
#----------------------
# Getting information about the file
#----------------------
print(fh.file_format)
NETCDF3_CLASSIC
print(fh)
root group (NETCDF3_CLASSIC data model, file format NETCDF3):
title: GEBCO_08 Grid
source: 20100927
dimensions(sizes): side(2), xysize(933120000)
variables(dimensions): float64 x_range(side), float64 y_range(side), int16 z_range(side), float64 spacing(side), int32 dimension(side), int16 z(xysize)
groups:
print(fh.dimensions.keys())
odict_keys(['side', 'xysize'])
print(fh.dimensions['side'])
: name = 'side', size = 2
print(fh.dimensions['xysize'])
: name = 'xysize', size = 933120000
#----------------------
# Variables
#----------------------
print(fh.variables.keys()) # returns all available variable keys
odict_keys(['x_range', 'y_range', 'z_range', 'spacing', 'dimension', 'z'])
xrange = fh.variables['x_range'][:]
print(xrange)
[-180. 180.] # contains the values -180 to 180 for the longitude of the whole world
yrange = fh.variables['y_range'][:]
print(yrange)
[-90. 90.] # contains the values -90 to 90 for the latitude of the whole world
zrange = fh.variables['z_range'][:]
[-10977 8685] # contains the depths/topography range for the world
spacing = fh.variables['spacing'][:]
[ 0.00833333 0.00833333] # spacing in both x and y. Equals the dimension, if multiplied with x and y range
dimension = fh.variables['dimension'][:]
[43200 21600] # corresponding to the shape of z if it was the 2D array I would've hoped for (it's currently an 1D array of 9333120000 - which is 43200*21600)
z = fh.variables['z'][:] # currently an 1D array of the depth/topography/z information I want
fh.close
Based on this information I still don't know how to access z for specific x/y (longitude/latitude) values. I think basically I need to convert the 1D array of z into a 2D array corresponding to longitude/latitude values. I just have not a clue how to do that. I saw in some posts where people tried to convert a 1D into a 2D array, but I have no means to know in what corner of the world they start and how they progress.
I know there is a 3 year old similar post, however, I don't know how to find an analogue "index of the flattened array" for my problem - or how to exactly work with that. Can somebody help?
You need to first read in all three of z's dimensions (lat, lon, depth) and then extract values across each of those dimensions. Here are a few examnples.
# Read in all 3 dimensions [lat x lon x depth]
z = fh.variables['z'][:,:,:]
# Topography at a single lat/lon/depth (1 value):
z_1 = z[5,5,5]
# Topography at all depths for a single lat/lon (1D array):
z_2 = z[5,5,:]
# Topography at all latitudes and longitudes for a single depth (2D array):
z_3 = z[:,:,5]
Note that the number you enter for lat/lon/depth is the index in that dimension, not an actual latitude, for instance. You'll need to determine the indices of the values you are looking for beforehand.
I just found the solution in this post. Sorry that I didn't see that before. Here's what my code looks like now. Thanks to Dave (he answered his own question in the post above). The only thing I had to work on was that the dimensions have to stay integers.
import netCDF4
import numpy as np
#----------------------
# Load netCDF file
#----------------------
bathymetry_file = 'C:/Users/te279/Matlab/data/gebco_08.nc'
fh = netCDF4.Dataset(bathymetry_file, mode='r')
#----------------------
# Extract variables
#----------------------
xrange = fh.variables['x_range'][:]
yrange = fh.variables['y_range'][:]
zz = fh.variables['z'][:]
fh.close()
#----------------------
# Compute Lat/Lon
#----------------------
nx = (xrange[-1]-xrange[0])/spacing[0] # num pts in x-dir
ny = (yrange[-1]-yrange[0])/spacing[1] # num pts in y-dir
nx = nx.astype(np.integer)
ny = ny.astype(np.integer)
lon = np.linspace(xrange[0],xrange[-1],nx)
lat = np.linspace(yrange[0],yrange[-1],ny)
#----------------------
# Reshape the 1D to an 2D array
#----------------------
bathy = zz[:].reshape(ny, nx)
So, now when I look at the shape of both zz and bathy (following code), the former is a 1D array with a length of 933120000, the latter the 2D array with dimensions of 43200x21600.
print(zz.shape)
print(bathy.shape)
The next step is to use indices to access the bathymetry/topography data correctly, just as N1B4 described in his post

Categories