If I have a rainfall map which has three dimensions (latitude, longitude and rainfall value), if I put it in an array, do I need a 2D or 3D array? How would the array look like?
If I have a series of daily rainfall map which has four dimensions (lat, long, rainfall value and time), if I put it in an array, do I need a 3D or 4D array?
I am thinking that I would need a 2D and 3D arrays, respectively, because the latitude and longitude can be represented by a 1D array only (but reshaped such that it has more than 1 rows and columns). Enlighten me please.
I think that both propositions from #Genhis and #Bitzel are right depending on what you want to do...
If you want to be effective, I would recommend you to put both in 2D data structure and I would even advise you specifically to choose a pandas dataframe which will put your data in some kind of matrix-like data structure but let you choose multiple indexes if you need to "think" in 3D or 4D.
It will especially be helpful with the 2nd kind of data you're mentioning "(lat, long, rainfall value and time)" as it is part of what is called "time series". Pandas has a lot of methods to help you averaging over some period of time (you can also group your data by longitude, latitude or location if needed).
On the contrary, if your objective is to learn about how to compute those numbers in Python, then you can use 2D arrays for the first case and 2D or 3D for the 2nd one as previous answers recommended. You could use something like numpy arrays as data structure instead of pure python list but that's debatable...
One important point: Choosing 3D arrays for the time series as #Genhis proposes would ask you to convert time in indexes (through lookup tables or hash function) but that will require some more work...
As I said, you could also learn about tidy, wide and long formats if you want to learn more about those questions...
for the rainfall map, the values you're describing are (latitude, longitude, rainfall value), you need to use a 2D array (matrix) since all you need is 3 columns and a number of rows. It will look like:
rainfall
For the values (lat, long, rainfall value, time) it's the same case. You need to use a 2D array with 4 columns and a number of rows:
Rainfall matrix
I believe that the rainfall value shouldn't be a dimension. Therefore, you could use 2D array[lat][lon] = rainfall_value or 3D array[time][lat][lon] = rainfall_value respectively.
If you want to reduce number of dimensions further, you can combine latitude and longitude into one dimension as you suggested, which would make arrays 1D/2D.
Related
Background: I have a numpy array of float entries. This is basically a set of observations of something, suppose temperature measured during 24 hours. Imagine that one who records the temperature is not available for the entire day, instead he/she takes few (say 5) readings during an hour and again after few hours, takes reading (say 8 times). All the measurements he/she puts in a single np.array and has handed over to me!
Problem: I have no idea when the readings were taken. So I decide to cluster the observations in the following way: maybe, first recognize local peaks in the array and all entries that are close enough (chosen tolerance, say 1 deg) are grouped together, meaning, I want to split the array into a list of sub-arrays. Note, any entry should belong to exactly one group.
One possible approach: First, sort the array, then split it into sub-arrays with two conditions: (1) Difference between the first and last entries is not more than 1 deg, (2) Difference between the last entry of a sub-array and the first entry of the next sub-array is greater than 1 deg. How can I achieve this fast (numpy way)?
So, I'm doing something that is maybe a bit unorthodox, I have a number of 9-billion pixel raster maps based on the NLCD, and I want to get the values from these rasters for the pixels which have ever been built-up, which are about 500 million:
built_up_index = pandas.DataFrame(np.column_stack(np.where(unbuilt == 0)), columns = ["row", "column"]).sort_values(["row", "column"])
That piece of code above gives me a dataframe where one column is the row index and the other is the column index of all the pixels which show construction in any of the NLCD raster maps (unbuilt is the ones and zeros raster which contains that).
I want to use this to then read values from these NLCD maps and others, so that each pixel is a row and each column is a variable, say, its value in the NLCD 2001, then its value in 2004, 2006 and so on (As well as other indices I have calculated). So the dataframe would look as such:
|row | column | value_2001 | value_2004 | var3 | ...
(VALUES HERE)
I have tried the following thing:
test = sprawl_2001.isel({'y': np.array(built_up_frame.iloc[:,0]), 'x': np.array(built_up_frame.iloc[:,1])}, drop = True).to_dataset(name="var").to_dataframe()
which works if I take a subsample as such:
test = sprawl_2001.isel({'y': np.array(built_up_frame.iloc[0:10000,0]), 'x': np.array(built_up_frame.iloc[0:10000,1])}, drop = True).to_dataset(name="var").to_dataframe()
but it doesn't do what I want, because the length is squared, as it seems it's trying to create a 2-d array which it then flattens, when what I want is a vector containing the values of the pixels I subsampled.
I could obviously do this in a loop, pixel by pixel, but I imagine this would be extremely slow for 500 million values and there has to be a more efficient way.
Any advice here?
EDIT: In the end I gave up on using the index, because I get the impression Xarrays will only make an array of the same dimensions (about 161000 columns and 104000 rows) as my original dataset with a bunch of missing values, rather than creating a column vector with the values I want. I'm using np.extract:
def src_to_frame(src, unbuilt, varname):
return pd.DataFrame(np.extract(unbuilt == 0, src), columns=[varname])
where src is the raster containing the variable of interest, unbuilt is the raster of the same size where 0s are the pixels that have ever been built, and varname is the name of the variable. It does what I want and fits in the RAM I have. Maybe not the most optimal, but it works!
This looks like a good application for advanced indexing with DataArrays
sprawl_2001.isel(
y=built_up_frame.iloc[0:10000,0].to_xarray(),
x=built_up_frame.iloc[0:10000,1].to_xarray(),
).to_dataset(name="var").to_dataframe()
I'm trying to open multiple netCDF files with xarray in Python. The files have data with same shape and I want to join them, creating a new dimension.
I tried to use concat_dim argument for xarray.open_mfdataset(), but it doesn't work as expected. An example is given below, which open two files with temperature data for 124 times, 241 latitudes and 480 longitudes:
DS = xr.open_mfdataset( 'eraINTERIM_t2m_*.nc', concat_dim='cases' )
da_t2m = DS.t2m
print( da_t2m )
With this code, I expect that the result data array will have a shape like (cases: 2, time: 124, latitude: 241, longitude: 480). However, its shape was (cases: 2, time: 248, latitude: 241, longitude: 480).
It creates a new dimension, but also sums the leftmost dimension: 'time' dimension of two datasets.
I was wondering whether it's an error from 'xarray.open_mfdateset' or it's an expected behavior because 'time' dimension is UNLIMITED for both datasets.
Is there a way to join data from these files directly using xarray and get the above expected return?
Thank you.
Mateus
Extending from my comment I would try this:
def preproc(ds):
ds = ds.assign({'stime': (['time'], ds.time)}).drop('time').rename({'time': 'ntime'})
# we might need to tweak this a bit further, depending on the actual data layout
return ds
DS = xr.open_mfdataset( 'eraINTERIM_t2m_*.nc', concat_dim='cases', preprocess=preproc)
The good thing here is, that you keep the original time coordinate in stime while renaming the original dimension (time -> ntime).
If everything works well, you should get resulting dimensions as (cases, ntime, latitude, longitude).
Disclaimer: I do similar in a loop with a final concat (wich works very well), but did not test the preprocess-approach.
Thank you #AdrianTompkins and #jhamman. After your comments I realize that due different time periods I really can't get what I want, with xarray.
My main purpose to create such array is to get in one single N-D array all data for different events, with same time duration. Thus, I can get easily, for example, composite fields of all events for each time (hour, day, etc).
I'm trying to do the same as I do with NCL. See below a code for NCL that works as expected (for me) for the same data:
f = addfiles( (/"eraINTERIM_t2m_201812.nc", "eraINTERIM_t2m_201901.nc"/), "r" )
ListSetType( f, "join" )
temp = f[:]->t2m
printVarSummary( temp )
The final result is an array with 4 dimensions, with the new one automatically named as ncl_join.
However, NCL doesn't respect time axis, joins the arrays and gives to the resulting time axis the coordinates of the first file. So, time axis become useless.
However, as well said for #AdrianTompkins, the time periods are different and xarray can't join data like this. So, to create such array, in Python with xarray, I think the only way is to delete time coordinate from arrays. Thus, time dimension would have only integer indexes.
The array given by xarray works like #AdrianThompkins said in his small example. Since it keep time coordinates for all merged data, I think xarray solution is the correct one, in comparison with NCL. But, now I think that a computation of composites (getting same example given above) wouldn't be done as easyly as it seems with NCL.
In a small test, I print two values from merged array with xarray with
print( da_t2m[ 0, 0, 0, 0 ].values )
print( da_t2m[ 1, 0, 0, 0 ].values )
What results in
252.11412
nan
For the second case, there isn't data for the first time, as expected.
UPDATE: all answers help me to understand better this problem, so I had to add an update here to also thanks #kmuehlbauer for his answer, indicating that his code give the expected array.
Again, thank you all for help!
Mateus
The result makes sense if the times are different.
To simplify it, forget about the lat-lon dimension for a moment and imagine you have two files that are simply data at 2 timeslices. The first has data at timesteps 1,2 and the second file with timesteps of 3 and 4. You can't create a combined dataset with a time dimension that only spans 2 timeslices; the time dimension variable has to have the times 1,2,3,4. So if you say you want a new dimension "cases", then the data is then combined as a 2d array and would look like this:
times: 1,2,3,4
cases: 1,2
data:
time
1 2 3 4
cases 1: x1 x2
2: x3 x4
Think of the netcdf file that would be the equivalent, the time dimension has to span the range of values present in both files. The only way you could combine two files and get (cases: 2, time: 124, latitude: 241, longitude: 480) would be if both files have the same time, lat AND lon values, i.e. point to exactly the same region in time-lat-lon space.
ps: Somewhat off-topic for the question, but if you are just starting a new analysis, why not instead switch to the new generation, higher resolution ERA-5 reanalysis, which is now available back to 1979 too (and eventually will be extended further back), you can download it straight to your desktop with the python api scripts from here:
https://cds.climate.copernicus.eu/cdsapp#!/search?type=dataset
I have csv files which are 1200 Rows x 3 Columns. Number of rows can differ from as low as 500 to as large as 5000 but columns remain same.
I want to create a feature vector from these files which will thus maintain consistent cells/vector length & thus help in finding out the distance between these vectors.
FILE_1
A, B, C
(267.09669678867186, 6.3664069175720197, 1257325.5809999991),
(368.24070923984374, 9.0808353424072301, 49603.662999999884),
(324.21470826328124, 11.489830970764199, 244391.04699999979),
(514.33452027500005, 7.5162401199340803, 56322.424999999988),
(386.19673340976561, 9.4927110671997106, 175958.77100000033),
(240.09965330898439, 10.3463039398193, 457819.8519411764),
(242.17559998691405, 8.4401674270629901, 144891.51100000029),
(314.23066895664061, 7.4405002593994096, 58433.818999999959),
(933.3073596304688, 7.1564397811889604, 41977.960000000014),
(274.04136473476564, 4.8482465744018599, 48782.314891525479),
(584.2639294320312, 7.90128517150879, 49730.705000000096),
(202.13173096835936, 10.559995651245099, 20847.805144088608),
(324.98563963710939, 2.2546300888061501, 43767.774800000007),
(464.35059935390626, 11.573680877685501, 1701597.3915132943),
(776.28339964687495, 8.7755222320556605, 106882.2469999999),
(310.11652952968751, 10.3175926208496, 710341.19162800116),
(331.19962889492189, 10.7578010559082, 224621.80632433048),
(452.31337752387947, 7.3100395202636701, 820707.26700000139),
(430.16615111171876, 10.134071350097701, 18197.691999999963),
(498.24687010585939, 11.0102319717407, 45423.269964585743),
.....,
.....,
500th row
FILE_2
(363.02781861484374, 8.8369808197021502, 72898.479666666608),
(644.20353882968755, 8.6263589859008807, 22776.78799999999),
(259.25105469882811, 9.8575859069824201, 499615.64068339905),
(410.19474608242189, 9.8795070648193395, 316146.18800000293),
(288.12153809726561, 4.7451887130737296, 58615.577999999943),
(376.25868409335936, 10.508985519409199, 196522.12200000012),
(261.11118895351564, 8.5228433609008807, 32721.110000000026),
(319.98896605312501, 3.2100667953491202, 60587.077000000027),
(286.94926268398439, 4.7687568664550799, 47842.133999999867),
(121.00206177890625, 7.9372291564941397, 239813.20531182736),
(308.19895750820314, 6.0029039382934597, 26354.519000000011),
(677.17011839687495, 9.0299625396728498, 10391.757655172449),
(182.1304913216797, 8.0010566711425799, 145583.55700000061),
(187.06341736972655, 9.9460496902465803, 77488.229000000007),
(144.07867615878905, 3.6044106483459499, 104651.56499999999),
(288.92317015468751, 4.3750333786010698, 151872.1949999998),
(228.2089825326172, 4.4475774765014604, 658120.07628214348),
(496.18831055820311, 11.422966003418001, 2371155.6659999997),
(467.30134398281251, 11.0771179199219, 109702.48440899582),
(163.08418089687501, 5.7271881103515598, 38107.106791666629),
.....,
.....,
3400th row
You can see that there is no correspondence between the two files, i.e. if someone asked you to calculate the distance between these two vectors its not possible.
The aim is to be able to interpolate the rows of both the files in such a manner so that there is a consistency across all such files. i.e. when I look up first row,
it should represent same feature across all the files. Now lest look at FILE_1
Range of values for three columns is (considering only 20 rows for time being)
A: 202.13173096835936,933.3073596304688
B: 2.2546300888061501, 11.573680877685501
C: 18197.691999999963,1701597.3915132943
I want to put these points on a 3d array, the grid size of which will be .1X.1X.1 (or lets say 10X10X10 or any arbitrary size grid cell)
But for that to work we need to normalize the data (mean normalize etc)
Now the data we have is a 3d data, which need to be normalized in order to interpolate them into this 3d array. Which neednt be 3d even if its a vector then that will also do.
Now when I said I need to average the points, by that I meant that if in a cell more than two points happen to fall (which will happen if the cell size is big eg 100X100X100) then we will take the average value of x,y,z coordinate as the value of that cell.
These interpolated vectors will have same length & correspondence, because corresponding point of a vector when compared to rest of such vectors will represent same point.
**NOTE : Min & Max range for all coordinates across all files is 100:1000,2:12, 10000:2000000
Suppose that you have hundreds of numpy arrays and you want to calculate correlation between each of them. I calculated it with the help of nested for loops. But, execution took huge time(20 minutes!). One way to make this calculation more efficient is to calculate one half of the correlation table diagonal, copy it to other half and make diagonal line equal to 1. What I mean is that, correlation(x,y)=correlation(y,x) and correlation(x,x) is always equal to 1. However, with these corrections, code will also take much time(approx 7-8 minutes). Any other suggestions?
My code
for x in data_set:
for y in data_set:
correlation = np.corrcoef(x,y)[1][0]
I am quite sure you can achieve must faster results by creating a 2-D array and calculating its correlation matrix (as opposed to calculate pair wise correlations one by one).
From numpy's corrcoef documentation the input can be:
" 1-D or 2-D array containing multiple variables and observations. Each row of m represents a variable, and each column a single observation of all those variables."
https://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html