How to rasterize a pandas dataframe with many points per pixel? - python

I am using rasterio to convert a geopandas dataframe of points to a geotif raster.
For that I am using this python code:
with rasterio.open("somepath/rasterized.tif", 'w+', **meta) as out:
out.nodata = 0
out_arr = out.read(1)
# this is where we create a generator of geom, value pairs to use in rasterizing
shapes = ((geom, value * 2) for geom, value in zip(gdf.geometry, gdf["PositionConfidence"]))
burned = features.rasterize(shapes=shapes, fill=0, out=out_arr, transform=out.transform)
out.write_band(1, burned)
out.write_band(2, burned)
out.write_band(3, burned)
out_arr1 = out.read(1)
The code not only writes a fixed value to the raster but a value based on the point to be converted.
The problem is that there are many points per raster pixel. Using above approach only a single points value is burned to the pixel.
What I am looking for is having the average of all points values per pixel burned to the given pixel.
Thanks for your help

I found some idea to do it:
First we sort all the points by pixel into a dictionary. Then we calculate the value that we need, in this case average of some value, for all points in that pixel. Now we assign this value to a new point with the coordinates of the pixel and use that instead of the old points.

Related

'Lining up' large lat/lon grid with smaller lat/lon grid

Let's say I have a large array of values that represent terrain latitude locations that is shape x. I also have another array of values that represent terrain longitude values that is shape y. All of the values in x as well as y are equally spaced at 0.005-degrees. In other words:
lons[0:10] = [-130.0, -129.995, -129.99, -129.985, -129.98, -129.975, -129.97, -129.965, -129.96, -129.955]
lats[0:10] = [55.0, 54.995, 54.99, 54.985, 54.98, 54.975, 54.97, 54.965, 54.96, 54.955]
I have a second dataset that is projected in an irregularly-spaced lat/lon grid (but equally spaced ~ 25 meters apart) that is [m,n] dimensions big, and falls within the domain of x and y. Furthermore, we also have all of the lat/lon points within this second dataset. I would like to 'lineup' the grids such that every value of [m,n] matches the nearest neighbor terrain value within the larger grid. I am able to do this with the following code where I basically loop through every lat/lon value in dataset two, and try to find the argmin of a the calculated lat/lon values from dataset1:
for a in range(0,lats.shape[0]):
# Loop through the ranges
for r in range(0,lons.shape[0]):
# Access the elements
tmp_lon = lons[r]
tmp_lat = lats[a]
# Now we need to find where the tmp_lon and tmp_lat match best with the index from new_lats and new_lons
idx = (np.abs(new_lats - tmp_lat)).argmin()
idy = (np.abs(new_lons - tmp_lon)).argmin()
# Make our final array!
second_dataset_trn[a,r] = first_dataset_trn[idy,idx]
Except it is exceptionally slow. Is there another method, either through a package, library, etc. that can speed this up?
Please take a look at the following previous question for iterating over two lists, which may improve the speed: Is there a better way to iterate over two lists, getting one element from each list for each iteration?
A possible correction to the sample code: assuming that the arrays are organized in the standard GIS fashion of Latitude, Longitude, I believe there is an error in the idx and idy variable assignments - the variables receiving the assignments should be swapped (idx should be idy, and the other way around). For example:
# Now we need to find where the tmp_lon and tmp_lat match best with the index from new_lats and new_lons
idy = (np.abs(new_lats - tmp_lat)).argmin()
idx = (np.abs(new_lons - tmp_lon)).argmin()

Build Shapely point objects from .TIF

I would like to convert an image (.tiff) into Shapely points. There are 45 million pixels, I need a way to accomplish this without a loop (currently taking 15+ hours)
For example, I have a .tiff file which when opened is a 5000x9000 array. The values are pixel values (colors) that range from 1 to 215.
I open tif with rasterio.open(xxxx.tif).
Desired epsg is 32615
I need to preserve the pixel value but also attach geospatial positioning. This is to be able to sjoin over a polygon to see if the points are inside. I can handle the transform after processing, but I cannot figure a way to accomplish this without a loop. Any help would be greatly appreciated!
If you just want a boolean array indicating whether the points are within any of the geometries, I'd dissolve the shapes into a single MultiPolygon then use shapely.vectorized.contains. The shapely.vectorized module is currently not covered in the documentation, but it's really good to know about!
Something along the lines of
# for a gridded dataset with 2-D arrays lats, lons
# and a list of shapely polygons/multipolygons all_shapes
XX = lons.ravel()
YY = lats.ravel()
single_multipolygon = shapely.ops.unary_union(all_shapes)
in_any_shape = shapely.vectorized.contains(single_multipolygon, XX, YY)
If you're looking to identify which shape the points are in, use geopandas.points_from_xy to convert your x, y point coordinates into a GeometryArray, then use geopandas.sjoin to find the index of the shape corresponding to each (x, y) point:
geoarray = geopandas.points_from_xy(XX, YY)
points_gdf = geopandas.GeoDataFrame(geometry=geoarray)
shapes_gdf = geopandas.GeoDataFrame(geometry=all_shapes)
shape_index_by_point = geopandas.sjoin(
shapes_gdf, points_gdf, how='right', predicate='contains',
)
This is still a large operation, but it's vectorized and will be significantly faster than a looped solution. The geopandas route is also a good option if you'd like to convert the projection of your data or use other geopandas functionality.

2D histogram colour by "label fraction" of data in each bin

Following on from the post found here: 2D histogram coloured by standard deviation in each bin
I would like to colour each bin in a 2D grid by the fraction of points whose label values are below a certain threshold in Python.
Note that, in this dataset, each point has a continuous label value between 0-1.
For example here is a histogram I made whereby the colour denotes the standard deviation of label values of all points in each bin:
The way this was done was by using
scipy.stats.binned_statistic_2d()
(see: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binned_statistic_2d.html)
..and setting the statistic argument to 'std'
But is there a way to change this kind of plot so that the colouring is representative of the fraction of points in each bin with label value below 0.5 for example?
It could be that the only way to do this is by explicitly defining a grid of some kind and calculating the fractions but I'm not sure of the best way to do that so any help on this matter would be greatly appreciated!
Maybe using scipy.stats.binned_statistic_2d or numpy.histogram2d and being able to return the raw data values in each bin as a multi dimensional array would help in being able to quickly compute the fractions explicitly.
The fraction of elements in an array below a threshold can be calculated as
fraction = lambda a, threshold: len(a[a<threshold])/len(a)
Hence you can call
scipy.stats.binned_statistic_2d(x, y, values, statistic=lambda a: fraction(a, 0.5))

Healpy: changing coordinates to a map and saving the new one

I have a map in galactic coordinates and I need to save it in equatorial coordinates in another file . I know i can use:
import healpy as hp
map=hp.read_map('file.fits')
map_rot=hp.mollview(map, coord=['G','C'], return_projected_map=True)
and this should return a 2D numpy array stored in map_rot. But when I read map_rot, I found out it is a masked_array filled ONLY with -inf values, and mask=False , fill_value=-1.6735e+30 (so, apparently, -inf is not a mask). Moreover, the total number of elements of map_rot do not match with the number of pixels I would expect for a map (npix=12*nside**2). For example if nside=256 I would expect to obtain npix=786432, while map_rot has 400*800=320000 elements. What's going on?
(I have already seen this post, but I have a map in polarization, so I need to rotate Stokes' parameters. Since mollview knows how to do that, I was trying to obtain the new map directly from mollview. )
One way to go around this is to save the output, for instance with pickle
import healpy as hp, pickle
map=hp.read_map('file.fits')
map_rot=hp.mollview(map, coord=['G','C'], return_projected_map=True)
pickle.dump(map_rot, open( "/path/map.p", "wb"))
The return value of hp.mollview() has a format that can be displayed using the standard imshow() function. So next time you want to plot it, just do the following
map_rot = pickle.load(open("/path/map.p"), 'rb'))
plt.imshow(map_rot)
map_rot describes the pixels in the entire matplotlib window, including the white area (-inf color-coded with white) around the ellipsoid.
In contrast, mollview() accepts only an array of pixels which reside in the ellipsoid, i.e. array of the length.
len(hp.pixelfunc.nside2npix(NSIDE))

Python: histogram/ binning data from 2 arrays.

I have two arrays of data: one is a radius values and the other is a corresponding intensity reading at that intensity:
e.g. a small section of the data. First column is radius and the second is the intensities.
29.77036614 0.04464427
29.70281027 0.07771409
29.63523525 0.09424901
29.3639355 1.322793
29.29596385 2.321502
29.22783249 2.415751
29.15969437 1.511504
29.09139827 1.01704
29.02302068 0.9442765
28.95463729 0.3109002
28.88609766 0.162065
28.81754446 0.1356054
28.74883612 0.03637681
28.68004928 0.05952569
28.61125036 0.05291172
28.54229804 0.08432806
28.4732599 0.09950128
28.43877462 0.1091304
28.40421016 0.09629156
28.36961249 0.1193614
28.33500089 0.102711
28.30037503 0.07161685
How can I bin the radius data, and find the average intensity corresponding to that binned radius.
The aim of this is to then use the average intensity to assign an intensity value to a radius data with a missing (NaN) data point.
I've never had to use the histogram functions before and have very little idea of how they work/ if its possible to do this with them. The full data set is large with 336622 number of data points, so I don't really want to be using loops or if statements to achieve this.
Many Thanks for any help.
if you only need to do this for a handful of points, you could do something like this.
If intensites and radius are numpy arrays of your data:
bin_width = 0.1 # Depending on how narrow you want your bins
def get_avg(rad):
average_intensity = intensities[(radius>=rad-bin_width/2.) & (radius<rad+bin_width/2.)].mean()
return average_intensities
# This will return the average intensity in the bin: 27.95 <= rad < 28.05
average = get_avg(28.)
It's not really histogramming what your are after. A histogram is more a count of items that fall into a specific bin. What you want to do is more a group by operation, where you'd group your intensities by radius intervals and on the groups of itensities you apply some aggregation method, like average or median etc.
What your are describing, however, sounds a lot more like some sort of interpolation you want to perform. So I would suggest to think about interpolation as an alternative to solve your problem. Anyways, here's a suggestion how you can achieve what you asked for (assuming you can use numpy) - I'm using random inputs to illustrate:
radius = numpy.fromiter((random.random() * 10 for i in xrange(1000)), dtype=numpy.float)
intensities = numpy.fromiter((random.random() * 10 for i in xrange(1000)), dtype=numpy.float)
# group your radius input into 20 equal distant bins
bins = numpy.linspace(radius.min(), radius.max(), 20)
groups = numpy.digitize(radius, bins)
# groups now holds the index of the bin into which radius[i] falls
# loop through all bin indexes and select the corresponding intensities
# perform your aggregation on the selected intensities
# i'm keeping the aggregation for the group in a dict
aggregated = {}
for i in range(len(bins)+1):
selected_intensities = intensities[groups==i]
aggregated[i] = selected_intensities.mean()

Categories