I have a dataframe where, columns with subscript 1 are starting points and with 2 are end points.
I want to find a difference in kilometers between them.
I tried following code however got an error
import mpu
import pandas as pd
import numpy as np
data = {'lat1': [116.51172,116.51135,116.51135,116.51627,116.47186],
'lon1': [39.92123,39.93883,39.93883,39.91034,39.91248],
'lat2': [np.nan,116.51172,116.51135,116.51135,116.51627],
'lon2': [np.nan,39.92123,39.93883,39.93883,39.91034]}
# Create DataFrame
df_test = pd.DataFrame(data)
mpu.haversine_distance((df.lat1, df.lon1), (df.lat2, df.lon2))
Related
I am generating area average daily time series from Netcdf4 files of gridded precipitation GPM Early for 23 polygon(sub-basins) in one shape file.
But output (avg_rf) list of 23 element is having empty values (inside dictionry: value = None Type object
Output (avg_rf) is this:
0 dict 1 {'mean': None}
1 dict 1 {'mean': None}
and so on.....
It should be some value indeed.
I am using this code ;
import pandas as pd
import xarray as xr
import rasterio as rio
import geopandas as gpd
import rasterstats
import numpy as np
# load and read shp-file with geopandas
shp_fo = r'D:/codes_fr_model_data/Sub_basins_shp/WGS_1984_Basin-shape/catchment_swat_shap_wgs1984.shp'
shp_df = gpd.read_file(shp_fo)
# load and read netCDF-file to dataset and get datarray for variable
nc_fo = r'D:/Data_Collection_thesis/GPM_h_hourly/daily_gpm_03utc_2002_22/HH_files_daily_0.1_resample_2022/2002-01-02.nc'
nc_ds = xr.open_dataset(nc_fo)
nc_arr = nc_ds['precipitationCal']
np_array = nc_arr.values
# get affine of nc-file with rasterio
affine = rio.open(nc_fo).transform
##compute the zonal statistics
avg_rf = rasterstats.zonal_stats(shp_df.geometry,
np_array,
affine=affine,
stats=['mean'],
nodata=-999,
geojason_out = False)
can i anybody solve the issue why the output list avg_rf is empty or is it correct way to do it??
then i want to iterate over days to extract daily average time series:
There are several tutorials (example 1, example 2, example 3) about masking NetCDF using shapefile and calculating average measures. However, I was confused with those workflows about masking NetCDF and extracting measures such as average, and those tutorials did not include extract anomaly (for example, the difference between temperature in 2019 and a baseline average temperature).
I make an example here. I have downloaded monthly temperature (download temperature file) from 2000 to 2019 and the state-level US shapefile (download shapefile). I want to get the state-level average temperature based on the monthly average temperature from 2000 to 2019 and the temperature anomaly of year 2019 relative to baseline temperature from 2000 to 2010. Specifically, the final dataframe looks as follow:
state
avg_temp
anom_temp2019
AL
xx
xx
AR
xx
xx
...
...
...
WY
xx
xx
# Load libraries
%matplotlib inline
import regionmask
import numpy as np
import xarray as xr
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
# Read shapefile
us = gpd.read_file('./shp/state_cus.shp')
# Read gridded data
ds = xr.open_mfdataset('./temp/monthly_mean_t2m_*.nc')
......
I really appreciate your help that providing an explicit workflow that could do the above task. Thanks a lot.
This can be achieved using regionmask. I don't use your files but the xarray tutorial data and naturalearth data for the US states.
import numpy as np
import regionmask
import xarray as xr
# load polygons of US states
us_states_50 = regionmask.defined_regions.natural_earth.us_states_50
# load an example dataset
air = xr.tutorial.load_dataset("air_temperature")
# turn into monthly time resolution
air = air.resample(time="M").mean()
# create a mask
mask3D = us_states_50.mask_3D(air)
# latitude weights
wgt = np.cos(np.deg2rad(air.lat))
# calculate regional averages
reg_ave = air.weighted(mask3D * wgt).mean(("lat", "lon"))
# calculate the average temperature (over 2013-2014)
avg_temp = reg_ave.sel(time=slice("2013", "2014")).mean("time")
# calculate the anomaly (w.r.t. 2013-2014)
reg_ave_anom = reg_ave - avg_temp
# select a single timestep (January 2013)
reg_ave_anom_ts = reg_ave_anom.sel(time="2013-01")
# remove the time dimension
reg_ave_anom_ts = reg_ave_anom_ts.squeeze(drop=True)
# convert to a pandas dataframe so it's in tabular form
df = reg_ave_anom_ts.air.to_dataframe()
# set the state codes as index
df = df.set_index("abbrevs")
# remove other columns
df = df.drop(columns="names")
You can find info how to use your own shapefile on the regionmask docs (Working with geopandas).
disclaimer: I am the main author of regionmask.
Let's say I have data of 16 grid cells (4 * 4) which has a corresponding index (0~15) as dimension & coordinates and variables (a, longitude and latitude) for each cell. Here is the code to create this data.
import xarray as xr
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(16,3), \
columns=['a','longitude', 'latitude'], \
index=range(16))
ds = df.to_xarray()
ds
What I want to do is:
Expand the coordination of data a from (index) to (longitude, latitude) using longitude and latitude variables of each cell.
So, the resulting DataSet will include longitude and latitude as its dimension and coordinates as well as variable a of (longitude, latitude)
How can I do this within xarray functionality?
Thanks!
Here is a way to solve that:
# convert the index as a MultiIndex containing longitude and latitude
dat_2d = ds.set_index({'index': ['longitude', 'latitude']})
# unstack the MultiIndex
unstacked = dat_2d.unstack('index')
# plot
unstacked['a'].plot()
This is undoubtedly a bit of a "can't see the wood for the trees" moment. I've been staring at this code for an hour and can't see what I've done wrong. I know it's staring me in the face but I just can't see it!
I'm trying to convert between two geographical co-ordinate systems using Python.
I have longitude (x-axis) and latitude (y-axis) values and want to convert to OSGB 1936. For a single point, I can do the following:
import numpy as np
import pandas as pd
import shapefile
import pyproj
inProj = pyproj.Proj(init='epsg:4326')
outProj = pyproj.Proj(init='epsg:27700')
x1,y1 = (-2.772048, 53.364265)
x2,y2 = pyproj.transform(inProj,outProj,x1,y1)
print(x1,y1)
print(x2,y2)
This produces the following:
-2.772048 53.364265
348721.01039783185 385543.95241055806
Which seems reasonable and suggests that longitude of -2.772048 is converted to a co-ordinate of 348721.0103978.
In fact, I want to do this in a Pandas dataframe. The dataframe contains columns containing longitude and latitude and I want to add two additional columns that contain the converted co-ordinates (called newLong and newLat).
An exemplar dataframe might be:
latitude longitude
0 53.364265 -2.772048
1 53.632481 -2.816242
2 53.644596 -2.970592
And the code I've written is:
import numpy as np
import pandas as pd
import shapefile
import pyproj
inProj = pyproj.Proj(init='epsg:4326')
outProj = pyproj.Proj(init='epsg:27700')
df = pd.DataFrame({'longitude':[-2.772048,-2.816242,-2.970592],'latitude':[53.364265,53.632481,53.644596]})
def convertCoords(row):
x2,y2 = pyproj.transform(inProj,outProj,row['longitude'],row['latitude'])
return pd.Series({'newLong':x2,'newLat':y2})
df[['newLong','newLat']] = df.apply(convertCoords,axis=1)
print(df)
Which produces:
latitude longitude newLong newLat
0 53.364265 -2.772048 385543.952411 348721.010398
1 53.632481 -2.816242 415416.003113 346121.990302
2 53.644596 -2.970592 416892.024217 335933.971216
But now it seems that the newLong and newLat values have been mixed up (compared with the results of the single point conversion shown above).
Where have I got my wires crossed to produce this result? (I apologise if it's completely obvious!)
When you do df[['newLong','newLat']] = df.apply(convertCoords,axis=1), you are indexing the columns of the df.apply output. However, the column order is arbitrary because your series was defined using a dictionary (which is inherently unordered).
You can opt to return a Series with a fixed column ordering:
return pd.Series([x2, y2])
Alternatively, if you want to keep the convertCoords output labelled, then you can use .join to combine results instead:
return pd.Series({'newLong':x2,'newLat':y2})
...
df = df.join(df.apply(convertCoords, axis=1))
Please note that the transform function of pyproj accepts also arrays, which is quite useful when it comes to large dataframes, and much faster than using lambda/apply function
import pandas as pd
from pyproj import Proj, transform
inProj, outProj = Proj(init='epsg:4326'), Proj(init='epsg:27700')
df['newLon'], df['newLat'] = transform(inProj, outProj, df['longitude'].tolist(), df['longitude'].tolist())
I have the following problem:
Given a 2D dataframe, first column with values and second giving categories of the points, I would like to compute a k-means dictionary of the means of each category and assign the centroid that the group mean of a particular value is closest to as a new column in the original data frame.
I would like to do this using groupby.
More generally, my problem is, that apply (to my knowledge) only can use functions that are defined on the individual groups (like mean()). k-means needs information on all the groups. Is there a nicer way than transforming everything to numpy arrays and working with these?
import pandas as pd
import numpy as np
from scipy.cluster.vq import kmeans2
k=4
raw_data = np.random.randint(0,100,size=(100, 4))
f = pd.DataFrame(raw_data, columns=list('ABCD'))
df = pd.DataFrame(f, columns=['A','B'])
groups = df.groupby('A')
means = groups.mean().unstack()
centroids, dictionary = kmeans2(means,k)
fig, ax = plt.subplots()
print dictionary
What I would like to get now, is a new column in df, that gives the value in dictionary for each entry.
You can achieve it by the following:
import pandas as pd
import numpy as np
from scipy.cluster.vq import kmeans2
k = 4
raw_data = np.random.randint(0,100,size=(100, 4))
f = pd.DataFrame(raw_data, columns=list('ABCD'))
df = pd.DataFrame(f, columns=['A','B'])
groups = df.groupby('A')
means_data_frame = pd.DataFrame(groups.mean())
centroid, means_data_frame['cluster'] = kmeans2(means_data_frame['B'], k)
df.join(means_data_frame, rsuffix='_mean', on='A')
This will append 2 more columns to df B_mean and cluster denoting the group's mean and the cluster that group's mean is closest to, respectively.
If you really want to use apply, you can write a function to read the cluster value from means_data_frame and assign it to a new column in df