Converting between projections using pyproj in Pandas dataframe - python

This is undoubtedly a bit of a "can't see the wood for the trees" moment. I've been staring at this code for an hour and can't see what I've done wrong. I know it's staring me in the face but I just can't see it!
I'm trying to convert between two geographical co-ordinate systems using Python.
I have longitude (x-axis) and latitude (y-axis) values and want to convert to OSGB 1936. For a single point, I can do the following:
import numpy as np
import pandas as pd
import shapefile
import pyproj
inProj = pyproj.Proj(init='epsg:4326')
outProj = pyproj.Proj(init='epsg:27700')
x1,y1 = (-2.772048, 53.364265)
x2,y2 = pyproj.transform(inProj,outProj,x1,y1)
print(x1,y1)
print(x2,y2)
This produces the following:
-2.772048 53.364265
348721.01039783185 385543.95241055806
Which seems reasonable and suggests that longitude of -2.772048 is converted to a co-ordinate of 348721.0103978.
In fact, I want to do this in a Pandas dataframe. The dataframe contains columns containing longitude and latitude and I want to add two additional columns that contain the converted co-ordinates (called newLong and newLat).
An exemplar dataframe might be:
latitude longitude
0 53.364265 -2.772048
1 53.632481 -2.816242
2 53.644596 -2.970592
And the code I've written is:
import numpy as np
import pandas as pd
import shapefile
import pyproj
inProj = pyproj.Proj(init='epsg:4326')
outProj = pyproj.Proj(init='epsg:27700')
df = pd.DataFrame({'longitude':[-2.772048,-2.816242,-2.970592],'latitude':[53.364265,53.632481,53.644596]})
def convertCoords(row):
x2,y2 = pyproj.transform(inProj,outProj,row['longitude'],row['latitude'])
return pd.Series({'newLong':x2,'newLat':y2})
df[['newLong','newLat']] = df.apply(convertCoords,axis=1)
print(df)
Which produces:
latitude longitude newLong newLat
0 53.364265 -2.772048 385543.952411 348721.010398
1 53.632481 -2.816242 415416.003113 346121.990302
2 53.644596 -2.970592 416892.024217 335933.971216
But now it seems that the newLong and newLat values have been mixed up (compared with the results of the single point conversion shown above).
Where have I got my wires crossed to produce this result? (I apologise if it's completely obvious!)

When you do df[['newLong','newLat']] = df.apply(convertCoords,axis=1), you are indexing the columns of the df.apply output. However, the column order is arbitrary because your series was defined using a dictionary (which is inherently unordered).
You can opt to return a Series with a fixed column ordering:
return pd.Series([x2, y2])
Alternatively, if you want to keep the convertCoords output labelled, then you can use .join to combine results instead:
return pd.Series({'newLong':x2,'newLat':y2})
...
df = df.join(df.apply(convertCoords, axis=1))

Please note that the transform function of pyproj accepts also arrays, which is quite useful when it comes to large dataframes, and much faster than using lambda/apply function
import pandas as pd
from pyproj import Proj, transform
inProj, outProj = Proj(init='epsg:4326'), Proj(init='epsg:27700')
df['newLon'], df['newLat'] = transform(inProj, outProj, df['longitude'].tolist(), df['longitude'].tolist())

Related

"leak" converting .csv to .nc using xarray in some points

I'm trying to transform some points that are tabulated .csv in a netcdf file.
This is my .csv file: https://1drv.ms/u/s!AhZf0QH5jEVSjWfnPtJjJgmXf-i0?e=WEpMyU
In my spreadsheet, I have the unique location of each point, not regular for all area but points are spaced by 0.1 degree, an SP value per year up to 100 years forward.
To work with this data, I needed something like other sources that use netcdf data tabled in sp(time, lat, lon). So, I can evaluate and visualize the values ​​of this specific region by year (using panoply or ncview for example).
For that, I came up with this code:
import pandas as pd
import xarray as xr
import numpy as np
csv_file = 'example.csv'
df = pd.read_csv(csv_file)
df = pd.melt(df, id_vars=["lon", "lat"], var_name="time", value_name="sp")
df['time']= pd.to_datetime(df['time'])
df = df.set_index(["time", "lat", "lon"])
df = df.astype('float32')
xr = df.to_xarray()
xc = xr.fillna(0)
xc.to_netcdf(csv_file + '.nc')
And I got a netcdf file like this: https://1drv.ms/u/s!AhZf0QH5jEVSjWfnPtJjJgmXf-i0?e=WEpMyU
At first, my code seems to work and create my netcdf file without problems, however, I noticed that in some places I am creating some "leakage" of points, or interpolating the same values ​​in some direction (north-south and west-east) when it shouldn't happen.
If you do a simple plot before converting to xarray you can see there are 3 west segments and one south segment
xr.sp[0].plot()
And this ends up being masked a bit when I fill the NaN with 0 and plot it again:
xc.sp[0].plot()
Checking the netcdf file using panoply I got something similar as well:
So I've start to check every-step of my code to see if I miss something.. my first guess was the melt part but I not 100% sure because if I plot df I can't see any leaking or extrapolation in the same region:
joint_axes = seaborn.jointplot(
x="lon", y="lat", data=df, s=0.5
)
contextily.add_basemap(
joint_axes.ax_joint,
crs="EPSG:4326",
source=contextily.providers.CartoDB.PositronNoLabels,
);
So anyone have any idea what's happening here?
EDIT:
Now a solution that would help me at the moment would be to fill in the missing coordinates with a value equal to 0 within my domain area using the minimum and maximum latitudes and longitudes.
My first (and unconventional) idea was to create a 0.1 x 0.1 grid with values equal to zero and feed this grid with my existing values.
However, the method using reindex would help me and I would be able to execute it in a few lines. My doubt is whether I should do this before or after the df.melt in my code.
I'm in this situation:
csv_file = '/Users/helioguerraneto/Desktop/example.csv'
df = pd.read_csv(csv_file)
lonmin, lonmax = df['lon'].min(), df['lon'].max()
latmin, latmax = df['lat'].min(), df['lat'].max()
df = pd.melt(df, id_vars=["lon", "lat"], var_name="time", value_name="sp")
df['time']= pd.to_datetime(df['time'])
df = df.set_index(["time", "lat", "lon"])
df = df.astype('float32')
xr = df.to_xarray()
xc = xr.reindex(lat=np.arange(latmin, latmax, 0.1), lon=np.arange(lonmin, lonmax, 0.1), fill_value=0)
xc.to_netcdf(csv_file + '.nc')
Seems like reindex is the way but I need to keep original data. I was expecting some zeros but not in all area:
EDIT2:
I think I found something might help! My goal now could be same what's happing here: How to interpolate latitude/longitude and heading in Pandas
But instead of interpolation by the nearest I just could match with the exactly coordinates. Maybe the real problem here is mix 100 hundred grids in the end..
Any suggestions?

How calculate distance between two points measured in longitude and latitude

I have a dataframe where, columns with subscript 1 are starting points and with 2 are end points.
I want to find a difference in kilometers between them.
I tried following code however got an error
import mpu
import pandas as pd
import numpy as np
data = {'lat1': [116.51172,116.51135,116.51135,116.51627,116.47186],
'lon1': [39.92123,39.93883,39.93883,39.91034,39.91248],
'lat2': [np.nan,116.51172,116.51135,116.51135,116.51627],
'lon2': [np.nan,39.92123,39.93883,39.93883,39.91034]}
# Create DataFrame
df_test = pd.DataFrame(data)
mpu.haversine_distance((df.lat1, df.lon1), (df.lat2, df.lon2))

Convert lists of Coordinates to Polygons with GeoPandas

I got lists of coordinates in the csv file(please click the pic). How should I convert them to polygons in GeoDataFrame?
Below is the coordinates of one polygon and I have thousands rows of this.
[118.103198,24.527338],[118.103224,24.527373],[118.103236,24.527366],[118.103209,24.527331],[118.103198,24.527338]
I tried the following codes:
def bike_fence_format(s):
s = s.replace('[', '').replace(']', '').split(',')
return s
df['FENCE_LOC'] = df['FENCE_LOC'].apply(bike_fence_format)
df['LAT'] = df['FENCE_LOC'].apply(lambda x: x[1::2])
df['LON'] = df['FENCE_LOC'].apply(lambda x: x[::2])
df['geom'] = Polygon(zip(df['LON'].astype(str),df['LAT'].astype(str)))
But I failed in the last step, since df['LON'] returns 'series' not 'string' type. How should I get over this problem? It's better if there is an easier way to achieve my goal.
Recreated a sample df of what your .csv file would give (depending on how your read it in with .read_csv()).
import pandas as pd
import geopandas as gpd
df = pd.DataFrame({'FENCE_LOC': ['[32250,175889],[33913,180757],[29909,182124],[28246,177257],[32250,175889]',
'[32250,175889],[33913,180757],[29909,182124],[28246,177257],[32250,175889]',
'[32250,175889],[33913,180757],[29909,182124],[28246,177257],[32250,175889]']}, index=[0, 1, 2])
Modified your function slightly because we want numeric values, not strings
def bike_fence_format(s):
s = s.replace('[', '').replace(']', '').split(',')
s = [float(x) for x in s]
return s
df['FENCE_LOC'] = df['FENCE_LOC'].apply(bike_fence_format)
df['LAT'] = df['FENCE_LOC'].apply(lambda x: x[1::2])
df['LON'] = df['FENCE_LOC'].apply(lambda x: x[::2])
We can use some list comprehensions to build a list of Shapely polygons.
geom_list = [(x, y) for x, y in zip(df['LON'],df['LAT'])]
geom_list_2 = [Polygon(tuple(zip(x, y))) for x, y in geom_list]
Finally, we can create a gdf using our list of Shapely polygons.
polygon_gdf = gpd.GeoDataFrame(geometry=geom_list_2)
To make available a small representative dataset similar to what the OP posts as an image, I create this rows of data (sorry for too many decimal digits):
[[-2247824.100899419,-4996167.43201861],[-2247824.100899419,-4996067.43201861],[-2247724.100899419,-4996067.43201861],[-2247724.100899419,-4996167.43201861],[-2247824.100899419,-4996167.43201861]]
[[-2247724.100899419,-4996167.43201861],[-2247724.100899419,-4996067.43201861],[-2247624.100899419,-4996067.43201861],[-2247624.100899419,-4996167.43201861],[-2247724.100899419,-4996167.43201861]]
[[-2247624.100899419,-4996167.43201861],[-2247624.100899419,-4996067.43201861],[-2247524.100899419,-4996067.43201861],[-2247524.100899419,-4996167.43201861],[-2247624.100899419,-4996167.43201861]]
[[-2247824.100899419,-4996067.43201861],[-2247824.100899419,-4995967.43201861],[-2247724.100899419,-4995967.43201861],[-2247724.100899419,-4996067.43201861],[-2247824.100899419,-4996067.43201861]]
[[-2247724.100899419,-4996067.43201861],[-2247724.100899419,-4995967.43201861],[-2247624.100899419,-4995967.43201861],[-2247624.100899419,-4996067.43201861],[-2247724.100899419,-4996067.43201861]]
[[-2247624.100899419,-4996067.43201861],[-2247624.100899419,-4995967.43201861],[-2247524.100899419,-4995967.43201861],[-2247524.100899419,-4996067.43201861],[-2247624.100899419,-4996067.43201861]]
[[-2247824.100899419,-4995967.43201861],[-2247824.100899419,-4995867.43201861],[-2247724.100899419,-4995867.43201861],[-2247724.100899419,-4995967.43201861],[-2247824.100899419,-4995967.43201861]]
[[-2247724.100899419,-4995967.43201861],[-2247724.100899419,-4995867.43201861],[-2247624.100899419,-4995867.43201861],[-2247624.100899419,-4995967.43201861],[-2247724.100899419,-4995967.43201861]]
[[-2247624.100899419,-4995967.43201861],[-2247624.100899419,-4995867.43201861],[-2247524.100899419,-4995867.43201861],[-2247524.100899419,-4995967.43201861],[-2247624.100899419,-4995967.43201861]]
This data is saved as polygon_data.csv file.
For the code, modules are loaded first as
import geopandas as gpd
import pandas as pd
from shapely.geometry import Polygon
Then, the data is read to create a dataframe by pandas.read_csv(). To get each row of data into a single column of the dataframe, delimiter="x" is used. Since there is no x within any row of data, the whole row of data as a long string is the result.
df3 = pd.read_csv('polygon_data.csv', header=None, index_col=None, delimiter="x")
To view the content of df3, you can run
df3.head()
and get single column (with header: 0) dataframe:
0
0 [[-2247824.100899419,-4996167.43201861],[-2247...
1 [[-2247724.100899419,-4996167.43201861],[-2247...
2 [[-2247624.100899419,-4996167.43201861],[-2247...
3 [[-2247824.100899419,-4996067.43201861],[-2247...
4 [[-2247724.100899419,-4996067.43201861],[-2247...
Next, df3 is used to create a geoDataFrame. Data in each row of df3 is used to create a Polygon object to act as the geometry of the geoDataFrame polygon_df3.
geometry = [Polygon(eval(xy_string)) for xy_string in df3[0]]
polygon_df3 = gpd.GeoDataFrame(df3, \
#crs={'init': 'epsg:4326'}, #uncomment this if (x,y) is long/lat
geometry=geometry)
Finally, the geoDataFrame can be plotted with a simple command:
# this plot the geoDataFrame
polygon_df3.plot(edgecolor='black')
In this particular case with my proposed data, the output plot is:

Pandas correlation on just one column containing np arrays

I'm working with a dataframe with a column containing a np.array per row (in this case representing the mean waveform of brain recordings trought the time). I want to calculate the pearson correlation of this column (array by array).
This is my code
lenght = len(df.Mean)
Mean = []
for i in range(len(df.Mean)):
Mean.append(df.Mean[i])
Correlation_p = np.zeros((lenght,lenght))
P_Value_p = np.zeros((lenght,lenght))
for i in range(lenght):
for j in range(lenght):
Correlation_p[i][j],P_Value_p[i][j] = stats.pearsonr(df.Mean[i],df.Mean[j])
This works, but I want to know if there is a more pythonic way to do it, maybe using df.corr(). I tried but I failed in how to do it.
EDIT: the output of df.Mean.head()
0 [-0.2559348091247745, 0.02743063113723536, 0.3...
1 [-0.37025615099744325, -0.11299328141596175, 0...
2 [-1.0543681894876467, -0.8452798699354909, -0....
3 [-0.23527437766943646, -0.28657810260136585, -...
4 [0.45557980303095674, 0.6055674269814991, 0.74...
Name: Mean, dtype: object
The arrays that you would like to correlate seem in single cells of the DataFrame, if I am not mistaken. The following brings it in a format where each single array occupies a single column.
I made an data example that resembles the format of df.Mean.head():
df = pd.DataFrame({'x':[np.random.randint(0,5,10), np.random.randint(0,5,10), np.random.randint(0,5,10)]})
You can turn these arrays into columns using this:
df = pd.DataFrame(np.array(df['x'].tolist()).transpose())
Adapt the reshape parameters according to your own dimensions.
From there, it would be fairly straightforward.
A correlation matrix can be created by:
df.corr()
A visualization of the correlation matrix:
import matplotlib.pyplot as plt
plt.matshow(df.corr())
plt.show()

Plotting with folium

The task is to make an adress popularity map for Moscow. Basically, it should look like this:
https://nbviewer.jupyter.org/github/python-visualization/folium/blob/master/examples/GeoJSON_and_choropleth.ipynb
For my map I use public geojson: http://gis-lab.info/qa/moscow-atd.html
The only data I have - points coordinates and there's no information about the district they belong to.
Question 1:
Do I have to manually calculate for each disctrict if the point belongs to it, or there is more effective way to do this?
Question 2:
If there is no way to do this easier, then, how can I get all the coordinates for each disctrict from the geojson file (link above)?
import pandas as pd
import numpy as np
import geopandas as gpd
from shapely.geometry import Point
Reading in the Moscow area shape file with geopandas
districts = gpd.read_file('mo-shape/mo.shp')
Construct a mock user dataset
moscow = [55.7, 37.6]
data = (
np.random.normal(size=(100, 2)) *
np.array([[.25, .25]]) +
np.array([moscow])
)
my_df = pd.DataFrame(data, columns=['lat', 'lon'])
my_df['pop'] = np.random.randint(500, 100000, size=len(data))
Create Point objects from the user data
geom = [Point(x, y) for x,y in zip(my_df['lon'], my_df['lat'])]
# and a geopandas dataframe using the same crs from the shape file
my_gdf = gpd.GeoDataFrame(my_df, geometry=geom)
my_gdf.crs = districts.crs
Then the join using default value of 'inner'
gpd.sjoin(districts, my_gdf, op='contains')
Thanks to #BobHaffner, I tried to solve the problem using geopandas.
Here are my steps:
I download a shape-files for Moscow using this link click
From a list of tuples containing x and y (latitude and logitude) coordinates I create list of Points (docs)
Assuming that in the dataframe from the first link I have polygons I can write a simple loop for checking if the Point is inside this polygon. For details read this.

Categories