I'm trying to transform some points that are tabulated .csv in a netcdf file.
This is my .csv file: https://1drv.ms/u/s!AhZf0QH5jEVSjWfnPtJjJgmXf-i0?e=WEpMyU
In my spreadsheet, I have the unique location of each point, not regular for all area but points are spaced by 0.1 degree, an SP value per year up to 100 years forward.
To work with this data, I needed something like other sources that use netcdf data tabled in sp(time, lat, lon). So, I can evaluate and visualize the values of this specific region by year (using panoply or ncview for example).
For that, I came up with this code:
import pandas as pd
import xarray as xr
import numpy as np
csv_file = 'example.csv'
df = pd.read_csv(csv_file)
df = pd.melt(df, id_vars=["lon", "lat"], var_name="time", value_name="sp")
df['time']= pd.to_datetime(df['time'])
df = df.set_index(["time", "lat", "lon"])
df = df.astype('float32')
xr = df.to_xarray()
xc = xr.fillna(0)
xc.to_netcdf(csv_file + '.nc')
And I got a netcdf file like this: https://1drv.ms/u/s!AhZf0QH5jEVSjWfnPtJjJgmXf-i0?e=WEpMyU
At first, my code seems to work and create my netcdf file without problems, however, I noticed that in some places I am creating some "leakage" of points, or interpolating the same values in some direction (north-south and west-east) when it shouldn't happen.
If you do a simple plot before converting to xarray you can see there are 3 west segments and one south segment
xr.sp[0].plot()
And this ends up being masked a bit when I fill the NaN with 0 and plot it again:
xc.sp[0].plot()
Checking the netcdf file using panoply I got something similar as well:
So I've start to check every-step of my code to see if I miss something.. my first guess was the melt part but I not 100% sure because if I plot df I can't see any leaking or extrapolation in the same region:
joint_axes = seaborn.jointplot(
x="lon", y="lat", data=df, s=0.5
)
contextily.add_basemap(
joint_axes.ax_joint,
crs="EPSG:4326",
source=contextily.providers.CartoDB.PositronNoLabels,
);
So anyone have any idea what's happening here?
EDIT:
Now a solution that would help me at the moment would be to fill in the missing coordinates with a value equal to 0 within my domain area using the minimum and maximum latitudes and longitudes.
My first (and unconventional) idea was to create a 0.1 x 0.1 grid with values equal to zero and feed this grid with my existing values.
However, the method using reindex would help me and I would be able to execute it in a few lines. My doubt is whether I should do this before or after the df.melt in my code.
I'm in this situation:
csv_file = '/Users/helioguerraneto/Desktop/example.csv'
df = pd.read_csv(csv_file)
lonmin, lonmax = df['lon'].min(), df['lon'].max()
latmin, latmax = df['lat'].min(), df['lat'].max()
df = pd.melt(df, id_vars=["lon", "lat"], var_name="time", value_name="sp")
df['time']= pd.to_datetime(df['time'])
df = df.set_index(["time", "lat", "lon"])
df = df.astype('float32')
xr = df.to_xarray()
xc = xr.reindex(lat=np.arange(latmin, latmax, 0.1), lon=np.arange(lonmin, lonmax, 0.1), fill_value=0)
xc.to_netcdf(csv_file + '.nc')
Seems like reindex is the way but I need to keep original data. I was expecting some zeros but not in all area:
EDIT2:
I think I found something might help! My goal now could be same what's happing here: How to interpolate latitude/longitude and heading in Pandas
But instead of interpolation by the nearest I just could match with the exactly coordinates. Maybe the real problem here is mix 100 hundred grids in the end..
Any suggestions?
I have a dataframe where, columns with subscript 1 are starting points and with 2 are end points.
I want to find a difference in kilometers between them.
I tried following code however got an error
import mpu
import pandas as pd
import numpy as np
data = {'lat1': [116.51172,116.51135,116.51135,116.51627,116.47186],
'lon1': [39.92123,39.93883,39.93883,39.91034,39.91248],
'lat2': [np.nan,116.51172,116.51135,116.51135,116.51627],
'lon2': [np.nan,39.92123,39.93883,39.93883,39.91034]}
# Create DataFrame
df_test = pd.DataFrame(data)
mpu.haversine_distance((df.lat1, df.lon1), (df.lat2, df.lon2))
I got lists of coordinates in the csv file(please click the pic). How should I convert them to polygons in GeoDataFrame?
Below is the coordinates of one polygon and I have thousands rows of this.
[118.103198,24.527338],[118.103224,24.527373],[118.103236,24.527366],[118.103209,24.527331],[118.103198,24.527338]
I tried the following codes:
def bike_fence_format(s):
s = s.replace('[', '').replace(']', '').split(',')
return s
df['FENCE_LOC'] = df['FENCE_LOC'].apply(bike_fence_format)
df['LAT'] = df['FENCE_LOC'].apply(lambda x: x[1::2])
df['LON'] = df['FENCE_LOC'].apply(lambda x: x[::2])
df['geom'] = Polygon(zip(df['LON'].astype(str),df['LAT'].astype(str)))
But I failed in the last step, since df['LON'] returns 'series' not 'string' type. How should I get over this problem? It's better if there is an easier way to achieve my goal.
Recreated a sample df of what your .csv file would give (depending on how your read it in with .read_csv()).
import pandas as pd
import geopandas as gpd
df = pd.DataFrame({'FENCE_LOC': ['[32250,175889],[33913,180757],[29909,182124],[28246,177257],[32250,175889]',
'[32250,175889],[33913,180757],[29909,182124],[28246,177257],[32250,175889]',
'[32250,175889],[33913,180757],[29909,182124],[28246,177257],[32250,175889]']}, index=[0, 1, 2])
Modified your function slightly because we want numeric values, not strings
def bike_fence_format(s):
s = s.replace('[', '').replace(']', '').split(',')
s = [float(x) for x in s]
return s
df['FENCE_LOC'] = df['FENCE_LOC'].apply(bike_fence_format)
df['LAT'] = df['FENCE_LOC'].apply(lambda x: x[1::2])
df['LON'] = df['FENCE_LOC'].apply(lambda x: x[::2])
We can use some list comprehensions to build a list of Shapely polygons.
geom_list = [(x, y) for x, y in zip(df['LON'],df['LAT'])]
geom_list_2 = [Polygon(tuple(zip(x, y))) for x, y in geom_list]
Finally, we can create a gdf using our list of Shapely polygons.
polygon_gdf = gpd.GeoDataFrame(geometry=geom_list_2)
To make available a small representative dataset similar to what the OP posts as an image, I create this rows of data (sorry for too many decimal digits):
[[-2247824.100899419,-4996167.43201861],[-2247824.100899419,-4996067.43201861],[-2247724.100899419,-4996067.43201861],[-2247724.100899419,-4996167.43201861],[-2247824.100899419,-4996167.43201861]]
[[-2247724.100899419,-4996167.43201861],[-2247724.100899419,-4996067.43201861],[-2247624.100899419,-4996067.43201861],[-2247624.100899419,-4996167.43201861],[-2247724.100899419,-4996167.43201861]]
[[-2247624.100899419,-4996167.43201861],[-2247624.100899419,-4996067.43201861],[-2247524.100899419,-4996067.43201861],[-2247524.100899419,-4996167.43201861],[-2247624.100899419,-4996167.43201861]]
[[-2247824.100899419,-4996067.43201861],[-2247824.100899419,-4995967.43201861],[-2247724.100899419,-4995967.43201861],[-2247724.100899419,-4996067.43201861],[-2247824.100899419,-4996067.43201861]]
[[-2247724.100899419,-4996067.43201861],[-2247724.100899419,-4995967.43201861],[-2247624.100899419,-4995967.43201861],[-2247624.100899419,-4996067.43201861],[-2247724.100899419,-4996067.43201861]]
[[-2247624.100899419,-4996067.43201861],[-2247624.100899419,-4995967.43201861],[-2247524.100899419,-4995967.43201861],[-2247524.100899419,-4996067.43201861],[-2247624.100899419,-4996067.43201861]]
[[-2247824.100899419,-4995967.43201861],[-2247824.100899419,-4995867.43201861],[-2247724.100899419,-4995867.43201861],[-2247724.100899419,-4995967.43201861],[-2247824.100899419,-4995967.43201861]]
[[-2247724.100899419,-4995967.43201861],[-2247724.100899419,-4995867.43201861],[-2247624.100899419,-4995867.43201861],[-2247624.100899419,-4995967.43201861],[-2247724.100899419,-4995967.43201861]]
[[-2247624.100899419,-4995967.43201861],[-2247624.100899419,-4995867.43201861],[-2247524.100899419,-4995867.43201861],[-2247524.100899419,-4995967.43201861],[-2247624.100899419,-4995967.43201861]]
This data is saved as polygon_data.csv file.
For the code, modules are loaded first as
import geopandas as gpd
import pandas as pd
from shapely.geometry import Polygon
Then, the data is read to create a dataframe by pandas.read_csv(). To get each row of data into a single column of the dataframe, delimiter="x" is used. Since there is no x within any row of data, the whole row of data as a long string is the result.
df3 = pd.read_csv('polygon_data.csv', header=None, index_col=None, delimiter="x")
To view the content of df3, you can run
df3.head()
and get single column (with header: 0) dataframe:
0
0 [[-2247824.100899419,-4996167.43201861],[-2247...
1 [[-2247724.100899419,-4996167.43201861],[-2247...
2 [[-2247624.100899419,-4996167.43201861],[-2247...
3 [[-2247824.100899419,-4996067.43201861],[-2247...
4 [[-2247724.100899419,-4996067.43201861],[-2247...
Next, df3 is used to create a geoDataFrame. Data in each row of df3 is used to create a Polygon object to act as the geometry of the geoDataFrame polygon_df3.
geometry = [Polygon(eval(xy_string)) for xy_string in df3[0]]
polygon_df3 = gpd.GeoDataFrame(df3, \
#crs={'init': 'epsg:4326'}, #uncomment this if (x,y) is long/lat
geometry=geometry)
Finally, the geoDataFrame can be plotted with a simple command:
# this plot the geoDataFrame
polygon_df3.plot(edgecolor='black')
In this particular case with my proposed data, the output plot is:
I'm working with a dataframe with a column containing a np.array per row (in this case representing the mean waveform of brain recordings trought the time). I want to calculate the pearson correlation of this column (array by array).
This is my code
lenght = len(df.Mean)
Mean = []
for i in range(len(df.Mean)):
Mean.append(df.Mean[i])
Correlation_p = np.zeros((lenght,lenght))
P_Value_p = np.zeros((lenght,lenght))
for i in range(lenght):
for j in range(lenght):
Correlation_p[i][j],P_Value_p[i][j] = stats.pearsonr(df.Mean[i],df.Mean[j])
This works, but I want to know if there is a more pythonic way to do it, maybe using df.corr(). I tried but I failed in how to do it.
EDIT: the output of df.Mean.head()
0 [-0.2559348091247745, 0.02743063113723536, 0.3...
1 [-0.37025615099744325, -0.11299328141596175, 0...
2 [-1.0543681894876467, -0.8452798699354909, -0....
3 [-0.23527437766943646, -0.28657810260136585, -...
4 [0.45557980303095674, 0.6055674269814991, 0.74...
Name: Mean, dtype: object
The arrays that you would like to correlate seem in single cells of the DataFrame, if I am not mistaken. The following brings it in a format where each single array occupies a single column.
I made an data example that resembles the format of df.Mean.head():
df = pd.DataFrame({'x':[np.random.randint(0,5,10), np.random.randint(0,5,10), np.random.randint(0,5,10)]})
You can turn these arrays into columns using this:
df = pd.DataFrame(np.array(df['x'].tolist()).transpose())
Adapt the reshape parameters according to your own dimensions.
From there, it would be fairly straightforward.
A correlation matrix can be created by:
df.corr()
A visualization of the correlation matrix:
import matplotlib.pyplot as plt
plt.matshow(df.corr())
plt.show()
The task is to make an adress popularity map for Moscow. Basically, it should look like this:
https://nbviewer.jupyter.org/github/python-visualization/folium/blob/master/examples/GeoJSON_and_choropleth.ipynb
For my map I use public geojson: http://gis-lab.info/qa/moscow-atd.html
The only data I have - points coordinates and there's no information about the district they belong to.
Question 1:
Do I have to manually calculate for each disctrict if the point belongs to it, or there is more effective way to do this?
Question 2:
If there is no way to do this easier, then, how can I get all the coordinates for each disctrict from the geojson file (link above)?
import pandas as pd
import numpy as np
import geopandas as gpd
from shapely.geometry import Point
Reading in the Moscow area shape file with geopandas
districts = gpd.read_file('mo-shape/mo.shp')
Construct a mock user dataset
moscow = [55.7, 37.6]
data = (
np.random.normal(size=(100, 2)) *
np.array([[.25, .25]]) +
np.array([moscow])
)
my_df = pd.DataFrame(data, columns=['lat', 'lon'])
my_df['pop'] = np.random.randint(500, 100000, size=len(data))
Create Point objects from the user data
geom = [Point(x, y) for x,y in zip(my_df['lon'], my_df['lat'])]
# and a geopandas dataframe using the same crs from the shape file
my_gdf = gpd.GeoDataFrame(my_df, geometry=geom)
my_gdf.crs = districts.crs
Then the join using default value of 'inner'
gpd.sjoin(districts, my_gdf, op='contains')
Thanks to #BobHaffner, I tried to solve the problem using geopandas.
Here are my steps:
I download a shape-files for Moscow using this link click
From a list of tuples containing x and y (latitude and logitude) coordinates I create list of Points (docs)
Assuming that in the dataframe from the first link I have polygons I can write a simple loop for checking if the Point is inside this polygon. For details read this.