I am using Python and I would like to represent a set of time series on a heatmap.
My static representation of the geographical space, which is a "fixed-time slice" of all the time series, looks like this:
Each time serie corresponds to a cell, and every cell is associated to a specific geometric shape on the plot above. Here is an example of one of the time series:
Is there any way I can "animate" the heatmap above or at least set a time parameter that I can regulate in order to see the time evolution of the entire map? Obviously the arrays have all the same length and are stored as NumPy arrays in a DataFrame like this:
SQUAREID
155 [0.057949285005512684, 0.04961411245865491, 0....
272 [0.4492307820512821, 0.3846153846153846, 0.415...
273 [0.09658214167585447, 0.08269018743109151, 0.0...
276 [0.03208695579710145, 0.03234782536231884, 0.0...
277 [0.82994485446527, 0.8366923737596471, 0.79620...
...
10983 [0.6770833333333334, 0.6865036231884057, 0.692...
10984 [0.21875, 0.22179347826086956, 0.2236956521739...
11097 [0.5921739130434782, 0.5934782608695652, 0.598...
11098 [0.06579710144927536, 0.06594202898550725, 0.0...
11099 [0.21273428886438808, 0.21320286659316426, 0.2...
Name: wp, Length: 2020, dtype: object
and SQUAREID is matched with the cellId column in a GeoDataFrame that looks like this:
cellId geometry
0 38 POLYGON ((10.91462 45.68201, 10.92746 45.68179...
1 39 POLYGON ((10.92746 45.68179, 10.94029 45.68157...
2 40 POLYGON ((10.94029 45.68157, 10.95312 45.68136...
3 154 POLYGON ((10.90209 45.69122, 10.91493 45.69100...
4 155 POLYGON ((10.91493 45.69100, 10.92777 45.69079...
... ... ...
6570 11336 POLYGON ((11.80475 46.52767, 11.81777 46.52735...
6571 11337 POLYGON ((11.81777 46.52735, 11.83080 46.52703...
6572 11452 POLYGON ((11.79219 46.53698, 11.80521 46.53666...
6573 11453 POLYGON ((11.80521 46.53666, 11.81824 46.53634...
6574 11454 POLYGON ((11.81824 46.53634, 11.83126 46.53601...
6575 rows × 2 columns
Thanks in advance.
You can use animation from the matplotlib library : https://towardsdatascience.com/learn-how-to-create-animated-graphs-in-python-fce780421afe
Related
I am trying to draw a choropleth map of municipalties in Denmark with color encoded as a sum of crimes in that municipalty.
I have several entries for each municipalty since the data is over a time-period and types of crime and I have a single geometry entry for each municipalty.
I want to perform a transform_lookup on the geometry field in the geopandas dataframe on the label_dk key, but I can't seem to get the map to render.
I could always merge the dataframes, but I am trying to save space by not repeating the geometry for every entry of crime, since I also want to plot the data in different charts and allow for slicing and dicing over time and offfence.
Bear in mind that this crime data is just a small example, and the real data I want to use has around 30,000 entries, so a merged geojson file takes up 647,000 KB and the map won't render.
Does anybody know why this transform_lookup doesn't work?
The data looks like this:
label_dk geometry
0 Aabenraa MULTIPOLYGON Z (((9.51215 54.85672 -999.00000,...
1 Aalborg MULTIPOLYGON Z (((9.84688 57.04365 -999.00000,...
2 Aarhus POLYGON Z ((9.99682 56.17872 -999.00000, 9.990...
3 Albertslund POLYGON Z ((12.35234 55.70461 -999.00000, 12.3...
4 Allerød POLYGON Z ((12.31845 55.88305 -999.00000, 12.3...
.. ... ...
94 Vejle POLYGON Z ((9.11714 55.76669 -999.00000, 9.100...
95 Vesthimmerlands MULTIPOLYGON Z (((9.17798 56.91745 -999.00000,...
96 Viborg POLYGON Z ((9.29501 56.59336 -999.00000, 9.297...
97 Vordingborg MULTIPOLYGON Z (((12.04479 54.95566 -999.00000...
98 Ærø MULTIPOLYGON Z (((10.43467 54.87952 -999.00000...
[99 rows x 2 columns]
tid offence label_dk Anmeldte forbrydelser
0 2021K1 Seksualforbrydelser i alt København 133
1 2021K1 Voldsforbrydelser i alt København 900
2 2021K2 Seksualforbrydelser i alt København 244
3 2021K2 Voldsforbrydelser i alt København 996
4 2021K3 Seksualforbrydelser i alt København 174
.. ... ... ... ...
787 2021K2 Voldsforbrydelser i alt Aalborg 178
788 2021K3 Seksualforbrydelser i alt Aalborg 53
789 2021K3 Voldsforbrydelser i alt Aalborg 185
790 2021K4 Seksualforbrydelser i alt Aalborg 43
791 2021K4 Voldsforbrydelser i alt Aalborg 205
[792 rows x 4 columns]
The code is below:
import altair as alt
import geopandas as gpd
import pandas as pd
import altair_viewer
alt.data_transformers.enable('data_server')
path = "data/small_few_umbrella_terms_crimes_2021.csv"
df = pd.read_csv(path,encoding="utf_8",index_col='Unnamed: 0')
geometry = gpd.read_file("data_with_geo/geometry.geojson")
map_chart = alt.Chart(df).mark_geoshape(
).transform_aggregate(
crime='sum(Anmeldte forbrydelser)',
groupby=["label_dk"]
).transform_lookup(
lookup='label_dk',
from_=alt.LookupData(geometry, 'label_dk', ['geometry'])
).encode(
color=alt.Color(
"crime:Q",
scale=alt.Scale(
scheme='viridis')
)
)
altair_viewer.show(map_chart)
The data can be found here:
https://github.com/Joac1137/Data-Visualization/blob/main/data_with_geo/geometry.geojson
and
https://github.com/Joac1137/Data-Visualization/blob/main/data/small_few_umbrella_terms_crimes_2021.csv
I think you're running into an issue similar to HConcat of mark_geoshape and mark_bar breaks depending of order (and the comments in the linked vega-lite issue). If you change the order of the data frames it will work.
There also seems to be some issue with the aggregation which I think is related to this issue https://github.com/altair-viz/altair/issues/1357, but I just used pandas to aggregate here:
grouped_sums = df.groupby('label_dk').sum().reset_index()
alt.Chart(geometry).mark_geoshape().transform_lookup(
lookup='label_dk',
from_=alt.LookupData(grouped_sums, 'label_dk', grouped_sums.columns.tolist())
).encode(
color=alt.Color("Anmeldte forbrydelser:Q"),
tooltip=['label_dk', 'Anmeldte forbrydelser:Q']
)
We're working on a revamp on the geo docs which you might find useful https://deploy-preview-1--spontaneous-sorbet-49ed10.netlify.app/user_guide/marks/geoshape.html#lookup-datasets
Thanks a lot #joelostblom !
I found the solution in the new docs you linked.
The trick was that I was missing the "type" column in my geojson, which usually only contains the string "Feature", but whatever.
The geojson data now looks like this:
label_dk type geometry
0 Aabenraa Feature MULTIPOLYGON Z (((9.51215 54.85672 -999.00000,...
1 Aalborg Feature MULTIPOLYGON Z (((9.84688 57.04365 -999.00000,...
2 Aarhus Feature POLYGON Z ((9.99682 56.17872 -999.00000, 9.990...
3 Albertslund Feature POLYGON Z ((12.35234 55.70461 -999.00000, 12.3...
4 Allerød Feature POLYGON Z ((12.31845 55.88305 -999.00000, 12.3...
And the code like this
import altair as alt
import geopandas as gpd
import pandas as pd
import altair_viewer
path = "data/small_few_umbrella_terms_crimes_2021.csv"
df = pd.read_csv(path,encoding="utf_8",index_col='Unnamed: 0')
geometry = gpd.read_file("data_with_geo/geometry.geojson")
map_chart = alt.Chart(df).transform_lookup(
lookup='label_dk',
from_=alt.LookupData(geometry, 'label_dk',['geometry','type'])
).transform_aggregate(
crime='sum(Anmeldte forbrydelser)',
groupby=["label_dk","type","geometry"]
).mark_geoshape(
).encode(
color=alt.Color(
"crime:Q",
scale=alt.Scale(
scheme='viridis')
)
)
altair_viewer.show(map_chart)
Changing from the merged data that I previously used to this lookup method resulted in a significant speedup when initializing. It used to take around 10 minutes to start up, but now it does it in a matter of seconds.
When using networkx I only now that there are several possibilities of plotting graphs with edges and nodes.
Is it possible only to plot a lot of nodes, without connections between them? The points all have x- and y-coordinates. The points are saved in a pandas dataframe with only 3 columns: ID, X, Y
g = nx.from_pandas_dataframe(df1, source='x', target='y')
I tried something like this but I don´t want to have edges only points.
This is a part of the dataframe:
id x y
0 550 1005.600 1539.400
1 551 1006.600 1549.400
2 705 1029.997 2140.001
3 706 1030.997 2141.001
4 478 180.000 1354.370
5 479 190.000 1354.370
.. ... ... ...
500 237 1135.000 2615.000
501 238 1145.000 2615.000
You can draw nodes and edges separately. Use the following to only draw the nodes:
nodes=nx.draw_networkx_nodes(G)
If you want to pass the specific position of the nodes you may want to create the pos out of the x and y values. (At that point I would rather not use networkx...)
See the docs...
My goal here is to make a geodataframe from a couple of columns of coordinates in an existing dataframe, take those 1677 geographic points and add a buffer circle around each, then union the resulting polygons into a multipolygon. Where I keep getting wrapped around the axle is the .buffer() part of geopandas doesn't seem to be using the units of measure for the CRS I've selected.
In []: ven_coords
Out []: VenLat VenLon
0 42.34768 -71.085359
1 42.349014 -71.081096
2 42.347627 -71.081685
3 42.348718 -71.077984
4 42.34896 -71.081467
... ... ...
1672 42.308962 -71.073516
1673 42.313169 -71.089027
1674 42.309717 -71.08247
1675 42.356336 -71.074386
1676 42.313005 -71.089887
1677 rows × 2 columns
In []: ven_coords_gdf = geopandas.GeoDataFrame(ven_coords,
geometry=geopandas.points_from_xy(ven_coords.VenLon, ven_coords.VenLat))
ven_coords_gdf
Out []: VenLat VenLon geometry
0 42.34768 -71.085359 POINT (-71.08536 42.34768)
1 42.349014 -71.081096 POINT (-71.08110 42.34901)
2 42.347627 -71.081685 POINT (-71.08168 42.34763)
3 42.348718 -71.077984 POINT (-71.07798 42.34872)
4 42.34896 -71.081467 POINT (-71.08147 42.34896)
... ... ... ...
1672 42.308962 -71.073516 POINT (-71.07352 42.30896)
1673 42.313169 -71.089027 POINT (-71.08903 42.31317)
1674 42.309717 -71.08247 POINT (-71.08247 42.30972)
1675 42.356336 -71.074386 POINT (-71.07439 42.35634)
1676 42.313005 -71.089887 POINT (-71.08989 42.31300)
1677 rows × 3 columns
So far so good, let's see what sort of thing I got back:
In []: print('Type:', type(ven_coords_gdf), "/ current CRS is:",ven_coords_gdf.crs)
Out []: Type: <class 'geopandas.geodataframe.GeoDataFrame'> / current CRS is: None
It has no CRS, so I assign it the one relevant to what I'm working on:
In []: ven_coords_gdf.crs = ("epsg:2249")
print('Type:', type(ven_coords_gdf), "/ current CRS is:",ven_coords_gdf.crs)
Out []: Type: <class 'geopandas.geodataframe.GeoDataFrame'> / current CRS is: epsg:2249
It appears to have "taken" the CRS I added, and just to double-check, let's take a look at the details for the CRS in question:
In []: CRS.from_epsg(2249)
Out []: <Projected CRS: EPSG:2249>
Name: NAD83 / Massachusetts Mainland (ftUS)
Axis Info [cartesian]:
- X[east]: Easting (US survey foot)
- Y[north]: Northing (US survey foot)
Area of Use:
- name: United States (USA) - Massachusetts onshore - counties of Barnstable; Berkshire; Bristol; Essex; Franklin; Hampden; Hampshire; Middlesex; Norfolk; Plymouth; Suffolk; Worcester.
- bounds: (-73.5, 41.46, -69.86, 42.89)
Coordinate Operation:
- name: SPCS83 Massachusetts Mainland zone (US Survey feet)
- method: Lambert Conic Conformal (2SP)
Datum: North American Datum 1983
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich
2249 uses the U.S. Survey Foot as it's unit of measure, so I'll set my buffer at 1000 to get a 1000 foot radius from each of the points in my data:
In []: ven_coords_buffer = ven_coords_gdf.geometry.buffer(distance = 1000)
ven_coords_buffer
Out []: 0 POLYGON ((928.915 42.348, 924.099 -55.669, 909...
1 POLYGON ((928.919 42.349, 924.104 -55.668, 909...
2 POLYGON ((928.918 42.348, 924.103 -55.670, 909...
3 POLYGON ((928.922 42.349, 924.107 -55.668, 909...
4 POLYGON ((928.919 42.349, 924.103 -55.668, 909...
...
1672 POLYGON ((928.926 42.309, 924.111 -55.708, 909...
1673 POLYGON ((928.911 42.313, 924.096 -55.704, 909...
1674 POLYGON ((928.918 42.310, 924.102 -55.707, 909...
1675 POLYGON ((928.926 42.356, 924.110 -55.661, 909...
1676 POLYGON ((928.910 42.313, 924.095 -55.704, 909...
Length: 1677, dtype: geometry
Those coordinates are just a wee bit off. Clearly the buffer applied itself as a 1000°, not 1000ft, resulting in a glob of 1677 massive overlapping circles that cover the entire globe. Not quite what I'm looking for. Obviously I'm missing something, any suggestions?
As with any fun code problem, I swear it worked earlier, honest. I futzed around for a while before I finally got it to output the right thing, then I shut it down, went to dinner, came back and re-ran it, and got the above. The obvious deduction is that something I'd done in the aforementioned futzing around had been key to getting it to work, some re-used variable or whatever, but I can't figure out what's missing in the code above.
GeoPandas 0.9.0, pyproj 3.0.1
screenshot from happier times when it worked and I got it onto a map
GeoPandas does exactly what is expected to do. You have to re-project your geometries to a target CRS, simply assigning it does not do anything.
When creating the GeoDataFrame, make sure you specify in which CRS your data is. In this case it is EPSG:4326 aka geographical projection in degrees.
ven_coords_gdf = geopandas.GeoDataFrame(ven_coords,
geometry=geopandas.points_from_xy(ven_coords.VenLon, ven_coords.VenLat),
crs=4326)
Once properly set, you have to reproject (transform) your coordinates to a target CRS using to_crs.
ven_coords_gdf_projected = ven_coords_gdf.to_crs("epsg:2249")
Now you can use the buffer in feet. If you want to store the result in 4326 again, you just reproject it back using to_crs(4326).
I swear it worked earlier, honest.
I am pretty sure it did not :).
I have a list of points (longitude and latitude), as well as their associated point geometries in a geodataframe. All of the points should be able to be subdivided into individual polygons, as the points are generally clustered in several areas. What I would like to do is have some sort of algorithm that loops over the points and checks the the distance between the previous and current point. If the distance is sufficiently small, it would group those points together. This process would continue until the current point is too far away. It would make a polygon out of those close points, and then continue the process with the next group of points.
gdf
longitude latitude geometry
0 -76.575249 21.157229 POINT (-76.57525 21.15723)
1 -76.575035 21.157453 POINT (-76.57503 21.15745)
2 -76.575255 21.157678 POINT (-76.57526 21.15768)
3 -76.575470 21.157454 POINT (-76.57547 21.15745)
5 -112.973177 31.317333 POINT (-112.97318 31.31733)
... ... ... ...
2222 -113.492501 47.645914 POINT (-113.49250 47.64591)
2223 -113.492996 47.643609 POINT (-113.49300 47.64361)
2225 -113.492379 47.643557 POINT (-113.49238 47.64356)
2227 -113.487443 47.643142 POINT (-113.48744 47.64314)
2230 -105.022627 48.585669 POINT (-105.02263 48.58567)
So in the data above, the first 4 points would be grouped together and turned into a polygon. Then, it would move onto the next group, and so forth. Each group of points is not evenly spaced, i.e., the next group might be 7 pairs of points, and the following could be 3. Ideally, the final output would be another geodataframe that is just a bunch of polygons.
You can try DBSCAN clustering as it will automatically find the best number of clusters and you can specify a maximum distance between points ( ε ).
Using your example, the algorithm identifies two clusters.
import pandas as pd
from sklearn.cluster import DBSCAN
df = pd.DataFrame(
[
[-76.575249, 21.157229, (-76., 21.15723)],
[-76.575035, 21.157453, (-76.57503, 21.15745)],
[-76.575255, 21.157678, (-76.57526, 21.15768)],
[-76.575470, 21.157454, (-76.57547, 21.15745)],
[-112.973177, 31.317333, (-112.97318, 31.31733)],
[-113.492501, 47.645914, (-113.49250, 47.64591)],
[-113.492996, 47.643609, (-113.49300, 47.64361)],
[-113.492379, 47.643557, (-113.49238, 47.64356)],
[-113.487443, 47.643142, (-113.48744, 47.64314)],
[-105.022627, 48.585669, (-105.02263, 48.58567)]
], columns=["longitude", "latitude", "geometry"])
clustering = DBSCAN(eps=0.3, min_samples=4).fit(df[['longitude','latitude']].values)
gdf = pd.concat([df, pd.Series(clustering.labels_, name='label')], axis=1)
print(gdf)
gdf.plot.scatter(x='longitude', y='latitude', c='label')
longitude latitude geometry label
0 -76.575249 21.157229 (-76.0, 21.15723) 0
1 -76.575035 21.157453 (-76.57503, 21.15745) 0
2 -76.575255 21.157678 (-76.57526, 21.15768) 0
3 -76.575470 21.157454 (-76.57547, 21.15745) 0
4 -112.973177 31.317333 (-112.97318, 31.31733) -1 # not in cluster
5 -113.492501 47.645914 (-113.4925, 47.64591) 1
6 -113.492996 47.643609 (-113.493, 47.64361) 1
7 -113.492379 47.643557 (-113.49238, 47.64356) 1
8 -113.487443 47.643142 (-113.48744, 47.64314) 1
9 -105.022627 48.585669 (-105.02263, 48.58567) -1 # not in cluster
If we add random data to your data set, run the clustering algorithm, and filter out those data points not in clusters, you get a clearer idea of how it's working.
import numpy as np
rng = np.random.default_rng(seed=42)
arr2 = pd.DataFrame(rng.random((3000, 2)) * 100, columns=['latitude', 'longitude'])
randdf = pd.concat([df[['latitude', 'longitude']], arr2]).reset_index()
clustering = DBSCAN(eps=1, min_samples=4).fit(randdf[['longitude','latitude']].values)
labels = pd.Series(clustering.labels_, name='label')
gdf = pd.concat([randdf[['latitude', 'longitude']], labels], axis=1)
subgdf = gdf[gdf['label']> -1]
subgdf.plot.scatter(x='longitude', y='latitude', c='label', colormap='viridis', figsize=(20,10))
print(gdf['label'].value_counts())
-1 2527
16 10
3 8
10 8
50 8
...
57 4
64 4
61 4
17 4
0 4
Name: label, Length: 99, dtype: int64
Getting the clustered points from this dataframe would be relatively simple. Something like this:
subgdf['point'] = subgdf.apply(lambda x: (x['latitude'], x['longitude']), axis=1)
subgdf.groupby(['label'])['point'].apply(list)
label
0 [(21.157229, -76.575249), (21.157453, -76.5750...
1 [(47.645914, -113.492501), (47.643609, -113.49...
2 [(46.67210037270342, 4.380376578722878), (46.5...
3 [(85.34030732681661, 23.393948586534073), (86....
4 [(81.40203846660347, 16.697291990770392), (82....
...
93 [(61.419880354359925, 23.25522624430636), (61....
94 [(50.893415175135424, 90.70863269095085), (52....
95 [(88.80586950148697, 81.17523712192651), (88.6...
96 [(34.23624333000541, 40.8156668231013), (35.86...
97 [(16.10456828199399, 67.41443008931344), (15.9...
Name: point, Length: 98, dtype: object
Although you'd probably need to do some kind of sorting to make sure you were connecting the closest points when drawing the polygons.
Similar SO question
DBSCAN from sklearn
Haversine Formula in Python (Bearing and Distance between two GPS points)
https://gis.stackexchange.com/questions/121256/creating-a-circle-with-radius-in-metres
You may be able to use the haversine formula to group points within a distance. Create polygons for each point (function below) with the formula then filter points inside from the master list of points and repeat until there are no more points.
#import modules
import numpy as np
import pandas as pd
import geopandas as gpd
from geopandas import GeoDataFrame, GeoSeries
from shapely import geometry
from shapely.geometry import Polygon, Point
from functools import partial
import pyproj
from shapely.ops import transform
#function to create polygons on radius
def polycir(lat, lon, radius):
local_azimuthal_projection = """+proj=aeqd +R=6371000 +units=m +lat_0={} +lon_0=
{}""".format(lat, lon)
wgs84_to_aeqd = partial(
pyproj.transform,
pyproj.Proj("+proj=longlat +datum=WGS84 +no_defs"),
pyproj.Proj(local_azimuthal_projection),
)
aeqd_to_wgs84 = partial(
pyproj.transform,
pyproj.Proj(local_azimuthal_projection),
pyproj.Proj("+proj=longlat +datum=WGS84 +no_defs"),
)
center = Point(float(lon), float(lat))
point_transformed = transform(wgs84_to_aeqd, center)
buffer = point_transformed.buffer(radius)
# Get the polygon with lat lon coordinates
circle_poly = transform(aeqd_to_wgs84, buffer)
return circle_poly
#Convert df to gdf
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.longitude, df.latitude))
#Create circle polygons col
gdf['polycir'] = [polycir(x, y, <'Radius in Meters'>) for x, y in zip(gdf.latitude,
gdf.longitude)]
gdf.set_geometry('polycir', inplace=True)
#You should be able to loop through the polygons and find the geometries that overlap with
# gdf_filtered = gdf[gdf.polycir.within(gdf.iloc[0,4])]
Looks like a job for k-means clustering.
You may need to be careful to how you define your distance (actual disctance "through" earth, or shortest path around?)
Turning each cluster into a polygon depends on what you want to do... just chain them or look for their convex enveloppe...
I have two shapefiles of a city. The first one is extremely detailed, down to the level of blocks, that has several information about each block, including the population density. The second one is the same city divided into a squared grid of 1.45km2 cells, with no other information.
I want to calculate the population density in each cell of the squared grid. I tried with
enriched=gpd.read_file('enriched.shp') #gdf with pop density info
grid=gpd.read_file('grid.shp') #grid gdf
popd=gpd.sjoin(grid[['cell_id','geometry']],enriched, op='intersects') #merge grid with enriched shp
popd=popd[['cell_id','popdens']].groupby(['cell_id']).sum().reset_index() #groupby cell and sum the densities of the blocks within
grid=grid.merge(popd,on='cell_id', how='left').fillna(0)
but I am not sure this is the proper way, since I am getting very high density values in some cells (like > 200k per km2). Is this right? how can I check if I am not missing anything?
EDIT: Here are the column headers of the two shapefiles
enriched.columns
Index(['REGION', 'PROVINCIA', 'COMUNA', 'COD_DISTRI', 'COD_ZONA', 'area', 'popdens', 'geometry'],
dtype='object')
enriched.head(2)
REGION PROVINCIA COMUNA COD_DISTRI COD_ZONA area popdens geometry
0 13 131 13121 2.0 1.0 0.442290 4589.75053 POLYGON ((-70.65571 -33.47856, -70.65575 -33.4...
1 13 131 13121 6.0 1.0 0.773985 7661.64421 POLYGON ((-70.68182 -33.47654, -70.68144 -33.4...
don't worry about the first 5 columns, you can see them as a primary key in the dataset: all together they uniquely identify a zone.
grid.columns
Index(['cell_id', 'geometry'], dtype='object')
grid.head(2)
cell_id geometry
0 sq00024 POLYGON ((-70.79970 -33.50447, -70.78894 -33.5...
1 sq00025 POLYGON ((-70.79989 -33.51349, -70.78913 -33.5...