Enhance performance of geopandas overlay(intersection)

Enhance performance of geopandas overlay(intersection) - python

I have two sets of shapefiles with polygons. One set of shapefile is just the US counties I'm interested in and this varies across firms and years. The other set of shapefile is the business area of firms and of course this varies across firms and years. I need to get the intersection of these two layers for each firm in each year. So far the function overlay(df1, df2, how = 'intersection') accomplished my goal. But it takes around 300s for each firm-year. Given that I have a long list of firms and many years, this would take me days to finish. Is there any way to enhance this performance?
I notice that if I do the same thing in ArcGIS, the 300s comes down to a few seconds. But I'm a new user of ArcGIS, not familiar with the python in it yet.

If you look at the current geopandas overlay source code, they've actually updated the overlay function to utilize Rtree spatial indexing! I don't think that doing doing the Rtree manually would be any faster (actually will probably be slower) at this point in time.
See source code here: https://github.com/geopandas/geopandas/blob/master/geopandas/tools/overlay.py

Hopefully you've figured this out by now, but the solution is to utilize Geopanda's R-tree spatial index. You can achieve orders of magnitude improvement by implementing it appropriately.
Goeff Boeing has written an excellent tutorial.
http://geoffboeing.com/2016/10/r-tree-spatial-index-python/

Related

How to calculate forces between individual atoms in OpenMM

I am new to OpenMM and I would appreciate some guidance on the following matter:
Currently I am not interested in running molecular dynamics simulations, for starters I would just like to compute what are the forces or free energies between individual pairs of atoms using OpenMMs AMBER force field for example. Essentially I would like to end up with a heat map which represents forces between atom pairs something like this:
Where numbers represent strength of the force or value of free energy.
I have trouble finding out how to access such lower level functionality of OpenMM where I could write a custom script that calculates only desired forces provided the 3D coordinates of atoms and their types. In their tutorials I have just found how to run fully fledged simulations by providing force field data and PDB files of molecular systems.
Preferably I would like to achieve this with python.
Any concrete example or guidance is much appreciated.

I have found an answer in the Openmm's issue tracker on GitHub.
In short: There is no API to achieve exactly that in OpenMM as what I am trying to do is not well defined from purely physical/chemical perspective. My best bet is to compute something that looks like an energy based only on pairwise inter-atom distances which can be quarried from an openmm state like this (as suggested in the discussion referenced above):
state = simulation.context.getState(getPositions=True)
positions = state.getPositions(asNumpy=True).value_in_unit(nanometer)

Should I use advance GeoDjango libraries for one simple calculation?

I am starting web app in Django, which must provide one simple task: get all records from DB which are close enough to other record.
For example: Iam in latlang (50, 10), and I need to get all records with latlang closer than 5km from me.
I found that geodjango thing called GeoDjango, but it contains a lot of other dependencies and libraries like GEOS, POSTGIS, and other stuff which i don't really need. I need only this one range functionality.
So should I use GeoDjango, or just write my own range calculation query?

Most definitely not write your own. As you get more familiar with geographic data you will realize that this particular calculation isn't at all simple see for example this question for a detailed discussion. However most of the solutions (answers) given in that question only produce approximate results. Partly due to the fact that the earth is not a perfect sphere.
On the other hand if you use Geospatial extensions for mysql (5.7 onwards) or postgresql you can make use of the ST_DWithin function.
ST_DWithin — Returns true if the geometries are within the specified distance of one another. For geometry units are in those of spatial reference and For geography units are in meters and measurement is defaulted to use_spheroid=true (measure around spheroid), for faster check, use_spheroid=false to measure along sphere.
ST_DWithin makes use of spatial indexes which home made solutions will be unable to. WHen GeoDjango is enabled, ST_DWithin becomes available as a filter to django querysets in the form of dwithin
Last but not least, if you write your own code, you will have to write a lot of code to test it too. Whereas dwithin is thoroughly tested.

Using Python to generate a connection/network graph

I have a text file with about 8.5 million data points in the form:
Company 87178481
Company 893489
Company 2345788
[...]
I want to use Python to create a connection graph to see what the network between companies looks like. From the above sample, two companies would share an edge if the value in the second column is the same (clarification from/for Hooked).
I've been using the NetworkX package and have been able to generate a network for a few thousand points, but it's not making it through the full 8.5 million-node text file. I ran it and left for about 15 hours, and when I came back, the cursor in the shell was still blinking, but there was no output graph.
Is it safe to assume that it was still running? Is there a better/faster/easier approach to graph millions of points?

If you have 1000K points of data, you'll need some way of looking at the broad picture. Depending on what you are looking for exactly, if you can assign a "distance" between companies (say number of connections apart) you can visualize relationships (or clustering) via a Dendrogram.
Scipy does clustering:
http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html#module-scipy.cluster.hierarchy
and has a function to turn them into dendrograms for visualization:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html#scipy.cluster.hierarchy.dendrogram
An example for a shortest path distance function via networkx:
http://networkx.lanl.gov/reference/generated/networkx.algorithms.shortest_paths.generic.shortest_path.html#networkx.algorithms.shortest_paths.generic.shortest_path
Ultimately you'll have to decide how you want to weight the distance between two companies (vertices) in your graph.

You have too many datapoints and if you did visualize the network it won't make any sense. You need to have ways to 1)reduce the number of companies by removing those that are less important/less connected 2)summarize the graph somehow and then visualize.
to reduce the size of data it might be better to create the network independently (using your own code to create an edgelist of companies). This way you can reduce the size of your graph (by removing singletons for example, which may be many).
For summarization I recommend running a clustering or a community detection algorithm. This can be done very fast even for very large networks. Use the "fastgreedy" method in the igraph package: http://igraph.sourceforge.net/doc/R/fastgreedy.community.html
(there is a faster algorithm available online as well, this is by Blondel et al: http://perso.uclouvain.be/vincent.blondel/publications/08BG.pdf I know their code is available online somewhere)

Plotting pathlines, streaklines given velocity vector field

So I have a 2D vector field {u(x,y,t), v(x,y,t)} representing velocities of an unsteady flow at different instances in time. I don't have an analytical description of the flow, just the two components u and v over time.
I am aware of matplotlib.quiver and the answer to this question which suggests to use this for plotting streamlines.
Now I want to also plot a couple of pathlines and streaklines of the vector field.
Is there any tool that is capable of doing this (preferably a Python package)? This seems to be a common task but I couldn't find anything and don't want to waste time on reinventing the wheel.

Currently, there is no functionality in matplotlib to plot streaklines. However, Tom Flannaghan's streamline plotting utility has been improved and merged into the codebase. It will be available in matplotlib version 1.2, which is to be released in the next few weeks.
At present, your best bet is to solve the streakline ODE in the Wikipedia page you linked to. If you want to use python to do this, you can use scipy.integrate.odeint. This is exactly what matplotlib.axes.streamplot does at present for streamlines.

How can I interpolate georeferenced data in python?

I have many lines of georeferenced hydrological data with weekly resolution:
Station name, Lat, Long, Week 1 average, Week 2 average ... Week 52 average
Unfortunately, I also have some data with only monthly resolution:
Station name, Lat, Long, January average, February average ... December average
Rather than "reinventing the wheel," can anyone recommend a favorite module, package, or technique that would provide a reasonable interpolation of weekly values from monthly values? Linear would be fine, but it would be nice if we could use the coordinates to improve the interpolation based on nearby stations.
I've tagged this post with python because it's the language I've been using recently (although not its statistical functions). If the answer is "use a stats program like r" so be it, but I'm curious as to what's out there for python. Thanks!

I haven't had a chance to dig into it, but the hpgl (High Performance Geostatistics Library) provides a number of kriging (geospatial interpolation) methods:
Algorithms
Simple Kriging (SK)
Ordinary Kriging (OK)
Indicator Kriging (IK)
Local Varying Mean Kriging (LVM Kriging)
Simple CoKriging (Markov Models 1 & 2)
Sequential Indicator Simulation (SIS)
Corellogram Local Varying Mean SIS (CLVM SIS)
Local Varying Mean SIS (LVM SIS)
Sequential Gaussian Simulation (SGS)

If you are interested into expanding your experience into R, there are a number of good, well used and documented packages out there. I would start by looking at the Spatial Taskview, which lists what packages can be used for spatial data. One of the paragraphs deals with interpolation. I am most familiar with automap/gstat (I wrote automap), where especially gstat is a powerfull geostatistics package which supports a wide range of methods.
http://cran.r-project.org/web/views/Spatial.html
Integrating Python and R can be done in multiple ways, e.g. Using system calls or an in memory link using Rpy. See also:
Python interface for R Programming Language

I am looking into doing the same thing, and I found this kriging module written by Sat Kumar Tomer at AMBHAS.
There appears to be methods for producing variograms and performing ordinary kriging.
I'll update this answer if I use this and make further discoveries.

Since I originally posted this question (in 2012!) an actively-developed Python Kriging module has been released https://github.com/bsmurphy/PyKrige
There's also this older option:
https://github.com/capaulson/pyKriging

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Enhance performance of geopandas overlay(intersection) - python

Hopefully you've figured this out by now, but the solution is to utilize Geopanda's R-tree spatial index. You can achieve orders of magnitude improvement by implementing it appropriately. Goeff Boeing has written an excellent tutorial. http://geoffboeing.com/2016/10/r-tree-spatial-index-python/

Related

How to calculate forces between individual atoms in OpenMM

Should I use advance GeoDjango libraries for one simple calculation?

Using Python to generate a connection/network graph

Plotting pathlines, streaklines given velocity vector field

How can I interpolate georeferenced data in python?

Categories

Resources