Having to save a lot of geo-spatial data is a task rarely one passes without being scared for one’s life. Whether you are using a DB like postGIS, an in-memory data store like geoRedis or some other tool, handling geo-spatial data usually mean using RTree. Using RTree directly can ave some very positive effects on your system performance, so we thought we’ll let you in on this industry secret.
What is RTree, you ask? It is a N-dimensional tree data structure for storing and indexing geometrical spaces. The documentation (which seems to be available only for v0.7) gives example for typical commands. Just to have a more visual example, here’s what Wikipedia has to demonstrate this useful data structure:
Today we will use it’s Python variant, just for the kicks. This Python package (v0.8.2 at the time of writing) is wrapping libspatialindex (written in C), providing a simple-to-use-not-so-simple-to-optimize RTree. Not perfect bud good enough for most needs – we handled several millions of requests per second on a single machine with this, so you know it’s just OK.
Yeah Yeah, now how would I use it?
Suppose you have a big geo-spatial DB and you’re worried about request latency. Sure, there are some fine solutions out there, with caching and parallelization, but what if you want some small custom query, done on a small subset of items? well – look no further than python RTree.
Here are some typical examples for integrating RTree, possibly along-side with your fancy geo-spatial DB thingy (postGIS anyone?):
- Match-making: if you want to match two (or more) items based on location, buyers to sellers, questions to answers, boys to girls – a set of two (or more) RTrees fits like a glove. Each type of request compared against the other’s RTree, and if a match is not found – put into its own RTree for future matching.
- spatial-based load-balancing: if you had several servers, each handling a different area, you could use RTree to decide which server to defer the request to. Furthermore, you could even add servers and change the area partitioning in flight, by changing the RTree (thread-safety caution is advised).
- Manual caching: if you want to cache items not by time of insert. You can have one or more RTree(s) by different criteria, to check against upon every incoming request (even in parallel!) before giving up to a sluggish DB with a complex query.
how fast is it?
Well, pretty darn fast, thanks for asking! We ran an incoming queue (using kombu) and were able to handle millions of request per second – RTree
intersection() call per-request on a tree with approx. 10K items. Keep in mind that the real performance comes from the C library, and the wrapper just makes it easy for Python coders to use.
Pitfalls & CAVEATS?
Well, a nasty (and of course, undocumented) bug in tree initialization with Index.bounds, for starters. It’s supposed to give you coordinates capturing all the items in the tree, but for some reason when the tree is almost empty (single node, I assume) the coordinates have X and Y swapped.
Another one, documented this time, is the
objects="raw" keyword for enumerating functions (
nearest()) which saves some time on python-wrapping actions.
The GEOREDIS DISCUSSION
As we mentioned in the beginning of this post, RTree is used be many other tools we use daily, including Redis. Moving from an in-code RTree may save you a lot of headache but due to the BETA nature of the geo-spatial part of Redis, and the fact that you will be unable to tweak performance by hand, using RTree in an end product may be inadvisable. Just make sure you do your math.
oh, You’ll need this
When using RTree it’s very helpful to know how to convert from polar to Cartesian coordinates in Python? (in plain English: I have (X,Y) and I want a rectangle of Z meters around it)
from geographiclib import geodesic def polar2cartezian(latitude, longitude, radius): if radius is None: return (latitude, longitude, latitude, longitude) lat1 = geodesic.Geodesic.WGS84.Direct(latitude, longitude, 0, radius)["lat2"] lon1 = geodesic.Geodesic.WGS84.Direct(latitude, longitude, 90, radius)["lon2"] lat2 = geodesic.Geodesic.WGS84.Direct(latitude, longitude, 180, radius)["lat2"] lon2 = geodesic.Geodesic.WGS84.Direct(latitude, longitude, 270, radius)["lon2"] return (min(lat1,lat2), min(lon1,lon2), max(lat1,lat2), max(lon1,lon2))