It’s been almost a decade that I’m in the HPC (≈cloud≈scalable software) field. During these joyful years, the family of problems which seems to pop back the most are those how are heavily dependent on geographic data as the main input (not that there’s anything wrong with that).
One would expect by now that these kind of problems will be solved by now, but the main source of (relatively) reliable GPS data today, in terms of volume, are smartphones, and these weren’t really here a decade ago.
Knowing that, It’s easy to understand why the existing solutions were mostly confined to universities and research centers until very recently. However, it does not explain why programmers and engineers all around the world refuse to use google, and just shovel all their geo-points into some SQL based monstrosity and call it “Big Data”, as if saying “this shit is hard” makes it easier.
So, what SHOULD we do when using geographic data to perform some critical operation in our system. Well… nice of you to ask, as is turns out (from like, two paragraphs ago) there was some research done on the subject, and as it turns out, there is a lot to be done.
here are some examples:
forget the DB
There are several other data structures you could use to store geo data more efficiently, depending on the intended use for that data. I like R-Tree especially, it is simple to understand and manipulate, and will do the job very well without becoming huge. We actually used R-Tree in project before with great results. BTW, As it turns out, I’m not the only one to like it, it is the data structure behind GEO-Redis.
It’s just a Tuple
Don’t get too hung up on the geographic data, as I explained before, geo data isn’t always as sensitive as you think, you can tinker with it, make assumptions, and even do some very rudimentary arithmetics with it to get some big savings on storage and compute resources. So feel free to break it down, to average it, to ignore it altogether – as long as the users get what they need.
stream stream Stream
Geographic data loves to be streamed, to flow through the system never being stored. This is prefect for a lot of use cases, some light caching can add even more use cases into this mix (did someone said redis?). Finding yourself in a Big Data type situation? Try and see if looking at shorter time frames (seconds, minuets even hours) instead of “for ever” minimizes the data handling without hurting accuracy too much. If so – your system is a “Fast Data” (= streaming) creature not a big data one.
These are only the three tips i could muster in 10 minutes, and without knowing your problem. Using some ingenuity and some google-type-know-how you could, with very little work improve your scale performance and code readability, just by using tools who were designed for the problem you are trying to solve.
Also! we now have a pro-bono (=FREE) consulting sessions. these 60 min sessions will help you see what actions you could take to drastically improve performance with minimal code change. To book your session drop me a line at adam (at) tamarlabs (dot) com