Monday, November 10, 2014

Time Series Database

I've been thinking about what would be the best way to store the various sensor data that'll be generated in a way that's scalable, easy to visualize, and easy to analyze.  In reality, the amount of data that'll be generated from a handful of sensors scattered around the house will be pretty small and whatever I'll end up building or using will be a complete overkill...  but it's always fun thinking about big infrastructure to solve minor problems. :-)

There are some very unique attributes to sensor data.  First, each piece of data has time associated with it, which needs to be captured and stored.  In a sensor data stream, each reading is also completely independent of each other.  This means that in terms of consistency, even if you lose a data point or ten, it's still OK.  Another interesting attribute is that once the data point has been generated, it's completely immutable.  There's no need to ever go back to a previous data point to update some value.  And for analysis, you'll never be looking at an individual data points, but rather a series of data points over time.

A tool that I instinctively reach for is a relational database - MySQL, SQLite, whatever - it's flexible, easy to use, and most importantly, I know how to use it.  But in this case, it just seemed ill-suited to me.  The sensor data has enough unique and weird attributes that even just trying to map the data to a database schema in an efficient way ran into roadblocks.  A typical analysis might be to find minimum or maximum value in a sensor stream - so do I need to create an index of the data as well as ones for the timestamp and sensor ID?

The other way to move forward was just storing it in a flat file of some sort, which will work just fine.  However, one of the things I had wanted to do is to create a way to replicate and sync across different stores of data - so that the the system can be decentralized and resilient in the face of network disruptions (like the Satellite internet link going down in a storm).  The fact that the data set itself is independent and immutable makes this easy, but with a flat file, scenarios like a data point in the past appearing after newer data has been written and flushed to the file becomes annoying to deal with (unless, of course, I just drop that old data point).

If pretty graphs are what's needed, something like RRDtool might be exactly what's needed here - I've spent enough time working online to have seen plenty of MRTG graphs, and with RRDtool, you have fixed data set size regardless of how long you capture the data for.  But this means some of the data is being thrown away, and in this age of people having a terabyte harddrive full of cat pictures, it seems... wasteful to throw any valuable data away!

Googling around to see what others have done, I stumbled across the Time Series Database wikipedia page!  Now that I think about it, things like stock ticks and interest rates over time behaves in a very similar way to sensor data.  There's a name to the problem that I'm trying to solve, though these people are trying to solve it in a much bigger scale. :-)

Wikipedia-surfing from Time Series Database page, I ended up with two open source packages that tries to solve the IoT data storage problem - nimbits and OpenTSDB.  Both seems to be a Java service that expose a REST API front-end for devices to report to, somewhat like what the Sparkfun folks are doing at data.sparkfun.com with Phant.

After half a day of clicking around and reading documentations and papers, it feels like I'm just as far as where I had started off from...  But at least I know the name of what it is that I'm looking for now! :-)

No comments:

Post a Comment