Big Data at Work

The Situation

At work we are concerned about the amount of data that our next project will generate. After the system is up with a few thousand customers we are estimating that we could be getting 10GB of data per day. This volume of data goes beyond our level of experience, and we are looking for new systems to help us deal with it.

My coworker has been charged with deciding on the database system to use to store the bulk of the data that will be coming in. It is all "log" style data: thousands of devices will report in and we will store the data for later aggregation. Once the data is reported it becomes historical and will not be changed.

Hadoop

The Hadoop name is very closely associated with the "big data" world and my coworker has spent a lot of time diving into it. However, it's a diverse ecosystem and there are many systems and subsystems that come into play. Mostly he has focused on the HBase system to organize our data in a column-oriented database.

There have been two major problems with HBase. The first is conceptual. Neither he nor anyone else on the team has experience with column-oriented databases. We all have strong backgrounds in relational databases with tables and schema. HBase represents a different way of thinking and we while we have ideas on how to use it in our project, we don't know if we are making good decisions. Additionally, without operational experience, we have had a lot of difficulty just getting the system installed and working. There are many vendors and packaged versions of Hadoop + HBase with monitoring and management consoles, and they have all given us different frustrations with attempting to get them to work. It has been slow going and the project needs to move forward.

The other problem, possibly related to the first, is performance. After HBase has been setup and filled with some test data for us to play with, the time to get results is unacceptable. Just doing the equivalent of "SELECT * FROM TABLE" took 20 seconds on a system with about 800 rows. While we are not experts, it's hard to imagine us setting up a system so unoptimized that a simple operation could take so long.

Other Systems

While considering the HBase system he also looked at other NoSQL options. SpliceMachine looked promising but is very new and we fear it is not battle tested yet. Other systems had various shortcomings in how data is stored or retrieved.

Apache Cassandra was evaluated briefly but rejected. Cassandra has an "eventual consistency" system that my coworker found very suspicious. Also, Cassandra allows for only one column family while HBase can have multiple. Personally I feel Cassandra should be given another look, but it still brings with it a large learning curve and mental shift to NoSQL ideas.

Big Relational

Our current front-runner is MySQL in the form of a Galera cluster of MariaDB instances. We hope this will allow us to scale our current understanding of databases to deal with the increase in data. If we can find a good criteria to shard our data, we will then use Galera to provide robustness and high uptime.

Never Enough Time

The project we are working on has a pretty strict deadline. Our team isn't known for hitting deadlines and I worry that we won't have enough time to really vet a database solution. The database that we choose will be the backbone of the whole system, so it's very important to get it right. Even after meeting with multiple NoSQL vendors like HortonWorks and Cloudera, we still do not feel confident using a NoSQL system. However, I'm not convinced that sticking with MySQL is a real win either.

The search continues.

The Situation

Hadoop

Other Systems

Big Relational

Never Enough Time

Comments