Just a Theory

Trans rights are human rights

Posts about Hadoop

Whither Impala Fault Tolerance?

Justin Erickson on the Cloudera Blog

In December 2012, while Cloudera Impala was still in its beta phase, we provided a roadmap for planned functionality in the production release. In the same spirit of keeping Impala users, customers, and enthusiasts well informed, this post provides an updated roadmap for upcoming releases later this year and in early 2014.

Impala is a pretty nice-looking SQLish query engine that runs on Hadoop. It provides the same basic interface as Hive, but circumvents MapReduce to access data directly from data nodes in parallel. But, in my opinion, to be useful as a real-time query engine, it needs fault tolerance. From the Impala FAQ, under the list of unsupported features:

Fault tolerance for running queries (not currently). In the current release, Impala aborts the query if any host on which the query is executing fails. In the future, we will consider adding fault tolerance to Impala, so that a running query would complete even in the presence of host failures.

Sounds like an unfortunate issue. Since it’s pretty typical for data nodes to go down, this seems like an essential feature. Products like CitusDB offer fault tolerance:

Does CitusDB recover from failures?

Yes. The CitusDB master node intelligently re-routes the work on any failed nodes to the remaining nodes in real-time. Since the underlying data are kept in fixed-size blocks in HDFS, a failed node’s work can evenly be distributed among the remaining nodes in the cluster.

That sound exactly right. I’m excited about Citus, and if it adds solid support for more data formats, such as ORC and Parquet, it may well be the way to go. But Impala will be a nice alternative if it can get fault tolerance figured out. I’m disappointed it’s not on the road map.