Simple and efficient coupling of Hadoop with a database engine

Lu, Jiamin GND; Güting, Ralf Hartmut GND

The growing need of processing massive amounts of data leads database researchers to explore the possibility of combining their existing single-computer database systems with Hadoop. Such hybrid systems not only can keep the efficiency of database processing, but also achieve a remarkable scalability. However, current Hadoop extensions usually rely on Hadoop to shuffle intermediate data, in order to get a balanced workload assignment on cluster nodes. During the process, data in databases have to be transformed to key-value pairs and delivered to HDFS, hence can be processed in Hadoop jobs. These overheads cause a performance degradation of Hadoop hybrid systems. In this paper, we propose a novel method to couple Hadoop and databases at the engine level rather than the SQL level, introducing query processing operators to interact with Hadoop. Distributed single-computer databases instead of Hadoop perform the task of shuffling intermediate data. Therefore, all data are exchanged between database engines directly, to reduce the unnecessary transform and transfer overhead. Query plans in the DBMS can be written to include the construction of Hadoop jobs, so that the user can write parallel queries in the DBMS without learning too many details of the MapReduce paradigm. As a prototype of this method, Parallel S ECONDO is demonstrated in this paper. It is fully evaluated with two parallel join methods, HDJ and SDJ, which respectively rely on Hadoop and S ECONDO to shuffle data. The evaluation is not only made in our own small-scale cluster, but also in large-scale clusters consisting of hundreds of instances rented from the Amazon EC2 Cloud Service. In addition, Parallel S ECONDO inherits the extensibility of S ECONDO , it can also process specialized database queries on spatial and spatio-temporal data (or moving objects). As a result, we obtain for the first time a highly scalable generic system for moving objects management.




Lu, Jiamin / Güting, Ralf: Simple and efficient coupling of Hadoop with a database engine. Hagen 2012. FernUniversität in Hagen.


12 Monate:

Grafik öffnen


Nutzung und Vervielfältigung:
Alle Rechte vorbehalten


powered by MyCoRe