Parallel SECONDO : processing moving objects data at large scale
In the recent years, along with the popularization of portable positioning devices like smart phones and vehicle navigators, it is becoming simpler to generate and collect end-users’ continuous position information (termed moving objects data), in order to assist various location based services. Under this background, our group’s SECONDO system was developed. It is designed as an extensible database system, providing a large number of data types and algorithms to represent and efficiently process moving objects based on the constant geographical information (termed spatial data). However, like many other standalone databases, SECONDO is facing challenges from Big Data, since it was developed as a single-computer system and its capability is restricted by the underlying computer resources. There are many parallel processing platforms like Hadoop developed for analyzing massive data upon computer clusters. However, they usually lay more weight on improving their efficiency and scalability but less on processing specialized data types. In order to scale up SECONDO’s capability to a cluster of computers, this Ph.D project intends to propose a hybrid system combing the Hadoop platform and SECONDO databases, taking the best technologies from both sides. This new system is named Parallel SECONDO. In this dissertation, the following issues about this novel system are studied. (1) A hybrid structure is established to combine Hadoop and SECONDO for achieving the most effective performance. Specifically, a native store mechanism is developed to reduce the data migration overhead between them to the minimum. (2) A parallel data model is proposed to help end-users to state their queries in SECONDO executable language, getting rid of the low-level and rigid programming model in Hadoop. Besides, it enables Parallel SECONDO to inherit most existing SECONDO data types and operations, hence any heavy sequential query can be easily converted into the corresponding parallel statements. As an example, a join method named PBSM is extensively used in this thesis. It can process the join operation on both spatial and moving objects data. Besides, its various approaches are also proposed, using different distributed file systems to shuffle the intermediate results, in order to achieve the best performance. All these approaches can be represented as SECONDO queries with slight adjustments, fully demonstrating the parallel data model’s flexibility. (3) Parallel SECONDO is evaluated not only on our small private cluster, but also on large clusters consisting of hundreds virtual computers provided by AWS (Amazon Web Services). On these different scale environments, Parallel SECONDO keeps a stable speed-up and scale-up performance, showing remarkable scalability by being set up on the Hadoop platform. (4) Regarding the special storage for spatial and moving objects data, a set of optimization technologies are also developed to improve the data access in the cluster environment. Furthermore, we intend to develop Parallel SECONDO as a user friendly system. A set of auxiliary tools are developed to easily deploy and manage the system on large-scale clusters. Two virtual machine images are also provided, hence endusers can get familiar with the system immediately and use it to address their own problems. The graphical user interface in SECONDO is also inherited, hence the query results can be displayed with vivid animations.
Nutzung und Vervielfältigung:
Alle Rechte vorbehalten