GeoJinni

Primary tabs

GeoJinni (formerly known as SpatialHadoop) is a comprehensive extension to Hadoop that allows efficient processing of spatial data. It injects spatial awareness in the different layers and components of Hadoop to make it more suitable and more efficient to store and process bug spatial data. More specifically, it modifies the storage layer, MapReduce layer and adds a new operations layer.

In the lowest level, it adds new data types for spatial data that can be used as keys or values in a MapReduce program. It also adds parsers and writers for interaction with files containing spatial data. Unlike traditional Hadoop where data in files are unorganized, GeoJinni provides efficient spatial indexes which are organized in two layers, global index and local index. The global index partitions data across different machines whereas local indexes organizes data inside each machine. This design is utilized to build three different indexes in GeoJinni, namely, Grid File, R-tree and R+-tree. All these indexes can be constructed upon user request and stored in the Hadoop Distributed File System (HDFS).

To allow MapReduce programs to use the constructed indexes, we add two new components to the MapReduce layer, namely, SpatialInputFormat and SpatialRecordReader. The SpatialInputFormat utilizes the global index by early pruning file partitions that are outside query range. The developer can choose which partitions to load based on a spatial FILTER function which is defined in the MapReduce program the same way the map and reduce functions are defined. The SpatialRecordReader utilizes the local index by allows the map function to choose only a subset of records to process instead of processing all records in a block.

The new components in the MapReduce layer allows GeoJinni to run many spatial operations efficiently by utilizing the underlying indexes. The operations layer encapsulates all the spatial operations to allow end users to access them quickly and efficiently. Current, GeoJinni has a dozen of operations including range query, k nearest neighbors, spatial join, convex hull, polygon union, closest pair, farthest pair, and skyline. Developers can add more operations by utilizing the new components added to GeoJinni.

GeoJinni can be installed as an extension to an existing Hadoop cluster. This means that you can run it without the need to give up your current configuration or distribution of Hadoop. This makes it portable to run with a wide range of Hadoop distributions including Apache Hadoop, Cloudera, and Hortonworks.