GeoJinni is concerned with efficient analysis of big spatial data. Similar to Hadoop, GeoJinni focuses on batch processing of large data analysis tasks. Interactive queries that should answer in milliseconds or analysis of a small dataset are not supposed to run in GeoJinni. Examples of operations include spatial join, data partitioning or large scale image generation. GeoJinni should also have interface to other Hadoop tools and projects such as Hive, Pig or HBase. The indexes built in GeoJinni should be accessible to all these tools through suitable APIs to access a wide range of users.
GeoJinni is a comprehensive extension to Hadoop that allows efficient processing of spatial data. It injects spatial awareness in the different layers and components of Hadoop to make it more suitable and more efficient to store and process bug spatial data. More specifically, it modifies the storage layer, MapReduce layer and adds a new operations layer.
In the lowest level, it adds new data types for spatial data that can be used as keys or values in a MapReduce program. It also adds parsers and writers for interaction with files containing spatial data. Unlike traditional Hadoop where data in files are unorganized, GeoJinni provides efficient spatial indexes which are organized in two layers, global index and local index. The global index partitions data across different machines whereas local indexes organizes data inside each machine. This design is utilized to build three different indexes in GeoJinni, namely, Grid File, R-tree and R+-tree. All these indexes can be constructed upon user request and stored in the Hadoop Distributed File System (HDFS).
To allow MapReduce programs to use the constructed indexes, we add two new components to the MapReduce layer, namely, SpatialInputFormat and SpatialRecordReader. The SpatialInputFormat utilizes the global index by early pruning file partitions that are outside query range. The developer can choose which partitions to load based on a spatial FILTER function which is defined in the MapReduce program the same way the map and reduce functions are defined. The SpatialRecordReader utilizes the local index by allows the map function to choose only a subset of records to process instead of processing all records in a block.
The new components in the MapReduce layer allows GeoJinni to run many spatial operations efficiently by utilizing the underlying indexes. The operations layer encapsulates all the spatial operations to allow end users to access them quickly and efficiently. Current, GeoJinni has a dozen of operations including range query, k nearest neighbors, spatial join, convex hull, polygon union, closest pair, farthest pair, and skyline. Developers can add more operations by utilizing the new components added to GeoJinni.
GeoJinni can be installed as an extension to an existing Hadoop cluster. This means that you can run it without the need to giveup your current configuration or distribution of Hadoop. This makes it portable to run with a wide range of Hadoop distributions including Apache Hadoop, Cloudera, and Hortonworks.
GeoJinni is the first full-fledged MapReduce framework for spatial data. Since the release of the first version in March, we received more than 28,000 downloads and brought the attention of people from both the industrial and academic communities. We believe that it will be very useful to users who have terabytes of spatial data ready to be processed. Moreover, its open source nature helps developers to provide new spatial functions and operations for spatial data analysis and processing and make them available to the community.
GeoJinni uses, in-part, the Java Toplogy Suite (JTS) library. JTS is released under the LGPL license which might conflict with the Apache license under which Hadoop and GeoJinni are released. Notice, however, that JTS is used as a library which means we didn't edit the code or copy/paste any code from it. It's just used as another third party library in the project.
We plan to add more sophisticated spatial operations including more efficient spatial join operations. We also plan to add interfaces to other Hadoop-related projects such as Pig, Hive and HBase. We need to reach out users of that projects and allow them to use the functionality provided in GeoJinni. As a long term objective, we also plan to add temporal support to make a spatio-temporal MapReduce framework.
Currently, a full prototype of GeoJinni is available as open source and it is up and running. Since the release of the first version in March 2013, it received more than 28,000 downloads of the binaries. The current implementation includes the spatial data types: Point, Rectangle, Linestring, MultiLinestring, Polygon, MultiPolygon, and GeometryCollection. These datatypes are wrappers around Geometry data type from Java Toplogy Suite (JTS) and ESRI Geometry API. These datatypes can be imported/exported from/to text and binary files. They can be also used in MapReduce programs as keys and/or values.
GeoJinni also contains a new operation INDEX which is used to index an existing file with spatial data. The spatial index to build can be ither Grid File, R-tree or R+-tree, based on use choice. The index is constructed in a distributed manner using a MapReduce program and stored efficiently in HDFS. Our experiments showed that we can index a 128 GB dataset using an R-tree in less than one hour on a 10 machines cluster.
To use the spatial index in MapReduce programs, we provide a SpatialInputFormat and SpatialRecordReader. The SpatialInputFormat can be plugged in with a user-defined filter function that filters out unnecessary partitions of the file. The SpatialRecordReader is used to allows map functions to filter out unnecessary records from a block.
As a case study, we provide at least 20 operations that are implemented efficiently using the constructed indexes and new components in the MapReduce layer. The list of operations include range query, knn, spatial join (two algorithms), index building, minimal bounding rectangle, polygon union, skyline, convex hull, closest pair, farthest pair, image creation, high resolution image creation, and other operations.
This project was created by the Data Management Lab in University of Minnesota by Ahmed Eldawy under supervision of Prof. Mohamed Mokbel. It is part of the research going on there and it is released to the public as open source hoping that the community would react and contribute more to the project.