GeoJinni

×

Status message

This proposal has been approved and the GeoJinni project has been created.
Basics

This proposal is in the Project Proposal Phase (as defined in the Eclipse Development Process) and is written to declare its intent and scope. We solicit additional participation and input from the community. Please add your feedback in the comments section.

Parent Project: 
Background: 
Since its release in 2007, Hadoop was adopted as a solution for scalable processing of huge datasets in many applications, e.g., machine learning, graph processing, and behavioral simulations. Hadoop employs MapReduce, a simplified programming paradigm for distributed processing, to build an efficient framework for processing large-scale data. In the mean time, there is a recent explosion in the amounts of spatial data produced by several devices such as smart phones, space telescopes, and medical devices. Space telescopes generate up to 50 GB of monthly spatial data. Medical devices produce spatial images (X-rays) at a rate of 50 PB per year. To this end, big spatial data calls for Hadoop-like environments for scalable distributed processing. Yet, Hadoop is ill equipped to support spatial data. The only way to handle spatial data in Hadoop is through an on-top approach, where map and reduce functions can be developed for realizing various spatial operations. Although such on-top approach is simple as it deals with Hadoop as a black box, it is inefficient as the internal Hadoop system is unaware of spatial data properties.
 
GeoJinni is full-fledged MapReduce framework for efficient processing of spatial operations. GeoJinni adds new spatial data types (e.g., Point, Rectangle, and Polygon) that can be used in MapReduce programs. It also adds spatial indexes that organizes the data in HDFS nicely to allow efficient retrieval and processing. GeoJinni support Grid Index, R-tree and R+-tree all implemented and stored in HDFS. GeoJinni also adds new components in the MapReduce layer to allow MapReduce programs to access the constructed indexes. Finally, GeoJinni is shipped with a dozen of spatial operations that are all implemented efficiently as MapReduce programs such as range query, spatial join, convex hull and polygon union. Users can use the built-in functions as-is while developers can write more spatial operations using the same API.
 
GeoJinni used to be called SpatialHadoop. However, to avoid conflicts with the name 'Hadoop' which is registered for Apache, we chose to rename it to GeoJinni.
Scope: 

GeoJinni is concerned with efficient analysis of big spatial data. Similar to Hadoop, GeoJinni focuses on batch processing of large data analysis tasks. Interactive queries that should answer in milliseconds or analysis of a small dataset are not supposed to run in GeoJinni. Examples of operations include spatial join, data partitioning or large scale image generation. GeoJinni should also have interface to other Hadoop tools and projects such as Hive, Pig or HBase. The indexes built in GeoJinni should be accessible to all these tools through suitable APIs to access a wide range of users.

Description: 

GeoJinni is a comprehensive extension to Hadoop that allows efficient processing of spatial data. It injects spatial awareness in the different layers and components of Hadoop to make it more suitable and more efficient to store and process bug spatial data. More specifically, it modifies the storage layer, MapReduce layer and adds a new operations layer.

In the lowest level, it adds new data types for spatial data that can be used as keys or values in a MapReduce program. It also adds parsers and writers for interaction with files containing spatial data. Unlike traditional Hadoop where data in files are unorganized, GeoJinni provides efficient spatial indexes which are organized in two layers, global index and local index. The global index partitions data across different machines whereas local indexes organizes data inside each machine. This design is utilized to build three different indexes in GeoJinni, namely, Grid File, R-tree and R+-tree. All these indexes can be constructed upon user request and stored in the Hadoop Distributed File System (HDFS).

To allow MapReduce programs to use the constructed indexes, we add two new components to the MapReduce layer, namely, SpatialInputFormat and SpatialRecordReader. The SpatialInputFormat utilizes the global index by early pruning file partitions that are outside query range. The developer can choose which partitions to load based on a spatial FILTER function which is defined in the MapReduce program the same way the map and reduce functions are defined. The SpatialRecordReader utilizes the local index by allows the map function to choose only a subset of records to process instead of processing all records in a block.

The new components in the MapReduce layer allows GeoJinni to run many spatial operations efficiently by utilizing the underlying indexes. The operations layer encapsulates all the spatial operations to allow end users to access them quickly and efficiently. Current, GeoJinni has a dozen of operations including range query, k nearest neighbors, spatial join, convex hull, polygon union, closest pair, farthest pair, and skyline. Developers can add more operations by utilizing the new components added to GeoJinni.

GeoJinni can be installed as an extension to an existing Hadoop cluster. This means that you can run it without the need to giveup your current configuration or distribution of Hadoop. This makes it portable to run with a wide range of Hadoop distributions including Apache Hadoop, Cloudera, and Hortonworks.

Why Here?: 

GeoJinni is the first full-fledged MapReduce framework for spatial data. Since the release of the first version in March, we received more than 28,000 downloads and brought the attention of people from both the industrial and academic communities. We believe that it will be very useful to users who have terabytes of spatial data ready to be processed. Moreover, its open source nature helps developers to provide new spatial functions and operations for spatial data analysis and processing and make them available to the community.

Future Work: 

We plan to add more sophisticated spatial operations including more efficient spatial join operations. We also plan to add interfaces to other Hadoop-related projects such as Pig, Hive and HBase. We need to reach out users of that projects and allow them to use the functionality provided in GeoJinni. As a long term objective, we also plan to add temporal support to make a spatio-temporal MapReduce framework.

People
Project Leads: 
Committers: 
Interested Parties: 

After releasing the initial version of GeoJinni, we received a lot of interest from both industrial and academic communities. We had direct contact with people from Oracle, Hortonworks and ESRI. We also received a lot of interest in the project from Azavea and Boundlessgeo.

Recently, we presented a demo about GeoJinni in VLDB 2013 in Italy where we met people from all around the world and they got very interested in GeoJinni. We are also going to present a full research paper in ACM SIGSPATIAL 2013 in Florida about the use of GeoJinni in computational geometry.

Initial Contribution: 

Currently, a full prototype of GeoJinni is available as open source and it is up and running. Since the release of the first version in March 2013, it received more than 28,000 downloads of the binaries. The current implementation includes the spatial data types: Point, Rectangle, Linestring, MultiLinestring, Polygon, MultiPolygon, and GeometryCollection. These datatypes are wrappers around Geometry data type from Java Toplogy Suite (JTS) and ESRI Geometry API. These datatypes can be imported/exported from/to text and binary files. They can be also used in MapReduce programs as keys and/or values.

GeoJinni also contains a new operation INDEX which is used to index an existing file with spatial data. The spatial index to build can be ither Grid File, R-tree or R+-tree, based on use choice. The index is constructed in a distributed manner using a MapReduce program and stored efficiently in HDFS. Our experiments showed that we can index a 128 GB dataset using an R-tree in less than one hour on a 10 machines cluster.

To use the spatial index in MapReduce programs, we provide a SpatialInputFormat and SpatialRecordReader. The SpatialInputFormat can be plugged in with a user-defined filter function that filters out unnecessary partitions of the file. The SpatialRecordReader is used to allows map functions to filter out unnecessary records from a block.

As a case study, we provide at least 20 operations that are implemented efficiently using the constructed indexes and new components in the MapReduce layer. The list of operations include range query, knn, spatial join (two algorithms), index building, minimal bounding rectangle, polygon union, skyline, convex hull, closest pair, farthest pair, image creation, high resolution image creation, and other operations.

This project was created by the Data Management Lab in University of Minnesota by Ahmed Eldawy under supervision of Prof. Mohamed Mokbel. It is part of the research going on there and it is released to the public as open source hoping that the community would react and contribute more to the project.