Software Release

I contributed to the design and/or implementation of various software artifacts. More specifically, some of my work include the following:

hippo-logo-2Hippo – a fast, yet lightweight, indexing scheme for modern database systems:
Classic database indexes (e.g., B+-Tree), though speed up queries, suffer from two main drawbacks: (1) An index usually yields 5% to 15% additional storage overhead which results in non-ignorable dollar cost in big data scenarios especially when deployed on modern storage devices. (2) Maintaining an index incurs high latency because the DBMS has to locate and update those index pages affected by the underlying table changes. Hippo significantly shrinks the index storage and mitigates maintenance overhead without compromising much on the query execution performance. Hippo stores disk page ranges instead of tuple pointers in the indexed table to reduce the storage space occupied by the index. It maintains simplified histograms that represent the data distribution and adopts a page grouping technique that groups contiguous pages into page ranges based on the similarity of their index key attribute distributions. When a query is issued, Hippo leverages the page ranges and histogram-based page summaries to recognize those pages such that their tuples are guaranteed not to satisfy the query predicates and inspects the remaining pages. Hippo occupies up to two orders of magnitude less storage space than that of the B+-Tree while still achieving comparable query execution performance to that of the B+-Tree. Furthermore, Hippo achieves up to three orders of magnitude less maintenance overhead and up to an order of magnitude higher throughput (for hybrid query/update workloads) than its counterparts. For more details, please visit the project website

geosparkGeoSpark – A Cluster Computing System For Processign Large-Scale Spatial Data:
GeoSpark is a cluster computing system for processing large-scale spatial data. GeoSpark extends Apache Spark with a set of out-of-the-box Spatial Resilient Distributed Datasets (SRDDs) that efficiently load, process, and analyze large-scale spatial data across machines. This problem is quite challenging due to the fact that (1) spatial data may be quite complex, e.g., rivers’ and cities’ geometrical boundaries, (2) spatial (and geometric) operations (e.g., Overlap, Intersect, Convex Hull, Cartographic Distances) cannot be easily and efficiently expressed using regular RDD transformations and actions. eoSpark provides APIs for Apache Spark programmer to easily develop their spatial analysis programs with Spatial Resilient Distributed Datasets (SRDDs) which have in house support for geometrical and distance operations. Experiments show that GeoSpark is scalable and exhibits faster run-time performance than Hadoop-based systems in spatial analysis applications like spatial join, spatial aggregation, spatial autocorrelation analysis and spatial co-location pattern recognition. For more details, please visit the GeoSpark project website

recdblogo

RecDB –  A Recommendation Database Management System
Database management systems used to expect that users know what kind of data they need to query in advance. In many cases, users don’t know exactly what data they need – Instead, users sometimes prefer to explore the database. To this end, this project extends existing database systems to support recommendation as a mean of data exploration.  In this project, we designed RecDB – a full-fledged database system that produces data recommendations to end-users. The system incorporates state-of-the-art recommendation algorithms into the core functionality of a database query execution engine. RecDB allows its users to write SQL queries that seamlessly integrate the recommendation functionality with traditional relational operators, i.e., SELECT, PROJECT, JOIN. The system optimizes incoming recommendation queries (written in SQL) and hence provides near real-time personalized recommendation to a high number of end-users who expressed their opionions over a large pool of data items. For more details, please visit the RecDB project website

 

mntg2MNTG – A Web-based System for Spatial Road Network Traffic Generation and Visualization
Traffic data consists of a set of spatial locations with the timestamps reported by a set of objects moving over an underlying road network. Traffic data have been already leveraged by researchers in different areas, e.g., spatio-temporal databases, transportation, urban computing and data mining. an extensible web-based road network traffic generator that overcomes the hurdles of existing traffic generators. MNTG has three main features that significantly help the researchers to obtain the traffic data more easily: (1) MNTG is a web service with a user-friendly map interface. Behind the scenes, MNTG carries the burden of configuring and running existing traffic generators. Thus, MNTG users do not need to install or configure anything on their local machines. (2) MNTG can be used for any arbitrary spatial area worldwide with the user selected traffic generator. Users can just mark their area of interest on a map interface. Once the traffic generation request is submitted, MNTG is responsible for extracting the road network for the requested area and generating the traffic on that area using the user selected traffic generators. (3) MNTG users do not need to worry about the processing time or computing resources, where MNTG has its own dedicated server machine that (a) receives a traffic request from the user, (b) internally processes the request in a multi-core multi-threaded paradigm, and (c) emails the user back when the data is generated. For more information, please visit: http://mntg.cs.umn.edu/