ISU Electrical and Computer Engineering Archives

A study of time-decayed aggregate computation on data streams

Xu, Bojian (2009) A study of time-decayed aggregate computation on data streams. PhD thesis, Iowa State University.

Full text available as:

PDF - Registered users only - Requires Adobe Acrobat Reader or other PDF viewer.

Abstract

Many real world data naturally arrive as rapid paced and virtually unbounded streams. Examples of such streams include network traffic at a router, events observed by a sensor network, accesses to a web server and transactional updates to a large database. Such streaming data need to be monitored online to collect traffic statistics, detect trends and anomalies, tune system performance and help make business decisions. However, because of the large size and rapid pace of the data, as well as the online processing requirement, conventional data processing methods, such as storing the data in a database and issuing offline SQL queries thereafter, are not feasible. Data stream processing is a new diagram of massive data set processing and creates new challenges in the algorithm design and implementation. In this thesis, we consider time-decayed data aggregation for data streams, where the importance or contribution of each data element decays over time, since recent data are usually considered of more importance in applications, and therefore are given heavier weights. We design small space data structures and algorithms for maintaining fundamental aggregates of the streams if it is possible and otherwise show large space lower bounds. We consider the data aggregation over a robust data stream model called asynchronous data stream, motivated by the streaming data transmitted in distributed systems, including computer networks, where the asynchrony in the data transmission is inevitable. In asynchronous data stream, the arrival order of the data elements at the receiver side is not necessarily the same as the order in which the data elements were generated. Asynchronous data stream is a robuster and generalized model of the previous synchronous data stream model. In summary, this thesis presents the following results: 1. We formalize the model of asynchronous data stream and the notion of timestamp sliding window. We propose the first small space sketch for summarizing the data elements over timestamp sliding windows of multiple geographically distributed asynchronous data streams. The sketch can return accuracy guaranteed estimates for basic aggregates, such as: Sum, Median and Quantiles. 2. We design the first small space sketch for general purpose network streaming data aggregation. The sketch has the following properties that make it useful in communication-efficient aggregation in distributed streaming scenarios: (1) The sketch can handle multiple geographically distributed asynchronous data streams. (2) The sketch is duplicate-insensitive, i.e. reinsertions of the same data will not affect the sketch, and hence the estimates of aggregates. (3) The sketch is also time-decaying, so that the weight of each data element summarized in the sketch decreases over time. (4) The sketch returns accuracy guaranteed estimates for a variety of core aggregates, including the sum, median, quantiles, frequent elements and selectivity. 3. We conduct a comprehensive study on the time-decayed correlated data aggregation over asynchronous data streams. For each class of time decay function, we either propose space efficient algorithms or show large space lower bounds. We not only closes the open problem of correlated data aggregation under sliding windows decay, but also presents negative results for the case of exponential decay, which however is highly used in the non-correlated scenarios. 4. We propose the forward decay model to simplify the time-decayed data stream aggregation and sampling. Forward decay captures a variety of usual decay functions (or called backward decay), such as exponential decay. We design efficient algorithms for data aggregation and sampling under the forward decay model, and show that they are easy to implement scalably.

EPrint Type:Thesis (PhD)
Subjects:Computer Engineering > SOFTWARE SYSTEMS > Parallel and Distributed Computing
ID Code:522
Identification Number:Identification Number UNSPECIFIED
Deposited By:Bojian Xu
Deposited On:10 November 2009

Archive Staff Only: edit this record