Public Data Sets List

I am asked by many students about finding good data sets for class or thesis projects. You can find in the following link a collection of public data sets.

Public Data Sets for Data Analytics Projects

Program Code Examples

R Code Examples related to CS555 “Data Analysis and Visualization” are here https://github.com/kiat/R-Examples
Code Examples related to CS777 “Big Data Analysis” are here https://github.com/kiat/MET-CS777
Code Examples related to CS665 “Software Design and Patterns ” are here https://github.com/kiat/MET-CS665
Code Examples related to CS755 “Cloud Computin” are here https://github.com/kiat/MET-CS755

Open Master Thesis

I am listing here some of open master thesis project ideas. If you are interested in writing a master thesis with me then you can contact me to talk about details. I may have also more topics than listed here.

Open Thesis-1

Title: Real-time Anomaly Detection from Data Streams

Abstract. The main goal if this project is to develop novel approaches for detection of anomalies from high-throughput data streams. Data streams are generated by sensing devices with very high throughput and anomalies should be detected in real-time from such high throughput data stream. Anomalies can be seen as any kind of unusual behaviors, events like rapid changes that can be detected from by using pattern matching on high velocity data streams. In this project we study state-of-the-art approaches for data stream anomaly detection and extend these approaches to improve the detection throughput by using data streams distribution.

Open Thesis-2

Title: Fast Approximation of Feature Matrix Entries

Abstract. In the context of many data mining and machine learning algorithms a feature vector represents complex objects features. Many of values can be extracted by static processing of stored objects and some of them are depending on specific user-queries because we need to compare user-object with stored objects to extract feature values. For example if the object is represented as a graph, and we search for the highest similar given a user-provided graph, then we might have some features based on graph properties, e.g., comparisons between the user-query and all stored objects.
Such dynamic feature values have to be computed by comparing query object against all stored objects at the query time. Retrieving large number of dynamic feature entries is an expensive computation (large No. features and No. objects). As usual, if we can’t compute it in time we might be able to replace it by a fast approximation of values.

Some of the related Work for this project are:

Manas Joglekar, Hector Garcia-Molina, Aditya Parameswaran, and Christopher Re. 2015. Exploiting Correlations for Expensive Predicate Evaluation. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD ’15). ACM, New York, NY, USA, 1183-1198. DOI: https://doi.org/10.1145/2723372.2723715
Christina Teflioudi, Rainer Gemulla, and Olga Mykytiuk. 2015. LEMP: Fast Retrieval of Large Entries in a Matrix Product. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD ’15). ACM, New York, NY, USA, 107-122. DOI: https://doi.org/10.1145/2723372.2747647
Lester Mackey, Ameet Talwalkar, and Michael I. Jordan. 2015. Distributed matrix completion and robust factorization. J. Mach. Learn. Res. 16, 1 (January 2015), 913-960.

Open Thesis-3

Title: Mining Design Patterns from Code Repositories

Abstract.

In this project, our goal is to develop a mining system that identifies design structures of a software package by executing pattern matching algorithm to match software structure to the well-known design patterns specifications. It will provide pattern matching results as ratios that a given code repository or a sub-set of it, is similar to one of the known patterns with the given ratio of confidentiality. We will start by extracting static structure of programs. We think that we can start with the code graph structure and add to it later the dynamic behaviors of programs if we can compile and execute programs, for example based on provided Unit tests. One idea here is to use a combination of graph languages and active rule representation languages to specific software design patterns. Some patterns might be highly similar and related to each other that can be differentiate only based on details and dynamic program behavior. Some of the most important state of the art approaches are [Dabain 2015, Bernardi 2014/12, Rasool 2011, Stella 2011, Prasad 2010, Dong 2007, Shi 2005, Balanyi 2003]. The goal of this project is to extend the existing approaches for the specific purpose of code mining and code similarity matching in a large-scale environment.