Public Data Sets List

I am asked by many students about finding good data sets for class or thesis projects. You can find in the following link a collection of public data sets.

Public Data Sets for Data Analytics Projects


Program Code Examples 

Open Master Thesis

I am listing here some of open master thesis project ideas. If you are interested in writing a master thesis with me then you can contact me to talk about details. I may have also more topics than listed here.



Open Thesis-1

Title: Real-time Anomaly Detection from Data Streams 

Abstract. The main goal if this project is to develop novel approaches for detection of anomalies from high-throughput data streams. Data streams are generated by sensing devices with very high throughput and anomalies should be detected in real-time from such high throughput data stream. Anomalies can be seen as any kind of unusual behaviors, events like rapid changes that can be detected from by using pattern matching on high velocity data streams.  In this project we study state-of-the-art approaches for data stream anomaly detection and extend these approaches to improve the detection throughput by using data streams distribution.


Open Thesis-2

Title: Fast Approximation of Feature Matrix Entries

Abstract. In the context of many data mining and machine learning algorithms a feature vector represents complex objects features. Many of values can be extracted by static processing of stored objects and some of them are depending on specific user-queries because we need to compare user-object with stored objects to extract feature values.  For example if the object is represented as a graph, and we search for the highest similar given a user-provided graph,  then we might have some features based on graph properties, e.g., comparisons between the user-query and all stored objects.
Such dynamic feature values have to be computed by comparing query object against all stored objects at the query time. Retrieving large number of dynamic feature entries is an expensive computation (large No. features and No. objects).  As usual, if we can’t compute it in time we might be able to replace it by a fast approximation of values.

Some of the related Work for this project are:

  1. Manas Joglekar, Hector Garcia-Molina, Aditya Parameswaran, and Christopher Re. 2015. Exploiting Correlations for Expensive Predicate Evaluation. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD ’15). ACM, New York, NY, USA, 1183-1198. DOI:
  2. Christina Teflioudi, Rainer Gemulla, and Olga Mykytiuk. 2015. LEMP: Fast Retrieval of Large Entries in a Matrix Product. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD ’15). ACM, New York, NY, USA, 107-122. DOI:
  3. Lester Mackey, Ameet Talwalkar, and Michael I. Jordan. 2015. Distributed matrix completion and robust factorization. J. Mach. Learn. Res. 16, 1 (January 2015), 913-960.


Open Thesis-3

Title: Mining Design Patterns from Code Repositories


In this project, our goal is to develop a mining system that identifies design structures of a software package by executing pattern matching algorithm to match software structure to the well-known design patterns specifications.  It will provide pattern matching results as ratios that a given code repository or a sub-set of it, is similar to one of the known patterns with the given ratio of confidentiality.  We will start by extracting static structure of programs.  We think that we can start with the code graph structure and add to it later the dynamic behaviors of programs if we can compile and execute programs, for example based on provided Unit tests. One idea here is to use a combination of graph languages and active rule representation languages to specific software design patterns. Some patterns might be highly similar and related to each other that can be differentiate only based on details and dynamic program behavior.  Some of the most important state of the art approaches are  [Dabain 2015, Bernardi 2014/12, Rasool 2011, Stella 2011, Prasad 2010, Dong 2007, Shi 2005, Balanyi 2003]. The goal of this project is to extend the existing approaches for the specific purpose of code mining and code similarity matching in a large-scale environment.

Some of the related Work for this project are:

  1. Marco Zanoni, Francesca Arcelli Fontana, and Fabio Stella. 2015. On applying machine learning techniques for design pattern detection. J. Syst. Softw. 103, C (May 2015), 102-117. DOI=
  2. Jing Dong, Dushyant S. Lad, and Yajing Zhao. 2007. DP-Miner: Design Pattern Discovery Using Matrix. In Proceedings of the 14th Annual IEEE International Conference and Workshops on the Engineering of Computer-Based Systems (ECBS ’07). IEEE Computer Society, Washington, DC, USA, 371-380. DOI:
  3. Haneen Dabain, Ayesha Manzer, and Vassilios Tzerpos. 2015. Design pattern detection using FINDER. In Proceedings of the 30th Annual ACM Symposium on Applied Computing (SAC ’15). ACM, New York, NY, USA, 1586-1593. DOI:
  4. Nija Shi and Ronald A. Olsson. 2006. Reverse Engineering of Design Patterns from Java Source Code. In Proceedings of the 21st IEEE/ACM International Conference on Automated Software Engineering (ASE ’06). IEEE Computer Society, Washington, DC, USA, 123-134. DOI=


(Last Update: December, 2018)

Comments are closed.