Bachelor Thesis: Data Smell Detection with Machine Learning

A data smell can be described as an irregularity in a dataset which can be caused by low-quality or poorly handled data as well as through violation of best practices when working with data. The presence of a data smell can, but does not have to, imply deeper issues in the given dataset. Several data smells and their descriptions have already been identified in a literature study. However, not all data smells can be detected using simple algorithms. This thesis aims to explore machine learning techniques which may be used to find certain data smells where detection is dependent on a deeper understanding of the data. To achieve this, the identified data smells first have to be evaluated in terms of their detectability using traditional, non-learning algorithms as well as learning algorithms. After a fitting subset of data smells and corresponding machine learning techniques have been defined, an intuitive interface for applying these methods on a desired dataset and displaying the results should be provided. 

Finally, the chosen techniques and implementation will be evaluated on their effectiveness on one or multiple test datasets.