Development of a Software Library and Web Application for Rule-based Data Smell Detection

Low quality data can have significant consequences for companies. With increased usage of machine learning approaches, quality control of training data becomes increasingly essential. The term data smell indicates that data is of poor quality caused by violation of recommended best practices, poor quality of data sources or poor data handling in preceding processes. Software developers and data scientists are currently not able to utilize these data smells because there exists no tool support for their practical application. This bachelor thesis aims to close this gap. In detail, a software library will be developed and made openly available to enable developers the integration of the data smell detection into their software applications. Additionally, a web application will further be developed to allow users to check their datasets for the presence of data smells. The implementation will be written in Python, since it is a widely used language in the field of machine learning. Finally, the web application and the smell detection are evaluated with real-world datasets.