Data analytics are widely used in science domains, government entities, and commercial organization. But they are only effective if the data is clean and easy to use. This is where database researchers like Xu Chu come in.
Chu joined the School of Computer Science as an assistant professor in Spring 2018 to research concerns like this.
“Valuable insights can be gained by analyzing relevant data, but there are a lot of steps to go through for that to happen,” Chu said. These include data preprocessing, such as data curation, data cleaning, and feature engineering and selection, and data post-processing, such as model interpretation and evaluation, and model maintenance.
Chu first found databases in the third year of his undergraduate degree at Nanjing University in China. He didn’t really delve into the research, though, until he studied abroad at the University of Waterloo, a Canadian computer science department known for its database expertise.
Chu was so engrossed by the practical problems in data management that he decided to complete his Ph.D. at Waterloo, too. His graduate study focused on various ways to do data cleaning, from rule-based approaches to advanced statistical and learning-based approaches.
“Errors can happen in various forms, so there’s not really one solution that solves all problems,” he said. “I am looking at more holistic solutions.”
Mixing ML and databases
Data cleaning is just one of the many painful steps users have to go through to do data analytics. Chu is building a research team to develop algorithms, tools, and systems to make the entire data analytics pipeline more usable and accessible, in particular the machine learning pipeline.
“Every step needs human attention, and every step is difficult if you don’t have the right tools,” Chu said.
Building these diverse tools for analytics requires being an expert in more than just databases. Other than a database course and a data mining course in his undergraduate years, Chu had to research the area on his own, which is something he brings to his teaching of a graduate course.
For every class, he provides an overview of the topic, but then tasks his students to research the technical details to the point where they know it better than he does. He wants to teach them more than just database knowledge, but the study skills for any future research they might pursue.
“I am a strong believer in picking a right problem, and learning whatever is necessary to solve it,” he said. “Students should not be afraid of anything.”