Georgia Tech researchers have developed a new algorithm to mitigate bias from one of the first steps in the machine learning (ML) process. Known as fair principal component analysis (PCA), the new algorithm runs as fast as existing PCAs, but can reduce bias in low-dimensional representations of large datasets.
Bias is one of the most pressing issues as ML is used for everything from image classification to determining loans. Although there are plenty of stories about obvious bias like ML algorithms only showing images of white men when asked to query the term “CEO,” much of the bias is more insidious.
Many researchers believe unfair ML is the result of biased data or faulty algorithms, but Tech researchers determined it can start as early as the data processing step.
Reducing the dimension, increasing bias
Data with high dimension is often the start of the problem. When a dataset needs to be mathematically represented, each feature is represented as one dimension. For example, a 200x200-pixel image transforms into a vector with 40,000 dimensions. Working with such a large representation is often too difficult to process efficiently, so computer scientists use standard PCA to reduce the dimension while keeping as much information from the original data set as possible.
PCA runs by looking for the main directions that the data is distributed and projects the data onto those directions. Scientists then evaluate the accuracy by determining how far the projection is from the original data or how much information is being lost.
Although this makes the data easier to work with, the low-dimensional representation can be biased, according to the researchers. They ran PCA on a dataset of 1,300 images of males and females and calculated the average error for the different populations. The male population always was more accurate than the female.
“If you’re already representing one population much better at the preprocessing step, it injects some bias no matter what you’re trying to do,” School of Computer Science (SCS) Ph.D. student Samira Samadi said.
One way to combat this bias is to have a definition of what a fair projection means. The researchers evaluate this through marginal error, which determines how far the output projection is from the best projection for each population in the data. They concluded the maximum marginal error must be low for both populations, or equal, to be considered fair.
This became the basis for the new fair PCA algorithm.
“If you use PCA and care about fairness, now you can use fair PCA,” Samadi said.
The researchers presented the work at one of the leading machine learning conferences of the year, the Conference on Neural Information Processing Systems (NeurIPS), in Montreal Dec. 2-8. Samadi co-authored the paper, The Price of Fair PCA: One Extra Dimension, with SCS Ph.D. student Uthaipon (Tao) Tantipongpipat, School of Industrial and Systems Engineering Associate Professor Mohit Singh, SCS Assistant Professor Jamie Morgenstern, and SCS Professor Santosh Vempala.