Unsupervised learning relies on finding relevant information in large databases. This is possible thanks in part to groundbreaking research by School of Computer Science Professor Santosh Vempala and his collaborators 20 years ago.
The team won a test-of-time award for their impactful work on information retrieval. The annual conference Principles of Database Systems (PODS) gave their paper, Latent Semantic Indexing: A Probabilistic Analysis, its Gems of PODS honor at the this year’s conference in May.
The 1998 paper analyzed a popular spectral algorithm and introduced the very first topic model, now a standard in unsupervised learning.
The researchers discovered that if a database or corpus is viewed as a matrix, a computer algorithm can perform singular-value decomposition, a matrix reduction technique that pulls out the most significant directions to explain the data. This step not only involves minimal distortion of data but it actually yields better retrieval results than the full original matrix.
Their topic model was able to identify the original underlying topics. The model and guarantees have been significantly enhanced in the decades since the paper was published.
“This was one of the first provable techniques for automatically extracting information from data,” Vempala said.
Their work has influenced prominent computing fields such as spectral methods, data mining, machine learning, and deep neural networks.
Vempala wrote the paper when he was a summer intern at IBM with Prabhakar Raghavan (now VP of Engineering at Google), together with Columbia University Professor Christos Papadimitriou (then at Berkeley), and Meiji University Professor Hisao Tamaki. Papadimitriou gave the award talk.