Building on the success of OpenAI, Sean Ren, a tenured faculty member at the University of Southern California and an AI researcher, co-created Sahara Labs with the goal of helping those who create data that feeds AI a piece of the pie.
The Playa Vista-based company announced on Wednesday it raised $37 million in fresh funding to launch Sahara AI, a blockchain offering that allows people to copyright assets that may be used by artificial attention. Sahara Labs has raised $43 million in total, and the company is advised by the likes of Midjourney cofounder Elvis Zhang, Anthropic research scientist Rohan Taori and Together AI chief executive Vipul Prakash.
Ren launched the company in 2023 with the goal of compensating everyone who participates in the AI economy – from those who create the underlying data to those who develop the training models.
“We are seeing rising ethical concerns over things like copyrights, privacy, resource access and economic imbalances. This issue just continues to grow as AI becomes more widely adopted and also more capable,” Ren said. “Stakeholders in this whole ecosystem, including the users, the data contributors, the model creators and the application builders, are all seeking solutions to secure their ownership.”
Who gets paid for generating data?
The merit of generative AI comes primarily from its data, which is becoming increasingly more valuable as companies compete for data sets. A 2022 report from the Massachusetts Institute of Technology and Epoch AI found that machine learning companies will have exhausted all high-quality text data for training large language models by 2026, which will halt AI companies’ ability to scale and evolve.
Acquiring free data from the internet isn’t a stable strategy for companies anymore because players are already scrambling to secure their fiduciary earnings from data they create or store – social media platforms Reddit and X have imposed strict restrictions around third-party tools and are charging companies for access. In 2023, Getty Images sued Stability AI and the New York Times sued OpenAI due to infringement. And several Southern California record labels sued generative AI music makers Udio and Suno in July.
The problem becomes even more severe as generative AI develops specific applications. While ChatGPT is trained on billions of language data points that have been collected over years to convey tone and authenticity, it is much harder for AI to find free data sets for medical or manufacturing applications.
According to Vishal Gupta, an associate professor of data sciences and operations at USC, large language models are often trained on unskilled, low-paying labor. This likely won’t be possible as generative AI goes more niche, and the low-risk business model used by first movers won’t be viable.
“If you think about looking at tax fraud cases and understanding if this particular tax filing likely has fraud in it, the only way to generate data about it is to pay tax expert to look at it and spend their time,” said Gupta, who is also an affiliate faculty member with the USC Center for AI and Society.