SEOUL -- South Korean artificial intelligence technology developer Upstage has launched an alliance of companies and organizations in the private and public sectors including media companies, academics, and corporates in a bid to establish an ecosystem for large language models that would incorporate Korean language-based AI models capable of understanding and generating natural conversations in Korean.
A large language model (LLM) is basically a smart computer program that understands and generates human-like conversations in text. Through machine learning, an LLM is smart enough to engage in conversations and ask questions about different topics, and translate languages. An LLM is based on about one billion parameters, which are variables present in the model. OpenAI's ChatGPT, one of the world's most famous LLM-based chatbot services, is known to have some 100 billion parameters.
While it is easier for English language-based LLM developers to gain access to parameters for machine learning, it is harder for Korean language-based AI developers to obtain parameters for their Korean-based LLMs because there is a lesser number of datasets available for the isolated language with some 77 million Korean-speaking people across the world.
Upstage said in a statement on August 14 that "1T Club," an alliance designed to solve the current situation that lacks Korean language-related data and reduce South Korea's dependency on foreign LLM solutions through the development of high-performance LLM. "1T" stands for one trillion tokens and some 20 partner companies and organizations will work together to contribute to the alliance by providing tokens through news articles, academic journals, novels, and other forms of text.
"A token is like a raw material for data. It is processed to parameters and corpora for machine learning. More tokens we gather, the more accurate an LLM becomes," Bae Sung-beom, Upstage's communications manager, told Aju Korea Daily.
LLMs operated by big tech companies are based on machine-learned data based on foreign languages including English and they do not have a thorough knowledge of the Korean culture or regional information. Through the build-up of Korean language-based tokens, Upstage predicts that a high-performance LLM would be established. The high-quality LLM, capable of understanding Korean culture, would be utilized in various applications of generative AI in South Korea.
"Through the 1T Club, we will do our best to upgrade South Korea's AI capabilities as well as firm its ground as the top player in the global AI-related industries," Upstage said, adding that the company will focus on solving intellectual property (IP)-related problems by allowing the AI to study when it web-crawls for data collection. "We will create an ecosystem where both data providers and model creators can benefit from," the AI company said.
Upstage's language model with about 70 billion parameters has topped Hugging Face's AI language model leaderboard, beating OpenAI's GPT-3 on August 1. GPT-3 with a capacity of 175 billion machine learning parameters uses deep learning to produce human-like texts. Upstages' AI model scored 72.3 points and GPT-3's score was 71.9. The score was determined based on four categories -- reasoning, common sense, language comprehension capability, and hallucination prevention.