South Korea to Open Modern-Era Texts and Folk Painting Data for AI Training

by Yoon Juhye Posted : February 9, 2026, 00:03Updated : February 9, 2026, 00:03
Exterior view of the National Library of Korea
Exterior view of the National Library of Korea [Photo=National Library of Korea]

The South Korean government is accelerating efforts to build high-quality training data for the artificial intelligence industry, starting with texts that can be used without copyright concerns. 

The National Library of Korea said on Saturday it will open a new section on its website, called “Shared Bookshelf,” around March to provide text data that private companies can freely use for AI development. 

A library official said the institution rebuilt its digitized holdings into formats suitable for AI training, such as text files, focusing on materials with resolved copyright issues. The official added the data could be opened as early as March. 

The release will be limited to publications issued in South Korea, mostly from the modern era. The collection will center on works whose copyright protection has expired or materials published by the library itself. Under South Korea’s Copyright Act, protection lasts 70 years after an author’s death, and many works from the early 1900s are expected to be included. The official said the main categories will include modern-era magazines and literature, classical literature and textbooks.

The data will also be provided to the Ministry of Science and ICT’s “Independent AI Foundation Model Project.” The term refers to general-purpose AI models trained and operated directly using domestic technology and resources.
 
Major national libraries overseas are also moving to build and open AI training data. Sweden’s National Library, which opened in 1661, has used text accumulated over about 500 years — including medieval manuscripts — to build more than 20 open-source transformer models through an affiliated research institute. Up to 200,000 developers a month are known to use them for research and technology development.
 
Tiger character before training (left) and after training
Tiger character before training (left) and after training [Photo=Korea Heritage Service Foundation]

The government is also speeding up work on image datasets. The Korea Heritage Service Foundation, an affiliate of the Korea Heritage Service, said it has prepared high-quality training data designed to capture the distinct characteristics of traditional Korean folk paintings through its “Korean Traditional Minhwa Production Data Project.” Existing generative AI models have had limits, including distorting or inaccurately depicting minhwa styles and motifs.

The dataset includes 3,779 minhwa images by genre — such as flower-and-bird paintings, landscapes, tiger-and-magpie paintings and bookshelf paintings — along with 5,340 detailed description images and 77,388 Korean-English multimodal caption entries. The foundation defined multimodal caption data as training data that combines images with artwork information so AI can understand, generate and describe them in language. It said it thoroughly verified artists’ time periods and iconography systems, based on collections including those of the Gahoe Museum of Minhwa.
 
The foundation said the dataset is expected to be used in areas including industrial design and product development such as goods, digital content and media art, and global promotion. The minhwa data will be fully opened on AI Hub in the first half of this year.



* This article has been translated by AI.
기사 이미지 확대 보기
닫기