세종대학교

SiteMap

close

Research

Research News

A prestigious global university that fosters Sejong-type talent who challenges creative thinking and communicates with the world.

Research News

The development of robotic manipulation task learning based on the foundation model in order to understand and reason ab

2025.03.27 883

The development of robotic manipulation task learning based on the foundation model in order to understand and reason about task situations


Professor Gu Young-hyeon

from the Department of Artificial Intelligence and Data Science


1. Introduction


The recent advancements in Large Language Models (LLMs), which have shown exceptional capabilities in regards to understanding and generating natural language, have brought revolutionary changes to the field of artificial intelligence, and they are being utilized across various industries and applications. The integration of LLMs with robotics, which has in particular become a highly noteworthy topic, has immense potential in order to enable autonomous decision-making in complex environments and intuitive interactions with humans. Robot learning allows machines to learn on their own and adapt to tasks via data, which has traditionally relied on sensor data and structured commands. However, robots are now capable of understanding unstructured natural language commands with the advent of LLMs, and they are thereby performing complex tasks.


Figure 1. Google DeepMind AutoRT[1] that is Capable of understanding commands, performing tasks, interpreting previously unseen objects in its surroundings, and autonomously generating commands via an LLM


For instance, if a user says, “Clean the table,” the LLM analyzes the meaning of the sentence, generates a task plan that the robot can execute, and it further creates action codes that are required in order to conduct the task and makes it possible to control the robot. A robot's task plan clearly defines the specific mission it aims to accomplish and sequences the necessary component actions in an optimal order in order to execute the task. Algorithms, such as the Hungarian algorithm were previously commonly used in order to efficiently perform robot task planning or task planning datasets, such as an assembly plan, were employed to train deep learning models. It is recently possible to input a task for a robot and plan the component actions that are required in order to execute the task with the reasoning capabilities of large language models.


Large language models perform reasoning based on the knowledge that is contained in their pre-trained datasets. They may lack awareness of the latest information or new knowledge that is not included during training. Their reasoning outcomes that are related to this type of information may consequently be inaccurate or incomplete. This means that the results of the reasoning process might lack precision when a task involves diverse environments or products that were not part of a model's training.


Figure 2. Concept diagram of a large AI model based robot task learning technology in order to understand and reason about task situations


This study therefore aims to propose a direction in order to address the issue of inaccuracies in regards to reasoning results that are derived from large language models.


2. Chain-of-Thought


The chain-of-thought approach enhances the reasoning performance of language models without the need to retrain them with additional datasets in specific domains. It instead allows the models to derive solutions in the desired field by providing a few examples as guidance [2]. 


 Figure 3 . An example of a comparison of Standard Prompt and a Chain-of-Thought Prompt. It is observed that CoT logically breaks down problems and enhances the accuracy of the results [2]


This approach plays a crucial role in the field of generating task plans based on robotic AI technologies. Large language models, which are beyond simply understanding commands, can design robotic task plans through the complex chain-of-thought. This chain-of-thought refers to the process of analyzing and synthesizing information step by step and reaching a conclusion, which is just like in human thought flow.


When a user commands, please clean the table, a large language model analyzes the intent of the sentence, plans where the robot should move objects, optimizes the sequence of tasks, and does systematic planning. It also helps the robot select the most efficient and appropriate method among the various possible task approaches. This is more flexible and intuitive than traditional methods, because it mimics human logical thinking, breaks down problems, analyzes them, and determines an optimal solution. The Chain-of-Thought technique helps the robot effectively respond even in unexpected situations. It infers the reason behind cleaning the table based on the interpretation of the user's intent as opposed to simply moving objects, and it thereby designs a task plan that includes the optimal sequence according to logic and reasoning.


The Chain-of-Thought technique elevates a robot's task execution ability, enables it to efficiently perform tasks in complex environments, and plays a significant role in regards to enhancing collaboration with humans.


3. Unseen Object Recognition Technology


An unseen object, which is also an unknown object, refers to an object that has not been pre-trained by an artificial intelligence (AI) model. This includes objects or classes that were not included in the model’s training data, which means objects with characteristics that were not observed in the existing dataset. There are in reality many objects and situations that cannot be covered by training data. Collecting and labeling data for every object is inefficient in terms of cost, time, and resources, and it is also limited by privacy concerns and data accessibility issues. There is therefore a need for unseen product recognition technology that can recognize objects that were not pre-trained. The classes of these unseen objects are defined through an open vocabulary.


The concept of open vocabulary emerged from the fields of machine learning and natural language processing (NLP) as an attempt to overcome the limitations of the traditional closed vocabulary approach. NLP models traditionally relied on a predefined and fixed vocabulary in order to process language, which had limitations when dealing with unknown words or domain-specific terms. Open vocabulary was proposed in order to overcome this constraint. This allows a model to learn new words and concepts flexibly during the learning process without being restricted to a fixed vocabulary set. It is difficult in systems that learn from multimodal data, such as text, images, videos, and audio to capture all concepts with predefined vocabulary. Thus, the need to expand the vocabulary by learning new formats and context has driven the concept of open vocabulary.


Unseen object recognition technology plays a key role in regards to developing AI models with general-purpose capabilities that can adapt flexibly to new situations, which is a lot like humans, without solely relying on the existing data. This technology is expected to be an essential element in regards to achieving general AI, and it will continue to be researched in both academic and industrial sectors. It in particular plays an important role in the convergence of large language models and robotics. It is expected to be a core technology in various fields in the future, such as autonomous driving, security, and healthcare.


Robots require vast amounts of training data and highly specialized datasets in order to perform various tasks. However, specialized datasets for specific tasks are often scarce, and it takes a considerable amount of time and resources in order to acquire them. Innovative unseen object recognition technology to overcome the limitation of slow adaptation to new environments and tasks has gained dominant attention in regards to robotics research.


Robots can recognize new objects that were not pre-trained and perform tasks in various environments by applying unseen object recognition technology to them. The performance of robots remains stable even when the work environment changes or new objects are introduced. There is also no need to collect large datasets for pre-training or go through the training process in order to recognize and manipulate various objects, so robots can expand a range of tasks and task efficiency. This significantly reduces the time and cost for data labeling and model training. The unseen object recognition technology enables robots to understand objects based on patterns and features as opposed to just relying on pre-trained data, which allows them to quickly recognize and respond to new objects without retraining. This facilitates flexible human-robot interactions in dynamic environments.


4. Retrieval-Augmented Generation


Retrieval-Augmented Generation (RAG) is an advanced AI technology that combines large language models (LLMs) with information retrieval systems in order to generate answers to questions or perform specific tasks. Large language models are actively used, but they still face several key challenges, such as hallucination issues, knowledge update problems, and a lack of domain-specific expertise, which cause difficulties in the field of robotics as well as also in other areas. The emergence of retrieval-augmented generation, which enhances LLMs by utilizing external knowledge databases, has addressed these issues [3]. The retrieval-augmented generation model searches for the relevant information from an external database according to the given input. The retrieved information is then combined with the input data via fusion techniques. Finally, the generators of the RAG make predictions based on the input and the retrieved information.


Figure 4. Overview of Retrieval-Augmented Generation: Retrieval-Augmented Generation is primarily composed of the Retriever, Retrieval Fusions, and Generators [3]


4.1. Retriever


The Retriever is the component that is responsible for finding relevant documents or information, which is highly related to an input query, from a database. Highly optimized search algorithms are used in this stage in order to ensure that users can quickly and accurately obtain the necessary information. 


4.2. Retrieval Fusions


Retrieval Fusions aim to enhance the generation process by incorporating the retrieved information. These fusion techniques are generally categorized into query-based fusion, latent fusion, and logits-based fusion. Query-based fusion strengthens the input data by adding the retrieved information before feeding it into a generator. Latent fusion improves a model's performance by introducing the retrieved representation into a generator’s latent representation. Logits-based fusion focuses on a generator’s output logits, and it merges the logits from the retrieved information in order to provide more robust logits.


4.3. Generators


Generators are categorized into basic generators and retrieval-augmented generators. A basic generator generally includes large pre-trained or fine-tuned language models, such as the GPT series models or the Gemini series models. A retrieval-augmented generator refers to a pre-trained or fine-tuned generator that includes a module for fusing the retrieved information. 


Large language models rely on pre-trained knowledge for inference, so they have no information about newly updated knowledge. It is also difficult to do accurate reasoning Additionally, even if the name of an object that appears for the first time in task environments or is not previously learned is recognized via the unseen objection technology, because the information on the object’s definition or function fails to be learned. It therefore struggles in order to make accurate inferences. A retrieval-augmented generation model in robot tasks helps a large language model more accurately reason about task manipulation planning with the use of the information about new or previously unseen objects in task environments.


Figure 5. Task plan generation based on ambiguous command inference through technologies, such as Chain-of-Thought, Unseen Object Recognition, and Retrieval-Augmented Generation


5. In-Context Learning 


In-Context Learning refers to the ability of large language models to learn and perform task patterns based on examples, which are provided within a given context, without the need of additional fine-tuning for new tasks. This approach maximizes the capabilities of the language model, and it is designed to enable rapid adaptation to specific tasks. A prompt that is used as an input when the In-Context Learning technique is applied means that the input text that instructs a large language model on the task to perform or provides the necessary context. The prompt is designed in order to assist the model understand what task it needs to perform and generate an appropriate response.


Figure 6. An example of In-Context Learning [4]


Traditional robots often relied on reinforcement learning and supervised learning, but these methods come with challenges, such as the time-consuming process of generating training data, long training times, expensive sensors, the difficulty of setting up simulation environments, and the heavy computational resource requirements. Furthermore, robots need adaptability in order to perform a variety of tasks. It is inefficient to continuously re-learn or reprogram a robot every time it needs to perform a new task.


In-Context Learning is a key technology that helps robots flexibly adapt to complex and changing environments, and it allows them to efficiently perform various tasks, which is a lot like humans. For example, a robot model may fail to generate the appropriate task plans due to a change in the environment when it is used in a home environment and transitioned to an office environment. The robot can quickly adapt to the new environment in this case when In-Context Learning is applied in order to provide just a few examples of the office environment. This technology enables efficient application in office settings as well as also in various service environments, such as manufacturing plants.


Figure 7. A comparison between the example of an application of In-Context Learning to LLM in robotic tasks and the example of application of In-Context Learning to LLM based ambiguous command understanding and inference model


6. The Development of Task Situation Inference LLM Robot Technology


The necessary technologies in regards to overcoming the various limitations of large-scale AI models in the field of robotics are examined above. The potential effects when these technologies are applied in conjunction are described in this section.


The Chain-of-Thought technology generates a robot's task plan according to logic and ground, but it can still produce incomplete or erroneous plans due to the reliance on the pre-learned knowledge of a large language model. The unseen object recognition technology helps the robot better understand dynamic task environments by informing the large language model of an object's class, which is its name. However, knowing just the name of the object is not enough. Understanding its function and how it interacts with the environment is crucial. The retrieval-augmented generation technology is therefore applied in order to perform more accurate step-by-step inferences based on the unseen object recognition technology and Chain-of-Thought. Lastly, In-Context Learning technology is introduced in order to allow the related technologies to rapidly adapt to various environments.


Figure 8 . An example of a human robot interface 


7. Conclusion


Robotic automation technologies for real-time responses are actively competitive internationally, and they are researched and developed in domestic institutions. However, they have not yet reached a level that is applicable in real-world environments. There is an increasing demand for robotic automation systems in various sectors, which include manufacturing, office environments, and service industries, in regards to labor shortages and the diversification of service demands. It is possible to expect these systems to be universally applicable in various settings by analyzing tasks with the use of robotic technologies that are based on large language models and adapting flexibly in regards to environmental changes. It is also expected to lower technological access barriers for small and medium-sized enterprises, support and enhance technologies, and increase automation rates across various industries and not just specific sectors. It is also expected to broaden the service application scope and improve quality by expanding automation technologies into specialized medical services and assisting people who are unable to independently manipulate objects due to illnesses or aging in public service areas.


8. References


[1] Ahn, M., Dwibedi, D., Finn, C., Arenas, M. G., Gopalakrishnan, K., Hausman, K., ... & Xu, Z. (2024). Autort: Embodied foundation models for large scale orchestration of robotic agents. arXiv preprint arXiv:2401.12963.


[2] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., ... & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35, 24824-24837.


[3] Wu, S., Xiong, Y., Cui, Y., Wu, H., Chen, C., Yuan, Y., ... & Xue, C. J. (2024). Retrieval-augmented generation for natural language processing: A survey. arXiv preprint arXiv:2407.13193.


[4] Dong, Q., Li, L., Dai, D., Zheng, C., Ma, J., Li, R., ... & Sui, Z. (2022). A survey on in-context learning. arXiv preprint arXiv:2301.00234.


Next Professor Jeffrey Hodgson Discovers a Correlation Between the Direction of Supermassive Black Hole Jets and Galaxy Morph
Previous Properties of Chalcogenides and Their Applications on the Next-Generation of Semiconductors