Foundation Models for Robotics Series Part 3 - Types of Foundation Models

The third post on this series covers the various types of foundation models that are commonly used in robotics


In the previous post on these series we looked at the foundational concepts and building blocks used to architect and train foundation models. This blog post focuses on the types of foundation models that are used in robotics. We divide these foundation models into specific categories and define the properties and examples of the foundation model type.

Large Language Models (LLMs)

LLMs are the most common type of foundation models used today, they have billions of parameters and are trained on trillions of text tokens. This large scale allows these models to be state-of-the-art performance on General Language Understanding Evaluation (GLUE)[2]. GPT-2 [3] and BERT [4] are early examples of this model type. GPT-2 was trained to be a decoder only model, while BERT is an encoder only model. The decoder models are more common today because they perform next token prediction given an existing sequence of previous tokens. The successors of this decoder model architecture include GPT3. [5], LLaMa [6] and PaLM [7]. The number of parameters on these models have grown significantly reaching up to trillions of parameters as with GPT4 [8]. GPT-3 was trained Common Crawl which contains petabytes of publicly available data over 12 years of web crawling [9]. After training these language models are fine-tuned for chat and instruction following, GPT-3 and GPT-4 use a process know as reinforcement learning with human feedback (RLHF) which involves a human instructor guiding the model towards desired a desired output pattern by providing prompts and rating the outputs generated by the model [10].

Vision Transformers (ViT)

A Vision Transformer is a transformer architecture used for computer vision tasks such as object detection, image classification and segmentation [11]. The ViT architecture splits an image into patches and treats each patch as a token, to maintain spatial relationship between patches in the image positional information is added to each token, this process is known as positional embedding. The tokens and their positional embeddings are fed as a sequence to the transformer encoder for the self-attention mechanism to capture dependencies and patterns in the image sequence. Similar to LLMs, ViT models have also been scaled to large number of parameters such as ViT-22B which a vision transformer with 22 billion parameters [12]. DINO is a self-supervised learning method for training ViT models [13]. It uses knowledge distillation which is a learning framework where a smaller model is trained to mimic the behavior of a larger teacher model. Self-supervised ViT features learned using DINO contain explicit information about the semantic segmentation of an image including scene layout and object boundaries, such clarity is not achieved using supervised ViTs or Convolutional Neural Networks (CNNs).

Multimodal Vision-Language Models(VLMs)

As discussed in previous posts, multimodal refers to the ability of a model to accept different “modalities” of inputs, such as images, texts, audio or video signals. Vision-Language models are a type of multimodal models that take both images and text. The most commonly used VLM is the Contrastive Language Image Pre-training (CLIP) model [14]. As the name implies CLIP was trained using contrastive learning as discussed in the previous post. Other CLIP variants that are used to build VLMs include BLIP [15], $CLIP^2$[16], FILIP[17] and FLIP[18], they all unique variations to the training or similarity matching process between texts and images.

Embodied Multimodal Language Models

An embodied agent is an AI system that interacts with a virtual or physical world, these can be virtual assistants, game characters or robots. Embodied language models are foundation models that incorporate real-world sensor and actuation modalities into pre-trained large language models. As an example PaLM-E [19] is a multimodal language model that has been trained on not only internet-scale general vision-language data, but also on embodied, robotics data, simultaneously. PaLM-E is built from an LLM and a ViT, the ViT transforms an image into a sequence of embedding vectors which are projected into the language embedding space through an affine transformation. The whole model is trained end-to-end starting from a pre-trained LLM and ViT model. The outputs of the model are then connected to a robot for control through a high-level control policy.

Visual Generative Models

Diffusion models trained on web-scale data such as DALL-E and Stable Diffusion provide zero-shot text-to-image generation. They are trained on hundreds of millions of image caption pari from the internet. The models learn language conditioned distribution over images from which an image can be generated given a prompt.

Conclusion

In this post we took a brief overview of the various foundation model types that are used in robotics, we did not explicitly cover how each have been applied in robotics applications but gave a high level overview of how these models are trained and used. In the next part of this series we will be covering core concepts and terminologies in AI driven robotics and discover how these foundation models can be fit into the robotics stack.

References