Paper Review 3: Pix2Struct is an image-encoder-text-decoder based on the Vision Transformer (ViT)

Pix2Struct is an image-encoder-text-decoder based on the Vision Transformer (ViT)

summarize:

• they introduce the area of general-purpose visually-situated language understanding, which consists of diverse tasks but common challenges.

• they propose a screenshot parsing pretraining objective based on the HTML source of web pages. they show that our objective is more effective than previous attempts at enabling the elegant pixel-to-text design for general-purpose visually-situated language understanding.

• they introduce variable-resolution input representations to the Vision Transformer and new finetuning strategies that seamlessly integrate language and vision inputs by directly rendering any language prompts on top of the input image

Problem with VIT

Before extracting fixed-size patches, the standard ViT scales the input images to a predefined resolution, which creates two undesirable effects: (1) rescaling the image distorts the true aspect ratio, which can be highly variable for documents, mobile UIs, and figures. (2) transferring these models to downstream tasks with higher resolution is non-trivial. Solution:

1) always scale our input image up or down such that we extract the maximal number of patches that fit within the given sequence length.

2)to handle variable resolutions unambiguously, we use 2-dimensional absolute positional embeddings for the input patches

Pretraining

Pix2Struct is pretrained by learning to parse masked screenshots of web pages into simplified HTML BART-like learning signal by masking 50% of the text while decoding the entire subtree. The masked regions are randomly sampled spans of text from the chosen subtree where we draw crossed-out opaque bounding boxes

**warmup stage before pretraining**

if we first expose the model to a short, intense “warmup” stage of simply learning to read, we find a strong curriculum learning effect where (1) pretraining is more stable and converges faster, and (2) we observe better finetuning performance, Specifically, they create images of text snippets with random colors and fonts on a white background. The model simply needs to decode the original text.

Results:

pretrain two model variants:

(a) a base model with 282M parameters including 12 transformer layers with a hidden size of 768, and

(b) a large model with 1.3B parameters including 18 layers with a hidden size of 1536.

Both models have the same warmup stage using text rendered from BooksCorpus (Zhu et al., 2015) lasting 30K steps with a maximum input sequence length of 128 patches.

The base model - 270K steps with the screenshot parsing objective using

batch size - 3072 on 64 Google Cloud TPUs.

The large model - 170K steps with

batch size of 1024 on 128 Google Cloud TPUs.

input sequence length - 2048 patches

optimizer - Adafactor

learning rate schedule uses a linear warmup of 1000 steps to 0.01, followed by cosine decay to 0.

Tags: Multimodality image2text