🚀 AI News: How to Instruction Tune Code LLMs without GPT4 Data? + Meet 3D-VisTA + Meet Cheetor...(Aug 17, 2023 Edition)

This newsletter brings AI research news that is much more technical than most resources but still digestible and applicable

1️⃣ How to Instruction Tune Code LLMs without GPT4 Data? Meet OctoPack: A Set of AI Models for Instruction Tuning Code Large Language Models [Paper] [Blog]

The study presents the advantages of finetuning large language models on specific instructions, specifically using code. By exploiting the unique structure of Git commits that combine code modifications with human directions, the researchers developed CommitPack, a massive compilation of 4 terabytes of Git commits from 350 different programming languages. When tested using the 16B parameter StarCoder model, CommitPack outperformed other code instructions, securing the best performance in the HumanEval Python benchmark without training on OpenAI outputs. The paper also introduced HumanEvalPack, which expands the HumanEval benchmark, encompassing three coding tasks in six different languages. Their resulting models, OctoCoder and OctoGeeX, showed superior performance, emphasizing CommitPack's capability in broadening its applicability to various languages and coding assignments.

2️⃣ Meet 3D-VisTA: A Pre-Trained Transformer for 3D Vision and Text Alignment that can be Easily Adapted to Various Downstream Tasks [Blog] [Paper]

The paper addresses the growing field of 3D vision-language grounding (3D-VL), which bridges the 3D world with natural language for embodied intelligence. Recognizing the complexity of current 3D-VL models, the study introduces 3D-VisTA, a pre-trained Transformer that simplifies the alignment of 3D vision and text. Unlike previous designs, 3D-VisTA solely relies on self-attention layers for both individual and combined modal operations. The research also presents ScanScribe, a groundbreaking dataset of 3D scene-text pairs, created from ScanNet, 3R-Scan, and GPT-3, enhancing 3D-VisTA's performance. Pre-training on ScanScribe, 3D-VisTA delivers state-of-the-art results in multiple 3D-VL applications, exhibiting high data efficiency, especially in scenarios with sparse annotations.

3️⃣ This AI Research from UCLA Indicates Large Language Models (such as GPT-3) have Acquired an Emergent Ability to Find Zero-Shot Solutions to a Broad Range of Analogy Problems [Paper] [Blog]

The study examines the capability of large language models, specifically the text-davinci-003 variant of GPT-3, in analogical reasoning tasks, contrasting them with human cognitive abilities. The focus is on the model's zero-shot reasoning, a process wherein novel problems are tackled without direct prior training, a trait humans are particularly adept at through analogy. By leveraging tests similar to Raven's Standard Progressive Matrices, the researchers found GPT-3 exhibited an impressive aptitude for abstract pattern recognition, often equating or surpassing human performance. Preliminary tests with GPT-4 show even greater promise, indicating such models possess an inherent ability to solve diverse analogy problems without prior direct instruction.

4️⃣ Meet Cheetor: A Transformer-based Multimodal Large Language Models (MLLMs) that can Effectively Handle a Wide Variety of Interleaved Vision-Language Instructions and Achieves State-of-the-Art Zero-Shot Performance [Paper] [Blog]

Recent developments in Multimodal Large Language Models (MLLMs) have shown promising capabilities for multiple vision-language tasks. Despite this, current techniques are primarily restricted to singular visual contexts and limited instruction types. To better assess these models, this study introduces the I4 benchmark. It evaluates the capability of MLLMs in handling complex interleaved vision-language instructions, such as those found in visually-rich content. An identified challenge is the Visual Prompt Generator's (VPG) limitation in extracting task-specific visual details. To overcome this, the authors propose a knowledge re-injection module and a cross-attention guided training strategy. By incorporating these, they present Cheetor, a Transformer-based MLLM that demonstrates superior zero-shot performance on the I4 benchmark without requiring high-quality instruction tuning data. Furthermore, Cheetor is competitive when compared with other top-performing models on the MME benchmark.


