• AI Research Insights
  • Posts
  • 🚀 Exciting AI Updates: Do Video-Language Models really understand actions? If not, how can we fix it? | SambaNova Systems open-source BLOOMChat a multilingual chat LLM | Meet Drag Your GAN....

🚀 Exciting AI Updates: Do Video-Language Models really understand actions? If not, how can we fix it? | SambaNova Systems open-source BLOOMChat a multilingual chat LLM | Meet Drag Your GAN....

This newsletter brings AI research news that is much more technical than most resources but still digestible and applicable

SambaNova Systems open-source BLOOMChat a multilingual chat LLM. Built on top of the BLOOM model. BLOOMChat is a new, open, multilingual chat LLM that is trained on SambaNova RDUs (Reconfigurable Dataflow Units). It achieves a win rate of 45.25% compared to GPT-4‘s 54.75% across 6 languages in a human preference study. It is preferred 66% of the time compared to mainstream open-source chat LLMs across 6 languages in a human preference study. It shows strong performance on WMT translation tasks by leading the results among BLOOM variants and mainstream open-source chat models.

Sketching the Future (STF): Applying Conditional Control Techniques to Text-to-Video Models. CMU researchers propose a novel approach that combines zero-shot text-to-video generation with ControlNet to improve the output of these models. The proposed method takes multiple sketched frames as input and generates video output that matches the flow of these frames, building upon the Text-to-Video Zero architecture and incorporating ControlNet to enable additional input conditions.

What if LLM Hallucinations Were A Feature And Not A Bug? Meet dreamGPT: An Open-Source GPT-Based Solution That Uses Hallucinations From Large Language Models (LLMs) As A Feature. This innovative approach helps in generating unique and creative ideas. While on the one hand, where hallucinations are typically associated with a negative connotation and are mostly referred to as a drawback of LLMs, DreamGPT enables the transformation of hallucinations into something valuable for generating innovative solutions. 

ONE-PEACE: A general representation model across vision, audio, and language modalities, Without using any vision or language pretrained model for initialization, ONE-PEACE achieves leading results in vision, audio, audio-language, and vision-language tasks. Furthermore, ONE-PEACE possesses a strong emergent zero-shot retrieval capability, enabling it to align modalities that are not paired in the training data.

Do Video-Language Models really understand actions? If not, how can we fix it? In this work, a research group first proposes ActionBench to diagnose VidLMs’ action knowledge. Surprisingly, SOTA VidLMs still struggle to distinguish “falling” v.s. “rising”; as well as reversed video. To resolve the issue, the research group proposes a novel framework, PAXION, along with the Discriminative Video Dynamics Modeling (DVDM) objective. Together, they were able to patch the missing action knowledge (~50% -> 80%) into frozen VidLMs without compromising their general VL capabilities

Meet Drag Your GAN: An Interactive Point-based Manipulation on the Generative Image Manifold. Researchers from Max Planck Institute for Informatics, MIT CSAIL, and Google AR/VR suggest DragGAN, which handles two sub-problems, including 1) overseeing the handle points to move towards the targets and 2) tracking the handle points so that their locations are known at each editing step to enable such interactive point-based manipulation. Their method is based on the fundamental observation that a GAN’s feature space has enough discriminative power to support motion supervision and accurate point tracking. In particular, a shifting feature patch loss that optimizes the latent code provides motion supervision. Point tracking is then carried out using the closest neighbor search in the feature space, as each optimization step causes the handle points to move nearer to the objectives.

LLMScore: A new AI framework that offers evaluation scores with multi-granularity compositionality. LLMScore leverages the large language models (LLMs) to evaluate text-to-image models. Initially, it transforms the image into image-level and object-level visual descriptions. Then an evaluation instruction is fed into the LLMs to measure the alignment between the synthesized image and the text, ultimately generating a score accompanied by a rationale.

Tired of trying to get RL to work with Human Feedback? Try this new method - SLiC: Sequence level calibration using human feedback. The research paper shows how the recently introduced Sequence Likelihood Calibration (SLiC), can also be used to effectively learn from human preferences (SLiC-HF). Furthermore, the team demonstrates this can be done with human feedback data collected for a different model, similar to off-policy, offline RL data. Automatic and human evaluation experiments on the TL;DR summarization task show that SLiC-HF significantly improves supervised fine-tuning (SFT) baselines. Furthermore, SLiC-HF presents a competitive alternative to the PPO RLHF implementation used in past work while being much simpler to implement, easier to tune, and more computationally efficient in practice.

Featured AI Tools For This Newsletter Issue:


Bright Data

Cody AI




Find 100s of cool artificial intelligence (AI) tools. Our expert team reviews and provides insights into some of the most cutting-edge AI tools available. Check out AI Tools Club