• AI Research Insights
  • Posts
  • 🚀 AI News: Discover OpenLLaMA 13B, TAPIR, MIME, Infinigen, Seal, BITE, and WebGLM: The Groundbreaking Models Revolutionizing Computer Vision and AI Learning from Top Global Researchers!

🚀 AI News: Discover OpenLLaMA 13B, TAPIR, MIME, Infinigen, Seal, BITE, and WebGLM: The Groundbreaking Models Revolutionizing Computer Vision and AI Learning from Top Global Researchers!

This newsletter brings AI research news that is much more technical than most resources but still digestible and applicable

1️⃣ A research team from Deepmind, UCL, and Oxford has introduced a pioneering model for Tracking Any Point (TAP) called TAPIR, which capably identifies any specified point on any physical surface throughout a video series. The innovative approach consists of two steps: firstly, a matching stage, which autonomously pinpoints a suitable corresponding point for the query point in each frame; secondly, a refinement stage that refines both the pathway and query attributes based on local connections. The developed model significantly outperforms all existing methods on the TAP-Vid benchmark, as evidenced by an approximate 20% absolute average Jaccard (AJ) enhancement on DAVIS. The model is designed to enable rapid analysis of extensive, high-resolution video sequences.

2️⃣ Introducing MIME (Mining Interaction and Movement to infer 3D Environments): This innovative generative model of indoor environments is capable of creating furniture arrangements that align with human movement patterns. Its underpinning mechanism is an auto-regressive transformer architecture, which utilizes already generated objects within a scene and the observed human motion to generate the next logical item. For MIME's training, a specially curated dataset was developed, incorporating 3D human models into the 3D FRONT scene dataset. According to the paper, experimental results confirm that MIME outperforms recent generative scene methods that do not take human movement into account, delivering more diverse and believable 3D scenes.

3️⃣ Researchers from the University of Maryland, College Park, reconstructed 3D scenes from reflections in the human eye. The research paper introduces a novel approach to reconstructing 3D scenes from the reflections found in the human eye, captured in portrait images. The technique overcomes challenges such as estimating eye pose and disentangling the iris texture from scene reflections. The method refines the cornea poses, the radiance field showing the scene, and the iris texture of the observer's eye concurrently. A regularization prior is also added to the iris texture pattern to improve the quality of the reconstruction. Utilizing the consistent cornea geometry across healthy adults, the researchers accurately establish eye location and train a radiance field on the eye reflections. Furthermore, a texture decomposition method is employed to learn and subsequently remove the iris texture from the reconstruction. Experiments on both synthetic and real-world captures show the method's effectiveness in recovering 3D scenes via eye reflections.

4️⃣ When we run out of good training data in reality, simulation is the next gold mine. Researchers from Princeton introduce Infinigen: an open-source, procedurally generated, photorealistic dataset for 3D vision. The quality is stunning! No two worlds are the same. Infinigen offers broad coverage of objects and scenes in the natural world, including plants, animals, terrains, and natural phenomena such as fire, clouds, rain, and snow. Infinigen can be used to generate unlimited, diverse training data for a wide range of computer vision tasks, including object detection, semantic segmentation, optical flow, and 3D reconstruction.

5️⃣ Meet Seal: An innovative framework that aims to "Segment Any Point Cloud Sequences," utilizing 2D vision foundation models for self-guided learning on extensive 3D point clouds. It has three principal attributes that make it stand out: scalability, consistency, and generalizability. Scalability is evident as the need for annotations during pretraining is entirely eliminated by directly distilling Vision Foundation Models (VFMs) into point clouds. The consistency attribute is underpinned by enforcing spatial and temporal relationships at both the camera-to-LiDAR and point-to-segment phases, promoting robust cross-modal representation learning. With its generalizability property, Seal is capable of transferring knowledge in a straightforward way to various downstream tasks, with point clouds of varying properties such as real or synthetic, low or high resolution, and different scales. The efficacy of Seal is demonstrated through its exceptional performance in numerous experiments on eleven different point cloud datasets, even achieving a remarkable 45.0% mIoU on nuScenes after linear probing. This is an impressive increase of 36.9% mIoU over random initialization and advancement of 6.1% mIoU over previous methods.

6️⃣ A research team from ETH Zurich, the Max Planck Institute, and IMATI-CNR Italy has unveiled BITE, a groundbreaking technique for the reconstruction of a dog's 3D shape and pose from just a single image. Even challenging poses, such as sitting or lying down, which often cause occlusion and deformation, can now be accurately represented. Previous methods struggled with these complex poses due to a scarcity of 3D training data. BITE's innovation lies in its use of ground contact information to accurately regress 3D dogs from a solitary RGB image. This is achieved by annotating the StanfordExtra dataset with labeled 3D body contact points. Further enriching this technique is the introduction of a new parametric 3D dog shape model, the D-SMAL, capable of representing a wide array of breeds and mixed breeds. BITE manages to regress the camera, ground plane, body contact points, 3D model shape, and pose parameters, while also offering an optional optimization stage to further enhance the prediction.

7️⃣ A Group of Researchers from China Developed WebGLM: A Web-Enhanced Question-Answering System based on the General Language Model (GLM). The capacity of LLMs like GPT-3 to spontaneously accept the right references is the source of inspiration for this technique, which might be refined to enhance smaller dense retrievers. A GLM-10B-based response generator bootstrapped via LLM in-context learning and trained on quoted long-formed QA samples is known as a bootstrapped generator. LLMs may be prepared to provide high-quality data using adequate citation-based filtering instead of relying on expensive human experts to write in WebGPT. A scorer that is taught using user thumbs-up signals from online QA forums can understand the preferences of the human majority when it comes to various replies.

8️⃣ OpenLLaMA 13B Released: OpenLLaMA aims to train Apache-licensed “drop in” replacement for Meta’s LLaMA models. They have been trained for 1T tokens on RedPajama dataset. Given the popularity of models based on LLaMA-13B, this one should be quite useful. The model weights can serve as the drop in replacement of LLaMA in existing implementations. The team also provides a smaller 3B variant of LLaMA model.

Turbocharge your Customer Research with AI!

*This section is presented in partnership with ICL PR

Essense.io is an AI tool that synthesizes insights from raw unstructured customer feedback to inform product decision-making. Its very easy to get started:

Import your customer feedback from sources like the iOS App Store, Typeform, HubSpot Intercom & many more.

Essense’s AI analyzes thousands of pieces of feedback, delivering results in seconds.

Turn unstructured feedback into insights on your customers' sentiments and pain points. You can also chat with your customer feedback to get specific insights!

Essense is like having a data analyst who never sleeps and delivers insights 100x faster.

Customers are the lifeblood of your business. Book a demo to learn more or start a free trial today!

*This section is presented in partnership with ICL PR