Arham Islam, Author at MarkTechPost

Revolutionizing Language Model Fine-Tuning: Achieving Unprecedented Gains with NEFTune’s Noisy Embeddings

Arham Islam — Wed, 25 Oct 2023 05:00:00 +0000

https://arxiv.org/abs/2310.05914

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-25-at-2.28.30-AM-300x237.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-25-at-2.28.30-AM-1024x810.png" />https://arxiv.org/abs/2310.05914

Instruction fine-tuning is the process of training an LLM on a small curated instruction dataset, which allows the model to achieve high performance on instruction-based tasks. It offers numerous advantages, such as better interpretability, reduced bias, and enhanced task performance. Instruction fine-tuning is, therefore, vital in harnessing the full potential of LLMs, and as such, it becomes essential to improve the outcome of the process.

The authors of this research paper have proposed a new method called NEFTune (Noisy Embedding Instruction Fine Tuning) to improve model performance on instruction-based tasks. They have shown that by adding random noise to the embedding vectors of training data at the time of forward-pass of fine-tuning, the model’s performance could be improved significantly without requiring extra computational resources or additional data. NEFTune leads to a surprising increase in the performance of the LLM on conversational tasks while at the same time maintaining the factual question-answering performance.

The researchers have conducted most of their experiments using 7B parameter LLMs like LLaMA-1, LLaMA-2, and OPT-6.7B and using fine-tuning datasets like Alpaca, ShareGPT, etc. The results were evaluated using the AplacaEval dataset to calculate the Win Rate- the rate at which the LLM is preferred over OpenAI’s Text-Davinci-003 model, as determined by the evaluator, GPT-4.

Results show that training these models with NEFT significantly increases conversational ability and answer quality. When fine-tuned with noisy embeddings, the performance of LLaMA-2 7B increased considerably from 29.8% to 64.7%, and the average performance of all the models increased by around 15%. Along with evaluating the performance using an LLM, the researchers also used human annotators. NEFT was preferred on 88 occasions, and 22 instances were a draw, corresponding to around 74% win score for NEFT.

In one of the experiments, LLaMA-2 was trained on Alpaca with and without NEFT and was asked a prompt on quantum computing. The response in the second stage, i.e., using noisy embeddings, was much more fluid, explaining complex concepts like superposition and quantum entanglement more clearly.

The researchers hypothesize that by introducing noise to the embeddings at the time of training, the model becomes less prone to overfitting. Instead of focusing on exact information distribution, such as formatting details, text length, and exact wording, the model provides answers encompassing the knowledge and behaviors in the pre-trained base model.

Given the importance of instruction fine-tuning, many models and methods have been introduced by researchers over the years. NEFT is not the first method to improve the performance using noisy embeddings. However, it can significantly improve the performance of LLMs on conversational tasks, providing a more detailed and clear explanation of complex topics like quantum computing. The most important aspect is that the method does not require additional computational resources, and the authors of this paper have termed it a “free lunch” for fine-tuning LLMs. NEFTune has the potential to be widely used in the future to develop LLMs, making it a promising tool for future development in enhancing LLMs’ capabilities across a wide range of real-world tasks.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post Revolutionizing Language Model Fine-Tuning: Achieving Unprecedented Gains with NEFTune’s Noisy Embeddings appeared first on MarkTechPost.

A New AI Research from China Proposes 4K4D: A 4D Point Cloud Representation that Supports Hardware Rasterization and Enables Unprecedented Rendering Speed

Arham Islam — Sun, 22 Oct 2023 04:46:26 +0000

https://zju3dv.github.io/4k4d/

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/10/ezgif-3-26b0bf5a90-300x169.gif" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/10/ezgif-3-26b0bf5a90.gif" />https://zju3dv.github.io/4k4d/

Dynamic view synthesis is the process of reconstructing dynamic 3D scenes from captured videos and creating immersive virtual playback. This process has been a long-standing research problem in computer vision and graphics, a process that holds significant promise in the field of VR/AR, sports broadcasting, and artistic performance capturing.

Traditional methods for representing dynamic 3D scenes use textured mesh sequences, but these methods are complex and computationally expensive, making them impractical for real-time applications.

In recent times, some methods have produced great results when it comes to dynamic view synthesis, showing impressive rendering quality. However, one area they still need to improve in is latency while rendering high-quality images. This research paper introduces 4K4D, a 4D point cloud representation that supports hardware rasterization and allows quick rendering.

4K4D represents 3D scenes based on a 4D grid of features, i.e., as a vector of 4 features. Such a representation makes the points in the grid regular and easier to optimize. The model first represents objects’ geometry and shape in the input video using a space carving algorithm and a neural network to learn how to represent the 3D scene from the point cloud. A differential depth peeling algorithm is then developed for rendering the point cloud representation, and a hardware rasterizer is leveraged to improve the rendering speed.

To boost the rendering speed, the following acceleration techniques are applied:

Some model parameters are precomputed and stored in memory, allowing the graphics card to render the scene faster.
The precision of the model is reduced from 32-bit float to 16-bit float. This increases the FPS by 20 without any visible performance loss.
Lastly, the number of rendering passes required for the depth peeling algorithm is reduced, which also increases the FPS by 20 with no visible change in quality.

The researchers evaluated the performance of 4K4D on multiple datasets such as DNA-Rendering, ENeRF-Outdoor, etc. The researcher’s method for rendering 3D scenes can be rendered at over 400 FPS at 1080p on the former dataset and at 80 FPS at 4K on the latter. This is 30 times faster than the state-of-the-art real-time dynamic view synthesis method ENeRF, that too with superior rendering quality. The ENeRF Outdoor dataset is a rather challenging one with multiple actors. 4K4D was still able to produce better results as compared to the other models, which produced blurry results and exhibited black artifacts around the image edges in some of the renderings.

In conclusion, 4K4D is a new method that aims to tackle the issue of slow rendering speed when it comes to real-time view synthesis of dynamic 3D scenes at 4K resolution. It is a neural point cloud-based representation that achieves state-of-the-art rendering quality and exhibits a more than 30× increase in rendering speed. However, there are a couple of limitations, such as high storage requirements for long videos and establishing point correspondences across frames, which the researchers plan to address in future work.

Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post A New AI Research from China Proposes 4K4D: A 4D Point Cloud Representation that Supports Hardware Rasterization and Enables Unprecedented Rendering Speed appeared first on MarkTechPost.

Top 40+ Generative AI Tools (October 2023)

Arham Islam — Thu, 19 Oct 2023 19:20:25 +0000

ChatGPT – GPT-4

GPT-4 is the latest LLM of OpenAI, which is more inventive, accurate, and safer than its predecessors. It also has multimodal capabilities, i.e., it is also able to process images, PDFs, CSVs, etc. With the introduction of the Code Interpreter, GPT-4 can now run its own code to avoid hallucinations and provide accurate answers.

Bing AI

Bing AI is powered by the GPT-4 model of OpenAI and can traverse the web to provide accurate answers. It also has the ability to generate images from user prompts.

GitHub Copilot

GitHub Copilot is an AI code completion tool that analyzes code and provides instant feedback and relevant code suggestions.

DALL-E 2

DALL-E 2 is a text-to-image generation tool developed by OpenAI that creates original images based on the user’s prompt. It has been designed to reject inappropriate user requests.

Cohere Generate

Cohere Generate leverages the potential of AI to enhance business operations. It offers personalized content for emails, landing pages, product descriptions, and various other requirements.

AlphaCode

AlphaCode has been developed by DeepMind and is capable of writing computer programs at a competitive level.

Adobe Firefly

Firefly is an image generation and editing tool known for its prompt-to-image output accuracy. It encompasses a wide range of image modification features, including content type, color, tone, lighting, and composition tools.

Bard

Bard is a chatbot developed by Google, which is seen as Google’s counterpart to OpenAI’s ChatGPT.

Claude 2

Claude is a chatbot developed by Anthropic. It is similar to ChatGPT and has the ability to process large amounts of text and automate workflows.

Adobe Enhance

This AI tool removes background noise from audio recordings.

Synthesia

Synthesia is a video generation tool that converts texts into high-quality videos using AI avatars and voiceovers.

Copy.ai

Copy.ai allows users to generate high-quality marketing copies in seconds, whether for a blog post, an email, or a social media update.

Microsoft Designer

This tool creates posters, illustrations, and artwork based on the user’s input.

Midjourney

Midjourney is another image generation tool that creates images according to the user’s prompts. It is excellent at creating environments, especially fantasy and sci-fi scenes, that resemble rendered concept art from a video game.

Poe

Poe is a platform that provides access to the major chatbots like GPT-4, Claude, Llama, etc., in one place.

Kickresume

Kickresume is an AI resume builder that simplifies the process of writing and creating effective resumes.

Perplexity AI

Perplexity AI is an all-in-one search engine utilizing GPT-4, enabling intelligent exploration across diverse databases with ease.

Murf.ai

Murf is a text-to-speech tool that allows users to generate studio-quality voiceovers in minutes.

Designs.ai

This tool allows users to generate logos, videos, and banners within a few minutes.

Soundraw

Soundraw is a music generator that allows users to create their own unique and royalty-free music.

Replika

Replika is an AI-driven chatbot designed to be a virtual companion. It has the capability to form deep connections with users and even allows them to consider it as a significant other. The platform offers features like video calls to enhance the interactive experience, and Replika keeps a journal where it records its emotions and thoughts.

Socratic

Socratic allows users to upload a photo of their homework, and it solves the problem almost immediately.

Tome

Tome is a storytelling tool that drafts text and generates images.

Chatflash

Chatflash is a tool that allows users to create content through a chat option.

Type Studio

Type Studio is a video editing tool that enables users to edit their videos by making changes directly to the transcribed text.

Descript

Descript is a versatile video editing tool that allows users to create, record, transcribe, edit, collaborate on, and easily share their videos and podcasts.

Bardeen

Bardeen is an automation platform that replaces the repetitive tasks of users.

Engage AI

Engage AI makes prospect engagement better by improving and adding context to comments. It helps break the ice and builds stronger relationships with potential clients.

Palette

Palette colorizes the black-and-white images within seconds.

Remove.bg

This tool removes the background of any image.

Picsart AI Writer

This tool has features like an Ad copy generator, LinkedIn headlines generator, rephraser, summarizer, and more.

Gamma

Gamma generates decks based on the user prompts.

Tutor AI

Tutor creates full-scale courses along with modules based on the user’s topic.

ChatPDF

This tool allows users to upload any PDF file and chat with it.

Quizify

This tool allows users to create quizzes on different topics.

Boomy

Boomy allows users to create their own original songs within seconds.

Prompt Vine

Prompt Vine is like a virtual library for ChatGPT prompts.

Pi

Pi is a chatbot that talks almost like a therapist.

GPTZero

GPTZero is an AI plagiarism checker tool.

ElevenLabs

ElevenLabs have developed an advanced text-to-speech and voice cloning tool.

Character.ai

Character AI allows users to create characters and talk to them.

Memecam

This tool gives the user’s image meme-like captions.

RoomGPT

This tool allows users to leverage AI to redesign their rooms within seconds.

Text to Pokemon

This tool converts the user’s input into a Pokemon.

Extrapolate

Extrapolate takes in user images as input and shows how they will age.

Scribble Diffusion

This tool converts hand sketches into professional images.

Voicemod

Voicemod allows users to create full-fledged songs just by entering text.

CLIP Interrogator

This tool analyses an image and identifies the prompts one might need to input to generate it.

Don’t forget to join our 27k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

The post Top 40+ Generative AI Tools (October 2023) appeared first on MarkTechPost.

This AI Paper Proposes ‘MotionDirector’: An Artificial Intelligence Approach to Customize Video Motion and Appearance

Arham Islam — Tue, 17 Oct 2023 18:30:17 +0000

Text-to-video diffusion models have made significant advancements in recent times. Just by providing textual descriptions, users can now create either realistic or imaginative videos. These foundation models have also been tuned to generate images to match certain appearances, styles, and subjects. However, the area of customizing motion in text-to-video generation still needs to be explored. Users may want to create videos with specific motions, such as a car moving forward and then turning left. It, therefore, becomes important to adapt the diffusion models to create more specific content to cater to the users’ preferences.

The authors of this paper have proposed MotionDirector, which helps foundation models achieve motion customization while maintaining appearance diversity at the same time. The technique uses a dual-path architecture to train the models to learn the appearance and motions in the given single or multiple reference videos separately, which makes it easy to generalize the customized motion to other settings.

The dual architecture comprises both a spatial and a temporal pathway. The spatial path has a foundational model with trainable spatial LoRAs (low-rank adaptions) integrated into its transformer layers for each video. These spatial LoRAs are trained using a randomly selected single frame in each training step to capture the visual attributes of the input videos. On the contrary, the temporal pathway duplicates the foundational model, sharing the spatial LoRAs with the spatial path to adapt to the appearance of the given input video. Moreover, the temporal transformers in this pathway are enhanced with temporal LoRAs, which are trained using multiple frames from the input videos to grasp the inherent motion patterns.

Just by deploying the trained temporal LoRAs, the foundation model can synthesize videos of the learned motions with diverse appearances. The dual architecture allows the models to learn the appearance and motion of objects in videos separately. This decoupling enables MotionDirector to isolate the appearance and motion of videos and then combine them from various source videos.

The researchers compared the performance of MotionDirector on a couple of benchmarks, having more than 80 different motions and 600 text prompts. On the UCF Sports Action benchmark (with 95 videos and 72 text prompts), MotionDirector was preferred by human raters around 75% of the time for better motion fidelity. The method also outperformed the 25% preferences of base models. On the second benchmark, i.e., the LOVEU-TGVE-2023 benchmark (with 76 videos and 532 text prompts), MotionDirector performed better than other controllable generation and tuning-based methods. The results demonstrate that numerous base models can be customized using MotionDirector to produce videos characterized by diversity and the desired motion concepts.

MotionDirector is a promising new method for adapting text-to-video diffusion models to generate videos with specific motions. It excels in learning and adapting specific motions of subjects and cameras, and it can be used to generate videos with a wide range of visual styles.

One area where MotionDirector can be improved is learning the motion of multiple subjects in the reference videos. However, even with this limitation, MotionDirector has the potential to enhance flexibility in video generation, allowing users to craft videos tailored to their preferences and requirements.

Check out the Paper, Project, and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post This AI Paper Proposes ‘MotionDirector’: An Artificial Intelligence Approach to Customize Video Motion and Appearance appeared first on MarkTechPost.

From 2D to 3D: Enhancing Text-to-3D Generation Consistency with Aligned Geometric Priors

Arham Islam — Mon, 16 Oct 2023 18:10:06 +0000

https://arxiv.org/abs/2310.02596

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-16-at-11.38.08-PM-300x248.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-16-at-11.38.08-PM-1024x847.png" />https://arxiv.org/abs/2310.02596

Converting 2D images into 3D objects for the purpose of text-to-3D generation is a daunting task. This is mainly because the 2D diffusion models learn only the view-agnostic priors and do not have an understanding of the 3D space during lifting. An outcome of this limitation is the multi-view inconsistency problem, i.e., the 3D object is not consistent from all viewpoints. For example, if we lift a 2D image of a cube into 3D space, the model might generate a cube that is perfect from one perspective but distorted from others.

To address this issue of geometric inconsistency, a group of researchers has introduced a new method called SweetDreamer, which adds well-defined 3D shapes during the lifting and then aligns the 2D geometric priors in diffusion models with the same. The model achieves this by fine-tuning the 2D diffusion model to be viewpoint-aware (to understand how the object’s appearance changes depending on the viewpoint) and produce view-specific coordinate maps of canonically oriented 3D objects. This approach is very effective at producing 3D objects that are consistent from all viewpoints.

The researchers have realized that the main reason behind 3D inconsistent results is due to geometric inconsistency, and therefore, their goal is to equip 2D priors with the ability to generate 3D objects that look the same from all viewpoints while retaining their generalizability.

The method proposed by the researchers leverages a comprehensive 3D dataset comprising diverse canonically oriented and normalized 3D models. Depth maps are rendered from random angles and converted into canonical coordinates maps. Then, they fine-tune the 2D diffusion model to produce the coordinate map aligned with a specific view, eventually aligning the geometric priors in 2D diffusion. Finally, the aligned geometric priors can be smoothly integrated into various text-to-3D systems, effectively reducing inconsistency issues and producing diverse, high-quality 3D content.

DMTet and NeRF are two common 3D representations used in text-to-3D generation. In the research paper, the authors showed that their aligned geometric priors can be integrated into both DMTet-based and NeRF-based text-to-3D pipelines to improve the quality of the generated 3D objects. This demonstrates the generality of their approach and its potential to enhance the performance of a wide range of text-to-3D systems.

Due to the lack of well-established metrics to evaluate the results of text-to-3D processes, the researchers focused on evaluating the multi-view consistency of the 3D results. They randomly selected 80 prompts from the DreamFusion gallery and performed text-to-3D generation using each method. 3D inconsistencies were then manually checked to report the success rate. The researchers found that their method significantly outperforms other methods. Their success rates were above 85% in both pipelines (DMTet and NeRF), while the other methods scored around 30%.

In conclusion, the SweetDreamers method presents a novel way of achieving state-of-the-art performance in text-to-3D generation. It can generate results from a wide array of prompts that are free from the issue of multi-view inconsistencies. It gives a better performance compared to other previous methods, and the researchers believe that their work would open up a new direction of using limited 3D data to enhance 2D diffusion priors for text-to-3D generation.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post From 2D to 3D: Enhancing Text-to-3D Generation Consistency with Aligned Geometric Priors appeared first on MarkTechPost.

Google AI Introduces SANPO: A Multi-Attribute Video Dataset for Outdoor Human Egocentric Scene Understanding

Arham Islam — Sat, 14 Oct 2023 17:20:58 +0000

https://blog.research.google/2023/10/sanpo-scene-understanding-accessibility.html

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/10/image6-300x169.gif" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/10/image6.gif" />https://blog.research.google/2023/10/sanpo-scene-understanding-accessibility.html

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/10/image6-300x169.gif" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/10/image6.gif" />

For tasks like self-driving, the AI model must understand not only the 3D structure of the roads and sidewalks but also identify and recognize street signs and stop lights. This task is made easier with a special laser mounted on the car that captures the 3D data. Such a process is called egocentric scene understanding, i.e., comprehending the environment from one’s own perspective. The problem is that there aren’t publicly available datasets beyond the autonomous driving domain that generalize to egocentric human scene understanding.

Researchers at Google have introduced SANPO (Scene understanding, Accessibility, Navigation, Pathfinding, Obstacle avoidance) dataset, which is a multi-attribute video dataset for human egocentric scene understanding. SANPO consists of both real-world as well as synthetic data, called SANPO-Real and SANPO-Synthetic, respectively. SANPO-Real covers diverse environments and has videos from two stereo cameras to support multi-view methods. The real dataset also includes 11.4 hours of video captured at 15 frames per second (FPS) with dense annotations.

SANPO is a large-scale video dataset for human egocentric scene understanding, consisting of more than 600K real-world and more than 100K synthetic frames with dense prediction annotations.

Google’s researchers have prioritized privacy protection. They’ve collected data while following the laws at the local, city, and state levels. They’ve also made sure to remove any personal information, like faces and vehicle license plates, before sending the data for annotation.

To overcome the imperfections while capturing videos, such as motion blur, human rating mistakes, etc., SANPO-Synthetic was introduced to augment the real dataset. The researchers partnered with Parallel Domain to create a high-quality synthetic dataset optimized to match real-world conditions. SANPO-Synthetic consists of 1961 sessions, which were recorded using virtualized Zed cameras having an even split between head-mounted and chest-mounted positions.

The synthetic dataset and a part of the real dataset have been annotated using panoptic instance masks, which assigns a class and an ID to each pixel. In SANPO-Real, only a few frames have more than 20 instances per frame. On the contrary, SANPO-Synthetic features many more instances per frame than the real dataset.

Some of the other important video datasets in this field are SCAND, MuSoHu, Ego4D, VIPSeg, and Waymo Open. SANPO was compared to these datasets, and it is the first dataset with panoptic masks, depth, camera pose, multi-view stereo, and both real and synthetic data. Apart from SANPO, only Waymo Open has both panoptic segmentation and depth maps.

The researchers trained two state-of-the-art models – BinsFormer (for depth estimation) and kMaX-DeepLab (for panoptic segmentation), on the SANPO dataset. They observed that the dataset is quite challenging for both the dense prediction tasks. Moreover, the synthetic dataset has better accuracy than the real dataset. This is mainly because real-world environments are quite complex compared to synthetic data. Additionally, segmentation annotators are more precise in the case of synthetic data.

Introduced to tackle the lack of datasets for human egocentric scene understanding, SANPO is a significant advancement that encompasses both real-world and synthetic datasets. Its dense annotations, multi-attribute features, and unique combination of panoptic segmentation and depth information set it apart from other datasets in the field. Furthermore, the researchers’ commitment to privacy allows the dataset to support fellow researchers in creating visual navigation systems for the visually impaired and push the boundaries of advanced visual scene understanding.

Check out the Paper and Google Blog. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post Google AI Introduces SANPO: A Multi-Attribute Video Dataset for Outdoor Human Egocentric Scene Understanding appeared first on MarkTechPost.

This AI Research Proposes Kosmos-G: An Artificial Intelligence Model that Performs High-Fidelity Zero-Shot Image Generation from Generalized Vision-Language Input Leveraging the property of Multimodel LLMs

Arham Islam — Thu, 12 Oct 2023 03:11:14 +0000

https://arxiv.org/abs/2310.02992

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-12-at-8.39.56-AM-300x202.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-12-at-8.39.56-AM-1024x688.png" />https://arxiv.org/abs/2310.02992

Recently, there have been significant advancements in creating images from text descriptions and combining text and images to generate new ones. However, one unexplored area is image generation from generalized vision-language inputs (for example, generating an image from a scene description involving multiple objects and people). A team of researchers from Microsoft Research, New York University, and the University of Waterloo introduced KOSMOS-G, a model that leverages Multimodal LLMs to tackle this issue.

KOSMOS-G can create detailed images from complex combinations of text and multiple pictures, even when it hasn’t seen these examples. It’s the first model that can generate images in situations where various objects or things are in the pictures based on a description. KOSMOS-G can be used in place of CLIP, which opens up new possibilities for using other techniques like ControlNet and LoRA for various applications.

KOSMOS-G uses a clever approach to generate images from text and pictures. It first starts by training a multimodal LLM (which can understand both text and images together), which is then aligned with the CLIP text encoder (which is good at understanding text).

When we give KOSMOS-G a caption with text and segmented images, it’s trained to create images that match the description and follow the instructions. It does this by using a pre-trained image decoder and leveraging what it has learned from the images to generate accurate pictures in different situations.

KOSMOS-G can generate images based on instructions and input data. It has three stages of training. In the first stage, the model is pre-trained on multimodal corpora. In the second stage, an AlignerNet is trained to align the output space of KOSMOS-G to U-Net’s input space through CLIP supervision. In the third stage, KOSMOS-G is fine-tuned through a compositional generation task on curated data. During Stage 1, only the MLLM is trained. In Stage 2, AlignerNet is trained with MLLM frozen. During Stage 3, both AlignerNet and MLLM are jointly trained. The image decoder remains frozen throughout all stages.

KOSMOS-G is really good at zero-shot image generation across different settings. It can make images that make sense, look good, and be customized differently. It can do things like changing the context, adding a particular style, making modifications, and adding extra details to the images. KOSMOS-G is the first model to achieve multi-entity VL2I in a zero-shot setting.

KOSMOS-G can easily take the place of CLIP in image generation systems. This opens up exciting new possibilities for applications that were previously impossible. By building on the foundation of CLIP, KOSMOS-G is expected to advance the shift from generating images based on text to generating images based on a combination of text and visual information, creating opportunities for many innovative applications.

In summary, KOSMOS-G is a model that can create detailed images from both text and multiple pictures. It uses a unique strategy called “align before instruct” in its training. KOSMOS-G is good at making images of individual objects and is the first to do this with multiple objects. It can also replace CLIP and be used with other techniques like ControlNet and LoRA for new applications. In short, KOSMOS-G is an initial step toward making images like a language in image generation.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post This AI Research Proposes Kosmos-G: An Artificial Intelligence Model that Performs High-Fidelity Zero-Shot Image Generation from Generalized Vision-Language Input Leveraging the property of Multimodel LLMs appeared first on MarkTechPost.

Latest Advancements in the Field of Multimodal AI: (ChatGPT + DALLE 3) + (Google BARD + Extensions) and many more….

Arham Islam — Thu, 05 Oct 2023 14:11:31 +0000

Multimodal AI is a field of Artificial Intelligence (AI) that combines various data types (modalities), such as text, image, video, audio, etc., to achieve better performances. Most traditional AI models are unimodal, i.e., they can process only one data type. They are trained, and their algorithms are tailored only for that modality. An example of an unimodal AI system is ChatGPT. It uses natural language processing to understand and extract meaning from textual data. Moreover, it can only produce text as output.

On the contrary, Multimodal AI systems can handle multiple modalities simultaneously and produce more than one output type. The paid version of ChatGPT, which uses GPT-4, is an example of multimodal AI. It can handle not only text but also images and can process different files such as PDF, CSV, etc.

In this article, we will discuss the recent advancements made in the field of Multimodal AI.

ChatGPT + DALLE 3

DALLE 3 represents the latest advancement in OpenAI’s text-to-image technology, marking a significant step forward in AI-generated art. The system’s ability to understand the context of the user prompts has increased, and it can better comprehend the details provided by the user.

Source: https://openai.com/dall-e-3

From the above image, we can clearly see that the model is able to capture all the details of the prompt to create a comprehensive image that adheres to the entered text.

DALL·E 3 is integrated directly into ChatGPT, enabling seamless collaboration. When given an idea, ChatGPT effortlessly generates specific prompts for DALL·E 3, giving life to the user’s concepts. If users want adjustments to an image, they can simply ask ChatGPT with a few words.

Users can request assistance from ChatGPT to create a prompt that DALL·E 3 can use for generating artwork. Even though DALL·E 3 can still handle users’ specific requests, with ChatGPT’s help, AI art creation becomes more accessible to all.

Google BARD + Extensions

BARD, a conversational AI tool developed by Google, recently received significant enhancements through extensions. These improvements enable BARD to connect with various Google apps and services. With Extensions, Bard can fetch and display relevant information from your everyday Google tools, such as Gmail, Docs, Drive, Google Maps, YouTube, Google Flights, and hotels.

BARD can assist even when the needed information spans multiple apps and services. For instance, when planning a trip to the Grand Canyon, users can now ask BARD to find dates from Gmail, provide current flight and hotel details, offer directions on Google Maps to the airport, and even share YouTube videos about activities at the destination, all within a single conversation.

Claude + File Upload

Claude is an AI chatbot developed by Anthropic that is easy to converse with and is less likely to produce harmful outputs. Claude 2 has improved coding, math, and reasoning performance and can produce longer responses. Apart from these features, Claude also has the ability to process different documents like PDF, DOC, CSV, etc. Claude 2 can analyze up to five documents of up to 100,000 tokens for analysis.

DeepFloyd IF

DeepFloyd IF is a powerful text-to-image model developed by Stability AI. It is a cascaded pixel diffusion model that generates images in a cascading manner. Initially, a base model produces low-resolution samples, and then a series of upscale models boost the image to create high-resolution images.

DeepFloyd IF is highly efficient and outperforms other leading tools. It demonstrates that larger UNet structures can enhance image generation tools, indicating a promising future for transforming text into images.

DeepFloyd IF’s base and super-resolution models utilize diffusion models, which involve introducing random noise into the data using Markov chain steps and then reversing this process to create new data samples from the noise.

Source: https://stability.ai/blog/deepfloyd-if-text-to-image-model

ImageBind

ImageBind, created by Meta AI, is the first AI model that can combine data from six types without direct guidance. This innovation improves AI by recognizing their connections by allowing machines to understand and analyze various kinds of information, such as images, video, audio, text, depth, thermal, and IMUs.

Some of the capabilities of ImageBind are:

It can immediately propose audio based on an image or video input. This can be used to improve an image or video by adding relevant audio, like including the sound of waves to a beach image.
ImageBind can instantly generate images using an audio clip as input. For instance, if we have an audio recording of a bird, the model can create images depicting what that bird could resemble.
Individuals can quickly find related images by using a prompt that links audio and images. This could be handy for locating images connected to a video clip’s visual and auditory aspects.

Source: https://imagebind.metademolab.com/demo?modality=AI2I

CM3leon

CM3Leon is an advanced model for generating text and images. It’s a versatile model that can create images from text and vice versa. CM3Leon excels in text-to-image generation, achieving top performance while using only a fraction of the training compute compared to similar methods.

Source: https://ai.meta.com/blog/generative-ai-text-images-cm3leon/

Don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

References:

The post Latest Advancements in the Field of Multimodal AI: (ChatGPT + DALLE 3) + (Google BARD + Extensions) and many more…. appeared first on MarkTechPost.

What is Model Merging?

Arham Islam — Thu, 28 Sep 2023 03:00:00 +0000

AI technology microchip background vector digital transformation concept

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/09/rm373batch5-18-300x200.jpg" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/09/rm373batch5-18-1024x683.jpg" />AI technology microchip background vector digital transformation concept

Model merging refers to the process of combining multiple distinct models, each designed to perform separate tasks or solve different problems, into a single unified model without requiring additional training. Depending on the specific technique and goal, merging models can also be called ensemble learning, model blending, or model stacking. This technique aims to create a more versatile and comprehensive Machine Learning model capable of handling various tasks simultaneously.

In the context of LLMs, model merging can involve combining LLMs with different initializations, architectures, or training on different tasks. The primary goal is to leverage the strengths of each individual model and create a multi-task LLM that can address a broader range of tasks. This approach can significantly improve performance and efficiency by allowing the combined model to benefit from the knowledge and capabilities of each constituent model.

Why merge ML models?

Combining Machine Learning models offers several benefits, such as reducing prediction variability and bias through averaging or voting among diverse models. Leveraging complex patterns and features from various data sources and models can enhance prediction accuracy and adaptability. Moreover, model merging can improve prediction diversity and reliability by reducing reliance on a single dataset or algorithm.

Model merging results in better performance, improved efficiency, and broader applicability, making it a valuable strategy for leveraging the strengths of different AI models without the need for extensive additional training.

Strategies for combining LLMs

One common approach is to combine models by averaging their weights or parameters. This can result in a fused model that benefits from the knowledge and expertise embedded in each original model. Model merging may also involve the integration of features from each model. This is particularly useful when the models have learned task-specific features that are valuable for the overall performance of the merged model.

Some model merging techniques allow for merging models up to a specified layer, creating a multi-head model. This approach can be beneficial when different models specialize in different aspects of a task.

Some Recent Research Papers on Model Merging

Fusing fine-tuned models for better pretraining

In this research, the authors acknowledge that pretrained models are widely used as a starting point for natural language processing tasks but can be expensive to create. They propose a novel approach of fusing multiple existing fine-tuned models into one, using an average of their weights. This fused model consistently outperforms pretrained models and is often superior to intertraining, where a base model is fine-tuned on another task. The fusion process is less dependent on the target task and remains effective even with weight decay, providing a more cost-effective and resource-efficient method for improving model initialization in NLP.

Resolving Interference When Merging Models

Transfer learning, which involves further fine-tuning pre-trained models for downstream tasks, offers improved performance, faster convergence, and sample efficiency. However, task-specific fine-tuned models often cannot collaborate effectively. Model merging methods have emerged to address this, but they frequently neglect interference between parameters from different models, causing performance drops. In response, the authors propose TIES-MERGING, which resolves interference issues by resetting parameters, resolving sign conflicts, and merging only compatible parameters. TIES-MERGING outperforms existing methods across diverse settings, emphasizing the importance of addressing interference in model merging for enhanced performance and versatility.

ZipIt! Merging Models from Different Tasks without Training

This research addresses the challenge of merging distinct models with different initializations, each trained for a separate task, into a single multi-task model without additional training. While previous model merging methods work for models trained on the same task, they fall short when combining models trained for different tasks. The authors introduce “ZipIt,” a general merging method for arbitrary models with the same architecture to overcome this limitation. ZipIt incorporates two key strategies: first, it allows for merging features within each model to account for non-shared features, and second, it supports partial merging up to a specified layer, creating a multi-head model. These innovations result in a significant 20-60% improvement over previous methods, enabling the effective merging of models trained on disparate tasks.

Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

References:

The post What is Model Merging? appeared first on MarkTechPost.

LLMs & Knowledge Graphs

Arham Islam — Wed, 20 Sep 2023 00:56:47 +0000

What are LLMs?

Large Language Models (LLMs) are AI tools that can understand and generate human language. They are powerful neural networks with billions of parameters trained on massive amounts of text data. The extensive training of these models gives them a deep understanding of human language’s structure and meaning.

LLMs can perform various language tasks like translation, sentiment analysis, chatbot conversation, etc. LLMs can comprehend intricate textual information, recognize entities and their connections, and produce text that maintains coherence and grammatical correctness.

What are Knowledge Graphs?

A Knowledge Graph is a database that represents and connects data and information about different entities. It comprises nodes representing any object, person, or place and edges defining the relationships between the nodes. This allows machines to understand how the entities relate to each other, share attributes, and draw connections between different things in the world around us.

Knowledge graphs can be used in various applications, such as recommended videos on YouTube, insurance fraud detection, product recommendations in retail, and predictive modeling.

Source: https://arxiv.org/pdf/2306.08302.pdf | Example of a Knowledge Graph.

LLMs and Knowledge Graphs

One of the main limitations of LLMs is that they are “black boxes,” i.e., it’s hard to understand how they arrive at a conclusion. Moreover, they frequently struggle to grasp and retrieve factual information, which can result in errors and inaccuracies known as hallucinations.

This is where knowledge graphs can help LLMs by providing them with external knowledge for inference. However, Knowledge graphs are difficult to construct and are evolving by nature. So, it’s a good idea to use LLMs and knowledge graphs together to make the most of their strengths.

LLMs can be combined with Knowledge Graphs (KGs) using three approaches:

KG-enhanced LLMs: These integrate KGs into LLMs during training and use them for better comprehension.
LLM-augmented KGs: LLMs can improve various KG tasks like embedding, completion, and question answering.
Synergized LLMs + KGs: LLMs and KGs work together, enhancing each other for two-way reasoning driven by data and knowledge.

KG-Enhanced LLMs

LLMs are well-known for their ability to excel in various language tasks by learning from vast text data. However, they face criticism for generating incorrect information (hallucination) and lacking interpretability. Researchers propose enhancing LLMs with knowledge graphs (KGs) to address these issues.

KGs store structured knowledge, which can be used to improve LLMs’ understanding. Some methods integrate KGs during LLM pre-training, aiding knowledge acquisition, while others use KGs during inference to enhance domain-specific knowledge access. KGs are also used to interpret LLMs’ reasoning and facts for improved transparency.

Source: https://arxiv.org/pdf/2306.08302.pdf

LLM-augmented KGs

Knowledge graphs (KGs) store structured information crucial for real-world applications. However, current KG methods face challenges with incomplete data and text processing for KG construction. Researchers are exploring how to leverage the versatility of LLMs to address KG-related tasks.

One common approach involves using LLMs as text processors for KGs. LLMs analyze textual data within KGs and enhance KG representations. Some studies also employ LLMs to process original text data, extracting relations and entities to build KGs. Recent efforts aim to create KG prompts that make structural KGs understandable to LLMs. This enables direct application of LLMs to tasks like KG completion and reasoning.

Source: https://arxiv.org/pdf/2306.08302.pdf

Synergized LLMs + KGs

Researchers are increasingly interested in combining LLMs and KGs due to their complementary nature. To explore this integration, a unified framework called “Synergized LLMs + KGs” is proposed, consisting of four layers: Data, Synergized Model, Technique, and Application.

LLMs handle textual data, KGs handle structural data, and with multi-modal LLMs and KGs, this framework can extend to other data types like video and audio. These layers collaborate to enhance capabilities and improve performance for various applications like search engines, recommender systems, and AI assistants.

Source: https://arxiv.org/pdf/2306.08302.pdf

Some Applications of LLMs and Knowledge Graphs

Multi-Hop Question Answering

Typically, when we use LLM to retrieve information from documents, we divide them into chunks and then convert them into vector embeddings. Using this approach, we might not be able to find information that spans multiple documents. This is known as the problem of multi-hop question answering.

This issue can be solved using a knowledge graph. We can construct a structured representation of the information by processing each document separately and connecting them in a knowledge graph. This makes it easier to move around and explore connected documents, making it possible to answer complex questions that require multiple steps.

Source: https://neo4j.com/developer-blog/knowledge-graphs-llms-multi-hop-question-answering/

In the above example, if we want the LLM to answer the question, “Did any former employee of OpenAI start their own company?” the LLM might return some duplicated information or other relevant information could be ignored. Extracting entities and relationships from text to construct a knowledge graph makes it easy for the LLM to answer questions spanning multiple documents.

Combining Textual Data with a Knowledge Graph

Another advantage of using a knowledge graph with an LLM is that by using the former, we can store both structured as well as unstructured data and connect them with relationships. This makes information retrieval easier.

Source: https://neo4j.com/developer-blog/knowledge-graphs-llms-multi-hop-question-answering/

In the above example, a knowledge graph has been used to store:

Structured data: Past Employees of OpenAI and the companies they started.
Unstructured data: News articles mentioning OpenAI and its employees.

With this setup, we can answer questions like “What’s the latest news about Prosper Robotics founders?” by starting from the Prosper Robotics node, moving to its founders, and then retrieving recent articles about them.

This adaptability makes it suitable for a wide range of LLM applications, as it can handle various data types and relationships between entities. The graph structure provides a clear visual representation of knowledge, making it easier for both developers and users to understand and work with.

Conclusion

Researchers are increasingly exploring the synergy between LLMs and KGs, with three main approaches: KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. These approaches aim to leverage both technologies’ strengths to address various language and knowledge-related tasks.

The integration of LLMs and KGs offers promising possibilities for applications such as multi-hop question answering, combining textual and structured data, and enhancing transparency and interpretability. As technology advances, this collaboration between LLMs and KGs holds the potential to drive innovation in fields like search engines, recommender systems, and AI assistants, ultimately benefiting users and developers alike.

If you like our work, you will love our newsletter..

References:

The post LLMs & Knowledge Graphs appeared first on MarkTechPost.