Janhavi Lande, Author at MarkTechPost

Meet LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models

Janhavi Lande — Wed, 25 Oct 2023 13:00:00 +0000

https://arxiv.org/abs/2310.08659

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-25-at-3.07.32-AM-300x166.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-25-at-3.07.32-AM-1024x568.png" />https://arxiv.org/abs/2310.08659

The introduction of Pre-trained Language Models (PLMs) has signified a transformative shift in the field of Natural Language Processing. They have demonstrated exceptional proficiency in performing a wide range of language tasks, including Natural Language Understanding (NLU) and Natural Language Generation (NLG). These models typically incorporate millions or even billions of parameters, leading to substantial computational and memory requirements. However, the considerable computational and memory needs of these models present significant challenges, as acknowledged by the research community.

In this paper, the authors introduce a novel quantization framework known as LoRA-Fine-Tuning-aware Quantization (LoftQ). This framework is specifically tailored for pre-trained models that necessitate quantization and LoRA fine-tuning. The framework actively combines low-rank approximation, working in conjunction with quantization to jointly approximate the original high-precision pre-trained weights.

The above image demonstrates QLoRA performance with different bits. Left: QLoRA initialization of LLAMA-2-13b on WikiText-2. Right: Apply QLoRA to LLAMA-2-13b on WikiText-2 language modelling task. Smaller perplexity indicates better performance.

Quantization Methods. We apply two quantization methods to demonstrate LoftQ is compatible with different quantization functions:

• Uniform quantization is a classic quantization method. It uniformly divides a continuous interval into 2N categories and stores a local maximum absolute value for dequantization.

• NF4 and its 2-bit variant NF2 are quantization methods used in QLoRA. They assume that the high-precision values are drawn from a Gaussian distribution and map these values to discrete slots that have equal probability.

We perform 2-bit and 4-bit quantization on all models, achieving compression ratios of 25-30% and 15-20% at the 4-bit and 2-bit levels, respectively. All the experiments are conducted on NVIDIA A100 GPUs.

The evaluation of their quantization framework is carried out through extensive experiments on various downstream tasks, including NLU, question answering, summarization, and NLG. The results of these experiments demonstrate that LoftQ consistently surpasses QLoRA across all precision levels. For example, with 4-bit quantization, they attain a 1.1 and 0.8 improvement in Rouge-1 for XSum and CNN/DailyMail, respectively. As the field of NLP continues to advance, it is expected that further innovations and optimizations will help bridge the gap between the immense potential of PLMs and their practical deployment, benefiting a wide range of applications and users.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post Meet LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models appeared first on MarkTechPost.

Enhancing Reasoning in Large Language Models: Check Out the Hypotheses-to-Theories (HtT) Framework for Accurate and Transferable Rule-Based Learning

Janhavi Lande — Fri, 20 Oct 2023 02:35:16 +0000

https://arxiv.org/abs/2310.07064

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-20-at-8.03.40-AM-300x221.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-20-at-8.03.40-AM-1024x755.png" />https://arxiv.org/abs/2310.07064

In the realm of reasoning tasks, large language models (LLMs) have displayed remarkable performance when provided with examples and intermediate steps. Nevertheless, approaches that depend on implicit knowledge within an LLM can sometimes produce erroneous answers when the implicit knowledge is incorrect or inconsistent with the task at hand.

To address this issue, a team of researchers from Google, Mila – Québec AI Insitute, Université de Montréal, HEC Montréal, University of Alberta, and CIFAR AI Chair introduce the Hypotheses-to-Theories (HtT) framework that focuses on acquiring a rule library for LLM-based reasoning. HtT comprises two key stages: an induction stage and a deduction stage. In the induction stage, an LLM is initially tasked with generating and validating rules based on a set of training examples.

The above image demonstrates the application of Hypotheses-to-Theories to the chain-of-thought method for solving base-9 arithmetic problems is exemplified here. To maintain conciseness, a few-shot examples have been omitted. In the induction stage, the Chain of Thought (CoT) technique is utilized to generate rules and validate them using training samples.

Subsequently, the rules produced are gathered and refined to construct a rule library. In the deduction stage, the CoT prompt is enhanced with knowledge derived from the rule library. Correct rules are indicated with green markers, while incorrect ones are marked in red. Rules that frequently lead to correct answers are accumulated to establish a rule library. In the deduction stage, the LLM is subsequently prompted to utilize the acquired rule library for reasoning in order to answer test questions.

In their evaluation of HtT, the researchers integrate it as an enhancement to pre-existing few-shot prompting techniques, such as chain-of-thought and least-to-most prompting. Performance is assessed on two challenging multi-step reasoning problems that have proven to be problematic for current few-shot prompting approaches.

Experimental results on both numerical reasoning and relational reasoning problems reveal that HtT enhances existing prompting methods, achieving an increase in accuracy ranging from 11% to 27%. Furthermore, the acquired rules can be effectively transferred to different models and various forms of the same problem. The introduced method paves the way for a novel approach to acquiring textual knowledge using LLMs. It is anticipated that HtT will enable a range of applications and inspire further research in the field of LLMs.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post Enhancing Reasoning in Large Language Models: Check Out the Hypotheses-to-Theories (HtT) Framework for Accurate and Transferable Rule-Based Learning appeared first on MarkTechPost.

This AI Research Presents Neural A*: A Novel Data-Driven Search Method for Path Planning Problems

Janhavi Lande — Wed, 18 Oct 2023 00:58:00 +0000

https://arxiv.org/abs/2009.07476

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-18-at-6.26.06-AM-300x129.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-18-at-6.26.06-AM-1024x440.png" />https://arxiv.org/abs/2009.07476

Path planning identifies a cost-effective and valid path from an initial point to a target point within an environmental map. Search-based planning methods, which include the well-known A* search, are widely employed in addressing path-planning challenges. These techniques have found application in various domains, including autonomous vehicle navigation and robot arm manipulation.

Recent studies have highlighted the significant benefits of data-driven path planning in two specific scenarios.

The first scenario involves the more efficient discovery of near-optimal paths in point-to-point shortest-path search problems compared to traditional heuristic planners.
The second scenario pertains to enabling path planning using raw image inputs. This task is challenging for classical planners unless there is access to semantic pixel-wise labeling of the environment.

In this research, the authors have redefined the conventional A* search algorithm differently and combined it with a convolutional encoder to create a fully trainable end-to-end neural network planner. This approach, known as Neural A*, addresses path planning problems by transforming a given problem instance into a guidance map and subsequently conducting a differentiable A* search based on that map.

The above image demonstrates two Scenarios of Path Planning with Neural A*.

Point-to-point shortest path search: finding a near-optimal path (red) with fewer node explorations (green) for an input map.
Path planning on raw image inputs: accurately predicting a human trajectory (red) on a natural image.

Through the process of learning to align search outcomes with expert-provided ground truth paths, Neural A* can generate paths that accurately and efficiently adhere to the ground truth.

This figure shows the schematic diagram of Neural A*:

(1) A path-planning problem instance is fed to the encoder to produce a guidance map.

(2) The differentiable A* module performs a point-to-point shortest path search with the guidance map and outputs a search history and a resulting path.

(3) A loss between the search history and the ground-truth path is back-propagated to train the encoder.

Comprehensive experimentation results have shown that Neural A* surpasses state-of-the-art data-driven planners, achieving a favorable balance between search optimality and efficiency. Furthermore, Neural A* has demonstrated the capability to predict realistic human trajectories by directly applying search-based planning to natural image inputs.

Check out the Paper, Project, and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post This AI Research Presents Neural A*: A Novel Data-Driven Search Method for Path Planning Problems appeared first on MarkTechPost.

CMU & Google DeepMind Researchers Introduce AlignProp: A Direct Backpropagation-Based AI Approach to Finetune Text-to-Image Diffusion Models for Desired Reward Function

Janhavi Lande — Mon, 16 Oct 2023 20:21:35 +0000

https://align-prop.github.io/

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/10/ezgif-2-da68357a72-300x228.gif" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/10/ezgif-2-da68357a72.gif" />https://align-prop.github.io/

Probabilistic diffusion models have become the established norm for generative modeling in continuous domains. Leading the way in text-to-image diffusion models is DALLE. These models have gained prominence for their ability to generate images by training on extensive web-scale datasets. The paper discusses the recent emergence of text-to-image diffusion models at the forefront of image generation. These models have been trained on large-scale unsupervised or weakly supervised text-to-image datasets. However, because of their unsupervised nature, controlling their behavior in downstream tasks like optimizing human-perceived image quality, image-text alignment, or ethical image generation is a challenging endeavor.

Recent research has attempted to fine-tune diffusion models using reinforcement learning techniques, but this approach is known for its high variance in gradient estimators. In response, the paper introduces “AlignProp,” a method that aligns diffusion models with downstream reward functions through end-to-end backpropagation of the reward gradient during the denoising process.

AlignProp’s innovative approach mitigates the high memory requirements that would typically be associated with backpropagation through modern text-to-image models. It achieves this by fine-tuning low-rank adapter weight modules and implementing gradient checkpointing.

The paper evaluates the performance of AlignProp in fine-tuning diffusion models for various objectives, including image-text semantic alignment, aesthetics, image compressibility, and controllability of the number of objects in generated images, as well as combinations of these objectives. The results demonstrate that AlignProp outperforms alternative methods by achieving higher rewards in fewer training steps. Additionally, it is noted for its conceptual simplicity, making it a straightforward choice for optimizing diffusion models based on differentiable reward functions of interest.

The AlignProp approach utilizes gradients obtained from the reward function for the purpose of fine-tuning diffusion models, resulting in improvements in both sampling efficiency and computational effectiveness. The experiments conducted consistently demonstrate the effectiveness of AlignProp in optimizing a wide range of reward functions, even for tasks that are difficult to define solely through prompts. In the future, potential research directions could involve extending these principles to diffusion-based language models, with the goal of improving their alignment with human feedback.

Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post CMU & Google DeepMind Researchers Introduce AlignProp: A Direct Backpropagation-Based AI Approach to Finetune Text-to-Image Diffusion Models for Desired Reward Function appeared first on MarkTechPost.

This AI Paper Introduces DSPy: A Programming Model that Abstracts Language Model Pipelines as Text Transformation Graphs

Janhavi Lande — Sat, 14 Oct 2023 16:02:14 +0000

https://github.com/stanfordnlp/dspy

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Blog-Banner-2-300x169.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Blog-Banner-2-1024x576.png" />https://github.com/stanfordnlp/dspy

Language models (LMs) have given researchers the ability to create natural language processing systems with less data and at more advanced levels of understanding. This has led to a growing field of “prompting” methods and lightweight fine-tuning techniques to make LMs work for new tasks. However, the problem is that LMs can be quite sensitive to how you ask them questions for each task, and this issue becomes more complex when you have multiple LM interactions in a single process.

The Machine learning (ML) community has been actively exploring methods for prompting language models (LMs) and building pipelines to tackle complex tasks. Unfortunately, existing LM pipelines often rely on hard-coded “prompt templates,” which are lengthy strings discovered through trial and error. In their pursuit of a more systematic approach to developing and optimizing LM pipelines, a team researchers from various institutions including Stanford, have introduced DSPy, a programming model that abstracts LM pipelines into text transformation graphs. These are essentially imperative computation graphs where LMs are invoked through declarative modules.

The modules in DSPy are parameterized, which means they can learn how to apply combinations of prompting, fine-tuning, augmentation, and reasoning techniques by creating and collecting demonstrations. They have designed a compiler to optimize any DSPy pipeline to maximize a specified metric.

The DSPy compiler was developed aiming to enhance the quality or cost-effectiveness of any DSPy program. The compiler takes as inputs the program itself, along with a small set of training inputs that may include optional labels and a validation metric for performance assessment. The compiler’s operation involves simulating different versions of the program using the provided inputs and generating example traces for each module. These traces serve as a means for self-improvement and are utilized to create effective few-shot prompts or to fine-tune smaller language models at various stages of the pipeline.

It’s important to mention that the way DSPy optimizes is quite flexible. They use something called “teleprompters,” which are like general tools for making sure each part of the system learns from the data in the best way possible.

Through two case studies, it has been demonstrated that concise DSPy programs can express and optimize sophisticated LM pipelines capable of solving maths word problems, handling multi-hop retrieval, answering complex questions, and controlling agent loops. In a matter of minutes after compilation, just a few lines of DSPy code enable GPT-3.5 and llama2-13b-chat to self-bootstrap pipelines that outperform standard few-shot prompting by over 25% and 65%, respectively.

In conclusion, this work introduces a groundbreaking approach to natural language processing through the DSPy programming model and its associated compiler. By translating complex prompting techniques into parameterized declarative modules and leveraging general optimization strategies (teleprompters), this research offers a new way to build and optimize NLP pipelines with remarkable efficiency.

Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post This AI Paper Introduces DSPy: A Programming Model that Abstracts Language Model Pipelines as Text Transformation Graphs appeared first on MarkTechPost.

This AI Paper introduces FELM: Benchmarking Factuality Evaluation of Large Language Models

Janhavi Lande — Tue, 10 Oct 2023 17:21:00 +0000

https://arxiv.org/abs/2310.00741

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-10-at-10.48.27-PM-300x192.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-10-at-10.48.27-PM-1024x656.png" />https://arxiv.org/abs/2310.00741

Large language models (LLMs) have experienced remarkable success, ushering in a paradigm shift in generative AI through prompting. Nevertheless, a challenge associated with LLMs is their proclivity to generate inaccurate information or hallucinate content, which presents a significant obstacle to their broader applicability. Even cutting-edge LLMs like ChatGPT exhibit vulnerability to this issue.

The assessment of text factuality generated by Large Language Models (LLMs) is emerging as a crucial research area aimed at improving the reliability of LLM outputs and alerting users to potential errors. However, the evaluators responsible for assessing factuality also require suitable evaluation tools to measure progress and foster advancements in their field. Unfortunately, this aspect of research has remained relatively unexplored, creating significant challenges for factuality evaluators.

To address this gap, the authors of this study introduce a benchmark for Factuality Evaluation of Large Language Models, referred to as FELM. The above image demonstrates examples of a factuality evaluation system – it could highlight the text spans from LLMs.’

responses with factual errors, explain the error, and provide references to justify the decision benchmark involves collecting responses generated by LLMs and annotating factuality labels in a fine-grained manner.

Unlike previous studies that primarily focus on assessing the factuality of world knowledge, such as information sourced from Wikipedia, FELM places its emphasis on factuality assessment across diverse domains, spanning from general knowledge to mathematical and reasoning-related content. To understand and identify where there might be mistakes in the text, they look at different parts of the text one by one. This helps them find exactly where something might be wrong. They also add labels to these mistakes, saying what kind of mistakes they are, and provide links to other information that either proves or disproves what’s said in the text.

Then, in their tests, they check how well different computer programs that use large language models can find these mistakes in the text. They test regular programs and some that are improved with extra tools to help them think and find mistakes better. The findings from these experiments reveal that, although retrieval mechanisms can aid in factuality evaluation, current LLMs still fall short in accurately detecting factual errors.

Overall, this approach not only advances our understanding of factuality assessment but also provides valuable insights into the effectiveness of different computational methods in addressing the challenge of identifying factual errors in text, contributing to the ongoing efforts to enhance the reliability of language models and their applications.

Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post This AI Paper introduces FELM: Benchmarking Factuality Evaluation of Large Language Models appeared first on MarkTechPost.

Google DeepMind Introduces Direct Reward Fine-Tuning (DRaFT): An Effective Artificial Intelligence Method for Fine-Tuning Diffusion Models to Maximize Differentiable Reward Functions

Janhavi Lande — Mon, 09 Oct 2023 00:28:18 +0000

https://arxiv.org/abs/2309.17400

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/10/F7Zs-VcbkAAIvJ2-300x150.jpeg" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/10/F7Zs-VcbkAAIvJ2-1024x512.jpeg" />https://arxiv.org/abs/2309.17400

Diffusion models have revolutionized generative modeling across various data types. However, in practical applications like generating aesthetically pleasing images from text descriptions, fine-tuning is often needed. Text-to-image diffusion models employ techniques like classifier-free guidance and curated datasets such as LAION Aesthetics to improve alignment and image quality.

In their research, the authors present a straightforward and efficient method for gradient-based reward fine-tuning, which involves differentiating through the diffusion sampling process. They introduce the concept of Direct Reward Fine-Tuning (DRaFT), which essentially backpropagates through the entire sampling chain, typically represented as an unrolled computation graph with a length of 50 steps. To manage memory and computational costs effectively, they employ gradient checkpointing techniques and optimize LoRA weights instead of modifying the entire set of model parameters.

The above image demonstrates DRaFT using human preference reward models. Furthermore, the authors introduce enhancements to the DRaFT method to enhance its efficiency and performance. First, they propose DRaFT-K, a variant that limits backpropagation to only the last K steps of sampling when computing the gradient for fine-tuning. Empirical results demonstrate that this truncated gradient approach significantly outperforms full backpropagation with the same number of training steps, as full backpropagation can lead to issues with exploding gradients.

Additionally, the authors introduce DRaFT-LV, a variation of DRaFT-1 that computes lower-variance gradient estimates by averaging over multiple noise samples, further improving efficiency in their approach.

The authors of the study applied DRaFT to Stable Diffusion 1.4 and conducted evaluations using various reward functions and prompt sets. Their methods, which leverage gradients, demonstrated significant efficiency advantages compared to RL-based fine-tuning baselines. For instance, they achieved over a 200-fold speed improvement when maximizing scores from the LAION Aesthetics Classifier compared to RL algorithms.

DRaFT-LV, one of their proposed variations, exhibited exceptional efficiency, learning approximately twice as fast as ReFL, a prior gradient-based fine-tuning method. Furthermore, they demonstrated the versatility of DRaFT by combining or interpolating DRaFT models with pre-trained models, which can be achieved by adjusting LoRA weights through mixing or scaling.

In conclusion, directly fine-tuning diffusion models on differentiable rewards offers a promising avenue for improving generative modeling techniques, with implications for applications spanning images, text, and more. Its efficiency, versatility, and effectiveness make it a valuable addition to the toolkit of researchers and practitioners in the field of machine learning and generative modeling.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post Google DeepMind Introduces Direct Reward Fine-Tuning (DRaFT): An Effective Artificial Intelligence Method for Fine-Tuning Diffusion Models to Maximize Differentiable Reward Functions appeared first on MarkTechPost.

Meet Concept2Box: Bridging the Gap Between High-Level Concepts and Fine-Grained Entities in Knowledge Graphs – A Dual Geometric Approach

Janhavi Lande — Fri, 06 Oct 2023 02:02:38 +0000

https://www.amazon.science/publications/concept2box-joint-geometric-embeddings-for-learning-two-view-knowledge-graphs

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-06-at-7.31.07-AM-300x270.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-06-at-7.31.07-AM-1024x922.png" />https://www.amazon.science/publications/concept2box-joint-geometric-embeddings-for-learning-two-view-knowledge-graphs

A lot of research has gone into finding ways to represent big sets of connected data, like knowledge graphs. These methods are called Knowledge Graph Embeddings (KGE), and they help us use this data for various practical purposes in the real world.

Traditional methods have often overlooked a significant aspect of knowledge graphs, which is the presence of two distinct types of information: high-level concepts that relate to the overall structure (ontology view) and specific individual entities (instance view). Typically, these methods treat all nodes in the knowledge graph as vectors within a single hidden space.

The above image demonstrates a two-view knowledge graph, which comprises (1) an ontology-view knowledge graph containing high-level concepts and meta-relations, (2) an instance-view knowledge graph containing specific, detailed instances and relations, and (3) a collection of connections (cross-view links) between these two views, Concept2Box is designed to acquire dual geometric embeddings. Under this approach, each concept is represented as a geometric box in the latent space, while entities are represented as point vectors.

In contrast to using a single geometric representation that cannot adequately capture the structural distinctions between two perspectives within a knowledge graph and lacks probabilistic meaning in relation to the granularity of concepts, the authors introduce Concept2Box. This innovative approach simultaneously embeds both views of a knowledge graph by employing dual geometric representations. Concepts are represented using box embeddings, enabling the learning of hierarchical structures and complex relationships like overlap and disjointness.

The volume of these boxes corresponds to the granularity of concepts. In contrast, entities are represented as vectors. To bridge the gap between concept box embeddings and entity vector embeddings, a novel vector-to-box distance metric is proposed, and both embeddings are learned jointly. Experimental evaluations conducted on both the publicly available DBpedia knowledge graph and a newly created industrial knowledge graph underscore the effectiveness of Concept2Box. Our model is built to handle the differences in how information is structured in knowledge graphs. But in today’s knowledge graphs, which can involve multiple languages, there’s another challenge. Different parts of the knowledge graph not only have different structures but also use different languages, making it even more complex to understand and work with. In the future, we can expect advancements in this domain.

If you like our work, you will love our newsletter..

The post Meet Concept2Box: Bridging the Gap Between High-Level Concepts and Fine-Grained Entities in Knowledge Graphs – A Dual Geometric Approach appeared first on MarkTechPost.

This Research Paper Introduces Lavie: High-Quality Video Generation with Cascaded Latent Diffusion Models

Janhavi Lande — Wed, 04 Oct 2023 14:09:03 +0000

https://arxiv.org/abs/2309.15103

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-04-at-7.36.37-PM-300x209.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-04-at-7.36.37-PM-1024x715.png" />https://arxiv.org/abs/2309.15103

In recent years, Diffusion Models (DMs) have made significant strides in the realm of image synthesis. This has led to a heightened focus on generating photorealistic images from text descriptions (T2I). Building upon the accomplishments of T2I models, there has been a growing interest among researchers in extending these techniques to the synthesis of videos controlled by text inputs (T2V). This expansion is driven by the anticipated applications of T2V models in domains such as filmmaking, video games, and artistic creation.

Achieving the right balance between video quality, training cost, and model compositionality remains a complex task, necessitating careful considerations in model architecture, training strategies, and the collection of high-quality text-video datasets.

In response to these challenges, a new integrated video generation framework called LaVie has been introduced. This framework, boasting a total of 3 billion parameters, operates using cascaded video latent diffusion models. LaVie serves as a foundational text-to-video model built upon a pre-trained T2I model (specifically, Stable Diffusion, as presented by Rombach et al., 2022). Its primary goal is to synthesize visually realistic and temporally coherent videos while retaining the creative generation capabilities of the pre-trained T2I model.

Figure 1 above demonstrates Text-to-video samples and Figure 2 demonstrates Diverse video generation results by Lavie.

LaVie incorporates two key insights into its design. First, it utilizes simple temporal self-attention coupled with RoPE to effectively capture inherent temporal correlations in video data. Complex architectural modifications provide only marginal improvements in the generated results. Second, LaVie employs joint image-video fine-tuning, which is essential for producing high-quality and creative outcomes. Attempting to fine-tune directly on video datasets can compromise the model’s ability to mix concepts and lead to catastrophic forgetting. Joint image-video fine-tuning facilitates large-scale knowledge transfer from images to videos, encompassing scenes, styles, and characters.

Additionally, the publicly available text-video dataset, WebVid10M, is found to be inadequate for supporting the T2V task due to its low resolution and focus on watermark-centered videos. In response, LaVie benefits from a newly introduced text-video dataset named Vimeo25M, which comprises 25 million high-resolution videos (> 720p) accompanied by text descriptions.

Experiments demonstrate that training on Vimeo25M significantly enhances LaVie’s performance, allowing it to generate superior results in terms of quality, diversity, and aesthetic appeal. Researchers envision LaVie as an initial step towards achieving high-quality T2V generation. Future research directions involve expanding the capabilities of LaVie to synthesize longer videos with intricate transitions and movie-level quality based on script descriptions.

If you like our work, you will love our newsletter..

The post This Research Paper Introduces Lavie: High-Quality Video Generation with Cascaded Latent Diffusion Models appeared first on MarkTechPost.

This AI Paper Introduces VidChapters-7M: A Scalable Approach to Segmenting Videos into Chapters Using User-Annotated Data

Janhavi Lande — Sun, 01 Oct 2023 08:53:22 +0000

https://arxiv.org/abs/2309.13952

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-01-at-2.20.42-PM-300x219.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-01-at-2.20.42-PM-1024x746.png" />https://arxiv.org/abs/2309.13952

In the realm of video content organization, the segmentation of lengthy videos into chapters emerges as an important capability, allowing users to pinpoint their desired information swiftly. Unfortunately, this subject has suffered from hardly any research attention due to the scarcity of publicly available datasets.

To address this challenge, VidChapters-7M is presented, a dataset comprising 817,000 videos that have been meticulously segmented into an impressive 7 million chapters. This dataset is assembled automatically by extracting user-annotated chapters from online videos, bypassing the need for labor-intensive manual annotation.

Within the scope of VidChapters-7M, researchers have introduced three distinct tasks. Firstly, there is the video chapter generation task, which entails the temporal division of a video into segments, accompanied by the generation of a descriptive title for each segment. To further deconstruct this task, two variations are defined: video chapter generation with predefined segment boundaries, where the challenge lies in generating titles for segments with annotated boundaries, and video chapter grounding, which necessitates the localization of a chapter’s temporal boundaries based on its annotated title.

Source: https://arxiv.org/abs/2309.13952

A comprehensive evaluation of these tasks was conducted that employed both fundamental baseline approaches and cutting-edge video-language models. The above image demonstrates an illustration of the three tasks defined for VidChapters-7M. Furthermore, it has been demonstrated that pre-training on VidChapters-7M results in remarkable advancements in dense video captioning tasks, both in zero-shot and fine-tuning scenarios. This advancement notably elevates the state of the art on benchmark datasets such as YouCook2 and ViTT. Ultimately, the experiments have unveiled a positive correlation between the size of the pretraining dataset and improved performance in downstream applications.

VidChapters-7M inherits certain limitations due to its origin from YT-Temporal-180M. These limitations are associated with the biases in the distribution of video categories that are present in the source dataset. The advancement of video chapter generation models has the potential to facilitate downstream applications, some of which could have negative societal impacts, such as video surveillance.

Additionally, models trained on VidChapters-7M may inadvertently reflect biases that exist within videos sourced from platforms like YouTube. It is necessary to maintain awareness of these considerations when deploying, analyzing, or building upon these models.

Check out the Paper, Github, and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post This AI Paper Introduces VidChapters-7M: A Scalable Approach to Segmenting Videos into Chapters Using User-Annotated Data appeared first on MarkTechPost.