Daniele Lorenzi, Author at MarkTechPost https://www.marktechpost.com/author/daniellorenzi/ An Artificial Intelligence News Platform Thu, 26 Oct 2023 20:41:43 +0000 en-US hourly 1 https://wordpress.org/?v=6.3.2 https://www.marktechpost.com/wp-content/uploads/2022/04/cropped-Favicon-512-x-512-1-1-32x32.png Daniele Lorenzi, Author at MarkTechPost https://www.marktechpost.com/author/daniellorenzi/ 32 32 127842392 Meet FreeU: A Novel AI Technique To Enhance Generative Quality Without Additional Training Or Fine-tuning https://www.marktechpost.com/2023/10/26/meet-freeu-a-novel-ai-technique-to-enhance-generative-quality-without-additional-training-or-fine-tuning/ https://www.marktechpost.com/2023/10/26/meet-freeu-a-novel-ai-technique-to-enhance-generative-quality-without-additional-training-or-fine-tuning/#respond Thu, 26 Oct 2023 20:41:40 +0000 https://www.marktechpost.com/?p=45169 Probabilistic diffusion models, a cutting-edge category of generative models, have become a critical point in the research landscape, particularly for tasks related to computer vision. Distinct from other classes of generative models, such as Variational Autoencoder (VAE), Generative Adversarial Networks (GANs), and vector-quantized approaches, diffusion models introduce a novel generative paradigm. These models employ a […]

The post Meet FreeU: A Novel AI Technique To Enhance Generative Quality Without Additional Training Or Fine-tuning appeared first on MarkTechPost.

]]>

Probabilistic diffusion models, a cutting-edge category of generative models, have become a critical point in the research landscape, particularly for tasks related to computer vision. Distinct from other classes of generative models, such as Variational Autoencoder (VAE), Generative Adversarial Networks (GANs), and vector-quantized approaches, diffusion models introduce a novel generative paradigm. These models employ a fixed Markov chain to map the latent space, facilitating intricate mappings that capture latent structural complexities within a dataset. Recently, their impressive generative capabilities, ranging from the high level of detail to the diversity of the generated examples, have pushed groundbreaking advancements in various computer vision applications such as image synthesis, image editing, image-to-image translation, and text-to-video generation.

The diffusion models consist of two primary components: the diffusion process and the denoising process. During the diffusion process, Gaussian noise is progressively incorporated into the input data, gradually transforming it into nearly pure Gaussian noise. In contrast, the denoising process aims to recover the original input data from its noisy state using a sequence of learned inverse diffusion operations. Typically, a U-Net is employed to predict the noise removal iteratively at each denoising step. Existing research predominantly focuses on the use of pre-trained diffusion U-Nets for downstream applications, with limited exploration of the internal characteristics of the diffusion U-Net.

A joint study from the S-Lab and the Nanyang Technological University departs from the conventional application of diffusion models by investigating the effectiveness of the diffusion U-Net in the denoising process. To gain a deeper understanding of the denoising process, the researchers introduce a paradigm shift towards the Fourier domain to observe the generation process of diffusion models—a relatively unexplored research area. 

The figure above illustrates the progressive denoising process in the top row, showcasing the generated images at successive iterations. In contrast, the following two rows present the associated low-frequency and high-frequency spatial domain information after the inverse Fourier Transform, corresponding to each respective step. This figure reveals a gradual modulation of low-frequency components, indicating a subdued rate of change, whereas high-frequency components exhibit more pronounced dynamics throughout the denoising process. These findings can be intuitively explained: low-frequency components inherently represent an image’s global structure and characteristics, encompassing global layouts and smooth colors. Drastic alterations to these components are generally unsuitable in denoising processes as they can fundamentally reshape the image’s essence. On the other hand, high-frequency components capture rapid changes in the images, such as edges and textures, and are highly sensitive to noise. Denoising processes must remove noise while preserving these intricate details.

Considering these observations regarding low-frequency and high-frequency components during denoising, the investigation extends to determine the specific contributions of the U-Net architecture within the diffusion framework. At each stage of the U-Net decoder, skip features from the skip connections and backbone features are combined. The study reveals that the primary backbone of the U-Net plays a significant role in denoising, while the skip connections introduce high-frequency features into the decoder module, aiding in the recovery of fine-grained semantic information. However, this propagation of high-frequency features can inadvertently weaken the inherent denoising capabilities of the backbone during the inference phase, potentially leading to the generation of abnormal image details, as depicted in the first row of Figure 1.

In light of this discovery, the researchers propose a new approach referred to as “FreeU,” which can enhance the quality of generated samples without requiring additional computational overhead from training or fine-tuning. The overview of the framework is reported below.

During the inference phase, two specialized modulation factors are introduced to balance the contributions of features from the primary backbone and skip connections of the U-Net architecture. The first factor, known as “backbone feature factors,” is designed to amplify the feature maps of the primary backbone, thereby strengthening the denoising process. However, it is observed that the inclusion of backbone feature scaling factors, while yielding significant improvements, can occasionally result in undesired over-smoothing of textures. To address this concern, the second factor, “skip feature scaling factors,” is introduced to mitigate the problem of texture over-smoothing.

The FreeU framework demonstrates seamless adaptability when integrated with existing diffusion models, including applications like text-to-image generation and text-to-video generation. A comprehensive experimental evaluation of this approach is conducted using foundational models such as Stable Diffusion, DreamBooth, ReVersion, ModelScope, and Rerender for benchmark comparisons. When FreeU is applied during the inference phase, these models show a noticeable enhancement in the quality of the generated outputs. The visual representation in the illustration below provides evidence of FreeU’s effectiveness in significantly improving both intricate details and the overall visual fidelity of the generated images.

This was the summary of FreeU, a novel AI technique that enhances generative models’ output quality without additional training or fine-tuning. If you are interested and want to learn more about it, please feel free to refer to the links cited below. 


Check out the Paper and Project Page. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 32k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post Meet FreeU: A Novel AI Technique To Enhance Generative Quality Without Additional Training Or Fine-tuning appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2023/10/26/meet-freeu-a-novel-ai-technique-to-enhance-generative-quality-without-additional-training-or-fine-tuning/feed/ 0 45169
Meet Decaf: a Novel Artificial Intelligence Monocular Deformation Capture Framework for Face and Hand Interactions https://www.marktechpost.com/2023/10/15/meet-decaf-a-novel-artificial-intelligence-monocular-deformation-capture-framework-for-face-and-hand-interactions/ https://www.marktechpost.com/2023/10/15/meet-decaf-a-novel-artificial-intelligence-monocular-deformation-capture-framework-for-face-and-hand-interactions/#respond Mon, 16 Oct 2023 03:20:26 +0000 https://www.marktechpost.com/?p=44627 Three-dimensional (3D) tracking from monocular RGB videos is a cutting-edge field in computer vision and artificial intelligence. It focuses on estimating the three-dimensional positions and motions of objects or scenes using only a single, two-dimensional video feed.  Existing methods for 3D tracking from monocular RGB videos primarily focus on articulated and rigid objects, such as […]

The post Meet Decaf: a Novel Artificial Intelligence Monocular Deformation Capture Framework for Face and Hand Interactions appeared first on MarkTechPost.

]]>

Three-dimensional (3D) tracking from monocular RGB videos is a cutting-edge field in computer vision and artificial intelligence. It focuses on estimating the three-dimensional positions and motions of objects or scenes using only a single, two-dimensional video feed. 

Existing methods for 3D tracking from monocular RGB videos primarily focus on articulated and rigid objects, such as two hands or humans interacting with rigid environments. The challenge of modeling dense, non-rigid object deformations, such as hand-face interaction, has largely been overlooked. However, these deformations can significantly enhance the realism of applications like AR/VR, 3D virtual avatar communication, and character animations. The limited attention to this issue is attributed to the inherent complexity of the monocular view setup and associated difficulties, such as acquiring appropriate training and evaluation datasets and determining reasonable non-uniform stiffness for deformable objects.

Therefore, this article introduces a novel method that tackles the aforementioned fundamental challenges. It enables the tracking of human hands interacting with human faces in 3D from single monocular RGB videos. The method models hands as articulated objects that induce non-rigid facial deformations during active interactions. An overview of this technique is reported in the figure below.

This approach relies on a newly created dataset capturing hand-face motion and interaction, including realistic face deformations. In making this dataset, the authors employ position-based dynamics to process the raw 3D shapes and develop a technique for estimating the non-uniform stiffness of head tissues. These steps result in credible annotations of surface deformations, hand-face contact regions, and head-hand positions.

At the heart of their neural approach is a variational auto-encoder that provides the depth information for the hand-face interaction. Additionally, modules are employed to guide the 3D tracking process by estimating contacts and deformations. The final 3D reconstructions of hands and faces produced by this method are both realistic and more plausible when compared to several baseline methods applicable in this context, as supported by quantitative and qualitative evaluations.

Reconstructing both hands and the face simultaneously, considering the surface deformations resulting from their interactions, poses a notably challenging task. This becomes especially crucial for enhancing realism in reconstructions since such interactions are frequently observed in everyday life and significantly influence the impressions others form of an individual. Consequently, reconstructing hand-face interactions is vital in applications like avatar communication, virtual/augmented reality, and character animation, where lifelike facial movements are essential for creating immersive experiences. It also has implications for applications such as sign language transcription and driver drowsiness monitoring.

Despite various studies focusing on the reconstruction of face and hand motions, capturing the interactions between them, along with the corresponding deformations, from a monocular RGB video has remained largely unexplored, as noted by Tretschk et al. in 2023. On the other hand, attempting to use existing template-based methods for hand and face reconstruction often leads to artifacts such as collisions and the omission of interactions and deformations. This is primarily due to the inherent depth ambiguity of monocular setups and the absence of deformation modeling in the reconstruction process.

Several significant challenges are associated with this problem. One challenge (I) is the absence of a markerless RGB capture dataset for face and hand interactions with non-rigid deformations, which is essential for training models and evaluating methods. Creating such a dataset is highly challenging due to frequent occlusions caused by hand and head movements, particularly in regions where non-rigid deformation occurs. Another challenge (II) arises from the inherent depth ambiguity of single-view RGB setups, making it difficult to obtain accurate localization information and resulting in errors like collisions or a lack of contact between the hand and head during interactions.

To address these challenges, the authors introduce “Decaf” (short for deformation capture of faces interacting with hands), a monocular RGB method designed to capture face and hand interactions along with facial deformations. Specifically, they propose a solution that combines a multiview capture setup with a position-based dynamics simulator to reconstruct the interacting surface geometry, even in the presence of occlusions. To incorporate the deformable object simulator, they determine the stiffness values of a head mesh using a method called “skull-skin distance” (SSD), which assigns non-uniform stiffness to the mesh. This approach significantly enhances the qualitative plausibility of the reconstructed geometry compared to using uniform stiffness values.

Using their newly created dataset, the researchers train neural networks to extract 3D surface deformations, contact regions on the head and hand surfaces, and an interaction depth prior from single-view RGB images. In the final optimization stage, this information from various sources is utilized to obtain realistic 3D hand and face interactions with non-rigid surface deformations, resolving the depth ambiguity inherent in the single-view setup. The results illustrated below demonstrate much more plausible hand-face interactions compared to existing approaches.

This was the summary of Decaf, a novel AI framework designed to capture face and hand interactions along with facial deformations. If you are interested and want to learn more about it, please feel free to refer to the links cited below.


Check out the Paper and ProjectAll Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post Meet Decaf: a Novel Artificial Intelligence Monocular Deformation Capture Framework for Face and Hand Interactions appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2023/10/15/meet-decaf-a-novel-artificial-intelligence-monocular-deformation-capture-framework-for-face-and-hand-interactions/feed/ 0 44627
Meet POCO: A Novel Artificial Intelligence Framework for 3D Human Pose and Shape Estimation https://www.marktechpost.com/2023/10/15/meet-poco-a-novel-artificial-intelligence-framework-for-3d-human-pose-and-shape-estimation/ https://www.marktechpost.com/2023/10/15/meet-poco-a-novel-artificial-intelligence-framework-for-3d-human-pose-and-shape-estimation/#respond Sun, 15 Oct 2023 14:23:34 +0000 https://www.marktechpost.com/?p=44579 Estimating 3D Human Pose and Shape (HPS) from photos and moving pictures is necessary to reconstruct human actions in real-world settings. Nevertheless, 3D inference from 2D images poses significant challenges due to factors such as depth ambiguities, occlusion, unusual clothing, and motion blur. Even the most advanced HPS methods make errors and are often unaware […]

The post Meet POCO: A Novel Artificial Intelligence Framework for 3D Human Pose and Shape Estimation appeared first on MarkTechPost.

]]>

Estimating 3D Human Pose and Shape (HPS) from photos and moving pictures is necessary to reconstruct human actions in real-world settings. Nevertheless, 3D inference from 2D images poses significant challenges due to factors such as depth ambiguities, occlusion, unusual clothing, and motion blur. Even the most advanced HPS methods make errors and are often unaware of these mistakes. HPS is an intermediate task that provides output consumed by downstream tasks like understanding human behavior or 3D graphics applications. These downstream tasks require a mechanism to assess the accuracy of HPS results, and, as a result, these methods must produce an uncertainty (or confidence) value that correlates with the quality of HPS.

One approach to addressing this uncertainty is to output multiple bodies, yet this still lacks an explicit measure of uncertainty. Some exceptions do exist, which estimate a distribution over body parameters. One approach is to compute uncertainty by drawing samples from a distribution over bodies and calculating the standard deviation of these samples. While this method is valid, it suffers from two limitations: it is slow since it necessitates multiple forward network passes to generate samples, and it trades off accuracy for speed. More samples improve accuracy but increase computational demands. 

Recently, an approach has been developed to skip explicit supervision by training a network to output both body parameters and uncertainty simultaneously. Inspired by work on semantic segmentation, it uses a Gaussian-based base density function but recognizes the need for more complex distributions for modeling human poses. Methods directly estimating uncertainty typically include a base density function and a scale network. Existing methods use an unconditional bDF and solely rely on image features for the scale network. This approach works well when samples share a similar distribution but falls short when handling diverse datasets required for robust 3D HPS models.

The authors introduce POCO (“POse and shape estimation with COnfidence”), a novel framework applicable to standard HPS methods to address these challenges. POCO extends these methods to estimate uncertainty. In a single feed-forward pass, POCO directly infers both Skinned Multi-Person Linear Model (SMPL) body parameters and its regression uncertainty, which is highly correlated with the reconstruction quality. The key innovation in this framework is the Dual Conditioning Strategy (DCS), which enhances the base density function and scale network. An overview of the framework is presented in the figure below.

Unlike previous approaches, POCO introduces a conditional vector (Cond-bDF) to model the base density function of the inferred pose error. Rather than using a simplistic one-hot data source encoding, POCO employs image features for conditioning, enabling more scalable training on diverse and complex image datasets. Furthermore, POCO’s authors introduce an enhanced approach for estimating uncertainty in HPS models. They use image features and condition the network on the SMPL pose, resulting in improved pose reconstruction and better uncertainty estimation. Their method can be seamlessly integrated into existing HPS models, improving accuracy without downsides. The study claims this approach outperforms state-of-the-art methods in correlating uncertainty with pose errors. The results displayed in their work are reported below.

This was the summary of POCO, a novel AI framework for 3D human pose and shape estimation. If you are interested and want to learn more about it, please feel free to refer to the links cited below. 


Check out the Paper and ProjectAll Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post Meet POCO: A Novel Artificial Intelligence Framework for 3D Human Pose and Shape Estimation appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2023/10/15/meet-poco-a-novel-artificial-intelligence-framework-for-3d-human-pose-and-shape-estimation/feed/ 0 44579
AI Researchers from Bytedance and the King Abdullah University of Science and Technology Present a Novel Framework For Animating Hair Blowing in Still Portrait Photos https://www.marktechpost.com/2023/10/11/ai-researchers-from-bytedance-and-the-king-abdullah-university-of-science-and-technology-present-a-novel-framework-for-animating-hair-blowing-in-still-portrait-photos/ https://www.marktechpost.com/2023/10/11/ai-researchers-from-bytedance-and-the-king-abdullah-university-of-science-and-technology-present-a-novel-framework-for-animating-hair-blowing-in-still-portrait-photos/#respond Wed, 11 Oct 2023 14:30:00 +0000 https://www.marktechpost.com/?p=44287 Hair is one of the most remarkable features of the human body, impressing with its dynamic qualities that bring scenes to life. Studies have consistently demonstrated that dynamic elements have a stronger appeal and fascination than static images. Social media platforms like TikTok and Instagram witness the daily sharing of vast portrait photos as people […]

The post AI Researchers from Bytedance and the King Abdullah University of Science and Technology Present a Novel Framework For Animating Hair Blowing in Still Portrait Photos appeared first on MarkTechPost.

]]>

Hair is one of the most remarkable features of the human body, impressing with its dynamic qualities that bring scenes to life. Studies have consistently demonstrated that dynamic elements have a stronger appeal and fascination than static images. Social media platforms like TikTok and Instagram witness the daily sharing of vast portrait photos as people aspire to make their pictures both appealing and artistically captivating. This drive fuels researchers’ exploration into the realm of animating human hair within still images, aiming to offer a vivid, aesthetically pleasing, and beautiful viewing experience.

Recent advancements in the field have introduced methods to infuse still images with dynamic elements, animating fluid substances such as water, smoke, and fire within the frame. Yet, these approaches have largely overlooked the intricate nature of human hair in real-life photographs. This article focuses on the artistic transformation of human hair within portrait photography, which involves translating the picture into a cinemagraph.

A cinemagraph represents an innovative short video format that enjoys favor among professional photographers, advertisers, and artists. It finds utility in various digital mediums, including digital advertisements, social media posts, and landing pages. The fascination for cinemagraphs lies in their ability to merge the strengths of still images and videos. Certain areas within a cinemagraph feature subtle, repetitive motions in a short loop, while the remainder remains static. This contrast between stationary and moving elements effectively captivates the viewer’s attention.

Through the transformation of a portrait photo into a cinemagraph, complete with subtle hair motions, the idea is to enhance the photo’s allure without detracting from the static content, creating a more compelling and engaging visual experience.

Existing techniques and commercial software have been developed to generate high-fidelity cinemagraphs from input videos by selectively freezing certain video regions. Unfortunately, these tools are not suitable for processing still images. In contrast, there has been a growing interest in still-image animation. Most of these approaches have focused on animating fluid elements such as clouds, water, and smoke. However, the dynamic behavior of hair, composed of fibrous materials, presents a distinctive challenge compared to fluid elements. Unlike fluid element animation, which has received extensive attention, the animation of human hair in real portrait photos has been relatively unexplored.

Animating hair in a static portrait photo is challenging due to the intricate complexity of hair structures and dynamics. Unlike the smooth surfaces of the human body or face, hair comprises hundreds of thousands of individual components, resulting in complex and non-uniform structures. This complexity leads to intricate motion patterns within the hair, including interactions with the head. While there are specialized techniques for modeling hair, such as using dense camera arrays and high-speed cameras, they are often costly and time-consuming, limiting their practicality for real-world hair animation.

The paper presented in this article introduces a novel AI method for automatically animating hair within a static portrait photo, eliminating the need for user intervention or complex hardware setups. The insight behind this approach lies in the human visual system’s reduced sensitivity to individual hair strands and their motions in real portrait videos, compared to synthetic strands within a digitalized human in a virtual environment. The proposed solution is to animate “hair wisps” instead of individual strands, creating a visually pleasing viewing experience. To achieve this, the paper introduces a hair wisp animation module, enabling an efficient and automated solution. An overview of this framework is illustrated below.

The key challenge in this context is how to extract these hair wisps. While related work, such as hair modeling, has focused on hair segmentation, these approaches primarily target the extraction of the entire hair region, which differs from the objective. To extract meaningful hair wisps, the researchers innovatively frame hair wisp extraction as an instance segmentation problem, where an individual segment within a still image corresponds to a hair wisp. By adopting this problem definition, the researchers leverage instance segmentation networks to facilitate the extraction of hair wisps. This not only simplifies the hair wisp extraction problem but also enables the use of advanced networks for effective extraction. Additionally, the paper presents the creation of a hair wisp dataset containing real portrait photos to train the networks, along with a semi-annotation scheme to produce ground-truth annotations for the identified hair wisps. Some sample results from the paper are reported in the figure below compared with state-of-the-art techniques.

This was the summary of a novel AI framework designed to transform still portraits into cinemagraphs by animating hair wisps with pleasing motions without noticeable artifacts. If you are interested and want to learn more about it, please feel free to refer to the links cited below.


Check out the Paper and Project PageAll Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post AI Researchers from Bytedance and the King Abdullah University of Science and Technology Present a Novel Framework For Animating Hair Blowing in Still Portrait Photos appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2023/10/11/ai-researchers-from-bytedance-and-the-king-abdullah-university-of-science-and-technology-present-a-novel-framework-for-animating-hair-blowing-in-still-portrait-photos/feed/ 0 44287
Revolutionizing CPR Training With CPR-Coach: Harnessing Artificial Intelligence for Error Recognition and Assessment https://www.marktechpost.com/2023/10/09/revolutionizing-cpr-training-with-cpr-coach-harnessing-artificial-intelligence-for-error-recognition-and-assessment/ https://www.marktechpost.com/2023/10/09/revolutionizing-cpr-training-with-cpr-coach-harnessing-artificial-intelligence-for-error-recognition-and-assessment/#respond Tue, 10 Oct 2023 02:53:08 +0000 https://www.marktechpost.com/?p=44173 Cardiopulmonary Resuscitation (CPR) is a life-saving medical procedure designed to revive individuals who have experienced cardiac arrest, meaning the heart suddenly stops beating effectively or someone stops breathing. This procedure aims to maintain the flow of oxygenated blood to vital organs, particularly the brain, until professional medical help arrives or until the person can be […]

The post Revolutionizing CPR Training With CPR-Coach: Harnessing Artificial Intelligence for Error Recognition and Assessment appeared first on MarkTechPost.

]]>

Cardiopulmonary Resuscitation (CPR) is a life-saving medical procedure designed to revive individuals who have experienced cardiac arrest, meaning the heart suddenly stops beating effectively or someone stops breathing. This procedure aims to maintain the flow of oxygenated blood to vital organs, particularly the brain, until professional medical help arrives or until the person can be transported to a healthcare facility for advanced care. Performing CPR requires endurance but becomes straightforward as soon as you follow the correct movements. However, there are several actions to master, such as chest compressions, rescue breaths, and early defibrillation (having the right equipment). Since CPR is a vital emergency skill, it is essential to spread this fundamental expertise as far as possible. Nevertheless, its assessment traditionally relies on physical mannequins and instructors, resulting in high training costs and limited efficiency. Furthermore, since both instructors and this very specific equipment are not available everywhere, this approach results hardly scalable.

In a groundbreaking development, the research presented in this article introduced a vision-based system to enhance error action recognition and skill assessment during CPR. This innovative approach marks a significant departure from conventional training methods. Specifically, 13 distinct single-error actions and 74 composite error actions associated with external cardiac compression have been identified and categorized. This innovative CPR-based research is the first to analyze action-specific errors commonly committed during this procedure. The researchers have curated a comprehensive video dataset called CPR-Coach to facilitate this novel approach. An overview of some of the most typical errors annotated in the dataset is reported below.

https://shunli-wang.github.io/CPR-Coach/

Using CPR-Coach as their reference dataset, the authors embarked on a thorough investigation, evaluating and comparing the performance of various action recognition models that leverage different data modalities. Their objective is to address the challenge posed by the single-class training and multi-class testing problem inherent in CPR skill assessment. To tackle this issue, they introduced a pioneering framework called ImagineNet, inspired by human cognition principles. ImagineNet is designed to enhance the model’s capacity for recognizing multiple errors within the CPR context, even under the constraints of limited supervision.

An overview of ImagineNet’s workflow is presented in the figure below.

https://shunli-wang.github.io/CPR-Coach/

This research represents a significant leap forward in the assessment of CPR skills, offering the potential to reduce training costs and enhance the efficiency of CPR instruction through the innovative application of vision-based technology and advanced deep learning models. Ultimately, this approach has the potential to improve the quality of CPR training and, by extension, the outcomes for individuals experiencing cardiac emergencies.

This was the summary of CPR-Coach and ImagineNet, two essential AI tools designed to analyze CPR-related errors and automatize the CPR assessment task. If you are interested and want to learn more about it, please feel free to refer to the links cited below. 


Check out the ProjectAll Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post Revolutionizing CPR Training With CPR-Coach: Harnessing Artificial Intelligence for Error Recognition and Assessment appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2023/10/09/revolutionizing-cpr-training-with-cpr-coach-harnessing-artificial-intelligence-for-error-recognition-and-assessment/feed/ 0 44173
Meet ReVersion: A Novel AI Diffusion-Based Framework to Address the Relation Inversion Task from Images https://www.marktechpost.com/2023/09/28/meet-reversion-a-novel-ai-diffusion-based-framework-to-address-the-relation-inversion-task-from-images/ https://www.marktechpost.com/2023/09/28/meet-reversion-a-novel-ai-diffusion-based-framework-to-address-the-relation-inversion-task-from-images/#respond Thu, 28 Sep 2023 12:00:00 +0000 https://www.marktechpost.com/?p=43555 Recently, text-to-image (T2I) diffusion models have exhibited promising outcomes, sparking explorations into numerous generative tasks. Some efforts have been made to invert pre-trained text-to-image models to obtain text embedding representations, allowing for capturing object appearances in reference images. However, there has been limited exploration of capturing object relations, a more challenging task involving the understanding […]

The post Meet ReVersion: A Novel AI Diffusion-Based Framework to Address the Relation Inversion Task from Images appeared first on MarkTechPost.

]]>

Recently, text-to-image (T2I) diffusion models have exhibited promising outcomes, sparking explorations into numerous generative tasks. Some efforts have been made to invert pre-trained text-to-image models to obtain text embedding representations, allowing for capturing object appearances in reference images. However, there has been limited exploration of capturing object relations, a more challenging task involving the understanding of interactions between objects and image composition. Existing inversion methods struggle with this task due to entity leakage from reference images, which happens when a model leaks sensitive information about entities or individuals, leading to privacy violations. 

Nonetheless, addressing this challenge is of significant importance.

This study focuses on the Relation Inversion task, which aims to learn relationships in given exemplar images. The objective is to derive a relation prompt within the text embedding space of a pre-trained text-to-image diffusion model, where objects in each exemplar image follow a specific relation. Combining the relation prompt with user-defined text prompts allows users to generate images corresponding to specific relationships while customizing objects, styles, backgrounds, and more.

A preposition prior is introduced to enhance the representation of high-level relation concepts using the learnable prompt. This prior is based on the observation that prepositions are closely linked to relations, prepositions and words of other parts of speech are individually clustered in the text embedding space, and complex real-world relations can be expressed using a basic set of prepositions.

Building upon the preposition prior, a novel framework termed ReVersion is proposed to address the Relation Inversion problem. An overview of the framework is illustrated below. 

This framework incorporates a novel relation-steering contrastive learning scheme to guide the relation prompt toward a relation-dense region in the text embedding space. Basis prepositions are used as positive samples to encourage embedding into the sparsely activated area. At the same time, words of other parts of speech in text descriptions are considered negatives, disentangling semantics related to object appearances. A relation-focal importance sampling strategy is devised to emphasize object interactions over low-level details, constraining the optimization process for improved relation inversion results.

In addition, the researchers introduce the ReVersion Benchmark, which offers a variety of exemplar images featuring diverse relations. This benchmark serves as an evaluation tool for future research in the Relation Inversion task. Results across various relations demonstrate the effectiveness of the preposition prior and the ReVersion framework.

As presented in the study, we report some of the provided outcomes below. Since this entails a novel task, there is no other state-of-the-art approach to compare with.

This was the summary of ReVersion, a novel AI diffusion model framework designed to address the Relation Inversion task. If you are interested and want to learn more about it, please feel free to refer to the links cited below. 


Check out the Paper and ProjectAll Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post Meet ReVersion: A Novel AI Diffusion-Based Framework to Address the Relation Inversion Task from Images appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2023/09/28/meet-reversion-a-novel-ai-diffusion-based-framework-to-address-the-relation-inversion-task-from-images/feed/ 0 43555
Advancing Image Inpainting: Bridging the Gap Between 2D and 3D Manipulations with this Novel AI Inpainting for Neural Radiance Fields https://www.marktechpost.com/2023/09/20/advancing-image-inpainting-bridging-the-gap-between-2d-and-3d-manipulations-with-this-novel-ai-inpainting-for-neural-radiance-fields/ https://www.marktechpost.com/2023/09/20/advancing-image-inpainting-bridging-the-gap-between-2d-and-3d-manipulations-with-this-novel-ai-inpainting-for-neural-radiance-fields/#respond Thu, 21 Sep 2023 02:27:07 +0000 https://www.marktechpost.com/?p=43182 There has been enduring interest in the manipulation of images due to its wide range of applications in content creation. One of the most extensively studied manipulations is object removal and insertion, often referred to as the image inpainting task. While current inpainting models are proficient at generating visually convincing content that blends seamlessly with […]

The post Advancing Image Inpainting: Bridging the Gap Between 2D and 3D Manipulations with this Novel AI Inpainting for Neural Radiance Fields appeared first on MarkTechPost.

]]>

There has been enduring interest in the manipulation of images due to its wide range of applications in content creation. One of the most extensively studied manipulations is object removal and insertion, often referred to as the image inpainting task. While current inpainting models are proficient at generating visually convincing content that blends seamlessly with the surrounding image, their applicability has traditionally been limited to single 2D image inputs. However, some researchers are trying to advance the application of such models to the manipulation of complete 3D scenes.

The emergence of Neural Radiance Fields (NeRFs) has made the transformation of real 2D photos into lifelike 3D representations more accessible. As algorithmic enhancements continue and computational demands decrease, these 3D representations may become commonplace. Therefore, the research aims to enable similar manipulations of 3D NeRFs as are available for 2D images, with a particular focus on inpainting.

The inpainting of 3D objects presents unique challenges, including the scarcity of 3D data and the necessity to consider both 3D geometry and appearance. The use of NeRFs as a scene representation introduces additional complexities. The implicit nature of neural representations makes it impractical to directly modify the underlying data structure based on geometric understanding. Additionally, because NeRFs are trained from images, maintaining consistency across multiple views poses challenges. Independent inpainting of individual constituent images can lead to inconsistencies in viewpoints and visually unrealistic outputs.

Various approaches have been attempted to address these challenges. For example, some methods aim to resolve inconsistencies post hoc, such as NeRF-In, which combines views through pixel-wise loss, or SPIn-NeRF, which employs a perceptual loss. However, these approaches may struggle when inpainted views exhibit significant perceptual differences or involve complex appearances.

Alternatively, single-reference inpainting methods have been explored, which avoid view inconsistencies by using only one inpainted view. However, this approach introduces several challenges, including reduced visual quality in non-reference views, a lack of view-dependent effects, and issues with disocclusions.

Considering the mentioned limitations, a new approach has been developed to enable the inpainting of 3D objects.

Inputs to the system are N images from different perspectives with their corresponding camera transformation matrices and masks, delineating the unwanted regions. Additionally, an inpainted reference view related to the input images is required, which provides the information that a user expects to gather from a 3D inpainting of the scene. This reference can be as simple as a text description of the object to replace the mask.

https://ashmrz.github.io/reference-guided-3d/paper_lq.pdf

In the example reported above, the “rubber duck” or “flower pot” references can be obtained by employing a single-image text-conditioned inpainter. This way, any user can control and drive the generation of 3D scenes with the desired edits. 

With a module focusing on view-dependent effects (VDEs), the authors try to account for view-dependent changes (e.g., specularities and non-Lambertian effects) in the scene. For this reason, they add VDEs to the masked area from non-reference viewpoints by correcting reference colors to match the surrounding context of the other views.

Furthermore, they introduce monocular depth estimators to guide the geometry of the inpainted region according to the depth of the reference image. Since not all the masked target pixels are visible in the reference, an approach is devised to supervise such unoccluded pixels via additional inpaintings.

A visual comparison of novel view renderings of the proposed method with the state-of-the-art SPIn-NeRF-Lama is provided below.

https://ashmrz.github.io/reference-guided-3d/paper_lq.pdf

This was the summary of a novel AI framework for reference-guided controllable inpainting of neural radiance fields. If you are interested and want to learn more about it, please feel free to refer to the links cited below. 


Check out the Paper and Project PageAll Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post Advancing Image Inpainting: Bridging the Gap Between 2D and 3D Manipulations with this Novel AI Inpainting for Neural Radiance Fields appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2023/09/20/advancing-image-inpainting-bridging-the-gap-between-2d-and-3d-manipulations-with-this-novel-ai-inpainting-for-neural-radiance-fields/feed/ 0 43182
Meet StableSR: A Novel AI Super-Resolution Approach Exploiting the Power of Pre-Trained Diffusion Models https://www.marktechpost.com/2023/09/20/meet-stablesr-a-novel-ai-super-resolution-approach-exploiting-the-power-of-pre-trained-diffusion-models/ https://www.marktechpost.com/2023/09/20/meet-stablesr-a-novel-ai-super-resolution-approach-exploiting-the-power-of-pre-trained-diffusion-models/#respond Wed, 20 Sep 2023 16:38:00 +0000 https://www.marktechpost.com/?p=43176 Significant progress has been observed in the development of diffusion models for various image synthesis tasks in the field of computer vision. Prior research has illustrated the applicability of the diffusion prior, integrated into synthesis models like Stable Diffusion, to a range of downstream content creation tasks, including image and video editing. In this article, […]

The post Meet StableSR: A Novel AI Super-Resolution Approach Exploiting the Power of Pre-Trained Diffusion Models appeared first on MarkTechPost.

]]>

Significant progress has been observed in the development of diffusion models for various image synthesis tasks in the field of computer vision. Prior research has illustrated the applicability of the diffusion prior, integrated into synthesis models like Stable Diffusion, to a range of downstream content creation tasks, including image and video editing.

In this article, the investigation expands beyond content creation and explores the potential advantages of employing diffusion priors for super-resolution (SR) tasks. Super-resolution, a low-level vision task, introduces an additional challenge due to its demand for high image fidelity, which contrasts with the inherent stochastic nature of diffusion models.

A common solution to this challenge entails training a super-resolution model from the ground up. These methods incorporate the low-resolution (LR) image as an additional input to constrain the output space, aiming to preserve fidelity. While these approaches have achieved commendable results, they often require substantial computational resources for training the diffusion model. Furthermore, initiating network training from scratch can potentially compromise the generative priors captured in synthesis models, potentially leading to suboptimal network performance.

In response to these limitations, an alternative approach has been explored. This alternative involves introducing constraints into the reverse diffusion process of a pre-trained synthesis model. This paradigm eliminates the need for extensive model training while leveraging the benefits of the diffusion prior. However, it’s worth noting that designing these constraints assumes prior knowledge of the image degradations, which is typically both unknown and intricate. Consequently, such methods demonstrate limited generalizability.

To address the mentioned limitations, the researchers introduce StableSR, an approach designed to retain pre-trained diffusion priors without requiring explicit assumptions about image degradations. An overview of the presented technique is illustrated below.

In contrast to prior approaches that concatenate the low-resolution (LR) image with intermediate outputs, necessitating the training of a diffusion model from scratch, StableSR involves fine-tuning a lightweight time-aware encoder and a few feature modulation layers specifically tailored for super-resolution (SR) tasks.

The encoder incorporates a time embedding layer to generate time-aware features, enabling adaptive modulation of features within the diffusion model at different iterations. This not only enhances training efficiency but also maintains the integrity of the generative prior. Additionally, the time-aware encoder provides adaptive guidance during the restoration process, with stronger guidance at earlier iterations and weaker guidance at later stages, contributing significantly to improved performance.

To address the inherent randomness of the diffusion model and mitigate information loss during the encoding process of the autoencoder, StableSR applies a controllable feature wrapping module. This module introduces an adjustable coefficient to refine the outputs of the diffusion model during the decoding process, using multi-scale intermediate features from the encoder in a residual manner. The adjustable coefficient allows for a continuous trade-off between fidelity and realism, accommodating a wide range of degradation levels.

Furthermore, adapting diffusion models for super-resolution tasks at arbitrary resolutions has historically posed challenges. To overcome this, StableSR introduces a progressive aggregation sampling strategy. This approach divides the image into overlapping patches and fuses them using a Gaussian kernel at each diffusion iteration. The result is a smoother transition at boundaries, ensuring a more coherent output.

Some output samples of StableSR presented in the original article compared with state-of-the-art approaches are reported in the figure below.

In summary, StableSR offers a unique solution for adapting generative priors to real-world image super-resolution challenges. This approach leverages pre-trained diffusion models without making explicit assumptions about degradations, addressing issues of fidelity and arbitrary resolution through the incorporation of the time-aware encoder, controllable feature wrapping module, and progressive aggregation sampling strategy. StableSR serves as a robust baseline, inspiring future research in the application of diffusion priors for restoration tasks.

If you are interested and want to learn more about it, please feel free to refer to the links cited below. 


Check out the PaperGithub, and Project PageAll Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post Meet StableSR: A Novel AI Super-Resolution Approach Exploiting the Power of Pre-Trained Diffusion Models appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2023/09/20/meet-stablesr-a-novel-ai-super-resolution-approach-exploiting-the-power-of-pre-trained-diffusion-models/feed/ 0 43176
Meet BLIVA: A Multimodal Large Language Model for Better Handling of Text-Rich Visual Questions https://www.marktechpost.com/2023/09/15/meet-bliva-a-multimodal-large-language-model-for-better-handling-of-text-rich-visual-questions/ https://www.marktechpost.com/2023/09/15/meet-bliva-a-multimodal-large-language-model-for-better-handling-of-text-rich-visual-questions/#respond Fri, 15 Sep 2023 11:19:37 +0000 https://www.marktechpost.com/?p=42880 Recently, Large Language Models (LLMs) have played a crucial role in the field of natural language understanding, showcasing remarkable capabilities in generalizing across a wide range of tasks, including zero-shot and few-shot scenarios. Vision Language Models (VLMs), exemplified by OpenAI’s GPT-4 in 2023, have demonstrated substantial progress in addressing open-ended visual question-answering (VQA) tasks, which […]

The post Meet BLIVA: A Multimodal Large Language Model for Better Handling of Text-Rich Visual Questions appeared first on MarkTechPost.

]]>

Recently, Large Language Models (LLMs) have played a crucial role in the field of natural language understanding, showcasing remarkable capabilities in generalizing across a wide range of tasks, including zero-shot and few-shot scenarios. Vision Language Models (VLMs), exemplified by OpenAI’s GPT-4 in 2023, have demonstrated substantial progress in addressing open-ended visual question-answering (VQA) tasks, which require a model to answer a question about an image or a set of images. These advancements have been achieved by integrating LLMs with visual comprehension abilities. 

Various methods have been proposed to leverage LLMs for vision-related tasks, including direct alignment with a visual encoder’s patch feature and the extraction of image information through a fixed number of query embeddings.

However, despite their significant capabilities in image-based human-agent interactions, these models encounter challenges when it comes to interpreting text within images. Text-containing images are prevalent in everyday life, and the ability to comprehend such content is crucial for human visual perception. Previous research has employed an abstraction module with queried embeddings, but this approach limited their capacity to capture textual details within images.

In the study outlined in this article, the researchers introduce BLIVA (InstructBLIP with Visual Assistant), a multimodal LLM strategically engineered to integrate two key components: learned query embeddings closely aligned with the LLM itself and image-encoded patch embeddings, which contain more extensive image-related data. An overview of the proposed approach is presented in the figure below.

https://arxiv.org/abs/2308.09936

This technique overcomes the constraints typically associated with the provision of image information to language models, ultimately leading to enhanced text-image visual perception and understanding. The model is initialized using a pre-trained InstructBLIP and an encoded patch projection layer trained from scratch. A two-stage training paradigm is followed. The initial stage involves pre-training the patch embeddings projection layer and fine-tuning both the Q-former and the patch embeddings projection layer using instruction tuning data. Throughout this phase, both the image encoder and LLM remain in a frozen state, based on two key findings from experiments: first, unfreezing the vision encoder leads to catastrophic forgetting of prior knowledge, and second, simultaneous training of the LLM did not yield improvement but introduced significant training complexity.

Two sample scenarios presented by the authors are reported here, showcasing the impact of BLIVA in addressing VQA tasks related to “Detailed caption” and “small caption + VQA.”

https://arxiv.org/abs/2308.09936

This was the summary of BLIVA, a novel AI LLM multimodal framework that combines textual and visual-encoded patch embeddings to address VQA tasks. If you are interested and want to learn more about it, please feel free to refer to the links cited below. 


Check out the Paper and GithubAll Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post Meet BLIVA: A Multimodal Large Language Model for Better Handling of Text-Rich Visual Questions appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2023/09/15/meet-bliva-a-multimodal-large-language-model-for-better-handling-of-text-rich-visual-questions/feed/ 0 42880
A New AI Research from Tel Aviv and the University of Copenhagen Introduces a ‘Plug-and-Play’ Approach for Rapidly Fine-Tuning Text-to-Image Diffusion Models by Using a Discriminative Signal https://www.marktechpost.com/2023/09/14/a-new-ai-research-from-tel-aviv-and-the-university-of-copenhagen-introduces-a-plug-and-play-approach-for-rapidly-fine-tuning-text-to-image-diffusion-models-by-using-a-discriminative-signal/ https://www.marktechpost.com/2023/09/14/a-new-ai-research-from-tel-aviv-and-the-university-of-copenhagen-introduces-a-plug-and-play-approach-for-rapidly-fine-tuning-text-to-image-diffusion-models-by-using-a-discriminative-signal/#respond Thu, 14 Sep 2023 08:00:06 +0000 https://www.marktechpost.com/?p=42814 Text-to-image diffusion models have exhibited impressive success in generating diverse and high-quality images based on input text descriptions. Nevertheless, they encounter challenges when the input text is lexically ambiguous or involves intricate details. This can lead to situations where the intended image content, such as an “iron” for clothes, is misrepresented as the “elemental” metal. […]

The post A New AI Research from Tel Aviv and the University of Copenhagen Introduces a ‘Plug-and-Play’ Approach for Rapidly Fine-Tuning Text-to-Image Diffusion Models by Using a Discriminative Signal appeared first on MarkTechPost.

]]>

Text-to-image diffusion models have exhibited impressive success in generating diverse and high-quality images based on input text descriptions. Nevertheless, they encounter challenges when the input text is lexically ambiguous or involves intricate details. This can lead to situations where the intended image content, such as an “iron” for clothes, is misrepresented as the “elemental” metal.

To address these limitations, existing methods have employed pre-trained classifiers to guide the denoising process. One approach involves blending the score estimate of a diffusion model with the gradient of a pre-trained classifier’s log probability. In simpler terms, this approach uses information from both a diffusion model and a pre-trained classifier to generate images that match the desired outcome and align with the classifier’s judgment of what the image should represent. 

However, this method requires a classifier capable of working with real and noisy data. 

Other strategies have conditioned the diffusion process on class labels using specific datasets. While effective, this approach is far from the full expressive capability of models trained on extensive collections of image-text pairs from the web.

An alternative direction involves fine-tuning a diffusion model or some of its input tokens using a small set of images related to a specific concept or label. Yet, this approach has drawbacks, including slow training for new concepts, potential changes in image distribution, and limited diversity captured from a small group of images.

This article reports a proposed approach that tackles these issues, providing a more accurate representation of desired classes, resolving lexical ambiguity, and improving the depiction of fine-grained details. It achieves this without compromising the original pretrained diffusion model’s expressive power or facing the mentioned drawbacks. The overview of this method is illustrated in the figure below.

Instead of guiding the diffusion process or altering the entire model, this approach focuses on updating the representation of a single added token corresponding to each class of interest. Importantly, this update doesn’t involve model tuning on labeled images.

The method learns the token representation for a specific target class through an iterative process of generating new images with a higher class probability according to a pre-trained classifier. Feedback from the classifier guides the evolution of the designated class token in each iteration. A novel optimization technique called gradient skipping is employed, wherein the gradient is propagated solely through the final stage of the diffusion process. The optimized token is then incorporated as part of the conditioning text input to generate images using the original diffusion model.

According to the authors, this method offers several key advantages. It requires only a pre-trained classifier and doesn’t demand a classifier trained explicitly on noisy data, setting it apart from other class conditional techniques. Moreover, it excels in speed, allowing immediate improvements to generated images once a class token is trained, in contrast to more time-consuming methods.

Sample results selected from the study are shown in the image below. These case studies provide a comparative overview of the proposed and state-of-the-art approaches.

This was the summary of a novel AI non-invasive technique that exploits a pre-trained classifier to fine-tune text-to-image diffusion models. If you are interested and want to learn more about it, please feel free to refer to the links cited below. 


Check out the Paper, Code, and ProjectAll Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post A New AI Research from Tel Aviv and the University of Copenhagen Introduces a ‘Plug-and-Play’ Approach for Rapidly Fine-Tuning Text-to-Image Diffusion Models by Using a Discriminative Signal appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2023/09/14/a-new-ai-research-from-tel-aviv-and-the-university-of-copenhagen-introduces-a-plug-and-play-approach-for-rapidly-fine-tuning-text-to-image-diffusion-models-by-using-a-discriminative-signal/feed/ 0 42814