Mohammad Arshad, Author at MarkTechPost https://www.marktechpost.com/author/mohammadarshad/ An Artificial Intelligence News Platform Thu, 26 Oct 2023 11:49:55 +0000 en-US hourly 1 https://wordpress.org/?v=6.3.2 https://www.marktechpost.com/wp-content/uploads/2022/04/cropped-Favicon-512-x-512-1-1-32x32.png Mohammad Arshad, Author at MarkTechPost https://www.marktechpost.com/author/mohammadarshad/ 32 32 127842392 Researchers from the University of Washington and NVIDIA Propose Humanoid Agents: An Artificial Intelligence Platform for Human-like Simulations of Generative Agents https://www.marktechpost.com/2023/10/26/researchers-from-the-university-of-washington-and-nvidia-propose-humanoid-agents-an-artificial-intelligence-platform-for-human-like-simulations-of-generative-agents/ https://www.marktechpost.com/2023/10/26/researchers-from-the-university-of-washington-and-nvidia-propose-humanoid-agents-an-artificial-intelligence-platform-for-human-like-simulations-of-generative-agents/#respond Thu, 26 Oct 2023 12:00:00 +0000 https://www.marktechpost.com/?p=45154 Human-like generative agents are commonly used in chatbots and virtual assistants to provide natural and engaging user interactions. They can understand and respond to user queries, engage in conversations, and perform tasks like answering questions and making recommendations. These agents are often built using natural language processing (NLP) techniques and machine learning models, such as […]

The post Researchers from the University of Washington and NVIDIA Propose Humanoid Agents: An Artificial Intelligence Platform for Human-like Simulations of Generative Agents appeared first on MarkTechPost.

]]>

Human-like generative agents are commonly used in chatbots and virtual assistants to provide natural and engaging user interactions. They can understand and respond to user queries, engage in conversations, and perform tasks like answering questions and making recommendations. These agents are often built using natural language processing (NLP) techniques and machine learning models, such as GPT-3, to produce coherent and contextually relevant responses. They can create interactive stories, dialogues, and characters in video games or virtual worlds, enhancing the gaming experience.

Human-like generative agents can assist writers and creatives in brainstorming ideas, generating story plots, or even composing poetry or music. However, this process is different from how humans think fully. Humans often tend to constantly adapt changes to their plans according to the changes in the physical environment. Researchers at the University of Washington and the University of Hong Kong propose Humanoid agents that guide generative agents to behave more like humans by introducing different elements. 

Inspired by the psychology of humans, researchers have proposed a two-system mechanism with system 1 to handle the intuitive and effortless process of thinking and system 2 to handle the logical process of thinking. To influence the behavior of these agents, they introduced aspects like basic needs, emotions, and closeness of their social relationship with other agents. 

The designed agents need to interact with others, and upon failing, they will receive negative feedback comprising loneliness, sickness, and tiredness. 

The social brain hypothesis proposes that a large part of our cognitive ability has evolved to track the quality of social relationships. People often interact with others to adapt to changes. To mimic this behavior, they empower humanoid agents to adjust their conversations based on how close they are to one another. Their agents visualize them using a Unity WebGL game interface and present the statuses of stimulated agents over time using an interactive analytics dashboard. 

They created a sandbox HTML game environment using the Unity WebGL game engine to visualize humanoid agents in their world. Users can select from one of the three worlds to see the agent’s status and location at each step. Their game interface ingests JSON-structured files from the simulated worlds and transforms them into animations. They built Plotly Dash to visualize the status of various humanoid agents over time.  

Their systems currently support dialogues between only two agents, aiming to help multi-party conversations. As the agents are working with a simulation that does not perfectly reflect human behavior in the real world, the users must be informed that they are working with a simulation. Despite their capabilities, it’s essential to consider ethical and privacy concerns when using human-like generative agents, such as the potential for spreading misinformation, biases in the training data, and responsible usage and monitoring.


Check out the Paper and GithubAll Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 32k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post Researchers from the University of Washington and NVIDIA Propose Humanoid Agents: An Artificial Intelligence Platform for Human-like Simulations of Generative Agents appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2023/10/26/researchers-from-the-university-of-washington-and-nvidia-propose-humanoid-agents-an-artificial-intelligence-platform-for-human-like-simulations-of-generative-agents/feed/ 0 45154
Blazing a Trail in Interleaved Vision-and-Language Generation: Unveiling the Power of Generative Vokens with MiniGPT-5 https://www.marktechpost.com/2023/10/24/blazing-a-trail-in-interleaved-vision-and-language-generation-unveiling-the-power-of-generative-vokens-with-minigpt-5/ https://www.marktechpost.com/2023/10/24/blazing-a-trail-in-interleaved-vision-and-language-generation-unveiling-the-power-of-generative-vokens-with-minigpt-5/#respond Wed, 25 Oct 2023 06:59:00 +0000 https://www.marktechpost.com/?p=45083 Large language models excel at understanding and generating human language. This ability is crucial for tasks such as text summarization, sentiment analysis, translation, and chatbots, making them valuable tools for natural language processing. These models can improve machine translation systems, enabling more accurate and context-aware translations between different languages, with numerous global communication and business […]

The post Blazing a Trail in Interleaved Vision-and-Language Generation: Unveiling the Power of Generative Vokens with MiniGPT-5 appeared first on MarkTechPost.

]]>

Large language models excel at understanding and generating human language. This ability is crucial for tasks such as text summarization, sentiment analysis, translation, and chatbots, making them valuable tools for natural language processing. These models can improve machine translation systems, enabling more accurate and context-aware translations between different languages, with numerous global communication and business applications. 

LLMs are proficient at recognizing and categorizing named entities in text, such as names of people, places, organizations, dates, and more. They can answer questions based on the information presented in a passage or document. They understand the context of the question and extract relevant information to provide accurate answers. However, the current LLMs are based on processing text image pairs. They need help when the task is to generate new images. The emerging vision and language tasks depend highly on topic-centric data and often skimps through image descriptors.

Researchers at the University of California built a new model named MiniGPT-5, which involves vision and language generation techniques based on generative vokens. This multimodal encoder is a novel technique proven effective compared to other LLMs. It combines the generative vokens with stable diffusion models to generate vision and language outputs. 

The term generative vokens are the special visual tokens that can directly train on raw images. Visible tokens refer to elements added to the model’s input to incorporate visual information or enable multimodal understanding. When generating image captions, a model may take an image as input, tokenize the image into a series of special visual tokens, and combine them with textual tokens representing the context or description of the image. This integration allows the model to generate meaningful and contextually relevant captions for the images.

The researchers follow a two-stage method in which the first stage is unimodal alignment of the high-quality text-aligned visual features from large text-image pairs, and the second stage involves ensuring the visual and text prompts are well coordinated in the generation. Their method of generic stages enables one to eliminate domain-specific annotations and makes the solution from the existing works. They followed the dual-loss strategy to balance the text and the images. Their adapted method also optimizes the training efficiency and addresses memory constraints, which can be solved easily.

The team implemented Parameter-efficient fine-tuning over the MiniGPT-4 encoder to train the model better to understand instructions or prompts and enhance its performance in novel or zero-shot tasks. They also tried prefix tuning and LoRA over the language encoder Vicuna used in MiniGPT-4. Future work on these methods will broaden the applications, which seemed challenging previously due to the disjointed nature of existing image and text models. 


Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post Blazing a Trail in Interleaved Vision-and-Language Generation: Unveiling the Power of Generative Vokens with MiniGPT-5 appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2023/10/24/blazing-a-trail-in-interleaved-vision-and-language-generation-unveiling-the-power-of-generative-vokens-with-minigpt-5/feed/ 0 45083
CMU Researchers Introduce MultiModal Graph Learning (MMGL): A New Artificial Intelligence Framework for Capturing Information from Multiple Multimodal Neighbors with Relational Structures Among Them https://www.marktechpost.com/2023/10/20/cmu-researchers-introduce-multimodal-graph-learning-mmgl-a-new-artificial-intelligence-framework-for-capturing-information-from-multiple-multimodal-neighbors-with-relational-structures-among-them/ https://www.marktechpost.com/2023/10/20/cmu-researchers-introduce-multimodal-graph-learning-mmgl-a-new-artificial-intelligence-framework-for-capturing-information-from-multiple-multimodal-neighbors-with-relational-structures-among-them/#respond Sat, 21 Oct 2023 03:49:04 +0000 https://www.marktechpost.com/?p=44899 Multimodal graph learning is a multidisciplinary field combining concepts from machine learning, graph theory, and data fusion to tackle complex problems involving diverse data sources and their interconnections. Multimodal graph learning can generate descriptive captions for images by combining visual data with textual information. It can improve the accuracy of retrieving relevant images or text […]

The post CMU Researchers Introduce MultiModal Graph Learning (MMGL): A New Artificial Intelligence Framework for Capturing Information from Multiple Multimodal Neighbors with Relational Structures Among Them appeared first on MarkTechPost.

]]>

Multimodal graph learning is a multidisciplinary field combining concepts from machine learning, graph theory, and data fusion to tackle complex problems involving diverse data sources and their interconnections. Multimodal graph learning can generate descriptive captions for images by combining visual data with textual information. It can improve the accuracy of retrieving relevant images or text documents based on queries. Multimodal graph learning is also used in autonomous vehicles to combine data from various sensors, such as cameras, LiDAR, radar, and GPS, to enhance perception and make informed driving decisions.

The present models depend upon generating images/text on given text/images using pre-trained image encoders and LMs. They use the method of pair modalities with a clear 1-to-1 mapping as an input. In the context of multimodal graph learning, modalities refer to distinct types or modes of data and information sources. Each modality represents a specific category or aspect of data and can take different forms. The problem arises when applying these models to many-to-many mappings among the modalities.  

Researchers at Carnegie Mellon University propose a general and systematic framework of Multimodal graph learning for generative tasks. Their method involves capturing information from multiple multimodal neighbors with relational structures among themselves. They propose to represent the complex relationships as graphs to capture data with any number of modalities and complex relationships between modalities that can flexibly vary from one sample to another.

Their model extracts neighbor encodings and combines them with graph structure, followed by optimizing the model with parameter-efficient finetuning. To fully understand many-many mappings, the team studied neighbor encoding models like self-attention with text and embeddings, self-attention with only embeddings, and cross-attention with embeddings. They used Laplacian eigenvector position encoding(LPE) and graph neural network encoding (GNN) to compare the sequential position encodings. 

Finetuning often requires substantial labeled data specific to the target task. If you already have a relevant dataset or can obtain it at a reasonable cost, finetuning can be cost-effective compared to training a model from scratch. Researchers use Prefix tuning and LoRA for Self-attention with text and embeddings(SA-TE) and Flamingo-style finetuning for cross-attention with embedding models(CA-E). They find that Prefix tuning uses nearly four times fewer parameters with SA-TE neighbor encoding, which decreases the cost.

Their research work is an in-depth analysis to lay the groundwork for future MMGL research and exploration in that field. The researchers say that the future scope of multimodal graph learning is promising and is expected to expand significantly, driven by advancements in machine learning, data collection, and the growing need to handle complex, multi-modal data in various applications.


Check out the Paper and GithubAll Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post CMU Researchers Introduce MultiModal Graph Learning (MMGL): A New Artificial Intelligence Framework for Capturing Information from Multiple Multimodal Neighbors with Relational Structures Among Them appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2023/10/20/cmu-researchers-introduce-multimodal-graph-learning-mmgl-a-new-artificial-intelligence-framework-for-capturing-information-from-multiple-multimodal-neighbors-with-relational-structures-among-them/feed/ 0 44899
Researchers from UC Berkeley Propose RingAttention: A Memory-Efficient Artificial Intelligence Approach to Reduce the Memory Requirements of Transformers https://www.marktechpost.com/2023/10/19/researchers-from-uc-berkeley-propose-ringattention-a-memory-efficient-artificial-intelligence-approach-to-reduce-the-memory-requirements-of-transformers/ https://www.marktechpost.com/2023/10/19/researchers-from-uc-berkeley-propose-ringattention-a-memory-efficient-artificial-intelligence-approach-to-reduce-the-memory-requirements-of-transformers/#respond Fri, 20 Oct 2023 03:03:07 +0000 https://www.marktechpost.com/?p=44859 A type of deep learning model architecture is called Transformers in the context of many state-of-the-art AI models. They have revolutionized the field of artificial intelligence, particularly in natural language processing and various other tasks in machine learning. It is based on a self-attention mechanism where the model weighs the importance of different parts of […]

The post Researchers from UC Berkeley Propose RingAttention: A Memory-Efficient Artificial Intelligence Approach to Reduce the Memory Requirements of Transformers appeared first on MarkTechPost.

]]>

A type of deep learning model architecture is called Transformers in the context of many state-of-the-art AI models. They have revolutionized the field of artificial intelligence, particularly in natural language processing and various other tasks in machine learning. It is based on a self-attention mechanism where the model weighs the importance of different parts of the input sequence when making predictions. They consist of an encoder and a decoder to process the inputs.  

However, scaling up the context length of Transformers takes a lot of work. It is due to the inherited self-attention. Self-attention has memory cost quadratic in the input sequence length, which makes it challenging to scale to the longer input sequences. Researchers at UC Berkley developed a method called Ring Attention to tackle this based on a simple observation. They observed that when self-attention and feedforward network computations are performed blockwise, the sequences can be distributed across multiple devices and easily analyzed.

They distribute the outer loop of computing blockwise attention among hosts, each device managing its respective input block. For the inner loop, they compute blockwise attention and feedforward operations specific to its designated input block for all devices. Their host devices form a conceptual ring and send a copy of its key-value blocks being used for blockwise computation to the next device in the ring. They also simultaneously receive key-value blocks from the previous one.

The block computations take longer than block transfers. The team overlapped these processes, resulting in no added overhead compared to standard transformers. By doing so, each device requires only memory proportional to the block size, independent of the original input sequence length. This effectively eliminates the memory constraints imposed by individual devices. 

Their experiments show that Ring Attention can reduce the memory requirements of Transformers by enabling them to train more than 500 times longer sequences than prior memory efficient state-of-the-arts. This method also allows training sequences that exceed 100 million in length without making approximations to attention. As Ring Attention eliminates the memory constraints imposed by individual devices, one can also achieve near-infinite context sizes. However, one would require many number of devices as sequence length is proportional to the number of devices.

The research only involves an evaluation of the effectiveness of the method without the large-scale training models. As the scale context length depends on the number of devices, the model’s efficiency depends on the optimization; they have only worked on the low-level operations required for achieving optimal computer performance. The researchers say that they would like to work on both maximum sequence length and maximum computer performance in the future. The possibility of near-infinite context introduces many exciting opportunities, such as large video-audio-language models, learning from extended feedback and trial-and-errors, understanding and generating codebase, and adapting AI models to understand scientific data such as gene sequences.


Check out the PaperAll Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post Researchers from UC Berkeley Propose RingAttention: A Memory-Efficient Artificial Intelligence Approach to Reduce the Memory Requirements of Transformers appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2023/10/19/researchers-from-uc-berkeley-propose-ringattention-a-memory-efficient-artificial-intelligence-approach-to-reduce-the-memory-requirements-of-transformers/feed/ 0 44859
Fondant AI Releases Fondant-25M Dataset of Image-Text Pairs with a Creative Commons License https://www.marktechpost.com/2023/10/15/fondant-ai-releases-fondant-25m-dataset-of-image-text-pairs-with-a-creative-commons-license/ https://www.marktechpost.com/2023/10/15/fondant-ai-releases-fondant-25m-dataset-of-image-text-pairs-with-a-creative-commons-license/#respond Sun, 15 Oct 2023 15:45:26 +0000 https://www.marktechpost.com/?p=44585 Handling and analysis of vast amounts of data is called Large-scale data processing. It involves extracting valuable insights, making informed decisions, and solving complex problems. It is crucial in various fields, including business, science, healthcare, and more. The choice of tools and methods depends on the specific requirements of the data processing task and the […]

The post Fondant AI Releases Fondant-25M Dataset of Image-Text Pairs with a Creative Commons License appeared first on MarkTechPost.

]]>

Handling and analysis of vast amounts of data is called Large-scale data processing. It involves extracting valuable insights, making informed decisions, and solving complex problems. It is crucial in various fields, including business, science, healthcare, and more. The choice of tools and methods depends on the specific requirements of the data processing task and the available resources. Programming languages like Python, Java, and Scala are often used for large-scale data processing. In this context, frameworks like Apache Flink, Apache Kafka, and Apache Storm are also valuable.

Researchers have built a new open-source framework called Fondant to simplify and speed up large-scale data processing. It has various embedded tools to download, explore, and process data. It also includes components for downloading through URLs and downloading images. 

The current challenge with generative AI, such as Stable Diffusion and Dall-E, is trained on hundreds of millions of images from the public Internet, including copyrighted work. This creates legal risks and uncertainties for users of these images and is unfair toward copyright holders who may not want their proprietary work reproduced without consent.

To tackle it, researchers have developed a data-processing pipeline to create 500 million datasets of Creative Commons images to train the latent diffusion image generation models. Data-processing pipelines are steps and tasks designed to collect, process, and move data from one source to another, where it can be stored and analyzed for various purposes.

Creating custom data processing pipelines involves several steps, and the specific approach may vary depending on your data sources, processing requirements, and tools. Researchers use the method of building blocks to create custom pipelines. They designed the Fondant pipelines to mix reusable components and custom components. They further deployed it in a production environment and set up automation for regular data processing.

Fondant-cc-25m contains 25 million image URLs with their Creative Commons license information that can be easily accessed in one go! The researchers have released a detailed step-by-step installation program for local users. To execute the pipelines locally, users must have Docker installed in their systems with at least 8GB of RAM allocated to their Docker environment. 

As the released dataset may contain sensitive personal information, the researchers only designed the datasets to include public, non-personal information in support of conducting and publishing their open-access research. They say the filtering pipeline for the dataset is still in progress, and they are willing to have contributions from other researchers to contribute to creating anonymous pipelines for the project. Researchers say that in the future, they want to add different components like Image-based deduplication, automatic captioning, visual quality estimation, watermark detection, face detection, text detection, and much more!


Check out the Blog Article and ProjectAll Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post Fondant AI Releases Fondant-25M Dataset of Image-Text Pairs with a Creative Commons License appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2023/10/15/fondant-ai-releases-fondant-25m-dataset-of-image-text-pairs-with-a-creative-commons-license/feed/ 0 44585
University of Sharjah Researchers Develop Artificial Intelligence Solutions for Inclusion of Arabic and Its Dialects in Natural Language Processing https://www.marktechpost.com/2023/10/12/university-of-sharjah-researchers-develop-artificial-intelligence-solutions-for-inclusion-of-arabic-and-its-dialects-in-natural-language-processing/ https://www.marktechpost.com/2023/10/12/university-of-sharjah-researchers-develop-artificial-intelligence-solutions-for-inclusion-of-arabic-and-its-dialects-in-natural-language-processing/#respond Fri, 13 Oct 2023 03:10:00 +0000 https://www.marktechpost.com/?p=44377 Arabic is the national language of more than 422 million people and is ranked the fifth most extensively used language globally. However, it has been largely overlooked in Natural Language Processing. The common language to use has been English. Is it because it is hard to use the Arabic alphabet? The answer to it is […]

The post University of Sharjah Researchers Develop Artificial Intelligence Solutions for Inclusion of Arabic and Its Dialects in Natural Language Processing appeared first on MarkTechPost.

]]>

Arabic is the national language of more than 422 million people and is ranked the fifth most extensively used language globally. However, it has been largely overlooked in Natural Language Processing. The common language to use has been English. Is it because it is hard to use the Arabic alphabet? The answer to it is partly yes, but researchers have been working to develop AI solutions to process Arabic and various dialects. 

The recent research has the potential to revolutionize the way Arabic speakers use technology and make it easier to understand and interact with the growth in technology. The challenges arise due to the complex and rich nature of the Arabic language. Arabic is a highly inflected language with rich prefixes, suffixes, and a root-based word-formation system. Words can have multiple forms and can be derived from the same root. Arabic text may lack diacritics and vowels, affecting the accuracy of text analysis and machine-learning tasks.

Arabic dialects can vary significantly from one region to another, and building models that can understand and generate text in multiple dialects is a considerable challenge. Due to the need for more spaces between words, Named Entity Recognition (NER) is quite challenging. NER is a NLP task to identify and classify named entities in the text. It is crucial in information extraction, text analysis, and language understanding. Addressing these challenges in Arabic NLP requires the development of specialized tools, resources, and models tailored to the language’s unique characteristics. 

The researchers at the University of Sharjah developed a deep learning system to utilize the Arabic language and its varieties in applications related to Natural Language Processing (NLP), an interdisciplinary subfield of linguistics, computer science, and artificial intelligence. Compared to other AI-based models, their model encompasses a broader range of dialect variations in Arabic. 

Arabic NLP needs more robust resources available for languages like English. This includes corpora, labeled data, and pre-trained models, which are crucial for developing and training NLP systems. To tackle this problem, the researchers have built a large, diverse, and bias-free dialectal dataset by merging several distinct datasets. 

The models like classical and deep learning models were trained upon these datasets. These tools enhanced the chatbot performance by accurately identifying and understanding various Arabic dialects, enabling chatbots to provide more personalized and relevant responses. The team’s research work has also received significant extracurricular interest, notably from major tech corporations like IBM and Microsoft, as they can ensure greater accessibility for people with disabilities. 

The speech recognition systems built upon these specific dialects will enable more accurate voice command recognition and services for people with disabilities. Arabic NLP can also be used in multilingual and cross-lingual applications, such as machine translation and content localization for businesses targeting Arabic-speaking markets. 


Check out the PaperAll Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post University of Sharjah Researchers Develop Artificial Intelligence Solutions for Inclusion of Arabic and Its Dialects in Natural Language Processing appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2023/10/12/university-of-sharjah-researchers-develop-artificial-intelligence-solutions-for-inclusion-of-arabic-and-its-dialects-in-natural-language-processing/feed/ 0 44377
UCSD and ByteDance Researchers Present ActorsNeRF: A Novel Animatable Human Actor NeRF Model that Generalizes to Unseen Actors in a Few-Shot Setting https://www.marktechpost.com/2023/10/10/ucsd-and-bytedance-researchers-present-actorsnerf-a-novel-animatable-human-actor-nerf-model-that-generalizes-to-unseen-actors-in-a-few-shot-setting/ https://www.marktechpost.com/2023/10/10/ucsd-and-bytedance-researchers-present-actorsnerf-a-novel-animatable-human-actor-nerf-model-that-generalizes-to-unseen-actors-in-a-few-shot-setting/#respond Wed, 11 Oct 2023 06:30:00 +0000 https://www.marktechpost.com/?p=44263 Neural Radiance Fields (NeRF) is a powerful neural network-based technique for capturing 3D scenes and objects from 2D images or sparse 3D data.NeRF employs a neural network architecture consisting of two main components: the “NeRF in” and the “NeRF out” network. The “NeRF in” network inputs the 2D coordinates of a pixel and the associated […]

The post UCSD and ByteDance Researchers Present ActorsNeRF: A Novel Animatable Human Actor NeRF Model that Generalizes to Unseen Actors in a Few-Shot Setting appeared first on MarkTechPost.

]]>

Neural Radiance Fields (NeRF) is a powerful neural network-based technique for capturing 3D scenes and objects from 2D images or sparse 3D data.NeRF employs a neural network architecture consisting of two main components: the “NeRF in” and the “NeRF out” network. The “NeRF in” network inputs the 2D coordinates of a pixel and the associated camera pose, producing a feature vector. The “NeRF out” network takes this feature vector as input and predicts the 3D position and color information of the corresponding 3D point.

To create a NeRF-based human representation, you typically start by capturing images or videos of a human subject from multiple viewpoints. These images can be obtained from cameras, depth sensors, or other 3D scanning devices. NeRF-based human representations have several potential applications, including virtual avatars for games and virtual reality, 3D modeling for animation and film production, and medical imaging for creating 3D models of patients for diagnosis and treatment planning. However, it can be computationally intensive and require substantial training data. 

It requires a combination of synchronized multi-view videos and an instance-level NeRF network trained on a specific human video sequence. Researchers propose a new representation method called ActorsNeRF. It is a category-level human actor NeRF model that generalizes to unseen actors in a few-shot setting. With only a few images, e.g., 30 frames, sampled from a monocular video, ActorsNeRF synthesizes high-quality novel views of novel actors in the AIST++ dataset with unseen poses. 

Researchers follow the method of 2-level canonical space, where for a given body pose and a rendering viewpoint, a sampled point in 3D space is first transformed into a canonical space by linear blend skinning, where the skinning weights are generated by skinning weight network that is shared across various subjects. Skinning weight controls how a 3D mesh representing a character deforms when it’s animated. Skinning weight networks are crucial for achieving realistic character movements and deformations in 3D computer graphics.

To achieve generalization across different individuals, the researchers trained the category-level NeRF model on a diverse set of subjects. During the inference phase, they fine-tuned the pretrained category-level NeRF model using only a few images of the target actor. They enabled the model to adapt to the specific characteristics of the actors. 

Researchers find that ActorsNeRF outperforms the HumanNeRF approach significantly, and it maintains a valid shape for the less unobserved body parts compared to the HUmanNeRF system. ActorsNeRF can leverage the category level before smoothly synthesizing unobserved portions of the body. When ActorsNeRF is tested on multiple benchmarks like ZJU-MoCap and AIST++ datasets, it outperforms novel human actors with unseen poses across multiple few-shot settings.


Check out the Paper and Project PageAll Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post UCSD and ByteDance Researchers Present ActorsNeRF: A Novel Animatable Human Actor NeRF Model that Generalizes to Unseen Actors in a Few-Shot Setting appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2023/10/10/ucsd-and-bytedance-researchers-present-actorsnerf-a-novel-animatable-human-actor-nerf-model-that-generalizes-to-unseen-actors-in-a-few-shot-setting/feed/ 0 44263
Researchers from ITU Denmark Introduce Neural Developmental Programs: Bridging the Gap Between Biological Growth and Artificial Neural Networks https://www.marktechpost.com/2023/10/08/researchers-from-itu-denmark-introduce-neural-developmental-programs-bridging-the-gap-between-biological-growth-and-artificial-neural-networks/ https://www.marktechpost.com/2023/10/08/researchers-from-itu-denmark-introduce-neural-developmental-programs-bridging-the-gap-between-biological-growth-and-artificial-neural-networks/#respond Sun, 08 Oct 2023 18:12:59 +0000 https://www.marktechpost.com/?p=44109 The human brain is an extraordinarily complex organ, often considered one of the most intricate and sophisticated systems in the known universe. The brain is hierarchically organized, with lower-level sensory processing areas sending information to higher-level cognitive and decision-making regions. This hierarchy allows for the integration of knowledge and complex behaviors. The brain processes information […]

The post Researchers from ITU Denmark Introduce Neural Developmental Programs: Bridging the Gap Between Biological Growth and Artificial Neural Networks appeared first on MarkTechPost.

]]>

The human brain is an extraordinarily complex organ, often considered one of the most intricate and sophisticated systems in the known universe. The brain is hierarchically organized, with lower-level sensory processing areas sending information to higher-level cognitive and decision-making regions. This hierarchy allows for the integration of knowledge and complex behaviors. The brain processes information in parallel, with different regions and networks simultaneously working on various aspects of perception, cognition, and motor control. This parallel processing contributes to its efficiency and adaptability.

Can we adapt these hierarchy organization and parallel processing techniques in deep learning? Yes, the field of study is called Neural networks. Researchers at the University of Copenhagen present a graph neural network type of encoding in which the growth of a policy network is controlled by another network running in each neuron. They call it a Neural Developmental Program (NDP). 

Some biological processes involve mapping a compact genotype to a larger phenotype. Inspired by this, the researchers have built indirect encoding methods. In Indirect encoding, the description of the solution is compressed. This allows the information to be reused, and the final solution will contain more components than the description itself. However, these encodings (particularly indirect encoding family) must be developed. 

The NDP architecture comprises a Multilayer Perceptron (MLP) and a Graph Cellular Automata (GNCA). This updates the node embeddings after each message passing step during the developmental phase. In general,  cellular automata are mathematical models consisting of a grid of cells in one of several states. These automata evolve over discrete time steps based on a set of rules that determine how the states of the cells change over time. 

In NDP, the same model is applied to every. So,  the number of parameters is constant with respect to the size of the graph in which it operates. This provides an advantage to NDP as it can operate upon any neural network of arbitrary size or architecture. The NDP neural network can also be trained with any black-box optimization algorithm to satisfy any objective function. This will allow neural networks to solve reinforcement learning and classification tasks and exhibit topological properties. 

Researchers also tried to evaluate the differentiable NDP by comparing trained and tested models on different numbers of growth steps. They observed that for most tasks, the network’s performance decreased after a certain number of growth steps. The reason to observe this was that the new modes of the network got larger. You would require an automated method to know when to stop growing the steps. They say this automation would be an important addition to the NDP. In the future, they also want to include activity-dependent and reward-modulated growth and adaptation techniques for the NDP.


Check out the PaperAll Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

Now, we are also on WhatsApp. Join our AI Channel on Whatsapp..

The post Researchers from ITU Denmark Introduce Neural Developmental Programs: Bridging the Gap Between Biological Growth and Artificial Neural Networks appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2023/10/08/researchers-from-itu-denmark-introduce-neural-developmental-programs-bridging-the-gap-between-biological-growth-and-artificial-neural-networks/feed/ 0 44109
Meet ConceptGraphs: An Open-Vocabulary Graph-Structured Representation for 3D Scenes https://www.marktechpost.com/2023/10/06/meet-conceptgraphs-an-open-vocabulary-graph-structured-representation-for-3d-scenes/ https://www.marktechpost.com/2023/10/06/meet-conceptgraphs-an-open-vocabulary-graph-structured-representation-for-3d-scenes/#respond Fri, 06 Oct 2023 13:00:16 +0000 https://www.marktechpost.com/?p=43964 Capturing and encoding information about a visual scene, typically in the context of computer vision, artificial intelligence, or graphics, is called Scene representation. It involves creating a structured or abstract representation of the elements and attributes present in a scene, including objects, their positions, sizes, colors, and relationships. Robots must build these representations online from […]

The post Meet ConceptGraphs: An Open-Vocabulary Graph-Structured Representation for 3D Scenes appeared first on MarkTechPost.

]]>

Capturing and encoding information about a visual scene, typically in the context of computer vision, artificial intelligence, or graphics, is called Scene representation. It involves creating a structured or abstract representation of the elements and attributes present in a scene, including objects, their positions, sizes, colors, and relationships. Robots must build these representations online from onboard sensors as they navigate an environment.

The representations must be scalable and efficient to maintain the scene’s volume and the duration of the robot’s operation. The open library shouldn’t be limited to predefined data in the training session but should be capable of handling new objects and concepts during inference. It demands flexibility to enable planning over a range of tasks, like collecting dense geometric information and abstract semantic information for task planning.

To include the above requirements, the researchers at the University of Toronto, MIT, and the University of Montreal propose ConceptGraphs, a 3D scene representation method for robot perception and planning. The traditional process of obtaining 3D scene representations using foundation models requires an internet scale of training data, and 3D datasets still need to be of comparable size. 

 They are based on assigning every point on a redundant semantic feature vector, which consumes more than necessary memory, limiting scalability to large scenes. These representations are dense and cannot be dynamically updated on the map, so they are not easy to decompose. The method developed by the team can efficiently describe the scenes with graph structures with node representations. It can be built on real-time systems that can build up hierarchical 3D scene representations.

ConceptGraphs is an object-centric mapping system that integrates geometric data from 3D mapping systems and semantic data from 2D foundation models. Therefore, this attempt to ground the 2D representations produced by image and language foundation models to the 3D world shows impressive results on open-vocabulary tasks, including language-guided object grounding, 3D reasoning, and navigation.

ConceptGraphs can construct open-vocabulary 3D scene graphs efficiently and structured semantic abstractions for perception and planning. The team also implemented ConceptGraphs on real-world wheeled and legged robotic platforms and demonstrated that those robots can perform task planning for abstract language queries with ease.

Provided RGB-D frames, the team runs a class-agnostic segmentation model to obtain candidate objects. It associates them across multiple views using geometric and semantic similarity measures and instantiates nodes in a 3D scene graph. They then use an LVLM to caption each node and an LLM to infer relationships between adjoining nodes and building edges in the scene graph. 

Researchers say that future work will involve integrating temporal dynamics into the model and assessing its performance in less structured and more challenging environments. Finally, their model addresses key limitations in the existing landscape of dense and implicit representations.


Check out the Paper, GitHub, and ProjectAll Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post Meet ConceptGraphs: An Open-Vocabulary Graph-Structured Representation for 3D Scenes appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2023/10/06/meet-conceptgraphs-an-open-vocabulary-graph-structured-representation-for-3d-scenes/feed/ 0 43964
Beyond the Fitzpatrick Scale: This AI Paper From Sony Introduces a Multidimensional Approach to Assess Skin Color Bias in Computer Vision https://www.marktechpost.com/2023/10/03/beyond-the-fitzpatrick-scale-this-ai-paper-from-sony-introduces-a-multidimensional-approach-to-assess-skin-color-bias-in-computer-vision/ https://www.marktechpost.com/2023/10/03/beyond-the-fitzpatrick-scale-this-ai-paper-from-sony-introduces-a-multidimensional-approach-to-assess-skin-color-bias-in-computer-vision/#respond Tue, 03 Oct 2023 10:44:28 +0000 https://www.marktechpost.com/?p=43772 Discrimination based on color continues to persist in many societies worldwide despite the progress in civil rights and social justice movements. They can harm individuals, communities, and society as a whole. These effects can manifest in various aspects of life, including psychological, social, economic, and health-related consequences. Efforts to address discrimination based on color should […]

The post Beyond the Fitzpatrick Scale: This AI Paper From Sony Introduces a Multidimensional Approach to Assess Skin Color Bias in Computer Vision appeared first on MarkTechPost.

]]>

Discrimination based on color continues to persist in many societies worldwide despite the progress in civil rights and social justice movements. They can harm individuals, communities, and society as a whole. These effects can manifest in various aspects of life, including psychological, social, economic, and health-related consequences. Efforts to address discrimination based on color should involve comprehensive strategies that promote equity, inclusion, and diversity. 

Can you even imagine that machine learning algorithms can also discriminate based on classes like race and gender? Recent studies by the researchers at Sony AI and the University of Tokyo have shown that these models can produce wrong skin lesion (abnormal change in skin structure ) diagnostics or incorrect heart rate measurements for individuals with darker skin tones. Most computer vision models use the Fitzpatrick Skin Type scale to determine appropriate treatments for various skin conditions and recommend sun protection measures. They say solely depending upon these scales will produce adverse decisions.

Describing the apparent skin color perception remains an open challenge as the final visual color perception results from complex physical and biological phenomena. The multilayered skin varies among individuals, as some have different amounts and distributions of carotene, hemoglobin, and melanin throughout these layers. Fitzpatrick skin scales are designed based on the spectrophotometer measurements of skin reflectance from the annotation process, and misclassification can occur as present skin types are not objective and descriptive enough.

Researchers use Colorimetry to derive quantitative metrics to overcome these limitations, enabling more reliable skin color scores. Based on the CIC established standard regarding illuminants and tristimulus values, they use the standard RGB space to view the images. To obtain a more comprehensive assessment of skin color, they use the Hue angle to describe the perceived gradation of color.  

They first quantified the skin color bias in face datasets and generative models trained on such datasets. They then broke down the results by the skin color of saliency-based image cropping and face verification algorithms. Instead of using uni-dimensional skin color scores to decide the fairness benchmarking, they use multidimensional skin color scores. They generated about 10,000 images with generative adversarial network and diffusion models for the datasets. 

Their multidimensional skin color scale offers a more representative assessment to surface socially relevant biases due to skin color effects in computer vision. This would enhance the diversity in the data collection process by encouraging specifications to represent skin color variability better and improve the identification of dataset and model biases by highlighting their limitations and leading to fairness-aware training methods.


Check out the Paper and Sony ArticleAll Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post Beyond the Fitzpatrick Scale: This AI Paper From Sony Introduces a Multidimensional Approach to Assess Skin Color Bias in Computer Vision appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2023/10/03/beyond-the-fitzpatrick-scale-this-ai-paper-from-sony-introduces-a-multidimensional-approach-to-assess-skin-color-bias-in-computer-vision/feed/ 0 43772