Dhanshree Shripad Shenwai, Author at MarkTechPost

This AI Research Introduces ‘RAFA’: A Principled Artificial Intelligence Framework for Autonomous LLM Agents with Provable Sample Efficiency

Dhanshree Shripad Shenwai — Wed, 25 Oct 2023 01:00:00 +0000

https://arxiv.org/abs/2309.17382

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-25-at-2.18.28-AM-300x137.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-25-at-2.18.28-AM-1024x467.png" />https://arxiv.org/abs/2309.17382

While LLMs’ reasoning capabilities are excellent, they still need to be improved to apply those capabilities in practical settings. In particular, how to proveably accomplish a task with minimal interactions with the outside world (e.g., via an internal method of reasoning) is still a matter of conjecture.

To choreograph reasoning and action, a new study by Northwestern University, Tsinghua University, and the Chinese University of Hong Kong presents a moral framework called “reason for future, act for now” (RAFA), which provides verifiable regret guarantees. To be more precise, they create a long-term trajectory planner (“reason for future”) that learns from the memory buffer’s prompts for reasoning.

Within a Bayesian adaptive MDP paradigm, they formally describe how to reason and act with LLMs. At each stage, the LLM agent does the first action of the planned trajectory (“act for now”), saves the gathered feedback in the memory buffer, and then re-invokes the reasoning routine to replan the future trajectory based on the current state.

Learning and planning in Bayesian adaptive Markov decision processes (MDPs) is the central principle, which is then used to represent reasoning in LLMs as MDPs. Similarly, they instruct LLMs to learn a more accurate posterior distribution over the unknown environment by consulting the memory buffer and designing a series of actions that will maximize some value function. When the external environment’s state changes, the LLM agent again calls on the reasoning routine to plot a new course of action. To maintain consistency in learning and planning, the researchers use a switching condition to determine if the more recent historical data should be used.

Several text-based benchmarks assess RAFA’s performance, including Game of 24, ALFWorld, BlocksWorld, and Tic-Tac-Toe. RAFA is an AI system that uses a linguistic model to carry out RL/PL tasks. The main points are summed up here.

In the game 24, RAFA determines how to get 24 by adding and subtracting four different natural numbers. The algorithm keeps track of the most recent formula and produces the next procedure to reach this objective. In terms of sample efficiency, RAFA performs exceptionally well.
ALFWorld is a virtual world where users may run simulations of household chores using embodied agents. RAFA achieves better results than competing frameworks like AdaPlanner, ReAct, and Reflexion.
In BlocksWorld, players are tasked with building structures out of blocks. Compared to other models such as Vicuna, RAP, and CoT, RAFA’s success rates are significantly higher.
RAFA acts as “O” in a game of Tic-Tac-Toe against a language model acting as “X.” The “O” penalty does not prevent RAFA from competing with and even outperforming the language model in some settings. The researchers believe selecting a different planning depth (B = 3 or B = 4) might improve or decrease sample efficiency.

In conclusion, RAFA is a flexible algorithm that excels in various settings and tasks, demonstrating amazing sample efficiency and often exceeding other existing frameworks.

Check out the Paper, Github, and Project Page. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post This AI Research Introduces ‘RAFA’: A Principled Artificial Intelligence Framework for Autonomous LLM Agents with Provable Sample Efficiency appeared first on MarkTechPost.

How Does Retrieval Augmentation Impact Long-Form Question Answering? This AI Study Provides New Insights into How Retrieval Augmentation Impacts Long- Knowledge-Rich Text Generation of Language Models

Dhanshree Shripad Shenwai — Tue, 24 Oct 2023 11:50:30 +0000

https://arxiv.org/abs/2310.12150

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-24-at-2.45.56-PM-300x222.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-24-at-2.45.56-PM-1024x758.png" />https://arxiv.org/abs/2310.12150

LFQA aims to provide a complete and thorough response to any query. Parametric information in large language models (LLMs) and retrieved documents presented at inference time enable LFQA systems to construct complicated replies to questions in paragraphs rather than by extracting spans in the evidence document. Recent years have revealed the startling impressiveness and fragility of large-scale LLMs’ LFQA capabilities. Retrieval has recently been proposed as a potent approach to supply LMs with up-to-date, appropriate information. However, it is still unknown how retrieval augmentation influences LMs during production, and it doesn’t always have the expected effects.

Researchers from the University of Texas at Austin investigate how retrieval influences the creation of answers for LFQA, a challenging long text generation problem. Their study provides two simulated research contexts, one in which the LM is held constant while the evidence documents are changed and another in which the opposite is true. Due to the difficulty in assessing LFQA quality, they begin by counting superficial indicators (e.g., length, perplexity) associated with distinct answer attributes like coherence. The ability to attribute the generated answer to the available proof documents is an attractive feature of retrieval-augmented LFQA systems. Newly acquired human annotations on sentence-level attribution are used to test commercially available attribution detection technologies.

Based on their examination of surface patterns, the team concluded that retrieval enhancement significantly modifies LM’s creation. Not all impacts are muted when the submitted papers are irrelevant; for example, the length of the generated responses may change. In contrast to irrelevant documents, those that provide important in-context evidence cause LMs to produce more unexpected phrases. Even when using an identical set of evidence documents, various base LMs may have contrasting impacts from retrieval augmentation. Their freshly annotated dataset provides a gold standard against which to measure attribution evaluations. The findings show that NLI models that identified attribution in factoid QA also do well in the LFQA context, surpassing chance by a wide margin but falling short of the human agreement by a margin of 15% in accuracy.

The research shows that even when given an identical set of documents, the quality of attribution might differ widely between base LMs. The study also shed light on the attribution patterns for the production of lengthy texts. The generated text tends to follow the sequence of the in-context evidence documents, even when the in-context document is a concatenation of numerous papers, and the last sentence is much less traceable than earlier sentences. Overall, the study shed light on how LMs leverage contextual evidence documents to answer in-depth questions and point toward actionable research agenda items.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post How Does Retrieval Augmentation Impact Long-Form Question Answering? This AI Study Provides New Insights into How Retrieval Augmentation Impacts Long- Knowledge-Rich Text Generation of Language Models appeared first on MarkTechPost.

Video Editing Enters a New Age with VideoCrafter: Open Diffusion AI Models for High-Quality Video Generation

Dhanshree Shripad Shenwai — Tue, 24 Oct 2023 05:50:03 +0000

https://github.com/AILab-CVC/VideoCrafter

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-24-at-8.47.35-AM-300x252.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-24-at-8.47.35-AM-1024x860.png" />https://github.com/AILab-CVC/VideoCrafter

VideoCrafter is a new open-source video creation and editing suite. Diffusion models, a machine learning model, fuel it. These models may generate photo- and video-realistic outputs from textual descriptions. Although it has not yet been released, VideoCrafter could significantly alter the production process. With VideoCrafter, even those with zero experience in video editing or animation can easily produce professional-quality results.

How does VideoCrafter work?

VideoCrafter creates a visual sequence from a written description. A video is made by splicing together many such stills. VideoCrafter’s realistic image and video generation is made possible by diffusion models trained on a large corpus of text and images.

The versatile video editing software VideoCrafter can be used to make:

Explanatory animations
Exhibits of Products
Videos for teaching
Promotional Movies
Videos of songs
Video clips
Or anything else that comes to mind!

VideoCrafter: How to Use It

VideoCrafter requires downloading and installing before it can be used. You can download VideoCrafter for any of these operating systems. After you’ve downloaded and installed VideoCrafter, you may follow these steps to begin making videos:

Create a written outline of the video you intend to make.
Change the video parameters to your liking, including resolution and frame rate.
Use the button labeled “Generate.”
Input some words, and VideoCrafter will make a video for you.
The resulting video can then be modified with the help of the integrated editor.
After making adjustments, you can save the video in several formats, including MP4, MOV, and AVI.

Key Benefits

VideoCrafter’s user-friendliness means that it may be used effectively even by those who have never worked with video or animation software.

Superior quality: VideoCrafter’s output is on par with commercially produced films.

Whether you want to make an explainer video or a short film, VideoCrafter can handle it.

Price: VideoCrafter is free because it is an open-source project.

Please visit https://github.com/AILab-CVC/VideoCrafter for more information on how to install and set up VideoCrafter.

Conclusion

Create professional-quality videos with ease using the innovative new VideoCrafter. It’s cheap, flexible, easy to use, and makes professional-looking videos. VideoCrafter is ideal for anyone who wants to create professional-looking videos but needs more time or expertise to learn video editing software or animation programs.

Although it has not yet been released, VideoCrafter could significantly alter the production process. VideoCrafter can be utilized to make videos that viewers can interact with, have realistic special effects added to, and even have unique versions.

People with impairments may find it easier to use VideoCrafter to create videos.

In conclusion, VideoCrafter is a groundbreaking new app that has the potential to revolutionize the video production and distribution industries.

All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post Video Editing Enters a New Age with VideoCrafter: Open Diffusion AI Models for High-Quality Video Generation appeared first on MarkTechPost.

Deciphering Memorization in Neural Networks: A Deep Dive into Model Size, Memorization, and Generalization on Image Classification Benchmarks

Dhanshree Shripad Shenwai — Mon, 23 Oct 2023 21:30:54 +0000

https://arxiv.org/abs/2310.05337

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-24-at-12.30.27-AM-300x253.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-24-at-12.30.27-AM-1024x863.png" />https://arxiv.org/abs/2310.05337

To learn statistically, one must balance memorization of training data and transfer to test samples. However, the success of overparameterized neural models casts doubt on this theory; these models can memorize yet still generalize well, as seen by their ability to correctly match random labels, for example. To attain perfect accuracy in classification, i.e., interpolate the training set, such models are commonly used in practice. This has sparked a slew of studies investigating the generalizability of these models.

Feldman recently showed that memorization may be required for generalization in certain contexts. Here, “memorization” is defined by a stability-based term with theoretical underpinnings; high memorization instances are those that the model can only correctly categorize if included in the training set. For practical neural networks, this term permits estimation of the degree of memorization1 of a training sample. Feldman and Zhang examined a ResNet’s memorization profile while using it to classify images using industry-standard standards.

While this is an intriguing initial look at what real-world models remember, a fundamental question remains: do larger neural models memorize more? New York-based Google researchers answer this topic empirically, providing a complete look at image classification standards. They discover that training examples display a surprising variety of memorization trajectories across model sizes, with some samples showing cap-shaped or growing memorization and others revealing decreasing memorization under larger models.

To produce high-quality models of varied sizes, practitioners use a systematic process, knowledge distillation. Specifically, it entails creating high-quality little (student) models with guidance from high-performing large (teacher) models.

Feldman’s concept of memorization has been used to theoretically examine the relationship between memorization and generalization across a range of model sizes. The following are their contributions based on the results of controlled experiments:

A quantitative investigation of the relationship between model complexity (such as the depth or width of a ResNet) and memorization for image classifiers is presented. The primary findings show that as the complexity of the model increases, the distribution of memorization across examples becomes increasingly bi-modal. They also note that other computationally tractable methods of assessing memorization and, for example, difficulty miss capturing this essential trend.
They give instances displaying different memorization score trajectories across model sizes, and they identify the four most frequent trajectory types, including those where memorization increases with model complexity, to investigate the bi-modal memorization trend further. Specifically, nebulous and mislabeled cases are found to follow this pattern.
Regarding samples that the one-hot (i.e., non-distilled) student memorizes, the researchers conclude with a quantitative study showing that distillation tends to impede memorization. Interestingly, they find memorization is hampered primarily for the cases in which memorization improves with model size. This finding suggests that distillation aids generalization by reducing the need to memorize such challenging circumstances.

The researchers begin by quantitatively analyzing the relationship between model complexity (the depth and width of a ResNet used for image classification) and memorization. They provide a graphic representation of the relationship between ResNet depth and memorization score on two well-known datasets (CIFAR-100 and ImageNet). Their investigation reveals that contrary to their initial beliefs, the memorization score decreases after reaching a depth of 20.

Researchers conclude that a greater bimodal distribution of memorization across diverse examples occurs as model complexity increases. They also point out a problem with current computationally feasible approaches for evaluating memorization and example difficulty by showing that these methods fail to capture this crucial pattern.

The study group gives examples with varied memorizing score trajectories across different model sizes to dig deeper into the bi-modal memorization pattern. They single out four main classes of trajectories, one of which involves memorization improving with model complexity. In particular, they discover that both unclear and mislabeled samples tend to follow this pattern.

The study concludes with a quantitative analysis showing that the process of distillation, by which knowledge is transferred from a big instructor model to a smaller student model, is associated with a decrease in memorization. This blockade is most noticeable for samples memorized by the one-hot, non-distilled student model. It’s interesting to note that distillation predominantly reduces memorization when memorization rises with increased model size. Based on this evidence, we can conclude that distillation improves generalization by preventing us from memorizing too many difficult examples.

In Conclusion:

The discovery by Google researchers has substantial practical implications and potential future directions for research. First, it’s important to use caution while memorizing specific data using only proxies. Various metrics defined in terms of model training or model inference have been proposed as effective surrogates for the memorization score in prior publications. These proxies provide a high agreement rate with memorization. Still, researchers have found that they differ greatly in distribution and fail to represent essential features of the memorization behavior of real-world models. This suggests a path forward for locating effectively computable proxies for memorization scores. The complexity of examples has been previously classified as a predetermined model size. The investigation results highlight the value of considering several model sizes when characterizing examples. For instance, Feldman defines the long tail examples of a dataset as the ones with the highest memorization score for a certain architecture. The results show that memorized information for one model size may not apply to another.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post Deciphering Memorization in Neural Networks: A Deep Dive into Model Size, Memorization, and Generalization on Image Classification Benchmarks appeared first on MarkTechPost.

Meet FastEmbed: A Fast and Lightweight Text Embedding Generation Python Library

Dhanshree Shripad Shenwai — Sun, 22 Oct 2023 12:00:00 +0000

Words and phrases can be effectively represented as vectors in a high-dimensional space using embeddings, making them a crucial tool in the field of natural language processing (NLP). Machine translation, text classification, and question answering are just a few of the numerous applications that can benefit from the ability of this representation to capture semantic connections between words.

However, when dealing with large datasets, the computational requirements for generating embeddings can be daunting. This is primarily because constructing a large co-occurrence matrix is a prerequisite for traditional embedding approaches like Word2Vec and GloVe. For very large documents or vocabulary sizes, this matrix can become unmanageably enormous.

To address the challenges of slow embedding generation, the Python community has developed FastEmbed. FastEmbed is designed for speed, minimal resource usage, and precision. This is achieved through its cutting-edge embedding generation method, which eliminates the need for a co-occurrence matrix.

Rather than simply mapping words into a high-dimensional space, FastEmbed employs a technique called random projection. By utilizing the dimensionality reduction approach of random projection, it becomes possible to reduce the number of dimensions in a dataset while preserving its essential characteristics.

FastEmbed randomly projects words into a space where they are likely to be close to other words with similar meanings. This process is facilitated by a random projection matrix designed to preserve word meanings.

Once words are mapped into the high-dimensional space, FastEmbed employs a straightforward linear transformation to learn embeddings for each word. This linear transformation is learned by minimizing a loss function designed to capture semantic connections between words.

It has been demonstrated that FastEmbed is significantly faster than standard embedding methods while maintaining a high level of accuracy. FastEmbed can also be used to create embeddings for extensive datasets while remaining relatively lightweight.

FastEmbed’s Advantages

Speed: Compared to other popular embedding methods like Word2Vec and GloVe, FastEmbed offers remarkable speed improvements.
FastEmbed is a compact yet powerful library for generating embeddings in large databases.
FastEmbed is as accurate as other embedding methods, if not more so.

Applications of FastEmbed

Machine Translation
Text Categorization
Answering Questions and Summarizing Documents
Information Retrieval and Summarization

FastEmbed is an efficient, lightweight, and precise toolkit for generating text embeddings. If you need to create embeddings for massive datasets, FastEmbed is an indispensable tool.

Check out the Project Page. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post Meet FastEmbed: A Fast and Lightweight Text Embedding Generation Python Library appeared first on MarkTechPost.

This AI Research Developed a Noise-Resistant Method for Detecting Object Edges Without Prior Imaging

Dhanshree Shripad Shenwai — Thu, 19 Oct 2023 19:56:35 +0000

https://spj.science.org/doi/10.34133/icomputing.0050

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-20-at-1.24.42-AM-300x199.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-20-at-1.24.42-AM-1024x681.png" />https://spj.science.org/doi/10.34133/icomputing.0050

Significant attention in computer vision has been focused on developing robust and efficient edge detection algorithms. Edge detection approaches, which span from traditional edge detection algorithms based on differential operations to cutting-edge edge detection algorithms based on neural networks, have significantly contributed to security, environmental sensing, and healthcare. Because image processing methods recover edge information, the availability of pre-complete photographs of the target object is necessary for conventional edge extraction. Consequently, the success of edge identification depends on the quality of the input image. However, standard optical imaging technologies have difficulty obtaining clear images of the target in complicated settings, such as those with objects buried behind fog, murky water, and biological tissues, especially in scenes with high light pollution. Edge detection quality in the final image may suffer as a result.

Edge-sensitive single-pixel imaging is introduced in the study published in Intelligent Computing. The novel method is particularly useful for detecting object edges properly despite noise when acquiring good images through standard optical methods is difficult due to variables such as severe light pollution.

The viability of SI-based edge detection algorithms has just recently been established. Without preliminary imaging or post-processing, a high-quality edge extraction method is provided using direct noise-resistant edge-sensitive single-pixel imaging (ESI). ESI illuminates an object with carefully crafted modulation patterns to extract its edges. Hadamard single-pixel imaging (HSI) involves projecting a corresponding set of Hadamard basis patterns onto a single pixel to generate a full image of an object. ESI obtains modulation patterns at the margins of the Hadamard basis patterns by convolving them with second-order differential operators. This approach directly acquires edge-sensitive Hadamard spectra of the object’s edges to detect edges, bypassing the need for any preexisting imaging. ESI uses binary modulation patterns to speed up edge detection and boost the signal-to-noise ratio (SNR).

An SESI edge detection technique was developed, using half as many modulation patterns as ESI but still quickly detecting edges. As a result, SESI can see edges in half the time, making SI-based edge detection more practical. The Laplacian and the Laplacian of Gaussian (LoG) are two examples of common second-order differential operators, and they take up most of the discussion here. Both theoretical and practical evaluations confirm their impact on the outcomes of edge detection simulations and experiments. Despite substantial background noise, these tests demonstrate that ESI and SESI can directly extract sharp edges from images.

Computational imaging in the form of SI is used to tailor the scene’s lighting to a given objective. In this research, the illumination patterns were created with the particular edge detection objective in mind. End-to-end optimized computational imaging, which also creates lighting patterns for a specific job (such as edge detection), is analogous to this work. Edge detection illumination patterns are built using a mathematical model that is both deterministic and interpretable. In contrast, end-to-end optimization illumination patterns are imagined using data-driven artificial intelligence, which typically involves optimization. End-to-end optimization has a high bar for achieving global optimality. This study focuses only on the lighting patterns generated by a pair of representative second-order differential operators.

Traditional SI acquires an object’s picture by projecting its matching modulation basis patterns and then uses an inverse transform or compressive sensing image recovery technique to rebuild the target image.

Typical Hadamard single-pixel imaging patterns were convolved using second-order differential operators to create the modulation patterns developed by the researchers. The noise immunity of this differential edge detection method is much improved, allowing for the clear and accurate detection of edges. Particularly impressive is the method’s ability to detect edges in real-time, even on moving objects, indicating its promise for use in covert security checks in invisible spectrums. The research also presents a one-round variant of the new method, which cuts the detection time in half by using fewer modulation patterns for edge detection. Despite this simplification, the system still uses fewer modulation patterns and has a higher signal-to-noise ratio than previously published edge detection schemes.

By pre-coding modulation patterns, the novel technology can produce immediate results in an “image-free” fashion, allowing for a large range of applications in image processing. As a result, homomorphic filtering and other image-processing techniques can be incorporated with less interference from background noise. Future improvements are expected to include the researchers’ exploration of end-to-end optimization and the optimization of the illumination patterns used in this work.

It was explained how the Laplacian and LoG operators affect the resilience of the ESI schemes. Simulation studies showed that the Laplacian ESI and the LoG ESI have similar noise robustness regarding SNRs, while the Laplacian ESI has sharper edges. The experimental findings agreed with the simulated results. The LoG ESI created rougher edges. The proposed ESI methodology provides an alternate method for retrieving the object’s edge image, and the idea that conventional image processing techniques may be pre-coded into modulation patterns and then used to provide direct results in an “image-free” fashion is provided as creative fodder. This additional dimension is significant as pre-coded modulation patterns are immune to disturbances and noise in the surrounding environment. Pre-coding is just one of many image-processing techniques that could be used for enhanced outcomes, including homomorphic filtering. The light patterns developed in this work can be fine-tuned and used as a jumping-off point for full system optimization.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post This AI Research Developed a Noise-Resistant Method for Detecting Object Edges Without Prior Imaging appeared first on MarkTechPost.

M42 Introduces Med42: An Open-Access Clinical Large Language Model (LLM) to Expand Access to Medical Knowledge

Dhanshree Shripad Shenwai — Thu, 19 Oct 2023 09:51:07 +0000

https://huggingface.co/m42-health/med42-70b

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-19-at-3.17.36-PM-300x201.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-19-at-3.17.36-PM-1024x685.png" />https://huggingface.co/m42-health/med42-70b

M42 Health, based in Abu Dhabi, UAE, has just published Med42, a promising new open-access clinical large language model. The release of this 70 billion parameter model is a watershed moment in the effort to increase public access to advanced AI capabilities that can revolutionize healthcare.

Med42, fine-tuned from Meta’s Llama-2 – 70B model, outperforms its predecessors in open-source medical AI by a wide margin. The model surpasses OpenAI’s ChatGPT 3.5 across many medical question-answering datasets, achieving up to 72% accuracy in a zero-shot evaluation on the USMLE. This demonstrates Med42’s ability to help with clinical decision-making by giving doctors easy access to medical knowledge that has been synthesized.

The M42 Health AI team built Med42 using their massive, human-curated medical literature and patient information dataset. M42, Cerebras, and Core42 (an M42 subsidiary) worked together to fine-tune the Condor Galaxy 1 supercomputer. The model’s efficacy was also assessed by experts at the Mohamed bin Zayed University for Artificial Intelligence (MBZUAI).

M42’s Med42 is a free, publicly available clinical large language model (LLM) created to make more medical information open to the public. Based on LLaMA-2 and has 70 billion parameters, this generative AI system offers accurate responses to medical inquiries.

One of Med42’s strongest points is its adaptability. As an AI helper, it has the potential to alter medical judgment significantly. It may be used for everything from generating personalized treatment plans based on medical records to speeding up the process of combing through mountains of medical material.

As an AI helper with the potential to improve clinical decision-making and expand access to an LLM for healthcare use, Med42 is now available for testing and evaluation. Examples of possible applications are:

Answering Health-Related Questions
Synopsis of Medical History
In support of medical diagnosis
Common Health Questions

The code and weights of Med42 have been released to Hugging Face, encouraging a broad range of scientific examination and input to foster collaboration and continuing growth. Med42’s licensing terms are modeled after those of Meta’s Llama 2 model, making it available for free research and non-commercial usage yet imposing appropriate constraints to account for the risks and obligations associated with using AI in healthcare.

Key indicators of performance:

Med42 outperforms the competition with an accuracy of 72% on a sample exam of USMLE compared to other publicly available medical LLMs.
MedQA dataset results in 61.5% accuracy (GPT-3.5 is at 50%).
Results on MMLU clinical issues are consistently better than those on GPT-3.5.

Limitations:

The therapeutic application of Med42 is still in its early stages. Extensive human testing is currently underway to assure safety.
The risk of creating misleading or dangerous data.
Possible danger of using biased data for training.

Though the findings are encouraging, the researchers warn that further real-world validation of Med42 is necessary before it can be used in clinical practice. Problems may arise from producing inaccurate or harmful results or failing to address existing training data biases. As Med42 moves beyond baselines and toward potentially substantial patient benefits, M42 emphasizes the importance of responsible testing.

Med42 showcases the remarkable development of medical AI while stressing the importance of ethics and safety in research and development. Researchers all over the world will be able to benefit from its open-access publication because of this. Models like Med42 can improve healthcare decision-making and expand access to treatment on a global scale if subjected to thorough validation. Its release is a significant step forward in healthcare AI, but realizing its full potential will require continued openness and teamwork.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post M42 Introduces Med42: An Open-Access Clinical Large Language Model (LLM) to Expand Access to Medical Knowledge appeared first on MarkTechPost.

Microsoft Azure AI Introduces Idea2Img: A Self-Refinancing Multimodal AI Framework For The Development And Design Of Images Automatically

Dhanshree Shripad Shenwai — Wed, 18 Oct 2023 05:59:07 +0000

https://arxiv.org/abs/2310.08541

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-18-at-10.09.00-AM-300x283.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-18-at-10.09.00-AM-1024x964.png" />https://arxiv.org/abs/2310.08541

The goal of “image design and generation” is to generate an image based on a broad concept provided by the user. This input IDEA may include reference images, such as “the dog looks like the one in the image,” or instructional instructions that further define the design’s intended application, such as “a logo for the Idea2Img system.” Humans can utilize text-to-image (T2I) models to create a picture based on a thorough description of an imagined image (IDEA). Users must manually explore several options until they find the one that best describes the problem (the T2I prompt).

In light of the impressive capabilities of large multimodal models (LMMs), the researchers investigate whether or not we can train systems based on LMMs to acquire the same iterative self-refinement ability, freeing people from the laborious task of translating concepts into visuals. When venturing into the unknown or tackling difficult tasks, humans have the innate propensity to continually enhance their methods. Natural language processing tasks like acronym generation, sentiment retrieval, text-based environment exploration, etc., can be better addressed with the help of self-refinement, as shown by large language model (LLM) agent systems. Challenges in enhancing, grading, and verifying multimodal contents, such as many interleaved image-text sequences, arise when we move from text-only activities to multimodal settings.

Self-exploration enables an LMM framework to automatically learn to address a wide range of real-world challenges, such as using a graphical user interface (GUI) to interact with a digital device, traversing the unknown with an embodied agent, playing a digital game, and so on. Researchers from Microsoft Azure study the multimodal capacity for iterative self-refinement by focusing on “image design and generation” as the job to investigate. To this purpose, they present Idea2Img, a self-refinancing multimodal framework for the development and design of images automatically. An LMM, GPT-4V(vision), interacts with a T2I model in Idea2Img to investigate the model’s application and identify a useful T2I cue. Both the analysis of the T2I model’s return signal (i.e., draft images) and the creation of the subsequent round’s inquiries (i.e., text T2I prompts) will be handled by the LMM.

T2I prompt generation, draft image selection, and feedback reflection all contribute to the multimodal iterative self-refinement capability. To be more specific, GPT-4V performs the following steps:

Prompt generation: GPT-4V generates N text prompts that correspond to the input multimodal user IDEA, conditioned on the previous text feedback and refinement history
Draft image selection: GPT-4V carefully compares N draft images for the same IDEA and selects the most promising one
Feedback reflection: GPT-4V analyzes the discrepancy between the draft image and the IDEA. Then, GPT-4V gives feedback on what went wrong, why it went wrong, and how the T2I prompts could be improved.

In addition, Idea2Img has a built-in memory module that keeps track of your exploration history for each prompt kind (picture, text, and feedback). For automated image creation and generation, the Idea2Img framework repeatedly cycles between these three GPT-4V-based processes. As an improved picture design and creation helper, Idea2Img is a useful tool for users. By accepting design directions instead of a thorough picture description, accommodating the multimodal IDEA input, and producing images with higher semantic and visual quality, Idea2Img stands out from T2I models.

The team reviewed some sample cases of picture creation and design. For instance, Idea2Img may process IDEA with arbitrarily interleaved picture-text sequences, include the visual design and intended usage description into IDEA, and extract arbitrary visual information from the input image. Based on these updated features and use cases, they created a 104-sample evaluation IDEA set with complex questions that humans might get wrong the first time. The team employs Idea2Img and various T2I models to conduct user preference studies. Improvements in user preference scores across many image-generating models, such as +26.9% with SDXL, demonstrate Idea2Img’s efficacy in this area.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post Microsoft Azure AI Introduces Idea2Img: A Self-Refinancing Multimodal AI Framework For The Development And Design Of Images Automatically appeared first on MarkTechPost.

Recognition and Generation of Object-State Compositions in Machine Learning Using “Chop and Learn”

Dhanshree Shripad Shenwai — Tue, 17 Oct 2023 12:39:59 +0000

https://chopnlearn.github.io/

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-17-at-6.05.49-PM-300x238.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/10/Screenshot-2023-10-17-at-6.05.49-PM-1024x813.png" />https://chopnlearn.github.io/

The real world contains objects of varying sizes, hues, and textures. Visual qualities, often called states or attributes, can be innate to an item (such as color) or acquired through treatment (such as being cut). Current data-driven recognition models (e.g., deep networks) presuppose robust training data available for exhaustive object attributes, yet they still need help generalizing to unseen aspects of objects. However, humans and other animals have an inbuilt ability to recognize and envision a wide variety of things with different properties by piecing together a small number of known items and their states. Modern deep learning models frequently need more compositional generalization and the capacity to synthesize and detect new combinations from finite concepts.

To aid in the study of compositional generalization—the ability to recognize and produce unseen compositions of objects in different states—a group of researchers from the University of Maryland suggest a new dataset, Chop & Learn (ChopNLearn). They restrict the research to chopping fruits and vegetables to zero in on the compositional component. These items change form in recognizable ways when sliced in various ways, depending on the method of slicing used. The purpose is to examine how these different approaches to recognizing object states without direct observation can be applied to various objects. Their choice of 20 things and seven typical cutting styles (including complete object) yields varying granularity and size object state pairs.

The first task requires the system to create an image from a (object, state) composition not encountered during training. For this purpose, researchers propose modifying existing large-scale text-to-image generative models. They compare many existing approaches, including Textual Inversion and DreamBooth, by utilizing text prompts to represent the object state creation. They also suggest a different process, which involves the addition of additional tokens for objects and states in addition to the simultaneous adjustment of language and diffusion models. Finally, they evaluate the strengths and weaknesses of the proposed generative model and the existing literature.

An existing Compositional Action Recognition job is expanded upon in the second challenge. This work aims to notice small changes in object states, a key initial step for activity recognition, while the focus of past work has been on long-term activity tracking in films. The task allows the model to learn changes in object states that are not visible to the naked eye by recognizing the compositions of states at the beginning and end of the task. Using the ChopNLearn dataset, they compare several state-of-the-art baselines for video tasks. The study concludes by discussing the many image and video-related functions that could benefit from using the dataset.

Here are some of the contributions:

The proposed ChopNLearn dataset would include photos and movies from various camera angles, representing different object-state compositions.
They offer a new activity called Compositional Image Generation to generate images for compositions of objects and states that are not currently visible to the user.
They set a new standard for Compositional Action as a whole. Recognition aims to learn and recognize how objects change over time and from diverse perspectives.

Limitations

Few-shot generalization is becoming more and more significant as foundation models become available. ChopNLearn’s potential is investigated in this work for use in studies of compositional production and identification of extremely intricate and interrelated concepts. ChopNLearn is, admittedly, a small-scale dataset with a green screen background, which limits the generalizability of models trained on it. However, this is the first attempt to learn how different objects might share common fine-grained states (cut styles). They investigate this by training and testing more complex models using ChopNLearn, then using the same tool to fine-tune those models against and without a green screen. Further, they anticipate that the community will benefit from employing ChopNLearn in even more difficult tasks such as 3D reconstruction, video frame interpolation, state change creation, etc.

Visit https://chopnlearn.github.io/ for further information.

To sum it up

Researchers offer ChopNLearn, a novel dataset for gauging compositional generalization, or the capacity of models to detect and build unseen compositions of objects in different states. In addition, they present two new tasks—Compositional Image Generation and Compositional Action Recognition—on which to evaluate the effectiveness of existing generative models and video recognition techniques. They illustrate the problems with the current methods and their limited generalizability to new compositions. These two activities, however, are merely the tip of the proverbial iceberg. Multiple image and video activities rely on understanding object states, including 3D reconstruction, future frame prediction, video production, summarization, and parsing of long-term video. As a result of this dataset, researchers hope to see new compositional challenges for photos, videos, 3D, and other media proposed and learned by the computer vision community.

Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post Recognition and Generation of Object-State Compositions in Machine Learning Using “Chop and Learn” appeared first on MarkTechPost.

This AI Paper Introduces Lemur and Lemur Chat For Harmonizing Natural Language and Code For Language Agents

Dhanshree Shripad Shenwai — Tue, 17 Oct 2023 02:35:43 +0000

In a broad sense, intelligent agents are autonomous problem solvers endowed with perception, judgment, and action capabilities based on data gathered from their surroundings. Recent applications of this idea have shown promise in developing language agents that can use natural language to do a wide range of complex tasks in various contexts. This is especially true when these agents are constructed using large language models (LLMs). Agents of this type can mimic human thought and language because they draw on human expertise in the form of LLMs. This allows people to be flexible in their use of tools, adapt to new situations, reason linguistically, and develop multi-agent systems on the fly.

LLMs should grasp human interaction, reasoning, and planning and ensure grounding in the necessary contexts to properly construct the foundation of language agents. LLMs’ natural language capabilities allow them to closely mimic human conversation, thinking, and planning. However, environment-based execution is typically accomplished through general-purpose code or domain-specific APIs, such as those used to manage web browsers, communicate with operating system command line interface terminals, and control robotic arms.

To fill this gap, a new study by the University of Hong Kong, XLang Lab, Salesforce Research, Sea AI Lab, University of Washington, and MIT CSAIL present Lemur and Lemur-Chat, two state-of-the-art, publicly available models that have been pre-trained and fine-tuned to achieve harmony between text and code. Through carefully crafted pre-training and instruction fine-tuning steps, the researchers improved the original Llama-2-70B. To ensure enhanced capabilities in coding ability while retaining performance in natural language ability, they constructed a code-centric corpus based on The Stack, including 90 billion tokens with a 10:1 text-to-code ratio. This prototype is known as Lemur. To create the instruction-following model, Lemur-Chat, they first pretrained it using around 100K instances from both text and code. Lemur and Lemur-Chat have been proven to be the most well-rounded open-source models after undergoing extensive examinations across 8 textual and coding benchmarks.

In addition, this effort sets out to provide agent standards for evaluating the core competencies of linguistic agents in various settings. The team focuses particularly on their skill with tools and their ability to root themselves in both environmental and social feedback. They also investigate the difficulties inherent in real-world, partially visible situations, where the agent must operate based on incomplete information and perform additional actions to fill in the gaps. Experiments show that Lemur-Chat performs better in 12 of the 13 agent benchmarks compared to other open-source models. This exemplifies how Lemur-Chat can outperform existing open-source models for language agents by bridging the performance gap between open-source and commercial alternatives by combining natural and coding talents.

The results of these tests demonstrate the importance of combining linguistic and computational skills in agent-based settings. Models like Llama-2-70B-Chat, which excel in natural language processing but struggle with coding, can efficiently use basic tools to aid reasoning because the action space is constrained, and the effort of employing such tools is low. In contrast, the action space is typically enormous when confronted with sophisticated decision-making scenarios like web browsing and home navigation, and models with high coding abilities have an edge when constructing complex executable action sequences. In sum, Lemur’s superior performance can be attributed to its natural language processing and programming superiority. This study lays the groundwork for creating sophisticated language agents that can function well in a wide range of settings by shedding light on optimizing the synergy between natural and programming languages.

Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..

The post This AI Paper Introduces Lemur and Lemur Chat For Harmonizing Natural Language and Code For Language Agents appeared first on MarkTechPost.