首頁 > Ai資訊 > Ai產品

Magma：微軟推出的多模態AI代理基礎模型，可實現多場景代理

映技派于2025-03-12發布在Ai產品

Magma是什么？

Magma 是微軟推出的一款多模態 ai 代理基礎模型，能夠處理虛擬和現實環境中的復雜交互，實現圖像字幕和問答、視頻字幕和問答、UI導航、機器人操作等多種任務。

Magma功能特點

多模態能力：支持圖像字幕和問答、視頻字幕和問答、UI 導航、機器人操作等任務。
數字與物理世界的交互：能夠處理虛擬和現實環境中的任務。
多功能性：單一模型具備通用的圖像和視頻理解能力，同時能生成目標驅動的視覺計劃和動作。
先進性能：在多模態任務上表現出色，特別是在空間理解和推理方面。
可擴展的預訓練策略：能夠從未標記的視頻中學習，具有很強的泛化能力。

Magma的技術原理

多模態預訓練：結合圖像、視頻和動作數據，通過統一框架進行大規模預訓練，學習跨模態的連接。
Set-of-Mark (SoM)：標記圖像中的可操作對象，幫助模型實現動作落地。
Trace-of-Mark (ToM)：標記視頻中物體的運動軌跡，增強時間動態理解能力。
視覺與語言結合：使用卷積網絡將視覺信息編碼為標記序列，與語言模型結合，生成動作或語言描述。
泛化與微調：預訓練后的模型具備零樣本泛化能力，可通過微調進一步提升性能。
跨任務適應：適用于多種任務（如UI導航、機器人操作、圖像和視頻理解），展現出強大的泛化能力。

Magma的技術原理.jpg

安裝與使用

克隆項目：

git clone https://github.com/microsoft/Magma.git
cd Magma

安裝依賴：

conda create -n magma python=3.10 -y
conda activate magma
pip install --upgrade pip
pip install -e .

推理例子

from PIL import Image
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
model = AutoModelForCausalLM.from_pretrained("microsoft/Magma-8B", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("microsoft/Magma-8B", trust_remote_code=True)
model.to("cuda")
image = Image.open("example.jpg").convert("RGB")
convs = [
    {"role": "system", "content": "You are an agent that can see, talk and act."},
    {"role": "user", "content": "\nWhat is in the image?"}
]
prompt = processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)
inputs = processor(images=[image], texts=prompt, return_tensors="pt").to("cuda")
with torch.inference_mode():
    generate_ids = model.generate(**inputs)
response = processor.decode(generate_ids[0], skip_special_tokens=True).strip()
print(response)