## summary

- 一种多模态
 - audio => text
 - model:
 - AudioEncoder
 - TextDecoder
 - 典型的 A Transformer sequence-to-sequence model
- 安装
 ```
 # https://github.com/openai/whisper
 pip install -U openai-whisper

 ```
- 模型下载地址
 - `~/.cache/whisper/xx.pt`


## ffmpeg: video => audio

```
sudo apt update && sudo apt install ffmpeg
ffmpeg -i sample.avi -q:a 0 -map a sample.mp3
```

- `-i`: input
- `-q:a 0` for variable bit rate 
 - https://trac.ffmpeg.org/wiki/Encode/MP3
- `-map a`: exclude video/subtitles and only grab audio

## cli 

```
whisper --language Chinese --model large video.mp3 -o output_dir
```

- 参数
 - `--task {transcribe,translate}`:默认 `transcribe`(也就是转录,asr)
 - `--language Chinese`:按哪种语言进行转录;
 - `--model {base, medium, large}`
 - `--device device`:default cuda,默认运行在一张卡上;
 - `-o output_dir`:生成的文本文件保存到的文件夹
 - `xx.json`
 - `xx.srt`
 - `xx.tsv`
 - `xx.txt`
 - `xx.vtt`

## python script

In [1]:
import whisper
import numpy as np

In [2]:
def get_params(model):
 model_parameters = filter(lambda p: p.requires_grad, model.parameters())
 params = sum([np.prod(p.size()) for p in model_parameters])
 return params

In [3]:
model = whisper.load_model("base")
print("base\t", get_params(model))
model = whisper.load_model("medium")
print("medium\t", get_params(model))
model = whisper.load_model("large")
print("large\t", get_params(model))
model

base	 71825920
medium	 762321920
large	 1541384960


Whisper(
 (encoder): AudioEncoder(
 (conv1): Conv1d(80, 1280, kernel_size=(3,), stride=(1,), padding=(1,))
 (conv2): Conv1d(1280, 1280, kernel_size=(3,), stride=(2,), padding=(1,))
 (blocks): ModuleList(
 (0-31): 32 x ResidualAttentionBlock(
 (attn): MultiHeadAttention(
 (query): Linear(in_features=1280, out_features=1280, bias=True)
 (key): Linear(in_features=1280, out_features=1280, bias=False)
 (value): Linear(in_features=1280, out_features=1280, bias=True)
 (out): Linear(in_features=1280, out_features=1280, bias=True)
 )
 (attn_ln): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
 (mlp): Sequential(
 (0): Linear(in_features=1280, out_features=5120, bias=True)
 (1): GELU(approximate='none')
 (2): Linear(in_features=5120, out_features=1280, bias=True)
 )
 (mlp_ln): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
 )
 )
 (ln_post): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
 )
 (decoder): TextDecoder(
 (token_embedding): Embedding(51865, 1280)
 (blocks): Modul

In [4]:
# load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio("./data/video/video.mp3")
print(audio.shape)
print(audio, type(audio))

(21241974,)
[ 0.00305176 0.00213623 0.00241089 ... -0.00765991 -0.00476074
 -0.00314331] 


In [15]:
21241974/16000/60

22.12705625

In [5]:
# load audio and pad/trim it to fit 30 seconds
audio = whisper.pad_or_trim(audio)
print(audio.shape)

(480000,)


In [6]:
model.device

device(type='cuda', index=0)

In [7]:
# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)

# detect the spoken language
_, probs = model.detect_language(mel)
# print(probs)
print(f"Detected language: {max(probs, key=probs.get)}")

# decode the audio
options = whisper.DecodingOptions()
# inference
result = whisper.decode(model, mel, options)

# print the recognized text
print(result.text)

Detected language: zh
好,B站的朋友们大家晚上好今天给大家开启一个插入一个新的系列就是面向小白的深度学习软硬件的装机指南然后这个系列呢我在上一期里面带着大家去实际开箱了一下我周末配置好的一个深度学习服务器然后大致的一个硬件一个配置的情况给大家展示了一把那这一节呢


In [None]:
import whisper
model = whisper.load_model("large")
audio = whisper.load_audio("./data/video/video.mp3")
# load audio and pad/trim it to fit 30 seconds
audio = whisper.pad_or_trim(audio)
# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)

# detect the spoken language
_, probs = model.detect_language(mel)
# print(probs)
print(f"Detected language: {max(probs, key=probs.get)}")

# decode the audio
# default task:transcribe
options = whisper.DecodingOptions()
result = whisper.decode(model, mel, options)
# print the recognized text
print(result.text)

options = whisper.DecodingOptions(task='translate')
result = whisper.decode(model, mel, options)
# print the recognized text
print(result.text)

Detected language: zh
好,B站的朋友们大家晚上好今天给大家开启一个插入一个新的系列就是面向小白的深度学习软硬件的装机指南然后这个系列呢我在上一期里面带着大家去实际开箱了一下我周末配置好的一个深度学习服务器然后大致的一个硬件一个配置的情况给大家展示了一把那这一节呢
