audio-visual speech recognition(only audio) 모델 흐름
(T, B, D) -> (B, D, T) (B, D, T) -> conv1d -> (B, D, T) -> frame과 dim 변경 (B, D, T) -> (T, B, D) (T, B, D) -> posEn -> (T, B, D) (T, B, D) -> trans -> (T, B, D) (T, B, D) -> trans -> (T, B, D) (T, B, D) -> (B, D, T) (B, D, T) -> conv1d -> (B, D, T) -> dim 만 변경. (B, D, T) -> (T, B, D) (T, B, D) -> softmax -> (T, B, D) 최종적으로 각 forward를 처리할때마다 (T , B, D) 에서 T만 바뀜. outputBatch : (T 29 , B 32, D 40) t..
2021. 7. 7.