BERTphone: Phonetically-Aware Encoder Representations for Utterance-Level Speaker and Language Recognition
Clicks: 38
ID: 282581
2019
Article Quality & Performance Metrics
Overall Quality
Improving Quality
0.0
/100
Combines engagement data with AI-assessed academic quality
Reader Engagement
Emerging Content
3.3
/100
11 views
11 readers
Trending
AI Quality Assessment
Not analyzed
Abstract
We introduce BERTphone, a Transformer encoder trained on large speech corpora
that outputs phonetically-aware contextual representation vectors that can be
used for both speaker and language recognition. This is accomplished by
training on two objectives: the first, inspired by adapting BERT to the
continuous domain, involves masking spans of input frames and reconstructing
the whole sequence for acoustic representation learning; the second, inspired
by the success of bottleneck features from ASR, is a sequence-level CTC loss
applied to phoneme labels for phonetic representation learning. We pretrain two
BERTphone models (one on Fisher and one on TED-LIUM) and use them as feature
extractors into x-vector-style DNNs for both tasks. We attain a
state-of-the-art $C_{\text{avg}}$ of 6.16 on the challenging LRE07 3sec
closed-set language recognition task. On Fisher and VoxCeleb speaker
recognition tasks, we see an 18% relative reduction in speaker EER when
training on BERTphone vectors instead of MFCCs. In general, BERTphone
outperforms previous phonetic pretraining approaches on the same data. We
release our code and models at
https://github.com/awslabs/speech-representations.
| Reference Key |
kirchhoff2019bertphone
Use this key to autocite in the manuscript while using
SciMatic Manuscript Manager or Thesis Manager
|
|---|---|
| Authors | Shaoshi Ling; Julian Salazar; Yuzong Liu; Katrin Kirchhoff |
| Journal | arXiv |
| Year | 2019 |
| DOI |
DOI not found
|
| URL | |
| Keywords |
Citations
No citations found. To add a citation, contact the admin at info@scimatic.org
Comments
No comments yet. Be the first to comment on this article.